Top PySpark Interview Questions and Answers
- Get link
- X
- Other Apps
So, let’s start PySpark Interview Questions.
data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)
RDD Operations
RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset. For example, map
is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce
is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey
that returns a distributed dataset).
All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently. For example, we can realize that a dataset created through map
will be used in a reduce
and return only the result of the reduce
to the driver, rather than the larger mapped dataset.
By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist
(or cache
) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.
- Code
- class pyspark.SparkConf (
- loadDefaults = True,
- _jvm = None,
- _jconf = None
- )
- DISK_ONLY
- DISK_ONLY_2
- MEMORY_AND_DISK
- MEMORY_AND_DISK_2
- MEMORY_AND_DISK_SER
- MEMORY_AND_DISK_SER_2
- MEMORY_ONLY
- MEMORY_ONLY_2
- MEMORY_ONLY_SER
- MEMORY_ONLY_SER_2
- OFF_HEAP StorageLevel
What do mean by Broadcast variables?
Ans. In order to save the copy of data across all nodes, With SparkContext.broadcast(), a broadcast variable is created.
OR
Broadcast variables in Apache Spark is a mechanism for sharing variables across executors that are meant to be read-only. Without broadcast variables these variables would be shipped to each executor for every transformation and action, and this can cause network overhead
OR
Broadcast variables are read-only variables that are distributed across worker nodes in-memory instead of shipping a copy of data with tasks.
Broadcast variables are mostly used when the tasks across multiple stages require the same data or when caching the data in the deserialized form is required.
Broadcast variables are created using a variable v by calling SparkContext.broadcast(v).
The Broadcast variable is a wrapper around v, and its value can be accessed by calling the value method.
The data broadcasted this way is cached in a serialized form and deserialized before running each task.
For Examples:
- >>> from pyspark.context import SparkContext
- >>> sc = SparkContext('local', 'test')
- >>> b = sc.broadcast([1, 2, 3, 4, 5])
- >>> b.value
- [1, 2, 3, 4, 5]
- >>> sc.parallelize([0, 0]).flatMap(lambda x: b.value).collect()
- [1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
- >>> b.unpersist()
- >>> large_broadcast = sc.broadcast(range(10000))
What are Accumulator variables?
Ans. In order to aggregate the information through associative and commutative operations, we use them.
OR
These accumulator variables can only be used when a user wants to perform associative or commutative operations on the data.
The accumulators can be created with or without a name. If the accumulators are created with a name, they can be viewed in Spark’s UI which will be useful to understand the progress of running stages.
The accumulators are created using an initial value v. by calling SparkContext.accumulator(v)
Syntax:
val acc = sc.accumulator(v)
- Code
- class pyspark.Accumulator(aid, value, accum_param)
- >>> from pyspark.context import SparkContext
- >>> sc = SparkContext('local', 'test')
- >>> b = sc.broadcast([1, 2, 3, 4, 5])
- >>> b.value
- [1, 2, 3, 4, 5]
- >>> sc.parallelize([0, 0]).flatMap(lambda x: b.value).collect()
- [1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
- >>> b.unpersist()
- >>> large_broadcast = sc.broadcast(range(10000))
- Code
- class pyspark.Accumulator(aid, value, accum_param)
- Get link
- X
- Other Apps
Comments
Really good work Pankaj Sir, i got some important questions explained here
ReplyDeleteThnks Mohit :)
Delete