TOP PYSPARK INTERVIEW QUESTION 2023

What is Apache Spark and how does it differ from Hadoop? What are the benefits of using Spark over MapReduce? What is a Spark RDD and what operations can be performed on it? How does Spark handle fault-tolerance and data consistency? Explain the difference between Spark transformations and actions. What is a Spark DataFrame and how is it different from an RDD? What is Spark SQL and how does it work? How can you optimize a Spark job to improve its performance? How does Spark handle memory management and garbage collection? Explain the role of Spark Driver and Executors. What is PySpark and how does it differ from Apache Spark? How do you create a SparkContext in PySpark? What is the purpose of SparkContext? What is RDD (Resilient Distributed Dataset)? How is it different from DataFrame and Dataset? What are the different ways to create RDD in PySpark? What is the use of persist() method in PySpark? How does it differ from cache() method? What is the use of broadcast variables in PySpark...

Top PySpark Interview Questions and Answers

In this PySpark article, we will go through mostly asked PySpark Interview Questions and Answers. This Interview questions for PySpark will help both freshers and experienced. Moreover, you will get a guide on how to crack PySpark Interview. Follow each link for better understanding.
So, let’s start PySpark Interview Questions.


What is PySpark?
Ans. PySpark helps data scientists interface with Resilient Distributed Datasets in apache spark and python.Py4J is a popularly library integrated within PySpark that lets python interface dynamically with JVM objects (RDD's).

Is PySpark a language?
Ans. Spark refers to the Apache Spark distributed computing framework, originally accessible using the Scala programming languagePySpark is the interface that gives access to Spark using the Python programming language. 

What is RDD in PySpark?
Ans. class pyspark.RDD(jrdd, ctx, jrdd_deserializer=AutoBatchedSerializer(PickleSerializer())) A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel.

Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)

RDD Operations

RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset. For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).

All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently. For example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.

By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.



=What is py4j?
Ans. Py4J is a bidirectional bridge between Python and Java. It enables Python programs running in a Python interpreter to dynamically access Java objects in a JVM.


What do you mean by PySpark SparkContext?
Ans. In simple words, an entry point to any spark functionality is what we call SparkContext. While it comes to Pyspark,Spark Context uses Py4J(library) in order to launch a JVM. In this way, it creates a JavaSparkContext. However, PySpark has SparkContext available as ‘sc’, by default.
Explain PySpark SparkConf?
Ans. Mainly, we use SparkConf because we need to set a few configurations and parameters to run a Spark application on the local/cluster. In other words, SparkConf offers configurations to run a Spark application.
  • Code
  1. class pyspark.SparkConf (
  2. loadDefaults = True,
  3. _jvm = None,
  4. _jconf = None
  5. )


Name different storage levels.
Ans. There are different storage Levels which are given below −
  • DISK_ONLY 
  • DISK_ONLY_2 
  • MEMORY_AND_DISK 
  • MEMORY_AND_DISK_2 
  • MEMORY_AND_DISK_SER 
  • MEMORY_AND_DISK_SER_2 
  • MEMORY_ONLY 
  • MEMORY_ONLY_2
  • MEMORY_ONLY_SER 
  • MEMORY_ONLY_SER_2 
  • OFF_HEAP  StorageLevel

What do mean by Broadcast variables?
Ans.  In order to save the copy of data across all nodes,  With SparkContext.broadcast(), a broadcast variable is created. 
OR
Broadcast variables in Apache Spark is a mechanism for sharing variables across executors that are meant to be read-only. Without broadcast variables these variables would be shipped to each executor for every transformation and action, and this can cause network overhead
OR
Broadcast variables are read-only variables that are distributed across worker nodes in-memory instead of shipping a copy of data with tasks.
Broadcast variables are mostly used when the tasks across multiple stages require the same data or when caching the data in the deserialized form is required.
Broadcast variables are created using a variable v by calling SparkContext.broadcast(v).
The Broadcast variable is a wrapper around v, and its value can be accessed by calling the value method.
The data broadcasted this way is cached in a serialized form and deserialized before running each task.
For Examples:
  1. >>> from pyspark.context import SparkContext
  2. >>> sc = SparkContext('local', 'test')
  3. >>> b = sc.broadcast([1, 2, 3, 4, 5])
  4. >>> b.value
  5. [1, 2, 3, 4, 5]
  6. >>> sc.parallelize([0, 0]).flatMap(lambda x: b.value).collect()
  7. [1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
  8. >>> b.unpersist()
  9. >>> large_broadcast = sc.broadcast(range(10000))

What are Accumulator variables?
Ans. In order to aggregate the information through associative and commutative operations, we use them. 
OR
These accumulator variables can only be used when a user wants to perform associative or commutative operations on the data.
The accumulators can be created with or without a name. If the accumulators are created with a name, they can be viewed in Spark’s UI which will be useful to understand the progress of running stages.
The accumulators are created using an initial value v. by calling SparkContext.accumulator(v)
Syntax:
val acc = sc.accumulator(v)
  • Code
  1. class pyspark.Accumulator(aid, value, accum_param)

Comments

  1. Really good work Pankaj Sir, i got some important questions explained here

    ReplyDelete

Post a Comment

Popular posts from this blog

Spark SQL “case when” and “when otherwise”

Hive failed renaming table with error "New location for this table already exist" ?

Top Hive Commands with Examples