TOP PYSPARK INTERVIEW QUESTION 2023

What is Apache Spark and how does it differ from Hadoop? What are the benefits of using Spark over MapReduce? What is a Spark RDD and what operations can be performed on it? How does Spark handle fault-tolerance and data consistency? Explain the difference between Spark transformations and actions. What is a Spark DataFrame and how is it different from an RDD? What is Spark SQL and how does it work? How can you optimize a Spark job to improve its performance? How does Spark handle memory management and garbage collection? Explain the role of Spark Driver and Executors. What is PySpark and how does it differ from Apache Spark? How do you create a SparkContext in PySpark? What is the purpose of SparkContext? What is RDD (Resilient Distributed Dataset)? How is it different from DataFrame and Dataset? What are the different ways to create RDD in PySpark? What is the use of persist() method in PySpark? How does it differ from cache() method? What is the use of broadcast variables in PySpark...

How Spark Internally Executes a Program

In this article, I will try to explain how Spark works internally and what the components of execution are:jobs, tasks, and stages.
As we all know, Spark gives us two operations for performing any problem.
When we do a transformation on any RDD, it gives us a new RDD. But it does not start the execution of those transformations. The execution is performed only when an action is performed on the new RDD and gives us a final result.
So once you perform any action on an RDD, Spark context gives your program to the driver.
The driver creates the DAG (directed acyclic graph) or execution plan (job) for your program. Once the DAG is created, the driver divides this DAG into a number of stages. These stages are then divided into smaller tasks and all the tasks are given to the executors for execution.
The Spark driver is responsible for converting a user program into units of physical execution called tasks. At a high level, all Spark programs follow the same structure. They create RDDs from some input, derive new RDDs from those using transformations, and perform actions to collect or save data. A Spark program implicitly creates a logical directed acyclic graph (DAG) of operations.
When the driver runs, it converts this logical graph into a physical execution plan.
So, let's take an example of word count for better understanding:
 rdd = sc.textFile("address of your file")
rdd.flatMap(_.split(" ")).map(x=>(x,1)).reduceByKey(_ + _).collect
Here you can see that collect is an action that will collect all data and give a final result. As explained above, when I perform the collect action, the Spark driver creates a DAG.

In the image above, you can see that one job is created and executed successfully. Now, let's have a look at DAG and its stages.

Here, you can see that Spark created the DAG for the program written above and divided the DAG into two stages.
In this DAG, you can see a clear picture of the program. First, the text file is read. Then, the transformations like map and flatMap are applied. Finally, reduceBykey is executed.
But why did Spark divided this program into two stages? Why not more than two or less than two? Basically, it depends on shuffling, i.e. whenever you perform any transformation where Spark needs to shuffle the data by communicating to the other partitions, it creates other stages for such transformations. And the transformation does not require the shuffling of your data; it creates a single stage for it.
Now, let's have a look at how many tasks have been created by Spark:

As I mentioned earlier, the Spark driver divides DAG stages into tasks. Here, you can see that each stage is divided into two tasks.
But why did Spark divide only two tasks for each stage? It depends on your number of partitions.
In this program, we have only two partitions, so each stage is divided into two tasks. And a single task runs on a single partition. The number of tasks for a job is:
( no of your stages * no of your partitions )
Now, I think you may have a clear picture of how Spark works internally.

Comments

Post a Comment

Popular posts from this blog

Spark SQL “case when” and “when otherwise”

Hive failed renaming table with error "New location for this table already exist" ?

Top Hive Commands with Examples