TOP PYSPARK INTERVIEW QUESTION 2023

What is Apache Spark and how does it differ from Hadoop? What are the benefits of using Spark over MapReduce? What is a Spark RDD and what operations can be performed on it? How does Spark handle fault-tolerance and data consistency? Explain the difference between Spark transformations and actions. What is a Spark DataFrame and how is it different from an RDD? What is Spark SQL and how does it work? How can you optimize a Spark job to improve its performance? How does Spark handle memory management and garbage collection? Explain the role of Spark Driver and Executors. What is PySpark and how does it differ from Apache Spark? How do you create a SparkContext in PySpark? What is the purpose of SparkContext? What is RDD (Resilient Distributed Dataset)? How is it different from DataFrame and Dataset? What are the different ways to create RDD in PySpark? What is the use of persist() method in PySpark? How does it differ from cache() method? What is the use of broadcast variables in PySpark...

How spark read a large file (petabyte) when file can not be fit in spark's main memory

What will happen for large files in these cases?
1) Spark gets a location from NameNode for data . Will Spark stop in this same time because data size is too long as per information from NameNode?
2) Spark do partition of data as per datanode block size but all data can not be stored into main memory. Here we are not using StorageLevel. So what will happen here?
3) Spark do partition the data, some data will store on main memory once this main memory store's data will process again spark will load other data from disc.
Ans. First of all, Spark only starts reading in the data when an action (like countcollect or write) is called. Once an action is called, Spark loads in data in partitions - the number of concurrently loaded partitions depend on the number of cores you have available. So in Spark you can think of 1 partition = 1 core = 1 task. Note that all concurrently loaded partitions have to fit into memory, or you will get an OOM.
Assuming that you have several stages, Spark then runs the transformations from the first stage on the loaded partitions only. Once it has applied the transformations on the data in the loaded partitions, it stores the output as shuffle-data and then reads in more partitions. It then applies the transformations on these partitions, stores the output as shuffle-data, reads in more partitions and so forth until all data has been read.
If you apply no transformation but only do for instance a count, Spark will still read in the data in partitions, but it will not store any data in your cluster and if you do the count again it will read in all the data once again. To avoid reading in data several times, you might call cache or persist in which case Spark will try to store the data in you cluster. On cache (which is the same as persist(StorageLevel.MEMORY_ONLY) it will store all partitions in memory - if it doesn't fit in memory you will get an OOM. If you call persist(StorageLevel.MEMORY_AND_DISK) it will store as much as it can in memory and the rest will be put on disk. If data doesn't fit on disk either the OS will usually kill your workers.
Note that Spark has its own little memory management system. Some of the memory that you assign to your Spark job is used to hold the data being worked on and some of the memory is used for storage if you call cache or persist.
I hope this explanation helps :)


Comments

  1. Most of the time I don’t make comments on websites, but I'd like to say that this article really forced me to do so. Really nice post!
    large file transfer

    ReplyDelete

Post a Comment

Popular posts from this blog

Spark SQL “case when” and “when otherwise”

Hive failed renaming table with error "New location for this table already exist" ?

Top Hive Commands with Examples