How spark read a large file (petabyte) when file can not be fit in spark's main memory

- February 28, 2020

What will happen for large files in these cases?

1) Spark gets a location from NameNode for data . Will Spark stop in this same time because data size is too long as per information from NameNode?

2) Spark do partition of data as per datanode block size but all data can not be stored into main memory. Here we are not using StorageLevel. So what will happen here?

3) Spark do partition the data, some data will store on main memory once this main memory store's data will process again spark will load other data from disc.

Ans. First of all, Spark only starts reading in the data when an action (like count, collect or write) is called. Once an action is called, Spark loads in data in partitions - the number of concurrently loaded partitions depend on the number of cores you have available. So in Spark you can think of 1 partition = 1 core = 1 task. Note that all concurrently loaded partitions have to fit into memory, or you will get an OOM.

Assuming that you have several stages, Spark then runs the transformations from the first stage on the loaded partitions only. Once it has applied the transformations on the data in the loaded partitions, it stores the output as shuffle-data and then reads in more partitions. It then applies the transformations on these partitions, stores the output as shuffle-data, reads in more partitions and so forth until all data has been read.

If you apply no transformation but only do for instance a count, Spark will still read in the data in partitions, but it will not store any data in your cluster and if you do the count again it will read in all the data once again. To avoid reading in data several times, you might call cache or persist in which case Spark will try to store the data in you cluster. On cache (which is the same as persist(StorageLevel.MEMORY_ONLY) it will store all partitions in memory - if it doesn't fit in memory you will get an OOM. If you call persist(StorageLevel.MEMORY_AND_DISK) it will store as much as it can in memory and the rest will be put on disk. If data doesn't fit on disk either the OS will usually kill your workers.

Note that Spark has its own little memory management system. Some of the memory that you assign to your Spark job is used to hold the data being worked on and some of the memory is used for storage if you call cache or persist.

I hope this explanation helps :)

Comments

Pankaj28 February 2020 at 14:30
Hello
ReplyDelete
Replies
sherazabbasi28 December 2021 at 13:08
hi
ReplyDelete
Replies
sherazabbasi3 January 2022 at 11:07
Most of the time I don’t make comments on websites, but I'd like to say that this article really forced me to do so. Really nice post!
large file transfer
ReplyDelete
Replies

Add comment

Search This Blog

Hadoop/Big Data ----Speak With Confidence

TOP PYSPARK INTERVIEW QUESTION 2023

How spark read a large file (petabyte) when file can not be fit in spark's main memory

Comments

Post a Comment

Popular posts from this blog

Spark SQL “case when” and “when otherwise”

Hive failed renaming table with error "New location for this table already exist" ?

Top Hive Commands with Examples