Top PySpark Interview Questions and Answers

In this PySpark article, we will go through mostly asked PySpark Interview Questions and Answers. This Interview questions for PySpark will help both freshers and experienced. Moreover, you will get a guide on how to crack PySpark Interview. Follow each link for better understanding.
So, let’s start PySpark Interview Questions.

What is PySpark?

Ans. PySpark helps data scientists interface with Resilient Distributed Datasets in apache spark and python.Py4J is a popularly library integrated within PySpark that lets python interface dynamically with JVM objects (RDD's).

Is PySpark a language?

Ans. Spark refers to the Apache Spark distributed computing framework, originally accessible using the Scala programming language. PySpark is the interface that gives access to Spark using the Python programming language.

What is RDD in PySpark?

Ans. class pyspark.RDD(jrdd, ctx, jrdd_deserializer=AutoBatchedSerializer(PickleSerializer())) A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel.

Spark revolves around the concept of a resilient distributed dataset (RDD), which is a fault-tolerant collection of elements that can be operated on in parallel. There are two ways to create RDDs: parallelizing an existing collection in your driver program, or referencing a dataset in an external storage system, such as a shared filesystem, HDFS, HBase, or any data source offering a Hadoop InputFormat.

data = [1, 2, 3, 4, 5]
distData = sc.parallelize(data)

RDD Operations

RDDs support two types of operations: transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset. For example, map is a transformation that passes each dataset element through a function and returns a new RDD representing the results. On the other hand, reduce is an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).

All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program. This design enables Spark to run more efficiently. For example, we can realize that a dataset created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset.

By default, each transformed RDD may be recomputed each time you run an action on it. However, you may also persist an RDD in memory using the persist (or cache) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting RDDs on disk, or replicated across multiple nodes.

=What is py4j?

Ans. Py4J is a bidirectional bridge between Python and Java. It enables Python programs running in a Python interpreter to dynamically access Java objects in a JVM.

What do you mean by PySpark SparkContext?

Ans. In simple words, an entry point to any spark functionality is what we call SparkContext. While it comes to Pyspark,Spark Context uses Py4J(library) in order to launch a JVM. In this way, it creates a JavaSparkContext. However, PySpark has SparkContext available as ‘sc’, by default.

Explain PySpark SparkConf?

Ans. Mainly, we use SparkConf because we need to set a few configurations and parameters to run a Spark application on the local/cluster. In other words, SparkConf offers configurations to run a Spark application.

Code

class pyspark.SparkConf (

  loadDefaults = True,

  _jvm = None,

  _jconf = None

)

Name different storage levels.

Ans. There are different storage Levels which are given below −

DISK_ONLY
DISK_ONLY_2
MEMORY_AND_DISK
MEMORY_AND_DISK_2
MEMORY_AND_DISK_SER
MEMORY_AND_DISK_SER_2
MEMORY_ONLY
MEMORY_ONLY_2
MEMORY_ONLY_SER
MEMORY_ONLY_SER_2
OFF_HEAP StorageLevel

What do mean by Broadcast variables?

Ans. In order to save the copy of data across all nodes, With SparkContext.broadcast(), a broadcast variable is created.

OR

Broadcast variables in Apache Spark is a mechanism for sharing variables across executors that are meant to be read-only. Without broadcast variables these variables would be shipped to each executor for every transformation and action, and this can cause network overhead

OR

Broadcast variables are read-only variables that are distributed across worker nodes in-memory instead of shipping a copy of data with tasks.

Broadcast variables are mostly used when the tasks across multiple stages require the same data or when caching the data in the deserialized form is required.

Broadcast variables are created using a variable v by calling SparkContext.broadcast(v).

The Broadcast variable is a wrapper around v, and its value can be accessed by calling the value method.

The data broadcasted this way is cached in a serialized form and deserialized before running each task.

For Examples:

>>> from pyspark.context import SparkContext

>>> sc = SparkContext('local', 'test')

>>> b = sc.broadcast([1, 2, 3, 4, 5])

>>> b.value

[1, 2, 3, 4, 5]

>>> sc.parallelize([0, 0]).flatMap(lambda x: b.value).collect()

[1, 2, 3, 4, 5, 1, 2, 3, 4, 5]

>>> b.unpersist()

>>> large_broadcast = sc.broadcast(range(10000))

What are Accumulator variables?

Ans. In order to aggregate the information through associative and commutative operations, we use them.

OR

These accumulator variables can only be used when a user wants to perform associative or commutative operations on the data.

The accumulators can be created with or without a name. If the accumulators are created with a name, they can be viewed in Spark’s UI which will be useful to understand the progress of running stages.

The accumulators are created using an initial value v. by calling SparkContext.accumulator(v)

Syntax:

val acc = sc.accumulator(v)

Code

class pyspark.Accumulator(aid, value, accum_param)

Search This Blog

Hadoop/Big Data ----Speak With Confidence

TOP PYSPARK INTERVIEW QUESTION 2023

Top PySpark Interview Questions and Answers

RDD Operations

Comments

Post a Comment

Popular posts from this blog

Spark SQL “case when” and “when otherwise”

Hive failed renaming table with error "New location for this table already exist" ?

Spark Modes of Deployment – Cluster mode and Client Mode