TOP PYSPARK INTERVIEW QUESTION 2023

What is Apache Spark and how does it differ from Hadoop? What are the benefits of using Spark over MapReduce? What is a Spark RDD and what operations can be performed on it? How does Spark handle fault-tolerance and data consistency? Explain the difference between Spark transformations and actions. What is a Spark DataFrame and how is it different from an RDD? What is Spark SQL and how does it work? How can you optimize a Spark job to improve its performance? How does Spark handle memory management and garbage collection? Explain the role of Spark Driver and Executors. What is PySpark and how does it differ from Apache Spark? How do you create a SparkContext in PySpark? What is the purpose of SparkContext? What is RDD (Resilient Distributed Dataset)? How is it different from DataFrame and Dataset? What are the different ways to create RDD in PySpark? What is the use of persist() method in PySpark? How does it differ from cache() method? What is the use of broadcast variables in PySpark...

Spark SQL “case when” and “when otherwise”

Is Apache Spark going to replace Hadoop? | Aptuz Technology Solutions
 
Like SQL “case when” statement and Swith statement from popular programming languages, Spark SQL Dataframe also supports similar syntax using “when otherwise” or we can also use “case when” statement. So let’s see an example on how to check for multiple conditions and replicate SQL CASE statement.
  • Using “when otherwise” on DataFrame.
  • Using “case when” on DataFrame.
  • Using && and || operator
First Let’s do the imports that are needed and create spark context and DataFrame.

import org.apache.spark.sql.functions.{when, _}
val spark: SparkSession = SparkSession.builder()
      .master("local[1]")
      .appName("SparkByExamples.com")
      .getOrCreate()

import spark.sqlContext.implicits._
val data = List(("James","","Smith","36636","M",60000),
        ("Michael","Rose","","40288","M",70000),
        ("Robert","","Williams","42114","",400000),
        ("Maria","Anne","Jones","39192","F",500000),
        ("Jen","Mary","Brown","","F",0))

val cols = Seq("first_name","middle_name","last_name","dob","gender","salary")
val df = spark.createDataFrame(data).toDF(cols:_*)

1. Using “when otherwise” on Spark DataFrame.

when is a Spark function, so to use it first we should import using import org.apache.spark.sql.functions.when before. Above code snippet replaces the value of gender with newderived value. when value not qualified with the condition, we are assigning “Unknown” as value.

val df2 = df.withColumn("new_gender", when(col("gender") === "M","Male")
      .when(col("gender") === "F","Female")
      .otherwise("Unknown"))
when can also be used on Spark SQL select statement.

val df4 = df.select(col("*"), when(col("gender") === "M","Male")
      .when(col("gender") === "F","Female")
      .otherwise("Unknown").alias("new_gender"))

2. Using “case when” on Spark DataFrame.

Similar to SQL syntax, we could use “case when” with expression expr() .

val df3 = df.withColumn("new_gender", 
      expr("case when gender = 'M' then 'Male' " +
                       "when gender = 'F' then 'Female' " +
                       "else 'Unknown' end"))
Using within SQL select.

val df4 = df.select(col("*"),
      expr("case when gender = 'M' then 'Male' " +
                       "when gender = 'F' then 'Female' " +
                       "else 'Unknown' end").alias("new_gender"))

3. Using && and || operator

We can also use and (&&) or (||) within when function. To explain this I will use a new set of data to make it simple.

val dataDF = Seq(
      (66, "a", "4"), (67, "a", "0"), (70, "b", "4"), (71, "d", "4"
      )).toDF("id", "code", "amt")
dataDF.withColumn("new_column",
       when(col("code") === "a" || col("code") === "d", "A")
      .when(col("code") === "b" && col("amt") === "4", "B")
      .otherwise("A1"))
      .show()
Output:

+---+----+---+----------+
| id|code|amt|new_column|
+---+----+---+----------+
| 66|   a|  4|         A|
| 67|   a|  0|         A|
| 70|   b|  4|         B|
| 71|   d|  4|         A|
+---+----+---+----------+

Conclusion:

In this article, we have learned how to use spark “case when” using expr() function and “when otherwise” function on Dataframe also, we’ve learned how to use these functions with && and || logical operators. I hope you like this article.
Happy Learning !!

Comments

Popular posts from this blog

Top Hive Commands with Examples

SPARK : Ways to Rename column on Spark DataFrame