What is Apache Spark and how does it differ from Hadoop? What are the benefits of using Spark over MapReduce? What is a Spark RDD and what operations can be performed on it? How does Spark handle fault-tolerance and data consistency? Explain the difference between Spark transformations and actions. What is a Spark DataFrame and how is it different from an RDD? What is Spark SQL and how does it work? How can you optimize a Spark job to improve its performance? How does Spark handle memory management and garbage collection? Explain the role of Spark Driver and Executors. What is PySpark and how does it differ from Apache Spark? How do you create a SparkContext in PySpark? What is the purpose of SparkContext? What is RDD (Resilient Distributed Dataset)? How is it different from DataFrame and Dataset? What are the different ways to create RDD in PySpark? What is the use of persist() method in PySpark? How does it differ from cache() method? What is the use of broadcast variables in PySpark...
Get link
Facebook
X
Pinterest
Email
Other Apps
SPARK : Ways to Rename column on Spark DataFrame
Get link
Facebook
X
Pinterest
Email
Other Apps
-
We often need to rename one or multiple columns on Spark DataFrame, Especially when a column is nested it becomes complicated. Let’s discuss all possible ways to rename column with Scala examples.
Though we have covered most of the examples in Scala here, the same concept can be used in PySpark to rename a DataFrame column (Python Spark).
Using withColumnRenamed – To rename Spark DataFrame column name
Using withColumnRenamed – To rename multiple columns
Using StructType – To rename nested column on Spark DataFrame
Using Select – To rename nested columns
Using withColumn – To rename nested columns
Using col() function – To Dynamically rename all or multiple columns
Using toDF() – To rename all or multiple columns
First, let’s create our data for our examples, we are using Row class as we convert this data to Spark DataFrame.
val data = Seq(Row(Row("James ","","Smith"),"36636","M",3000),
Row(Row("Michael ","Rose",""),"40288","M",4000),
Row(Row("Robert ","","Williams"),"42114","M",4000),
Row(Row("Maria ","Anne","Jones"),"39192","F",4000),
Row(Row("Jen","Mary","Brown"),"","F",-1))
Our base schema with nested structure.
val schema =new StructType().add("name",new StructType().add("firstname",StringType).add("middlename",StringType).add("lastname",StringType)).add("dob",StringType).add("gender",StringType).add("salary",IntegerType)
Let’s create the DataFrame by using parallelize and provide the above schema.
val df = spark.createDataFrame(spark.sparkContext.parallelize(data),schema)
df.printSchema()
Below is our schema structure. I am not printing data here as it is not necessary for our examples. This schema has a nested structure.
1. Using Spark withColumnRenamed – To rename DataFrame column name
Spark has a withColumnRenamed function on DataFrame to change a column name. This is the most straight forward approach; this function takes two parameters; first is your existing column name and the second is the new column name you wish for.
Above statement changes column “dob” to “DateOfBirth” on spark DataFrame. Note that withColumnRenamedfunction returns a new DataFrame and doesn’t modify the current DataFrame.
2. Using withColumnRenamed – To rename multiple columns
To change multiple column names, we should chain withColumnRenamed functions as shown below.
val df2 = df.withColumnRenamed("dob","DateOfBirth").withColumnRenamed("salary","salary_amount")
df2.printSchema()
This creates a new DataFrame “df2” after renaming dob and salary columns.
3. Using Spark StructType – To rename a nested column in Dataframe
Changing a column name on nested data is not straight forward and we can do this by creating a new schema with new DataFrame columns using StructType and use it using cast function as shown below.
val schema2 =new StructType().add("fname",StringType).add("middlename",StringType).add("lname",StringType)
5. Using Spark DataFrame withColumn – To rename nested columns
When you have nested columns on Spark DatFrame and if you want to rename it, use withColumn on a data frame object to create a new column from an existing and we will need to drop the existing column. Below example creates a “fname” column from “name.firstname” and drops the “name” column
val df4 = df.withColumn("fname",col("name.firstname")).withColumn("mname",col("name.middlename")).withColumn("lname",col("name.lastname")).drop("name")
df4.printSchema()
6. Using col() function – To Dynamically rename all or multiple columns
Another way to change all column names on Dataframe is to use col() function.
This article explains different ways to rename a single column, multiple, all and nested columns on Spark DataFrame. Besides what explained here, we can also change column names using Spark SQL and the same concept can be used in PySpark.
Like SQL “case when” statement and Swith statement from popular programming languages, Spark SQL Dataframe also supports similar syntax using “when otherwise” or we can also use “case when” statement. So let’s see an example on how to check for multiple conditions and replicate SQL CASE statement. Using “when otherwise” on DataFrame. Using “case when” on DataFrame. Using && and || operator First Let’s do the imports that are needed and create spark context and DataFrame. import org . apache . spark . sql . functions . { when , _ } val spark : SparkSession = SparkSession . builder ( ) . master ( "local[1]" ) . appName ( "SparkByExamples.com" ) . getOrCreate ( ) import spark . sqlContext . implicits . _ val data = List ( ( "James" , "" , "Smith" , "36636" , "M" , 60000 ) , ( "Michael" , "Rose" , "" , "40288" , ...
NO need to guess, Yes, we are talking about Spark SQL catalyst optimizer here. I am sure all of you have read through various blogs that describe the advantages of Spark SQL in a very neat and pretty way (at least those who are interested in working with Spark and Structured Data). In precise, what I want to tell you is that there are lots of blogs available that are there to tell you why should you be using the Spark SQL and what optimizations bring to you out of the box. It seems too much one-sided, isn’t it? Like Spark is some magic thing and just writing SparkSession in your code will be enough to process that “Big Data” you gathered from IoT devices. So today, we will be looking into some issues that I faced while using Spark SQL. So let’s get started with it. What is Spark SQL? Let’s begin with a brief basic overview of Spark SQL. If you already possess a familiarity with it then you can skip this part. Spark SQL is a module in the Spark ...
A Tale of Three Apache Spark APIs: RDDs vs DataFrames and Datasets When to use them and why In this blog, I explore three sets of APIs— RDDs , DataFrames , and Datasets—available in Apache Spark 2.2 and beyond; why and when you should use each set; outline their performance and optimization benefits; and enumerate scenarios when to use DataFrames and Datasets instead of RDDs. Mostly, I will focus on DataFrames and Datasets, because in Apache Spark 2.0 , these two APIs are unified. Resilient Distributed Dataset (RDD) RDD was the primary user-facing API in Spark since its inception. At the core, an RDD i s an immutable distributed collection of elements of your data, partitioned across nodes in your cluster that can be operated in parallel with a low-level API that offers transformations and actions . When to use RDDs? Consider these scenarios or common use cases for using RDDs when: you want low-level transformation ...
Comments
Post a Comment