SPARK : Ways to Rename column on Spark DataFrame

We often need to rename one or multiple columns on Spark DataFrame, Especially when a column is nested it becomes complicated. Let’s discuss all possible ways to rename column with Scala examples.

Though we have covered most of the examples in Scala here, the same concept can be used in PySpark to rename a DataFrame column (Python Spark).

Using withColumnRenamed – To rename Spark DataFrame column name
Using withColumnRenamed – To rename multiple columns
Using StructType – To rename nested column on Spark DataFrame
Using Select – To rename nested columns
Using withColumn – To rename nested columns
Using col() function – To Dynamically rename all or multiple columns
Using toDF() – To rename all or multiple columns

First, let’s create our data for our examples, we are using Row class as we convert this data to Spark DataFrame.


val data = Seq(Row(Row("James ","","Smith"),"36636","M",3000),
  Row(Row("Michael ","Rose",""),"40288","M",4000),
  Row(Row("Robert ","","Williams"),"42114","M",4000),
  Row(Row("Maria ","Anne","Jones"),"39192","F",4000),
  Row(Row("Jen","Mary","Brown"),"","F",-1)
)

Our base schema with nested structure.


val schema = new StructType()
  .add("name",new StructType()
    .add("firstname",StringType)
    .add("middlename",StringType)
    .add("lastname",StringType))
  .add("dob",StringType)
  .add("gender",StringType)
  .add("salary",IntegerType)

Let’s create the DataFrame by using parallelize and provide the above schema.


val df = spark.createDataFrame(spark.sparkContext.parallelize(data),schema)
df.printSchema()

Below is our schema structure. I am not printing data here as it is not necessary for our examples. This schema has a nested structure.

1. Using Spark withColumnRenamed – To rename DataFrame column name

Spark has a withColumnRenamed function on DataFrame to change a column name. This is the most straight forward approach; this function takes two parameters; first is your existing column name and the second is the new column name you wish for.


df.withColumnRenamed("dob","DateOfBirth")
    .printSchema()

Above statement changes column “dob” to “DateOfBirth” on spark DataFrame. Note that withColumnRenamedfunction returns a new DataFrame and doesn’t modify the current DataFrame.

2. Using withColumnRenamed – To rename multiple columns

To change multiple column names, we should chain withColumnRenamed functions as shown below.


val df2 = df.withColumnRenamed("dob","DateOfBirth")
           .withColumnRenamed("salary","salary_amount")
df2.printSchema()

This creates a new DataFrame “df2” after renaming dob and salary columns.

3. Using Spark StructType – To rename a nested column in Dataframe

Changing a column name on nested data is not straight forward and we can do this by creating a new schema with new DataFrame columns using StructType and use it using cast function as shown below.


val schema2 = new StructType()
    .add("fname",StringType)
    .add("middlename",StringType)
    .add("lname",StringType)


df.select(col("name").cast(schema2),
  col("dob"),
  col("gender"),
  col("salary"))
    .printSchema()

This statement renames firstname to fname and lastname to lname within name structure.

4. Using Select – To rename nested elements.

Let’s see another way to change nested columns by transposing the structure to flat.


df.select(col("name.firstname").as("fname"),
  col("name.middlename").as("mname"),
  col("name.lastname").as("lname"),
  col("dob"),col("gender"),col("salary"))
  .printSchema()

5. Using Spark DataFrame withColumn – To rename nested columns

When you have nested columns on Spark DatFrame and if you want to rename it, use withColumn on a data frame object to create a new column from an existing and we will need to drop the existing column. Below example creates a “fname” column from “name.firstname” and drops the “name” column


val df4 = df.withColumn("fname",col("name.firstname"))
      .withColumn("mname",col("name.middlename"))
      .withColumn("lname",col("name.lastname"))
      .drop("name")
df4.printSchema()

6. Using col() function – To Dynamically rename all or multiple columns

Another way to change all column names on Dataframe is to use col() function.


val old_columns = Seq("dob","gender","salary","fname","mname","lname")
    val new_columns = Seq("DateOfBirth","Sex","salary","firstName","middleName","lastName")
    val columnsList = old_columns.zip(new_columns).map(f=>{col(f._1).as(f._2)})
    val df5 = df4.select(columnsList:_*)
    df5.printSchema()

7. Using toDF() – To change all columns in a Spark DataFrame

When we have data in a flat structure (without nested) , use toDF() with a new schema to change all column names.


val newColumns = Seq("newCol1","newCol2","newCol3")
val df = df.toDF(newColumns:_*)

The complete code

package com.sparkbyexamples.spark.dataframe

	import org.apache.spark.sql.{Row, SparkSession}
	import org.apache.spark.sql.types.{IntegerType, StringType, StructType}
	import org.apache.spark.sql.functions.{col, _}

	object RenameColDataFrame {

	def main(args:Array[String]):Unit= {

	val spark: SparkSession = SparkSession.builder()
	.master("local[1]")
	.appName("SparkByExamples.com")
	.getOrCreate()

	val data = Seq(Row(Row("James ","","Smith"),"36636","M",3000),
	Row(Row("Michael ","Rose",""),"40288","M",4000),
	Row(Row("Robert ","","Williams"),"42114","M",4000),
	Row(Row("Maria ","Anne","Jones"),"39192","F",4000),
	Row(Row("Jen","Mary","Brown"),"","F",-1)
	)

	val schema = new StructType()
	.add("name",new StructType()
	.add("firstname",StringType)
	.add("middlename",StringType)
	.add("lastname",StringType))
	.add("dob",StringType)
	.add("gender",StringType)
	.add("salary",IntegerType)

	val df = spark.createDataFrame(spark.sparkContext.parallelize(data),schema)

	df.printSchema()

	df.withColumnRenamed("dob","DateOfBirth")
	.printSchema()

	val schema2 = new StructType()
	.add("fname",StringType)
	.add("middlename",StringType)
	.add("lname",StringType)

	df.select(col("name").cast(schema2),
	col("dob"),
	col("gender"),
	col("salary"))
	.printSchema()

	df.select(col("name.firstname").as("fname"),
	col("name.middlename").as("mname"),
	col("name.lastname").as("lname"),
	col("dob"),col("gender"),col("salary"))
	.printSchema()

	df.withColumnRenamed("name.firstname","fname")
	.withColumnRenamed("name.middlename","mname")
	.withColumnRenamed("name.lastname","lname")
	.drop("name")
	.printSchema()
	}
	}

Conclusion:

This article explains different ways to rename a single column, multiple, all and nested columns on Spark DataFrame. Besides what explained here, we can also change column names using Spark SQL and the same concept can be used in PySpark.

Hope you like this article!!

Happy Learning.

Search This Blog

Hadoop/Big Data ----Speak With Confidence

TOP PYSPARK INTERVIEW QUESTION 2023