TOP PYSPARK INTERVIEW QUESTION 2023

What is Apache Spark and how does it differ from Hadoop? What are the benefits of using Spark over MapReduce? What is a Spark RDD and what operations can be performed on it? How does Spark handle fault-tolerance and data consistency? Explain the difference between Spark transformations and actions. What is a Spark DataFrame and how is it different from an RDD? What is Spark SQL and how does it work? How can you optimize a Spark job to improve its performance? How does Spark handle memory management and garbage collection? Explain the role of Spark Driver and Executors. What is PySpark and how does it differ from Apache Spark? How do you create a SparkContext in PySpark? What is the purpose of SparkContext? What is RDD (Resilient Distributed Dataset)? How is it different from DataFrame and Dataset? What are the different ways to create RDD in PySpark? What is the use of persist() method in PySpark? How does it differ from cache() method? What is the use of broadcast variables in PySpark

TOP 50 AWS Glue Interview Questions

 

What is AWS Glue?

AWS Glue helps in preparing data for Analysis by automated extract, transforming, and loading ETL processes. It supports MySQL, Microsoft SQL Server, PostgreSQL Databases which runs on Amazon EC2(Elastic Compute Cloud) Instances in an Amazon VPC(Virtual Private Cloud).
AWS Glue is an extracted, loaded, transformed service which helps in automating time-consuming steps of Data Preparation for the analytics.

What are the Benefits of AWS Glue?

Benefits of AWS Glue are as follows:
  • Fault Tolerance - AWS Glue is retrievable and the logs can be debugged.
  • Filtering - AWS Glue uses filtering for bad data.
  • Maintenance and Development - AWS Glue uses maintenance and deployment as the service is managed by AWS.

What are the components used by AWS Glue?


AWS Glue

AWS Glue consists of:
  • Data Catalog is a Central Metadata Repository.
  • ETL Engine helps in generating Python and Scala Code.
  • Flexible Scheduler helps in handling Dependency Resolution, Job Monitoring and Retring.
  • AWS Glue DataBrew helps in Normalizing and Cleaning Data with visual interface.
  • AWS Glue Elastic View used in Replicating and Combining Data through multiple Data Stores.

What Data Sources are supported by AWS Glue?

Data Sources supported by AWS Glue are:
Amazon Aurora
Amazon RDS for MySQL
Amazon RDS for Oracle
Amazon RDS for PostgreSQL
Amazon RDS for SQL Server
Amazon Redshift
DynamoDB
Amazon S3
MySQL
Oracle
Microsoft SQL Server
AWS Glue also supports Database such as:
Amazon MSK
Amazon Kinesis Data Streams
Apache Kafka

What are Development Endpoints?

Development Endpoints are used in describing the AWS Glue API that is related to testing by using Custom DevEndpoint.The endpoint is where a developer can debug the extract, transforming, and loading ETL Scripts.

What are AWS Tags in AWS Glue?

AWS Tags are labels used in assigning us to AWS Resources.
Each tag contains a Key and an Optional Value, which we can define. We can also use tags in AWS Glue for organizing and identifying our resources. All the tags are used in creating cost accounting reports and restricting access to resources.

What is AWS Glue Data Catalog?

AWS Glue Data Catalog helps by storing Structural and Operational Metadata for all the Data Assets. It also helps in providing uniform repositories where the disparate systems help in storing and finding metadata for keeping track of data in Data Silos and also in using metadata to query and in transforming the data.
AWS Glue Data Catalog also helps in storing Table Definition, Physical Location, and Business relevant Attributes, also tracks data that has changed over time.

AWS Glue


What are AWS Glue Crawlers?

AWS Glue Crawler helps in connecting Data Store, also progress by a prioritized list of classifiers for extracting the schema of the data and other statistics. AWS Glue Crawler also helps by scanning data stores to automatically infer schemas and the partition structures for populating Glue Data Catalog with Table definitions and statistics.

What is AWS Glue Streaming ETL?

AWS Glue is used in enabling ETL Operations on the streaming data by using continuously running jobs. Streaming ETL is built on Apache Spark that is structured in streaming engines and in ingesting streams from Kinesis Data Streams and Kafka by using Amazon Managed Streaming for Apache Kafka.

Is AWS Glue Schema Registry open-source?

AWS Glue Schema Registry Storage is a service used while serializing and deserializing Apache Licensed open sources components.

How can we list Databases and Tables in AWS Glue Catalog?

We can list Databases and Tables by using the following command:
import boto3
client = boto3.client('glue',region_name='us-east-1')

responseGetDatabases = client.get_databases()

databaseList = responseGetDatabases['DLIST']

for databaseDict in databaseList:

    databaseName = databaseDict['XYZ']
    print '\ndatabaseXYZ: ' + databaseXYZ

    responseGetTables = client.get_tables( DatabaseName = databaseDEF )
    tableList = responseGetTables['TLIST']

    for tableDict in tableList:

         tableName = tableDict['ABC']
         print '\n-- tableABC: '+tableABC



How does AWS Glue update Duplicating Data?

AWS Glue update Duplicating Data by using the following command:
sc = SparkContext()
glueContext = GlueContext(sc)

#get your source data
src_data = create_dynamic_frame.from_catalog(database = src_fg, table_name = src_fg)
src_df =  src_data.toDF()


#get your destination data
dst_data = create_dynamic_frame.from_catalog(database = dst_fg, table_name = dst_fg)
dst_df =  dst_data.toDF()

#Now merge two data frames to remove duplicates
merged_df = dst_df.union(src_df)

#Savea the data to destination with OVERWRITE MODE
merged_df.write.format('abcd'). 
 
 

What are the Features of AWS Glue?

  • Automatic Schema Discovery - Allows in automating crawlers to obtain schema related information and also in storing in data catalog.
  • Job Scheduler - Several jobs can be started in parallel, and users can specify dependencies between jobs.
  • Developer Endpoints - helps in creating custom readers, writers and transformations.
  • Automatic Code Generation - helps in generating code.
  • Integrated Data Catalog - stores data from a disparate source in the AWS pipeline.

How can AWS Glue manage ETL Service?


AWS Glue


What are the use cases of AWS Glue?

The use cases of AWS Glue are as follows:
Data extraction - helps in extracting data in variety of formats.
Data transformation - helps in reformating data for storage.
Data integration - helps in interagting data into enterprise data lakes and warehouse.


What are the drawbacks of AWS Glue?

  • Limited Compatibility - used for working with variety of commonly used data sources and works with services running on AWS.
  • No incremental data sync - Glue is not the best option for real-time ETL jobs.
  • Learning curve - used for supporting queries of traditional relational database.


How can we Automate Data Onboarding?


AWS Glue


How to list all databases and tables in AWS Glue Catalog?

import boto3 client = boto3.client('glue',region_name='us-east-1') responseGetDatabases = client.get_databases() databaseList = responseGetDatabases['DatabaseList'] for databaseDict in databaseList: databaseName = databaseDict['Name'] print '\ndatabaseName: ' + databaseName responseGetTables = client.get_tables( DatabaseName = databaseName ) tableList = responseGetTables['TableList'] for tableDict in tableList: tableName = tableDict['Name'] print '\n-- tableName: '+tableName

What is AWS Glue Data Catalog?

AWS Glue Data Catalog is a persist metadata store used for storing structural and operational metadata for all data sets, also provides uniform repository where disparate systems helps in storing and finding metadata for keeping track of data in data silos.It uses metadata to query and transform the data.It also helps in tracking data that has changed overtime, is a drop in replacement for the Apache Hive Metastore for Big Data Applications running on AWS EMR.AWS Glue Data Catalog also helps by providing out of box integration with Athena, EMR, and Redshift Spectrum.

How can AWS Glue Data Catalog access with Amazon Athena?


AWS Glue


What are AWS Glue Crawlers?

AWS Glue Crawlers used for storing data and progressing through a prioritized list of classifiers for extracting the schema of our data and other statistics and populates the Glue Data Catalog with this metadata.They helps us by running periodically for detecting the availability for new data and also changes the existing data, including table definition changes.Crawlers automatically add new tables, new partitions to existing table, and new versions of table definitions.

What is the AWS Glue Schema Registry?

AWS Glue Schema Registry helps by enabling us for validating and controlling the evolution of streaming data using the registered Apache Avro schemas with no additional charge.Schema Registry helps in integrating with Java Applications developed for Apache Kafka, Amazon Managed Streaming for Apache Kafka (MSK), Amazon Kinesis Data Streams, Apache Flink, Amazon Kinesis Data Analytics for Apache Flink, and AWS Lambda.

How does Schema Registry be integrated?


AWS Glue


How can we solve this HIVE PARTITION SCHEMA MISMATCH?

If we are using crawler, we should select following option:
Update all new and existing partitions with metadata from the table

How to define nested array to ingest data and convert?

{ "class_id": "test0001", "students": [{ "student_id": "xxxx", "student_name": "AAAABBBCCC", "student_gpa": 123 }] }

How to execute aws glue scripts using python 2.7 from local machine?

import sys from awsglue.transforms import * from awsglue.utils import getResolvedOptions from pyspark.context import SparkContext from awsglue.context import GlueContext from awsglue.job import Job glueContext = GlueContext(SparkContext.getOrCreate()) persons = glueContext.create_dynamic_frame.from_catalog( database="records", table_name="recordsrecords_converted_json") print "Count: ", persons.count() persons.printSchema()

What is AWS Glue Streaming ETL?

AWS Glue helps in enabling ETL operations on streaming data by using continuously-running jobs.It can also be built on the Apache Spark Structured Streaming engine, and can ingest streams from Kinesis Data Streams and Apache Kafka using Amazon Managed Streaming for Apache Kafka.It can clean and transform streaming data and load it into S3 and JDBC data stores and can process event data like IoT streams, clickstreams, and network logs.

How set name for crawled table?

import boto3 database_name = "database" table_name = "prefix-dir_name" new_table_name = "more_awesome_name" client = boto3.client("glue") response = client.get_table(DatabaseName=database_name, Name=table_name) table_input = response["Table"] table_input["Name"] = new_table_name # Delete keys that cause create_table to fail table_input.pop("CreatedBy") table_input.pop("CreateTime") table_input.pop("UpdateTime") client.create_table(DatabaseName=database_name, TableInput=table_input)
 
 

Q: Do the aws glue APIs return the partition key fields in the order as they were specified when the table was created.?
Ans:

Yes, the partition keys would be returned in the same order as they were specified when the table was created.

Q: How to join / merge all rows of an RDD in PySpark / AWS Glue into one single long line?
Ans:

Each rdd row can be mapped into one string per row using map, and the result of the map call can then be aggregated into a single large string:

result = rdd.map(lambda r: " ".join(r) + "\n")\ .aggregate("", lambda a,b: a+b, lambda a,b: a+b)

Q: How to create AWS glue job using CLI commands?
Ans:

We can create AWS glue job by using below command:

aws glue create-job \ --name ${GLUE_JOB_NAME} \ --role ${ROLE_NAME} \ --command "Name=techgekkenxt1,ScriptLocation=s3:///" \ --connections Connections=${GLUE_CONN_NAME} \ --default-arguments file://${DEFAULT_ARGUMENT_FILE}

Q: How to get the total number of partitions in AWS Glue for specific range?
Ans:

By using the below command, we can get the total number of partitions with specified length.

aws glue get-partitions --database-name xx --table-name xx --query 'length(Partitions[])'

Q: When an AWS Glue job times out, how do we Retry it?
Ans:

Retrying a job only works if it has failed, not if it has timed out. For this, we'll need to create custom logic, such as Event Bridge listening for Glue timeout events and then running a Lambda to restart your task.

 

Comments

Post a Comment

Popular posts from this blog

Spark SQL “case when” and “when otherwise”

Top Hive Commands with Examples

SPARK : Ways to Rename column on Spark DataFrame