Hadoop/Big Data ----Speak With Confidence

Posts

In which case we can use orc or in which case we can use parquet ?

- April 13, 2023

The decision to use ORC or Parquet can depend on several factors, including the specific requirements of your use case and the tools and technologies that you are using. Here are some common scenarios where ORC or Parquet might be a better choice: Use cases for ORC: Complex data types or a large number of nested structures: ORC is designed to handle complex data types efficiently, making it a good choice for datasets with a large number of nested structures, arrays, and maps. Low-latency queries: ORC is optimized for low-latency queries and can quickly access specific columns, making it a good choice for real-time or interactive querying scenarios. Apache Hive ecosystem: ORC was developed by the Apache Hive team and is optimized for use with Hive, making it a good choice if you are using Hive for data processing and analysis. Use cases for Parquet: Large datasets with simpler schema: Parquet is efficient for storing and querying large datasets with simpler schema, making it a good choi...

- February 18, 2022

8 Performance Optimization Techniques Using Spark Due to its fast, easy-to-use capabilities, Apache Spark helps to Enterprises process data faster, solving complex data problems quickly. We all know that during the development of any program, taking care of the performance is equally important. A Spark job can be optimized by many techniques so let’s dig deeper into those techniques one by one. Apache Spark optimization helps with in-memory data computations. The bottleneck for these spark optimization computations can be CPU, memory or any resource in the cluster. 1. Serialization Serialization plays an important role in the performance for any distributed application. By default, Spark uses Java serializer. Spark can also use another serializer called ‘Kryo’ serializer for better performance. Kryo serializer is in compact binary format and offers processing 10x faster than Java serializer. To set the serializer...

Optimizations In Spark

- February 16, 2022

NO need to guess, Yes, we are talking about Spark SQL catalyst optimizer here. I am sure all of you have read through various blogs that describe the advantages of Spark SQL in a very neat and pretty way (at least those who are interested in working with Spark and Structured Data). In precise, what I want to tell you is that there are lots of blogs available that are there to tell you why should you be using the Spark SQL and what optimizations bring to you out of the box. It seems too much one-sided, isn’t it? Like Spark is some magic thing and just writing SparkSession in your code will be enough to process that “Big Data” you gathered from IoT devices. So today, we will be looking into some issues that I faced while using Spark SQL. So let’s get started with it. What is Spark SQL? Let’s begin with a brief basic overview of Spark SQL. If you already possess a familiarity with it then you can skip this part. Spark SQL is a module in the Spark ...

Top 70 + Hadoop Interview Questions and Answers: Sqoop, Hive, HDFS and more

- February 09, 2022

HDFS Interview Questions - HDFS 1. What are the different vendor-specific distributions of Hadoop? The different vendor-specific distributions of Hadoop are Cloudera, MAPR, Amazon EMR, Microsoft Azure, IBM InfoSphere, and Hortonworks (Cloudera). 2. What are the different Hadoop configuration files? The different Hadoop configuration files include: hadoop-env.sh mapred-site.xml core-site.xml yarn-site.xml hdfs-site.xml Master and Slaves 3. What are the three modes in which Hadoop can run? The three modes in which Hadoop can run are : Standalone mode : This is the default mode. It uses the local FileSystem and a single Java process to run the Hadoop services. Pseudo-distributed mode : This uses a single-node Hadoop deployment to execute all Hadoop services. Fully-distributed mode : This uses separate nodes to run Hadoop master and slave services. 4. What are the differences between regular FileSystem and HDFS? Regular FileSystem : In regular FileSystem, data is ma...

Spark Modes of Deployment – Cluster mode and Client Mode

- January 12, 2022

Spark Modes of Deployment – Cluster mode and Client Mode While we talk about deployment modes of spark , it specifies where the driver program will be run, basically, it is possible in two ways. At first, either on the worker node inside the cluster, which is also known as Spark cluster mode . Secondly, on an external client, what we call it as a client spark mode . In this blog, we will learn the whole concept of Apache Spark modes of deployment. At first, we will learn brief introduction of deployment modes in spark, yarn resource manager’s aspect here. Since we mostly use YARN in a production environment. Hence, we will learn deployment modes in YARN in detail. Spark Deploy modes When for execution, we submit a spark job to local or on a cluster, the behaviour of spark job totally depends on one parameter, that is the “Driver” component. Where “Driver” component of spark job will reside, it defines the behaviour of spark job. Basically, there are t...

Why is my EC2 Linux instance unreachable and failing one or both of its status checks?

- March 30, 2021

My Amazon Elastic Compute Cloud (Amazon EC2) Linux instance is unreachable, and is failing one or both of its status checks. How do I troubleshoot status check failure? Short Description Amazon EC2 monitors the health of each EC2 instance with two status checks: System status check: The system status check detects issues with the underlying host that your instance runs on. If the underlying host is unresponsive or unreachable due to network, hardware, or software issues, then this status check fails. Instance status check: An instance status check failure indicates a problem with the instance due to operating system-level errors such as the following: Failure to boot the operating system Failure to mount volumes correctly File system issues Incompatible drivers Kernel panic Instance status checks might also fail due to severe memory pressures caused by over-...

Search This Blog

Hadoop/Big Data ----Speak With Confidence

Posts

TOP PYSPARK INTERVIEW QUESTION 2023

In which case we can use orc or in which case we can use parquet ?

Optimizations In Spark

Top 70 + Hadoop Interview Questions and Answers: Sqoop, Hive, HDFS and more

Spark Modes of Deployment – Cluster mode and Client Mode

Why is my EC2 Linux instance unreachable and failing one or both of its status checks?

Popular posts from this blog

Spark SQL “case when” and “when otherwise”

Hive failed renaming table with error "New location for this table already exist" ?

Top Hive Commands with Examples