Spark Archives - Big Data In Real World

Spark

October 9, 2023

How to kill a running Spark application?

Apache Spark is a powerful open-source distributed computing system used for big data processing. However, sometimes you may need to kill a running Spark application for […]
September 25, 2023

What is the default number of executors in Spark?

This is going to be a short post.  Number of executors in YARN deployments Spark.executor.instances controls the number of executors in YARN. By default, the number […]
September 18, 2023

What is the default number of cores and amount of memory allocated to an application in Spark?

Number of cores spark.executor.cores controls the number of cores available for the executors.  By default, it is 1 core per executor in YARN and all available […]
August 14, 2023

Improving Performance with Adaptive Query Execution in Apache Spark 3.0

Apache Spark, the popular distributed computing framework, has been widely adopted for processing large-scale data. With the release of Apache Spark 3.0, a groundbreaking feature called […]
August 7, 2023

Exploring the Power of Apache Spark 3.0: Adaptive Query Execution and More

Apache Spark, the popular distributed computing framework, has taken a significant leap forward with the release of Apache Spark 3.0. Packed with new features and enhancements, […]
July 31, 2023

What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism in Spark?

Both spark.sql.shuffle.partitions and spark.default.parallelism control the number of tasks that get executed at runtime there by controlling the distribution and parallelism. Which means both properties have […]
July 24, 2023

What is the difference between client and cluster deploy modes in Spark?

This post aims at describing the differences between client and cluster deploy modes in Spark. Client mode Cluster mode
June 12, 2023

How to create and use UDF in Spark?

In this post we are going to create a Spark UDF which converts temperature from Fahrenheit to Celsius. Here is our data. We have day and […]
May 22, 2023

How to add total count of DataFrame to an already grouped DataFrame?

Here is our data. We have an employee DataFrame with 3 columns, name, project and cost_to_project. An employee can belong to multiple projects and for each […]
May 15, 2023

How to query data from Snowflake in Spark?

If your organization is working with lots of data you might be leveraging Spark to compute distribution. You could also potentially have some or all your […]
April 27, 2023

How to transpose a DataFrame from columns to rows in Spark?

Unfortunately there is no built in function to transpose a DataFrame from columns to rows in Spark. In this post we will show an easy way […]
April 13, 2023

What is the difference between map and mapValues functions in Spark?

In this post we will look at the differences between map and mapValues functions and when it is appropriate to use either one. We have a […]
March 30, 2023

How to read and write XML files with Spark?

We will be using the spark-xml package from Databrick to read and write XML files with Spark. Here is how we enter the spark shell to […]
March 16, 2023

How to read and write Excel files with Spark?

In this post we are going to see how to work with Excel files in Spark. We will be using the spark-excel package created by Crealytics. […]
April 27, 2022

How to create a column with unique, incrementing index value in Spark?

Let’s say we have a DataFrame like below. +---------+-------+---------------+ |  Project|   Name|Cost_To_Project| +---------+-------+---------------+ |Ingestion|  Jerry|           1000| |Ingestion|   Arya|   […]
April 20, 2022

How to find the number of partitions in a DataFrame?

Let’s say we have a DataFrame with the employee name, project and the cost of the employee to the project. From this data, we have a […]
March 23, 2022

How to pivot and unpivot a DataFrame in Spark?

In this post we are going to describe how to pivot and unpivot a DataFrame in Spark.  We have an employee DataFrame with 3 columns, name, […]
March 16, 2022

Understanding stack function in Spark

stack function in Spark takes a number of rows as an argument followed by expressions. stack(n, expr1, expr2.. exprn) stack function will generate n rows by […]
March 9, 2022

What is the difference between map and flatMap functions in Spark?

Both map and flatMap functions are transformation functions. When applied on RDD, map and flatMap transform each element inside the rdd to something. Consider this simple […]
December 29, 2021

What is an efficient way to check if a Spark DataFrame is empty?

A quick answer that might come to your mind is to call the count() function on the dataframe and check if the count is greater than […]
gdpr-image
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X