BLOG - Big Data In Real World

BLOG

October 9, 2023

How to kill a running Spark application?

Apache Spark is a powerful open-source distributed computing system used for big data processing. However, sometimes you may need to kill a running Spark application for […]
October 2, 2023

How does a consumer know the offset to read after restart in Kafka?

Let’s say you have a consumer group which has 3 consumers at the moment consuming messages from a topic. Assume that you had to shut down […]
September 25, 2023

What is the default number of executors in Spark?

This is going to be a short post.  Number of executors in YARN deployments Spark.executor.instances controls the number of executors in YARN. By default, the number […]
September 18, 2023

What is the default number of cores and amount of memory allocated to an application in Spark?

Number of cores spark.executor.cores controls the number of cores available for the executors.  By default, it is 1 core per executor in YARN and all available […]
September 11, 2023

How to find the number of objects in an S3 bucket?

There is no separate command in AWS CLI to find the number of objects in an S3 bucket but there is a workaround. Solution aws s3 […]
August 14, 2023

Improving Performance with Adaptive Query Execution in Apache Spark 3.0

Apache Spark, the popular distributed computing framework, has been widely adopted for processing large-scale data. With the release of Apache Spark 3.0, a groundbreaking feature called […]
August 7, 2023

Exploring the Power of Apache Spark 3.0: Adaptive Query Execution and More

Apache Spark, the popular distributed computing framework, has taken a significant leap forward with the release of Apache Spark 3.0. Packed with new features and enhancements, […]
July 31, 2023

What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism in Spark?

Both spark.sql.shuffle.partitions and spark.default.parallelism control the number of tasks that get executed at runtime there by controlling the distribution and parallelism. Which means both properties have […]
July 24, 2023

What is the difference between client and cluster deploy modes in Spark?

This post aims at describing the differences between client and cluster deploy modes in Spark. Client mode Cluster mode
July 17, 2023

Stream Processing vs. Message Processing: What’s the Difference?

In modern software systems, data is often generated and consumed in real-time. To handle these data streams, various processing techniques have been developed, including stream processing […]
July 10, 2023

How to fix unassigned shards issue in Elasticsearch?

By default shards should be automatically allocated to nodes. But in extreme cases where you had to add or remove nodes to the cluster or after […]
July 3, 2023

How to recursively upload a folder to S3 using AWS CLI?

This is a pretty common requirement and here is the solution. Solution Let’s create a bucket named hirw-sample-aws-bucket first. [osboxes@wk1 ~]$ aws s3 mb s3://hirw-sample-aws-bucket Use […]
June 26, 2023

How to delete an index in Elasticsearch?

Simple problem with a simple solution. In this post we will see how to delete an index in Elasticsearch. Solution We have 3 indices in Elasticsearch […]
June 19, 2023

How to recursively delete files, folders or bucket from S3?

In this post we will see how to recursively delete files/objects, folders and bucket from S3. Recursively deleting a folder in S3 rm –recursive followed by […]
June 12, 2023

How to create and use UDF in Spark?

In this post we are going to create a Spark UDF which converts temperature from Fahrenheit to Celsius. Here is our data. We have day and […]
June 5, 2023

How to kill multiple YARN applications at once?

If you work with Apache Hadoop, you may find yourself needing to kill multiple YARN applications at once. While you can kill them one by one […]
May 29, 2023

How to list topics without accessing Zookeeper in Kafka?

Kafka uses Zookeeper to manage it’s internal state. So it is not possible to run Kafka without Zookeeper. Even if you don’t have access to Zookeeper […]
May 22, 2023

How to add total count of DataFrame to an already grouped DataFrame?

Here is our data. We have an employee DataFrame with 3 columns, name, project and cost_to_project. An employee can belong to multiple projects and for each […]
May 15, 2023

How to query data from Snowflake in Spark?

If your organization is working with lots of data you might be leveraging Spark to compute distribution. You could also potentially have some or all your […]
May 11, 2023

What is the difference between sync and cp operations in S3?

This post describes the differences in sync and cp operations in S3 and which one should be preferred. sync aws s3 sync copies any files that […]
gdpr-image
This website uses cookies to improve your experience. By using this website you agree to our Data Protection Policy.

Hadoop In Real World is now Big Data In Real World!

X