Spark Optimizations

Release Date - Jan/01/2022
[Update: Released ! Click here to enroll]

We have ringed 2022 with a new update to the Spark Developer In Real World course. Spark optimizations is one of the asked topic from our student. This update will cover exactly that that. In this new chapter, we will explore how Spark decides on the number of tasks in a stage and how to can tweak it. We will go over all the join algorithms that are available in Spark and how Spark will choose the join algorithm at runtime. Furthermore, we will see how to nudge Spark in selecting the join algorithm that we prefer to use a runtime.

Hive Window and Analytical Functions

Release Date - Feb/16/2020
[Update: Released ! Click here to enroll]

We have updated Hadoop Developer In Real World course with Hive Window and Analytical functions. When you perform complex data analytics, window and analytical functions become unavoidable. So we have added 3 new lessons on the topic with Hive. With this update, you will be able to get a deeper understanding of what are window and analytical functions. Once you understand it, we will explore the concepts with some hands on with campaign data from Kickstarter. We will additionally explore how to perform time series operation with window and analytical functions in Hive.

Kafka Schema Registry

Release Date - Jan/01/2020
[Update: Released ! Click here to enroll]

We have started the New Year right with a course update to Hadoop Developer In Real World course. We have added 3 new lessons to the Kafka chapter in the course. With this update, you will be able to write production ready Kafka application with Spring Kafka. Following that, we will explain how Kafka schema registry helps us to manage the schemas better creating transparency in managing the schemas between producers and consumers. Finally, we will see how to safely evolve schemas in Kafka with the help of schema registry and compatibility types. This update will augment our last update on Schema Evolution with Avro and will give you a solid understanding of managing and evolving schemas with Kafka.

Schema evolution in Avro

Release Date - Oct/25/2019
[Update: Released ! Click here to enroll]

One of the most requested topics from our students is to explain more on Schema Evolution and these 3 lectures are designed to do exactly that. In these lectures we will first understand the importance of Schema in Avro. Next we will understand how changes to schema are supported by Avro and what is possible and what is not possible with schema evolution with out breaking the clients who are consuming your data. These 3 lectures are added to the File Formats chapter and you will find them at the end of the chapter.

Spark Developer In Real World (An end to end project [Spark, Elasticsearch, Kibana, REST and Angular])

Release Date - Mar/07/2019
[Update: Released ! Click here to enroll]

We received a lot of emails asking for end to end real world projects in Spark. We will use the dataset from Stackoverflow and leverage Spark to transform the data, load the data in to Elasticsearch, use Kibana to visualize the data. Most real world projects involve not just big data tools but also technologies outside the big data ecosystem. So we will build a REST service which will expose the data in Elasticsearch and an Angular application will consume the data from the REST service.

Spark Developer In Real World (Spark & Data Sources & File Formats)

Release Date - Nov/14/2018
[Update: Released ! Click here to enroll]

This update will focus on using different file formats with Spark like Parquet and ORC. Along with different file formats we will also see how Spark work with other data sources like Hive, NoSQL (HBase) and RDBMS databases.

Spark Developer In Real World (Shuffle & Transformations)

Release Date - Sep/10/2018
[Update: Released ! Click here to enroll]

There are 2 new chapter additions in this update. In the first chapter, Shuffle in Spark, we cover both Hash Shuffle Manager and Sort Shuffle Manager in Spark. We go very deep in explaining the shuffle implementations in Spark. To be honest, we have not come across a course even course materials from DataBricks (creators of Spark) come close in explaining shuffle concepts in such great detail.

2nd new chapter is named Spark Transformations. The goal of this chapter is to understand what happens behind the scenes and the internal RDDs that Spark create when we use transformation functions in Spark. We will look in to ways to avoid shuffle operation when we group RDDs and joining RDDs. With each transformation we will look in to the dependencies (narrow & wide) involved behind the transformation. This level of understanding will help you optimize your Spark jobs better.

Spark Developer In Real World (New Course !)

Release Date - April/29/2018
[Update: Released ! Click here to enroll]

Spark Developer In Real World will cover all core concepts in Spark - RDD, DataFrame, DataSet, SQL etc. We will also go deep in to Spark's architecture, cluster setup in the course. Spark is not complete with out machine learning and streaming. So this course will also include both Spark ML and Spark streaming.

Puppet for Hadoop Deployment (Hadoop Administrator In Real World)

Release Date - November/30/2017
[Update: Released ! Click here to enroll]

Puppet for Hadoop Deployment will be added to the Hadoop Administrator In Real World course as an update to the course. This chapter will explore how Puppet can be used to deploy Hadoop solutions in a wide scale.

Apache Kafka (Hadoop Developer In Real World)

Release Date - November/30/2017
[Update: Released ! Click here to enroll]

Apache Kafka will be added to the Hadoop Developer In Real World course as an update to the course. Kafka is one of the most requested topics from our students at Hadoop In Real World. Kafka is used to build scalable, distributed and real-time streaming applications.

Troubleshooting for Administrators (Hadoop Administrator In Real World)

Release Date - September/30/2017
[Update: Released ! Click here to enroll]

Troubleshooting for Administrators will be added to the Hadoop Administrator In Real World course as an update to the course. Troubleshooting is an important topic for any Hadoop administrators. This chapter will give administrators the confidence to troubleshoot errors and performance issues with the cluster.

RCFile, ORCFile and Parquet (Hadoop Developer In Real World)

Release Date - September/05/2017
[Update: Released ! Click here to enroll]

RCFile, ORCFile and Parquet will be added to the Hadoop Developer In Real World course as an update to the course. RCFile and ORCFile are optimized file formats to store relational data in big data environments. Parquet is a columnar storage format which can be used in Hadoop for faster execution and efficient process of data.

Apache Ambari (Hadoop Administrator In Real World)

Release Date - June/23/2017
[Update: Released ! Click here to enroll]

Apache Ambari will be added to the Hadoop Administrator In Real World course as an update to the course. Apache Ambari is a Hadoop cluster management software used to provision, manage, and monitor Apache Hadoop clusters. Hortonworks Data Platform (HDP) includes Apache Ambari to manage Hadoop clusters.

Spark Starter Kit (New Course !)

Release Date - May/21/2017
[Update: Spark Starter Kit is LIVE ! Click here to enroll]

Spark Starter Kit is a new course and it is 100% free. This introductory course to Spark answers all the important questions that most new Spark learners have. Most courses and other online help including Spark's documentation is not good in helping students understand the foundational concepts. They explain what is Spark, what is RDD, what is this and what is that but students are most interested in understanding core fundamentals like why do we need Spark when we have Hadoop, what is the need for RDD, how Spark is faster than Hadoop and how Spark achieves the speed and efficiency it claims and that is exactly what you will learn in this free Spark Starter Kit course.