We see time and time again, folks who are trying to understand RDD, ask questions about RDD online in websites like stackoverflow or other tech. forums and they are usually pointed to the RDD paper from the Spark authors. The RDD paper is great but we have to understand that the paper is a research paper and it is meant to condense years of research and findings in to few pages. The RDD paper is not a good stating point if you are tying to understand what is RDD. So students or aspiring Spark learners who are referred to this paper to understand RDD are usually lost and even more confused before they encountered the paper.
So we have decided not to refer to this paper or ask students to refer to this paper in this post or in our Spark Starter Kit course. With that goal let’s start this post.
The very first reason behind Spark’s speed is in-memory computing but in-memory computing is not a revolutionary new concept. In-memory computing in Spark is special because Spark does in-memory computing in a distributed scale – with hundreds, thousand or even more commodity computers. In-memory computing sounds like a very simple idea but to implement the Same in a distributed environment is not simple and easy. There are lot of complexities.
Let’s forget Spark for a minute and visualize a simple example. John is given a piece of paper with number 10 written on it. You can think of this as an input dataset. So John reads 10 adds 10 and memorize the result 20, Emily asks John for the result, John answers 20, Emily adds 20 and memorize 40 as the result. Sam asks Emily for the result, Emily says 40, Sam adds 5 and memorize 45 as the result, finally Riley will ask Sam for the result and will add 40 to 45 and memorize 85 as the result.
Note that each person is memorizing the result of their calculation. This is analogous to individual nodes keeping their intermediate output in memory.
Let’s now say Sam finished the calculation he is supposed to do and keeps the result 45 in his memory. It is now Riley’s turn to do the calculation. Riley reaches out to Sam for the number, Sam is not responding. May be Sam is thinking of something important or may be he is having a brain freeze :-). Whatever the reason may be, Sam is not responding to Riley. Now this is a problem, without the input from Sam, Riley cannot proceed with the calculation.
If you take out the individuals from this illustration and substitute nodes or hosts, you will get the same problem. When the node goes down everything that is stored in its main memory or RAM is lost. So what is the solution?
Like what you are reading so far, sign up to get more interesting stuff.
Let’s call the input that was given to John as A. John takes A adds 10 and produces 20 and lets to this intermediate result as B. Emily takes B adds 20 and produces 40 and lets call this C, Sam takes C adds 5 produces D which is 45, finally Riley takes D adds 40 and produces E which is 85. Here we have clear track of how our data is getting transformed. We know B is A plus 10, D is C plus 5 and finally E is D plus 40.
With this in mind, let’s walk through a failure scenario. It’s now Riley’s turn to do the calculation. Riley know that she should take D and add 40 to it. So she reaches out to Sam for D. Sam is not responding. Sam memorized D so if Sam is not responding how would Riley know the value of D? We know what Sam did to calculate D. He took C from Emily and add 5 to it to come up with D. Now Sam is not available but we know how to get D, we can ask someone else to calculate D. Let’s call Jim to calculate D, Jim will ask Emily for C add 5 to it and will calculate D which is 45.
Now Riley will ask Jim instead of Sam for D and will complete the calculation. Since we kept track of all the transformations that happened to our data we were able to tolerate Sam’s failure and that is our solution to fault tolerance – keep track of all transformations that happens to our dataset.
Spark does exactly this, it keeps tracks of every single operation or transformation that happens to your dataset and it is referred to as lineage, this way even when there is a failure Spark knows how to recover from that failure. Here is the million dollar question. How does spark keep track of everything you do with your data? The answer is RDDs. RDD stands for Resilient Distributed Dataset.
Spark tightly controls what you do with the dataset. You can not work with data in Spark with out RDD. let me say that again. You can not work with data in Spark with out RDD. To refer your dataset in Spark you will use the functions provided by Spark and those functions will create a RDD behind the scenes. Can you join two datasets in spark? Of course you can using the join function provided by Spark and internally the join function will result in a RDD. Can you do aggregation in your dataset? Of course you can, Spark provides functions for that too and spark will create a RDD behind the scenes any time you attempt to transform the data.
In short, Spark controls any operations that you do with your dataset, to perform any meaningful operation with your data you need to use functions provided by Spark and it will result in RDD. This way Spark can keep track of everything that you are trying to do with the dataset and this is referred to as lineage and this helps Spark deal with tolerating failures effectively, which by the way is a very big deal in in-memory computing.
Let’s understand what is RDD with a simple use case. We are going to count the number of HDFS related errors in a log file. You don’t need to be an expert in Scala or Spark to read and understand the lines of code here. First line – we are referring to a text file in HDFS and we are assigning it to logfile, in the next line, we are filtering all lines in logfile with the word errors in it and calling it errors, in the next line, we are filtering all lines in errors with the word HDFS in it and we are referring the result hdfs. Finally in the 4th line we are counting the number of lines in hdfs.
See I told you, you don’t have to be a Scala or spark expert to understand the code. Again. Don’t worry about the syntax, the underscore, sc. etc. for now we will cover all that very soon. Let me ask you a question. In this 4 lines of code. do you see the lineage?
Of course. First we have logfile, then we applied a filter function on logfile to get errors and then we applied another filter function on errors this time to get hdfs. We can say hdfs is dependent on errors and errors is dependent on logfile Spark refers to this dependency chain as lineage. Like your family lineage for instance, you are a result of your father and mother. Your father and mother are result of their respective parents. textfile and filter functions are functions from the Spark API.
Assume our dataset is in HDFS and is divided in to 5 blocks. When we referred to the dataset using the textFile function the resulting logfile will have 5 blocks and each of these 5 blocks can technically be on a different node. Spark calls a block a data as partition. So logfile will refer to partitions from different nodes. Since we are referring to a text file in HDFS each partition will have several lines or records. In Spark we refer to each record as element.
So each partition will have several elements. Next, we apply filter function on logfile to filter only lines with the word ERROR in it and this operation results in errors. Filter function is applied on each element or in simple terms each record in each partition from logfile. The result will be errors, which will again have 5 partitions and will only have elements with the word ERROR in it. Next we apply filter operation on errors to filter only lines with the word HDFS in it and this operation results in hdfs. Filter function is applied on each element in each partition from errors RDD resulting in HDFS. Finally we call the count function on hdfs which counts all the elements in each partition. Spark will sum up all the individual counts and send it to the user.
Alright. Do you think errors in this illustration is an RDD?
R in RDD stands for Resilient meaning resilience to failure; that is being fault tolerant. DD in RDD stands for Distributed Dataset. So RDD refers to a distributed dataset which is tolerant to failure. Now with that in mind, let’s pick errors from the illustration.
Let’s first check whether errors is resilient or not.
Assume partition 2 in errors resides in node 2 and node 2 for some hardware failure went down. Which means we lost partition number 2 in errors. If we are able to recover partition 2 we can say errors is resilient correct? True.
We know exactly how to recover partition 2. To get partition 2 in errors simply run the filter function on partition 2 of logfile and we will get partition 2 of errors. This is possible because we know the lineage and because of lineage we know errors is dependent on logfile and errors can be derived from logfile by applying the filter function. So errors is resilient.
DD in RDD stands for distributed dataset. Is errors a distributed dataset? errors has multiple partitions distributed in different nodes which means it is a distributed dataset. So errors is both resilient and a distributed dataset and hence an RDD.
Same is true for logfile and hdfs. So logfile is an RDD, errors is an RDD and hdfs is an RDD. But wait a minute! In our instructions we did not specify any where that we are creating RDDs correct? So is Spark really treating logfile, errors and hdfs as RDDs?
Why don’t we execute these instructions in spark and check? Spark Shell is an interactive way to execute spark code. What we have here is a 3 node spark cluster, spark’s architecture is very similar to master and worker architecture. Master node in our cluster runs on node1. We can enter the spark shell with the command spark-shell –master spark://node1:7077 .
Let’s execute first 3 instructions in spark shell. You can see that Spark is creating RDD for each of our instruction. The reason Spark knows to create RDD is because of the functions we are using in the instructions. That is textFile and filter functions. These functions are from the spark API and they create RDDs behind the scenes and this helps Spark manage the dependencies and track all the operations.
Let’s execute the 4th instruction to get the count. So we have xxx hdfs related errors in the log file.
If you look at the spark shell again. RDD is actually a type and logFile is an RDD of type MapPartitionsRDD so is errors and hdfs.
Let’s look at the API documentation for RDD, here you can see RDD is nothing but a class, an abstract class to be exact and you can see MappedRDD as one of the direct subclasses. In this page, I am interested in showing you the definition of RDD first. “A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. RDD represents an immutable, partitioned collection of elements that can be operated on in parallel”. Let’s break that statement down.
RDD is a partitioned collection of elements. We know that. In our illustration each RDD is made up of 5 partitions and each partition has a collection of elements. Each partition is operated in parallel across 5 nodes.
RDDs are immutable. Meaning once a RDD is created you cannot change the elements inside the RDD. That is, once you create errors RDD by applying a filter function on logfile RDD, you can not change the elements in errors RDD. You can, by all means create another RDD named newerrors, by applying another function on the logfile RDD but that operation will result in another RDD. The key point to understand is the elements inside the RDD once created cannot be changed. the only way to change the elements is by applying another function to a RDD and this would create a new RDD.
Internally, each RDD is characterized by five main properties. These properties make up an RDD. With out these properties an RDD can not exist. 2 of the 5 properties are optional so we will skip them for now. Let’s look at the 3 non optional properties that make up RDD –
First, list of partitions. RDD must know it’s list of partitions. If you look in our illustration we can see that each of the RDDs involved is made up of 5 partitions.
Second, a function for computing each split. So RDD must know which function to apply on the elements in it’s partitions.
errors RDD is computed by a filter function, hdfs RDD is computed by a filter function. It is important to understand that the function you apply on a RDD, will be applied to all the elements in the RDD. This means RDD does not support fine grained operation. We can not operate on specific elements in RDD like you would do update on specific rows on a database table. In contrast RDD supports coarse grained operations or in simple terms when you call a function on RDD, the function is applied to all elements in the RDD. so RDD support coarse grained operations and now fine grained operations on its elements.
Third, A list of dependencies on other RDDs. In our example, hdfs RDD is dependent on errors RDD, errors RDD is dependent on logfile RDD.
Now a big secret, ready? To work with Spark, you don’t need to know anything about RDD and we have seen many Spark courses don’t talk about RDD in the same level or depth we just covered RDD. So why do we have to care so much about RDD? RDD is at the core of spark. Without understanding how RDD works, it will be difficult to understand how fault tolerance works in Spark. Also we won’t be able to understand how Spark derives the logical and physical plan which will be translated in to tasks for execution and this means with out having a good understanding of RDD, we will miss to understand the efficiencies of Spark’s execution engine and that is the reason we go deep in understanding RDDs.
So to summarize, in this post, we got introduced to RDD. We know what is RDD and what is the need and the importance of RDD. We also looked at the API documentation for RDD and understood the key properties that make up a RDD. From what we know now, we can just simply say RDD is at the core and heart of Spark.