1. Overview

As we are increasingly more connected and personifying our digital existence, together with globalization and the physical distance nullification through the internet, the amount of data we handle and create increases more and more every day.

Some techniques and technologies are implemented to work with these data amounts, such as the MapReduce programming model, computer cluster parallelism, and the Apache Spark framework, which provides distributed data systems to manipulate data.

In this tutorial, we will look at one fundamental data structure of Spark, the Resilient Distributed Dataset or RDD.

2. Spark RDD

RDDs are an immutable, resilient, and distributed representation of a collection of records partitioned across all nodes in the cluster.

In Spark programming, RDDs are the primordial data structure. Datasets and DataFrames are built on top of RDD.

Spark RDDs are presented through an API, where the dataset is represented as an object, and with methods, we can apply logic to it. We define how-to Spark will execute and perform all transformations with this API.

Also, with this Low-Level API, we achieve type safety and have the flexibility to manipulate the data.

2.1. Spark Architecture

Apache Spark is designed to utilize the best performance from a cluster of machines. The architecture comprises three main components: the Driver Program, Cluster Manager, and the workers.

The driver program orchestrates the spark application, the cluster manager administers all resources and workers used, and finally, the workers are the ones who execute the program’s tasks.

2.2. Features

The RDD data structure is based on principles and features to control and optimize the code. Some of them are:

  • Immutability: It’s a crucial concept of functional programming that has the benefit of making parallelism easier. Whenever we want to change the state of an RDD, we create a new one with all transformations performed.
  • In-memory computation: With Spark, we can work with data in RAM instead of disk. Because loading and processing performance increases when using RAM compared to disk.
  • Fault-Tolerant: The system keeps working correctly even with eventual failure with this architecture.
  • Partitioning: RDDs data are distributed across nodes to perform better computation.
  • Lazy evaluation: In addition to performance, Spark RDD is evaluated lazily to only process what is necessary and hereafter optimized (DataFrames and Datasets have query plans optimized by Catalyst).

The RDDs were the first structure of Apache Spark. Moreover, others structures nowadays have proven to be more efficient in some cases. However, RDD is not deprecated and is commonly used.

As already mentioned, DataFrames and Datasets are built on top of RDD, so it’s still the core of Spark.

Also, DataFrames and Datasets are great when talking about code optimization due to the Catalyst, space efficiency with Tungsten, and structured data. In addition, the high-end API is easier to code and understand.

But, RDDs still a good choice, with flexible control of the dataset with the low-level API and data manipulation without DSL (Domain Specific Language).

Especially when working with non-structured data and performance is not overriding.

2.3. Create an RDD

Considering all this theoretical information, let’s create one.

There are two ways: Parallelizing collections and reading data from source files.

Let’s see how we create an RDD parallelizing a collection:

val animals = List("dog", "cat", "frog", "horse")
val animalsRDD = sc.parallelize(animals)

In the example above, we have animalsRDD: RDD.

The second way is loading the data from somewhere. Let’s take a look:

val data = sc.textFile("data.txt")
// rdd: RDD[String] = data.txt MapPartitionsRDD[1] at textFile...

Additionally, we can always read as a DataFrame/Dataset and convert to RDD with method .rdd:

val dataDF = spark.read.csv("data.csv")
val rdd = df.rdd

rdd: RDD

3. Operations

Functions make all operations in RDDs. These functions embrace data manipulation, persistence, interaction, load. For example: map(), filter(), save().

There are two categories in which Spark functions classify: Transformations and Actions.

3.1. Transformations

Transformations operations are functions that have as return the given RDD transformed. These functions are lazy. In other words, they will only be executed when an action happens.

In other words, the functions that perform any change in the RDD will only be executed when necessary. For instance, map(), join(), sort(), filter(), are transformations.

That is, due to this laziness, we do not send across workers to drivers unnecessarily. For example: If we map and reduce an RDD, what only matters for the driver is the reduced result, so we don’t send the map result.

Let’s see some map function implementation:

val countLengthRDD = animalsRDD.map(animal => (animal, animal.length))

With these code, looking into the RDD we would have: (“dog”, 3), (“car”, 3), (“frog”, 4)…

Now, let’s remove all animal that starts with ‘c’:

val noCRDD = animalsRDD.filter(_.startsWith("c"))

Looking into the RDD, we have (“cat”) as a result.

Some of the mainly used transformations are: map(), flatMap(), filter(), sample(), union(), join().

It’s essential not to create side-effecting operations or non-associative operations.

3.2. Actions

On the other hand, actions operations return data to the driver. And also, actions are the ones that start the task computation, that is, execute all transformations.

One way to see the content of an RDD is with the collect method, which is an action and returns Seq[T]:

countLengthRDD.collect()
// res0: Array[String] = Array((dog,3), (cat,3), (frog, 4), (horse, 5))

noCRDD.collect()
//res1: Array[String] = Array(cat)

We know the result in an Array[String] in our cases.

Another widely used action is the reduce method. Let’s see how that works:

val numbers = sc.parallelize(List(1, 2, 3, 4, 5))
numbers.reduce(_ + _)

In the example above, we see the number 15 as a result.

Some of the mainly used transformation are: collect(), reduce(), save(), count().

3.3. Key-Pair RDD

One particular type of RDD, the Key-Pair RDD, has a few special operations available.

These methods are usually distributed “shuffle” operations to group and aggregate data by key.

Shuffle is a Spark mechanism to re-distribute data across nodes. Spark executes costly tasks such as in-disk data manipulation, data serialization, and network transport to perform this re-distribution. Also, shuffle creates intermediates files which increase the cost and memory usage.

To clarify the shuffling, let’s take as an example the join method:

val rddOne = sc.parallelize(List((1, "cat"), (2, "dog"), (3, "frog")))
val rddTwo = sc.parallelize(List((1, "mammal"), (3, "amphibian")))

rddOne.join(rddTwo).collect

As a result, we have: Array((1,(cat, mammal)), (3, (frog, amphibian))). Spark needs to compute in each partition and put together to re-compute the final result to achieve this result.

4. Performance Review

With Apache Spark, performance is a primordial aspect to be considered.

The newer data structures, DataFrame and Dataset, are more efficient because of all optimizations applied. However, as shown before, RDD is still very concrete in many cases.
Another essential piece of information is that RDD has a better performance with Scala than other languages.

Moreover, Spark RDD API provides information about the execution plan. For example, let’s take a number RDD:

val data = List(1, 2, 3, 4, 5)
val rdd = sc.parallelize(data)

rdd
  .filter(_ % 2 == 0)
  .map(_ * 2)

To get more information about the RDD lineage, we can use the .toDebugString method:

rdd.toDebugString
res0: String =
(8) MapPartitionsRDD[2] at map at <console>:26 []
 |  MapPartitionsRDD[1] at filter at <console>:26 []
 |  ParallelCollectionRDD[0] at parallelize at <console>:27 []

When using with Datasets, we have the .explain method:

val df = rdd.toDF() 
df 
  .filter($"value" % 2 === 0) 
  .withColumn("value", $"value" * 2) 

df.explain("formatted")

The explain method shows the steps and the description of each:

== Physical Plan ==
* Project (4)
+- * Filter (3)
   +- * SerializeFromObject (2)
      +- Scan (1)

(1) Scan
Output [1]: [obj#1]
Arguments: obj#1: int, ParallelCollectionRDD[0] at parallelize at <console>:27

(2) SerializeFromObject [codegen id : 1]
Input [1]: [obj#1]
Arguments: [input[0, int, false] AS value#2]

(3) Filter [codegen id : 1]
Input [1]: [value#2]
Condition : ((value#2 % 2) = 0)

(4) Project [codegen id : 1]
Output [1]: [(value#2 * 2) AS value#11]
Input [1]: [value#2]

5. Conclusion

In this article, we have seen that Spark presents multiple data structures to optimize data processing with all its complexity and features.

To summarize, RDDs are great with non-structured data, and performance is not essential.

Allows us to work with functional patterns and control all data over the provided API, being flexible and still benefiting from Spark architecture.

In conclusion, we presented some theoretical explanations of Spark itself and RDDs. We were learning the usage and best cases to utilize it.

The source code is available over on GitHub.