# OUTPUT

The blog of Maxime Kjaer

# CS-206 Parallelism and Concurrency

These are my notes from the CS-206 Parallelism and Concurrency course. Prerequisites are:

Please note that these notes won’t be as good or complete as in the previous semester, as some of the lectures in this course were given ex cathedra instead of as a MOOC.

## Introduction

Almost every desktop, laptop, mobile device today has multiple processors; it is therefore important to learn how to harness these resources. We’ll see how functional programming applies to parallelization. We’ll also learn how to estimate and measure performance.

## Part 1: Parallelism

#### What is parallel computing?

Parallel computing is a type of computation in which many calculations are performed at the same time. The basic principle is to divide the computation into smaller subproblems, each of which can be solved simultaneously. This is, of course, assuming that parallel hardware is at our disposal, with shared access to memory. Parallel programming is much harder than sequential programming, but we can get significant speedups.

Parallelism and concurrency are closely related concepts:

• Parallel program: uses parallel hardware to execute computation more quickly. It is mainly concerned with division into subproblems and optimal use of parallel hardware
• Concurrent program: may or may not execute multiple executions at the same time. Mainly concerned with modularity, responsiveness or maintainability (convenience).

The two often overlap; neither is the superset of the other.

Parallelism manifests itself at different granularity levels.

• Bit-level parallelism: processing multiple bits of data in parallel
• Instruction-level parallelism: executing different instructions from the same instruction stream in parallel
• Task-level parallelism: executing separate instruction streams in parallel

The first two are mainly implemented in hardware or in compilers; as developers, we focus on task-level parallelism.

### Parallelism on the JVM

#### Definitions

A process is an instance of a program that is executing in the OS. The same program can be started as a process more than once, or even simultaneously in the same OS. The operating system multiplexes many different processes and a limited number of CPUs, so that they get time slices of execution. This mechanism is called multitasking.

Two different processes cannot access each other’s memory directly — they are isolated. Interprocess communication methods exist, but they aren’t particularly straightforward.

Each process can contain multiple independent concurrency units called threads. They can be started programmatically within the program, and they share the same memory address space — this allows them to exchange information by doing memory read/writes.

Each thread has a program counter and a program stack. JVM threads can’t modify each other’s stack memory, they can only modify the heap memory.

#### Implementation

1. Define a Thread subclass.
2. Instantiate a new Thread object.
3. Call start on the Thread object.

Notice that the same class can be used to start multiple threads.

t.join() blocks the main thread’s execution until the t thread is done executing.

Let’s look at a more complex example:

Running it multiple times might yield the following output:

On the first two executions, the threads happened to execute linearly; first t, then s. But on the third attempt, the first thread printed Hello, but then the second thread kicked in, also printed Hello — before the first had time to print out world!, and then they both completed.

### Atomicity

The above shows that two parallel threads can overlap arbitrarily. However, we sometimes want to ensure that a sequence of statements executes at once, as if they were just one statement, meaning that we don’t want them to overlap. This is called atomicity.

An operation is atomic if it appears as if it occurred instantaneously from the point of view of other threads.

The implementation of getUniqueId() below isn’t atomic, as it suffers from the same problem as the hello world example above.

#### Synchronized blocks

How can we secure it from this problem? How do we get it to execute atomically?

The synchronized block is used to achieve atomicity. Code blocks after a synchronized call on an object x are never executed on two threads at once. The JVM ensures this by storing an object called the monitor in each object. At most one thread can own the monitor at any particular time, and releases it when it’s done executing.

synchronized blocks can even be nested.

This way, the thread gets a monitor on account A, and then on account B. Once it has monitors on both, it can transfer the amount from A to B. Another thread can do this with C and D in parallel.

Sometimes though, this may cause the code to freeze, or to deadlock. This is a scenario in which two or more threads compete for resources (such as monitor ownership) and wait for each to finish without releasing the already acquired resources.

The following code should cause a deadlock:

T1 gets the monitor for a, T2 gets the monitor for b. Then they both wait for each other to release the monitor, leaving us in a deadlock.

One approach is to always acquire resources in the same order. This assumes an ordering relationship on the resources. In our example, we can simply assign unique IDs on the accounts, and order our synchronized calls according to this ID.

#### Memory model

A memory model is a set of rules describing how threads interact when accessing shared memory. Java Memory Model is the memory model for the JVM. There are many rules, but the ones we chose to remember in the context of this course are:

1. Two threads writing to separate locations in memory do not need synchronization.
2. A thread X that calls join on another thread Y is guaranteed to observe all the writes by thread Y after join returns. Note that if we don’t call join, there’s no guarantee that X will see any of Y’s changes when it reads in memory.

We will not be using threads and the synchronized primitive directly in the remainder of the course. However, the methods in the course are based on these, and knowledge about them is indeed useful.

### Running computations in parallel

How can we run the following code in parallel?

We just add parallel!

Recursion works very well with parallelism. We can for instance spin up an arbitrary number of threads:

##### Signature of parallel

It returns the same value as it is given, but can do it faster than its sequential counterpart. From the point of view of the value, it is an identity function. Its arguments are taken by name (CBN); otherwise it wouldn’t be able to do much with them, as they would be evaluated sequentially before being sent to parallel. parallel needs the unevaluated computations to function, thus CBN.

#### Underlying hardware architecture affects performance

Sometimes, we do not achieve any speedup even though we ran computations in parallel. For instance, if we sum up array elements instead of summing powers of array elements like above, we don’t see any speedups using parallelism. This is because this computation is bound by the memory bandwidth, which acts as a bottleneck to any speedup.

Therefore, when considering opportunities for speed-up, we must take into account not only the number of cores, but also the parallelism available for any other shared resources that we might need in order to perform computation, such as memory in this case.

In general, parallel computation takes as long as its slowest / longest thread.

t = task(e) starts a computation “in the background”; the main thread continues while the task is running (unless we use join in which case it waits). Tasks are easier to use; instead of this mess with nested calls to parallel:

We can easily get 4 tasks by doing:

We don’t call join, it’s implicit?

### How do we measure performance?

#### Work and depth

We introduce two measures for a program:

• Work W(e): number of steps e would take if there was no parallelism. This is simply the sequential execution time
• Depth D(e): number of steps if we had unbounded parallelism

The key rules are:

• .
• .

For parts of code where we do not use parallel explicitly, we must add up costs. For function call or operation :

• .
• .

Here denotes values of . If is a primitive operation on integers, then and are constant functions, regardless of .

Suppose we know and and our platform has P parallel threads. It is reasonable to use this estimate for running time:

Given and , we can estimate how programs behave for different :

• If is constant but inputs grow, parallel programs have the same asymptotic time complexity as sequential ones.
• Even with infinite resources () we have non-zero complexity given by .

#### Asymptotic analysis

Asymptotic analysis allows us to understand how the runtime of our algorithm changes when the inputs get larger or when we have more parallel hardware available. Just like in Algorithms, we consider the worst case to get an upper bound using big-O notation.

Let’s look at an example:

The running time is linear in the time between t and s, , a function of the form .

The cost (work) of this function is:

is some function overhead, the cost of calculating the middle m and starting two parallel threads. Assume , where is the depth of the tree (how we “split our program into two”). The computation tree has leaves and internal nodes, so:

For each internal node, we do work, and for each leaf we do work. If we look at what is constant, we can write it in the above form. If we pick our so that (meaning we just find the closest power of 2 approximation), we have:

in , so our function is in . What if we now make it parallel?

The depth of our computation tree is given by:

Since we’re running the branches of the computation tree in parallel, the depth will be the max of the left and the right branch, plus some constant overhead. Let’s assume again that is of the convenient form , where is the depth of the tree. The computation tree has leaves and internal nodes.

• Leaves of computation tree:
• One level above:
• Two levels above:
• Root:

Since is the depth of the tree. Our depth is thus bounded by . As before, running time is monotonic is . Let’s pick our to approximate our depth, meaning that for , we have . This is an important result, since combined with the idea that depth is bounded by , we conclude that is . This means that parallelization has taken us from linear to logarithmic runtime (assuming unlimited parallelization).

#### Empirical analysis: Benchmarking

Measuring performance is difficult. To ensure somewhat reliable results, we need a strict measurement methodology involving:

• Multiple repetitions
• Statistical treatment – computing mean and variance
• Eliminating outliers
• Preventing anomalies (GC, JIT compilation, aggressive optimizations)

This is all quite complex, so we use a tool to do it for us: ScalaMeter. To use it, we first need to add it as a dependency in build.sbt:

Then we can use it as such:

But this yields unreliable results due to garbage collection and dynamic optimization and stuff like that. Running it multiple times can yield anything from 7 to 50ms. We notice that the program runs in about 7ms after a few runs; this is the JVM Warmup.

ScalaMeter can ensure that warm-up has taken place if we do:

If we are not entirely satisfied with the defaults settings of ScalaMeter, we can change them as such:

Finally, ScalaMeter can measure more than just the running time:

• Measurer.Default: plain running time
• IgnoringGC: running time without GC pauses
• OutlierElimination: removes statistical outliers
• MemoryFootprint: memory footprint of an object
• GarbageCollectionCycles: total number of GC pauses

### Parallelizing important algorithms

#### Parallel merge sort

As we mentioned in Algorithms, Merge Sort works very well in parallel. We’ll now see how to do it in parallel. In order to do that we’re going to use two arrays, one of which is going to be a temporary array. And we are going to be copying elements between the temporary array ysand the original array xs.

The merge implementation is sequential, so we will not go through it. Benchmarking this parallel merge sort to the Scala quicksort implementation shows up to a two-fold speedup in practice.

##### Copying array in parallel

To copy the temporary array into the original one, we need an optimized algorithm:

#### Parallel map

Some operations we saw in the previous course were map, fold and scan (like fold but stores intermediate results).

Lists aren’t terribly efficient, as splitting them in half and combining them take linear time. As alternatives, we’ll use arrays and trees in our implementation. We’ll see more about Scala’s parallel collection libraries in future lectures.

Parallelization yields 5x or 6x speedup in certain benchmarks. From the benchmarks we can also tell that the parallelized map is basically as efficient as specialized implementations of operations in parallel.

If we use trees instead of arrays:

##### Comparison of arrays and immutable trees

Arrays:

• (+) random access to elements, on shared memory can share array
• (+) good memory locality
• (-) imperative: must ensure parallel tasks write to disjoint parts
• (-) expensive to concatenate

Immutable trees:

• (+) purely functional, produce new trees, keep old ones
• (+) no need to worry about disjointness of writes by parallel tasks
• (+) efficient to combine two trees
• (-) high memory allocation overhead

#### Parallel reduce

For reduce (or fold), the order of operation matters. When we process the elements in parallel, we must therefore impose that the operation be associative, meaning that the order doesn’t matter. Examples of associative operations include addition or concatenation of strings, but not subtraction.

An operation f: (A, A) => A is associative if and only if for every x, y, z, f(x, f(y, z)) == f(f(x, y), z).

We can represent the reduction as an operation tree, where every node corresponds to a single operation (say, addition or concatenation). If t1 and t2 are different tree representations of the same reduction (so they correspond to the same reduction, but in a different order), and f: (A, A) => A is associative, then:

If we want to implement reduce for arrays instead of trees, we can just conceptually consider arrays as trees by cutting them in half at every step (until a certain threshold size):

#### Associative and/or commutative operations

Associative and commutative operations:

• Addition and multiplication of mathematical integers (BigInt) and of exact rational numbers (given as, e.g., pairs of BigInts)
• Addition and multiplication modulo a positive integer (e.g. 232), including the usual arithmetic on 32-bit Int or 64-bit Long values
• Union, intersection, and symmetric difference of sets
• Union of bags (multisets) that preserves duplicate elements
• Boolean operations &&, ||, xor
• Addition and multiplication of polynomials
• Addition of matrices of fixed dimension

Associative but not commutative operations:

• Concatenation (append) of lists: (x ++ y) ++ z == x ++ (y ++ z)
• Concatenation of Strings (which can be viewed as lists of Char)
• Matrix multiplication AB for matrices A and B of compatible dimensions
• Composition of relations
• Composition of functions

Many operations Commutative but not associative, such as . Interestingly, addition or multiplication of floating point numbers is commutative, but not associative. This is because of floating point errors (where they’re off by 0.000...01), so we don’t always have

As a conclusion, proving commutativity alone does not prove associativity. Another thing to look out for is that associativity is not preserved by mapping; when combining and optimizing reduce and map invocations, we need to be careful that operations given to reduce remain associative.

##### Making an operation commutative is easy

Suppose we have a binary operation g and a strict total ordering less (e.g. lexicographical ordering of bit representations). Then this operation is commutative:

There is no such trick for associativity, though.

##### Constructing associative operations

Suppose f1: (A1,A1) => A1 and f2: (A2,A2) => A2 are associative.

Then f: ((A1,A2), (A1,A2)) => (A1,A2) defined by

is also associative.

The following functions are also associative:

There are some situations where commutativity can help us establish associativity, but we need some additional property. Let:

We say arguments of E can rotate if:

If the above function f is commutative and the arguments if E can rotate, then f is also associative. Proof:

We can use this to prove associativity for the following examples:

Again, we should be wary of floating point numbers in proving associativity!

#### Parallel scan

Sequentially, scanLeft can be implemented as:

Can this be made parallel? We’ll assume f is associative. Our goal is to have an algorithm (given infinite parallelism).

At first, this task seems almost impossible, because the value of the last element in sequence is computed from the previous element. And for every element, it looks like the natural way is indeed what we gave in the sequential algorithm. But even if we parallelize the individual applications of f, we would not be able to parallelize the traversal of the array itself. So this would give us still a linear algorithm even with infinite parallelism.

So, we need to perform computation in a different order, the idea is to give up reusing all intermediate results. And in fact, we will do more work and more applications of f that need the simple sequential version. However, this will allow us to improve parallelism and in terms of the parallel running time, more than compensate for the fact that we are applying f a few more times than in the sequential algorithm.

To show that this is even possible in parallel, here’s how we’d define it in terms of the parallel map and reduce:

##### On trees

Let’s implement scanLeft on trees.

Here’s how downsweep works:

### Data parallelism

A form of parallelization that distributes execution processes across computing nodes.

In Scala, we express this with task and parallel.

Data-parallel programs have a different approach:

A form of parallelization that distributes data across computing nodes.

Why would we want to use data-parallelism? It can be much faster than task parallelism (in the demo with Mandelbrot sets, it was 2x faster).

Why are data parallel programs sometimes faster than task parallel ones?

Different data-parallel programs have different workloads. Workload is a function that maps each input element to the amount of work required to process it.

initializeArray had a workload defined by a constant function, . We call this uniform workload, and it’s really easy to parallelize.

An irregular workload is one where the work is described by an arbitrary function: . This is where we have a data-parallel scheduler, whose role is to efficiently balance the workload across processors without any knowledge about the . The idea of the scheduler is to shift away the task of balancing the workload from the programmer. They have similar semantics, so we won’t study them in detail.

#### Parallel for-loop

To initialize arrays with a given value in parallel (writing v to every position in xs)

Here, the .par method converts the range to a parallel range; the for loop will be executed in parallel. Parallel for-loops are not functional, do not return a value, and can therefore only communicate with the rest of the program through some side effect, such as writing to an array. Therefore, the parallel for-loop must write to separate memory locations or be synchronized in order to work.

#### Non-parallelizable operations

In general terms, most sequential collections can be converted to parallel collections by using .par; some collection operations subsequently applied become data-parallel, but not all. Let’s look at an example:

This does not execute in parallel, as foldLeft has no way of not processing elements sequentially; foldRight, reduceLeft, reduceRight, scanLeft and scanRight similarly must process the elements sequentially and operate sequentially on parallel collections.

#### Parallelizable operations

However, fold (without any direction) can process elements in parallel (see LEGO bricks explanation in lecture videos). Our previous sum function, and max could be written like this instead:

It is important to note that fold will work with a function f if:

• f is an associative operation
• When applied to the neutral element z, it must act as an identity function.

In other words, the following relations must hold:

In more formal terms, the neutral element z and the binary operator f must form a monoid. Commutativity is not important for fold, but it is important that the neutral element z be of the same type as the collection items (unlike foldLeft); this is clear if we look at the signature of fold:

This seems like a lot of limitations, so we’ll need a more powerful data parallel operation. Enter aggregate:

What it does is divide the collection into pieces, applying the sequential folding operator f and combine results using the parallel folding operator g. Using it, we can do what we couldn’t do with fold: count the number of vowels in a character array:

Again, the parallel reduction operator g and the neutral element z should form a monoid (z should be the netral element of g). Note that this is just an “if”, not “if and only if”; the only iff condition for aggregate to work is:

Many other parallel collection operations can be expressed in terms of aggregate. So far we’ve only seen accessor combinators (sum, max, fold, count, aggregate, …). Transformer combinators (such as map, filter, flatMap, groupBy) do not return a single value, but instead return new collections as a result.

#### Parallel collections

In sequential collections, the hierarchy is as follows:

• Traversable[T]: collection of elements with type T, with operations implemented using foreach
• Iterable[T]: collection of elements with type T, with operations implemented using iterator
• Seq[T]: an ordered sequence of elements with type T
• Set[T]: a set of elements with type T (no duplicates)
• Map[K, V]: a map of keys with type K associated with values of type V (no duplicate keys)

Traits ParIterable[T], ParSeq[T], ParSet[T] and ParMap[K, V] are the parallel counterparts of different sequential traits. For code that is agnostic about parallelism, there exists a separate hierarchy of generic collection traits GenIterable[T], GenSeq[T], GenSet[T] and GenMap[K, V].

Using these generic collections, operations may or may not execute in parallel:

In practice, parallelizable collections are:

• ParArray[T]: parallel array of objects, counterpart of Array and ArrayBuffer
• ParRange: parallel range of integers, counterpart of Range
• ParVector[T]: parallel vector, counterpart of Vector
• immutable.ParHashSet[T]: counterpart of immutable.HashSet
• immutable.ParHashMap[K, V]: counterpart of immutable.HashMap
• mutable.ParHashSet[T]: counterpart of mutable.HashSet
• mutable.PasHashMap[K, V]: counterpart of mutable.HashMap
• ParTrieMap[K, V]: thread-safe parallel map with atomic snapshots, counterpart of TrieMap
• for other collections, par creates the closest parallel collection: e.g. a List is converted to a ParVector

The last point stresses the importance of picking data structures carefully and making sure that they are parallelizable; otherwise, the conversion might take longer than the parallel instructions themselves.

##### Avoiding parallel errors

As we’ve said previously, one should either synchronize or write to separate memory locations. To synchronize, we can use the Java ConcurrentSkipListSet[T] instead of Scala mutable Set. To avoid side-effects, we can use the right combinators (for instead, use filter instead of making your own code).

A rule to avoid concurrent modifications during traversals is to never read or write to a parallel collection on which a data-parallel operation is in progress.

The TrieMap collection is an exception to this; it atomically takes snapshots whenever a parallel operation starts, so concurrent updates aren’t observed during that time. It offers the snapshot method (efficient: constant time), which can be used to efficiently grab the current state.

### Data-parallel abstractions

Transformer operations are collection operations that create another collection instead of a single value. Methods such as filter, map, flatMap, groupBy are examples of transformer operations.

#### Iterators

The iterator contract states that:

• next can be called only if hasNext returns true. That means that when defining next, one should always call hasNext.
• After hasNext returns false, it will always return false

#### Splitters

The splitter contract states that:

• After calling split, the original splitter is left in an undefined state
• The resulting splitters traverse disjoint subsets of the original splitter
• remaining is an estimate on the number of remaining elements
• split is an efficient method – or better (since we invoke it in parallel in hopes of obtaining a speedup)

#### Builders

Builders are abstractions for creating new sequential collections. T denotes the type of the elements of the collection (e.g. String), and Repr is the type of the resulting collection of elements (e.g. Seq[String]).

The builder contract states that:

• Calling result returns a collection of type Repr, containing the elements that were previously added with +=
• Calling result leaves the Builder in an undefined state (after this we cannot use it anymore)

#### Combiners

A combiner is a parallel version of a builder. It has the same += and result methods as it extends Builder, but adds a method combine to merge two combiners (invalidating the two old combiners in the process).

The combiner contract states that:

• Calling combine returns a new combiner that contains elements of input combiners
• Calling combine leaves both original Combiners in an undefined state
• combine is an efficient method – or better (usually, this means that we should run it in parallel)
##### Implementing combiners

How do we implement it efficiently ()? We’ll see that this depends on the underlying data structure; Repr could be a map, a set or a sequence, and that changes everything:

• When Repr is a set or a map, combine represents union
• When Repr is a sequence, combine represents concatenation

For arrays, there is no efficient combine operation. This has to do with how arrays are stored in memory; the two subarrays may be stored in different locations, which implies having to move one to the end of another; this cannot be done in logarithmic time, only in linear time.

For sets, we can use different data structures, that may have different runtimes for lookup, insertion and deletion:

• Hash tables: expected
• Balanced trees:

Unfortunately, most set implementations do not have an efficient union operation, so combine is tricky to implement.

For sequences, there are also a few data structures, with which the operation complexity varies.

• Mutable linked lists: prepend and append, insertion
• Functional (cons) lists: prepend, everything else
• Array lists: amortized append, random access, otherwise

(Amortized means that we may need to copy the array to another location when we need more space, but overall we expect it to be constant time on average.)

Mutable linked lists can have concatenation, but for most sequences, concatenation is .

All of the above shows that providing a combiner for the corresponding collections is not straightforward, since most parallel data structures do not have the efficient union or concatenation operation we want — but it is still possible to implement.

##### Two-phase construction

Most data structures can be constructed in parallel using two-phase construction. Here, the combiner doesn’t use the final data structure in its internal representation; it uses an intermediate data structure that:

• Has an efficient combine method: or better
• Has an efficient += method (this ensures that individual processors can efficiently modify the data structure)
• Can be converted to the resulting data structure in time (n size of data structure, P number of processors)

Let’s implement it for arrays:

Benchmarks show over 2x speedups with 4 (not linear because of the memory access bottleneck). So for arrays:

1. partition the indices into subintervals
2. initialize the array in parallel

For hash tables:

1. partition the hash codes into buckets (e.g. linked lists of arrays) according to their hashcode prefix
2. allocate the table, and map hash codes from different buckets into different regions

For search trees:

1. partition the elements into non-overlapping intervals according to their ordering
2. construct search trees in parallel, and link non-overlapping trees (which is efficient when they’re non-overlapping)

Spatial data structures (see exercises):

1. spatially partition the elements
2. construct non-overlapping subsets and link them

So as a conclusion, how can we implement combiners?

1. Two-phase construction: the combiner uses an intermediate data structure with an efficient combine method to partition the elements. When result is called, the final data structure is constructed in parallel from the intermediate data structure.
2. An efficient concatenation or union operation: a preferred way when the resulting data structure allows this.
3. Concurrent data structure: different combiners share the same underlying data structure, and rely on synchronization to correctly update the data structure when += is called.

We’ll look more into the second method, which is more suited for parallel computations.

##### Conc-Trees

Conc is the parallel counterpart to parallel cons lists, and is used to manipulate data. Let’s compare them to other data structures with an example: the implementation of filter:

Lists are built for sequential computations, and are traversed left to right:

Trees allow parallel computations – their subtrees can be traversed in parallel (this is not a search tree, just a regular tree):

Trees are not good for parallelism unless they are balanced. Let’s devise a data type called Conc, which represents balanced trees:

In addition, we will define the following invariants for Conc-trees:

1. A <> node can never contain Empty as its subtree. This guards us from sparse trees with too many empty subtrees
2. The level (read: height) difference between the left and the right subtree of a <> node is always 1 or less. This ensures that the height is bounded by .

Concatenation is then:

Concatenation takes time, where and are the heights of the two trees.

##### Combiners using Conc-Trees

First we’ll implement +=. To make it efficient (), We extend the Conc-Tree with a new node type:

The Append node has the same structure as a regular <> Conc node, and it has the same level and size; however, we will not impose the balance invariant on it. Trees of arbitrary size difference are allowed. To append a leaf:

This is indeed but creates an unbalanced tree, which means concatenation will be instead of .

We have implemented an immutable data structure with:

• appends
• concatenation

Transforming a Conc-Tree with Append nodes into a regular Conc-Tree should be fairly straightforward from this point on (concatenate trees from append list together). We almost have a functioning combiner; we just need to do a little more work.

The ConcBuffer appends elements into an array of size k. When the array gets full, it is stored into a Chunk node and added into the Conc-tree.

Chunk nodes are similar to Single nodes, but instead of a single element, they hold an array of elements.

Summary:

• combine concatenation
• Fast += operation
• result operation

## Part 2: Concurrent programming

### A surprising program

If we try to mentally simulate all runtime scenarios, where the threads execute in parallel, we can distrniguish three scenarios:

• y = 1, x = 0
• y = 0, x = 1
• y = 0, x = 0

In no scenario do we have x = 1, y = 1. Yet if we run this program, we do encounter this scenario! Let’s rebuild our intuition of concurrent programming.

Every concurrent programming model must answer two questions:

1. How to express that two executions are concurrent?
2. Given a set of concurrent executions, how can they exchange information (i.e. synchronize)?

In what follows, we will answer these two questions in the context of the JVM concurrency model.

The thread function is implemented as follows:

We need threads, instead of working directly with the CPU for two reasons:

• Portability: We don’t know which / how many CPUs to address, since this depends on the system
• The number of concurrent entities in a program can be much larger than the number of CPUs

• Copies of processor registers
• The call stack (~2MB per default)

The operating system eventually assigns threads to processes (the OS guarantees liveness). Two approaches:

• Cooperative multitasking: a program has to explicitly give control (yield) back to the OS (think Windows 3.1)
• Preemptive multitasking: the OS has a hardware timer that periodically interrupts the running thread, and assigns different thread to the CPU (time slices usually ~10 ms)

#### Some more definitions

Non-deterministic program: Given the same input, the program output is not unique between multiple runs. We want to write deterministic programs!

When join returns, the effects of the terminated thread are visible to the thread that called join.

To avoid race conditions, we want to ensure that all operations of a function are performed atomically, without another thread reading or writing intermediate results. To do so, we used synchronized blocks, as seen previously (remember how it places a lock on the object).

### Monitors

How do we know if a monitor is released? We could do polling (also called busy waiting), but that consumes compute time while waiting. Instead, we can use a notification; indeed, all Monitor objects have the following methods:

• wait(): suspends the current thread
• notify(): wakes up one other thread waiting on the current object
• notifyAll(): wakes up all other thread waiting on the current object

The fine print:

• wait, notify and notifyAll should only be called from within a synchronized on this
• wait will release the lock, so other threads can enter the monitor
• notify and notifyAll schedule other threads for execution after the calling thread has released the lock (has left the monitor)
• On the JVM runtime, it is possible that a thread calling wait sometimes wakes up even if nobody called notify or notifyAll. This is why we use while and not if.
• In practice, we’ll use notifyAll() much more often

#### Memory model

A memory model is a set of rules that defines how and when the writes to memory by one thread become visible to other threads. Consider our introductory example:

When we initially analyzed the introductory example, we assumed that every read and write happens in the program order, and that every read and write goes to main memory. That specific memory model is called the sequential consistency model. More formally:

Consider all the reads and writes to program variables. If the result of the execution is the same as if the read and write operations were executed in some sequential order, and the operations of each individual processor appear in the program order, then the model is sequentially consistent.

Unfortunately, as we saw in our experiment, multicore processors and compilers do not implement the sequential consistency model.

The Java Memory Model (JMM) defines a “happens-before” relationship as follows.

• Program order: Each action in a thread happens-before every subsequent action in the same thread.
• Monitor locking: Unlocking a monitor happens-before every subsequent locking of that monitor.
• Volatile fields: A write to a volatile field happens-before every subsequent read of that field.
• Thread start: A call to start() on a thread happens-before all actions of that thread.
• Transitivity: If A happens before B and B happens-before C, then A happens-before C.

This means:

• A program point of a thread t is guaranteed to see all actions that happen_before it.
• It may (may) also see actions that can occur before it in the sequential consistency (interleaving) model.

Back to our surprising program:

This can fail since the two threads operate on their separate copy of memory, and can therefore have outdated information compared to each other.

This would work though! Because synchronized synchronizes the processor caches with main memory at the end of its execution.

In general, a memory model is an abstraction of the hardware capabilities of different computer systems. It essentially abstracts over the underlying system’s cache coherence protocol.

##### Volatile fields

A volatile field is a field that may change at any time, thanks to another thread. Making a variable @volatile has several effects:

• Reads and writes to volatile variables are never reordered by the compiler.
• Reads and writes are never cached in CPU registers — they go directly to the main memory
• Writes to normal variables, that in the program precede a volatile write W, cannot be moved by the compiler after W
• Reads from normal variables that in the program appear after a volatile read R cannot be moved by the compiler before R.
• Before a volatile write, values cached in registers must be written back to main memory.
• After a volatile read, values cached in registers must be re-read from the main memory.

Writes to @volatile are somewhat expensive (~50% as much as synchronized), but reads are very cheap. But @volatile offers fewer guarantees than synchronized.

#### Executors

Threads have a lot of nice guarantees, but they are expensive to create. What people do to counteract that is use threads as workhorses that perform the tasks given to them. The number of available threads in a pool is typically some polynomial of the number of cores (e.g. ).

A task presented to an executor is encapsulated in a Runnable object:

Here’s how a task gets passed to the ForkJoinPool:

Note that there is no way to await the end of a task like we did with t.join() for threads. Instead, we pause the main thread to give the executor threads time to finish.

The scala.concurrent package defines the ExecutionContext trait and object which is similar to Executor but more specific to Scala.

Execution contexts are passed as implicit parameters to many of Scala’s concurrency abstractions. Here’s how one runs a task using the default execution context:

To hide all of this boilerplate, we can put it all in an execute function.

#### Atomic primitives

synchronized, wait, notify, notifyAll are complex and require support from the OS scheduler. We now look at the primitives in terms of which these higher-level operations are implemented.

An atomic variable is a memory location that supports linearizable operations (meaning that can be executed atomically). Here’s how we can define getUID without synchronized:

AtomicLong offers the atomic operations incrementAndGet() getAndSet(newValue: Long), and compareAndSet(expect: Long, update: Long):

compareAndSet, also known as CAS, is a building block on which other linearizable operations are implemented with. It’s often built-in into the hardware, and runs over a hundred cycles or so.

We can implement getUID using CAS directly:

### Programming without locks

Locks as implemented by synchronized are a convenient concurrency mechanism, but are also problematic (possibility of deadlock, possibility to arbitrarily delay other threads if a thread executes a long-running operation in a synchronized).

With atomic variables and their lock-free operation, we can avoid these problems. We can even simulate locks with atomic variables!

Here’s how we define lock-freedom:

An operation op is lock-free if, whenever there is a set of threads executing op, at least one thread completes the operation after a finite number of steps, regardless of the speed in which the different threads progress.

Essentially: at least one thread needs to complete the operation in a finite number of steps.

#### Lazy values

Here’s how scalac currently implements lazy values. It doesn’t use synchronized around the whole block since it’s quite costly, but still must set up some protections:

The problems with this implementation are:

• It’s not lock-free; E could take arbitrarily long time.
• It uses this as a lock, which might conflict with application-defined locking.

The new Scala compiler, dotty, does this instead:

• The evaluation of E happens outside a monitor, therefore no arbitrary slowdowns
• Two short synchronized blocks instead of one arbitrary long one
• No interference with user-defined locks
• Deadlocks are still possible but only in cases where sequential execution would give an infinite loop

#### Collections

Operations on mutable collections are usually not thread-safe. The safe way to deal with this is of course to use synchronized, but that often leads to too much blocking. To gain speed, we can use or implement special concurrent collection implementations.

As an example, here’s how concurrent queues could be implemented:

This is not lock-free, but it guarantees that we actually remove. We could also just give up if first == null and return None, which would be lock-free.

### Futures

One Many
Synchronous Try[T] Iterable[T]
Asynchronous Future[T] Observable[T]

#### Synchronous: Try

The return type here is dishonest, since actions may fail. So if we want to expose the possibility of failure in the types, then we should do T => Try[S] instead of T => S. Our game might now look like this:

There are some higher-order functions available that manipulate Try[T]:

#### Asynchronous: Future

Future[T] is a monad that handles exceptions and latency. Usually exceptions aren’t really nice in a multi-threaded context, but Future exists to abstract away from all of our worries. They can asynchronously notify consumers:

Sending a packet across the Atlantic is a situation in which we could use futures; it takes quite a while and we’d like to do other things in the meantime, and still manage failures:

This is a bit awkward, we can do much better:

We can then do this elegantly (?):

##### Recover and recoverWith

Here’s how we would send packets using futures robustly:

A sometimes cleaner way of doing it is to provide a fallback:

#### Implementation of FlatMap on Future

The actual implementation is a bit more evolved, as there’s some scheduling involved, but this is the gist of it.

## Part 3: Actors

### Why Actors?

Actors were invented in 1973 for research on artificial intelligence. Actors were added to the Scala standard library in 2006. Akka, an Actor framework on the JVM with Java and Scala APIs, was created in 2009.

Today, CPUs aren’t really gettting faster as much as they’re getting wider: we use multiple physical and virtual execution cores. To take advantage of these cores, we run multiple programs in parallel (multi-tasking) or we run parts of the same program in parallel (multi-threading).

We saw in the earlier example about bank accounts that if we don’t synchronize our actions, we may “create money” and write incorrect amounts to the balance. To avoid this, we saw how locks work with obj.synchronized { ... }. We also saw that we have to be careful to avoid deadlocks when using these. Since this blocking synchronization may introduce dead-locks, and since it is bad for CPU utilization, we will look into Actors, which are non-blocking objetcts.

### What is an Actor?

The Actor Model represents objects and their interactions, resembling human organizations. It is helpful to visualize Actors not as abstract objects on which we call methods, but as people talking to each other. More formally, an Actor:

• is an object with identity
• has a behavior
• only interacts using asynchronous message passing

#### The Actor Trait

For this, we use the Actor trait in Akka:

It defines one abstract method, receive, which returns a partial function from Any to Unit, describing the response of the Actor to a message. Any message could come in (hence Any), and the Actor may act upon it but cannot return anything, since the sender is long gone (hence Unit).

### A simple, stateful Actor

Let’s implement the Actor trait in an example class:

If the counter gets the message "incr", it simply increments the counter. But our actor can also send messages to addresses they know (in Akka, they are ActorRefs). If the customer (the actor that sends the message) sends a ("get", customer: ActorRef) tuple, our counter will send the count back. Note that the exclamation mark ! is used to send messages in Akka (it means “fire and forget”, also known as tell).

#### How messages are sent

Let’s look at more parts of the Actor trait:

With this API in mind, we can make our previous example a little nicer:

#### The Actor’s Context

It can do more things than just send messages: it can create other Actors, change its behavior, etc. The Actor type describes the behavior, while the execution is done by its ActorContext:

Each Actor has a stack of behaviors, and the topmost one is always the active one. The default mode of become is to replace the top of the stack with a new behavior, but it can also be used to push, and unbecome to pop behaviors. Let’s see it in action by reformulating our Counter Actor:

Functionally, it is equivalent to the previous version. It looks a bit like a tail-recursive function because it calls itself, but it is asynchronous, since context.become only evaluates when the next message is processed. There are advantages in this version though:

• State change is explicit; there’s only one place where the state is changed
• State is scoped to current behavior; there are no variables that can be left in an unkown state

#### Creating and Stopping Actors

Actors are always created by actors; that means that they always form a hierarchy. stop is often applied to self, meaning that the actor wants to terminate. We can now define an Actor application:

Note that in order to run this in an IDE, we’ll need to set the main class to akka.Main and give it as first argument the full class name of the Actor class which is to be instantiated.

### Message Processing

Access to the state of Actors is only possible by exchanging messages. Messages can be sent to to known addresses (ActorRef):

• Every actor knows its own address (self)
• Creating an actor returns its address (not the Actor object)
• Addresses can be sent with messages (sender, for instance, which is captured automatically)

Actors are completely independent agents of computation:

• Local execution, no notion of global synchronization
• All actors run fully concurrently
• Message passing is the only way to interact
• Message passing is a one-way communication (an Actor doesn’t know if the sent message has been sent or processed)

• Behavior change is effective before processing the next message
• Processing one message is the atomic unit of execution

This has the same benefits of synchronized methods, but blocking is replaced by enqueuing messages for later execution.

#### Revisiting the Bank Account Example

It’s good practice to define an Actor’s messages (both what it can receive and send) in the Actor’s companion object:

#### Message Delivery Guarantees

All communication is inherently unreliable — we can’t know for sure that the message was actually received. As a fix, we can set the following resending policies:

• at-most-once: sending once delivers times. This can be done without keeping any state
• at-least-once: resending until acknowledged delivers times. The sender needs to buffer the message in case it needs to resend
• exactly-once: processing only first reception delivers 1 time. This requires a buffer and keeping track of which messages have been processed.

Luckily for us, messages support reliability:

• All messages can be persisted, meaning that we can take a copy and store it in some persistent storage.
• Messages can include unique IDs
• Delivery can be retried until successful

But this only works if we use acknowledgements from the receiver. We can make the WireTransfer reliable by:

• Logging activities of WireTransfer to persistent storage
• Giving each transfer a unique ID
• Adding IDs to Withdraw and Deposit
• Storing IDs of completed actions within BankAccount

#### Message Ordering

If an actor sends multiple messages to the same destination, they will not arrive out of order (this is Akka-specific). Other than that, message ordering to different receivers is not prescribed by the Actor model.

### Designing Actor Systems

Imagine giving the task to a group of people, dividing it up. Consider the group to be of very large size, and the people to be easily replaceable (since Actors have quite low overhead, we can make plenty of them). Draw a diagram with how the task will be split up between the people and the communication lines between them. For example, if we want to build a web crawler, we might use:

• A web client which turns a URL into a HTTP body asynchronously
• A Getter actor for processing the body
• A Controller which spawns Getters for all links encountered
• A Receptionist managing one Controller per request

The videos goes into a long example (seriously, 40 minutes), but here are the important lessons:

• A reactive application is non-blocking and event-driven top to bottom
• Actors are run by a dispatcher (potentially shared) which can also run Futures
• Prefer immutable data structures, since they can be shared
• Do not refer to actor state from code running asynchronously
• Prefer context.become for different states, with data local to the behavior

#### Actor-Based Logging

In Akka, you can log like this:

Logging includes IO which can block indefinitely, but Akka’s logging passes that task to dedicated actors. You can set the level of debugging by setting akka.loglevel=DEBUG (other levels are debug, info, warning, error).

### Testing Actor Systems

Tests can only verify message passing. Say we have an Actor called Toggle that responds with happy and sad alternatively:

We can also run a test within a TestProbe:

Sometimes, we need to test an Actor with real-life dependencies (accessing a DB, a production web service, etc). We don’t want to do testing with these, so the solution is to use dependency injection, or overridable factory methods.

When testing hierarchies, it’s good practice to verify the leaves, and working your way up.

### Failure Handling with Actors

Resilience demands:

• Containment of failure: failure is isolated, can’t spread to other components. This happens by design of the model, since actors are fully encapsulated objects
• Delegation of failure: failure cannot be handled by the failed component, since it is presumably compromised, so the failure must be handled by another actor.

This means that another Actor needs to decide whether the failed Actor is terminated or restarted. If it needs to be able to restart the failed Actor, then it is both the supervisor and the parent. In other words, the supervision and parental hierarchy are the same, which means failure is passed to the parent. In Akka, we call this mandatory parental supervision.

How does this supervision hierarchy translate to code?

Note that failure is sent and processed like a message, so we can do everything we usually can with it. However, to fix the problem Stop, Restart and Escalate are the only available tools.

#### Strategies

There are two strategies:

• OneForOneStrategy: always deal with each child Actor in isolation
• AllForOneStrategy: decision applies to all children (children need to live and die together (that’s dark)).

Each strategy can be configured to include a simple rate trigger:

#### Restarts

In Akka, the ActorRef stays valid after a restart (this is not the case in Erlang, for instance). This means other Actors can keep interacting with the failed Actor once the failure has been dealt with.

What does a restart really mean?

• Expected error conditions are typically handled explicitly within the Actor
• Unexpected errors indicate invalidated actor state. In this case, a restart restores the initial state.

#### Lifecycle Hooks

When an Actor starts, restarts and stops, we can define different hooks to be run, whose defaults are:

Any Actor can of course override these methods. If, when overriding preRestart, we do not stop child actors, the context will recursively restart them.

#### Lifecycle Monitoring

From the outside, the only observable transition occurs when an Actor stops. After a stop, there will be no more responses — but how do we distinguish an Actor that has been terminated from one that just doesn’t respond anymore? In Akka, there exists a feature called DeathWatch:

• An Actor registers its interest using context.watch(target)
• It will receive a Terminated(target) message when the target stops
• It will not receive any direct messages from target thereafter (indirect messages going through intermediate actors may still be en route, but not direct ones)

The DeathWatch API provides two methods:

Terminated is a special message. You can’t declare it yourself as it’s private to Akka. If you’re watch an existing Actor, Akka will send Terminated(...)(true) when it stops. If you try to watch a non-existing (or no-longer existing) Actor, Akka will reply immediately with Terminated(...)(false).

Terminated extends AutoReceiveMessage which means that they’re handled by the Actor context, and that Terminated messages cannot be forwarded.

Terminated messages are PossiblyHarmful.

#### The Chlidren List

Each actor maintains a list of the actors it created:

• When context.actorOf returns, the child has been entered into the list
• When Terminated is received, the child has been removed from the list
• Even if we don’t use DeathWatch, the child is removed (but no Terminated is sent)

#### The Error Kernel

It’s good practice to keep important data near the root, and delegate risk to the leaves. This way, when an Actor restarts, the effect will be rather localized (it won’t have to restart children). This is called the Error Kernel pattern.

#### EventStream

So far, we’ve stated that Actors can direct messages only at known addresses. But the reality is that another option exists: it’s the equivalent of shouting something in a room full of people. The EventStream allows publication of messages to an unknown audience. Every Actor can optionally subscribe to (parts of) the EventStream.

An example in code:

#### Unhandled Messages

Actor.Receive is a partial function, so the behavior may not apply. Unhandled messages are passed into the unhandled method, whose default behavior is:

The supervisor’s default response to a DeathPactException is to send a Stop command. All other messages are published, so we could for example register a listener to log unhandled messages.

#### Persistent Actor State

The Error Kernel pattern allows us to keep important Actors relatively safe, but there are cases where we cannot lose data at all. Losing state due to a restart is not the only thing that can happen: we could also lose data due to buggy software or hardware, or due to a system failure. If we need to keep state across all these failure scenarios, then we need to make sure it’s stored on a harddrive for instance.

There are two possibilites for persisting state:

• Have the Actor mirror a persistent storage location, and do in-place updates (replacements) of both.
• Recovery of latest state in constant time
• Data volume depends on number of records, not their change rate
• Save the chain of changes that are applied to state, in an append-only fashion (this is like an event log)
• History can be replayed, audited or restored
• Some processing errors (bugs) can be corrected retroactively
• Additional insight can be gained on the process (a user’s behavior on a shopping site, for example)
• Writing an append-only stream optimizes IO bandwidth
• Changes are immutable and can freely be replicated

If you only want to persist the state of an actor, and put an upper bound on the time recovery may take, you can use snapshots. Snapshots combine both of the above methods by tying a current state to a change. That way you don’t have to replay all changes to find the final state, you can just recover from the latest saved state and replay the few changes that happened after that snapshot.

How do Actors actually persist changes? There are two ways of doing it:

• Command-Sourcing: We make sure that the Actor doesn’t lose any command by persisting the command before processing it (same goes for acknowledgements). During recovery, all commands are replayed to recover the state. During this replay, the Actor may send messages in response, but those are rerouted through a persistent Channel which discards messages that have already been sent to other actors.
• Event-Sourcing: The focus isn’t on replaying the commands, but instead on saving and applying the change requests (“events”). We don’t save what we want to happen (commands) to the log, we save what has happened (events). During a replay, the Actor doesn’t need to see the commands again, just the events.

Here’s an example of this — an actor which processes blog posts, with a daily quota for each user:

There’s a problem with this, though: we might get a new post while we’re sending the previous one to the log (writing to disk does take some time), in which case it will pass the condition in the if statement, and will be added anyway. This is a general problem with applying and then persisting: it leaves the actor in a stale state.

If we persist and then apply, then we don’t have that problem, but we might have trouble regenerating after a system-wide crash, as the last state may not have been fully persisted yet.

So it seems like we need to choose between correct persistence and correct behavior — but there is a third option: do not process new messages while waiting for persistance. However this option comes at a performance cost, as we reduce performance, augment latency and lower throughtput.

The ability to postpone messages which cannot yet be handled is provided by the Stash trait in Akka:

What about when we need to interact with some outside service that isn’t based on actors? Say we charge money from a credit card for each post, do we first charge the card (perform) and then persist? Or the other way?

• If we perform before persisting, we get at least one bank payment (the user may be charged again if the system crashes before the persisting happened).
• If we persist before performing, we get at most one bank payment

There’s no “better option” — what to choose depends on the application and business model.

In summary:

• Actors can persist incoming messages or generated events.
• Events are immutable, so they can be replicated, and used to inform other components since they can be read many times
• Recovery replays past commands or events; snapshots reduce this cost
• Actors can defer handling certain messages by using the Stash trait

### Actors are Distributed

Actors are inherently distributed. Normally, we run them on different CPUs on the same system, but nothing stops us from running them on different network-connected hosts. But it does take some effort to make Actors agree on a common truth. We call this eventual consistency.

#### The Impact of Network Communication

Compared to running the code locally, running it on a network means:

• Data can only be shared by value (not by name), since a copy has to be made.
• Bandwidth is lower
• Latency is higher
• Partial failures may happen: packets may get lost in transfer
• Data corruption can also happen (think 1 corruption by TB sent)

Distributed computing breaks many assumptions made by the synchronous programming model.

Some things still hold, though. Actors are model a network locally (instead of modelling local mechanisms on the network), so message loss is a part of what we deal with, and they are so isolated that they act like different hosts on a network would.

All of this means that the effort of writing a distributed program using Actors is basically the same of writing a local variant (!!). The code itself won’t look much different.

#### Actor Path

Actors are Location Transparent, which means that their location simply is hidden behind an ActorRef. We know that Actors form a hierarchy. Behind the hood, they have a path corresponding to the URI format:

Every Actor is like a folder in this URI. akka://HelloWorld/ is called the authority, and user/greeter is the path. Let’s look at a remote address example: akka.tcp://HelloWorld@10.2.4.6:6564/user/greeter is a /user/greeter Actor. It’s in an Akka system using the TCP protocol, also named HelloWorld, accessible at 10.2.4.6 at port 6565. An Actor has at least one such URI (but can have multiple, i.e. if it’s reachable on multiple IP addresses or ports).

It is worth noting that an ActorPath is not an ActorRef:

• ActorPath is the full name, whether the Actor exists or not
• Can only optimistically send a message
• Cannot be watched (we don’t know if it exists)
• ActorRef points to one Actor which was started (an incarnation).
• Can be watched
• Example: akka://HelloWorld/user/greeter#43428347. It looks like a path, but has an extra UID

To resolve an ActorPath in remote system, it is necessary to talk to Actors that we haven’t created ourselves, and for which you have no means to acquire an ActorRef. You just know which the ActorPath (host, port, etc.). To resolve an Actor, we can just send a Resolve(path) message to a Resolver Actor, which is defined as:

Every Actor automatically supports akka.actor.Identify, and reply with an ActorIdentity.

Relative actor paths also exist:

#### Clusters

A cluster is a set of nodes about which all members are in agreement. These nodes can then collaborate on a common task. A single node can declare itself a cluster (join itself). It can then join a cluster:

• A request is sent to any member
• Once all current members know about the new node, it is declared part of the cluster

Information is spread using a gossip/epidemic protocol, each Actor spreading the message to its neighbors.

To set up a cluster, there are a few prerequisites. First, we have an SBT dependency:

Then, we need some configuration enabling the cluster module, in application.conf:

Alternatively, we can use -Dakka.actor.provider=.... Now onto some code; this is the only full example I’ve included, as I think it covers the whole subject pretty well. We’ll take a look at how to implement work routing to different Controller workers, again to grab some URLs. The following will start a single-node cluster on port 2552:

If we code a second node, we need a seperate port, using akka.remote.netty.tcp.port = 0.

How do we route work to cluster members? We create a Receptionist:

There’s a Customer Actor that makes sure the given url is retrieved, but the work is supposed to be performed at a remote node. For that, it creates a Controller:

The Controller is the worker; its parent is the Customer.

#### Eventual Consistency

We’ve now seen how Clusters work. What’s evident is that everything takes time: node joining takes time to disseminate the information among the rest of the cluster, it takes time until the welcome message arrives, etc. The decisions aren’t taken immediately, they’re taken eventually.

When we made a bank account thread safe, we made it strongly consistent: after an update, all subsequent reads will return the updated value.

We can also define weak consistency: after an update, certain conditions need to be met before the update is visible — in other words, it takes a little while before our update can be read. This is the inconsistency window.

Eventual consistency is a special kind of weak consistency. It states that once no more updates are made to an object, after a while (once everyone has communicated the value) reads will all return the last written value.

Let’s try it out with a simple Actor.

This should clarify the concept of eventual consistency a bit. Updates are taken into account, but eventually the “truth” propagates throughout the peer network.

Actors and eventual consistency are closely tied:

• An actor forms an island of consistency, surrounded by an ocean of non-determinism
• Collaborating actors can at most be eventually consistent — they can’t be strongly consistent, since messages take time to be sent. Note that eventual consistency isn’t a given, it doesn’t come automatically: some work still needs to be done to ensure this.
• Event consistency requires us to disseminate all updates to interested parties eventually (there needs to be a resend mechanism).
• The shared data structures need to be suitable for that. We can for instance use CRDTs (Commutative Replicated Data Types, a class of data types) .

## Part 4: Big Data Analysis with Scala and Spark

This part focuses on how to map some of the abstractions that we’ve learned so far to computations on multiple machines over massive datasets. Why Scala, why Spark? The alternatives for data processing, like R, Python and MATLAB don’t scale when our datasets get too large to fit into memory. Once we get more than a few gigabytes, we have to re-implement everything in some other language or system, like Spark or Hadoop.

• The functional paradigm scales very well
• Spark is more expressive: APIs are modeled after Scala collections, they look like functional lists.
• Spark is performant, in terms of running time (sometimes x120), but also in terms of dev productivity. It’s also interactive, which Hadoop isn’t
• Good for data science, since it enables iteration (most data science problems involve iteration). Once data is in memory, Spark applies all functional iterations, instead of spending time needlessly doing IO.

Throughout this part, we’ll use the Apache Spark framework for distributed data-parallel programming. Spark implements a data model called Resilient Distributed Datasets (RDDs), the distributed counterpart of a parallel collection.

### Data-Parallel to Distributed Data-Parallel

In this section, we’ll bridge the gap between parallelism with shared memory and distributed data parallelism. Let’s look at an example for shared memory:

We’ve seen before that we have a sort of collection abstraction going on, where if jar is a parallel array, then the data is split up between workers/threads, which combine when they’re done (if necessary) — but this is done under the hood. In the distributed case, we need to split the data over several nodes (instead of several workers), which independently operate on data shards in parallel, and combine when done, and we now have to worry about network latency between workers. What would that look like?

It’s the same! Just like before, we can keep collections abstraction over distributed data-parallel execution. So a lot of what we’ve learned can now be applied (including problems, i.e. with non-associative reduction operations), but we also have latency as an extra problem.

#### Latency

Latency cannot be masked completely, we always have to think about it. It will be an important aspect that also impacts the programming model. Latency of a packet from the US to Europe and back to the US is approximately 150ms, compared to 100ns for reading from memory, so it’s about a million times slower. Network is super slow, so slow that it changes how we must work.

Hadoop has fault-tolerance (this is important because once you have 1000 nodes, crashes and network faults happen), but it comes at a cost: between each map and reduce step, in order to recover from potential failures, Hadoop/MapReduce shuffles its data and writes intermediate data to disk. This is also very slow. Spark manages to keep fault-tolerance, but reduces latency by keeping all data immutable and in-memory. Then fault tolerance is achieved by replaying functional transformations over the original dataset.

### RDDs, Spark’s Distributed Collections

RDDs look a lot like immutable sequential or parallel Scala collections, and make a lot of use of higher-order functions.

Every definition closely resembles the regular Scala Collection definition, except for aggregate, which takes z by value instead of CBN like in normal Scala. This is because copies need to be made, we can’t send a reference to z since it may be sent over the network to another node.

Let’s try an example. Given val encyclopedia: RDD[String], say we want to search all of encyclopedia for mentions of EPFL, and count the number of pages mentioning it:

Let’s try word counting (this is like “Hello World” for large-scale data):

reduceByKey is a special Spark method that we’ll talk more about later.

#### Creating RDDs

RDDs can be created in two ways:

• Transforming an existing RDD: using higher order functions, for instance
• From a SparkContext or SparkSession object: SparkContext (now renamed to SparkSession) is how we handle the Spark cluster. It represents the connection between the Spark cluster and our running application, and uses a handful of methods to create and populate a new RDD:
• parallelize: convert local Scala collection to an RDD
• textFile: read a text file from HDFS or local file system and return RDD[String]

#### Transformations and Actions

A reminder of transformers and accessors:

• Transformers: return new collections as results (map, filter, flatMap, groupBy, …)
• Accessors: return single value as result (reduce, fold, aggregate)

In Spark, we have transformations instead of transformers, and actions instead of accessors. Why do we use a different name, when there’s seemingly no difference? Because there’s a new, enormous difference: Transformations are lazy, actions are eager: for transformations, the resulting RDD is not immediately computed, while the result of actions are. Laziness/eagerness is how we can limit network communication using the programming model, so this is important.

Let’s look at an example:

At this point, the RDD hasn’t changed! This is because the transformations are deferred; all we get back is an address to an RDD that doesn’t yet exist. To kick off the computation and wait for its result, we can add an action:

This is important to remember: nothing happens until we call an action.

A few common transformations:

• map[B](f: A => B): RDD[B]: apply f to each element in the RDD and return RDD of the result
• flatMap[B](f: A => TraversableOnce[B]): RDD[B]: apply f to each element in the RDD and return an RDD of the contents of what the iterators returned.
• filter(pred: A => Boolean): RDD[A]: Apply pred to each element and return RDD of elements passing the predicate condition
• distinct(): RDD[B]: return RDD with duplicates removed

Transformations on two RDDs:

• union(other: RDD[T]): RDD[T]: return an RDD containing elements from both RDDs
• intersection(other: RDD[T]): RDD[T]: return an RDD containing elements only found in both RDDs
• subtract(other: RDD[T]): RDD[T]: return an RDD containing elements only found in both RDDs
• cartesian[U](other: RDD[U]): RDD[(T, U)]: Cartesian product with other RDD

Here are a few common actions:

• collect(): Array[T]: return all elements from RDD
• count(): Long: return the number of elements in the RDD
• take(num: Int): Array[T]: return the first num elements of the RDD
• reduce(op: (A, A) => A): A: combine the elements in the RDD together using op and return result
• foreach(f: T => Unit): Unit: apply f to each element in the RDD
• takeSample(withRepl: Boolean, num: Int): Array[T]: return array with a random sample of num elements of the dataset, with or without replacement
• takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T]: return first n elements of the RDD using either their natural order or a custom comparator
• saveAsTextFile(path: String): Unit: write elements of the dataset as a textr file in the local filesystem or HDFS
• saveAsSequenceFile(path: String): Unit: write the elements of the dataset as a Hadoop SequenceFile in the local filesystem or HDFS

Let’s do another example. Assume that we have an RDD[String] which contains gigabytes of logs collected over the previous years. Dates come in the form YYYY-MM-DD:HH:MM:SS, errors are logged with a prefix including the word “error”. To determine the number of errors that were logged in December 2016:

#### Benefits of laziness for Large-Scale Data

Consider the following examples:

Spark can leverage the fact that filter is deferred until take by analyzing and optimizing the chain of operations before executing it to reduce the number of iterations. Spark will not compute intermediate RDDs. Instead, in this case, as soon as 10 elements of the filtered RDD have been computed, it is done.

#### Caching and Persistence

Let’s look at an example. A logistic regression follows the following formula:

Let’s try to implement it in a straightforward way:

We use a reduce, so we call numIterations actions. Notice how the reduce is done on points: this is problematic, because Spark recomputes an RDD every time we call an action on it. In our example above, we’re calling parsePoint waaaay too many times! To tell Spark to cache an RDD in memory (say if we need to use it multiple times), we can call .persist() or .cache() on them.

There are many ways to configure how your data is persisted:

• In memory as regular Java objects
• On disk as regular Java objects
• In memory as serialized Java objects (more compact)
• On disk as serialized Java objects
• Both in memory and on disk (spill over to disk to avoid re-computation)

.cache() is a shorthand for the default, which is in memory only, as regular Java objects. .persist() can be customized.

### Reductions

Remember that fold and aggregate are parallelizable, but foldLeft isn’t. Therefore, Spark implements fold, reduce, aggregate on RDDs, but not foldLeft or foldRight. Why can’t it just implement it sequentially? Well, it takes a lot of synchronization to do things serially across a cluster, which is difficult, and it doesn’t make a lot of sense on a cluster.

So since those aren’t an option, we’ll have to use aggregate is we need to change the return type of our reduction operation.

### Distributed Key-Value Pairs (Pair RDDs)

In single-node Scala, we had maps. In Spark, we think of those as key-value pairs, or Pair RDDs. In practice, this is a data structure that is used very often for big data analysis; most computations map data to a key-value pair, and reduce it to a final result.

Pair RDDs have additional, specialized methods for working with data associated with keys.

#### Creating a Pair RDD

Pair RDDs are most often created from already-existing non-pair RDDs, for example by using the map operation:

#### groupByKey

In regular Scala, we had groupBy:

In English: it breaks up a collection into two or more collections according to a function that we pass to it.

• Key: The result of the function
• Value: The collection of elements that return that key when the function is applied to it

Spark’s groupByKey can be thought of as a groupBy on Pair RDDs that is specialized in grouping all values that have the same key. As a result, it takes no argument, no discriminator function.

In English:

• Key: Key
• Value: The collection of values with the given key

#### reduceByKey

reduceByKey can be thought of as a combination of groupByKey and reduce on all the values per key. It’s more efficient though, than using each separately.

The function only cares about the values: we assume that the elements are already grouped by key, and now we apply this reduction function.

#### mapValues

Can be thought of as a short-hand for:

That is, it simply applies a function to the values in a Pair RDD.

#### countByKey

Simply counts the number of elements per key in a Pair RDD, returning a normal Scala Map (this is an action) mapping keys to counts.

#### keys

This returns a RDD with the keys of each tuple (this is a transformation).

### Joins

Joins are unique to Pair RDDs. They’re used to combine multiple datasets. There are 2 kinds of joins:

• Inner joins (join)
• Outer joins (leftOuterJoin, rightOuterJoin)

The key difference between the two is what happens to the keys when both RDDs don’t contain the same key.

The choice between leftOuterJoin and rightOuterJoin should be made by deciding which data is optional.

### Shuffles

We typically have to move data from one node to another to be grouped by key; doing this is called “shuffling”. They are an enormous hit to performance because Spark must send data from one node to another over the network (latency costs!).

To minimize these latency costs, we can reduce the amount of data being sent over the network. This is why reduceByKey is more efficient than groupByKey: it does a groupByKey and reduce locally, then sends the reduced information around for a second round of reduceByKey. This means a non-trivial reduction of data shuffling (in a simple example, we get a 3x speedup).

#### Partitioning

How does Spark know which key to put on which machine? By default, it uses hash partitioning to determine which key-value pair should be sent to which machine.

The data within an RDD is split into several partitions.

• Partitions never span multiple machines; tuples in the same partition are guaranteed to be on the same machine
• Each machine in the cluster contains one or more partitions
• The number of partitions to use is configurable. By default, it’s the total number of cores on all executor nodes

There are two kinds of partitioning in Spark:

• Hash partitioning: attempts to spread the data evenly across partitions based on the keys (partition = k.hashCode() % numPartitions)
• Range partitioning: when an ordering is defined on the keys, range partitioning may be more efficient. Keys are partitioned according to their ordering and a set of sorted ranges of keys.

Customizing a partitioning is only possible on Pair RDDs. There are two ways to set partitioning for our data:

• Call partitionBy on an RDD, providing an explicit Partitioner
• Using transformations that return RDDs with specific Partitioners
• From parent RDD: the result of a transformation on a partitioned Pair RDD typically is configured to use the hash partitioner that was used to construct it
• Automatically-set partitioners: some operations result in an RDD with a known partitioner, for when it makes sense (sortByKey uses a RangePartitioner, groupByKey uses a HashPartitioner)

Let’s look at partitionBy:

The result of partitionBy should always be persisted. Otherwise, the partitioning is repeatedly applied (involving shuffles) each time.

For partition resulting from transformations, the following operations on Pair RDDs hold on to (and propagate) a partitioner:

• cogroup, groupWith, groupByKey
• join, leftOuterJoin, rightOuterJoin
• reduceByKey, foldByKey, combineByKey
• partitionBy
• sort
• If the parent has a partitioner: mapValues, flapMapValues, filter

All other operations will produce a result without a partitioner. Interestingly, map and flatMap aren’t on the list, since they can change the key; using them makes us lose our partitioning.

#### Optimizing with Partitioners

Using range partitioners we can optimize use of reduceByKey so that it doesn’t involve any shuffling over the network:

This is almost 9x faster than our initial examples.

As a rule of thumb, a shuffle can occur when the resulting RDD depends on other elements from the same RDD or another RDD. Here’s a list of operations that might cause a shuffle:

• cogroup, groupWith, groupByKey
• join, leftOuterJoin, rightOuterJoin
• reduceByKey, combineByKey
• distinct, intersection
• repartition, coalesce

There are ways to use these operations and still avoid much or all network shuffling. For instance, by running reduceByKey on a pre-partitioned RDD, or by running join on two RDDs that are pre-partitioned with the same partitioner and cached on the same machine, we avoid almost all network shuffling.

#### Wide vs Narrow Dependencies

Computations on RDDs are represented as a lineage graph, a directed acyclic graph (DAG) representing the computations done on the RDD. Spark analyses the lineage graph to do optimizations.

Remember the rule of thumb above. Dependency information can tell us when a shuffle may occur. There are two kinds of dependencies:

• Narrow Dependencies: each partition of the parent RDD is used by at most one partition of the child RDD
• join (with co-partitioned inputs)
• map, mapValues, flatMap, filter, union, mapPartitions, mapPartitionsWithIndex
• Wide Dependencies: each partition of the parent RDD may be depended on by multiple child partitions
• join, leftOuterJoin, rightOuterJoin (with inputs not co-partitioned)
• cogroup, groupWith groupByKey, groupBy, reduceByKey, combineByKey, distinct, intersection, repartition, coalesce

The former is fast! No shuffling is necessary, and optimizations like pipelining are possible. The latter is slow — it requires some or all of the data to be shuffled over the network.

THere is a dependencies method on RDDs. It returns a sequence of Dependency objects, which are the dependencies used by Spark’s scheduler to know how this RDD depends on other RDDs.

The sorts of dependency objects it may return include:

• Narrow dependency objects: OneToOneDependency, PruneDependency, RangeDependency
• Wide dependency objects: ShuffleDependency

Another helpful method is toDebugString, which prints out a visualization of the RDD’s lineage, and other relevant scheduling information.

Lineage graphs are the key to fault tolerance in Spark: we can recover from failures by recomputing lost partitions from the lineage graphs (we can just recompute the partition, not the whole data set). This allows fault tolerance without writing to disk, which is why Spark is so fast.

Recomputing missing partitions is fast for narrow dependencies, but slow for wide dependencies.

### Structured and Unstructured Data

In Spark, there are often multiple ways of achieving the same result, often with very different running times. For instance, filtering before joining is faster than joining before filtering, which itself is faster than computing a Cartesian product and then filtering.

Sometimes, the way we pick isn’t the fastest one; it would be great if Spark could optimize our commands to the fastest version! That’s what Spark SQL does: given a bit of extra structural information, Spark can do many optimizations for us!

All data isn’t created equal, structurally. It falls on a spectrum from unstructured (log files, images), to semi-structured (JSON, XML), to structured (database tables). With RDD, we’ve been working with unstructed or semi-structured data, where we don’t know anything about the schema of the data we’re working with (we just have generic typing, without knowing what’s inside of the objects). The same can be said about computation; lambda operations are opaque to Spark (they’re not predefined like in SQL).

We’ve got to give up some of the freedom, flexibility and generality of the functional collections API in order to give Spark more opportunities to optimize.

### Spark SQL

Spark SQL is a Spark library with 3 main goals:

1. Support relational processing within Spark and on external data sources, so that we can mix SQL and functional APIs.
2. High performance
3. Support new data sources such as semi-structured data and external databases (it’s usually complicated to connect big data processing pipelines like Spark or Hadoop to an SQL database, but Spark SQL makes it easy).

The three main APIs are:

1. SQL literal syntax
2. DataFrames
3. Datasets

1. Catalyst, a query optimizer
2. Tungsten, off-heap serializer (encodes Scala objects efficiently off the heap, away from the garbage collector)

#### Getting started

DataFrame is Spark SQL’s core abstraction. Conceptually, it’s equivalent to a table in a relational database. Conceptually, it’s an RDD full of records with a known schema. Unlike RDDs, DataFrames are untyped (no type paramater), but require schema information. One final terminological point: transformations on DataFrames are called untyped transformations.

To get started using Spark SQL, we’ll start with SparkSession (the equivalent of SparkContext):

Data frames can be created in two ways. Either from an existing RDD:

Or, in another way, by defining a schema explicitly:

We can also read from a source file (it supports JSON, CSV, Parquet, JDBC… see the documentation):

Once we have a DataFrame, we can freely write SQL syntax. We just have to register it as a temporary SQL view first:

See this cheat sheet for supported SQL statements.

#### DataFrames

DataFrames are a relational API over Spark’s RDD, which can be aggressively optimized. Another thing to remember is that they’re untyped; the elements with DataFrames are Rows, which aren’t parameterized by a type, so the compiler cannot type check Spark SQL.

SQL does have types though; in order to access any of them, you must first import Spark SQL types:

The DataFrames API contains operations that look similar to SQL, including, but not limited to, select, where, limit, orderBy, groupBy, join, …

To see what a DataFrame looks like, we can use .show(), which pretty prints the data’s first 20 elements in tabular form. .printSchema() prints the schema in tree format.

There are a few different ways to specify columns. The syntax is flexible, as it goes through an SQL parser, but the first two are often a little less error-prone:

Note: where and filter are exactly equivalent. Use whichever looks/sounds best.

Like on RDDs, transformations on DataFrames are operations which return a DataFrame as a result, and are lazily evaluated. As an example, assume we have a dataset of homes currently for sale in an entire US state. We want the most expensive and the least expensive homes for sale per zip code.

Another example: we want to find the person with the most posts in each subforum of a forum:

Some of the transformations return specific kinds of DataFrames on which we can only execute certain actions. Here’s a list of methods that can be called on a RelationalGroupedDataset (after a groupBy), and here’s a list of methods that can be called within agg.

##### Cleaning Data with DataFrames

Sometimes, data sets have null or NaN values. In these cases it’s often desirable to:

• Drop rows with unwanted values:
• drop() drops rows that contain null or NaN values in any column and returns a new DataFrame
• drop("all") drops rows that contain null or NaN values in all columns and returns a new DataFrame
• drop(Array("id", "name")) drops rows that contain null or NaN values in the specified columns and returns a new DataFrame
• Replace certains values with a constant:
• fill(0) replaces all occurrences of null or NaN in numeric columns with the specified value and returns a new DataFrame
• fill(Map("minBalance" -> 0)) replaces all occurrences of null or NaN in specified column with the specified value and returns a new DataFrame
• replace(Array("id"), Map(1234 -> 8923)) replaces the specified value (1234) in the specified column (id) with the specified replacement value (8923) and returns a new DataFrame
##### Common actions on DataFrames
• collect(): Array[Row]: returns an array containing all rows in the DataFrame
• count(): Long: returns the number of rows in the DataFrame
• first(): Row or head(): Row: returns the first row in the DataFrame
• show(): Unit: displays the top 20 rows in tabular form
• take(n: Int): Array[Row]: returns the first n rows
##### Joins on DataFrames

Joins on DataFrames are similar to those on Pair RDDs, but since DataFrames aren’t key/value pairs, we have to specify which columns to join on.

There are several types of joins available: inner, outer, left_outer, right_outer, leftsemi.

##### Optimizations on DataFrames

The great advantage of working with DataFrames is that queries are automatically optimized. Revisiting our award example, we don’t have to put too much thought into performance, and we can just do what seems most natural and enjoy optimized performance:

In practice, compared to the previous Spark RDD version, the DataFrame version is even faster! How is this possible?

Recall that Spark comes with Catalyst, which is Spark SQL’s query optimizer, which compiles Spark SQL programs down to an RDD. It can:

• Reorder operations: The laziness and structure inherent to Catalyst gives it the ability to analyze and rearrange the DAG of the computation before it’s executed
• Reduce the amount of data we must read: Catalyst can skip reading in, serializing and sending around parts of the data set that aren’t needed for our computation
• Pruning unneeded partitioning: it analyzes DataFrames and filter operations to figure out and skip partitions that are unneeded in our computation
• And more!

Tungsten in Spark SQL’s off-heap data encorder. It takes schema information and tightly packs serialized data into memory. This means more data can fit in memory, and faster serialization/deserialization can occur. Data is stored off-heap, where it’s free from garbage collection overhead.

##### Limitations
• Untyped: errors aren’t caught at compile time, they’re caught at execution time
• Limited data types: data that can’t be expressed by case class or Product or standard Spark SQL data types is hard to encode with Tungsten; it doesn’t always play well with old codebases.
• Requires semi-structured or structured data: some data doesn’t fit nicely in DataFrames (images, logs, etc).

#### Datasets

Datasets provide an alternative framework to DataFrames. As an example, let’s say we we want the average price of home listings:

We wanted an Array[Double], what is this Row thing? We have to cast things because Rows don’t have type information associated with them.

This gives us an exception. Well, what about looking at the Row API docs? That reveals that we can do this:

Ah, so we had a mistake in our types! But rather than doing it that way, wouldn’t it be nice to have both Spark SQL and type safety? Enter Datasets: they combine type safety with Spark SQL. DataFrames are in fact Datasets themselves:

The Dataset API unifies the DataFrame and RDD APIs. We can mix and match relational and functional operators! Like DataFrames, they require structured/semi-structured data. Schemas and Encoder are a core part of Datasets.

Recall the Column type from DataFrames. On Datasets, typed operations tend to act on TypedColumn instead. To create a TypedColumn, all you have to do is call as[...] on your untyped Column.

##### Creating Datasets

First, see Getting Started for general setup. Once that is done, there are a number of ways of creating a Dataset:

##### Transformations on Datasets

The Dataset API includes:

• Typed transformations typed variants of many DataFrame transformations and additional transformations such as RDD-like higher-order function map, flatMap, etc.
• map[U](f: T => U): Dataset[U]
• flatMap[U](f: T => TraversableOnce[U]): Dataset[U]
• filter(pred: T => Boolean): Dataset[T]
• distinct(): Dataset[T]
• groupByKey[K](f: T => K): KeyValueGroupedDataset[K, T]: Like on DataFrames, Datasets have a special set of aggregation operations meant to be used after a call to groupByKey, hence the different return type. Operations include:
• reduceGroups(f: (V, V) => V): Dataset[(K, V)]: reduce on the elements of each group. The function must be commutative and associative.
• agg[U](col: TypedColumn[V, U]): Dataset[(K, U)]: aggregates using the given Spark SQL function, for instance someDS.agg(avg(\$"column").as[Double]). We use the .as method to make it a TypedColumn, and then everything type checks.
• mapGroups[U](f: (K, Iterator[V]) => U): DataSet[U]: maps on the collection of values for each group. A big disclaimer on it though: it does not suport partial aggregation, and as a result requires shuffling all the data in the Dataset. If an application intends to perform an aggregation over each key, it is best to use the reduce function or an Aggregator.
• flatMapGroups[U](f: (K, Iterator[V]) => TraversableOnce[U]): Dataset[U]
• coalesce(numPartitions: Int): Dataset[T]: apply a function to each element in the Dataset and return a Dataset of the contents of the iterators returned
• repartition(numPartitions: Int): Dataset[T]
• Untyped transformations the transformations we learned on DataFrames
##### Aggregators

A class that helps you generically aggregate data. Kind of like the aggregate method on RDDs

• IN is the input type to the aggreagtor. When using an aggregator after groupByKey, this is the type that represents the value in the key/value pair
• BUF is the intermediate type during aggregation
• OUT is the type of the output of the aggregation

To use it, we have to define the types and methods:

But this gives us an error telling us to define bufferEncoder and outputEncoder. To understand this, we first need to understand what an Encoder is. They are what convert your data between JVM objects and Spark SQL’s specialized internal tabular representation. They’re required by all Datasets! There are two ways to introduce encoders:

• Automatically (generally the case) via implicits from a SparkSession (import spark.implicits._)
• Explicitly via org.apache.spark.sql.Encoder which contains a large selection of methods for creating Encoders from Scala primitive types and Products. Some example of Encoder creation methods in Encoders:
• INT, LONG, STRING for nullable primitives
• scalaInt, scalaLong, scalaByte for Scala primitives
• product, tuple for Scala’s Product and tuple types

Example of explicitily creating Encoders:

So we actually also need to define the encoders for our example. This is pretty straightforward once we’ve replaced the IN, BUF and OUT types:

##### Dataset Actions

All the actions are exactly the same as we’ve previously seen on RDDs and on DataFrames.

• collect(): Array[T]
• count(): Long
• first(): T or head(): T
• foreach(f: T => Unit): Unit
• reduce(f: (T, T) => T): T
• show(): Unit
• take(n: Int): Array[T]
##### Limitations of Datasets

Catalyst can’t optimize all operations. Lambda functions can’t be optimized as they are opaque to the optimizer; equivalent relational calls can however be optimized.

• When using Datasets with higher-order functions like map, you miss out on many Catalyst optimizations
• When using Datasets with relational operations like select, you get all of Catalysts’s optimizations
• Thoough not all operations can be optimized by Catalyst, Tungsten is still running under the hood to serialize data in a highly optimized way

If the data can’t be expressed by case classes or Products and standard Spark SQL data types, it may be difficult to ensure that a Tungsten encoder exists (this is often a problem for legacy code using regular Scala class).

If your unstructured data cannot be reformulated to adhere to some kind of schema it would be better to use RDDs.

#### Datasets vs DataFrames vs RDDs

To finish this course on Spark, a bit of discussion on when to use the different technologies:

Use Datasets when:

• you have structured or semi-structured data (JSON, XML, CSV, …)
• you want type safety
• you need to work with functional APIs
• you need good performance, but it doesn’t have to be the best

Use DataFrames when:

• you have structured or semi-structured data
• you want the best possible performance, automatically optimized for you

Use RDDs when:

• you have unstructured data
• you need to fine-tune and manage low-level details of RDD computations
• you have complex data types that cannot be serialized with Encoders
« Back