Spark RDD (en)

RDD (Resilient Distributed Dataset)

I have investigated RDD which is the core technology on Spark and eventually found that the RDD papers are the most usefull source to understand.

Matei Zaharia et al. "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing" 2012.
Matei Zaharia et al. "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing" 2011.
Matei Zaharia et al. "Spark: Cluster Computing with Working Sets" 2010.

The three papers are almost the same content and I have read the latest one and summarize the contents; however, if we want to have deeper knowledge about RDD we have to read RDD/Spark codes.

Terminology

RDD
- Resilient Distributed Dataset
- the core of Spark and data sharing abstraction
Transformation
- lazy operation on RDD, e.g., map, filter, reduceByKey, etc., acutual computaion is not peformed.
Action
- actual operatoin on RDD, e.g., reduce, collect, etc., actural computation is performed.
Narrow dependency
- Each partition of the parent RDD is used at moast one partition of child RDD.
- Pipeline operation
Wide dependency
- Multiple child partitions may depend on a same partition of the parent RDD.

f:id:KZKY:20141120004800p:plain

A few words explanation for RDD

Read-only, partitioned collection of records
Data Sharing Abstraction
Like a Distributed Shared Memory
Read is fine-grained, wirte is coarse-grained
Immutable

Advantages

Data and intermediate results are stored in memory to speed up computation and located on the adequate nodes for optimization.
Able to take transformation operation on RDD many times
Calculate lineage information about RDD transformation for failure recovery, if you have a failure when operating a partition, re-operate that partition.
Able to persist on disk or not
- Default is in memory
- Able to locate replica on plural nodes
- If data does not fit in memory, spill data to a disk (I wanna see code)
- Better to make a checkpoint when a lineage is long or wide dependency exist on a lineage.
  - Making checkpoint is performed in the background
Data locality works in narrow dependency
Intermediate results in wide dependency is output into a disk like a mapper output.
v.s. DSM (Distributed Sharing Memory)
- Hard to implement fault-tolerance on commodity servers
- RDD is immutable, so easy to take a backup
  - In DSM, taks access to the same memory location and interfere with each other's updates.

f:id:KZKY:20141120004845p:plain

RDD interface

Five interfaces

partitions()
- return partion list.
preferredLocations()
- return node list with data locality
dependencies()
- return dependency list on parents
iterator(p, parentIters)
- compute the elements of partition p given iterators for its parent partitions
partitioner()
- return metadata specifying whether the RDD is hash/range partitioned

Not suitable applications

Fine-grained and asynchronous updates
- web application storage
- web crawler
For machine learning task example, fully asynchronous update and with-no-lock algortihm?

Partition

If HDFS is data source，1 partition corresponds to 1block.

Memory Management

Tree types

deserialized object in memory
serialized object in memory
serialized object on disk

Evaluation

Summarize only when data size does not fint into cluster memory size.

Condition
- m1.xlarge EC2 node
- 4 cores
- 15 GB RAM
- HDFS with 256 MB blocks
- 100 GB data
- 25 nodes

When I read carefully, rather than using data size which does not fit into memory size, they conducted experiment with varying the memory size for each node and performed the logistic regression.

Quoted sentance

we configured Spark not to use more than a certain per-
centage of memory to store RDDs on each machine.

I am wondering even if using 25% memory usage, at the most 93.75 GB memory (15 GB * 0.25 * 25 nodes) is used.
Anyway, even if spark uses 0% memory, the computational speed is 2.67 (184/68.8) times fater than that of Hadoop.

f:id:KZKY:20141120004841p:plain

f:id:KZKY:20141120004837p:plain

KZKY memo

自分用メモ．