RDD (Resilient Distributed Dataset)
- Matei Zaharia et al. "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing" 2012.
- Matei Zaharia et al. "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing" 2011.
- Matei Zaharia et al. "Spark: Cluster Computing with Working Sets" 2010.
- Resilient Distributed Dataset
- the core of Spark and data sharing abstraction
- lazy operation on RDD, e.g., map, filter, reduceByKey, etc., acutual computaion is not peformed.
- actual operatoin on RDD, e.g., reduce, collect, etc., actural computation is performed.
- Narrow dependency
- Wide dependency
- Multiple child partitions may depend on a same partition of the parent RDD.
A few words explanation for RDD
- Read-only, partitioned collection of records
- Data Sharing Abstraction
- Like a Distributed Shared Memory
- Read is fine-grained, wirte is coarse-grained
- Data and intermediate results are stored in memory to speed up computation and located on the adequate nodes for optimization.
- Able to take transformation operation on RDD many times
- Calculate lineage information about RDD transformation for failure recovery, if you have a failure when operating a partition, re-operate that partition.
- Able to persist on disk or not
- Data locality works in narrow dependency
- Intermediate results in wide dependency is output into a disk like a mapper output.
- v.s. DSM (Distributed Sharing Memory)
- return partion list.
- return node list with data locality
- return dependency list on parents
- iterator(p, parentIters)
- compute the elements of partition p given iterators for its parent partitions
- return metadata specifying whether the RDD is hash/range partitioned
Not suitable applications
- Fine-grained and asynchronous updates
- web application storage
- web crawler
- For machine learning task example, fully asynchronous update and with-no-lock algortihm?
If HDFS is data source，1 partition corresponds to 1block.
- deserialized object in memory
- serialized object in memory
- serialized object on disk
Summarize only when data size does not fint into cluster memory size.
- m1.xlarge EC2 node
- 4 cores
- 15 GB RAM
- HDFS with 256 MB blocks
- 100 GB data
- 25 nodes
When I read carefully, rather than using data size which does not fit into memory size, they conducted experiment with varying the memory size for each node and performed the logistic regression.
we configured Spark not to use more than a certain per- centage of memory to store RDDs on each machine.
I am wondering even if using 25% memory usage, at the most 93.75 GB memory (15 GB * 0.25 * 25 nodes) is used.
Anyway, even if spark uses 0% memory, the computational speed is 2.67 (184/68.8) times fater than that of Hadoop.