KZKY memo

自分用メモ.

Spark RDD (en)

RDD (Resilient Distributed Dataset)

I have investigated RDD which is the core technology on Spark and eventually found that the RDD papers are the most usefull source to understand.

  • Matei Zaharia et al. "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing" 2012.
  • Matei Zaharia et al. "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing" 2011.
  • Matei Zaharia et al. "Spark: Cluster Computing with Working Sets" 2010.

The three papers are almost the same content and I have read the latest one and summarize the contents; however, if we want to have deeper knowledge about RDD we have to read RDD/Spark codes.

Terminology

  • RDD
    • Resilient Distributed Dataset
    • the core of Spark and data sharing abstraction
  • Transformation
    • lazy operation on RDD, e.g., map, filter, reduceByKey, etc., acutual computaion is not peformed.
  • Action
    • actual operatoin on RDD, e.g., reduce, collect, etc., actural computation is performed.
  • Narrow dependency
    • Each partition of the parent RDD is used at moast one partition of child RDD.
    • Pipeline operation
  • Wide dependency
    • Multiple child partitions may depend on a same partition of the parent RDD.

f:id:KZKY:20141120004800p:plain

A few words explanation for RDD

  • Read-only, partitioned collection of records
  • Data Sharing Abstraction
  • Like a Distributed Shared Memory
  • Read is fine-grained, wirte is coarse-grained
  • Immutable

Advantages

  • Data and intermediate results are stored in memory to speed up computation and located on the adequate nodes for optimization.
  • Able to take transformation operation on RDD many times
  • Calculate lineage information about RDD transformation for failure recovery, if you have a failure when operating a partition, re-operate that partition.
  • Able to persist on disk or not
    • Default is in memory
    • Able to locate replica on plural nodes
    • If data does not fit in memory, spill data to a disk (I wanna see code)
    • Better to make a checkpoint when a lineage is long or wide dependency exist on a lineage.
      • Making checkpoint is performed in the background
  • Data locality works in narrow dependency
  • Intermediate results in wide dependency is output into a disk like a mapper output.
  • v.s. DSM (Distributed Sharing Memory)
    • Hard to implement fault-tolerance on commodity servers
    • RDD is immutable, so easy to take a backup
      • In DSM, taks access to the same memory location and interfere with each other's updates.

f:id:KZKY:20141120004845p:plain

RDD interface

Five interfaces

  • partitions()
    • return partion list.
  • preferredLocations()
    • return node list with data locality
  • dependencies()
  • iterator(p, parentIters)
    • compute the elements of partition p given iterators for its parent partitions
  • partitioner()
    • return metadata specifying whether the RDD is hash/range partitioned

Not suitable applications

  • Fine-grained and asynchronous updates
    • web application storage
    • web crawler
  • For machine learning task example, fully asynchronous update and with-no-lock algortihm?

Partition

If HDFS is data source,1 partition corresponds to 1block.

Memory Management

Tree types

  • deserialized object in memory
  • serialized object in memory
  • serialized object on disk

Evaluation

Summarize only when data size does not fint into cluster memory size.

  • Condition
    • m1.xlarge EC2 node
    • 4 cores
    • 15 GB RAM
    • HDFS with 256 MB blocks
    • 100 GB data
    • 25 nodes

When I read carefully, rather than using data size which does not fit into memory size, they conducted experiment with varying the memory size for each node and performed the logistic regression.

Quoted sentance

we configured Spark not to use more than a certain per-
centage of memory to store RDDs on each machine.

I am wondering even if using 25% memory usage, at the most 93.75 GB memory (15 GB * 0.25 * 25 nodes) is used.
Anyway, even if spark uses 0% memory, the computational speed is 2.67 (184/68.8) times fater than that of Hadoop.

f:id:KZKY:20141120004841p:plain

f:id:KZKY:20141120004837p:plain