KZKY memo

自分用メモ.

Hadoop Cluster Provisioning (en)

HDD

  • Use JBOD (Just a Bunch of Disk ) as an architecture using multiple hard drives
  • For a master node, it is possible to use RAID 1+0 for durability
  • Better to use the number of HDD which is at leat grater than or equal to the number of cpu cores
  • Do not create HDFS on the HDD on which OS is
  • Do not use VM due to Hadoop mapreduce I/O-bound
  • Use noatime option when mounting HDD to avoid to update file access time
  • Use fuse-mouunt for HDFS, it is efficient to read fils on HDFS
  • Be careful to provisoin I/O-related things since HDSF is the base of hadoop ecosystem

Network

Kernel Paremter

  • Increase the number of file descriptors
  • Swap does not needed

Master

Slaves

  • Setup HDSF on JBOD
  • No special care is need

Mapreduce

  • Approximately #map:#red = 2:1 on the cluster

Other Tips

Orchestration

If you have more than 3 nodes, it is trouble some to do commands for each node, use orchestration tool!
For examle,

it is better to use the first three, I used pdsh.

PXE Boot

Better to setup PXE Boot server before setting up hadoop cluster, if your hadoop cluster is not ad-hoc cluster and is 10+ nodes.

CombineFileInputFormat

However establishing hadoop cluster, your job is very slow. In that case, have doubt about the following, hadoop is not good at handling very small size file under 64MB (default block size), so use CombinedFileInputFormat to concatenate small files, in addition to using bzip2 compression (or raw txt), then hadoop splits a concatenated file into files of block size, resulting in more speed-up.

Notes

Please post commets, hadoop experts.