KZKY memo


Hadoop Cluster Provisioning (en)


  • Use JBOD (Just a Bunch of Disk ) as an architecture using multiple hard drives
  • For a master node, it is possible to use RAID 1+0 for durability
  • Better to use the number of HDD which is at leat grater than or equal to the number of cpu cores
  • Do not create HDFS on the HDD on which OS is
  • Do not use VM due to Hadoop mapreduce I/O-bound
  • Use noatime option when mounting HDD to avoid to update file access time
  • Use fuse-mouunt for HDFS, it is efficient to read fils on HDFS
  • Be careful to provisoin I/O-related things since HDSF is the base of hadoop ecosystem


Kernel Paremter

  • Increase the number of file descriptors
  • Swap does not needed



  • Setup HDSF on JBOD
  • No special care is need


  • Approximately #map:#red = 2:1 on the cluster

Other Tips


If you have more than 3 nodes, it is trouble some to do commands for each node, use orchestration tool!
For examle,

it is better to use the first three, I used pdsh.

PXE Boot

Better to setup PXE Boot server before setting up hadoop cluster, if your hadoop cluster is not ad-hoc cluster and is 10+ nodes.


However establishing hadoop cluster, your job is very slow. In that case, have doubt about the following, hadoop is not good at handling very small size file under 64MB (default block size), so use CombinedFileInputFormat to concatenate small files, in addition to using bzip2 compression (or raw txt), then hadoop splits a concatenated file into files of block size, resulting in more speed-up.


Please post commets, hadoop experts.