Hadoop Cluster Provisioning (en)
HDD
- Use JBOD (Just a Bunch of Disk ) as an architecture using multiple hard drives
- Do not use RAID
- For a master node, it is possible to use RAID 1+0 for durability
- Better to use the number of HDD which is at leat grater than or equal to the number of cpu cores
- Do not create HDFS on the HDD on which OS is
- Do not use VM due to Hadoop mapreduce I/O-bound
- Use noatime option when mounting HDD to avoid to update file access time
- Use fuse-mouunt for HDFS, it is efficient to read fils on HDFS
- Be careful to provisoin I/O-related things since HDSF is the base of hadoop ecosystem
Network
- Use 10G Ethernet
- Use DNS cache
- Better to not use /etc/hosts
- Infiniband is bettter?
- http://www.quora.com/Is-it-worth-using-Infiniband-in-a-Hadoop-cluster-Why-What-is-the-bang4-payoff
- If your job is network-bound like shuffle, it is worth thinking, but usually not need to use.
- Infiniband is costly
- Do not forget NTP setup for each node to use the same time base.
Kernel Paremter
- Increase the number of file descriptors
- Swap does not needed
Master
- High Availability
- NFS mount
- QuorumJournalManager is good for cdh users
- http://www.slideshare.net/Cloudera_jp/hdfsha-hcj13w
- Setup Secondary Name Node
- Cold standby
- Do not setup workers coexisting on Master Node
- Master node should have master role only
Slaves
- Setup HDSF on JBOD
- No special care is need
Mapreduce
- Approximately #map:#red = 2:1 on the cluster
Other Tips
Orchestration
If you have more than 3 nodes, it is trouble some to do commands for each node, use orchestration tool!
For examle,
- pdsh (sh)
- Fabric (python)
- Ansible (python)
- Capistrano (ruby)
it is better to use the first three, I used pdsh.
PXE Boot
Better to setup PXE Boot server before setting up hadoop cluster, if your hadoop cluster is not ad-hoc cluster and is 10+ nodes.
CombineFileInputFormat
However establishing hadoop cluster, your job is very slow. In that case, have doubt about the following, hadoop is not good at handling very small size file under 64MB (default block size), so use CombinedFileInputFormat to concatenate small files, in addition to using bzip2 compression (or raw txt), then hadoop splits a concatenated file into files of block size, resulting in more speed-up.
Notes
Please post commets, hadoop experts.