- Use JBOD (Just a Bunch of Disk ) as an architecture using multiple hard drives
- Do not use RAID
- For a master node, it is possible to use RAID 1+0 for durability
- Better to use the number of HDD which is at leat grater than or equal to the number of cpu cores
- Do not create HDFS on the HDD on which OS is
- Do not use VM due to Hadoop mapreduce I/O-bound
- Use noatime option when mounting HDD to avoid to update file access time
- Use fuse-mouunt for HDFS, it is efficient to read fils on HDFS
- Be careful to provisoin I/O-related things since HDSF is the base of hadoop ecosystem
- Use 10G Ethernet
- Use DNS cache
- Better to not use /etc/hosts
- Infiniband is bettter?
- If your job is network-bound like shuffle, it is worth thinking, but usually not need to use.
- Infiniband is costly
- Do not forget NTP setup for each node to use the same time base.
- Increase the number of file descriptors
- Swap does not needed
- High Availability
- NFS mount
- QuorumJournalManager is good for cdh users
- Setup Secondary Name Node
- Cold standby
- Do not setup workers coexisting on Master Node
- Master node should have master role only
- Setup HDSF on JBOD
- No special care is need
- Approximately #map:#red = 2:1 on the cluster
If you have more than 3 nodes, it is trouble some to do commands for each node, use orchestration tool!
it is better to use the first three, I used pdsh.
However establishing hadoop cluster, your job is very slow. In that case, have doubt about the following, hadoop is not good at handling very small size file under 64MB (default block size), so use CombinedFileInputFormat to concatenate small files, in addition to using bzip2 compression (or raw txt), then hadoop splits a concatenated file into files of block size, resulting in more speed-up.
Please post commets, hadoop experts.