Wednesday 30 July 2014

Starting a CDH5 yarn hadoop cluster with whirr on OpenStack

Just recently I was looking at some tools to start dynamic Hadoop cluster on our private OpenStack cloud.
The first one I evaluated was Apache Whirr. The project had once an active community but recently it seems to be almost dead. It is however still part of Cloudera CDH5 distribution so I decided to give it a shot.

Cloudera provides information on how to setup a CDH5 cluster at: http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Installation-Guide/cdh5ig_whirr_define.html

Unfortunately it does not work with the CDH5 whirr as it does seem to be updated since CDH4 and is not aware of some of the  CDH5 settings.

Another problem is a whirr issue with official Ubuntu cloud images: (https://issues.apache.org/jira/browse/WHIRR-435). It appears that the new images use cloud-init to update the apt sources.list and there is a short period of timew when whirr can ssh to the machine (e.g. to start jdk installation) but still sees the outdated sources (which breaks jdk installation).

Finally the last official version of whirr (0.8.2) does not work with openjdk-6 (and some newer versions of orcacle jdk) - the issues was fixed in 0.9.1 - Luckily Cloudera's  CHD5 whirr has the fix.

I have managed to patch the installation functions to fix the issues mentioned above and finally be able to start my cluster.

The patches and sample configuration is avaliable at: https://github.com/piotrszul/whirr-chd5-openstack