Just recently I was looking at some tools to start dynamic Hadoop cluster on our private OpenStack cloud.
The first one I evaluated was Apache Whirr. The project had once an active community but recently it seems to be almost dead. It is however still part of Cloudera CDH5 distribution so I decided to give it a shot.
Cloudera provides information on how to setup a CDH5 cluster at: http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH5/latest/CDH5-Installation-Guide/cdh5ig_whirr_define.html
Unfortunately it does not work with the CDH5 whirr as it does seem to be updated since CDH4 and is not aware of some of the CDH5 settings.
Another problem is a whirr issue with official Ubuntu cloud images: (https://issues.apache.org/jira/browse/WHIRR-435). It appears that the new images use cloud-init to update the apt sources.list and there is a short period of timew when whirr can ssh to the machine (e.g. to start jdk installation) but still sees the outdated sources (which breaks jdk installation).
Finally the last official version of whirr (0.8.2) does not work with openjdk-6 (and some newer versions of orcacle jdk) - the issues was fixed in 0.9.1 - Luckily Cloudera's CHD5 whirr has the fix.
I have managed to patch the installation functions to fix the issues mentioned above and finally be able to start my cluster.
The patches and sample configuration is avaliable at: https://github.com/piotrszul/whirr-chd5-openstack
Piotr's Technical
Wednesday 30 July 2014
Tuesday 11 February 2014
Hadoop and OpenCV
Compiling OpenCV with Java
The info how to compile OpenCV is available at:
http://docs.opencv.org/doc/tutorials/introduction/desktop_java/java_dev_intro.html
Here are some additional notes for opencv-2.4.8
- Ubuntu 12.04: It compiles and works fine as is. However some extra packages are required including: build–essentials, libjpg-dev, python-dev, ant, libpng-dev and perhaps some other (see the cmake output). JAVA_HOME needs to be set to a JDK
- On Centos 6.2: cmake needs to be upgraded to 2.8.x (2.8.12.2). Some extra packages may be required e.g.: libjpg-devel, python-devel etc. There is bug in opencv-2.4.8 that results in SEGV while loading the library to JavaVM – to fix apply the patch before compilation: https://github.com/djetter99/opencv/commit/6bf599b1bca8a58c7a656ddc169f7be0fc3094c6
- On SUSE Linux Enterprise Server 11 SP2 (bragg cluster): apply the patch (see CentoOS). Load modules cmake, and gcc (does not compile with intel cc)
To apply the patch use (in opencv sources root dir):
# wget https://github.com/djetter99/opencv/commit/6bf599b1bca8a58c7a656ddc169f7be0fc3094c6.patch
# git apply 6bf599b1bca8a58c7a656ddc169f7be0fc3094c6.patch
Loading custom native libraries from in Hadoop
Native libraries need to be on the path defined by java java.library.path system property. It’s a bit confusing how to pass it to Hadoop workers as it depends on the mode (local vs distributed) and version.
For local run (that is one that does not involve spawning child jvms):
- set the additional path in JAVA_LIBRARY_PATH env variable
For distributed runs:
- copy the library to standard hadoop native path (e.g. /usr/lib/hadoop/lib/native)
- use the distributed cache as described: http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-common/NativeLibraries.html
- in version 0.20 and 1.0 pass the additional paths (need to be available on worker nodes) in mapred.child.java.opts by defining java.library.path e.g.: -Dmapred.child.java.opts=”-Djava.library.path=/usr/local/lib”
- in version 2.0 pass the additional paths in the mapred.child.env property by defining JAVA_LIBRARY_PATH env variable (e.g.: -Dmapred.child.env=”JAVA_LIBRARY_PATH=/usr/local/lib”)
Monday 1 April 2013
Image mosaic with Hadoop and Scoobi
I have been recently working on some prototypes for applying
hadoop and map-reduce for image processing.
One useful albeit somewhat trivial use for hadoop is batch processing
of an image set with an embarrassingly parallel function. I was however looking
for a problem that would involve both map and reduce steps and in addition be
simple and visually compelling.
Eventually I settled on creating image mosaics like this:
that is to recreate an image from a potentially huge collection of other
images.
The problem is essentially simple. We split the input image
into small square cells, and then for each cell we find an image that matches
it best (according to some measure of sameness) and then replace the cell with
the best fit image in the output mosaic.
Simple enough except that to get good results one needs a
huge set of reference images. How big the set needs to be depends on the
dimensions of the cell the input image is split into. For example if we consider 16x16 cells there
is (256 ^3) ^ (16 x 16) = 2^17088 different combinations of how the cell may
look like. Try to even cover 1% of that!
In general the larger set of reference images the better
results and here is where hadoop comes to the rescue, because the mosaic creation can be quite naturally expressed as a map reduce computation.
In the map phase we for each image of the reference set we calculate
its distance (or sameness) with all the cells of the input image. The result of
the map phase can be visualized as heat map with green cells representing a
better fit (one heat map per one reference image)
In the reduce phase we for each cell we find the image with
the best fit (the greenest one in the heat map) and choose it as a replacement.
This is a bit sketch and the actual implementation my does
not involve heat maps, but hopefully it’s easy enough to understand.
A working implementation using Scoobi can be found here: https://bitbucket.org/piotrszul/scoobi-mosaic
Why Scoobi? Well because it allows expressing map-reducing
in functional programming paradigm.
Here are some sample mosaics created from ~10M images
harvested from Flicker.
To create them however I need to switch from java graphics
to a C image processing library (opencv with javacv bindings) as the
calculation of image sameness is quite computationally intensive and the java version
was just way to slow.
Using a C library with Scoobi and optimizing the cluster for
image processing were interesting challenges by themselves so I will write
about them in next posts.
Subscribe to:
Posts (Atom)