Sunday, March 15, 2009

data processing workflows

if you are faced with data processing workflow which requires to process / transform a huge amount of data in a limited amount of time this can end up in pretty complex implementations. if you have enough hardware to do the job you need a infrastructure makes use of the hardware.

using hardware for a limited time today isn't a big issue. the cloud infrastructure out there (e.g Amazon ec2) is perfect if you have to process a huge amount of data in limited amount of time for a limited duration. you are able to scale the usage of the required hardware for the time they are required and just pay for the required duration.

now you also need a ready to use software infrastructure to implement the processing workflow. MapReduce is a software infrastructure for such kind of problems.

Apache hadoop implement this MapReduce but it lacks of easy of use means is a pretty low level infrastructure and of course lacks of higher level workflows which is not defined by MapReduce.

Cascading closes the gap. based on "stream processing" the MapReduce pattern are applied and used. it is not too complicated within one day i was able to create a simple applicaton which convert 20 TB of svg data to jpg and doing some transformation in between using batik and 20 concurrent hardware nodes.

for some use cases the power of cloud is easy to tell....

