I have a colleague who has been attending the Google I/O event ever since it began in 2008. This year was no exception, and in his trip report he highlighted what Google calls “Google Cloud Dataflow”. From what I can gather, it is sort of like Google’s version of Hadoop, but presumably better (at least from their perspective and use cases). “Cloud Dataflow is based on a highly efficient and popular model used internally at Google,” they write in a blog post introducing it, “which evolved from MapReduce and successor technologies like Flume and MillWheel.”
From what I can gather with so little information with limited specificity, this platform might be best for dealing with massive information flows (think of a very large Twitter stream, web visits at a very popular website, or large volumes of search strings with associated data such as location) that really never stop. This doesn’t mean it wouldn’t also be able to deal with data that updates on a less regular basis (think library catalog data), but it probably isn’t the use case that receives Google’s most rapt development attention.
Also, it isn’t yet clear how this new technology will be made available. Certainly it will be available to users of Google existing Cloud Platform services (see the product/pricing page), but will it be available to install locally, as is Apache Hadoop? Only time will tell.
Meanwhile, I’m intrigued enough to have signed up for the Google Group where announcements about this new technology will be made. If you have big data processing needs you may want to as well.