May 17, 2021

Google Announces Google Cloud Dataflow

googlecloudI have a colleague who has been attending the Google I/O event ever since it began in 2008. This year was no exception, and in his trip report he highlighted what Google calls “Google Cloud Dataflow”. From what I can gather, it is sort of like Google’s version of Hadoop, but presumably better (at least from their perspective and use cases). “Cloud Dataflow is based on a highly efficient and popular model used internally at Google,” they write in a blog post introducing it, “which evolved from MapReduce and successor technologies like Flume and MillWheel.”

From what I can gather with so little information with limited specificity, this platform might be best for dealing with massive information flows (think of a very large Twitter stream, web visits at a very popular website, or large volumes of search strings with associated data such as location) that really never stop. This doesn’t mean it wouldn’t also be able to deal with data that updates on a less regular basis (think library catalog data), but it probably isn’t the use case that receives Google’s most rapt development attention.

Also, it isn’t yet clear how this new technology will be made available. Certainly it will be available to users of Google existing Cloud Platform services (see the product/pricing page), but will it be available to install locally, as is Apache Hadoop? Only time will tell.

Meanwhile, I’m intrigued enough to have signed up for the Google Group where announcements about this new technology will be made. If you have big data processing needs you may want to as well.


Roy Tennant About Roy Tennant

Roy Tennant is a Senior Program Officer for OCLC Research. He is the owner of the Web4Lib and XML4Lib electronic discussions, and the creator and editor of Current Cites, a current awareness newsletter published every month since 1990. His books include "Technology in Libraries: Essays in Honor of Anne Grodzins Lipow" (2008), "Managing the Digital Library" (2004), "XML in Libraries" (2002), "Practical HTML: A Self-Paced Tutorial" (1996), and "Crossing the Internet Threshold: An Instructional Handbook" (1993). Roy wrote a monthly column on digital libraries for Library Journal for a decade and has written numerous articles in other professional journals. In 2003, he received the American Library Association's LITA/Library Hi Tech Award for Excellence in Communication for Continuing Education. Follow him on Twitter @rtennant.