September 26, 2016

Google Announces Google Cloud Dataflow

googlecloudI have a colleague who has been attending the Google I/O event ever since it began in 2008. This year was no exception, and in his trip report he highlighted what Google calls “Google Cloud Dataflow”. From what I can gather, it is sort of like Google’s version of Hadoop, but presumably better (at least from their perspective and use cases). “Cloud Dataflow is based on a highly efficient and popular model used internally at Google,” they write in a blog post introducing it, “which evolved from MapReduce and successor technologies like Flume and MillWheel.”

From what I can gather with so little information with limited specificity, this platform might be best for dealing with massive information flows (think of a very large Twitter stream, web visits at a very popular website, or large volumes of search strings with associated data such as location) that really never stop. This doesn’t mean it wouldn’t also be able to deal with data that updates on a less regular basis (think library catalog data), but it probably isn’t the use case that receives Google’s most rapt development attention.

Also, it isn’t yet clear how this new technology will be made available. Certainly it will be available to users of Google existing Cloud Platform services (see the product/pricing page), but will it be available to install locally, as is Apache Hadoop? Only time will tell.

Meanwhile, I’m intrigued enough to have signed up for the Google Group where announcements about this new technology will be made. If you have big data processing needs you may want to as well.


View TDS Archive
On October 14, 2015 Library Journal, School Library Journal, and thousands of library professionals from around the world gathered for the 6th annual Digital Shift virtual conference to focus on the challenges and opportunities presented by the digital transition’s impact on libraries, their communities, and partners. Now available on-demand, this year’s program provides actionable answers to some of the biggest questions our profession faces for and from libraries of all types – school, academic, and public and features thought-provoking keynotes from John Palfrey, author of BiblioTech: Why Libraries Matter More Than Ever in the Age of Google, and Denise Jacobs, tech leader, author, and creativity evangelist.
Roy Tennant About Roy Tennant

Roy Tennant is a Senior Program Officer for OCLC Research. He is the owner of the Web4Lib and XML4Lib electronic discussions, and the creator and editor of Current Cites, a current awareness newsletter published every month since 1990. His books include "Technology in Libraries: Essays in Honor of Anne Grodzins Lipow" (2008), "Managing the Digital Library" (2004), "XML in Libraries" (2002), "Practical HTML: A Self-Paced Tutorial" (1996), and "Crossing the Internet Threshold: An Instructional Handbook" (1993). Roy wrote a monthly column on digital libraries for Library Journal for a decade and has written numerous articles in other professional journals. In 2003, he received the American Library Association's LITA/Library Hi Tech Award for Excellence in Communication for Continuing Education. Follow him on Twitter @rtennant.