Last week Amazon announced something very cool: the availability of public data sets hosted in its Elastic Block Store (EBS) service, part of its Elastic Compute Cloud (EC2) offering. These data sets are available for free, with only typical EC2 runtime charges applied for access and use of the data. If you're not familiar with EBS, it's Amazon's persistent storage service integrated with EC2, making it easy to read and write data sets from applications hosted in EC2.
The first set of data offerings are genome, chemistry, and economic statistics. I'm not very familiar with the first two categories, but have worked a lot with the economic ones Amazon has on tap: Census, Bureau of Labor Statistics, and (coming soon) Bureau of Economic Analysis data. The BLS keeps track of job statistics, and the BEA tracks overall economic data like GDP, capital investment, and so on. Amazon indicates that it plans to grow the number of data sets it will offer.
What is striking about the announcement is how Amazon keeps making unexpected moves in its cloud offerings. It pioneered the category, showing the way. Hundreds of startups have jumped on EC2, using it as the foundation for inexpensive offerings like backup services, image manipulation, and so on. Just as the rest of the industry started to catch up with that, Amazon comes along and offers pre-built data sets, free for the asking.
This is a great initiative, and offers real promise in a number of ways:
It provides a way for neophytes to learn how to build real systems, using the data sets as a jumping-off point. If Amazon wanted to create a perfect testbed for enterprises to get comfortable with EC2, it succeeded. These data sets mirror the most demanding coporate ones in terms of size and complexity. Therefore, it is a great staring point for corporations to experiment with Amazon's EC2 service. This is also a big win for Amazon in that it will induce more people to use its cloud services.
It offers great learning tools for educators. I was talking with a friend and he pointed out that a statistics teacher could easily use the economics data sets as the basis for class assignments. In fact, Amazon notes that the sets can be made even more valuable by creating customized system images that contain preconfigured apps that use one or more of the data sets; these images can be shared. A teacher could easily create a preconfigured system with data for all of the students in a class and allow them to start work immediately.
It offers great potential for innovation. These data sets can be mashed up with other public data sets or web services to create new applications. One need only look at the variety of imaginative mashups currently out on the web to recognize that these data sets could form the basis for many new applications -- most of which can't even be envisioned beforehand, but once available immediately strike one as inventive.
Amazon's continued progress in pioneering new means of system creation and new types of computer systems is exciting. I recently posted about its CloudFront content delivery initiative -- unexpected yet perfectly sensible once understood. With these public data sets, Amazon continues to increase the volume and variety of its offerings -- unlike many other recently announced cloud initiatives from other companies, which seem to be mostly focused on migrating existing products to a new hosting model (aka "your mess for less"). I can't wait to see what's next.
Bernard Golden is CEO of consulting firm HyperStratus, which specializes in virtualization, cloud computing and related issues. He is also the author of "Virtualization for Dummies," the best-selling book on virtualization to date.