How to Work with Firehose Data
It goes without saying that big data is, well, big. But it's not just the size that's an obstacle when dealing with big data: it's the rate at which that data can be coming into your data storage infrastructure.
Mon, January 30, 2012
It goes without saying that big data is, well, big. But it's not just the size that's an obstacle when dealing with big data: it's the rate at which that data can be coming into your data storage infrastructure.
Not too long ago, before the age of data automation, data would typically come into an organization at predetermined, set times: a good example would be business hours. When data was entered from nine to five, it grew at perfectly predictable rates, and was accessed and analyzed at equally predictable rates. Even better, the downtime that occurred when everyone went home at night would enable the night-owl DBAs to make updates and repairs to the database in question
There may have even been - are you sitting? - overtime pay in it for them.
Many businesses still work with data in this manner, some even exclusively. (Gone, in many cases, is this strange word known as "overtime.") But more and more often, data is coming in from automated sources that don't have downtime, and could be firing data to an organization every second of every day. And significant amounts, at that.
[ Free download: Hadoop creator Doug Cutting expects surge in interest to continue | Big data strains small-business bandwidth ]
This, then, is what the data gurus call firehose data - a steady and powerful stream of data that your IT infrastructure may be required to manage, and when all is said and done, actually use for business decisions.
According to Josh Berkus, CEO of PostgreSQL Experts Inc., there are four inherent challenges of working with firehose data. Berkus addressed those characteristics in a talk Jan. 22 at the Southern California Linux Expo.
First, the firehouse will have a lot of volume: anywhere from hundreds to thousands of facts-per-second. That volume may not be a steady rate, Berkus added, as it can have spikes, come from multiple uncoordinated sources, and may grow over time.
The second challenge is that, while the rate of volume can vary, the flow itself will be nearly constant, arriving on a 24/7 cycle. This means DBAs can't stop their systems to process the data, nor bring down an entire infrastructure for maintenance. This, and the fact that data can also arrive out of order, means extract, transform, and load (ETL) operations are pretty much not happening.
The third obstacle, Berkus told his audience, was that the database itself was going to be large - with multiple terabytes to petabytes of data to be handled.


