by Brian Hopkins

The Anthropology of Data

Oct 20, 2010
Enterprise Architecture

Copied from a March posting on my personal blog, this post explores the future of data as it grows and moves to the Cloud

I wrote this in March 2010 on my personal blog, Practicing Enterprise Architecture, in response to a discussion I was having with the Enterprise Architect of a Fortune 50 company. Noticed recently that some of these ideas are popping up in analyst coverage (see Next Generation BI), so thought it might be interesting to republish here –


I was asked recently where I though data was going; after mulling my answer for a few days I decided I’m not happy with it. My new answer is this –  data is moving from being a matter of archeology to anthropology. 

While I admit the analogy is obscure, consider this – we have been collecting electronic data for about 50 years now. The largest organizations now have warehouses that contain petabytes of it. Furthermore, there is no end in sight. As we become better at digitizing information and assigning useful metadata to it; we collect more and more. How will we use all this data in the future?

Even now, the most common use cases involve spending enormous effort moving, cleansing, enriching and transforming data into a central stores to generate tabular reports. The value of the data we collect therefore depends on the skill and technology used to clean it, connect it and create reports that provide historical information in the hopes that we can make good decisions about the future. In other words, our data is archeological in nature – wait until its dead, move it into a secure place and inspect it.

Our ability to collect data has always outpaced our ability to make use of it, therefore we will always have more of it than we know what to do with. While the ‘move it to the warehouse once it’s dead’ approach will continue to be helpful, tomorrow’s successful organizations will adopt approaches to data that ire more akin to anthropology. Think about this – the most useful data is not old and dead, it is alive and evolving. While we can make use of old, dead data; the most useful technology will allow us to inspect the culture of living data and use trends and behaviors observed in past to allow predictions of the future.

Compare these two scenarios –

#1 – Archeological Approach

At a certain point in each year, shoppers start buying orange juice and tissues together and in greater quantity. Not-So-SavvyMart takes the archeological approach – they discover this connection by mining the warehouse and concluding that during cold season people need tissue and want the vitamin C in OJ. As a result of this expensive analysis, they eventually co-locate these products to increase sales during certain times of the year. The archeological pattern – collect data, mine data, draw some “ah ha” conclusions, take some action, increase profit. Boy that took a long time.

#2 – Anthropological Approach

This approach recognizes the connection between OJ and tissue is one of many connections between products that come and go; the culture of data (what people are buying and why) is most useful, however, when trends can be immediately recognized and taken advantage of. In this second scenario, SavvyMart employs RFID sensors and smart shopping carts to track what its customers are putting in their shopping carts in near real-time. They are also tracking where their customers are going and the patterns they follow through stores. Rather than wait to store and mine this data, they have designed a predictive model that spots trends in products purchased together. Rather than ask, “why” and take action to rearrange product placement, the monitors on the smart carts simply suggest related products when the first is bought, possibly even offering digital coupons. Managers use a product placement data mash-up application to suggest end-stand arrangements that optimize related product purchase opportunities based on the evolution of cart patterns and product purchases.

Next SavvyMart’s procurement system begins noticing sales trends and advertises its desire to purchase more of high demand items via its network of suppliers and the Good Relations Ontology. SavvySupplier notices the increased demand and reduces its cost on large volumes, making this offer back to SavvyMart automatically without human intervention.

Notice the different pattern here – evaluate raw data, sense patterns, use past behaviors (customers often purchase products together) to create predictive models (what products will be purchased together), and act on data using advanced metadata techniques (ontologies). This is the anthropology of data – I assert that tomorrow’s successful organizations will be the one’s that leverage it most effectively to gain competitive advantage.

As part of the anthropology, here are some additional contributing trends and predictions:

  • The Data Cloud is the next big Cloud. Persistent Data will continue to grow, that’s a given. Cloud Platform Storage (an extension of Scale-out storage, for example Google’s AppEngine) provides a technology that is significantly lowering the costs of extremely large scale data storage environments but it is not without drawbacks (reference David Chappell’s February 27th 2009 blog). Most prominent is that nobody seems to know how to make a relational data store scale to really massive data volumes; this is because relational data stores scale up but not out. If you follow this trend, then the logical conclusion is that tomorrow’s technology for accessing massive amounts of data will have to be really ‘smart’ because we will not have nice little query languages like SQL to help out.
  • (New Content) Cloud Application Platforms will learn to deal with large,dirty and constantly changing data sets. The ability to develop applications that do not need a “snapshot in time” view of extremely large and complex data sets will be able to operate in the Cloud in ways we are only just beginning to fathom.  See BOOM FAQ from Berkely and watch for a future post, Bloom Goes Boom in the Cloud.
  • Legal requirements for data are not going anywhere, driving the expectation that data will be stored and accessible. I could reference all kinds of information on the impacts of the 2006 changes to the Federal Rules of Civil Procedure, but I don’t need to. Think we all understand this; however when you combine this fact with the above trend you start to get the point. We are collecting a massive amount of persistent data and there is a growing legal expectation that we can access it. This adds a regulatory burden to our need to have really smart ways to access lots of data.
  • The availability of processing power will enable replacing traditional ETL with virtual ETL and near real time knowledge extraction. Traditional ETL jobs run in batch processes and move massive amounts of data through cleaning, transforming, enriching and loading routines. The result is a secondary data stores that can be queried by Business Intelligence technologies such a multidimensional OLAP. The problem with this is it only runs so fast, requires a lot of transformation logic – e.g. time and expense. As processing power increases and smarter ways query de-normalized, dirty data emerge, technology to extract that data and move it to useful forms will progress. I use the term Virtual ETL to refer to these types of near real-time data movements. Examples of this today exist in very primitive form as database materialized views and Business Objects Universes. 
  • Data collection will move closer and closer to sources and become pervasive – think about what RFID technology has done for data. Now we can collect information about products that are on retailer’s shelves. We can also put sensors in shopping carts and track the interactions of customers and products in a store. What will sensor technology do to revolutionize insurance? Think of how that business will change once we can detect and prevent losses for many risks that are considered ‘uninsurable’ today.
  • Information Ontology technology will marry Predictive Analytics and Cloud Storage to enable smart mining of dirty, hierarchical data collected in near real time on cloud platforms. With this prediction, things start to get really interesting. Imagine a near future in which data collection spiders, like today’s search engines, find and transfer actual data or pointers to cloud storage devices in a massively scalable, hierarchical and dirty format. Smart query languages based on ontologies for specific knowledge domains perform light weight (and virtual) ETL on this data to create knowledge stores. Predictive models can then be run on this knowledge, again leveraging Ontology languages to create business knowledge and predict future behavior – essentially delivering the information we get today from SQL, warehouses and batch ETL but faster, and with massively more data.

In conclusion, I answer the question, “where is data going?” very simply – it is moving towards being useful in ways that it has never been before. We will be able to leverage techniques proven in traditional Business Intelligence on massively large amounts of information in near real-time. This will allow us to study and leverage cultures of growing, changing and organic data to make better decisions faster.

Organizations that get this ahead of their peers will have a competitive advantage.


If you liked (or hated) this post, please join me on my personal blog where you can comment, see my other posts and follow me on Twitter. Thanks!