Not all data is created equal. One of the first questions faced by enterprises while deciding their big data strategy is how and where will they store this data. The rush to this decision often underscores the need to understand what should be stored or thrown away and how will the data be processed to generate value.
There is nothing wrong with casting a big net to instrument as much as you can and capture as much as you can. However, the danger of collecting too much data is that it makes it harder for the relevant data to be discovered, used and processed. More data means more storage, more processing and more noise that needs to be dealt with by the user wanting to analyze this data.
Big data that remains big is a problem; and a growing problem for most enterprises. A guaranteed requirement towards a successful, value oriented big data strategy is the ability to quickly reduce big data down to small, more meaningful data.
This is easier said than done and is quite often overlooked or ignored because it requires a blend of strategic and analytical thinking to be applied upfront. However, there are certain key strategies that can help enterprises get to this point. These strategies are what I term as F.A.S.T for Filter, Aggregate, Sample and Transform.
F is for Filter
Filtering is a capability that enables a divide and conquer approach to data organization and management. By taking raw or enriched data and dividing it up based on logical groupings such as entities or events (entities can be things, people, places, organizations that participate in events) or by where it was generated or when it was generated or the problem it will be used to solve, large data sets can be reduced down to more manageable sets.
Similarly, different use cases that require multiple data sets to be merged and enriched can benefit from separate storage of their relevant data. Use case based storage enables the selection of the optimal technology for storage and processing given the unique needs of the use case (for example, interactive and ad hoc analysis vs. batch reporting).
Filtering also enables segmentation of data into data sets that a unique set of data consumers care about. For example, sales rep looking into the sale of a particular product through iPhones in Europe should not need to analyze data for sales in the US unless the analysis requires comparisons or stack ranking. Segment based organization of data enables quick discovery and analysis of data relevant to the consumer’s need.
A is for Aggregation
Aggregations enable compaction of a lot of data into smaller sets by reducing the fidelity of data. A strong strategy for aggregations is the top down analysis of what decisions will the analytics enable.
For example, if the actions or decisions are required to be made every hour or every day, multiple events that arrive every second can be aggregated to per hour or per day levels. This ensures that a consumer looking to interact with this data does not have process all events every time and have the option to look at pre-aggregated data that represents the fidelity required.
Another mechanism for aggregation is to aggregate metrics or KPIs of interest over the dimensions of analysis. For example, if events arriving into the system represent user demand and analysis that compares the demand signals between users belonging to different age groups, these events can be aggregated by the age groups such as 0 – 18 years, 18 – 40 years, 40 – 60 years and 60 and above. Aggregations based on event attributes like the above can be over a combination of several attribute-value combination such as user age and gender or user age, gender and location enabling ready to use data at the point of analysis and decision making.
S is for Sampling
Sampling is another mechanism that enables an iterative analysis over large data sets ensuring the users are able to quickly identify relevant data sets and progressively analyze larger versions of the data. This ensures that irrelevant data or analysis can be easily discarded and time is not spent analyzing the entire data set during the experimentation phase of analytics design. The use of incrementally larger samples while designing a technique or algorithm for data processing can save valuable time and enable the data scientist to fail fast on not so promising techniques.
T is for Transform
Transformation of data is the process through which new attributes and new records can be added or removed from a data set. new attributes could be generated by applying a mapping function on the one or more existing attributes or gathered by merging the data with another data set. New records can be generated by combining two data sets with the same schema.
Transformed data is usually the data that powers dashboarding and reporting of insights represented by analytics, metrics and KPIs. The faster data can be transformed into its final state, the sooner can these updated insights be delivered to the consumers of insights.
Needle in the haystack
Converging big data into small, manageable data sets that directly lead to the generation of relevant insights is akin to finding the needle in a haystack. This divide and conquer technique makes data governance an easier problem and when/if carefully controlled and conducted, this technique can dramatically increase the speed to insights and value.