by Peter B. Nichol

Recharge your knowledge of the modern data warehouse

Mar 06, 2017
AnalyticsBig DataData and Information Security

Data warehousing is evolving from centralized repositories to logical data warehouses leveraging data virtualization and distributed processing. Make sure you’re not using old terminology to explain new initiatives.

Are you comfortable with source systems feeding ETL processes into operational data stores or master reference data through an enterprise service bus with the product, supply chain and business operational reports dumped into a presentation layer with soft analytics, dashboards, alerts and scorecards? That was yesterday.

Don’t get caught, explaining your new data warehouse initiative with old terminology.

Traditional vs. modern data warehouses

Data warehouses are not designed for transaction processing. Modern data warehouses are structured for analysis. In data architecture Version 1.0, a traditional transactional database was funneled into a database that was provided to sales. In data architecture Version 1.1, a second analytical database was added before data went to sales, with massively parallel processing and a shared-nothing architecture. The challenge was that this resulted in slow writes and fast reads. In data architecture Version 2.0, the transactional database populated a second database which flowed into a third analytical database, which connected to the presentation layer (business intelligence). In data architecture Version 2.1, multiple transactional databases fed the core database which provided information downstream to data stores (sales, marketing, finance) that connected to a business intelligence engine. At this point, traditional database structures end and modern structures begin: data architecture Version 3.0.

The two below examples highlight the difference between a traditional data warehouse and a data a modern data warehouse (using Hadoop for this example).

Traditional data warehouse:

  1. Operational systems: CRM, ERM, financial, billing.
  2. ETL: decision analysis model and data.
  3. Enterprise data warehouse: operational, customers and IT data marts.
  4. BI platform: KPI summary, monthly and quarterly reporting and daily summaries.
  5. Automatic customer value analysis: interactive data queries, static data analysis, and OLAP.
  6. BI collaboration portal: wholesale, OEM, sales, employees and external.

Modern data warehouse:

  1. HDFS: Hadoop Distributed File System.
  2. HCatalog: a metadata and table and storage management layer system.
  3. HBase: key-value database, columnar storage.
  4. MapReduce: a flexible parallel data processing framework for large data sets.
  5. Oozie: A MapReduce job scheduler.
  6. ZooKeeper: distributed hierarchical key-value store enabling synchronization across a cluster.
  7. Hadoop: open-source software framework to support data-intensive distributed applications (storage, processing of big data sets).
  8. Hive: A high-level language built on top of MapReduce for analyzing large data sets.
  9. Pig: Enables the analysis of large data sets using Pig Latin. Pig Latin is a high-level language compiled into MapReduce for parallel data processing.

Most database designs cover four functions: 1) data sources, 2) infrastructure, 3) applications and 4) analytics. This principle of design does apply to both traditional data warehouses and modern architectures. The design thinking, however, is different. In a modern data warehouse, there are four core functions: 1) object storage, 2) table storage, 3) computation and processing, and 4) programming languages.

Re-establish the structure for success

The lack of data governance, inadequately trained staff, weak security and non-existent business cases each factor into why data warehouse or business intelligence initiatives fail to achieve the desired outcomes. Keep your data warehouse program on track.

Start by strengthening your framework for business intelligence. If it’s been more than six months since you looked at your end-to-end operational state, it’s a good idea to revisit the original thinking and revalidate assumptions.

The three-tier structure outlined here can help guide your discussions and the assessment.

Primary functions

  1. Program management: portfolio, process, quality, change and services management.
  2. Business requirements: business management, information services, communities, capabilities, service levels and value.
  3. Development: data warehouse and database, services, data integration, systems, monitoring systems and business analytics.
  4. Business value: culture; data quality, analytics and data utilization; data evolution; and value measurements.

Secondary functions

  1. Intelligence architecture: data, integration, information, technology and organizational.
  2. Data governance: accountabilities, roles, people, processes, resources and outcomes
  3. Operations: data center operations (SaaS, DaaS, PaaS, IaaS), technology operations, application support and services delivery
  4. Intelligence applications: strategic intelligence, customer intelligence, financial intelligence, risk intelligence, operations intelligence and workforce intelligence

Tertiary functions

  1. Data integration: data consolidation, data quality and master data management.
  2. Informational resources: data management, informational access and metadata management.
  3. Informational delivery: query and reporting, monitoring and management, and business analytics.

Shrinking budgets, pressure to deliver and expanding data sources all encourage us as CIOs to accelerate progress. Much of this acceleration comes at the cost of not thinking. The modern data warehouse is being designed differently. This means we as leaders need a block of time to think. This time also allows us to upgrade our understanding of how modern data warehouses are planned, refresh the core elements of the progressive data ecosystem and upgrade our terminology.

Evaluating your current data warehouse initiative

Start by asking the following questions to determine if you’re running a modern data warehouse.

  1. Does our environment quickly handle diverse data sources and a variety of subject areas?
  2. Can we handle excessive volumes of data (social, sensor, transactional, operational, analytical)?
  3. Are we using structures such as data lakes, Hadoop and NoSQL databases, or are we running relational data mart structures?
  4. Do we support a multiplatform architecture to maximize scalability and performance?
  5. Do we utilize Lambda architecture (more about data processing than data storage) for near real-time analysis of high-velocity data?
  6. Have we leveraged new capabilities like data virtualization (cloud services) in additional to data integration?
  7. Has the organization applied data warehouse automated orchestration for improved agility, consistency and speed through the release life cycle?
  8. Is our organization running a bimodal business intelligence environment?
  9. If we asked our primary business sponsors, would they know where the data catalog is located to document business terminology?
  10. Are the BI development tools decoupled from the agile deployment models?
  11. Have we clearly defined how we certify enterprise BI and analytical environments?

Modern data warehouses are comprised of multiple platforms impervious to users. Polyglot persistence encourages the most suitable data storage technology based on your data. This “best-fit engineering” aligns multi-structure data into data lakes and considers NoSQL solutions for XML or JSON formats. Pursuing a polyglot persistence data strategy benefits from virtualization and takes advantage of the diverse infrastructure.

If you’re well into the modern data warehouse journey but have not seen the benefits initially forecasted, don’t fear, there is still hope. Allow me to share a few tips to uncover the underlying challenges preventing successful adoption. First, define all the data storage and compression formats in use today. There are many options, and each one offers benefits depending on the type of applications your organization is running. Second, look at the degree of multi-tenacy supported in your BI environment. Using a single instance of software to serve multiple customers improves cost savings, makes upgrades easy and simplifies customizations. Third, review the schema or schema-less nature of your databases and the data you’re storing. Understanding how data is loaded, processed and analyzed can help to determine how to optimize the schemas of objects stored in systems. Fourth, metadata management, while often overlooked, can be almost more important as the data itself.

Upgrading your team’s understanding of data warehouses will move your organization toward agile deliveries, measured in weeks not months.