The Seven Design Principles of an AI-Ready Data Architecture

These foundational components will position your AI initiatives for success

Cognitive cloud
iStock

AI is having a big impact on organizations of all sizes, across all industries. But if you don’t have the proper data architecture in place to support AI and machine learning, you’re likely to be disappointed in the results you’re seeing. Here are seven principles to consider for an AI-ready data architecture.

  1. Plan for scale and elasticity.

Artificial intelligence (AI) is all about data, all the time. Does your IT team’s architecture enable computations to be performed on demand? Does the environment allow users the freedom to, say, apply a formula to a large dataset without first asking IT to check server capacity? Scale and elasticity are at the heart of AI. A cloud-enabled data architecture offers elasticity, letting your organization scale up for the moments where additional computing horsepower is needed.

Data and analytics ecosystems on AWS that are built around Amazon S3 naturally inherit the virtually infinite scalability of S3 storage. However, as the volume of persisted data proliferates, the compute horsepower required to harness the value of these huge datasets also needs to scale in tandem. Horizontally scalable compute platforms such as Apache Spark clusters and massively parallel processing data warehouse systems such as Amazon Redshift play key roles in building systems where compute scalability can keep up with ever-growing data volume.

  1. Shape an architecture that can ingest all types of data and measure changes over time.

An AI-ready architecture is able to address different shapes and granularities of data such as transactions, logs, geospatial information, sensors, and social. In addition, real-time time-series data is key to the constant feed of input that propels data-driven devices, from smart-home appliances and health devices to self-driving cars. Make sure your AI architecture has the capability to consume different data structures in different time dimensions, especially real time.

AWS offers various tools for persistence of time-series data conducive for temporal data analysis. At re:Invent 2018, AWS announced Amazon Timestream, a fully managed time-series database. This is an excellent choice as the back-end data store of very fast-moving time-series data such as that generated by health monitoring devices of critical patients. In addition to Amazon Timestream, several design patterns are available for using Amazon DynamoDB to persist time-series data.

  1. Be metadata driven from the start.

Is your organization identifying and classifying data at the point of ingestion? Most enterprises view metadata extraction as an afterthought, typically driven by compliance. Yet metadata is much easier to manage early in the process rather than later, and it has value to organizations far beyond compliance.

AWS Glue crawlers automate the process of building a technical metadata catalog from a variety of sources including file-based, databases, and semi-structured data such as XML and JSON. This feature of AWS Glue creates a meaningful value proposition for the Chief Digital Officer (CDO), particularly when there is a need of building capability of data lineage or business glossary. In these scenarios, tools such as Apache Atlas and Waterline Data can leverage the technical metadata catalog of AWS Glue to build more sophisticated solutions.

  1. Provide open access across all layers.

Platforms have three layers of data: raw, curated, and consumption. Older architectures frequently grant access only to the consumption layer. That’s a problem for decision scientists, who often like to examine raw data for overlooked elements that may generate more information. Be sure all of your architecture’s layers are exposed and open for access.

With a data lake built on Amazon S3, access to data at all layers can be democratized in principle since S3 offers simple HTTP-style access, eliminating the need for thick clients or custom code for data access. At the same time, discretionary access can be implemented using bucket policies and Identity and Access Management (IAM) policies. Amazon Athena, powered by Apache Presto engine, enables data scientists and analysts analyze huge datasets residing in S3 buckets.

  1. Enable autonomous data integration.

Mapping to target usage environments still remains a largely manual process. As your team rethinks its AI data architecture, consider applying machine learning (ML) in data integration so that the integration layer can automatically detect changes in incoming data and adjust the integration patterns with no manual intervention.

In the context of autonomous data integration, AWS Glue has the ability to detect changes in metadata of data sources that it has been instructed to track. Recently AWS launched its Lake Formation offering which features a ML-driven data duplication mechanism called FindMatches. This is a step toward more sophisticated models of intelligent and autonomous data integration. There are other tools from third parties such as Waterline Data that can accurately identify functional and business attributes of source data automatically by using built-in ML models.

  1. Get feature engineering right.

ML is often underutilized in enterprises because the data isn’t ready and usable. Feature engineering offers the answer. It transforms data into consumable forms and shapes that ML models can use.

Data preparation for training and evaluating ML models usually contribute a significant proportion of overall effort behind building and deploying ML models at scale. AWS Sagemaker also has a list of pre-requisites that must be met with respect to the format and location of the data that it uses for training ML models. With AWS, there are numerous tools available to a data engineer for data preparation. These range from basic utilities such as the AWS CLI, to sophisticated, high-performance, in-memory data processing platforms like AWS Glue (Apache Spark). Additionally, AWS Sagemaker also comes with a powerful notebook, Jupyter, which enables data engineers to develop and test data preparation tasks before deploying them at scale.

  1. Support a unified security model for data.

If your enterprise is like most, it relies on a complex, hybrid environment that blends cloud-based and on-premise services. Data resides in scattered locations, consumed by individuals and reports as well as other applications and devices. AI’s issues of trust and ethics also influence security. A unified security approach lets your organization consider security from the point that data is produced, to all points of consumption and cycles of enrichment. 

Learn more about enabling your digital business on AWS Cloud.

Related:

Copyright © 2019 IDG Communications, Inc.