by John McDowall

It’s all about the data

Jul 30, 2019
Data ManagementIT StrategyMaster Data Management

Gaining control of an enterprise’s data is a key element in taming enterprise complexity. But control without understanding is a recipe for trouble. Understanding data is an obligation of every enterprise architect.

Data is the lifeblood of any enterprise. It is now conventional wisdom that data is among any enterprise’s most valuable possessions, and chief data officers are increasingly recognized as first-tier members of corporate leadership teams. The emergence of the master data management discipline and the proliferation of enterprise data management tools likewise speak to the growing recognition that data is of vital importance to every enterprise.

Data is the foundation of knowledge, and as the old saying goes, knowledge is power. But data is often stored in many different systems, each using different formats, and the result is a complex web of interrelated and conflicting data models that must be reconciled before the enterprise can hope to wield all that potential power. This is a principle goal of an enterprise architecture. To successfully realize that goal, the architect must understand data.

The myth of metadata

Before we can truly understand data, there is one thing that must be distinctly understood: there is no such entity as “metadata.” This assertion is at odds with the assumptions which underlie any number of accepted tools and standards. Many master data management tools explicitly state that they gather “metadata” to help the enterprise organize its data, and the Dublin Core Metadata Initiative is the foundation of numerous data indexing and discovery methods. And yet this notion is mistaken.

Metadata is generally defined as “data about data.” While accurate, that is not a very helpful definition. The more astute practitioner will not use the word “metadata” without an adjective: structural metadata, semantic metadata, discovery metadata, etc. This at least narrows the definition of what is being described. But it does not refute the basic contradiction at the heart of the definition of the term “metadata” (i.e., the assertion that some data is not really data). We can also see this in the assertion that one person’s data is another person’s metadata—an obvious truism. This is not the place for a discussion of the circumstances that gave rise to the notion of metadata; all that matters is that we reject the notion of metadata as a distinct entity.

In reality, metadata is a role that one piece of data takes in relation to another piece of data. And any (or all) of those pieces of data may be things the enterprise needs to analyze. A simple example will serve to illustrate the point. Consider a document, which has both an author and a publication date. In a document management system, the document is the main object and the author and the publication date are additional information about the document. But in a personnel management system, the author is the main object, while the document and its publication date are additional information about the author. When we try to integrate these two systems, this different representation presents a problem: is the author a piece of data or a piece of metadata? The truth is, the author is both, depending on the needs of the current analysis. If we want to know information about who produced a document, then the author is metadata about the document. If we want to know what documents a person has authored, then the document is metadata about the author.

The important thing to remember about this discussion is that all data elements should be first-class entities within any data management system. The particular analysis being conducted at any given time will determine which data elements are metadata within that specific context. Any data management approach that promises results by organizing only the enterprise’s metadata cannot deliver a comprehensive understanding of the enterprise’s data holdings.

No silver bullets

In the beginning, there was the file storage system. Then came the relational database, followed by eXtensible Markup Language (XML), the key-value pair database and technologies like MapReduce and BigTable. The latest fad in data storage is the graph database. Each of these data storage techniques has been presented as the silver bullet that will slay all the monsters that plague enterprise data and prevent real understanding. The truth is, there are no silver bullets. Even if there were, as any fan of old monster movies will tell you, silver bullets only work against werewolves; mummies, vampires, and other monsters require different techniques. The same is true of data storage and management: different problems will require different solutions.

Ultimately, data comes in only two basic formats: as files or tuples. Files are those data entities that are stored as a single item, usually directly into non-volatile memory (e.g., disk drives). Think of documents, images, videos, and the like. (This type is often called “unstructured data.”) Each of these data entities can be said to make sense by itself. In contrast, the term “tuple” is the formal name used by data scientists to describe a set of ordered values that, taken together, have some meaning. Think of a row in a table or even an XML document. (This type is often called “structured data” or “semi-structured data.”) Tuples are just a series of values that are difficult to make sense of if they are not in some more formal structure (e.g., a table with labels on each column). All of an enterprise’s data management problems boil down to managing files and tuples.

In practice, we tend to manage files by managing tuples. Files are put onto a disk or in some type of file management system, and some set of tuples is cataloged as an index. This has usually been called metadata, but as I explained above that’s not really appropriate. In reality, it is just a set of tuples that includes a link or pointer to the file they relate to or describe. If the descriptive tuple includes a link to the file it describes, then important thing to manage is the tuple. So in the end, the problem of data management is always a tuple problem.

Managing tuples

I do not intend to describe the right way to store, index, retrieve and maintain all the tuples in an enterprise. There is no single right way to do this, as the nature of the tuples and the uses for them will vary depending on the intended use of the data and the needs of the enterprise. Google stores most of its data in key-value systems that facilitate thorough indexing and rapid searching. Banks store user account data in relational databases because of the transaction completeness guarantees provided by those systems. Different uses of data lend themselves to different types of storage, so using the right tool for the job is paramount.

Regardless of the storage method, no one can understand and make proper use of data without knowing two things: the syntax (format) of each value and the semantics (meaning) of each value. If either of these is missing, using the data becomes difficult if not impossible. The problem is that most data storage systems do not encode the semantics of the data, they only encode the syntax. Understanding the semantics is left to the user and is often encoded in the application logic of the system.

Encoding semantics in the application logic of a system or an analysis tool makes the system brittle and makes it difficult to use the data in other systems without an extensive effort to rediscover the semantics of the data. System documentation may be of some use in this, but the unfortunate fact is that systems evolve over time and documentation is rarely updated.

A better approach is to formally model the data using a language like the Web Ontology Language (OWL), which is specifically intended to encode the semantics of data in a machine-readable format that makes widespread reuse easier. Once the data has been formally modeled, converting it from one storage format to another (e.g., relational to key-value pair) is a task that is readily automated. More importantly, such a formal data model captures the syntax and semantics of the data within the model itself, making it unnecessary to maintain a stand-alone data dictionary.

A good formal data model makes it easy to understand the full scope of the enterprise’s data. Only when data is truly understood can leaders make informed decisions about how to manage all of the enterprise’s data as an enterprise asset instead of making piecemeal decisions based on the partial information that is typically available from each system.