by John McDowall

Effective data modeling

Aug 28, 2019
Data ManagementData MiningData Visualization

Understanding data is a prerequisite to gaining control of any enterprise. But understanding is only useful if that knowledge can be shared and transmitted. Effective data modeling should be a primary focus of any enterprise architect.

data artist visualization explainer paint brush  paint colorful diverse2
Credit: Getty Images

In my last missive, I opined that understanding an enterprise’s data is central to guiding an enterprise. But understanding is only half the problem. The other half is being able to document that understanding and share it with others.

It’s impossible to share data across systems or organizations without a common understanding of the data. Traditionally, this has been done using data dictionaries—documents that purport to explain the contents and format of every field in a data structure. The sad reality is that those documents have to be manually created and updated, and so are rarely updated. The result is outdated, useless documents and frustrated architects and developers. But there is a better way.

Modeling done right

Over the past couple of decades, data modeling efforts have generally focused on either relational data modeling or eXtensible Markup Language (XML) modeling. Relational data modeling is fine as long as data is being stored in a relational database, but otherwise it has no particular use. And XML cannot credibly be called a modeling language. XML is a specification for serializing data—that is, writing it to a file. XML provides a format for structuring the data’s serialization, but it is not a real model.

By “model,” I mean a formal specification grounded in mathematics. In practical terms, this means something that can be subjected to verification using formal methods. In layman’s terms, that means we can use mathematical operations to prove it is correct, and that we can automate that verification process. Capturing data in an XML schema does not qualify as a model under this definition. To be sure, we can use software to verify that the XML is well formed and conforms to some XML schema document. But that is not enough to truly model the data.

No one, whether a computer or a human, can understand data without understanding both its syntax (structure) and semantics (meaning). XML can capture syntax, but it cannot innately capture semantics. Semantics can be written in XML format, but those semantics must first be captured in some more formal modeling scheme. In other words, the enterprise needs a formal ontology. The majority of such modeling schemes are based in formal logic, usually either Common Logic or Description Logics.

By far, the most commonly used semantic modeling language is the Web Ontology Language (OWL), which is based in Description Logics. This means that we can not only formally verify the model and the data it contains, but we can also infer new facts by reasoning about the data, and we can prove the correctness of those inferences. Because OWL is the de facto standard for ontology modeling, I will confine my remaining remarks to OWL.

But wait! None of this means you need to store your data as OWL. Hear me out before you get too concerned about forcing storage formats onto unwilling developers.

Of data models and data storage

Military planners have a maxim: “Amateurs worry about tactics; professionals worry about logistics.” The core idea that they’re trying to get at is that it does no good if you create a battle plan that will overwhelm your enemy’s defenses, but you cannot get your own troops the fuel and ammunition they need to carry out the plan. In a similar way, we can say that implementers worry about storage, but architects should worry about models. There is no reason that the data model should be dictated by the storage technology used by a particular system. And a well-defined model can be transformed into any storage format using a lossless process.

Too often, we begin with the storage solution and work backwards to a data format. Or multiple formats. When XML was first introduced some 20 years ago, it was hailed as the universal data exchange format. Given this assumption, the various systems that needed to exchange data took their current storage schemas (usually relational databases) and converted the data to XML for exchange with other systems. The result has been enterprise and system architects focusing on XML formats almost to the exclusion of the intended function of the system or the overall interoperability of the enterprise.

This problem is particularly acute in the Department of Defense. The Department supports a veritable cottage industry of XML specification creation and maintenance. Each one of the XML schemas is maintained separately, and every time one is updated every related specification must be reviewed for potential impacts (usually manually). On top of that, provision must be made in the XML schemas for systems that cannot be updated to conform to the new schema. The result is a confusing mishmash of specifications that force a focus on making the XML work together instead of focusing on the mission the XML is supposed to facilitate.

Instead of starting with a storage format and then determining how to represent it for information exchange, start with a storage-agnostic data model such as OWL, then use that as the basis for generating database schemas and data exchange formats. This will not only let you focus on understanding the data as it exists (as opposed to how some developer wants to cram it into a database), by creating multiple data representations from the based model you can minimize the maintenance tail. Because any change to the enterprise’s data only needs to be manually changed in the master model, generating other storage and exchange schemas from that model ensures consistency across those schemas.

Enterprise data modeling

If your concern is the enterprise, then obviously your data concerns span the entire enterprise, and right now you are probably thinking that the prospect of modeling all of the data in the enterprise is rather daunting. But fear not, this is a task you can safely delegate to a number of people if you are reasonably careful.

Creating a single enterprise data model is usually a fruitless endeavor. There is just too much data for one group to model, and too many competing interests trying to drive the model in their preferred direction and insisting that no other approach will work for them. But ontologies developed using OWL are modular, meaning you can integrate multiple models from different sources. Instead of creating a single model that covers the entire enterprise, each interest group (business area, development team, etc.) can define its own ontology for the data it is concerned about.

Unfortunately, this will almost certainly result in data models that overlap but model different objects differently. The solution to this problem is to adopt a common upper ontology from which every ontology in the enterprise should derive. A common upper ontology will not prevent all interoperability problems, but with a good upper ontology it will bound the problems by preventing utterly ridiculous constructs such as making a “location” a type of “event” (no, seriously, I’ve seen that).

There are a number of candidate upper ontologies available, and most of them try to divide all information into five or six top-level categories. But most of these ontologies run into the problem that someone has a class of data that does not fit into one of their foundational classes, and a kludge like making a location a type of event is the result. In my experience Basic Formal Ontology (BFO) is the most well-thought-out of them. In several years of working with BFO, I have not found a single case where the data under consideration does not fit within BFO’s class hierarchy.

Ultimately, the enterprise architect must choose the data modeling philosophy that works best in his or her specific circumstances. Regardless of what data modeling philosophy you choose, remember that you have an obligation to capture both the syntax and the semantics of all the data in your enterprise.