Big data and the risks of using NoSQL databases

Using big data to extract value from your data is one thing. However, using NoSQL can increase your technical debt and put your enterprise at risk of data integrity and the lack of resilience.

big data risks
Credit: Thinkstock

NoSQL uses procedural implementation-specific structures expressed in a JSON format to represent its data model. ECMA International Standards body developed JavaScript to handle tasks in the browser. They also provided an extension to JavaScript to develop a lightweight language for interchanging data over the Internet called JavaScript Object Notation (JSON). The downside of JSON is that it lacks the capabilities to provide referential integrity. These data models are neither interoperable nor standardized. Which means, no data portability. JSON doesn’t provide any ability to resolve name space ambiguity in which your data is defined, or the structure and data types.

The JSON community has a IETF working draft schema specification that would provide a format for defining the structure of JSON data. This specification would provide the ability to define validation, documentation, hyperlink navigation, and interaction control of JSON data. However, this specification falls short in aligning to and delivering data type primitive and built-in capabilities that are provided by the W3C body on XML data standards. There are vendors that publish their own version of JSON schema. However, these schemas are non-standard and are unable to support portability between vendors.

The use of JSON in environments that requires interoperability of data across the enterprise will make governance more challenging. Gartner has published a white paper called: “Does your NoSQL DBMS Result in Information Governance Debt?” (subscription required). As technology leaders looking to increase data integrity and the overall resilience of their enterprise, one can only conclude that storing data as JSON objects will inhibit you from achieving these goals.

Introduction to JSON-LD

Consequently, to resolve the short comings of JSON, the W3C standards body has approved a standard for the support of JSON-LD (Linked Data). JSON-LD is a standard to allow interoperability of data over the internet and defines the concept of context by specifying the vocabulary of types and properties. Furthermore, JSON-LD is backed by another W3C standard called Resource Description Framework (RDF) that provides a canonical data model to represent the data in JSON-LD.

RDF is a general-purpose framework for representing information in a graph format that describes data in context. It does this by representing data as a subject, predicate and object, commonly known as a triple. These triples are the foundation for bridging unstructured to structured data.

RDF has been adopted by schema.org. This open community internet ontology is sponsored by Google, Microsoft, Yahoo and Yandex and is used by 2.5 billion web pages.

JSON-LD has been adopted by the likes of Google’s Gmail and Search, Microsoft Bing, Yandex, BBC, and the U.S. government to name a few. A complete list can be found at json-ld.org/wiki.

Developing a cognitive road map starting with JSON-LD

The realization of big data delivering value from both structured and unstructured data is a game changer. Using JSON-LD in a big data solution requires no extra effort over using JSON. The syntax serializes directly into graphs, ensuring that most real world data models can be expressed. JSON-LD is a document that can be described either as a RDF document or as a JSON document.

Using JSON-LD to express a RDF framework is a very powerful way to set the foundation for more advanced concepts, with the frameworks ability to define context. In RDF, you have the ability to define Domain-Range constraints, i.e. a rule which restricts which subjects and objects can be used with a given predicate. For an example, in the triple Employee hasSSN 123-45-5678, the predicate hasSSN can be defined so that it cannot be assigned to an organization, as organizations don’t have Social Security Numbers. RDF supports W3C Shapes Constraint Language standards that provides you with the ability to define the object structure. The equivalent to XML XSD definition for graph data. There are tools in the market place like Lymba’s Jaguar that have the capability to read unstructured documents and extract the subject, predicate and object out of a sentence and store them as triples in a graph database.

The benefits of this approach are twofold. First, you now have the ability to extract meaning from unstructured data. And second, your now are able to provide a full line of sight from both your unstructured and structured data to your business concepts.

Big data and the risks of using NoSQL databases. Mitch De Felice

Figure 1: Line of sight of knowledge representation.

Figure 1 represents the line of sight of knowledge acquisition technologies. The bottom left axis of the graph represents low knowledge acquisition and high cost of operation. As you move up the line, incorporating more advanced technologies you will increase your knowledge acquisition capabilities, finally reaching to today’s machine learning. At this level, you will have achieved a high efficiency of knowledge acquisition with an overall lower cost of operation.

Pundits scoff at this modeling approach in that RDF is too hard to learn and is not well adopted by the industry in general. This is an opportunity cost proposition. You can continue to use big data with NoSQL and reduce your infrastructure costs, but provide no additional value for your business over what enterprise data warehouses provides. In fact, one could dispute that you could be moving the needle backwards with the introduction of referential integrity issues that was addressed above. Or, you can turn your data lake of facts into an ocean of knowledge using JSON-LD with W3C Semantic specifications as the backbone for knowledge acquisition.

JSON-LD with W3C Semantic specifications provides you with a line of sight of interoperability, from your data all the way through your business concepts. This approach sets the groundwork to build upon more advanced capabilities like natural language understanding and application auto-generation. Not only will you be able to build smarter applications, but build them much faster, since you have now have data that is in the context of your business concepts.

These advanced capabilities will guarantee that you have the advantage to manage any new threats in an expeditious manner and provide the foundation to leap frog your competition.

This article is published as part of the IDG Contributor Network. Want to Join?

To comment on this article and other CIO content, visit us on Facebook, LinkedIn or Twitter.
NEW! Download the State of the CIO 2017 report