The JSON community has a IETF working draft schema specification that would provide a format for defining the structure of JSON data. This specification would provide the ability to define validation, documentation, hyperlink navigation, and interaction control of JSON data. However, this specification falls short in aligning to and delivering data type primitive and built-in capabilities that are provided by the W3C body on XML data standards. There are vendors that publish their own version of JSON schema. However, these schemas are non-standard and are unable to support portability between vendors.
The use of JSON in environments that requires interoperability of data across the enterprise will make governance more challenging. Gartner has published a white paper called: “Does your NoSQL DBMS Result in Information Governance Debt?” (subscription required). As technology leaders looking to increase data integrity and the overall resilience of their enterprise, one can only conclude that storing data as JSON objects will inhibit you from achieving these goals.
Introduction to JSON-LD
Consequently, to resolve the short comings of JSON, the W3C standards body has approved a standard for the support of JSON-LD (Linked Data). JSON-LD is a standard to allow interoperability of data over the internet and defines the concept of context by specifying the vocabulary of types and properties. Furthermore, JSON-LD is backed by another W3C standard called Resource Description Framework (RDF) that provides a canonical data model to represent the data in JSON-LD.
RDF is a general-purpose framework for representing information in a graph format that describes data in context. It does this by representing data as a subject, predicate and object, commonly known as a triple. These triples are the foundation for bridging unstructured to structured data.
RDF has been adopted by schema.org. This open community internet ontology is sponsored by Google, Microsoft, Yahoo and Yandex and is used by 2.5 billion web pages.
JSON-LD has been adopted by the likes of Google’s Gmail and Search, Microsoft Bing, Yandex, BBC, and the U.S. government to name a few. A complete list can be found at json-ld.org/wiki.
Developing a cognitive road map starting with JSON-LD
The realization of big data delivering value from both structured and unstructured data is a game changer. Using JSON-LD in a big data solution requires no extra effort over using JSON. The syntax serializes directly into graphs, ensuring that most real world data models can be expressed. JSON-LD is a document that can be described either as a RDF document or as a JSON document.
Using JSON-LD to express a RDF framework is a very powerful way to set the foundation for more advanced concepts, with the frameworks ability to define context. In RDF, you have the ability to define Domain-Range constraints, i.e. a rule which restricts which subjects and objects can be used with a given predicate. For an example, in the triple Employee hasSSN 123-45-5678, the predicate hasSSN can be defined so that it cannot be assigned to an organization, as organizations don’t have Social Security Numbers. RDF supports W3C Shapes Constraint Language standards that provides you with the ability to define the object structure. The equivalent to XML XSD definition for graph data. There are tools in the market place like Lymba’s Jaguar that have the capability to read unstructured documents and extract the subject, predicate and object out of a sentence and store them as triples in a graph database.
The benefits of this approach are twofold. First, you now have the ability to extract meaning from unstructured data. And second, your now are able to provide a full line of sight from both your unstructured and structured data to your business concepts.
Figure 1 represents the line of sight of knowledge acquisition technologies. The bottom left axis of the graph represents low knowledge acquisition and high cost of operation. As you move up the line, incorporating more advanced technologies you will increase your knowledge acquisition capabilities, finally reaching to today’s machine learning. At this level, you will have achieved a high efficiency of knowledge acquisition with an overall lower cost of operation.
Pundits scoff at this modeling approach in that RDF is too hard to learn and is not well adopted by the industry in general. This is an opportunity cost proposition. You can continue to use big data with NoSQL and reduce your infrastructure costs, but provide no additional value for your business over what enterprise data warehouses provides. In fact, one could dispute that you could be moving the needle backwards with the introduction of referential integrity issues that was addressed above. Or, you can turn your data lake of facts into an ocean of knowledge using JSON-LD with W3C Semantic specifications as the backbone for knowledge acquisition.
JSON-LD with W3C Semantic specifications provides you with a line of sight of interoperability, from your data all the way through your business concepts. This approach sets the groundwork to build upon more advanced capabilities like natural language understanding and application auto-generation. Not only will you be able to build smarter applications, but build them much faster, since you have now have data that is in the context of your business concepts.
These advanced capabilities will guarantee that you have the advantage to manage any new threats in an expeditious manner and provide the foundation to leap frog your competition.
Mitch DeFelice started his career off serving six years in the U.S. Navy as part of the Naval Security Group tactical electronic support staff. Mitch’s military tours included serving with Fleet Air Reconnaissance Squadron (VQ-1) in Guam and support staff for Admiral Thomas B. Hayward, Commander-in-Chief, U.S. Pacific Fleet (CINPACFLT), Honolulu, Hawaii.
Mitch is a TOGAF 9 Enterprise Architect Certified. Mitch’s primarily focus is working with key business stakeholders and technology executive leadership developing technology solutions that support unstructured data. This includes areas of content management, records management, enterprise search and eDiscovery solutions. His passion lies with developing business solutions around Cognitive Computing capabilities.
Mitch is a frequent contributing author to trade magazines on unstructured data and cognitive computing related topics.
The opinions expressed in this blog are those of Mitch DeFelice and do not necessarily represent those of IDG Communications, Inc., its parent, subsidiary or affiliated companies.