By Sam Ramji, Chief Strategy Officer, DataStax\n\nIs giving people the right to change, alter, and amend your software a good thing? What about doing this for your data? Companies used to think that publicizing their source code was the same as giving away their secret sauce.\n\nBut they\u2019re beginning to realize the impact that open source has had in creating the things around them, such as mobile devices or TVs, and how much open source is a vehicle for change.\n\nWhat if your secret sauce was the data you owned, and not the source code? Would you be as comfortable making it public? Is it possible to have a General Public License (GPL) for data?\n\nI recently sat down with Larry Augustin to delve into this topic. Augustin is an open-source titan: he was part of the group that coined the term \u201copen source.\u201d He led the first open source IPO at VA Linux, led SugarCRM for a decade, and most recently, he served as Vice President of Applications at Amazon Web Services (AWS), responsible for services including Connect, Pinpoint, SES, Workspaces, Chime, Alexa for Business, and many others.\n\nFrom open source to open source data\n\nAugustin was in the open source world at its origins. He watched open source like the tide, fading away into the distance at times, then rushing back in a gigantic wave.\n\nBack in the 1990s and early 2000s, open source was the new kid on the block. Some people were excited about it, while the majority asked questions like \u201cwhy does it matter?\u201d and \u201cwhat is the strategy around it?\u201d until the hype wore off. Now, in the 2020s, businesses are being built on an open source model by default.\n\nAugustin speaks about the transition of open source from data centers, like the ones he was building in the Linux days, to integrating into consumer devices. But the consumer often has no idea how open source benefits them. As he points out, you wouldn\u2019t have a functioning TV without open source \u2013 looking into your TV settings will likely show you the open source licenses of the software used to build it.\n\nThe future of software, however, is not about the source code. It\u2019s about the data. In an AI-centric world, the machine learning code itself is not the powerful part. Its only purpose is to enable training\u2014building a system of neural weights, in other words \u2014 based on vast streams of data. Given the data you can reproduce the AI, but with just the code you cannot.\n\nSo as we move to the future, Larry sees an \u201cAI-native\u201d era of apps and businesses that build on the fundamental premise of AI-powered software that elevates human work.\n\n"Why should a salesperson have to enter data that the system already knows? Smart systems should import that data automatically. That\u2019s a design principle I call \u2018zero data entry.\u2019 Instead, software should be helping the salesperson do their job. For example, help the salesperson know what information the customer likely wants next. I call that creating a \u2018system of action,\u2019 one that helps the person do something (take action) in their job," Augustin said.\n\nTo reach the AI-native future, we\u2019re going to have to figure out how to apply the heuristics of open source to the world of open source data.\n\nOpen source: two core themes\n\nThere are two core themes that Augustin speaks of that have made a great impact on open source software, which he believes should be applied to open source data as well. The first is the ability to extend, enhance, and reuse software. And the second is the ability to fix a bug or repair a problem.\n\nExtend, enhance, and reuse\n\nThe notion of extending or enhancing code is something you might have run into when using software, and finding a small thing that, if changed, would make your life easier. Open source grants you the freedom to make these changes and share them with other people who might be in a similar situation.\n\nExtending, enhancing, and reusing open source data is also applicable, but it isn\u2019t as simple as just sharing data. As Augustin puts it: \u201cYou have to have the correct licensing. There are access mechanisms. Does that mean you get the data in a structured format? Do you need to change the schema? People who think about data all the time don\u2019t always think about the metadata that goes with it.\u201d\n\nAugustin has seen many companies providing data without the metadata. It\u2019s a key component as it contains information about the history and the causality of how the data was generated. Without this metadata, the value of the data collapses, because we have crippled our ability to trust and analyze it.\n\nFixing bugs and repairing problems\n\nThe second core theme is the ability to fix a bug or repair a problem. It\u2019s annoying when one little thing can prevent you from using the software as you want, all because of a small oversight in coding or a lack of clear understanding of internal workings.\n\nAs an example, Augustin brought up an issue he ran into using QuickBooks at a startup many years ago: \u201cI was using QuickBooks to do the accounting. And there was this field. If I put in 12 characters, it crashed. But if I put in 11, everything worked. And it was very clear when you put the 12 characters in, it went off the end, and boom, everything blew up. I could see the person writing this code thinking, \u2018Oh, yeah, these things will never be longer than 11 characters.\u201d\n\nAugustin contacted QuickBooks support, but they weren\u2019t interested in fixing the problem. It\u2019s an example of why open source is so attractive: you don\u2019t have to \u201clive with it\u201d or wrangle with workarounds when you run into a software issue. You can change the code and share the benefit with others who might also benefit from it. It\u2019s about \u201cpermissionless innovation,\u201d as Vint Cerf stated so well.\n\nData also needs to be \u201cfixed\u201d at times. It can be hard to think of data as \u201cbroken,\u201d but Augustin said that he rarely sees a clean dataset. And the larger the dataset, the greater the amount of \u201cnoise\u201d in the data. The ability to improve the signal-to-noise ratio is an important part of opening up data.\n\nWhat is the GPL for data?\n\nAs in the software world, where a user gives up some control through a contribution agreement, users of open source data have to give up some rights to their data. But the question we\u2019re facing now is, what would that agreement or General Public License (GPL) look like?\n\n\u201cOn the data side, what are the set of rights that a contributor of data needs to give up to still feel comfortable that they can use their data the way they want to, the way they intended, that they haven't sort of lowered their own rights?\u201d Augustin says.\n\nContributors who understand this trade-off enable the open-source community to enhance and create new items from their data.\n\nThis user agreement also opens up the possibility of accelerated human progress. For instance, academic researchers in biological sciences are producing brand new data. Sharing their findings would allow others the opportunity to train new models on it.\n\nThe data-in-to-data-out ratio\n\nIf we take it one step further from the GPL for data, we begin to see the value equation of data, or \u201cthe data-in-to-data-out ratio\u201d as Augustin calls it. He uses the example of why people are so willing to give up parts of their data and privacy to websites because the small amount of data they\u2019re handing over returns greater value back to them.\n\nAugustin sees the data-in-to-data-out ratio as a tipping point in open source data. Calling it one of his application principles, Augustin suggests that data engineers should focus on providing users with more value but take less and less information from them.\n\nHe also wants to figure out a way never to ask your users for anything. You\u2019re only providing them an advantage. For example, new app users will always be asked for information. But how can we skip that step and collect data directly in exchange for providing value?\n\n\u201cMost people are willing to [give up data] because they get a lot of utility back. Think about the ratio of how much you put in versus how much you get back. You get back an awful lot. People are willing to give up so much of their personal information because they get a lot back,\u201d he says.\n\nThe future landscape of AI-native applications will generate billions of dollars through improved efficiency of enterprises as systems. Perhaps more importantly, we have a chance to make work more meaningful and joyful for the people freed from data administration to create value. AI has taught us that computers can learn things, and that they can know things. What\u2019s special about humans is that we are creative beings who love to spend our time connecting with other humans. Let\u2019s design a future where the use of AI sets us free.\n\nLearn more about DataStax here, and subscribe to the Open||Source||Data podcast.\n\nAbout Sam Ramji:\n\nSam leads strategy at DataStax. A 25-year veteran of the Silicon Valley and Seattle technology scenes, Sam led Kubernetes and DevOps product management for Google Cloud, founded the Cloud Foundry foundation, has helped build two multi-billion dollar markets (API Management at Apigee and Enterprise Service Bus at BEA Systems) and redefined Microsoft\u2019s open source and Linux strategy from \u201cextinguish\u201d to \u201cembrace\u201d.\n\nHe is nerdy about open source, platform economics, middleware, and cloud computing with emphasis on developer experience and enterprise software. He is an advisor to multiple companies including Dell Technologies, Accenture, Observable, Insight Engines, and the Linux Foundation.