Data Management is a multi-billion-dollar industry with heavy competition and an often-confusing landscape. Although an expansion of the industry has given way to a period of contraction and consolidation, the ecosystem is ever-evolving and it still continues to shift rapidly. Mergers, acquisitions, and displacements all impact the tools and platforms used to manage information. New hardware and software tools can quickly upend how data is being managed.
For researchers, the process of collecting information to formulate a hypothesis, conduct experiments, or analyze and iterate on a research program can be a dauting task. The challenge compounds when the use of advanced technologies and big data is included, and it only gets more difficult with increased pressure from regulations and security constraints.
To address these challenges, research-focused organizations need to take a strategic approach to data management. But what are best practices for data management in an ever-evolving landscape of approaches, tools, and threats?
Data management can be broken down into a set of interconnected component parts. Taken as a whole, these components provide a structure to help various stakeholders – data engineers, data scientists, IT operations personnel, data users – understand how the evolution of data management is impacting the way that research is constructed and conducted, the skillsets necessary for users of the data, and what may be on the horizon for the data management ecosystem.
We’ve identified nine key pieces of this puzzle:
- Data movement
- Data locality
- Metadata management
- Data integration
- Search capabilities
- Data catalog(s)
- Data pipeline(s)
- Policy and governance
- Intrinsic security and trust
Organizations must carefully consider how they address these various components as part of their data management strategy to enable the research enterprise effectively, to generate efficiencies, and to protect all data as valuable assets. Read on for an overview of select components, or check out the full whitepaper, “Data Management for Research” by Adam Robyak and Dr. Jeffrey Lancaster.
Data movement. A few trends are likely to impact how data movement will evolve over the coming years. First, organizations are adopting hybrid cloud environments in which data is stored in both on-premise infrastructure as well as with cloud providers, on remote devices, in sensors, and at edge gateways on top of on-prem and cloud services. As researchers seek to use that data, it will need to be both accessible and secure, no matter where it is stored. Second, machine learning is increasingly being used to automate manual tasks that had previously been the responsibility of IT professionals. As a result, those IT professionals can expect to spend less time on rote processes and more time monitoring resource allocation and troubleshooting at a distance.
Data locality. Whether that data is generated and stored in the cloud, in a data center, on the edge, or somewhere in between, understanding where data lives is critical to any data management strategy. Additionally, edge computing is one newer consideration that has emerged in response to decentralized IT, Web 3.0, and disaggregated data where the computational advantage comes in pre-processing data so only key data, aggregate data, or pre-analyzed data is transmitted from the edge back to a data center. And in some cases, data doesn’t need to make a round-trip to a data center; it can be wholly processed at the edge. Edge computing can be employed for a range of applications, from AI and analytics to inference and localized learning. Edge systems can also provide data aggregation from multiple endpoints and they can act as relays or nodes in a distributed network.
Data Pipeline(s). Data pipelines provide an organized and often-efficient construct for delivery of information from data source to destination. Pipelines should be automated whenever possible and can leverage machine learning and artificial intelligence to aid in sourcing as well as ingest. To make the best use of data pipelines, researchers should be able to clearly articulate where, when, and how data is collected. Multiple data pipelines are likely to be employed by researchers and organizations who have a mature data management strategy.
Policy and governance. Policy and governance have also led to the expectation that researchers must have a plan for data management. The National Science Foundation and the National Institutes of Health, along with other Federal agencies in the United States, mandate the inclusion of a data management plan as part of grant applications. Universities and colleges thereby assume the responsibility for the proper stewardship of the data that is generated by the research enterprise. The burden on institutions continues to grow as the amount of research data for which they are responsible exponentiates.
Intrinsic security and trust. The trust gaps associated with current solutions present an opportunity for new and emerging technologies: the Internet of Things is being secured through a mix of edge and telemetry data collection and processing; data provenance solutions are ensuring the accuracy and legitimacy of data, even for physical items procured through complex supply chains; data security across hybrid cloud models is protecting data in transit. Even SecDevOps – the process of integrating Security, development, and IT Operations into a contiguous and cohesive lifecycle management architecture – is a sign of the attention and importance afforded to the need for trust within data management.
By deconstructing the components of a data management strategy, researchers can ensure that they are both responsible stewards of the data and that they are employing best-in-class emerging technologies. Although the responsibility does not wholly fall on researchers — it must be shared by research administrators, students, and others — it is only through the collaborative cooperation of researchers, organizations, and IT operations that the optimal implementation of a data management strategy can be achieved for research.
For a more in-depth look at each of the components of a successful data management strategy, see the Dell Technologies whitepaper “Data Management for Research” by Adam Robyak and Dr. Jeffrey Lancaster.