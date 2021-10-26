The history of data can be divided into two eras: pre-big data and post-big data.

In the pre-big data era, data was mostly structured and exchanged between enterprises through standard mechanisms such as network data mover (NDM). The need for near real-time insights was limited, and data extraction and transformation were batch-oriented and scheduled during non-peak hours to reduce MIPs usage and disruption to online production transactions.

Also, data formats were limited, the most common format being delimited flat files with headers and trailers. Both headers and trailers stored important information such as data arrival time, data producer information, and the number of records in the file.

Moreover, relational database management systems (RDBMs) — such as DB2, hierarchical databases such as IMSDB, flat files and custom extract, transform, load (ETL) logic within COBOL or PL/I — were sufficient to address data ingestion, analysis, and storage. Since sources of data generation were limited, it was easier to manage the volume of data.

As we ushered in the era of big data, enterprises expected more value from data as advances in technology provided the capacity to gather, store, and analyze an exponential growth in the volume of data. Data is now a competitive advantage empowering enterprises to analyze and distill business insights for timely and informed decisions.

Moreover, enterprises need to comply with regulations that demand ingesting data from diverse sources to take informed decisions based on several data points. Take for instance, how utility companies in California need to ingest and analyze voluminous data. Wildfires in California take a huge economic toll on the community and business every year. Regulatory authorities mandate collection and storage of data and application of artificial intelligence or machine learning-based prediction techniques to reduce disruption caused by wildfires. This shift in the dynamics of data resulted in an exponential growth in terms of data volume, data sources, data exchange patterns, and data formats.

Managing volume and complexity of data

Today, a significant amount of enterprise data is generated from external sources rather than internal systems of record (SORs). The type of data stored is transactional as well as engagement data. The engagement data can possibly be 10-20 times more than transactional data. Although Hadoop and Spark introduced distributed storage and accelerated data processing through massive parallel processing, they do not address dynamic scaling up of data acquisition, storage, and processing based on demand.



Elastic scaling of compute and storage on-premises is human-intensive, cumbersome, and expensive. Even data acquisition from multiple external sources increases overheads. Consequently, enterprises face several challenges with on-premises data management. It is difficult to:

Scale up data processing and storage for an exponential increase in polymorphic data Manage different mechanisms to ingest data from external and internal systems Ensure high availability of data and near-real time secure access to data insights

Necessity is the mother of invention

The evolution of cloud computing coincided with an exponential growth in data. The cloud abstracted the problem of infinitely scaling storage and processing power on demand. It also provided a managed data landing zone for data ingestion from various internal and external systems.

Here is an example. Infosys redesigned the data landscape of a device manufacturer to better manage almost a petabyte of data residing in on-premises network-attached storage (NAS). The data was growing by 300% year on year. The system allowed users to upload images, incident descriptions, and application logs related to device defects. Our team redesigned the data management system using AWS and Amazon Document DB for metadata management. Our choice was determined by several factors:

Amazon Simple Storage Service S3 (Amazon S3) provides security, scalability, and a highly available object store for the petabyte-scale file storage on the NAS. AWS features — such as transfer manager — help manage large file uploads through multi-part uploads. AmazonS3 transfer accelerator enables data to be routed to the nearest edge location over an optimized network path for faster and more secure transfer of files. Amazon S3 provides a common and standard landing zone for data exchange between stakeholders. Amazon Document DB, a managed NoSQL database on the cloud, is schema-free and a good fit to store metadata which goes through frequent structural changes.

Our experience of partnering with global clients across verticals suggests that a majority of enterprises face challenges of data acquisition, exchange, and analysis when using an on-premises technology stack. While the device manufacturer had an on-premises file system, it is similar for on-premises traditional databases. For databases, the problem is compounded by expensive database licenses and database management support cost.



Along with data, enterprises need to develop and manage complex on-premises data ingestion and replication mechanisms. Enterprises need a highly skilled workforce to manage data that is stored on-premises. As a result, enterprises are migrating their data processing and management to the cloud, with a majority preferring a managed service on the cloud.



Amazon Web Services (AWS), offers a broad spectrum of data management services catering to several types of data, be it relational, semi-structured, or unstructured. Amazon Relational Database Service (RDS) and Amazon Aurora cater to the relational domain, while Amazon DynamoDB is a fully managed NoSQLdatabase service. Apart from these services, AWS provides managed services for other popular NoSQL compatible databases such as Amazon DocumentDB with MongoDB compatibility and Amazon Keyspaces for Apache Cassandra.

Navigating data migration, powered by AWS and Infosys migration strategy

Migrating data to the cloud demands a strategy to ensure seamless operations and business continuity. In some cases, it may be beneficial to retain certain types of data on-premises due to regulatory requirements. The data migration approach may vary based on the size and nature of the data.

For example, if the volume of data is huge (such as in the device manufacturer use case), it is prudent to adopt AWS Snow Family, comprised of AWS Snowcone, AWS Snowball, and AWS Snowmobile. This suite of services offers a number of physical devices and capacity points to help physically transport up to exabytes of data into the AWS Cloud.

In addition, for continuous data ingestion from various resources in the AWS Cloud, AWS provides data migration and ingestion services that can be utilized — such as AWS Data Migration Service (DMS), which ingest relational data into AWS. Also, Amazon Kinesis services help to ingest, store and process streaming data.

For data transformation, AWS provides Amazon Elastic Map Reduce (EMR), which manages Hadoop clusters in the cloud, and AWS Glue to manage Extract, Transform and Load (ETL) services. Furthermore , Amazon Athena and Amazon Redshift with spectrum provide data lakehouse implementation in cloud, and Amazon Quicksight adds a visualization layer for business users.



Post-migration, enterprises need to consider managing running costs. Implementing an observatory layerhelps track and manage resource usage and optimization on the cloud. The observatory layer helped the Infosys team optimize usage and reduce costs.

A platform-based approach to migrate applications and data to the cloud is imperative for a seamless migration. The Infosys Modernization Suite and its component Infosys Database Migration Platform, part of Infosys Cobalt, enable enterprises to migrate from on-premises RDBMs to cloud databases — such as Amazon Aurora — or NoSQL databases such as Amazon DynamoDB and Amazon DocumentDB.

Migrating data and application workloads to the cloud are imperatives for enterprises to future-proof their businesses. A well-orchestrated, automated approach allows enterprises to realize the benefits from migrating data to the cloud.



About the authors:

Rajib Deb is the Associate Vice President and Head of Architecture – Modernization Practice, Infosys

Saurabh Shrivastava is the AWS Global Solutions Architect Leader for Infosys