I won big at a recent \u201ccasino night\u201d event by betting all my chips and hitting blackjack on the last hand.\u00a0 After lots of adulation from my peers for my courage, and a small prize (we weren\u2019t playing for money), they asked me why I risked the bet: \u201cThere was nothing at stake,\u201d I replied.\nThe same isn\u2019t true for large businesses planning their migration to the cloud.\u00a0 The promise of on-demand capacity, low-cost storage, and a rich ecosystem of open-source and commercial tools are compelling.\u00a0 But the stakes are real, especially when it comes to migrating data.\u00a0 As hundreds of companies have now demonstrated, a single data breach can cause long-term economic, legal, and brand damage. Beyond data protection, simply managing data in the cloud is different, and if it\u2019s not done right the cost, complexity, and risk can bring down the house.\nA simple \u201clift and shift\u201d of a data warehouse or data lake to the cloud won\u2019t generate cost savings to justify the effort. The cloud technologies that dramatically impact both TCO and scale are low-cost object storage (e.g. Amazon S3, ADLS) and elastic data processing (EMR, Spark).\u00a0 In fact, leveraging these measures to set up an elastic rather than fixed data management cloud environment can lower TCO by as much as 85%.\nHow much does it cost to manage data in the cloud?\n Podium Data \nIt\u2019s important to note that the technologies driving down data storage costs provide significantly less data management capabilities.\u00a0 Hadoop is a lot cheaper than Teradata, but it provides none of the data integrity controls, load balancing, and automation of a mature RDBMS.\u00a0 Similarly, S3 is cheaper than storage on Hadoop data nodes, but it\u2019s just a file system.\u00a0 There are no tables, fields, or datatypes. To query or process data on S3 you need to use commercial or open source tools (e.g., AWS Glue, EMR) or write custom programs. To manage and update data in S3 a data management tool (Redshift, Snowflake, Podium) is required.\u00a0 Data protection is limited to encrypting files\u2014not very helpful when you want to analyze datasets that have PII in some fields. Although object storage is scalable, inexpensive, and flexible, it turns the clock back decades on data management.\nAs with many immature technologies, the limitations of object stores have been touted as features.\u00a0 They \u201callow\u201d programmers to process data of any size, shape, or quality, and interpret its structure and contents.\u00a0 This \u201cschema on read\u201d approach works well for processing unstructured data or data that changes structure frequently.\u00a0 But it stymies automation, standardization, and scale that is key to collaboration and reuse, because the meaning of the data is buried in the code. Sound familiar?\u00a0 It is. The rallying cry for relational databases was to make the structure and meaning of data declarative, not embedded in COBOL redefines (look it up.)\nBridges built from a catalog-first strategy\nThe bridge between highly structured databases and \u201canything goes\u201d object stores is a data catalog.\u00a0 The catalog is a shared database that provides structure and meaning to data in object stores. Hadoop catalogs include HIVE, Atlas, and Navigator, which define how HDFS files comprise tables and fields. Through an API, programs can query the catalog to find the structure of a logical data object, its technical and business properties, access permissions, and the location of the data files. These programs can then push insights and results back into the catalog to enrich it.\nHowever, many cloud catalogs are passive \u2013 they scan files and logs to infer the structure and usage of data after they are processed. Data management, however, must be active to ensure that sensitive data is not exposed, important data standards are followed, and rogue actors don\u2019t create a house of cards.\u00a0 All cloud migrations should adopt a catalog-centric policy:\n\nAll shared and sensitive data is registered in a common catalog\nAll programs will access data through the catalog and log its activity\n\nThis allows a company to provide basic data management that supports a wide range of rapidly evolving technologies.\u00a0 A data lake on S3 can support Hadoop processing, custom PySpark code, R analytics, Amazon Glue, etc. while maintaining (and enriching) a shared data asset. Furthermore, a standard can be set for how data is stored, updated, and checked for quality\u2014which allows lights-out automation of these tasks.\nThe catalog also enables elasticity, which is central to cloud economics.\u00a0 The catalog can be available 24\/7 on a single server, supporting business users shopping for data, developers designing new data products, stewards checking quality and adding business definitions.\u00a0 Only data processing tasks \u2013 such as data loads, refreshes, preparation, and analytics\u2014require parallel processing power. Relational databases and Hadoop have traditionally coupled storage, processing, and the catalog in one fixed system and as data grows, costs rise across the board.\u00a0 In the new world, the catalog is again the bridge between processing power and cheap storage.\u00a0 Vast amounts of data can be managed affordably with the catalog, and processing costs can be controlled. In fact, if the catalog has profiling statistics (e.g., cardinality, min, max) it can optimize the processing of the data.\nAnother benefit of being catalog-centric is portability. Cloud vendors are eager to have you sign up for their integrated, proprietary tools. That is their strategy\u2014once they have your data and code in their applications, they have you. A catalog gives you choice\u2014we literally migrated a customer from one cloud vendor to another in a weekend because it was catalog-driven and automated.\nBehind the firewall, a catalog-first strategy is best, and prepares you to be catalog-centric. An automated cataloging tool can give you insights into all your data assets \u2013 relational, mainframe, Hadoop, files \u2013 in a few weeks, and give you a playbook for migration.\n\nWhat sources should we migrate?\nWhere is there GDPR and PII data?\nWhat duplicated and related data should we rationalize?\nWhat is the profile, content, and quality of every field?\n\nThe objective is to create cloud-ready data with a verifiable audit trail that attests to its provenance, lineage, and quality. Furthermore, the catalog provides a foundation for agility and scale, through secure, self-service access to a broad user community.\u00a0 With real insights on the readiness of your data to move to the cloud, and a cloud-native catalog ready to manage it, you can accelerate migration with both confidence and control.