Recently, there has been a lot of talk about whether ETL is still a necessary activity in the modern data architecture and whether ETL should be abandoned in favor of modern techniques such as data wrangling, in memory transformations and in memory databases. Turns out, ETL is more important than ever and though it can look slightly different (e.g. data warehouses can be too restrictive), it continues to play an extremely important role in the data value generation process.
ETL – short for Extract, Transform, Load – is made up of these three key stages of Extract, Transform and Load. Extract is the process of collecting data from the data source, transform is the process of processing the data to convert it into the form suitable for the required use case and Load is the process of moving the data into a data storage layer that can power the required data use case.
ETL is not dead. In fact, it has become more complex and necessary in a world of disparate data sources, complex data mergers and a diversity of data driven applications and use cases.
Extract is the process of collecting data from all required data sources. Data sources come in many shapes and sizes ranging from RDBMS systems to APIs to file shares or from public to private sources or from paid to free data sources. Data sources can contain PII (personally identifiable information) or could contain enterprise IP (Intellectual Property). Data sources can be messy, unstructured or structured and well described. Data sources can generate data at varied frequencies or constantly produce data through data streams. Data sources can support “pull” data mechanisms or “push” data mechanisms both synchronously or asynchronously.
This means that the extract part of modern ETL needs to be extremely flexible, resilient and malleable to support the diversity of data sources and the variations in data extraction procedures and protocols. Modern data architectures need to be able to connect to multiple data sources in parallel and extract data to make it available to downstream processing without impacting other extract process’ resilience.
Transform is the process where data is read from its raw form and transformed into the form where it is ready for usage in multiple types of scenarios. Transformation is probably the part of ETL that has changed the least however technology advances have made this part of the process more resilient, stable and efficient. Transformation comprises of three key subparts
The first type of transformation process is the determination and qualification of various data as being high quality, complete and acceptable. Here, the system needs to ensure that the various data points are complete, adhere to the schema that is expected and do not contain data that is not readable or is corrupted and incoherent. Another type of data quality check uses past patterns of data associated with a data set to determine whether there have been unexpected changes in the data that is newly arrived when compared to past arrivals. If any such changes are noticed, the data quality can be marked as suspect.
The second type of transformation process ensures that the data is deemed appropriate according the business quality requirements of the intended analysis of the data. Here the data is inspected for and analyzed for completeness from a business relevancy perspective and if the data is found to be missing key elements that are required for powering business workflows, the data is marked suspect.
The third type of transformation process ensures that data is processed to take the shape required by the business purpose of the data analysis. Here data can be aggregated, cubed, filtered, sampled, processed through algorithms to produce a transformed data set that is primed to support the intended business use case.
Because the same data can be used for multiple business use cases, transformations typically take a one to many relationship, with one data set being transformed multiple times through multiple business logic to produce multiple transformed data sets.
Load in ETL has gone through major changes in approach especially with the advent of polyglot storage — where storage is designed to best empower the specific data scenario be it analytics, search, alerting, visibility etc. Load in modern data architectures can, in parallel, load the same data into multiple different types of storage technologies to power the end user and customer applications as needed by business requirements.
In modern load architectures, it is important that the system be able to simultaneously stream and load data into multiple technology stacks, again without hurting or impacting the resiliency and quality of other parallel loads.
Though the nature of ETL has changed, the idea of ETL has not become stale or irrelevant. There is an ever-increasing list of options for ETL which is a sign that the ETL Tools market is not only existent but growing. Especially with CIOs under more pressure to drive higher quality and value from their data through application of AI and machine learning technologies to data, modern ETL becomes a significant part of any data architecture. ETL is not dead. Just more complex, harder and significant.