Extract transform load (ETL)
ETL has for long been the preferred way of doing batch data integrations. The reason for this is that the ETL method where the data is extracted from the source and transformed in memory before loaded into the target, is a genius way of processing data as long as the data can fit into memory. Memory transformations where data is not touching the disk and generating IO before being transformed into a static state, is a huge performance factor in any integration. The issue arrives when the data does not fit into memory, and data has to be paged to disk on temporary storage, or when the data is dependent on relational data that are needed for lookups or functional transformations.
Imagine the following: you are able to incrementally load only a few rows of data from the source that fit perfectly into a small memory buffer, but you are dependent on matching the data to millions of rows in your target database to correctly process the data. With a ETL approach, you would preload that relational data from the target database into memory, or do a row by row lookup to the target data creating multiple session connections. But neither of the approaches are very efficient.
What the ETL methodology do very well, is to centralize the business rules needed for the integration, since much of the processing is done by the ETL tool. This makes clean integrations and minimizes the complexity on the database level, meaning that you potentially can go for cheaper database solutions.
|Centralize business logic||Not suitable for bigger amounts of data|
|Minimize stress on databases||Data in-memory is isolated to the session and is terminated after the session ends.|
|Very performant on data that fit into memory||Pure batch orientation|
|In-memory processing||Advanced processing transformations will create back-pressure on the source|
Extract Load Transform (ELT) & Data Lake
ELT is a process where raw data is extracted from the source and stored in an archive or temporary storage area, before the transformation is invoked on the target/receiving end of the integration. This is done for a couple of reasons:
- Minimize back-pressure on the source
- Opens up for parallel inserts into the target
- Utilize target database capabilities like indexes
The ELT method is often chosen when the data volume, or velocity, is increasing to a point where traditional ETL is becoming a bottleneck due to issues like sequential writes, insufficient use of indexes, or that the data is not processed by the engine responsible for the integration. The ELT method minimizes complexity in the integration layer by enabling faster availability on raw data without creating stress on the source system (decrease in development time in this stage). This complexity, however, is by no means eliminated, but rather shifted to another part of the system. This can be a good thing if the resources you have available in development are more skewed towards supporting the target side processing, but it is an important aspect for the managers to make sure that cleansing and implemented business rules are done in such a way that others can reuse them. A common issue with the ELT approach is that multiple actors get access to the raw data and start creating their own interpretations of the data without aligning or sharing the code with others.
When ELT is used together with a data lake you achieve two things: Firstly, it enables you to take advantage of parallel writes both to the data lake, but also to the connected database if it supports commands like “copy into”. This means that you suddenly can write in parallel all the way down to the target database, resulting in reducing the time of writes from hours down to a few seconds. The other thing you achieve with the ELT and data lake is that you minimize the connections and integrations done to the sources. You may now reuse the data made available in the data lake, should a second system like to access the same data.
|Enables parallel processing||More frequent use of disk and IO|
|Removes integration bottlenecks||Business rules are often stored in the target databases making them harder to maintain.|
|Enables parallel writes||“Slow” for small amounts of data|
|Utilizes indexes on both source and target databases|
|When used together with Data Lake, implements common data platform integration|
|Minimize complexity in the integration layer|
Centralized vs Distributed
Centralized integrations are often done through heavy services like Azure Data Factory. When building a Centralized integration we create a software that is responsible for delivering data from and to multiple services. They are often highly scalable and are proprietary built with a big stack of components and adapters that help you integrate without the need of much custom code, and often supplemented by a graphical interface, making business logic easier to understand and maintain as well as lowering the threshold for the developers. All of this leads to higher productivity and better governance for your integrations.
These kinds of software often have unit testing tools and are adapted to a DevOps workflow where code is automatically tested and connections verified as you develop. In addition to this, they are often equipped with pre-built or tightly integrated with change management tools. This is contributing to a workflow for the developers that are pushing the idea of continuous integration/continuous deployment (CI/CD). The drawback is that it is often hard, or close to impossible, to implement test cases on function level of the integration. Leaving you with only test cases on static data outputs, that can make your integration results harder to verify. On the deployment side the centralized integration, softwares and services tend to generate bigger deployments as code is packed together in large XML or JSON files. This can make the deployments riskier and the complex structure of the files can make it harder to detect unwanted changes to already implemented integrations, as well as making it harder to roll back individual changes.
The centralized integration tools often comes with security layers, where you can implement security settings to limit access to systems and its data. However, these security settings are often hard to implement in such a way that the person working on the platform only gets access to the entities related to the specific workspace.
And then to the price. By using software, or services, that provides you with all of these features are usually not cheap, and for a good reason many would say. The companies developing these put in a lot of resources to enable your business to increase throughput, stability, security and productivity.
|Often graphical interfaces||Limitations to individual scaling|
|Ready-to-use Source and target adapters||Costly|
|Minimize use of custom code||Bigger deployments with higher risk|
|Easier to maintain||Less use of re-usable components of developed code|
|Built in unit and integrations testing capabilities|
|Often pre-built or highly integrated with CI/CD tools|
|Often pre-built encryption|
A distributed integration strategy on the other hand, is based on multiple lighter and smaller parts of developed code, or solutions, that are more independent, covering a smaller portion of the overall architecture. This creates the possibility of individually scaling each part, and thereby lowering the cost, increasing performance and mitigating the risk of single-point-of-failure. In Azure, this strategy can be implemented with components like Functions and Logic apps. But there is of course a major drawback of this strategy, and that is the architectural complexity. Having multiple smaller parts that all contribute to a greater solution, and at the same time maintain “law and order”, can be challenging and require more leadership and governance than the centralized solutions.
However, if you manage to “keep a tight rein”, this strategy will enable you to increase development speed and significantly lower the cost of your solution. A great benefit of the distributed strategy is that it lets you be more flexible with what technology you use for the different parts of your solution, making you able to optimize for each given case.
What you often lose is obviously the arguments for a centralized bigger software component, where you are often left with more custom code, more on your own when it comes to CI/CD, etc.
|Flexibility to tech||Without governance you will lose the game|
|More custom code||Have to be supported by source code software like GitHub|
|Re-usable code||More on your own when it comes to security|