Often when we are met by customers struggling with performance, we see staging environments that are existing more or less as an archive or a copy of the source. Its uncertain why a staging environment is so often handled this way, but it may be a misunderstanding originating from some non-technical reference books like “Kimball’s Data Warehouse Toolkit Classics” where its stated that the staging environment should reflect the source. This may be the case on a metadata level, but it is hard to believe that these authors meant that it should be a copy of the source data persistent over time in a database that run on indexes.
So why can’t this be? Well the answer lies in how data is accessed and loaded on a database server. But before we go there let’s establish some objectives with our stage environment.
- The staging environment should support be fast inserts
- The staging environment should support fast selects
So, when this is important for us, what is the fastest way to insert and select data from a table?
The inserts will be fastest if the storage engine does not have to take into account what order the data should be inserted in. And the selects will be faster if the read does not have to care about filtering out the result. This mean that the fastest inserts is done into a heap and the fastest way to select from a table is to have a table that contain only the data you that you need and then take advantage of the full scan.
This in turn tells us that the staging environment will perform best, if there is no indexes to maintain on the tables, and that the tables contain only the data that you would like to select meaning no “where clause”.
To achieve this we have to filter our incremental selects on the source and truncate our staging environment before we load in new data.