Snowflake has taken the cloud analytic storage and compute market by storm and for a good reason. The time to market, simplicity and cheap storage is all appealing when it comes to choosing your storage and compute engine for analytics. In this review we will dive a little deeper into the software and what it has to offer.
When taking a look at Snowflake, there is one thing that stands out and that is the simplicity. Snowflake has obviously put a lot of resources into making their software one of the simplest storage solutions out there. They do this by automating partitioning and maintenance, removing indexes and simplifying and limiting data types. The steps that Snowflake has taken here is important for the customer, as it helps lowering the threshold for being a database developer, in addition to increasing the development speed. Resulting in a much lower development cost of the product that is being developed.
The simplicity, however, does not come without a cost. With fewer options to optimize your compute, you are often left with increasing the compute power, and thereby, the cost of the computation.
Snowflake has separated the storage and compute, meaning that while not using the solution you are only paying for your storage, lowering the cost and making it ideal for solutions where the usage will vary throughout the day and throughout the month. When you open a session on Snowflake you spin up a pre-defined cluster that is cooperating to complete your query. This is done through two well-known principals that is common for MPP systems like Snowflake, Hash-distribution and broadcast of data.
The broadcast and Hash-distributions work together to give you the best performance when querying. All datasets that are small enough to fit into memory of a compute unit (node), will be copied and broadcasted to the different nodes that are responsible for the compute, while the bigger datasets are split into equal sizes by a hash distribution resulting in small partitions of data on each node. This makes it possible for Snowflake to join the bigger dataset on the smaller one without generating data movement between the nodes, due to missing matches of the join on a specific node.
But what about joining two big tables, that both generate a hash distribution?
Well the hash of the join key will make it probable that the the join finds a match on the node that the data is distributed to, but it is far from guaranteed, resulting in what is called shuffling of data between the nodes, generating network traffic and slower compute time for the query. Shuffling is issued when a node does not find a match on the two datasets that are initially distributed to that node. The data is then sent over to other nodes searching for matching data. This is of course something that is a well known problem for the MPP systems in general. But on Snowflake we do not have indexes and are left with the micro partitioning and pruning on the storage side to minimize the data loaded into each node. This can make the issue a greater topic for Snowflake compared to other MPP systems like Synapse. This is, however, definitely something to be aware of and if you also have complex joins in your query, you are up for a coffee-break at the point that you hit execute on your machine, initiating the query.
On the storage side, Snowflake is applying columnar storage and compression on all its data. By doing this, Snowflake is compressing your data really well, minimizing the cost on the storage side, as well as making room for great processing power. The columnar storage engine is created for analytic queries and is ideal for data warehousing.
On the compute side, Snowflake lets you pre-define compute clusters that they misleadingly call warehouses. The warehouses can be scaled horizontally and vertically as long as your bank account can afford it. The warehouses are created on the point of execution and can be kept on or automatically shut down. When warehouses are left on, the data is cached locally on the node and the result of the query is kept in the result cache, making the exact same query return the result instantly. Please make a note of that the query has to be exactly the same as the previous one to utilize the result cache (Try it out yourself and change out a lower case letter to an upper case). The cache is cleared on the point of termination of the warehouse, making you consider the value of the cache against the cost of having the cluster running.
So by getting an intro to the internals, this could work as a guideline for the data modeling of Snowflake solutions, that more or less follow the same principals as for other MPP systems. The first notion to be aware of, is that due to the risk of data shuffling between nodes and copying of data over multiple nodes, joins can and will be more costly than what you are used to on a SMP system. This speaks for allowing wider tables (de-normalization) than what you would normally have, and is further backed-up by the columnar compression, which enables you to not suffer as much on the storage side. However, there is one issue that complicates this statement and that is: if you have a BI system on top of your Snowflake solution that mainly prefers normalized structures and is the main consumer of your data.
I do not have a definite answer to this issue, but in previous projects the issue has been solved by implementing a secondary consumption layer (database schema) in the solution, that is more tuned towards the BI tool with normalized schema containing already implemented business rules, making room for clean selects without much logic.
The second notion to address is the lack of indexes and thereby the lack of enforced constraints. This is in a way also what makes Snowflake best in class in terms of lowering the bar to be a successful database developer. But it is a significant drawback whenever we address the data consistency and referential integrity. To be clear, it is not possible in Snowflake alone to enforce a constraint. And it is thereby no way to be sure that your data and joins will not contain duplicates before it is discovered by the BI tool or the user. You can of course create checks yourself or use an integration tool that checks this for you at the point of loading data, however, there is no way to ensure the integrity on database level, and thereby also no way to prevent deletion of relational entities.
The last modeling consideration I wanted to point out, is due to the fact that Snowflake does not allow indexes. This is a good thing in reflection to less to optimize and maintain, but it does leave us with some issues. For example, lets consider a situation where you would like to implement multiple keys on a table, this can be a synthetic primary key for joining and SCD logic and a natural key for filtration or multiple foreign keys to related dimensions. In this situation Snowflake let you have one cluster key to cover all your joining and filtration needs. On another system you would probably have multiple indexes to cover all these purposes and make sure that the database system limit the use of IO, RAM and CPU. The result of this is that Snowflake will not be able to “prune” the data as good as other systems, resulting in more data that has to be lifted from disk to complete the query. There is one option and that is to implement multiple materialized views that all can have a different cluster key, mimicking the indexes. But you then end up with a lot of objects that needs to update and mess-up your schema. The alternative is to just ignore the issue and scale up the processing units and thereby our cost.
Security and administration
Snowflake has done a lot on the security side since it started, making it possible to limit incoming IP addresses, active directory federation, multi-factor single-sign-on, private link connections and data masking.
However, some of these features are not available for all licenses and there is currently (while this article is being written) only possible to limit access to entities, not objects as you may be used to. This forces you to consider if Snowflake is secure enough for your data and if you can afford the more expensive licenses to get the security you need.
On the administration side, Snowflake has done and automated most of what you have to do. From taking care of partitions, backup (time travel), memory management, disk handling and so on. The Active directory integration, however, could have been better as you now have to code your way around syncing groups and users to Active directory. There is also some potential on the 3rd party modeling side. As there is no tool, that I am aware of, that is able to handle multiple database environments in one Snowflake instance without messing up the schema and its entities. This makes it harder to migrate and store code separated from Snowflake without pulling out your “custom coding swiss army knife”.
Querying and programming
Tables and views
There are three types of tables and three types of views. We have the permanent table that offer fail-safe (backup/time travel) functionality, then we have the transient table that is just a simple table with nothing more to it and then we have the temporary table that is created and destroyed within the session. On the View side we have the regular views, secure views where the DDL is only visible for given roles and then we have the materialized views where the data is pre-calculated and stored with a custom cluster key that can be used to optimize performance. From other database solutions you may be used to creating materialized views to increase performance on slow queries, however, in Snowflake the materialized view can only be built on data that resigns from a single table, does not contain udf’s or window functions, and a lot more restrictions. All this makes them a bit “useless” and we are patiently waiting for Snowflake to overcome these limitations.
Snowflake is a great piece of equipment, but it has its pitfalls. Most of them can be dealt with through modelling techniques and 3rd party software, but it is certainty not for everyone.
If, however, you are building an analytical solution like data warehouses, and you are consider the database development project cost to be a significant part of the total cost of the solution (and you probably do), Snowflake should at least be considered. Just bear in mind, that the automated and simple solution make you end up with a greater cost on the consumption side than what would be possible on comparable solutions, given that you find qualified personnel to develop on that more complex platform.
If you like to find out more about Snowflake take a look at https://www.snowflake.com/