A Song of Snow and Bricks
who’s going to conquer the SQL kingdom?
Business battles tend to converge into a duopoly. Nike vs Adidas, Pepsi vs Cola, Visa vs MasterCard. Tech is not an exception, or should I say, especially in Tech. The duopoly of Google and Apple in the mobile space has been going on for more than a decade and likely won’t change anytime soon. While today I’m not going to discuss Google vs Apple, a similar contention is emerging in the data management space. The similarity I’m referring to is the flexibility vs simplicity aspect. Say hello to our contenders: Databricks on the red side and Snowflake on the blue side.
Before we dive into each platform, let’s set the stage by briefly defining the data management world. Every company has a collection of data and where they store the data. Small companies can store their data on paper or spreadsheets. As they get bigger, trying to find specific data or analyze data to answer a business question will also get more challenging. That’s where data management tools come in, more specifically, databases. There are two types of a database based on the purpose: transactional or analytical. The space that I focus on is the (very) large analytical databases, commonly known as data warehouses.
There was a time when people use the same database tech for transactional and analytical. Understandable, since proven databases were transactional-minded, like Oracle DBMS or IBM DB2, and they still work, there is no need for new tech, right? Oh, big data beg to differ. The Volume of big data slows down analysis significantly. Data can also come in Various formats, like long texts, images, and voices, which the databases’ wouldn’t understand how to handle. Finally, queues built up due to the time required for writing to the database is slower than the new data Velocity.
Big data translate to big problems, and big problems translate to big opportunities. At first, new databases arose claiming to be more big data friendly. The most notable was NoSQL databases, like MongoDB, which decide to compromise the traditional database rules in favour of handling the big data 3Vs characteristics (regarding the rules, see: ACID vs BASE vs CAP). While the concept resolves the big data issue for transactional purposes, performing analysis was still too slow. What can I say, data analysts are difficult to satisfy.
In retrospect, we needed a brand new database to support the analytical purpose. Or, didn’t we? One word: columnar. I love solutions that are so simple, I feel silly after I discover them. I won’t go into detail and explain how, but with columnar DB, analytical queries become much more efficient.
The first time I encountered columnar DB was through Amazon Redshift. Actually, columnar DB concept has been around for years before Redshift. In fact, it is in the NoSQL family, but much less known as opposed to the other types. When pondering about the reason, I posit that the catalyst was cloud computing. You see, at the same time as the big data boom, cloud computing was the other buzzword. At that time, cloud computing is analogous to Amazon’s AWS. Since companies were already exposed to AWS stacks, Redshift was on the menu. Compare this to the era of traditional databases, no one would have even considered any new kids on the block.
Not wanting to miss the opportunity, Google soon followed suit with BigQuery and Microsoft with Azure Synapse. In most cases, people just choose according to which cloud flavour they like more. Google’s GCP, for example, is more well-known in terms of data science capability. Data warehouse feature-wise, not much difference. Barring the hassle of occasional integration issues from one cloud provider to another, everything seemed fine. However, as the saying goes, it is always the calmest before the snowstorm.
As a bastard son of Lord Stark of Winterfell, Jon Snow lives his life believing he is unwanted. Unbeknownst to him, he has a rightful claim to the Iron Throne and is destined to rule the seven kingdoms. Will he succeed or not? Well I am not going to spoil it here, but hardly anyone who watches Game of Thrones hates Jon Snow. Besides his obvious charming presence and delightful accent, people love the story of underdogs like Jon Snow. That’s also probably why people love the story of another Snow: Snowflake.
Snowflake emerged between the clash of the cloud titans: Amazon, Google, and Microsoft. I have only heard about Snowflake very recently, but since then, I heard its name a lot. Like, too much actually. Everyone has been talking about it, like how the Game of Thrones fans were when conspiring about Jon Snow’s parents (Remember R+L=J, anyone?).
Frankly speaking, when I first noticed the hype, a big question mark popped in my head. I just couldn’t seem to grasp what was so special about it. To anyone who is familiar with data warehouse tools, Snowflake offers nothing novel. It is just a data warehouse. Those who idolise it would speak about cost-saving and how the performance is blazing fast (an overused term nowadays). Yet, those kinds of traits tend to be temporary. There will always be something cheaper and faster in the future. I was left buried under the snow.
Only after I completed Snowflake’s self-learning (free) online course that I can confirm my suspicion: it is indeed just a data warehouse, but that’s not what they’re selling. Their play is about simplicity. Before I elaborate on why simplicity matters, it’s time to introduce the other side of the ring.
I’ve heard, even used, Databricks way before Snowflake, around 2016. Hands down, Databricks trumps Snowflake in terms of first impression. It was much easier to answer what’s novel about it: The flexibility of a data lake wrapped in the robustness of a data warehouse. Quite mind-blowing. Same feeling as when I encountered the columnar concept. What’s more, Databricks is built with various open-standard technology. Whenever I am curious about how a component works, I just need to read the documentation. I felt like Daenerys, rising within burning fire with three hatched dragon eggs.
Fast-forward to 2022, the dragons are rarely found flying in the sky. In fact, Snowflake is the cool kid now. Let’s go back to the power of simplicity. Dragons, while powerful, are scary. Databricks has a lot of features that enable us to achieve things the way we want. Snowflake, on the other hand, is like a one-trick pony: if you wanna do something, there’s just one way to do it. Counterintuitive as it may seem, the latter wins.
People would talk about how they love flexibility, but when presented with it, they scratch their heads. Similar to the paradox of choice, flexibility is good to a certain degree. Snowflake successfully reinvent the data warehouse, bringing the old ways of doing things. Oh, you like python? Too bad, let’s just do everything in SQL. You want graphical job orchestration? Here you go, stored procedure and cron jobs. Apparently, people like being told there are no other options. We prefer to just get the things done.
One more question left: if Snowflake is just a data warehouse, why is it better than Redshift or BigQuery? Again, choices. Snowflake can sit on top of the three biggest cloud providers: Amazon AWS, Google GCP, and Microsoft Azure. Wait, then we have to choose? Not really, it is the illusion of choice. Put it this way, whatever cloud provider you already use (or may use in the future), you can continue to use them. While if you consider Redshift or BigQuery, and you’re not on AWS or GCP, good luck with choosing to migrate.
Jon Snow left the King’s Landing with his heart shattered. He may have won, but at what cost? The blood of Daenerys, his lover, is on his hand. Daenerys Targaryen can withstand any kind of fire without getting burned, except for one: her fiery love for Jon Snow. The almighty dragons can’t help you. It is no longer who is the most powerful, but who is the most loved. Jon Snow knows nothing, and that’s why everyone loves him.
Disclaimer: All opinion is my own and not sponsored by anyone.