Single Source of Trap
As companies found themselves drowning in data lakes, they become confused on which data to use. At first, volume brings joy. Then, gradually, you find conflicting data here and there. Enough is enough, the management said. We need to make sure we only have one version of the data. The correct one. The single source of truth.
Good luck with that.
I will put this blatantly: there is no such thing as a single source of truth. It is just a mirage; a naïve solution to the recurring data consistency problems. I will elaborate the practical reasons, but the underlying cause is philosophical: the truth is subjective.
In data analytics, the subjectivity arises when we need to define things. Let’s take a simple example. We have 100 unique emails, but only 80 unique phone numbers associated with the emails. Some phone numbers have more than one emails, or even none, and vice versa.
How many users do we truthfully have? Depends on how you define a user, are we counting unique emails or phone numbers as “users”. Another recent case, take the bot issues in Twitter. What is the percentage of bots in Twitter? Well, depends on how would you flag accounts as bots. You can state a reasonable proportion and find a definition that yields that number.
In practice, the subjectivity characteristic manifests into three problems:
One-size-fits-all means a size exactly fits no one
The main proposition of data analytics is context-based insight. The more related it is to the business concern, the better. Meanwhile, a single source often leads to a decision to an agreement on one definition for everyone. Or worse, a forced assumption with the least conflict. Instead of getting answers, you’ll end up with more questions.
Malicious segregation of concern
Having a single source of truth requires dedicated maintainers, often the unwitting data engineers. People are assigned solely to maintain the single source of truth. The job description orients around keeping everything up and running, improving the performance if they can. However, they know scarcely, if any, about the business purpose. In the end, work hours will be spent on unfruitful debates between the engineers and analysts.
People will invent back doors
Agility is a prerequisite for data analytics, because no one knows exactly what they want at the beginning. Sadly, many companies exclaim agility as one of their core values, yet bureaucracy persists.
A single door for all data looks great on paper in terms of security. In practice, though, it will turn into a bottleneck. Different levels of confidentiality blended into a single bucket, creating the worst security scenario for all the data. Do I really to fill all these papers and get all these approvals? Before long, analysts will invent their own way to get the data anyway, rendering the whole purpose of the single source of truth useless.
What is better then?
To be frank, I’m not sure. All I know, based on my experience, requiring everyone to take data only from a single source creates more problems than benefits. So don’t do that. My suggestion is to avoid technical-oriented data structure and strive for business-oriented segregation of concern. Organize data in terms of business context, like datamarts. If multiple business lines have the same definition, they’ll converge to using the same source. Everyone hates arguing over data consistency anyway. Simplification will emerge naturally after a period of time.
Remember that agility goes hand-in-hand with unpredictability. When there’s conflicting data, let debate occurs between analysts on which definition is more suitable for the scenario. If there are two definitions with different assumptions, check the assumptions. Both definitions can still be correct. Keep in mind, the truth is subjective.
I would rather be vaguely right than precisely wrong.
- John Maynard Keynes