On overisolation

Prologue

How many times have you heard something like “we need to implement a couple of microservices for this task”? Or, maybe, “let’s keep the telemetry from production and development environments on separate servers”? These sometimes seem like the most logical things to do, but turn out to be really annoying and time consuming in the long run. In this post, I’ll share some examples of what I called overisolation — the phenomenon of separating things for no good reason.

Examples of overisolation

To get a better understanding of what exactly I’m talking about, let’s look at some architectural mistakes that I consider to be typical examples of overisolation.

VDC per department in a cloud

Some of my previous workplaces had separate virtual datacenters and even separate accounts for different departments. Perhaps convenient for billing (it’s relatively easy to control costs when each department gets its own bill), this turned out to be absolute hell in maintenance. Say, you have a data collection pipeline owned by data engineering team, and a Jira server, owned by IT support team. One day the data guys decide to create Jira tickets automatically in one of their DAGs. This wouldn’t normally be a problem, but since everyone lives in their own VDC, some additional requirements need to be met beforehand:

VPC peering (with potential IP address conflicts) or mTLS (with the need to issue and monitor certificates);
routes;
firewall rules;
etc.

A single and relatively simple task of calling an API becomes a major pain, potentially requiring days of preparatory work that could be avoided altogether. Now imagine what happens as the number of departments and their IT systems goes up and so inevitably do the number and complexity of potential integrations. The only way for the company to untangle itself from such a mess is to spend ungodly amounts of resources either migrating data to a single VDC or automating the interconnection — both of these, I assure you, equally hard and daunting.

Most often, a better approach is to keep everything together, using network policies for isolation and tags for billing.

Unique resources for each development environment

Another approach that I saw in quite a number of companies. Instead of keeping and managing just two separate sets (production and non-production) of resources like kubernetes clusters, databases, brokers, etc., people often create separate entities for every single environment (production, staging, multiple development environments, and so on) and service. Sometimes this stretches even further and extrapolates on all observability tools like log and metric databases. What happens next? Pain for both development and operations teams, as well as anyone in between.

Operations teams spend more time automating routine tasks: just compare the difficulty of running a simple CREATE DATABASE ... on a communal cluster against bootstrapping a whole new cluster just for some new URL shortening service.

Development teams, on the other hand, end up getting lost in the woods trying to navigate the ever growing IT landscape of the company. Where are the metrics for my service? Which credentials do I use to connect to the staging database?

A good way of doing this is pooling and standardizing resources as much as possible while separating only where it truly matters. It’s probably not the best idea to give everyone access to production databases with customer data. However, there is nothing preventing you from using the same credentials across all development instances, provided that you don’t have any sensitive data there (and you really shouldn’t!)

Gitflow

Not strictly an overisolation example, more like a dated approach that creates a lot of toil by introducing unnecessary entities. Introduced some fifteen years ago, this model got widely adopted by development teams and is still considered to be the defining standard by some professionals. For those not familiar, this system enforces the developers to store the current project state in multiple git branches (develop, release, master, etc.), each with its own unique lifecycle.

Well, my experience shows that most projects need exactly two kinds of branches: the long-living main one and the short-lived feature branches. In most cases (excluding maybe some LTS projects which require frequent backports) adding anything on top of that will just result in a waste of everyone’s time: turns out, us humans are not very good at working with multiple parallel versions of reality, and once these start to pile up, we end up getting lost in cherry-picks and merge conflicts.

A more modern and a healthier approach is to focus on a single (most recent) version of code and update it frequently: this way, the complexity doesn’t grow along with the number of people working on the project.

The root of all evils

The examples above are just some of the common overisolation cases. If you worked in IT, you probably saw something similar more than once in your career. So what is the common lining that all of these examples share?

The best I can describe it is

Choosing a wrong level of abstraction to isolate things

Let’s look at the examples from the perspective of abstraction levels:

corporate systems should have been isolated with network policies rather than accounts or VDCs;
most environments should have been isolated with namespaces and databases, rather than clusters; in some aspects (like telemetry and credentials for non-production environments) — shouldn’t have been isolated at all;
software versions also shouldn’t have been isolated.

In all the examples, poor choice of isolation level led to creation of extra entities (tunnels, servers, etc.) Each of these entities requires some effort to set up and maintain. To make matters worse, the entities are interconnected, and the number of connections between them grows multiplicatively when new ones are introduced.

Avoiding overisolation

Of course, there is no single recipe to success when it comes to something as complex as IT (or any other engineering field). My best advice is, as always, to apply common sense. When designing any system, ask yourself a question:

What’s the minimal level of isolation that I can afford?

Does this feature really need to be implemented by a new microservice rather than a module in an existing one? Should I create separate credentials for this database, or can I safely reuse some other? When unsure, always choose the path of least resistance (i.e. with the lowest level of isolation). You’ll always be able to refactor things later if the level that you chose turns out to be insufficient — it will be much cheaper than to do it the other way around.

Epilogue

Why do we overcomplicate things so often? I think that the reason for that is our inherent need to create order from chaos. That’s why sorting things out and keeping them in separate silos quite often seems like the most logical thing to do. And even if isolating everything is bad, it is still an approach, i.e. a formalized way to make some things, which, I guess, is better than complete chaos.

Best we can do avoid such traps is to stay mindful and to discuss our decisions openly and without bias, hopefully finding some best compromises along the way.

Prologue#

Examples of overisolation#

VDC per department in a cloud#

Unique resources for each development environment#

Gitflow#

The root of all evils#

Avoiding overisolation#

Epilogue#