Architectural principles of a DevOps team

Trying to tame complexity

Over the last few years my team (“DevOps” – the name’s not entirely accurate, but close enough) have been putting together a handful of charter-like documents that we can agree on and work to. This isn’t an especially new idea and we shamelessly cribbed the initial version from one of the other teams. The reasoning was simple enough. We have an increasingly difficult time bringing the convoluted internal systems in line with the changing demands of the business and setting ourselves standards we would adhere to was a basic part of fixing that complexity problem.

This exercise has coalesced over time into three documents – Coding Standards, Development Practices and Architectural Principles. They’re all living documents and we update them now and again as our opinions and understanding changes. The first two are fairly pedestrian and contain pretty much exactly what you’d expect in terms of things like how we prefer to format code and which tools and techniques we prefer to use.

I think the third one, though, is a little less common. It’s not something I’ve seen in many places and yet in some ways it seems to be the most often useful. We’re also trying to word it carefully. There is a nuance we’re trying to achieve in the way we present this – it’s a set of principles we are aspiring to, not a set of hurdles we’re setting ourselves.

Just writing things down

We’ve broadly identified what we think are out most significant problems in working with the intertwined systems and tried to tease out guidelines that will help us untangle them over time. This helps with all the good stuff such as frequent deployments, overall system robustness, monitoring and metrics, and perhaps most of all understandability.

It’s not something we were deliberately aspiring to when we wrote them, but just writing things like this down is a useful step in terms of improvement. It gives you a description of what you currently consider to be best practice which is a basic necessity in working out whether you live up to your best practice, figuring out how you’ll do so if not and most importantly reviewing and revising what ‘best practice’ actually is.

Here they are:

We like the single responsibility principle

We aim to divide things, at all levels, into isolated parts each of which is named in terms of the one job it does. The goal is to be able to look at an individual component and understand from its name the responsibility it has. Simultaneously, when looking at a system as a whole and searching for where some business logic or responsibility is contained, the names of the components should guide us straight to it.

Whether at the method, class, project or service level, we should always aim to consider whether the thing we’re producing is “as simple as possible, but no simpler”.

We like asynchronous events

If there is a chain of processing steps to be taken, we would prefer that these are decoupled into separate ‘services’ each listening to the ‘completed’ events output by the previous ones. This aligns with the single responsibility principle and also gives us robustness, flexibility of deployment, metrics & monitoring and freedom to easily alter or add to the processing chains in the future.

We name events in terms of what has just happened (past tense) not what we hope/expect to happen next. E.g. “Invoice.Created” not “Email.Invoice.To.Customer”.

We think carefully about boundaries and seams

We aim to focus most on the boundaries between two systems. These are the places which are difficult to fix when mistakes are found.

More than anything else – algorithms, technology choices, or whatever – the seams between systems will be the defining qualities of the systems we build.

Whether this applies to Events on the bus, API calls or C# Interfaces, we should focus our thinking and reviewing effort on this area.

We don’t leak implementations or domains

We try to keep a clean separation between the internal domain representation a system has and the interface it offers, even if this means potentially duplicating classes and mapping between them.

Similarly, we make effort not to expose the implementation details of the service we’re creating through its interfaces. It should look as close to an ‘ideal’ service for its consumers as possible and the interface design should allow for as much implementation re-work as possible.

We dislike using the database as an interface

When two apps share data through the database we have a subtle and difficult-to-manage coupling. Data should be segregated and strongly owned by individual systems. If one system needs information from elsewhere it should use normal, open methods (e.g. Api calls) to retrieve it.

Ideally external information should arrive at a system as part of its received Events and it will not need to ‘import’ data from anywhere else.

Also, we try not to merge concerns in the data. If extra information is required for reporting (for example) then we can store that as part of a separate service and merge them in a data warehouse managed by BI. If we are capturing some user behaviour (e.g. a download) and will later be running stateful processing on that, then we should again separate those two and not combine the event-recording and process-state in the same table (or even the same database).

We Test Logic and Behaviour

And we prefer to unit test.

A system (or unit) will gather data from various sources, combine that data by making decisions, and produce an output.

We should aim to be able to test that decision making since this will represent the bulk of the behaviour of the system. This should encompass both the ‘happy path’ scenarios and error conditions.

The implication of this is that there will be a ‘core’ block of code which can act in a pure functional manner and this core can be tested against the expected behaviour of the system.

We push non-deterministic code (e.g. getting current time) towards the outside so the core remains deterministic and testable.

Performance isn’t a problem until it’s proved to be

… and your solution isn’t a solution until it’s proved to be, either. Performance is a very tricky and complicated subject and naive approaches often cause time to be sunk into worthless – even detrimental – code. So, before you think there’s a performance problem, see it in some fairly-real-world type scenario and assess whether it’s really a problem.

If it is, and you have to make a fix, check the numbers to make sure that whatever you did really improved the situation.

Data changes over time and we need to track it

Traditional database design tends to hold the “current” state of a system, and mutate that state over time. While this is convenient for simple reporting and performance, however, it loses a significant amount of information making later questions difficult to answer and data corrections often impossible.

Once this problem has been spotted the normal approach is to bolt on some kind of ‘audit’ table. However these typically only capture a small number of scenarios and can be difficult to maintain.

For preference we should build change-tracking into our data systems from the outset. For example holding a separate row for every historic version of an entity rather than just representing the current state.

Word of warning, here: it is very tempting to baulk at the performance around solutions here. See the note about Performance, above.

Errors and metrics are also Events

We raise logging information and metrics on the queue like all other systems events. These can then be managed with the same complexity as all other system behaviours.

We’re moving away from having metrics as separate messages: if an event happened we can put it on the bus and then the event is its own metric. We don’t, in general, want to be putting messages on the bus purely as metrics.

Ok, so perhaps some of those don’t quite belong under the heading of “Architectural” and some definitely overlap with Coding and Dev documents, but this is meant as a matter of gathering common consensus so we’re comfortable with that.

It’s entirely likely we’ll change our minds about some of these over time, but they seem to be working for us at the moment.

Guided thinking

One interesting issue we occasionally run up against is with people trying to be overly prescriptive, or trying to put hard boundaries around these principles. As engineers we have a tendency to want to see things like this as a set of rules which we can use to strictly reason about the world. It’s a kind of habit we naturally fall into. We’re writing these principles in direct contravention of that expectation – they are meant to guide thinking on the subject, not replace it. They are neither an algorithm for success nor a boundary within which success is defined. They’re just a tool we can use to help us with the work of thinking-about-system-development.

They change the tone of discussions around technical issues. For example changing the question “should we put an event on the bus and write a new service that handles it?” into the more loaded, but probably easier to answer, question “do we have a good reason for not putting a message on the bus and writing a new service?”. It’s somewhat easier to answer an asymmetrical technical question like this: can we justify breaking our principles here? No? Then the answer’s clear.

Guided changes

It’s also true that we’re working with systems which have been built, over many years, on completely different principles and thus at times we’re in a difficult situation. We’re often finding ourselves having to move in the direction of the ideal goals we’ve outlined here, but unable (usually for time or scope reasons) to re-architect things the way we’d like.

As an example, most of the invoicing, quoting and shopping cart related processes are handled by ‘legacy’ code – a mixture of C# libraries and stored procedures baked into web front ends of various flavours. A while back we decided to introduce some more complicated transactional emailing capability and as part of that we raised a set of events such as ‘invoice.created’. Most of the processing didn’t change, but now we had an asynchronous event we could hang email processing off, separated from everything else. When we later came to change our CRM system and needed to perform some updates based off the invoicing process we had a ready-made integration point we could use, and we deployed these process changes without having to touch any of the existing system – no code changes, no redeployments.

Does it make a difference?

So far following these principles does seem to be slowly pulling the systems we’re responsible for into a place we can more easily work with and understand. Our confidence that the systems are robust is building (evidence from metrics is broadly supporting that gut feeling). Generally our ability to understand the behaviour of the systems (both desired and real!) seems to be improving too.

In some ways that’s not the point. We’ve demonstrated that we can use principles to think about our processes and analyse them against some ideals and that, I think, is a more important outcome.