The Distributed World built by DevOps – #GOTOAMS 2018

Distributed computing is the challenge of today. At the GOTO Amsterdam 2018 conference there was quite some talk about it. But also the practices needed to build and operate these systems got a lot of attention. This blog post contains some of my takeaways from various talks.

Distributed Systems are too hard

In his keynote, Brendan Burns, co-founder of Kubernetes talked about all the things we need to do to deploy an application on Kubernetes. Just as in the past, we need to develop tools and patterns to make this process easier. These tools to configure and deploy distributed applications should be a formal programming language.

Currently to deploy an application, you need to know Dockerfile syntax, Docker CLI, Kubectl, YAML, etc. He showed his experiment Metaparticle, which aims to make the deployment configuration of your application part of the source code. In that way, people only need to know the concepts of distributed deployments, but do not have to learn the various tools and (configuration) languages that are now needed.

Distribute a Database like a CDN

Ben Darnell from Cockroach Labs demonstrated how one can build a database that can be distributed. By assigning locality to tables, CockroachDB is able to provide an optimum with respect to latency.

Multiple replica’s are stored in a distributed database cluster. Reads only happen from the master replica for consistency, but CockroachDB is smart enough to move the master to the location where most reads come from. In this way it is even possible to let the master ‘follow the sun’. When it is daytime in the US, the master is in the US, when it is daytime in Asia, the master is in Asia.

As one of their customers is Baidu, it might be a very interesting option for globally distributed systems. Especially since one queries it with SQL, which makes CockroachDB compatible with almost everything.

Uncoupling

Coupling is everywhere, in fact your systems would not do very much if they were not coupled in some way. In software development, coupling has a very bad taste, but in other disciplines it is often a good thing. Michael Nygard takes a more nuanced view on coupling. There are actually different types of coupling:

  • Operational coupling one cannot run without the other
  • Development coupling one cannot change without the other
  • Sementic coupling one needs to use the same concepts as the other
  • Functional coupling one shares the same responsibility as the other
  • Incidental coupling one unexpectedly breaks as the other breaks

Only Incidental coupling is always bad. The other types of coupling are often necessary or at least not harmful. An application that can only function when it’s database is up, is not necessarily bad design. Also the fact that SMTP libraries all use the same semantics is not necessarily bad. The SMTP protocol is well understood and unlikely to change.

It is important to examine your architecture and look carefully at the different types of coupling and to ask yourself whether you have a problem or not.

Developer eXperience

When we transition from a monolithic application to a serverless landscape with thousands of little services / API’s, we are going to experience the problem of discoverability. How can we make our API fun to use and easier to find?

Graham Brooks made the case to focus more on DX, Developer eXperience. Typically, when a developer cannot figure out how to use an API within 30 minutes, she will move on to find another one. Therefore, specification and documentation need to be focussed on how to use an API.

Executable specifications, like RAML, can help you to write a specification that will give you a working stub and will act as documentation of your API. Another alternative is Swagger, although this does not offer the executability of the specification (like a stub implementation).

AsciiDoc is a nice way to write the documentation in plain text format. It can reference code (for example from your unit tests) in order to explain how your API works. This means that if you change your API and unit tests with it, the documentation will be kept up to date automatically.

The only way to write specifications and documentations is in a plain text format. This will ensure that it can always be parsed, even if the specific tooling is no longer available. Once you have a comprehensive specification and documentation, you can make them discoverable via any service you like.

Site Reliability Engineering @ Google

To optimize the balance between reliability and development of new features, Google uses a process called Site Reliability Engineering. Christof Leng, a SRE manager, explained how the system works.

Basically, there are various feedback loops that ensure that the system optimizes properly. This is done by introducing the Error Budget. 100% uptime is not realistic and much too expensive. Also, often it is not necessary. When your mobile app cannot connect to the server at all for 1 out of 100 calls, it is perfectly acceptable for your service to have an error every 1 out of 1000 calls. The end user will not notice this, the errors drown in the background noise of other errors.

This all means that your Service Level Objective (SLO) can be less then 100%. 100% – your SLO is then your error budget. Now a couple of constraints make sure that effort is suitably divided between the SRE team an de DEV team. First of all, they both hire from the same headcount. One SRE extra means one developer less. An SRE cannot spend more than 50% time on operations. If a system has more incidents than the SRE can handle within that time, the tickets are routed to the developer team. Developer teams can only deploy if they have a positive error budget.

These constraints and the error budget make sure that developer teams have the greatest interest in fixing problems. If one developer spoils the error budget by committing a bug, the others cannot release their features. This will make sure the dev team will ‘self-police’ their activities. On the other hand, the SREs can spend 50% of their time in making the system better. Everybody wins.

More in-depth information can be found in the SRE book as there are many details that I have no space to express here properly.

Breaking things on purpose

Outages are expensive. An hour of downtime can easily cost an organization many thousands of euros. You really want to proactively avoid this. Therefore it is good to actively think about the failures that might happen, test for them and fix any bugs that may arise.

Kolton Andrus, co-founder of Gremlin and former Chaos Engineer at Netflix, presented the Game Day. It is a day where your team will do a scientific experiment with respect to possible failures.

First you think of some hypothesis, how might your application fail. Then you design a test scenario. It is very important to think carefully about the blast radius of your experiment. Communicate very clearly to everyone that your team will be running this experiment. Then when the Game Day arrives, you execute the test in an iterative loop. Increase the strength of the failure until your application breaks, or one of the abort criteria is met. Then fix the problems you have encountered.

Compared to Netflix’s Chaos Monkey the advantage is that this procedure is not random. It is much easier to reason about what happens when you conduct the experiment under controlled circumstances.

Once you have increased your confidence in your environment, you should really start running your tests in production. This is the ultimate proof of confidence.

The new era of computing

As we have seen there are some great advances made in how we run and operate our IT. These takeaways from the GOTO conference are a nice example of these advances.

NOTE: I will update this post later with links to the talks.