You sunk my Battleship! Distributing metadata in HA clusters.

By Chris Kranz | | Archive & DR

Who doesn’t know the game Battleship? Oh, okay then, Battleship is a game where you distribute your ships across a board in a random fashion, in a way that you think your opponent won’t be able to guess and won’t be able to target. Your opponent does the same. The idea is to sink all your opponent’s ships, the most important one being the battleship. Bonus points if you add sound effects, “for the kids” (awooga, awooga, man the lifeboats, abandon ship!).

You can’t distribute your ships evenly across every available grid reference (without cheating), you have to strategically choose the exact spots to avoid hits. You are effectively designing a grid to avoid points of failure and minimise the impact of any direct loss. Sounds familiar huh? Needless to say the Kranzlings are trained from an early age!

At this point, you’re probably wondering why I’m talking about Battleships. Well, turns out it’s the perfect analogy for a distributed system and how Hedvig works as a software-defined storage platform.

Are you playing a game with your metadata and cluster roles?

The same logic can be applied to any distributed system. You have a grid reference across your data centre and your goal is to distribute your workloads, failure domains and clustered systems across the data centre to minimise any outage or loss. But is your technology vendor limiting the size of your battleship, meaning you can’t “cheat” and distribute a single role across every available node? If you have a fixed number of nodes for a role, you have an architectural constraint.

Lets pick on the old implementation of VMware HA, which leveraged Legato AAM and had a limitation of five masters in a HA cluster. Hopefully this is fair as everyone recognised it as a limitation and VMware have long since fixed this now. Duncan Epping’s HA Deep Dive is a great read and covers the details of how it works now. This meant that any cluster larger than 10 hosts couldn’t be split across two sites and have site resilience (good old vSphere Metro Storage Cluster), and it also meant that when designing blade chassis you would need to be conscious of where your HA controllers are in the cluster to avoid a single chassis failure from causing a complete HA failure. VMware fixed this by fully distributing the voting process, socialism for the HA master role which any host could partake in (but only one is the actual master). I won’t get into FSMO roles in Active Directory, I’m still going through therapy on that one.

In the storage world there are a lot of different roles, processes and data operations you must factor in when going from a traditional monolithic controller architecture to a scale-out architecture. You may find yourself wondering:

  • Do I have metadata nodes and roles that are highly accelerated and are able to handle all my metadata requests, but become a single point of failure?
  • Do I distribute a replication role across a small number of member nodes, so that I have some resilience, but some hot-spots too?
  • Do I fully distribute all my roles across all my nodes and handle the engineering complexity of making a storage system that can work with that architecture?

For an existing platform this can be a difficult thing to retrofit to a storage system, to any system (look how long it took VMware to rewrite HA clustering).

Storage built with distributed systems DNA

Here at Hedvig we have some of the best distributed systems engineers in the world, led by Avinash Lakshman, who helped write the book on distributed data systems (just go take a look at Dynamo and Cassandra).

We’ve built our platform from the ground up to be fully distributed, with the sole aim of minimising as many single points of failure as possible. The key here is to distribute replicas and stripes of all data across all nodes. By distributing all the roles, including data storage and metadata roles, across the entire cluster, you get the whole power of every node in the cluster for both performance, capacity and operational resilience. We split up and strip the larger contexts which would make no logical or practical sense to have an entire copy on every node (think of this as metadata tables, dedupe hash tables and actual data blocks). This gives very predictable scale where every node shares the full responsibilities for the entire cluster, with fixed overheads.

Within the Hedvig Distributed Storage Platform cluster, you can lose any node (or even an entire rack or site) in a cluster and the impact on data, metadata, management and cluster operations is zero. The cluster automatically starts repairing itself, rebuilding any striped and replicated systems that we have defined to be a certain level. For example, if you have a Replication Factor of 3, we make sure we rebuild three replicas across the cluster to maintain all data blocks back up to Replication Factor 3 after a drive or node failure, fully autonomously.

The key is a purpose-built storage system designed from the ground up with a distributed systems model. We haven’t had to retrofit or rewrite part of an existing platform to add new features (and as such pay for inefficiencies). We cheat at Battleships by cutting up each of our ships into N+1 pieces, making copies and distributing them across the entire grid. Every shot is a critical miss, and no shot is ever a critical hit. After every shot we rebuild the lost pieces immediately.

It’s very boring playing Battleships against us!