Facebook’s new storage system (Delos)

Gaurav Sehgal
6 min readJan 24, 2021

In this blog post, I’m going to talk about my learnings around the new way of doing consensus that Facebook developed to support its unprecedented scale.

Virtual Consensus!

This system currently runs in production handling 1.8 billion transactions per day. Amazing right?

So why did they do it?

In 2019, Facebook used to store the state of its different applications in a combination of MySql, ZippyDb and zookeeper. While Mysql provides the rich APIs to do transactions, and range queries etc it lacks in the dimension of fault-tolerant whereas zookeeper gives better fault-tolerant guarantees but lack those rich APIs that are required by the applications. To solve this problem they built a rich API database called Delos backed by general-purpose consensus implementation that is not tied to any particular database but rather can be used alongside any application. Theoretically, with this consensus (virtual) you can combine the powers of MySQL with distributed state replication thus making it more reliable and fault-tolerant.

Well, the interesting part here is not the new database. I mean that is but not more than the way they implemented consensus. Generally many distributed systems today tend to tie their core implementation with their own choice of consensus algorithm which makes them very hard to innovate. For example, ETCD can only work on Raft, and Kafka still uses ZAB(zookeeper) which is a two-decade-old algorithm. While on the research side we get many new algorithms almost every year but we can’t really use them because first, getting consensus right is a really hard problem for any complex distributed system and second, changing consensus in the current implementation require a lot of rewrites which, to be honest, everybody hates.

So the concept of having a virtual layer over consensus is really a game-changer. With this, you can switch to any new algorithm without any downtime and change in core business logic thus making your system fast to innovate.

When Delos switch from zookeeper based consensus to their own simple native implementation using this virtual layer they saw a huge drop in latency.

Source: Delos Paper

Some interesting observations

  • The drop is where they switch the algorithm without any major downtime in production.
  • Those spikes in multi-puts are because of the regular cronjob that runs on their database which dumps a huge amount of data.

Now let’s go through the core architecture behind the Virtual Consensus.

Source: Delos Paper

The virtual consensus is derived from a very well known concept of the shared log. A shared log is a very powerful yet simple primitive for ensuring consistency in a distributed systems. In this, a highly available and fault-tolerant append based log is shared by the cluster of machines to perform different operations. For example, a distributed database can use the shared log to do state replication (aka consensus) by appending all the operations to the log and re-playing those in other servers.

Now as you can see in the above figure, Virtual Consensus has two major components, Virtual log and Loglets.

Virtual Log

It is nothing but a shared log which doesn’t actually store the log operations and relies on different Loglet implementations to do that. It only stores the metadata which basically maps the range of logs to different loglets. A typical chain could look like [0–30) logs are stored inside zookeeper based loglet and [30-inf) are stored inside raft based loglet.

A metastore is used to store this map persistently at Virtual Log. So it has to be fault-tolerant, and highly available but not really fast because achieving these three properties together is theoretically impossible in a distributed system. But that's okay because metastore will only get called during reconfiguration to different loglet. Let say you switch from ZK to Raft Loglet at tail position 30 then only metastore will get access to change the current active Loglet and range mapping.

Along with storing this information in metastore, it also caches the same mapping in its client library for fast access. I’ll explain the cache invalidation process further once I get into the loglet’s seal command.

Loglet

It is a distributed system which does consensus based on the async log entries it receives from the different virtual log clients. It is where the years of research comes into play. For example, different variations of Paxos could be different loglets, raft could be another and even you can use the existing systems like zookeeper, and LogDevice. So the choice is yours on which algorithm you want to use based on the kind of latency and throughput you want because in each consensus implementation there’s a tradeoff and further as your system grows you can easily switch to another consensus implementation without any change in your core business logic.

Virtual Log APIs and interaction with Loglets

At the high level, it exposes three major APIs, append, checkTail and readNext. Clients like databases can use these API's to append the operations to the log, to check whether the tail it knows is the correct one and to read the operations from the current known tail for replay purpose.

Similarly, each loglet also exposes these three APIs with an extension of another seal command. This command is called during the switch to different implementation (ZK to Raft) which essentially lock the loglet for further appends. After this command, if a client tries to push to sealed loglet, the operation will fail and then it re-sync its cache from the metastore to retry again on the active loglet.

So this tells us two things first seal command has to fault-tolerant so that log entry doesn’t get lost, second the append operation on loglet doesn’t have to fault-tolerant and highly available because whenever it fails virtual log can switch to another loglet implementation at runtime. These two things are really important because achieving a fault-tolerant seal command is relatively easier than append. It could be implemented using the simple atomic bit operation on each loglet server.

Reconfiguration

Source: Delos Paper

It is when the magic of virtual consensus comes into play.

So reconfiguration basically means switching to new loglet (or consensus) implementation. This happens in three steps. First, sealing the current active loglet so it doesn’t accept new appends, then updating the map to new loglet in the metastore and then invalidating the client’s cache to the new map. The last step can be ignored because cache invalidation also happens during append failure when the loglet is sealed.

There are mostly three major situations when reconfiguration could occur

  • Planned reconfiguration during the switch to better loglet implementation. When Facebook switched from ZK based loglet to the native one they saw a huge drop in latency.
  • During failure in loglet. So let say if loglet append operations started failing the system could automatically switch to another loglet at runtime.
  • During truncate operation. It is when the system deletes (or backup) all the old log entries which are not relevant for further operations. At this point, the virtual log reconfigures itself to new loglet and then truncate the old loglets.

Conclusion

I think this is a great concept which brings the research world closer to the production systems. Because until now if you’ve implemented consensus in your application, it is very tough to change even if you know the better solution exists. When I first read this research paper, I was surprised that nobody implemented something like this before in the production system because at a high level, it seems like a very obvious thing to do. But then that's how things evolve in computer science once someone does it at scale then everyone starts to follow the trend. And I think in future we will see many distributed system implement something similar even if they name it differently.

In the end, I just want to congratulate Delos team for this amazing work and encourage you to read their paper for a better understanding of the system.

https://www.usenix.org/system/files/osdi20-balakrishnan.pdf

--

--