Hedvig internals: Log-structured merge-trees and folding of Bloom filters
One of the really cool things about Hedvig is that we get to work on very different projects, even if all orbiting the same product. Since joining, I’ve worked on distributed protocols such as atomic commitment, consensus, and anti-entropy, which is pretty standard for a company with “distributed” in its DNA (yeah, that’s what the “d” in Hedvig stands for). However, one of my early projects had to do with how data is stored and read from the system, using log-structured merge-trees and, specifically, folding of Bloom filters.
Log-structured merge-trees are widely used in the NOSQL key-value store world. The technique roughly consists of storing every write operation performed in a redo log while keeping the latest set of key-values in memory, for ease of access. Whenever the number of key-value pairs in memory grows beyond a threshold value, they are flushed into what is generally called a Sorted Strings Table (SSTable). So, in summary, a write operation goes like this: update log, update memory table and, if memory table is full, flush it into an SSTable sorted by key (Figure 1).
Figure 1: Courtesy of DataStax
Therefore, when the read operation for a certain key is received, data may be available in memory, from where it is promptly returned, but if the data is only on disk, the SSTables need to be searched for the key before an answer to the query is provided. Depending on how the data is structured, multiple SSTables may need to be read before the operation can complete, which would become too costly very quickly. To read only SSTables that potentially have data associated with the searched key, it is a common practice to associate a Bloom filter to each table.
Bloom filters are very useful data structures, with dozens of variants, used to test membership of elements in sets. Results are probabilistic in the sense that if it says that an element is not in the set, then it definitely isn’t, but if says that it is in the set, then it is very likely that it is, with a small chance that it is not. The “small” in the previous phase can be made as small as needed by tweaking some parameters of the filter.
So, to speed up reads from SSTables, we add all keys to the filter and first check if the key is in the Bloom filter associated with the SSTable when doing the read.
The size of the filter depends on how little you want to err on queries and also on how many elements you plan to add to it. Being more specific about implementations, we can see them as bit vectors, in which each bit is initially zero. For each element in the corresponding set, we hash the element to an index of the filter and flip the indexed bit to 1; we do so with multiple hashes. The total number of hashes used is a function of how precise we need the filter and how much space is available. To test if an element is in the set, we test its filter by computing all the hashes and checking if corresponding indexed bits are set to 1. It is easy to see that as you add more elements, you get more and more false positives so you need to size the filter properly.
Figure 2: Courtesy of Wikipedia
To get rid of stale data in the SSTables and speed up reads, these files undergo a compaction process, which merges a set of SSTables into a new one. Let’s say that the SSTables created by the flushing process are level 0 SSTables and that a compaction merges SSTables from level X into SSTables of level X+1. Then, if you represent this merging process as a graph, it will be a tree, which is the reason for the “merge-tree” in the name of the technique. For instance, in the figure shown below, there are 3 compaction levels with each box representing an SSTable and the number inside each box representing the number of keys in that SSTable.
Figure 3: Merge-tree
When doing the merge, you must create a new Bloom filter as well. The size of the new filter is a decision that must be made before adding elements, that is, before the merged SSTable is created. However, even though you know the number of elements in each input table, you are not sure how many there will be in the output, since many of the entries may be stale (corresponds to the same key but with older value) and will not survive the compaction. We cannot risk creating too small a filter or it would give too many false positives, but we also cannot afford to over-dimension the filter. So we finally arrive at folding of bloom filters (as you will see, it should actually have been called stacking of Bloom filter slices, but folding was the name that stuck).
FOLDING OF BLOOM FILTERS
A curious thing about Bloom filters is that if you take two filters with the same specification, that is, with the same length and set of hash functions, then if you bitwise OR/AND them together, the resulting filter corresponds to the union/intersection of the two original corresponding sets.
In our system, we have used this feature to redimension a filter. That is, upon a merge, we dimension the new filter to ensure that even if all values from the input tables make it to the output table, then the precision of the filter will be good enough. Then, after we finish the merge, we split the filter into disjoint and equally sized parts and OR them together to get a smaller filter. The number of hashes remains the same, and the indexes pointed by such hashes are “moduled” by the new size of the filter. For instance, in the figure shown below, the size of filter is cut into half from 16 to 8, and the new index will be computed as “hash(key) % 8”, instead of “hash(key) % 16”.
Figure 4: Bloom filter folding
The exact new size depends on the maximum false positive rate desired and the number of elements actually in the filter. To have multiple options for sizes, we start with a size that satisfies the worst case merge scenario (all keys are copied to the new SSTable) and that is also a multiple of some 2^X. For example, if X = 6, the Bloom filter size will be a multiple of 64, but also of 32, 16, 8, 4, and 2 and, therefore, can be folded by a factor of 2, 4, 8, 16, 32, and 64. Given the options, we pick the largest folding that also satisfies the desired false positive rate. This configuration has worked well for all of our use-cases.
Folding the Bloom filters is not exactly distributed systems stuff, but it has had a big impact in our IO rate and index loading times. Really cool, isn’t it?
To learn more about Hedvig, please download our technical whitepaper