How We Found a Data Corruption Bug in RocksDB

Background

As a distributed open source HTAP database, TiDB uses TiKV as its storage engine. Inside TiKV, we use RocksDB as the local storage. RocksDB is a great project. It’s mature, fast, tunable, and widely used in very large scale production environments. We have been working very closely with the RocksDB team. Recently, we found a bug in the DeleteRange feature in RocksDB.

DeleteRange

Now we know how to delete a key in RocksDB, but what if we want to delete a range of keys? For example, if we want to delete keys in a range [a,c), we can first scan keys in [a,c), which are “a” and “b” in the above example, and delete them one by one. The result of the above example will be:

(a,1,PUT),(b,2,PUT),(c,3,PUT),(c,4,DELETE),(a,5,DELETE),(b,6,DELETE)
  1. Deleting all keys appends a lot of delete entries. So as you can see, deleting data can result in more data temporarily (before compaction), and more IO amplification.
(a,1,PUT),(b,2,PUT),(c,3,PUT),(c,4,DELETE),([a,c),5,DELETE_RANGE)

The BUG

Before telling the story, let’s see how we check data in tests first. TiKV has a useful consistency check mechanism, which has helped to expose some serious bugs. When it is enabled, TiKV runs the consistency check periodically. Consistency check will calculate a hash of each replica and compare the hashes of all replicas to check whether they are consistent. If the hashes don’t match, something must be terribly wrong, so it will panic.

Round 1

There are 4 TiKV instances KV1, KV2, KV3, and KV4 in Cluster A, and the consistency check showed that we had an abnormal replica R2 in KV2, and two normal replicas R1 and R3 in KV1 and KV3. We used tikv-ctl to print out the diff of R2 and R1 to see what’s wrong. The diff showed that R2 had lost some ranges of data, and some deleted data reappeared too. Since we were testing the DeleteRange branch in this cluster, we guessed that this could happen if a DeleteRange entry dropped some entries in a wrong way during the compaction.

Round 2

A few days later, Cluster B panicked again, so we were given another chance to hunt, let’s go. This time, we protected the crime scene very well and made a backup before doing anything.

  1. The file’s smallest key must be smaller than its largest key. But file 185830’s smallest key was larger than its largest key.

The End

The RocksDB team reacted quickly and released a bugfix version for us, we highly appreciate that. However, it turns out the DeleteRange feature cannot work well with another delete files in range feature we are using now (see issue: overlapped files of delete range), and both the features are kind of dangerous, so we still need to wait and test them thoroughly.

Lessons learned:

  1. Protect your crime scene well. Some bugs are hard to reproduce, so seize every opportunity.
  2. Collect more evidence before diving into the details. It’s inefficient to hunt with no clear direction. What’s worse is that it’s easy to go too deep in a wrong direction.
  3. It’s important to test new features (especially those with high risks) carefully under different conditions, some bugs rarely happen but can be destructive. Time tries truth.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
PingCAP

PingCAP

PingCAP is the team behind TiDB, an open-source MySQL compatible NewSQL database. Official website: https://pingcap.com/ GitHub: https://github.com/pingcap