Reducing P99 Latency to 150 μs and Hardware Cost by 75% with a Scale-Out DBMS

How TiKV outperforms AWS Aurora and Pache Ignite.

Our business challenge: real-time responses to massive data

Our equipment processes 84 billion requests every day around the world. The average peak transactions per second (TPS) reached 1.5 million. We need to ensure that our average response time for queries is less than 10 milliseconds. We’re in the IoT industry where there is no off-peak traffic, and the amount of writes is very large. We spent six years looking for the most suitable data architecture solution.

Database exploration

We tried Aurora for three years. But as our data size grew, it didn’t meet our business requirements. Then, we switched to Ignite and used it for two years. But it was not ideal, either.

AWS Aurora couldn’t withstand our data volume surge

Previously, we used AWS Aurora. Its architecture separated storage and computing layers. Our application ran stably on Aurora for three years. In these years, Aurora fully met our application demands. This is because six or seven years ago IoT was unpopular, and smart home devices were not widely used.

The AWS Aurora-based architecture

Apache Ignite scaling risked data loss

We also tried Apache Ignite, a key-value system similar to TiKV, but it couldn’t meet our business needs, either. Its partition size was large; one partition stored 1 GB of data. Unlike TiKV, its scalability was not linear. When our business volume doubled and we needed to scale out our database, we had to shut down our machines. There was a risk of data loss; this is unacceptable to IoT devices. To solve this issue, we used Aurora behind an Ignite server for disaster recovery, and data was written to Aurora synchronously.

The Apache Ignite-based architecture

TiKV is an optimal solution

TiDB is an open-source distributed SQL database built by PingCAP and its open-source community. We tested TiDB 3.0 and TiDB 4.0, but they didn’t meet our requirements for low query latency and high throughput. The PingCAP team analyzed these problems and found that the SQL parser layer consumed most of the time, while TiKV, TiDB’s underlying storage engine, was completely idle.

  • Latency existed in the SQL layer.
  • Although IoT devices’ data had high TPS, their application logic was not that complex.
Seek duration in TiKV
Write duration in TiKV

Our new challenge: deploying TiKV across regions

All of our applications were deployed in three regions, and we needed cross-region calls. The communication between three replicas consumed network traffic, and we must pay for that traffic. But TiKV did not support calls within a region. Even though our hardware cost was reduced by 75%, our network cost was higher than before.

Upgrading our architecture from x86 to Arm to reduce costs and increase efficiency

The reason why the IoT industry focuses on reducing costs is that the gross profit margin of this industry is very low. In June 2020, AWS launched Amazon EC2 C6g instances. They declared that C6g instances delivered up to 40% better price-performance over C5 instances.

Our future plans

In the future, with the help of TiKV 5.0 and 5.1, we’re confident that we can handle our large business volume. We estimate that by the end of 2021, TiKV’s traffic will increase by two to three times.

--

--

PingCAP is the team behind TiDB, an open-source MySQL compatible NewSQL database. Official website: https://pingcap.com/ GitHub: https://github.com/pingcap

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
PingCAP

PingCAP is the team behind TiDB, an open-source MySQL compatible NewSQL database. Official website: https://pingcap.com/ GitHub: https://github.com/pingcap