Reducing P99 Latency to 150 μs and Hardware Cost by 75% with a Scale-Out DBMS
Written by Yunsong Liu (DBA at Tuya Smart)
Tuya Smart is a global Internet of Things (IoT) development platform. It builds interconnectivity standards to bridge the intelligent needs of brands, original equipment manufacturers, developers, and retail chains across a broad range of smart devices and industries. By the end of June 2021, the Tuya IoT Development Platform served 384,000+ developers around the world. Now, smart devices “Powered by Tuya” are available in 200+ countries and regions in 100,000+ stores all over the world.
As our business developed, our data volume grew sharply. We needed to ensure that our average query response time was less than 10 milliseconds. We tried AWS Aurora and Apache Ignite, but they didn’t meet our business requirements. Thanks to TiKV, a highly scalable, low latency, key-value database, we solved our database problem. With TiKV, we reduced our hardware cost by 75%. Our P99 query latency was 150 microseconds, and the write latency was 360 microseconds.
In this post, I’ll share our business pain point, why we chose TiKV, our challenge with using TiKV, and our future plans.
Our business challenge: real-time responses to massive data
Our equipment processes 84 billion requests every day around the world. The average peak transactions per second (TPS) reached 1.5 million. We need to ensure that our average response time for queries is less than 10 milliseconds. We’re in the IoT industry where there is no off-peak traffic, and the amount of writes is very large. We spent six years looking for the most suitable data architecture solution.
We tried Aurora for three years. But as our data size grew, it didn’t meet our business requirements. Then, we switched to Ignite and used it for two years. But it was not ideal, either.
AWS Aurora couldn’t withstand our data volume surge
Previously, we used AWS Aurora. Its architecture separated storage and computing layers. Our application ran stably on Aurora for three years. In these years, Aurora fully met our application demands. This is because six or seven years ago IoT was unpopular, and smart home devices were not widely used.
However, as our business developed, in recent years, our devices have increased exponentially. Every year, the number of our devices increases by three to five times. Aurora couldn’t withstand the data volume surge. Not to mention that IoT devices’ response times should be within 10 milliseconds. Even if we sharded our database and split the cluster, we couldn’t meet our business needs.
Apache Ignite scaling risked data loss
We also tried Apache Ignite, a key-value system similar to TiKV, but it couldn’t meet our business needs, either. Its partition size was large; one partition stored 1 GB of data. Unlike TiKV, its scalability was not linear. When our business volume doubled and we needed to scale out our database, we had to shut down our machines. There was a risk of data loss; this is unacceptable to IoT devices. To solve this issue, we used Aurora behind an Ignite server for disaster recovery, and data was written to Aurora synchronously.
TiKV is an optimal solution
TiDB is an open-source distributed SQL database built by PingCAP and its open-source community. We tested TiDB 3.0 and TiDB 4.0, but they didn’t meet our requirements for low query latency and high throughput. The PingCAP team analyzed these problems and found that the SQL parser layer consumed most of the time, while TiKV, TiDB’s underlying storage engine, was completely idle.
We thought we could remove the SQL layer ( TiDB) and write directly to TiKV, because:
- Latency existed in the SQL layer.
- Although IoT devices’ data had high TPS, their application logic was not that complex.
After we used TiKV in production, we found the result was exciting and recognized by the whole company.
Then, we launched TiKV 4.0 in various regions around the world. After a year of testing, no problems occurred, and our systems ran normally. If we hadn’t used TiKV, we would have needed 12 machines. But with TiKV, under the same configuration, only three machines were enough. This means that our hardware cost was reduced by 75%.
When we used TiKV in the production environment, our throughput was already 200,000 TPS. At that time, we used TiKV 4.0.8 in our cluster located in North America. The P99 query latency was 150 microseconds, and the write latency was 360 microseconds (both less than one millisecond). If you have similar scenarios, you can also try it.
Our new challenge: deploying TiKV across regions
All of our applications were deployed in three regions, and we needed cross-region calls. The communication between three replicas consumed network traffic, and we must pay for that traffic. But TiKV did not support calls within a region. Even though our hardware cost was reduced by 75%, our network cost was higher than before.
Our solution was to enable gRPC message compression to reduce network traffic. But this traffic was for replicating Regions (the basic unit of key-value data movement in TiKV). This solution didn’t reduce application code’s cross-region replication traffic.
The reason for this problem was that TiKV did not perform server-side filtering. We needed to retrieve the data stored in TiKV to the local machine for application filtering and then put the data back. We communicated with the TiKV R&D team about this problem. TiKV’s later versions may introduce server-based filtering to reduce server load. The traffic cost may also decrease.
Upgrading our architecture from x86 to Arm to reduce costs and increase efficiency
The reason why the IoT industry focuses on reducing costs is that the gross profit margin of this industry is very low. In June 2020, AWS launched Amazon EC2 C6g instances. They declared that C6g instances delivered up to 40% better price-performance over C5 instances.
We tried C6g instances, but when we compiled and deployed TiKV using TiUP, the package manager of the TiDB ecosystem, the response time was six to seven times longer than the x86 architecture. That is, TiUP deployed a universal compiled version, which was not so appropriate to the hardware. After we tested TiKV, we found that TiKV 4.0 and its RocksDB version did not support the SSE instruction set.
At that time, the compromise solution was mixed deployment. TiKV used the x86 architecture, and other nodes used the Arm architecture. But this was inconvenient. If we upgraded TiKV’s version, the pointed mirror would be sometimes x86 and sometimes Arm. This would be troublesome, so we switched back to the x86 architecture.
In 2021, TiKV 5.0 was released. The internal RocksDB version of TiKV 5.0 was updated to 6.4.6 which supports the AArch64-optimized CRC32C SSE4.2 instruction set. It’s expected to solve the latency issue we encountered in TiKV 4.0. We’ll test it in the second half of 2021, and we expect that our cost may be significantly reduced.
Our future plans
In the future, with the help of TiKV 5.0 and 5.1, we’re confident that we can handle our large business volume. We estimate that by the end of 2021, TiKV’s traffic will increase by two to three times.
Our big data platform also uses TiDB for large-screen display. To make the application easier to use, we’re thinking of using TiKV 5.1 for storage in our IoT device pipeline. In the second half of 2021, we plan to deploy the TiDB Arm version.
If you’d like to know more about our story or have any questions, you’re welcome to join the TiKV community on Slack and send us your feedback.