Cluster Diagnostics: Troubleshoot Cluster Issues Using Only SQL Queries

Cluster diagnostic report

How cluster diagnostics makes your job easier

It’s important to provide a consistent user experience and reduce the learning curve. There is a lot of information that is relevant to troubleshooting, including cluster static information and cluster runtime information. We reorganize this information across the entire cluster to make sure that you can access it using SQL queries, without the need for external tools. At the same time, you can extract common SQL statements as scripts and write different troubleshooting scripts for different application scenarios.

Querying logs with cluster diagnostics

In a TiDB cluster, a single transaction might involve multiple instances of TiDB’s storage engine, TiKV. Before TiDB 4.0, if you wanted to view logs related to a specific transaction ID (txn_id), you might need to log in to all nodes and view the logs using the grep command. But in TiDB 4.0 and later, we offer cluster log tables. You can view all relevant logs with only one SQL statement. For example:

SELECT * FROM information_schema.cluster_log where message like "%{txn_id}%" and time > '2020-03-27 15:39:00' and time < '2020-03-27 15:50:00'

Cluster diagnostic reports

In TiDB 4.0, if you want to diagnose or inspect the cluster within a time range, or check the load of the cluster, you can generate a diagnostic report for a period of time in TiDB Dashboard. The diagnostic report contains the diagnostic results during this period and monitoring and configuration information for each component in the system.

The instance CPU usage report

The instance CPU usage report lets you view the average (AVG), maximum (MAX), and minimum (MIN) CPU usage for TiDB, Placement Driver (PD), and TiKV instances. You can use this report to quickly judge whether the cluster’s load is balanced or if it has hotspots.

The instance CPU usage report

The monitoring execution time report

The monitoring execution time report presents the monitoring time for each component in the cluster and what percentage it is of the total execution time for all queries. You can use this report to quickly determine whether a component’s execution time is too long and whether there is a bottleneck.

The monitoring execution time report
  • LABEL: The monitoring label information. To get more detailed monitoring information for each label in this monitoring, click expand.
  • TIMERATIO: The ratio of the total execution time for this monitoring item to the total execution time for the monitoring item whose TIMERATIO is 1. For example, the ratio of the total execution time for tidb_kv_request to the total execution time for tidb_query is: 151709.3/174648.19=0.87.
  • TOTAL_TIME: The total execution time, in seconds, for the monitoring item.
  • TOTAL_COUNT: The number of times the monitoring item ran.
  • P999: The execution time, in seconds, that 99.9% of the monitoring samples fall below.
  • P99: The execution time, in seconds, that 99% of monitoring samples fall below.
  • P90: The execution time, in seconds, that 90% of monitoring samples fall below.
  • P80: The execution time, in seconds, that 80% of monitoring samples fall below.

How to generate diagnostic reports

To generate a diagnostic report:

Choose a time range
View the full report
Generate a comparison report

Cases for automatic diagnostics and time range comparison

Here are two cases that show how cluster diagnostics helped us quickly find system problems.

Automatic diagnostics and system inspection

Cluster diagnostics automatically diagnoses system faults and potential problems in the current cluster. This feature lets you inspect the cluster as a whole and analyze the system for bottlenecks. You don’t have to check the monitoring information for nodes one by one. Diagnostic results are output to the information_schema.inspection_result system table. When you query this table, you trigger diagnostics. When you encounter a problem, you can first query this table to find the cause.

  • version: Checks version consistency. It checks whether versions of the same type of components are consistent.
  • node-load: Checks server node load. It checks whether CPU, memory, and disk usage are too high.
  • critical-error: Checks critical errors in the system, such as server is busy, component restarting, and failures to write the binlog.
  • threshold-check: Checks whether some monitoring metrics exceed threshold values, for example:
  • Whether the CPU usage of a thread in TiKV’s components exceeds threshold values
  • Whether leaders and Regions (the basic data storage unit in TiKV) are balanced among TiKV instances
  • Whether a TiKV instance has too many Regions and whether a single instance has more than 20,000 Regions
The 999th percentile latency and QPS suddenly jittered
mysql>select /*+ time_range("2020-03-30 23:45:00", "2020-03-30 23:50:00") */ * from inspection_result;
+----------------+-------------+------+-------------------+-------+-
| RULE | ITEM | TYPE | INSTANCE | VALUE | REFERENCE | SEVERITY | DETAILS
+----------------+-------------+------+-------------------+-------+-
| critical-error | server-down | tikv | 172.16.5.40:23151 | |
| critical | tikv 172.16.5.40:22151 restarted at time '2020/03/30 23:48:38.763' |
+----------------+-------------+------+-------------------+-------+-

Comparing two time ranges to quickly find system problems

Automatic diagnostics does not always pinpoint the problem. More often than not, you need to find the problem based on monitoring and other information. However, a TiDB cluster has many monitoring metrics. To quickly identify abnormal monitoring items, you can compare two time ranges.

The query’s 999th percentile latency and QPS suddenly dropped
  • The time range t2 with an abnormal cluster status: (“2020–03–03 17:18:00”, “2020–03–03 17:21:00”)
SELECT GREATEST(t1.avg_value,t2.avg_value)/LEAST(t1.avg_value,
t2.avg_value) AS ratio,
t1.metrics_name,
t1.avg_value as t1_avg_value,
t2.avg_value as t2_avg_value,
t2.comment
FROM
(SELECT /*+ time_range("2020-03-03 17:08:00", "2020-03-03 17:11:00")*/ *
FROM information_schema.metrics_summary ) t1
JOIN
(SELECT /*+ time_range("2020-03-03 17:18:00", "2020-03-03 17:21:00")*/ *
FROM information_schema.metrics_summary ) t2
ON t1.metrics_name = t2.metrics_name
ORDER BY ratio DESC limit 10;
Query results
  • The tidb_distsql_partial_scan_key_total_num (the number of keys scanned by TiDB's distsql requests) in t2 was about 3,648 times that in t1.
  • The tidb_slow_query_cop_wait_total_time (cop requests' waiting time in TiDB slow queries) in t2 was about 267 times that in t1.
  • The tikv_cop_total_response_size (the size of the result returned by TiKV's cop requests) in t2 was about 192 times that in t1.
  • The tikv_cop_scan_details (the number of scans of TiKV's cop requests) in t2 was about 105 times that in t1.

Summary

The cluster diagnostics feature simplifies and streamlines TiDB cluster monitoring. Say goodbye to multiple tools and multiple learning curves. Further, you can monitor and analyze the entire cluster at the same time. No more checking each node individually.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
PingCAP

PingCAP

PingCAP is the team behind TiDB, an open-source MySQL compatible NewSQL database. Official website: https://pingcap.com/ GitHub: https://github.com/pingcap