TiDB Internal (III) — Scheduling

Why scheduling?

From the first blog of TiDB internal, we know that TiKV cluster is the distributed KV storage engine of TiDB database. Data is replicated and managed in Regions and each Region has multiple Replicas distributed on different TiKV nodes. Among these replicas, Leader is in charge of read/write and Follower synchronizes the raft log sent by Leader. Now, please think about the following questions:

  • When TiKV cluster is performing multi-site deployment for disaster recovery, how to guarantee that multiple Replicas of a Raft Group will not get lost if there is outage in one datacenter?
  • How to move data of other nodes in TiKV cluster onto the newly-added node?
  • What happens if a node fails? What does the whole cluster need to do? How to handle if the node fails only temporarily (e.g. restarting a service)? What about a long-time failure (e.g. disk failure or data loss)?
  • Assume that each Raft Group is required to have N replicas. A single Raft Group might have insufficient Replicas (e.g. node failure, loss of replica) or too much Replicas (e.g. the once failed node functions again and automatically add to the cluster). How to schedule the number?
  • As read/write is performed by Leader, what happens to the cluster if all Leaders gather on a few nodes?
  • There is no need to get access to all Regions and the hotspot probably resides in a few Regions. In this case, what should we do?
  • The cluster needs to migrate data during the process of load balancing. Will this kind of data migration consume substantial network bandwidth, disk IO and CPU and influence the online service?

The Requirements of Scheduling

I want to categorize and sort out the previously listed questions. In general, there are two types:

  • Replicas should be distributed on different machines.
  • Replicas on other nodes can be migrated after adding nodes.
  • When a node is offline, data on this node should be migrated.
  • A balanced storage capacity in each node.
  • A balanced distribution of hotspot accessing.
  • Control the speed of balancing in order not to impact the online service.
  • Manage the node state, including manually online/offline nodes and automatically offline faulty nodes.

The Basic Operations of Scheduling

The basic operations of schedule are the simplest. In other words, what we can do to meet the schedule policy. This is the essence of the whole scheduler. The previous scheduler requirements seem to be complicated, but can be generalized into 3 operations:

  • Delete a Replica.
  • Transfer the role of Leader among different Replicas of a Raft Group.

Information Collecting

Schedule depends on the information gathering of the whole cluster. Simply put, we need to know the state of each TiKV node and each Region. TiKV cluster reports two kinds of information to PD:

  • free disk capacity
  • the number of Regions
  • data writing speed
  • the number of sent/received Snapshot (Replicas synchronize data through Snapshots)
  • whether it is overloaded
  • label information (Label is a series of Tags that has hierarchical relationship)
  • the position of Followers
  • the number of offline Replicas
  • data reading/writing speed

The Policy of Scheduling

After gathering information, PD needs some policies to draw up a concrete schedule plan.

  • a dropped node functions again and automatically joins in the cluster. In this case, there is a redundant Replica and needs to be removed.
  • the administrator has modified the replica policy and the configuration of max-replicas.
  • TiKV nodes distribute on multiple servers. It is expected that when a server powers down, the system is still available.
  • TiKV nodes distribute on multiple IDCs. When a datacenter powers down, the system is still available.

The implementation of Scheduling

Now let’s see the schedule process.

Summary

This blog discloses information you might not find elsewhere. We hope that you’ve had a better understanding about what needs to be considered to build a distributed storage system for scheduling and how to decouple policies and implementation to support a more flexible expansion of policy.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
PingCAP

PingCAP

PingCAP is the team behind TiDB, an open-source MySQL compatible NewSQL database. Official website: https://pingcap.com/ GitHub: https://github.com/pingcap