10x Performance Improvement for Expression Evaluation Made Possible by Vectorized Execution and the Community

  • Changed the once-per-tuple iteration (the Volcano model) to a once-per-batch iteration (1024 tuples). See PR #5178.
  • Optimized various query operators based on the principles of vectorized query execution. See PR #5184.

Why do we need vectorized execution?

Previously, TiDB implemented the Volcano-style execution engine. This iterator model uses a standard data access interface. It has open()-next()-close() between algebra operators and processes data row by row. The Volcano model is simple and extensible.

An expression evaluation tree
An expression evaluation tree
type Node interface {
evalReal(row Row) (val float64, isNull bool)
}
func (node *multiplyRealNode) evalReal(row Row) (float64, bool) {
v1, null1 := node.leftChild.evalReal(row)
v2, null2 := node.rightChild.evalReal(row)
return v1 * v2, null1 || null2
}
func (node *constantNode) evalReal(row Row) (float64, bool) {
return node.someConstantValue, node.isNull // 0.8 in this case
}
func (node *columnNode) evalReal(row Row) (float64, bool) {
return row.GetReal(node.colIdx)
}

The interpretation overhead of row-based execution

Similar to the Volcano model, the expression implementation discussed above is iterating over rows.

func (s *builtinArithmeticMultiplyRealSig) evalReal(row chunk.Row)      
(float64, bool, error) { 9
a, isNull, err := s.args[0].EvalReal(s.ctx, row) 27
if isNull || err != nil { 3
return 0, isNull, err
}
b, isNull, err := s.args[1].EvalReal(s.ctx, row) 25
if isNull || err != nil { 3
return 0, isNull, err
}
result := a * b 2
if math.IsInf(result, 0) { 6
return 0, true, types.ErrOverflow.GenWithStackByArgs(...)
}
return result, false, nil 7
}

Batch processing reduces interpretation overhead

Like vectorized optimization for SQL operators, we can process and return one batch of data at a time. This decreases the interpretation overhead in the expression evaluation process.

type Node interface {
batchEvalReal(rows []Row) (vals []float64, isNull []bool)
}

What does our vector-processing interface look like? Why?

The real source code is not exactly the same as the model shown above. It looks like this:

type VecNode interface {
vecEvalReal(input *Chunk, result *Column)
}
  • Variable-length columns, in which the data length can change.
TiDB’s Chunk structure
TiDB’s Chunk structure
New vector access interface
New vector access interface
  • For variable-length data, such as a string, we can use only GetString(rowIdx int) string to obtain the data in the corresponding row, and only append data to update it. Randomly modifying an element in the variable-length data column involves moving all the subsequent data. This creates a heavy overhead. To improve the overall performance, this operation is not implemented in Column.
  • We accessed data (which was stored in Column.data) via Column instead of Row to reduce the number of function calls. This helped decrease interpretation overhead and speed up accessing data.
  • We put columns for storing data in parameters and passed the parameters instead of directly returning []float64 and []bool arrays. This improved the memory reuse rate and reduced the Golang GC overhead.

How do we implement vectorized execution?

This section covers how to implement vectorized execution for functions.

func (node *multiplyRealNode) vecEvalReal(input *Chunk, result *Column) {
buf := pool.AllocColumnBuffer(TypeReal, input.NumRows())
defer pool.ReleaseColumnBuffer(buf)
node.leftChild.vecEvalReal(input, result)
node.rightChild.vecEvalReal(input, buf)
f64s1 := result.Float64s()
f64s2 := buf.Float64s()
result.MergeNulls(buf)
for i := range i64s1 {
if result.IsNull(i) {
continue
}
i64s1[i] *= i64s2[i]
}
}
  1. Columns.MergeNulls(cols…) merges NULL flags of multiple columns. This method is like result.nulls[i] = result.nulls[i] || buf.nulls[i]. Column internally uses a bitmap to maintain NULL flags. When this function is called, a column does a bitwise operation to merge NULLs.
  2. A loop directly multiplies the data of the left and right child nodes.
  3. During the multiplication process, this function calls the left and right child interfaces to get their data.
  • Most of the computational work is within a simple loop. This facilitates CPU branch prediction and instruction pipelining.

Benchmark comparison between vectorized execution and row-based execution

In this section, I’ll use the TiDB source code for benchmark testing and compare the performance before and after code vectorization.

BenchmarkVec-12           152166              7056 ns/op               0 B/op          0 allocs/op
BenchmarkRow-12 28944 38461 ns/op 0 B/op 0 allocs/op
Before-and-after performance comparison for various LT functions
Before-and-after performance comparison for various LT functions
Before-and-after performance comparison for arithmetic functions
Before-and-after performance comparison for arithmetic functions
Performance improvement for vectorized functions
Performance improvement for vectorized functions

How do we vectorize 360+ built-in functions?

On our journey towards TiDB 4.0, expression vectorization is a huge project, as expressions involve more than 500 built-in functions. Since we have so many built-in functions in TiDB’s code — and a fairly small staff — handcrafting these built-in functions one by one is a nearly impossible mission.

Vectorization using a template

When we vectorized built-in functions, we found that many functions had similarities. For example, most of the LT (<), GT (>), and LE (<=) functions have similar logic. They only differ in the operators they use. Therefore, it’s possible to use a template to generate the code of these functions.

var builtinCompareVecTpl = template.Must(template.New("").Parse(`
func (b *builtin{{ .compare.CompareName }}{{ .type.TypeName }}Sig) vecEvalInt(input *chunk.Chunk, result *chunk.Column) error {
n := input.NumRows()
buf0, err := b.bufAllocator.get(types.ET{{ .type.ETName }}, n)
if err != nil {
return err
}
if err := b.args[0].VecEval{{ .type.TypeName }}(b.ctx, input, buf0); err != nil {
return err
}
...
for i := 0; i < n; i++ {
{{ if eq .type.ETName "Json" }}
val := json.CompareBinary(buf0.GetJSON(i), buf1.GetJSON(i))
{{ else if eq .type.ETName "Real" }}
val := types.CompareFloat64(arg0[i], arg1[i])
{{ else if eq .type.ETName "String" }}
val := types.CompareString(buf0.GetString(i), buf1.GetString(i))
...
if val {{ .compare.Operator }} 0 {
i64s[i] = 1
} else {
i64s[i] = 0
}
}
return nil
}

Vectorization with the community’s help

Besides using a template to vectorize built-in functions, we also started our vectorization campaign in the community to get more hands involved in vectorization. At the beginning of the campaign, we found that when we merged PRs, conflicts often occurred. Therefore, we wrote a script to generate vectorization interfaces for all built-in functions, and reserved these interfaces in the code:

func (b *builtinCastStringAsIntSig) vecEvalInt(input *chunk.Chunk, result *chunk.Column) error {
return errors.Errorf("not implemented")
}
func (b *builtinCastStringAsDurationSig) vectorized() bool {
return false
}
var vecBuiltinArithmeticCases = map[string][]vecExprBenchCase{
ast.Div: {
{retEvalType: types.ETReal, childrenTypes: []types.EvalType{types.ETReal, types.ETReal}},
},
}
func (s *testEvaluatorSuite) TestVectorizedBuiltinArithmeticFunc(c *C) {
testVectorizedBuiltinFunc(c, vecBuiltinArithmeticCases)
}
func BenchmarkVectorizedBuiltinArithmeticFunc(b *testing.B) {
benchmarkVectorizedBuiltinFunc(b, vecBuiltinArithmeticCases)
}
> GO111MODULE=on go test -check.f TestVectorizedBuiltinArithmeticFunc                                              
PASS: builtin_arithmetic_vec_test.go:30: testEvaluatorSuite.TestVectorizedBuiltinArithmeticFunc 0.002s
OK: 1 passed
PASS
ok github.com/pingcap/tidb/expression 0.636s
> go test -v -benchmem -run=BenchmarkVectorizedBuiltinArithmeticFunc -bench=BenchmarkVectorizedBuiltinArithmeticFunc
BenchmarkVectorizedBuiltinArithmeticFunc/builtinArithmeticDivideReal Sig-VecBuiltinFunc-12 424515 2808 ns/op 0 B/op 0 allocs/op
BenchmarkVectorizedBuiltinArithmeticFunc/builtinArithmeticDivideRealSig-NonVecBuiltinFunc-12 47856 25272 ns/op 0 B/op 0 allocs/op
PASS
ok github.com/pingcap/tidb/expression 4.116s

What’s next

By introducing vectorized execution, we’ve dramatically improved the performance of expression execution. Currently, vectorized execution is enabled on our master branch by default. Once all built-in functions of an expression support vectorized execution, this expression will use vectorized execution. In the future, we’ll also use vectorized execution to improve TiDB’s HTAP capabilities.

  • In hash aggregation, we vector-encoded data. In a local test, the performance was 20% to 60% faster than before. For details, see PR #12729.
  • In the StreamAggregation operator, we vector-divided data into groups. In a local test, the performance improved by about 60%. For details, see PR #12903.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
PingCAP

PingCAP

841 Followers

PingCAP is the team behind TiDB, an open-source MySQL compatible NewSQL database. Official website: https://pingcap.com/ GitHub: https://github.com/pingcap