Try to Fix Two Linux Kernel Bugs While Testing TiDB Operator in K8s

Bug #1: Unstable kmem accounting

Keyword: SLUB: Unable to allocate memory on node -1

Issue description

When we tested TiKV’s online transaction processing (OLTP) performance in K8s, I/O performance occasionally jittered. However, the following items appeared normal:

  • CPU (no bottleneck)
  • Memory and disk statistics from the load information

Issue analysis

We used funcslower in perf-tools to trace kernel functions that were executed slowly and adjusted the threshold value of the hung_task_timeout_secs kernel parameter. When we performed the write operations in TiKV, we found the following kernel path information:

Solution

Based on our issue analysis, we can fix the bug in either of the following ways:

  • Disable the kmem accounting feature when starting the container.
$ make BUILDTAGS="nokmem"
func EnableKernelMemoryAccounting(path string) error {
return nil
}
func setKernelMemory(path string, kernelMemoryLimit int64) error {
return nil
}
$ cat /sys/fs/cgroup/memory/kubepods/burstable/pod<pod-uid>/<container-id>/memory.kmem.slabinfo
> cat: memory.kmem.slabinfo: Input/output error

Bug #2: Network device reference count leak

Keyword: kernel:unregister_netdevice: waiting for eth0 to become free. Usage count = 1

Issue description

After the K8s platform ran for a time, the following message was displayed: “Kernel:unregister_netdevice: waiting for eth0 to become free. Usage count = 1”. In addition, multiple processes would be in an uninterruptible state. The only workaround for this problem was to restart the server.

Issue analysis

We used the crash tool to analyze vmcore. We found that the netdev_wait_allrefs function blocked the kernel thread, and that the function was in an infinite loop waiting for dev->refcnt to drop to 0. Because the pod had been released, we suspected that this issue was due to a reference count leak. After analyzing K8s issues, we found that the kernel was the crux of the issue, but there was no simple, stable, and reliable method to reproduce the issue. Also, this issue still occurred in the later community version of the kernel.

  • It was difficult to judge whether a reference count leak issue existed in the kernel module. netdev_wait_allrefs would continually retry releasing the NETDEV_UNREGISTER and NETDEV_UNREGISTER_FINAL messages to all message subscribers via notification chains. We traced this issue and found there were up to 22 subscribers. It was difficult to access the processing logic of all the callback functions registered by these subscribers and therefore difficult to decide whether we had a way to avoid misjudgement.

Solution

As we prepared to drill down into the callback function logic registered by each subscriber, we also followed up on the progress of kernel patches and RHEL. We found that solutions:3659011 of RHEL had an update that mentioned a patch submitted by the upstream link: route: set the deleted fnhe fnhe_daddr to 0 in ip_del_fnhe to fix a race.

UNAME=$(uname -r)
sudo yum install gcc kernel-devel-${UNAME%.*} elfutils elfutils-devel
sudo yum install pesign yum-utils zlib-devel \
binutils-devel newt-devel python-devel perl-ExtUtils-Embed \
audit-libs audit-libs-devel numactl-devel pciutils-devel bison

# enable CentOS 7 debug repo
sudo yum-config-manager --enable debug

sudo yum-builddep kernel-${UNAME%.*}
sudo debuginfo-install kernel-${UNAME%.*}

# optional, but highly recommended - enable EPEL 7
sudo yum install ccache
ccache --max-size=5G
git clone https://github.com/dynup/kpatch && cd kpatch
make
sudo make install
systemctl enable kpatch
curl -SOL https://raw.githubusercontent.com/pingcap/kdt/master/kpatchs/route.patch
kpatch-build -t vmlinux route.patch (Compiles and generates the kernel module)
mkdir -p /var/lib/kpatch/${UNAME}
cp -a livepatch-route.ko /var/lib/kpatch/${UNAME}
systemctl restart kpatch (Loads the kernel module)
kpatch list (Checks the loaded module)

Conclusion

We’ve fixed these kernel bugs expediently. However, there are more effective long-term solutions. For Bug #1, we hope that the K8s community can provide an argument for kubelet to allow users to disable or enable the kmem accounting feature.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
PingCAP

PingCAP

PingCAP is the team behind TiDB, an open-source MySQL compatible NewSQL database. Official website: https://pingcap.com/ GitHub: https://github.com/pingcap