Ubuntu 18.04升级内核以及内存性能研究
自从用上了 AMD 7002 CPU之后,服务器上出现了 EDAC amd64: Error: F0 not found, device 0x1460 (broken BIOS?)
。
通过 suse 的论坛支持,确定这是因为 “AMD ROMA” 这一代太高级了,Linux Kernel 没跟上,所以会出现这个问题。Ubuntu 18.04 的内核版本是 4.15.0-55-generic;Ubuntu 20.04 的内核版本是 5.4.0-33-generic;官网提供的稳定版内核版本是 5.7.2-050702-generic。
那么就要评估要不要升级内核了?是直接升级操作系统还是仅升级内核?
要不要升级内核?
这需要判断这个错误对 性能
和 稳定性
的影响程度。
先整理出两台机器:
- mtest01:7542, 64G 2933 DDR4 * 8,内核 4.15.0-55-generic
- mtest02: 7542, 64G 2933 DDR4 * 8,内核 5.7.2-050702-generic
重启后,通过 dmesg 检测,发现 mtest01 出现 EDAC amd64: Error: F0 not found, device 0x1460 (broken BIOS?)
问题,而 mtest02 没出现这个问题。
memtester 测试
# mtest01
memtester 5G 2
memtester version 4.3.0 (64-bit)
Copyright (C) 2001-2012 Charles Cazabon.
Licensed under the GNU General Public License version 2 (only).
pagesize is 4096
pagesizemask is 0xfffffffffffff000
want 5120MB (5368709120 bytes)
got 5120MB (5368709120 bytes), trying mlock ...locked.
Loop 1/2:
Stuck Address : ok
Random Value : ok
Compare XOR : ok
Compare SUB : ok
Compare MUL : ok
Compare DIV : ok
Compare OR : ok
Compare AND : ok
Sequential Increment: ok
Solid Bits : ok
Block Sequential : ok
Checkerboard : ok
Bit Spread : ok
Bit Flip : ok
Walking Ones : ok
Walking Zeroes : ok
8-bit Writes : ok
16-bit Writes : ok
Loop 2/2:
Stuck Address : ok
Random Value : ok
Compare XOR : ok
Compare SUB : ok
Compare MUL : ok
Compare DIV : ok
Compare OR : ok
Compare AND : ok
Sequential Increment: ok
Solid Bits : ok
Block Sequential : ok
Checkerboard : ok
Bit Spread : ok
Bit Flip : ok
Walking Ones : ok
Walking Zeroes : ok
8-bit Writes : ok
16-bit Writes : ok
Done.
# mtest02
memtester 5G 2
memtester version 4.3.0 (64-bit)
Copyright (C) 2001-2012 Charles Cazabon.
Licensed under the GNU General Public License version 2 (only).
pagesize is 4096
pagesizemask is 0xfffffffffffff000
want 5120MB (5368709120 bytes)
got 5120MB (5368709120 bytes), trying mlock ...locked.
Loop 1/2:
Stuck Address : ok
Random Value : ok
Compare XOR : ok
Compare SUB : ok
Compare MUL : ok
Compare DIV : ok
Compare OR : ok
Compare AND : ok
Sequential Increment: ok
Solid Bits : ok
Block Sequential : ok
Checkerboard : ok
Bit Spread : ok
Bit Flip : ok
Walking Ones : ok
Walking Zeroes : ok
8-bit Writes : ok
16-bit Writes : ok
Loop 2/2:
Stuck Address : ok
Random Value : ok
Compare XOR : ok
Compare SUB : ok
Compare MUL : ok
Compare DIV : ok
Compare OR : ok
Compare AND : ok
Sequential Increment: ok
Solid Bits : ok
Block Sequential : ok
Checkerboard : ok
Bit Spread : ok
Bit Flip : ok
Walking Ones : ok
Walking Zeroes : ok
8-bit Writes : ok
16-bit Writes : ok
Done.
sysbench 测试
# mtest01
Running the test with following options:
Number of threads: 1
Initializing random number generator from current time
Running memory speed test with the following options:
block size: 16384KiB
total size: 102400MiB
operation: write
scope: global
Initializing worker threads...
Threads started!
Total operations: 6400 ( 869.54 per second)
102400.00 MiB transferred (13912.57 MiB/sec)
General statistics:
total time: 7.3591s
total number of events: 6400
Latency (ms):
min: 0.99
avg: 1.15
max: 13.56
95th percentile: 1.32
sum: 7353.34
Threads fairness:
events (avg/stddev): 6400.0000/0.00
execution time (avg/stddev): 7.3533/0.00
# mtest02
sysbench --test=memory --memory-block-size=16M --memory-total-size=100G run
Running the test with following options:
Number of threads: 1
Initializing random number generator from current time
Running memory speed test with the following options:
block size: 16384KiB
total size: 102400MiB
operation: write
scope: global
Initializing worker threads...
Threads started!
Total operations: 6400 ( 662.80 per second)
102400.00 MiB transferred (10604.75 MiB/sec)
General statistics:
total time: 9.6551s
total number of events: 6400
Latency (ms):
min: 0.95
avg: 1.51
max: 13.93
95th percentile: 2.00
sum: 9647.11
Threads fairness:
events (avg/stddev): 6400.0000/0.00
execution time (avg/stddev): 9.6471/0.00
mbw 测试
数据暂未获取
总结
从三项数据上看,低版本内核的内存性能可能更好,EDAC amd64: Error: F0 not found, device 0x1460 (broken BIOS?)
也未导致明显错误。
细节:更新内核
# 仅针对 Ubuntu
# 打开 https://link.zhihu.com/?target=https%3A//kernel.ubuntu.com/~kernel-ppa/mainline/,查找确定需要的内核,并下载
$ wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.7.3/amd64/linux-headers-5.7.3-050703-generic_5.7.3-050703.202006171531_amd64.deb
$ wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.7.3/amd64/linux-headers-5.7.3-050703_5.7.3-050703.202006171531_all.deb
$ wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.7.3/amd64/linux-image-unsigned-5.7.3-050703-generic_5.7.3-050703.202006171531_amd64.deb
$ wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.7.3/amd64/linux-modules-5.7.3-050703-generic_5.7.3-050703.202006171531_amd64.deb
$ sudo dpkg -i *.deb
$ reboot
$ uname -r
更新内核后,NVIDIA 显卡驱动无法安装。
细节:edac 到底是什么
Error Detection And Correction。Linux 2.6.16 引入系统。该内核模块的目的是发现并报告硬件错误。
EDAC 有一个工具 edac-util,但根据我的测试,在兼容机上执行该指令只会出现 edac-util: Error: No memory controller data found.
。经过查询,现已使用 mcesys 来判断硬件信息。