136.26 宕机分析

2022/06/11

136.26 机器宕机分析

第一次宕机分析

CPU: 42 PID: 7755 Comm: lotus-seal-work Tainted: P           OE    4.15.0-184-generic #194-Ubuntu

报错表示内核被污染,污染的 flag 参考:https://www.kernel.org/doc/html/v5.8/admin-guide/tainted-kernels.html

从下述三个日志片段来看,和新加的 nvme 可能有关系。

[Sat Jun 11 10:21:50 2022] WARNING: CPU: 42 PID: 7755 at /build/linux-dJz1Tt/linux-4.15.0/kernel/time/timer.c:898 mod_timer+0x3d8/0x3f0
[Sat Jun 11 10:21:50 2022] Modules linked in: binfmt_misc amd64_edac_mod edac_mce_amd kvm_amd kvm irqbypass nvidia_drm(POE) nvidia_modeset(POE) ipmi_ssif xfs snd_hda_codec_hdmi input_leds joydev nvidia(POE) snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer snd soundcore k10temp shpchp ipmi_si ipmi_devintf ipmi_msghandler mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ast ttm drm_kms_helper syscopyarea crct10dif_pclmul crc32_pclmul sysfillrect ghash_clmulni_intel sysimgblt pcbc ixgbe igb fb_sys_fops aesni_intel aes_x86_64 crypto_simd glue_helper dca hid_generic
[Sat Jun 11 10:21:50 2022]  i2c_algo_bit cryptd nvme ptp usbhid drm ahci nvme_core pps_core libahci hid mdio i2c_piix4
[Sat Jun 11 10:21:50 2022] CPU: 42 PID: 7755 Comm: lotus-seal-work Tainted: P           OE    4.15.0-184-generic #194-Ubuntu
[Sat Jun 11 10:21:50 2022] Hardware name: Supermicro Super Server/H11DSi, BIOS 2.3 08/02/2021
[Sat Jun 11 10:21:50 2022] RIP: 0010:mod_timer+0x3d8/0x3f0
[Sat Jun 11 10:21:50 2022] RSP: 0018:ffff8e4e4f083d10 EFLAGS: 00010097
[Sat Jun 11 10:21:50 2022] RAX: 00000001000542fd RBX: 0000000100054300 RCX: 00000001000542f3
[Sat Jun 11 10:21:50 2022] RDX: 00000001000542f9 RSI: ffff8e4e4f083d30 RDI: ffff8e4e4ed9f740
[Sat Jun 11 10:21:50 2022] RBP: ffff8e4e4f083d68 R08: 0000000000000082 R09: ffff8e4e4e406d00
[Sat Jun 11 10:21:50 2022] R10: ffff8e4e4f083d78 R11: 0000000000000001 R12: ffff8e4e4ed9f740
[Sat Jun 11 10:21:50 2022] R13: ffffc5c3cf29a0d0 R14: 00000000ffffffff R15: ffff8f4dea833858
[Sat Jun 11 10:21:50 2022] FS:  00007febae7fc700(0000) GS:ffff8e4e4f080000(0000) knlGS:0000000000000000
[Sat Jun 11 10:21:50 2022] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Sat Jun 11 10:21:50 2022] CR2: 00007f70c57f0000 CR3: 00000060f3da8000 CR4: 0000000000340ee0
[Sat Jun 11 10:21:50 2022] Call Trace:
[Sat Jun 11 10:21:50 2022]  <IRQ>
[Sat Jun 11 10:21:50 2022]  queue_iova+0x129/0x140
[Sat Jun 11 10:21:50 2022]  __unmap_single.isra.27+0xa3/0x100
[Sat Jun 11 10:21:50 2022]  unmap_sg+0x5f/0x70
[Sat Jun 11 10:21:50 2022]  nvme_pci_complete_rq+0xb4/0x130 [nvme]
[Sat Jun 11 10:21:50 2022]  __blk_mq_complete_request+0xd2/0x140
[Sat Jun 11 10:21:50 2022]  blk_mq_complete_request+0x19/0x20
[Sat Jun 11 10:21:50 2022]  nvme_process_cq+0xdf/0x1c0 [nvme]
[Sat Jun 11 10:21:50 2022]  nvme_irq+0x23/0x50 [nvme]
[Sat Jun 11 10:21:50 2022]  __handle_irq_event_percpu+0x44/0x1a0
[Sat Jun 11 10:21:50 2022]  handle_irq_event_percpu+0x32/0x80
[Sat Jun 11 10:21:50 2022]  handle_irq_event+0x3b/0x60
[Sat Jun 11 10:21:50 2022]  handle_edge_irq+0x83/0x1a0
[Sat Jun 11 10:21:50 2022]  handle_irq+0x20/0x30
[Sat Jun 11 10:21:50 2022]  do_IRQ+0x50/0xe0
[Sat Jun 11 10:21:50 2022]  common_interrupt+0x90/0x90
[Sat Jun 11 10:21:50 2022]  </IRQ>
[Sat Jun 11 10:21:50 2022] RIP: 0010:get_page_from_freelist+0x203/0x1400
[Sat Jun 11 10:21:50 2022] RSP: 0018:ffffa6c59ef47878 EFLAGS: 00000202 ORIG_RAX: ffffffffffffffdb
[Sat Jun 11 10:21:50 2022] RAX: 00000000000000a6 RBX: 0000000000000004 RCX: fffff189000d6020
[Sat Jun 11 10:21:50 2022] RDX: 000000000002bdaa RSI: 0000000000000004 RDI: 0000000000000202
[Sat Jun 11 10:21:50 2022] RBP: ffffa6c59ef47988 R08: 000000000002bd60 R09: fffff18900632760
[Sat Jun 11 10:21:50 2022] R10: ffffffffffffffff R11: 0000000000000000 R12: ffff8e504f2d5d00
[Sat Jun 11 10:21:50 2022] R13: ffffa6c59ef47998 R14: fffff18900632740 R15: fffff189000663c0
[Sat Jun 11 10:21:50 2022]  ? pagevec_lookup_range+0x24/0x30
[Sat Jun 11 10:21:50 2022]  __alloc_pages_nodemask+0x11c/0x2c0
[Sat Jun 11 10:21:50 2022]  alloc_pages_current+0x6a/0xe0
[Sat Jun 11 10:21:50 2022]  __page_cache_alloc+0x81/0xa0
[Sat Jun 11 10:21:50 2022]  __do_page_cache_readahead+0x113/0x2c0
[Sat Jun 11 10:21:50 2022]  ondemand_readahead+0x11a/0x2c0
[Sat Jun 11 10:21:50 2022]  ? ondemand_readahead+0x11a/0x2c0
[Sat Jun 11 10:21:50 2022]  page_cache_async_readahead+0x71/0x80
[Sat Jun 11 10:21:50 2022]  generic_file_read_iter+0x795/0xc00
[Sat Jun 11 10:21:50 2022]  ? xfs_file_write_iter+0xa8/0xc0 [xfs]
[Sat Jun 11 10:21:50 2022]  ? _cond_resched+0x19/0x40
[Sat Jun 11 10:21:50 2022]  ? down_read+0x12/0x40
[Sat Jun 11 10:21:50 2022]  xfs_file_buffered_aio_read+0x5a/0x100 [xfs]
[Sat Jun 11 10:21:50 2022]  xfs_file_read_iter+0x72/0xe0 [xfs]
[Sat Jun 11 10:21:50 2022]  generic_file_splice_read+0xdb/0x180
[Sat Jun 11 10:21:50 2022]  do_splice_to+0x79/0x90
[Sat Jun 11 10:21:50 2022]  splice_direct_to_actor+0xd0/0x230
[Sat Jun 11 10:21:50 2022]  ? do_splice_from+0x30/0x30
[Sat Jun 11 10:21:50 2022]  do_splice_direct+0x98/0xd0
[Sat Jun 11 10:21:50 2022]  vfs_copy_file_range+0x2e2/0x310
[Sat Jun 11 10:21:50 2022]  SyS_copy_file_range+0x127/0x1d0
[Sat Jun 11 10:21:50 2022]  do_syscall_64+0x73/0x130
[Sat Jun 11 10:21:50 2022]  entry_SYSCALL_64_after_hwframe+0x41/0xa6
[Sat Jun 11 10:21:50 2022] RIP: 0033:0x7fec0ea1825a
[Sat Jun 11 10:21:50 2022] RSP: 002b:00007febae7f8e10 EFLAGS: 00000293 ORIG_RAX: 0000000000000146
[Sat Jun 11 10:21:50 2022] RAX: ffffffffffffffda RBX: 0000000000000008 RCX: 00007fec0ea1825a
[Sat Jun 11 10:21:50 2022] RDX: 0000000000000009 RSI: 0000000000000000 RDI: 0000000000000008
[Sat Jun 11 10:21:50 2022] RBP: 0000000000000000 R08: 0000000040000000 R09: 0000000000000000
[Sat Jun 11 10:21:50 2022] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000000
[Sat Jun 11 10:21:50 2022] R13: 0000000000000009 R14: 0000000000000000 R15: 0000000040000000
[Sat Jun 11 10:21:50 2022] Code: fd ff ff 49 89 44 24 10 48 89 45 c0 e9 b1 fc ff ff 0f 0b e9 aa fc ff ff 49 89 44 24 10 e9 bf fc ff ff 49 89 41 10 e9 55 fd ff ff <0f> 0b e9 af fc ff ff 0f 0b 41 8b 47 20 e9 47 fd ff ff e8 81 f9
[Sat Jun 11 10:21:50 2022] ---[ end trace f41e5ef4806c7438 ]---
Jun 11 09:56:03 fil kernel: [    1.456460] dpc: probe of 0000:60:03.1:pcie010 failed with error -524
Jun 11 09:56:03 fil kernel: [    1.456505] dpc: probe of 0000:40:01.1:pcie010 failed with error -524
Jun 11 09:56:03 fil kernel: [    1.456548] dpc: probe of 0000:40:01.2:pcie010 failed with error -524
Jun 11 09:56:03 fil kernel: [    1.456592] dpc: probe of 0000:40:01.3:pcie010 failed with error -524
Jun 11 09:56:03 fil kernel: [    1.456636] dpc: probe of 0000:20:03.1:pcie010 failed with error -524
Jun 11 09:56:03 fil kernel: [    1.456680] dpc: probe of 0000:00:01.1:pcie010 failed with error -524
Jun 11 09:56:03 fil kernel: [    1.456726] dpc: probe of 0000:80:01.1:pcie010 failed with error -524
[Sat Jun 11 10:05:57 2022] INFO: task lotus-seal-work:2651 blocked for more than 120 seconds.
[Sat Jun 11 10:05:57 2022]       Tainted: P           OE    4.15.0-184-generic #194-Ubuntu
[Sat Jun 11 10:05:57 2022] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Sat Jun 11 10:05:57 2022] lotus-seal-work D    0  2651      1 0x00000000

第二次挂掉分析报告

  • basic:预计 2022-06-12 13:47 机器挂掉
  • 监控异常:
    • 在挂掉的时候,CPU 的使用率打满了
    • 内存多次异常释放
    • 网卡 enp33s0 的网络流量比较高,且出现了“快速拉高,快速降低”的情况,和内存多次异常释放有关联
    • 4 块 ssd 的 io wait 比较长

下面进行系统日志分析:

Jun 12 03:24:24 fil kernel: [63030.417478] Out of memory: Kill process 58870 (lotus-seal-work) score 376 or sacrifice child
Jun 12 03:24:24 fil kernel: [63030.417620] Killed process 58870 (lotus-seal-work) total-vm:266876256kB, anon-rss:201982792kB, file-rss:17192kB, shmem-rss:0kB
Jun 12 03:24:45 fil kernel: [63052.111820] oom_reaper: reaped process 58870 (lotus-seal-work), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

从今天 8:00 到 13:47 约 6 个小时间内,触发了 7 次 OOM。在大部分情况下,触发完成 OOM 后会出现下面的 page allocation failure 问题。

Jun 12 03:27:26 fil kernel: [63212.460575] worker-thread-2: page allocation failure: order:2, mode:0x14050c0(GFP_KERNEL|__GFP_NORETRY|__GFP_COMP), nodemask=0
Jun 12 03:27:26 fil kernel: [63212.460577] worker-thread-2 cpuset=/ mems_allowed=0-1
Jun 12 03:27:26 fil kernel: [63212.460583] CPU: 46 PID: 56407 Comm: worker-thread-2 Tainted: P        W  OE    4.15.0-184-generic #194-Ubuntu
Jun 12 03:27:26 fil kernel: [63212.460584] Hardware name: Supermicro Super Server/H11DSi, BIOS 2.3 08/02/2021
Jun 12 03:27:26 fil kernel: [63212.460584] Call Trace:
Jun 12 03:27:26 fil kernel: [63212.460593]  dump_stack+0x6d/0x8b
Jun 12 03:27:26 fil kernel: [63212.460597]  warn_alloc+0xff/0x1a0
Jun 12 03:27:26 fil kernel: [63212.460600]  ? __alloc_pages_direct_compact+0x51/0x100
Jun 12 03:27:26 fil kernel: [63212.460601]  __alloc_pages_slowpath+0xdc5/0xe00
Jun 12 03:27:26 fil kernel: [63212.460603]  __alloc_pages_nodemask+0x29a/0x2c0
Jun 12 03:27:26 fil kernel: [63212.460607]  alloc_pages_current+0x6a/0xe0
Jun 12 03:27:26 fil kernel: [63212.460609]  kmalloc_order+0x18/0x40
Jun 12 03:27:26 fil kernel: [63212.460611]  kmalloc_order_trace+0x24/0xb0
Jun 12 03:27:26 fil kernel: [63212.460614]  __kmalloc+0x1fe/0x210
Jun 12 03:27:26 fil kernel: [63212.460631]  alloc_internal+0x25/0x60 [nvidia_uvm]
Jun 12 03:27:26 fil kernel: [63212.460640]  __uvm_kvmalloc+0x22/0x60 [nvidia_uvm]
Jun 12 03:27:26 fil kernel: [63212.460805]  ? _nv032864rm+0x40/0x40 [nvidia]
Jun 12 03:27:26 fil kernel: [63212.460816]  uvm_va_range_map_rm_allocation+0x293/0x4a0 [nvidia_uvm]
Jun 12 03:27:26 fil kernel: [63212.460826]  ? entry_offset_pascal+0x20/0x20 [nvidia_uvm]
Jun 12 03:27:26 fil kernel: [63212.460836]  uvm_map_external_allocation_on_gpu+0x24a/0x350 [nvidia_uvm]
Jun 12 03:27:26 fil kernel: [63212.460844]  ? uvm_map_external_allocation_on_gpu+0x24a/0x350 [nvidia_uvm]
Jun 12 03:27:26 fil kernel: [63212.460853]  uvm_api_map_external_allocation+0x24b/0x4f0 [nvidia_uvm]
Jun 12 03:27:26 fil kernel: [63212.460864]  ? uvm_va_range_create_external+0x39/0x100 [nvidia_uvm]
Jun 12 03:27:26 fil kernel: [63212.460873]  uvm_ioctl+0x12d5/0x14b0 [nvidia_uvm]
Jun 12 03:27:26 fil kernel: [63212.460880]  ? uvm_ioctl+0x12d5/0x14b0 [nvidia_uvm]
Jun 12 03:27:26 fil kernel: [63212.461026]  ? _nv036268rm+0x6c/0x90 [nvidia]
Jun 12 03:27:26 fil kernel: [63212.461171]  ? _nv009235rm+0x1d/0x30 [nvidia]
Jun 12 03:27:26 fil kernel: [63212.461316]  ? _nv036272rm+0x100/0x100 [nvidia]
Jun 12 03:27:26 fil kernel: [63212.461464]  ? _nv009260rm+0x60/0x80 [nvidia]
Jun 12 03:27:26 fil kernel: [63212.461576]  ? os_acquire_spinlock+0x12/0x20 [nvidia]
Jun 12 03:27:26 fil kernel: [63212.461686]  ? os_release_spinlock+0x1a/0x20 [nvidia]
Jun 12 03:27:26 fil kernel: [63212.461802]  ? _nv040195rm+0x9b/0x190 [nvidia]
Jun 12 03:27:26 fil kernel: [63212.461960]  ? rm_ioctl+0x63/0xb0 [nvidia]
Jun 12 03:27:26 fil kernel: [63212.461969]  uvm_unlocked_ioctl+0x35/0x60 [nvidia_uvm]
Jun 12 03:27:26 fil kernel: [63212.461977]  ? uvm_unlocked_ioctl+0x35/0x60 [nvidia_uvm]
Jun 12 03:27:26 fil kernel: [63212.461985]  uvm_unlocked_ioctl_entry+0x87/0xb0 [nvidia_uvm]
Jun 12 03:27:26 fil kernel: [63212.461989]  do_vfs_ioctl+0xa8/0x630
Jun 12 03:27:26 fil kernel: [63212.461991]  ? __schedule+0x256/0x890
Jun 12 03:27:26 fil kernel: [63212.461993]  SyS_ioctl+0x79/0x90
Jun 12 03:27:26 fil kernel: [63212.461996]  do_syscall_64+0x73/0x130
Jun 12 03:27:26 fil kernel: [63212.461998]  entry_SYSCALL_64_after_hwframe+0x41/0xa6
Jun 12 03:27:26 fil kernel: [63212.461999] RIP: 0033:0x7fe07522f217
Jun 12 03:27:26 fil kernel: [63212.462000] RSP: 002b:00007fd2af5f7a18 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Jun 12 03:27:26 fil kernel: [63212.462002] RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fe07522f217
Jun 12 03:27:26 fil kernel: [63212.462002] RDX: 00007fd2af5f7e70 RSI: 0000000000000021 RDI: 000000000000000d
Jun 12 03:27:26 fil kernel: [63212.462003] RBP: 00007fd21ffdf460 R08: 00007fd21ffdf4f0 R09: 0000000000000000
Jun 12 03:27:26 fil kernel: [63212.462003] R10: 00007fcc74000000 R11: 0000000000000246 R12: 00007fd2af5f7a30
Jun 12 03:27:26 fil kernel: [63212.462004] R13: 00007fd2af5f7e70 R14: 00007fd2af5f7a48 R15: 00007fd21ffdf460

这一段出现的 page allocation failure 一般来说指的是内存回收出现了问题。但是堆栈信息大量指向显卡相关,可能是显卡内存无法回收?这可能需要进一步对显卡使用进行监控。

Jun 12 03:27:25 fil kernel: [63212.238655] warn_alloc_show_mem: 3 callbacks suppressed
Jun 12 03:27:25 fil kernel: [63212.238656] Mem-Info:
Jun 12 03:27:25 fil kernel: [63212.238675] active_anon:126795005 inactive_anon:10572 isolated_anon:0
Jun 12 03:27:25 fil kernel: [63212.238675]  active_file:30842620 inactive_file:97103226 isolated_file:0
Jun 12 03:27:25 fil kernel: [63212.238675]  unevictable:0 dirty:292215 writeback:6584 unstable:0
Jun 12 03:27:25 fil kernel: [63212.238675]  slab_reclaimable:1641251 slab_unreclaimable:146650
Jun 12 03:27:25 fil kernel: [63212.238675]  mapped:8871896 shmem:10752 pagetables:283475 bounce:0
Jun 12 03:27:25 fil kernel: [63212.238675]  free:7038246 free_pcp:472 free_cma:0
Jun 12 03:27:25 fil kernel: [63212.238677] Node 0 active_anon:498074288kB inactive_anon:24976kB active_file:22211580kB inactive_file:1585860kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:24036252kB dirty:366112kB writeback:21232kB shmem:25308kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 71680kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
Jun 12 03:27:25 fil kernel: [63212.238680] Node 0 DMA free:15884kB min:28kB low:424kB high:820kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15884kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Jun 12 03:27:25 fil kernel: [63212.238682] lowmem_reserve[]: 0 2551 515783 515783 515783
Jun 12 03:27:25 fil kernel: [63212.238687] Node 0 DMA32 free:2054768kB min:5428kB low:73856kB high:142284kB active_anon:619248kB inactive_anon:0kB active_file:41852kB inactive_file:8kB unevictable:0kB writepending:1536kB present:2740208kB managed:2737216kB mlocked:0kB kernel_stack:0kB pagetables:1148kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
Jun 12 03:27:25 fil kernel: [63212.238690] lowmem_reserve[]: 0 0 513232 513232 513232
Jun 12 03:27:25 fil kernel: [63212.238692] Node 0 Normal free:1044152kB min:1042948kB low:14181860kB high:27320772kB active_anon:497455040kB inactive_anon:24976kB active_file:22169728kB inactive_file:1585852kB unevictable:0kB writepending:385808kB present:533974016kB managed:525556604kB mlocked:0kB kernel_stack:14344kB pagetables:1089336kB bounce:0kB free_pcp:1736kB local_pcp:0kB free_cma:0kB
Jun 12 03:27:25 fil kernel: [63212.238695] lowmem_reserve[]: 0 0 0 0 0
Jun 12 03:27:25 fil kernel: [63212.238697] Node 0 DMA: 1*4kB (U) 1*8kB (U) 0*16kB 0*32kB 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15884kB
Jun 12 03:27:25 fil kernel: [63212.238702] Node 0 DMA32: 294*4kB (UME) 229*8kB (UME) 113*16kB (UME) 163*32kB (UME) 185*64kB (UME) 130*128kB (UE) 154*256kB (UME) 67*512kB (UME) 37*1024kB (UME) 18*2048kB (UME) 456*4096kB (UME) = 2054768kB
Jun 12 03:27:25 fil kernel: [63212.238707] Node 0 Normal: 21040*4kB (UME) 1267*8kB (UME) 80*16kB (UME) 1673*32kB (UME) 6881*64kB (UME) 2241*128kB (UME) 395*256kB (UME) 102*512kB (ME) 12*1024kB (E) 0*2048kB 0*4096kB = 1041976kB
Jun 12 03:27:25 fil kernel: [63212.238713] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Jun 12 03:27:25 fil kernel: [63212.238714] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jun 12 03:27:25 fil kernel: [63212.238715] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Jun 12 03:27:25 fil kernel: [63212.238716] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jun 12 03:27:25 fil kernel: [63212.238716] 127956864 total pagecache pages
Jun 12 03:27:25 fil kernel: [63212.238717] 0 pages in swap cache
Jun 12 03:27:25 fil kernel: [63212.238718] Swap cache stats: add 0, delete 0, find 0/0
Jun 12 03:27:25 fil kernel: [63212.238718] Free swap  = 0kB
Jun 12 03:27:25 fil kernel: [63212.238719] Total swap = 0kB
Jun 12 03:27:25 fil kernel: [63212.238720] 268400027 pages RAM
Jun 12 03:27:25 fil kernel: [63212.238720] 0 pages HighMem/MovableOnly
Jun 12 03:27:25 fil kernel: [63212.238721] 4204131 pages reserved
Jun 12 03:27:25 fil kernel: [63212.238721] 0 pages cma reserved
Jun 12 03:27:25 fil kernel: [63212.238721] 0 pages hwpoisoned

结合监控信息和日志,现在有几个关联猜想:

  • 程序在运行一些和网络传输有关的操作时,会大量请求内存,然后触发了 OOM
  • OOM 后程序被干掉,网络传输也就停了,这里对应网络传输的速起速落
  • 但因为某些原因 GPU 的内存无法因为程序被干掉而被回收
  • 新的程序运行后就申请内存/GPU内存,CPU就一直请求,但是没有内存资源被释放
  • SSD 的写入等待过长,这个需要单独的测试去验证,这也有可能是 CPU 使用率上升的原因

Search

    Table of Contents