ttyS0串口冲突导致的RCU stall

  1. 问题背景
  2. 排查过程

问题背景

某款机型的机器在系统启动的时候稳定出现RCU stall的栈,需要排查并定位

排查过程

  • 上游搜索相关patch

首先看一下发生问题时候的栈

[   45.941504] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
[   45.949072] 00:04: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
[   45.957833] 00:05: ttyS1 at I/O 0x2f8 (irq = 3, base_baud = 115200) is a 16550A
[   45.966229] printk: console [ttyS0] disabled
[   45.988682] serial 00:04: Runtime PM usage count underflow!
[   45.989099] 00:06: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
[   45.989254] printk: console [ttyS0] enabled
[   66.990978] rcu: INFO: rcu_preempt self-detected stall on CPU
[   66.990978] rcu:     81-....: (2686 ticks this GP) idle=8cb4/1/0x4000000000000000 softirq=115/115 fqs=1750
[   66.990978] rcu:              hardirqs   softirqs   csw/system
[   66.990978] rcu:      number:        0        171            0
[   66.990978] rcu:     cputime:        0          0         5360   ==> 10492(ms)
[   66.990978] rcu:     (t=5251 jiffies g=-755 q=66 ncpus=256)
[   66.990978] CPU: 81 PID: 1 Comm: swapper/0 Not tainted 6.6.0-3.0.0.7 #1
[   66.990978] Hardware name: Inventec P9000G7/P9000G7 MLB, BIOS P9000G7094 02/13/2026
[   66.990978] RIP: 0010:console_flush_all+0x19f/0x450
[   66.990978] Code: 01 49 89 46 58 8b 05 60 60 f0 01 83 f8 ff 0f 84 a8 00 00 00 31 c0 41 88 04 24 e8 5c 2b 00 00 80 e7 02 74 01 f
b 41 0f b6 1c 24 <84> db 0f 85 1a 01 00 00 49 8b 46 58 49 3b 45 00 76 04 49 89 45 00
[   66.990978] RSP: 0000:ffffc90000087728 EFLAGS: 00000202
[   66.990978] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[   66.990978] RDX: 0000000000000001 RSI: 0000000000000046 RDI: ffffffff83badf40
[   66.990978] RBP: 0000000000000001 R08: 6166203a33312052 R09: 4142203a302e3030
[   66.990978] R10: 302e30303a61633a R11: 3030303020696370 R12: ffffc900000877af
[   66.990978] R13: ffffc900000877b0 R14: ffffffff83254ba0 R15: ffffffff83bad320
[   66.990978] FS:  0000000000000000(0000) GS:ffff8b7a7e240000(0000) knlGS:0000000000000000
[   66.990978] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   66.990978] CR2: 0000000000000000 CR3: 0000000005e2e001 CR4: 0000000000770ee0
[   66.990978] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   66.990978] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[   66.990978] PKRU: 55555554
[   66.990978] Call Trace:
[   66.990978]  <IRQ>
[   66.990978]  ? rcu_dump_cpu_stacks+0xa2/0x110
[   66.990978]  ? rcu_sched_clock_irq+0x35e/0x970
[   66.990978]  ? update_process_times+0x5b/0x90
[   66.990978]  ? tick_sched_handle.isra.9+0x1e/0x50
[   66.990978]  ? tick_sched_timer+0x6f/0x90
[   66.990978]  ? __pfx_tick_sched_timer+0x10/0x10
[   66.990978]  ? __hrtimer_run_queues+0x11c/0x2a0
[   66.990978]  ? hrtimer_interrupt+0x100/0x240
[   66.990978]  ? __sysvec_apic_timer_interrupt+0x4e/0x130
[   66.990978]  ? sysvec_apic_timer_interrupt+0x69/0x90
[   66.990978]  </IRQ>
[   66.990978]  <TASK>
[   66.990978]  ? asm_sysvec_apic_timer_interrupt+0x16/0x20
[   66.990978]  ? console_flush_all+0x19f/0x450
[   66.990978]  ? console_flush_all+0x194/0x450
[   66.990978]  console_unlock+0x75/0xd0
[   66.990978]  vprintk_emit+0x209/0x290
[   66.990978]  _printk+0x58/0x80
[   66.990978]  ? kernfs_notify+0x53/0xb0
[   66.990978]  register_console+0x1ff/0x440
[   66.990978]  serial_core_register_port+0x767/0x780
[   66.990978]  ? __pfx_ignore_unknown_bootoption+0x10/0x10
[   66.990978]  serial8250_register_8250_port+0x3cc/0x610
[   66.990978]  serial_pnp_probe+0x276/0x390
[   66.990978]  ? __pfx_serial_pnp_probe+0x10/0x10
[   66.990978]  pnp_device_probe+0x59/0xb0
[   66.990978]  really_probe+0x126/0x3b0
[   66.990978]  __driver_probe_device+0x80/0x160
[   66.990978]  driver_probe_device+0x1f/0x90
[   66.990978]  ? __pfx___driver_attach+0x10/0x10
[   66.990978]  __driver_attach+0xd3/0x170
[   66.990978]  ? __pfx___driver_attach+0x10/0x10
[   66.990978]  bus_for_each_dev+0x78/0xd0
[   66.990978]  bus_add_driver+0x177/0x230
[   66.990978]  driver_register+0x55/0x100
[   66.990978]  serial8250_init+0x86/0x1e0
[   66.990978]  ? __pfx_serial8250_init+0x10/0x10
[   66.990978]  do_one_initcall+0x4c/0x230
[   66.990978]  kernel_init_freeable+0x327/0x460
[   66.990978]  ? __pfx_kernel_init+0x10/0x10
[   66.990978]  kernel_init+0x16/0x130
[   66.990978]  ret_from_fork+0x2d/0x40
[   66.990978]  ? __pfx_kernel_init+0x10/0x10
[   66.990978]  ret_from_fork_asm+0x1b/0x30
[   66.990978]  </TASK>

发现有一个serial 00:04: Runtime PM usage count underflow!,这里产生了计数器下溢

搜索上游patch,发现ed2761958ad7(“tty: serial: 8250: Fix another runtime PM usage counter underflow”)可能解决这个问题

于是合入这个patch重编内核,测试发现,不再产生underflow了,但是RCU stall还是存在

也就是说,RCU stall的产生与这个underflow日志并没有关系

  • 与正常机器对比

找一个”正常机器”(与异常机器并非同一机型,因为异常机器所属的机型都存在问题)对比下看看

发现”正常机器”并没有printk: console [ttyS0] disabled然后又enable的过程

所以此时怀疑就是这一次disable->enable的过程导致了RCU stall

  • 去掉console=ttyS0实验

把cmdline中的console=ttyS0去掉,然后重启机器,发现没有call trace了

所以基本可以确定就是ttyS0在系统启动过程中的一次disable->enable导致的RCU stall

  • 看代码+做实验实锤猜测

可以看到register_console中有一段:

void register_console(struct console *newcon)
{
......
        /*
         * If we have a bootconsole, and are switching to a real console,
         * don't print everything out again, since when the boot console, and
         * the real console are the same physical device, it's annoying to
         * see the beginning boot messages twice
         */
        if (bootcon_registered &&
            ((newcon->flags & (CON_CONSDEV | CON_BOOT)) == CON_CONSDEV)) {
                newcon->flags &= ~CON_PRINTBUFFER;
        }
......
}

这里可以看出,如果是bootcon->realcon的切换的话,新注册的realcon中会清除掉CON_PRINTBUFFER这个标志

这个标志的作用可以在下面代码中体现出来:初始化seq的时候,如果newcon是bootcon或者带有CON_PRINTBUFFER标志,那么这个newcon的seq就赋值为syslog_seq(这个syslog_seq就是syslog的buffer最开始的位置,只要buffer足够大,那syslog_seq基本就是系统启动的第一行log的位置)。

static void console_init_seq(struct console *newcon, bool bootcon_registered)
{
        struct console *con;
        bool handover;

        if (newcon->flags & (CON_PRINTBUFFER | CON_BOOT)) {
                /* Get a consistent copy of @syslog_seq. */
                mutex_lock(&syslog_lock);
                newcon->seq = syslog_seq;
                mutex_unlock(&syslog_lock);
        } else {
                /* Begin with next message added to ringbuffer. */
                newcon->seq = prb_next_seq(prb);
                ......
        }
        ......
}

在这块加点log重编内核测试一下

--- a/kernel/printk/printk.c
+++ b/kernel/printk/printk.c
@@ -3505,6 +3505,7 @@ static void console_init_seq(struct console *newcon, bool bootcon_registered)
                        newcon->seq = prb_first_valid_seq(prb);
                } else {
                        /* Get a consistent copy of @syslog_seq. */
+                       pr_warn("yzwddsg newcon is in newcon->flags & (CON_PRINTBUFFER | CON_BOOT)\n");
                        mutex_lock(&syslog_lock);
                        newcon->seq = syslog_seq;
                        mutex_unlock(&syslog_lock);
@@ -3521,7 +3522,9 @@ static void console_init_seq(struct console *newcon, bool bootcon_registered)
                 * start with the message of the enabled boot console that is
                 * the furthest behind.
                 */
+
                if (bootcon_registered && !keep_bootcon) {
+                       pr_warn("yzwddsg unregister bootcon\n");
                        /*
                         * Hold the console_lock to stop console printing and
                         * guarantee safe access to console->seq.
@@ -3650,6 +3653,7 @@ void register_console(struct console *newcon)
         */
        if (bootcon_registered &&
            ((newcon->flags & (CON_CONSDEV | CON_BOOT)) == CON_CONSDEV)) {
+               pr_warn("yzwddsg: newcon removed CON_PRINTBUFFER\n");
                newcon->flags &= ~CON_PRINTBUFFER;
        }

发现确实没有清除CON_PRINTBUFFER标志,直接去走了刚才说的newcon->seq = syslog_seq;路径(因为之前并没有bootcon)

那这里就可以确认:RCU stall的直接原因就是因为异常机型的系统启动过程中存在一次ttyS0的disable->enable的过程,这里在enable注册新的ttyS0的过程中,console结构体中的seq会被赋值成syslog_seq,也就导致新ttyS0被注册enable之后需要从头到尾重新打印一遍系统启动的log,而这时候syslog buffer中的log已经有3000多行,串口打印的速度又不是很快,这就导致了系统出现RCU stall。

事情到这还没完,还需要确定为什么异常机型上会存在ttyS0 disable->enable的过程?

  • 对比正常机器日志

在”正常机器”和异常机器的cmdline中都分别添加trace_buf_size=20M trace_options=stacktrace ftrace=function ftrace_filter=serial8250*,然后重启机器看一下ftrace的输出,过滤一下结果

正常机器输出如下:

          <idle>-0       [000] ...2.     1.219179: serial8250_isa_init_ports.part.0 <-univ8250_console_init
          <idle>-0       [000] ...2.     1.219180: serial8250_setup_port <-serial8250_isa_init_ports.part.0
          <idle>-0       [000] ...2.     1.219180: serial8250_init_port <-serial8250_setup_port
          <idle>-0       [000] ...2.     1.219180: serial8250_set_defaults <-serial8250_setup_port
          <idle>-0       [000] ...2.     1.219180: serial8250_setup_port <-serial8250_isa_init_ports.part.0
          <idle>-0       [000] ...2.     1.219180: serial8250_init_port <-serial8250_setup_port
          <idle>-0       [000] ...2.     1.219180: serial8250_set_defaults <-serial8250_setup_port
          <idle>-0       [000] ...2.     1.219180: serial8250_setup_port <-serial8250_isa_init_ports.part.0
          <idle>-0       [000] ...2.     1.219180: serial8250_init_port <-serial8250_setup_port
          <idle>-0       [000] ...2.     1.219181: serial8250_set_defaults <-serial8250_setup_port
          <idle>-0       [000] ...2.     1.219181: serial8250_setup_port <-serial8250_isa_init_ports.part.0
          <idle>-0       [000] ...2.     1.219181: serial8250_init_port <-serial8250_setup_port
          <idle>-0       [000] ...2.     1.219181: serial8250_set_defaults <-serial8250_setup_port
          <idle>-0       [000] ...2.     1.219182: serial8250_console_setup <-univ8250_console_setup
          <idle>-0       [000] ...2.     1.219183: serial8250_set_termios <-uart_set_options
          <idle>-0       [000] ...2.     1.219183: serial8250_do_set_termios <-uart_set_options
          <idle>-0       [000] ...2.     1.219183: serial8250_get_baud_rate <-serial8250_do_set_termios
          <idle>-0       [000] ...2.     1.219184: serial8250_get_divisor <-serial8250_do_set_termios
          <idle>-0       [000] d..3.     1.219186: serial8250_set_divisor <-serial8250_do_set_termios
          <idle>-0       [000] d..3.     1.219187: serial8250_do_set_divisor <-serial8250_do_set_termios
          <idle>-0       [000] d..3.     1.219197: serial8250_set_mctrl.part.8 <-serial8250_do_set_termios
          <idle>-0       [000] d..3.     1.219197: serial8250_do_set_mctrl <-serial8250_do_set_termios
          ......
          systemd-1       [241] ...1.    15.112899: serial8250_pnp_init <-serial8250_init
         systemd-1       [241] ...1.    15.112920: serial8250_register_8250_port <-serial_pnp_probe
         systemd-1       [241] ...1.    15.112921: serial8250_set_defaults <-serial8250_register_8250_port
         systemd-1       [241] ...1.    15.112964: serial8250_config_port <-univ8250_config_port
         systemd-1       [241] ...1.    15.112965: serial8250_request_std_resource <-serial8250_config_port
         systemd-1       [241] d..2.    15.112996: serial8250_clear_fifos.part.1 <-serial8250_config_port
         systemd-1       [241] d..2.    15.113004: serial8250_clear_IER <-serial8250_config_port
         systemd-1       [241] ...1.    15.113008: serial8250_type <-serial_core_register_port
         systemd-1       [241] ...1.    15.120475: serial8250_pm <-serial_core_register_port
         systemd-1       [241] ...1.    15.120475: serial8250_set_sleep <-serial_core_register_port
         systemd-1       [241] d..2.    15.120475: serial8250_set_mctrl <-serial_core_register_port
         systemd-1       [241] d..2.    15.120475: serial8250_set_mctrl.part.8 <-serial_core_register_port
         systemd-1       [241] d..2.    15.120476: serial8250_do_set_mctrl <-serial_core_register_port
         systemd-1       [241] ...1.    15.120520: serial8250_get_port <-serial_pnp_probe
         systemd-1       [241] ...1.    15.120575: serial8250_config_port <-univ8250_config_port
         systemd-1       [241] ...1.    15.120575: serial8250_request_std_resource <-serial8250_config_port
         systemd-1       [241] ...1.    15.120586: serial8250_release_std_resource <-serial8250_config_port
         systemd-1       [247] ...1.    15.120646: serial8250_config_port <-univ8250_config_port
         systemd-1       [247] ...1.    15.120646: serial8250_request_std_resource <-serial8250_config_port
         systemd-1       [247] ...1.    15.120657: serial8250_release_std_resource <-serial8250_config_port
         systemd-1       [241] ...1.    15.120710: serial8250_config_port <-univ8250_config_port
         systemd-1       [241] ...1.    15.120710: serial8250_request_std_resource <-serial8250_config_port
         systemd-1       [241] ...1.    15.120721: serial8250_release_std_resource <-serial8250_config_port
         systemd-1       [247] ...1.    15.120764: serial8250_probe <-platform_probe

异常机器输出如下:

=> trace_event_raw_event_initcall_start
 => console_init
 => start_kernel
 => x86_64_start_reservations
 => x86_64_start_kernel
 => secondary_startup_64_no_verify
          <idle>-0       [000] ...2.     9.735705: serial8250_isa_init_ports.part.0 <-univ8250_console_init
          <idle>-0       [000] ...2.     9.735705: serial8250_setup_port <-serial8250_isa_init_ports.part.0
          <idle>-0       [000] ...2.     9.735705: serial8250_init_port <-serial8250_setup_port
          <idle>-0       [000] ...2.     9.735705: serial8250_set_defaults <-serial8250_setup_port
          <idle>-0       [000] ...2.     9.735706: serial8250_setup_port <-serial8250_isa_init_ports.part.0
          <idle>-0       [000] ...2.     9.735706: serial8250_init_port <-serial8250_setup_port
          <idle>-0       [000] ...2.     9.735706: serial8250_set_defaults <-serial8250_setup_port
          <idle>-0       [000] ...2.     9.735706: serial8250_setup_port <-serial8250_isa_init_ports.part.0
          <idle>-0       [000] ...2.     9.735706: serial8250_init_port <-serial8250_setup_port
          <idle>-0       [000] ...2.     9.735706: serial8250_set_defaults <-serial8250_setup_port
          <idle>-0       [000] ...2.     9.735706: serial8250_setup_port <-serial8250_isa_init_ports.part.0
          <idle>-0       [000] ...2.     9.735706: serial8250_init_port <-serial8250_setup_port
          <idle>-0       [000] ...2.     9.735706: serial8250_set_defaults <-serial8250_setup_port
          <idle>-0       [000] ...2.     9.735707: serial8250_console_setup <-univ8250_console_setup
          <idle>-0       [000] ...2.     9.735708: serial8250_set_termios <-uart_set_options
          <idle>-0       [000] ...2.     9.735708: serial8250_do_set_termios <-uart_set_options
          <idle>-0       [000] ...2.     9.735709: serial8250_get_baud_rate <-serial8250_do_set_termios
          <idle>-0       [000] ...2.     9.735709: serial8250_get_divisor <-serial8250_do_set_termios
          <idle>-0       [000] d..3.     9.735726: serial8250_set_divisor <-serial8250_do_set_termios
          <idle>-0       [000] d..3.     9.735726: serial8250_do_set_divisor <-serial8250_do_set_termios
          <idle>-0       [000] d..3.     9.735810: serial8250_set_mctrl.part.8 <-serial8250_do_set_termios
          <idle>-0       [000] d..3.     9.735810: serial8250_do_set_mctrl <-serial8250_do_set_termios
          <idle>-0       [000] ...1.    11.567817: initcall_finish: func=univ8250_console_init+0x0/0x40 ret=0
          <idle>-0       [000] ...1.    11.567818: <stack trace>
          
  .......
 => trace_event_raw_event_initcall_start
 => do_one_initcall
 => kernel_init_freeable
 => kernel_init
 => ret_from_fork
 => ret_from_fork_asm
         systemd-1       [240] ...1.    48.507441: serial8250_init <-do_one_initcall
         systemd-1       [240] ...1.    48.514565: serial8250_pnp_init <-serial8250_init
         systemd-1       [240] ...1.    48.514583: serial8250_register_8250_port <-serial_pnp_probe
         systemd-1       [240] ...1.    48.514585: serial8250_set_defaults <-serial8250_register_8250_port
         systemd-1       [240] ...1.    48.514623: serial8250_config_port <-univ8250_config_port
         systemd-1       [240] ...1.    48.514623: serial8250_request_std_resource <-serial8250_config_port
         systemd-1       [240] d..2.    48.514916: serial8250_clear_fifos.part.1 <-serial8250_config_port
         systemd-1       [240] d..2.    48.514983: serial8250_clear_IER <-serial8250_config_port
         systemd-1       [240] ...1.    48.515008: serial8250_type <-serial_core_register_port
         systemd-1       [240] ...1.    48.529539: serial8250_pm <-serial_core_register_port
         systemd-1       [240] ...1.    48.529539: serial8250_set_sleep <-serial_core_register_port
         systemd-1       [240] d..2.    48.529540: serial8250_set_mctrl <-serial_core_register_port
         systemd-1       [240] d..2.    48.529540: serial8250_set_mctrl.part.8 <-serial_core_register_port
         systemd-1       [240] d..2.    48.529540: serial8250_do_set_mctrl <-serial_core_register_port
         systemd-1       [240] ...1.    48.529595: serial8250_get_port <-serial_pnp_probe
         systemd-1       [240] ...1.    48.529605: serial8250_register_8250_port <-serial_pnp_probe
         systemd-1       [240] ...1.    48.529606: serial8250_set_defaults <-serial8250_register_8250_port
         systemd-1       [240] ...1.    48.529625: serial8250_config_port <-univ8250_config_port
         systemd-1       [240] ...1.    48.529625: serial8250_request_std_resource <-serial8250_config_port
         systemd-1       [240] d..2.    48.529919: serial8250_clear_fifos.part.1 <-serial8250_config_port
         systemd-1       [240] d..2.    48.529986: serial8250_clear_IER <-serial8250_config_port
         systemd-1       [240] ...1.    48.530003: serial8250_type <-serial_core_register_port
         systemd-1       [240] ...1.    48.544532: serial8250_pm <-serial_core_register_port
         systemd-1       [240] ...1.    48.544532: serial8250_set_sleep <-serial_core_register_port
         systemd-1       [240] d..2.    48.544532: serial8250_set_mctrl <-serial_core_register_port
         systemd-1       [240] d..2.    48.544532: serial8250_set_mctrl.part.8 <-serial_core_register_port
         systemd-1       [240] d..2.    48.544532: serial8250_do_set_mctrl <-serial_core_register_port
         systemd-1       [240] ...1.    48.544549: serial8250_pm <-serial_core_register_port
         systemd-1       [240] ...1.    48.544549: serial8250_set_sleep <-serial_core_register_port
         systemd-1       [240] ...1.    48.544582: serial8250_get_port <-serial_pnp_probe
        systemd-1       [240] ...1.    48.544588: serial8250_register_8250_port <-serial_pnp_probe
         systemd-1       [241] ...1.    48.558764: serial8250_console_exit <-unregister_console
         systemd-1       [241] ...1.    48.558765: serial8250_release_port <-serial_core_unregister_port
         systemd-1       [241] ...1.    48.558766: serial8250_release_std_resource <-serial_core_unregister_port
         systemd-1       [241] ...1.    48.558799: serial8250_set_defaults <-serial8250_register_8250_port
         systemd-1       [241] ...1.    48.558824: serial8250_config_port <-univ8250_config_port
         systemd-1       [241] ...1.    48.558824: serial8250_request_std_resource <-serial8250_config_port
         systemd-1       [241] d..2.    48.559121: serial8250_clear_fifos.part.1 <-serial8250_config_port
         systemd-1       [241] d..2.    48.559186: serial8250_clear_IER <-serial8250_config_port
         systemd-1       [241] ...1.    48.559205: serial8250_type <-serial_core_register_port
         systemd-1       [241] ...1.    48.559212: serial8250_pm <-serial_core_register_port
         systemd-1       [241] ...1.    48.559212: serial8250_set_sleep <-serial_core_register_port
         systemd-1       [241] d..2.    48.559212: serial8250_set_mctrl <-serial_core_register_port
         systemd-1       [241] d..2.    48.559212: serial8250_set_mctrl.part.8 <-serial_core_register_port
         systemd-1       [241] d..2.    48.559212: serial8250_do_set_mctrl <-serial_core_register_port
         systemd-1       [241] ...1.    48.559229: serial8250_console_setup <-univ8250_console_setup
         systemd-1       [241] ...1.    48.559230: serial8250_set_termios <-uart_set_options
         systemd-1       [241] ...1.    48.559230: serial8250_do_set_termios <-uart_set_options
         systemd-1       [241] ...1.    48.559230: serial8250_get_baud_rate <-serial8250_do_set_termios
         systemd-1       [241] ...1.    48.559230: serial8250_get_divisor <-serial8250_do_set_termios
         systemd-1       [241] d..2.    48.559246: serial8250_set_divisor <-serial8250_do_set_termios
         systemd-1       [241] d..2.    48.559246: serial8250_do_set_divisor <-serial8250_do_set_termios
         systemd-1       [241] d..2.    48.559346: serial8250_set_mctrl.part.8 <-serial8250_do_set_termios
         systemd-1       [241] d..2.    48.559346: serial8250_do_set_mctrl <-serial8250_do_set_termios
         systemd-1       [113] ...1.    73.198659: serial8250_get_port <-serial_pnp_probe
         systemd-1       [113] ...1.    73.198713: serial8250_config_port <-univ8250_config_port
         systemd-1       [113] ...1.    73.198713: serial8250_request_std_resource <-serial8250_config_port
         systemd-1       [113] ...1.    73.198771: serial8250_release_std_resource <-serial8250_config_port
         systemd-1       [114] ...1.    73.198842: serial8250_config_port <-univ8250_config_port
         systemd-1       [114] ...1.    73.198842: serial8250_request_std_resource <-serial8250_config_port
         systemd-1       [114] ...1.    73.198891: serial8250_release_std_resource <-serial8250_config_port
         systemd-1       [113] ...1.    73.198939: serial8250_probe <-platform_probe
         systemd-1       [113] .....    73.198943: initcall_finish: func=serial8250_init+0x0/0x1e0 ret=0
         systemd-1       [113] .....    73.198946: <stack trace>

可以发现,异常机器中是比正常机器多两次serial_pnp_probe的

  • 看代码做实验

看一下serial8250_register_8250_port函数

int serial8250_register_8250_port(const struct uart_8250_port *up)
{
......

        uart = serial8250_find_match_or_unused(&up->port);
        if (!uart) {
                uart = serial8250_setup_port(nr_uarts);
                if (!uart)
                        goto unlock;
                nr_uarts++;
        }

        if (uart->port.type != PORT_8250_CIR) {
                struct mctrl_gpios *gpios;

                if (uart->port.dev)
                        uart_remove_one_port(&serial8250_reg, &uart->port);
                        ......
}

应该就是这里serial8250_find_match_or_unused匹配到了上一个ttyS0是match的或者没有使用,所以后面去触发了uart_remove_one_port去disable旧的ttyS0,然后才去enable新的ttyS0

在serial8250_find_match_or_unused相关函数中添加一些log,然后重编内核重启机器

发现确实是match的,而且match的原因是port相同

这也符合我们的日志看到的00:04和00:06设备的端口都是0x3f8

[   45.949072] 00:04: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
......
[   45.966229] printk: console [ttyS0] disabled
......
[   45.989099] 00:06: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
[   45.989254] printk: console [ttyS0] enabled

所以这里我们也可以确定了:ttyS0的disable->enable操作是因为00:04和00:06这两个设备都映射的是0x3f8这个端口号。当serial8250_register_8250_port扫到00:06这个设备的时候发现历史上有一个00:04是同样的端口号,所以需要先把00:04给disable掉,然后才能enable 00:06。

这个我们也可以通过系统中的一些sysfs接口看出来:如下,可以看到00:04和00:06的io端口是一样的。

[root@yzwddsg tmp]# cat /sys/bus/pnp/devices/00:04/resources
state = active
io 0x3f8-0x3ff
irq 4
dma disabled
[root@yzwddsg tmp]# cat /sys/bus/pnp/devices/00:06/resources
state = active
io 0x3f8-0x3ff
irq 4
  • 继续追查00:04和00:06设备

如上所说,已经可以明确00:04和00:06是两个不同的设备但是映射到相同的0x3f8端口号

dump acpi表看一下DSDT表中的COM1

            Device (UAR1)
            {
                Name (_HID, EisaId ("PNP0501") /* 16550A-compatible COM Serial Port */)  // _HID: Hardware ID
                Name (_UID, 0x00)  // _UID: Unique ID
                Name (LDN, 0x02)
                Method (_STA, 0, NotSerialized)  // _STA: Status
                {
                    Return (^^SIO1.DSTA (0x00))
                }

                Method (_DIS, 0, NotSerialized)  // _DIS: Disable Device
                {
                    ^^SIO1.DCNT (0x00, 0x00)
                }

                Method (_CRS, 0, NotSerialized)  // _CRS: Current Resource Settings
                {
                    Return (^^SIO1.DCRS (0x00, 0x00))
                }

                Method (_SRS, 1, NotSerialized)  // _SRS: Set Resource Settings
                {
                    ^^SIO1.DSRS (Arg0, 0x00)
                }
                
                Name (_DDN, "COM1")  // _DDN: DOS Device Name
                Name (_PRS, ResourceTemplate ()  // _PRS: Possible Resource Settings
                {
                    StartDependentFn (0x00, 0x00)
                    {
                        IO (Decode16,
                            0x03F8,             // Range Minimum
                            0x03F8,             // Range Maximum
                            0x01,               // Alignment
                            0x08,               // Length
                            )
                        IRQNoFlags ()
                            {4}
                        DMA (Compatibility, NotBusMaster, Transfer8, )
                            {}
                    }
                    StartDependentFnNoPri ()
                    {
                        IO (Decode16,
                            0x03F8,             // Range Minimum
                            0x03F8,             // Range Maximum
                            0x01,               // Alignment
                            0x08,               // Length
                            )
                        IRQNoFlags ()
                            {3,4,5,7,9,10,11,12}
                        DMA (Compatibility, NotBusMaster, Transfer8, )
                            {}
                    }
                    StartDependentFnNoPri ()
                    {
                        IO (Decode16,
                            0x02F8,             // Range Minimum
                            0x02F8,             // Range Maximum
                            0x01,               // Alignment
                            0x08,               // Length
                            )
                        IRQNoFlags ()
                            {3,4,5,7,9,10,11,12}
                        DMA (Compatibility, NotBusMaster, Transfer8, )
                            {}
                    }
                    StartDependentFnNoPri ()
                    {
                        IO (Decode16,
                            0x03E8,             // Range Minimum
                            0x03E8,             // Range Maximum
                            0x01,               // Alignment
                            0x08,               // Length
                            )
                        IRQNoFlags ()
                            {3,4,5,7,9,10,11,12}
                        DMA (Compatibility, NotBusMaster, Transfer8, )
                            {}
                    }
                    StartDependentFnNoPri ()
                    {
                        IO (Decode16,
                            0x02E8,             // Range Minimum
                            0x02E8,             // Range Maximum
                            0x01,               // Alignment
                            0x08,               // Length
                            )
                        IRQNoFlags ()
                            {3,4,5,7,9,10,11,12}
                        DMA (Compatibility, NotBusMaster, Transfer8, )
                            {}
                    }
                    EndDependentFn ()
                })
            }

还有个COM2串口,COM2串口也是类似的,只是StartDependentFn中首选是0x02F8那些

怀疑是不是COM2串口使用了StartDependentFnNoPri变成了去映射0x3f8端口导致和COM1冲突了?

但是看了下系统的sysfs接口,这个怀疑被排除了,00:05这个设备就是COM2

[root@yzwddsg bbb]# cat /sys/bus/pnp/devices/00\:05/resources
state = active
io 0x2f8-0x2ff
irq 3
dma disabled

那么00:06是什么设备?

继续看DSDT表中发现这个IUR0也是个PNP0501串口设备

                Device (IUR0)
                {
                    Name (_HID, EisaId ("PNP0501") /* 16550A-compatible COM Serial Port */)  // _HID: Hardware ID
                    Name (_UID, "IblUart0")  // _UID: Unique ID
                    Method (_STA, 0, NotSerialized)  // _STA: Status
                    {
                        If ((UM00 == 0x02) || (UM00 == 0x03))
                        {
                            Return (0x0F)
                        }
                        Else
                        {
                            Return (0x00)
                        }
                    }

                    Method (_CRS, 0, Serialized)  // _CRS: Current Resource Settings
                    {
                        Return (UACR (UC00, UI00))
                    }
                }

那么会不会是这个呢?

[root@yzwddsg bbb]# cat /sys/bus/pnp/devices/00\:04/firmware_node/path
\_SB_.LPC0.UAR1
[root@yzwddsg bbb]# cat /sys/bus/pnp/devices/00\:05/firmware_node/path
\_SB_.LPC0.UAR2
[root@yzwddsg bbb]# cat /sys/bus/pnp/devices/00\:06/firmware_node/path
\_SB_.UAR1

这里可以看到00:06是直接挂在_SB_下面的

去dump bios看一下,有这个可疑选项

Setup Question  = Serial Device
Help String     = Sets the Serial Device used to output bios serial log
Token   =9003   // Do NOT change this line
Offset  =2F8
Width   =01
BIOS Default =[01]BMC
Options =*[01]BMC       // Move "*" to the desired Option
         [02]S3M

它的作用就是dump串口日志到BMC上,所以00:06极有可能就是这个选项控制的串口设备

把该选项改成S3M然后重启测试发现已经没有RCU stall了

所以现在可以确定就是Serial Device选择了BMC导致00:06串口设备和主板COM1的串口设备都被系统检测到并register,而且这俩设备都映射到相同的0x3f8端口。内核在register 00:06的时候检测到0x3f8端口已经被COM1端口使用了,必须先把它disable,然后才能register 00:06。对于00:06来说,这是一个新的console,所以是带有CON_PRINTBUFFER标志位的,当它被enable的时候,就需要把syslog buffer中的所有日志重新flush到这个串口上,当时syslog buffer中已经有3000多条启动日志,它们需要一次flush到串口上,所以出现了RCU stall。

后记:但是如果把Serial Device设置成S3M的话,串口日志就不能打到BMC上了,ilo也不能看到log和交互了。

天天开心噢11st


转载请注明来源,欢迎对文章中的引用来源进行考证,欢迎指出任何有错误或不够清晰的表达。可以在下面评论区评论,也可以邮件至 857879363@qq.com