rt调度器bug导致宕机的排查

故障排查

发布时间 : 2025-03-22 21:08

字数:5.4k 评论:

背景
排查

背景

在测试系统抖动时候，我们有时候会使用cyclictest这个工具。

但是，在Ubuntu22上测试的时候几乎每次都会宕机，其他系统上没有出现这个问题

经过排查，最终确定与内核rt调度器的一个bug有关

排查

首先我们当前已知的触发宕机的条件是：

系统上有multipathd这个进程（这是一个io多路径的进程，用于一个设备上路径损坏的时候可以使用另一个路径）
系统上运行cyclictest

首先根据宕机的栈我们可以确定是softlockup引起的宕机，而且每次宕机都是宕在multipathd这里，所以可以明确的是宕机由multipathd引起的，kill掉multipathd进程就不会出现这个问题了。

那么为什么加载了multipathd服务就会出问题？继续排查

crash分析一下vmcore，首先虽然每次core在不同的地方，但是我们可以明确的是都是core在multipathd这个进程

看一下multipathd进程共有7个线程

root@XXX:~# pidstat -p 4087 -t
Linux 5.15.0-72-generic (XXX) 	01/20/2025 	_x86_64_	(384 CPU)

11:31:06 AM   UID      TGID       TID    %usr %system  %guest   %wait    %CPU   CPU  Command
11:31:06 AM     0      4087         -    0.00    0.00    0.00    0.00    0.00   279  multipathd
11:31:06 AM     0         -      4087    0.00    0.00    0.00    0.00    0.00   279  |__multipathd
11:31:06 AM     0         -      4097    0.00    0.00    0.00    0.00    0.00   287  |__multipathd
11:31:06 AM     0         -      4098    0.00    0.00    0.00    0.00    0.00    87  |__multipathd
11:31:06 AM     0         -      4099    0.00    0.00    0.00    0.00    0.00    87  |__multipathd
11:31:06 AM     0         -      4100    0.00    0.00    0.00    0.00    0.00   272  |__multipathd
11:31:06 AM     0         -      4101    0.00    0.00    0.00    0.00    0.00    80  |__multipathd
11:31:06 AM     0         -      4102    0.00    0.00    0.00    0.00    0.00    80  |__multipathd

然后找了一个还没宕机的相同环境的系统，通过cat /proc/XXX/tasks/XXX/stack看一下每个线程在做什么

然后发现multipathd一共7个线程，其中三个线程在poll，三个线程在queue等待唤醒，剩下的那个线程是第六个，在进行一些check工作

而通过vmcore确认，发现每次core掉的multipathd线程其实就是第6个线程，也就是在check的线程

该线程正常的栈如下

root@XXX:~# cat /proc/4087/task/4101/stack
[<0>] hrtimer_nanosleep+0x99/0x120
[<0>] common_nsleep+0x44/0x50
[<0>] __x64_sys_clock_nanosleep+0xd2/0x160
[<0>] do_syscall_64+0x59/0xc0
[<0>] entry_SYSCALL_64_after_hwframe+0x61/0xcb

看上去它执行的其实就是： https://github.com/opensvc/multipath-tools/blob/master/multipathd/main.c 中的checkerloop函数，大致过程就是检查路径然后sleep

至此可以确认是multipathd的第六个线程也就是checkerloop线程引起的宕机

再看一下相关进程的优先级和调度类

#define SCHED_NORMAL            0
#define SCHED_FIFO              1
#define SCHED_RR                2
#define SCHED_BATCH             3
/* SCHED_ISO: reserved but not implemented yet */
#define SCHED_IDLE              5
#define SCHED_DEADLINE          6

##这是触发宕机的multipathd线程，rt线程，RR策略，也就是时间片的rt线程
crash> task ff1d16af4e8f8000|grep policy
  policy = 2,
  mempolicy = 0x0,
crash> task ff1d16af4e8f8000|grep sched_class
  sched_class = 0xffffffffa1eea290 <rt_sched_class>,
crash> task ff1d16af4e8f8000 |grep prio
  prio = 0,
  static_prio = 120,
  normal_prio = 0,
  rt_priority = 99,
    prio = 0,
    prio_list = {

##这是cyclictest的线程，rt线程，fifo策略
crash> task ff1d16af6f935e80 |grep policy
  policy = 1,
  mempolicy = 0x0,
crash> task ff1d16af6f935e80 |grep sched_class
  sched_class = 0xffffffffa1eea290 <rt_sched_class>,
crash> task ff1d16af6f935e80 |grep prio
  prio = 19,
  static_prio = 120,
  normal_prio = 19,
  rt_priority = 80,
    prio = 140,
    prio_list = {

##cyclictest的优先级是80比multipathd的99要低

至此可以确定multipathd的checkerloop线程和cyclictest一样都是rt线程优先级，且multipathd的chekerloop线程优先级是99比cyclictest的80要高

好了，上面都已经明了了，再继续分析一下core

看一下触发softlockup的cpu的timer，果然是有任务都超时了还没有执行（其他的cpu上没有这个问题，但是cpu0上挂了好多好多ghes_poll_func，但是这个与本core应该没关系，我们不受这个场外因素影响）

该cpu上的rq如下，确实有timer超时了

crash> timer -C 240
JIFFIES
4306062733

TIMER_BASES[240][BASE_STD]: ff1d176b4f4207c0
  EXPIRES        TTE        TIMER_LIST     FUNCTION
  4306058428   -4305  ffffffffa2e83ee0  ffffffffa097a210  <clocksource_watchdog>
  4306097530   34797  ff1d176b4f41fc20  ffffffffa0861530  <mce_timer_fn>
TIMER_BASES[240][BASE_DEF]: ff1d176b4f421a40
  EXPIRES        TTE     TIMER_LIST     FUNCTION
  (none)


crash> runq -c 240
CPU 240 RUNQUEUE: ff1d176b4f430b80
  CURRENT: PID: 4069   TASK: ff1d16af4e8f8000  COMMAND: "multipathd"
  RT PRIO_ARRAY: ff1d176b4f430e40
     [  0] PID: 4069   TASK: ff1d16af4e8f8000  COMMAND: "multipathd"
     [ 19] PID: 1392590  TASK: ff1d16af58b48000  COMMAND: "cyclictest"
  CFS RB_ROOT: ff1d176b4f430cc0
     [120] PID: 3073   TASK: ff1d16af5f4b8000  COMMAND: "kworker/240:2"
     [100] PID: 3898   TASK: ff1d16af6d128000  COMMAND: "kworker/240:1H"

那么有可能是什么原因导致了stop_work一直没有更新时间戳呢？首先想到的就是是不是multipathd执行的内容陷入到内核中去有死循环或者死锁之类的？但是看了下multipathd中checkerlopp函数的代码又感觉没什么问题，而且他会执行nanosleep睡眠，也就是理论上并不会一直占用cpu，因为sleep会进行schedule让出cpu。

到这里仿佛没什么头绪了。于是把/etc/multipath.conf配置文件中加上verbosity 4，然后multipath -r然后systemctl restart multipathd，让multipathd打印更详细的日志，然后再触发一下cyclictest导致的宕机，看了下没什么问题感觉，还是在执行checkerloop中的定时判断检查，然后就突然宕机了。

这是一张图片

根据开启multipathd的log我们可以确定chekerloop线程其实是有执行nanosleep的，没有什么异常

所以这时候大胆推测一下是不是和rt调度器有关系？看一下vmcore中checkerloop线程所在的cpu上有没有挂一些ipi任务。

crash> p call_single_queue:240
per_cpu(call_single_queue, 240) = $1 = {
  first = 0xff1d176e5011d878
}
crash> list 0xff1d176e5011d878
ff1d176e5011d878
crash> call_single_data_t ff1d176e5011d878
struct call_single_data_t {
  node = {
    llist = {
      next = 0x0
    },
    {
      u_flags = 35,
      a_flags = {
        counter = 35
      }
    },
    src = 0,
    dst = 0
  },
  func = 0xffffffffa091f400 <rto_push_irq_work_func>,
  info = 0xfc4ab0200000001
}

确实发生问题的cpu上call_single_queue上挂着任务呢，而其他的cpu中是没有挂任务的，挂的任务是rto_push_irq_work_func。那是不是这个这个cpu上一直在处理ipi导致触发了softlockup的watchdog呢？

还是看看这个rto_push_irq_work_func函数是做什么的，看上去应该是rt调度的负载均衡相关的东西

void rto_push_irq_work_func(struct irq_work *work)
{
        struct root_domain *rd =
                container_of(work, struct root_domain, rto_push_work);
        struct rq *rq;
        int cpu;

        rq = this_rq();

        /*
         * We do not need to grab the lock to check for has_pushable_tasks.
         * When it gets updated, a check is made if a push is possible.
         */
         //has_pushable_tasks就是检查rq中的rq->rt.pushable_tasks是否为空
         //push_rt_task的作用是，如果当前cpu有超过1个的rt任务，就检查一下没有在运行的任务是不是可以
         //迁移到其他的正在运行低优先级的任务的cpu上去
        if (has_pushable_tasks(rq)) {
                raw_spin_rq_lock(rq);
                while (push_rt_task(rq, true))
                        ;
                raw_spin_rq_unlock(rq);
        }

        raw_spin_lock(&rd->rto_lock);

        /* Pass the IPI to the next rt overloaded queue */
        //发ipi给下一个rt overloaded queue？什么意思？应该是值发ipi给下一个过载的cpu，让下一个cpu也去执行负载均衡吧？
        cpu = rto_next_cpu(rd);

        raw_spin_unlock(&rd->rto_lock);

        if (cpu < 0) {
                sched_put_rd(rd);
                return;
        }

        /* Try the next RT overloaded CPU */
        //入队一个irq work
        irq_work_queue_on(&rd->rto_push_work, cpu);
}

至此可以猜测是该cpu在处理rt调度器负载均衡相关的任务

先大概看一下谁会去调用这个函数？

balance_rt
pull_rt_task
tell_cpu_to_push
     irq_work_queue_on(&rq->rd->rto_push_work, cpu);

看一下pick_next_task中什么时候会去调用balance呢？删除fair class中的一部分没用的内容，可以看到如果prev的sched_class不是小于等于fair的话（显然cyclictest和multipathd是rt class不满足这个条件），就会去先执行put_prev_task_balance，而put_prev_task_balance是干什么呢？其实就是从prev所属的sched class开始遍历并执行每个class的balance，对于rt来说，那自然就是balance_rt函数

__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
..........
        if (likely(prev->sched_class <= &fair_sched_class &&
                   rq->nr_running == rq->cfs.h_nr_running)) {
..............
        }

restart:
        put_prev_task_balance(rq, prev, rf);
        for_each_class(class) {
                p = class->pick_next_task(rq);
                if (p)
                        return p;
        }


static void put_prev_task_balance(struct rq *rq, struct task_struct *prev,
                                  struct rq_flags *rf)
{
。。。。。。
        const struct sched_class *class;
        for_class_range(class, prev->sched_class, &idle_sched_class) {
                if (class->balance(rq, prev, rf))
                        break;
        }
。。。。。。
        put_prev_task(rq, prev);
}

所以我们基本可以得出，cyclictest和multipathd调度出去的时候应该都会执行balance。

因此从balance_rt开始看

static int balance_rt(struct rq *rq, struct task_struct *p, struct rq_flags *rf)
{
        if (!on_rt_rq(&p->rt) && need_pull_rt_task(rq, p)) {
                rq_unpin_lock(rq, rf);
                pull_rt_task(rq);
                rq_repin_lock(rq, rf);
        }

        return sched_stop_runnable(rq) || sched_dl_runnable(rq) || sched_rt_runnable(rq);
}

##如果当前rq中最高优先级的任务还不如正在运行的prio高，就需要拉任务下来？？？
##这块看core怎么感觉有些问题
static inline bool need_pull_rt_task(struct rq *rq, struct task_struct *prev)
{
        /* Try to pull RT tasks here if we lower this rq's prio */
        return rq->online && rq->rt.highest_prio.curr > prev->prio;
}

pull_rt_task的大致过程：

static void pull_rt_task(struct rq *this_rq)
{
        int this_cpu = this_rq->cpu, cpu;
        bool resched = false;
        struct task_struct *p, *push_task;
        struct rq *src_rq;
        int rt_overload_count = rt_overloaded(this_rq);

        //没有overload的cpu就返回
        if (likely(!rt_overload_count))
                return;

        /*
         * Match the barrier from rt_set_overloaded; this guarantees that if we
         * see overloaded we must also see the rto_mask bit.
         */
        smp_rmb();

        /* If we are the only overloaded CPU do nothing */
        //若当前cpu一个overload的，就什么都不做
        if (rt_overload_count == 1 &&
            cpumask_test_cpu(this_rq->cpu, this_rq->rd->rto_mask))
                return;

#ifdef HAVE_RT_PUSH_IPI
        if (sched_feat(RT_PUSH_IPI)) {
                //通知其他cpu push task？
                tell_cpu_to_push(this_rq);
                return;
        }
#endif
        //遍历所有在mask中的cpu
        for_each_cpu(cpu, this_rq->rd->rto_mask) {
                //跳过当前cpu
                if (this_cpu == cpu)
                        continue;

                //获取源rq，也就是遍历的cpu的rq作为源rq，因为需要从这些东西上push任务上去给之前的cpu pull
                src_rq = cpu_rq(cpu);

                /*
                 * Don't bother taking the src_rq->lock if the next highest
                 * task is known to be lower-priority than our current task.
                 * This may look racy, but if this value is about to go
                 * logically higher, the src_rq will push this task away.
                 * And if its going logically lower, we do not care
                 */
                 //如果src rq中highest_prio.next，也就是下一个要调度的rt进程的prio大于当前的rq中的highest_prio.curr就跳过？
                 //也就是，src rq中的下一个要调度的任务的优先级比当前小就跳过？
                if (src_rq->rt.highest_prio.next >=
                    this_rq->rt.highest_prio.curr)
                        continue;

                /*
                 * We can potentially drop this_rq's lock in
                 * double_lock_balance, and another CPU could
                 * alter this_rq
                 */
                push_task = NULL;
                double_lock_balance(this_rq, src_rq);

                /*
                 * We can pull only a task, which is pushable
                 * on its rq, and no others.
                 */
                 //从src rq中挑一个最高优先级的任务
                p = pick_highest_pushable_task(src_rq, this_cpu);

                /*
                 * Do we have an RT task that preempts
                 * the to-be-scheduled task?
                 */
                 //从src rq中push上来的任务是不是优先级比当前cpu上的最高优先级的任务的优先级高？
                 //是的话抢占？
                if (p && (p->prio < this_rq->rt.highest_prio.curr)) {
                        WARN_ON(p == src_rq->curr);
                        WARN_ON(!task_on_rq_queued(p));

                        /*
                         * There's a chance that p is higher in priority
                         * than what's currently running on its CPU.
                         * This is just that p is waking up and hasn't
                         * had a chance to schedule. We only pull
                         * p if it is lower in priority than the
                         * current task on the run queue
                         */
                         //如果p比他自己的cpu中正在运行的任务的优先级高，那就不管
                        if (p->prio < src_rq->curr->prio)
                                goto skip;
                        //如果p禁止migration，就先获取这个task？
                        //如果咩禁止，就从src rq中删除，然后设置task到本cpu，然后激活这个task？
                        if (is_migration_disabled(p)) {
                                push_task = get_push_task(src_rq);
                        } else {
                                deactivate_task(src_rq, p, 0);
                                set_task_cpu(p, this_cpu);
                                activate_task(this_rq, p, 0);
                                resched = true;
                        }
                        /*
                         * We continue with the search, just in
                         * case there's an even higher prio task
                         * in another runqueue. (low likelihood
                }
skip:
                double_unlock_balance(this_rq, src_rq);
                //如果是上面的禁止迁移的task，就向目标cpu发stop_work，让目标cpu执行push_work？
                if (push_task) {
                        raw_spin_rq_unlock(this_rq);
                        stop_one_cpu_nowait(src_rq->cpu, push_cpu_stop,
                                            push_task, &src_rq->push_work);
                        raw_spin_rq_lock(this_rq);
                }
        }

        if (resched)
                resched_curr(this_rq);
}




tell_cpu_to_push
irq_work_queue_on
__smp_call_single_queue
send_call_function_single_ipi

继续看一下core

enqueue_task_rt/sched_rt_rq_enqueue
enqueue_rt_entity
__enqueue_rt_entity
inc_rt_tasks
inc_rt_migration
rt_set_overload

static inline void rt_set_overload(struct rq *rq)
{
	if (!rq->online)
		return;

	cpumask_set_cpu(rq->cpu, rq->rd->rto_mask);
	/*
	 * Make sure the mask is visible before we set
	 * the overload count. That is checked to determine
	 * if we should look at the mask. It would be a shame
	 * if we looked at the mask, but the mask was not
	 * updated yet.
	 *
	 * Matched by the barrier in pull_rt_task().
	 */
	smp_wmb();
	atomic_inc(&rq->rd->rto_count);
}            ........

现在可以看到rto_mask确实被置位了，那么怎么置位的呢？

大致就是在enqueue或者dequeue的时候会inc或者dec对应的task，在这里就会判断是否overload了

enqueue_task_rt/sched_rt_rq_enqueue
enqueue_rt_entity
__enqueue_rt_entity
inc_rt_tasks
inc_rt_migration
rt_set_overload

static inline void rt_set_overload(struct rq *rq)
{
	if (!rq->online)
		return;

	cpumask_set_cpu(rq->cpu, rq->rd->rto_mask);
	/*
	 * Make sure the mask is visible before we set
	 * the overload count. That is checked to determine
	 * if we should look at the mask. It would be a shame
	 * if we looked at the mask, but the mask was not
	 * updated yet.
	 *
	 * Matched by the barrier in pull_rt_task().
	 */
	smp_wmb();
	atomic_inc(&rq->rd->rto_count);
}

（vmcore中的root_domain的overload是1表明，enqueue se之后，把overload设置成1之后，该cpu上就没有再dequeue过rt任务了，因为一直都在overload？所以就是一直在占用cpu？）

至此可以猜测是multipathd那个cpu overload了，导致一直有ipi中断需要处理，让它把task push上去

这表示，出问题的cpu上有两个rt任务了，可迁移的任务有一个，其实也就是multipathd，因为cyclictest我们肯定是绑核的，不能迁移

这正好可以对应上update_rt_migration函数设置overhead

crash> rq ff40900c4ddf0b80 |grep rt_nr_total
    rt_nr_total = 2,
crash> rq ff40900c4ddf0b80 |grep rt_nr_migratory
    rt_nr_migratory = 1,


static void update_rt_migration(struct rt_rq *rt_rq)
{
        if (rt_rq->rt_nr_migratory && rt_rq->rt_nr_total > 1) {
                if (!rt_rq->overloaded) {
                        rt_set_overload(rq_of_rt_rq(rt_rq));
                        rt_rq->overloaded = 1;
                }
        } else if (rt_rq->overloaded) {
                rt_clear_overload(rq_of_rt_rq(rt_rq));
                rt_rq->overloaded = 0;
        }
}

那么为什么overhead了就会宕机呢？？？

联系之前开启multipathd的日志，难不成和check线程的每次tick一秒钟有关系？不应该呀，这个是有sleep 1 的操作的会让出cpu，怎么还会softlockup呢？

所以现在的线索和疑点如下：

宕机应该和rt进程的负载均衡有关，因为宕机的cpu上会发现挂着rto_push_irq_work_func的irq work，这个应该是cpu上rt进程overhead导致的，一般认为cpu上挂了大于1个的rt进程的时候就认为overheade了（当然还有别的一些判断条件，比如进程是否能迁移到其他cpu之类的，不过从core中可以看出，这个绝对是overhead了，从root_domain的rto_mask可以看出），需要负载均衡一下
为什么multipathd的其他线程没有导致宕机？都是checkerloop这个线程导致的？

还是需要有问题的环境去调试一下，但是开了ftrace之后问题就不好复现了。于是写了个bpftrace脚本监控一下rto_push_irq_work_func函数。

可以确定的是，在出问题时rto_push_irq_work_func函数调用激增，那么谁会调用这个呢？看了下只有tell_cpu_to_push和rto_push_irq_work_func函数。而tell_cpu_to_push是被pull_rt_task调用的，也就是其他cpu需要向这个异常cpu通知拉任务下来的时候会调用。另一处调用就是rto_push_irq_work_func里调用了，这个有可能是自己给自己发ipi然后调用这个rto_push_irq_work_func，然后在这个函数里再自己给自己发ipi导致这个死循环一直发ipi让这个cpu去处理softirq？但是看了下这块的函数逻辑，理论上应该不会，他应该是找当前cpu的next cpu，也就是rto cpumask中当前cpu的下一个overlaod的cpu去调用，rd->rto_cpu的赋值看上去问题不大。

要搞清楚是谁谁让异常cpu调用一直处理中断，还是要trace清楚，是谁大量调用的rto_push_irq_work_func函数。

perf probe ‘rto_push_irq_work_func’

perf record -e probe:rto_push_irq_work_func -aR sleep 30

然后又perf probe了一下tell_cpu_to_push，大概就明白了

cyclictest调度的时候会触发rt balance，这个就会发ipi让异常cpu push任务上去，但是短时间内大量的发的话就处理不过来了

https://lore.kernel.org/lkml/xhsmhpm3vb6ws.mognet@vschneid.remote.csb/T/

然后看到社区有人遇到过类似的问题

所以这个过程就很清晰了：

开启cyclictest，每个cpu上会有一个绑核的cyclictest的rt进程，他的优先级是80
ubuntu22上会有一个multipathd进程，他有一个优先级为99的chckerloop的rt线程
当multipathd的chckerloop运行时，他所在的cpu上（假设是X号cpu）还存在cyclictest的rt线程，因此有两个线程，所以系统会认为这个cpu overload了
其他cpu上的cyclictest运行时，进行调度或者sleep（这会进行schedule，schedule到swapper的话，其实就是往低优先级的任务进行调度了）的时候，发现需要负载均衡，于是就使用tell_cpu_to_push（这个过程会增加rq->rd->rto_loop_next）向X发送ipi让它去push task
X上现在正运行multipathd的checkerloop，而且该cpu上的cyclictest是绑核的没办法migration，因此也就没有pushable_tasks，也就不会把task push上去。然后在rto_push_irq_work_func中就会继续执行rto_next_cpu去找下一个overload的cpu，通过rto_next_cpu代码逻辑可以知道，他是从X开始从rto_mask中寻找下一个overload的cpu，很明显，它找不到，因为只有X是overload的，于是此时rd->rto_cpu被置为-1，但是他还会去尝试去判断rto_loop_next == rto_loop，如果不相等的话，就把rto_loop赋值成rto_loop_next，然后继续循环，很明显，再进入下一轮循环后，rto_cpu此时的值是-1，那在找cpu的时候就会把X给return了，因此又会去调用irq_work_queue_on往自己所在的X上发ipi执行rto_push_irq_work_func，然后X上唤醒irq work去执行rto_push_irq_work_func还是没办法执行把task push上去，因为只有multipathd的checkerloop是pushable的，但是他正在运行，所以此时是没有pushable task的，而其他cpu也还在一直给它发ipi，他就一直不断处理中断，最终导致softlockup

那么这该怎么解决呢？一个是像那个patch一样如果没有pushable task的话就直接返回？或者如果cpu是当前cpu的话就返回，不让他自己给自己发ipi（这个是不是不太合理，应该是允许这样的情况发生的吧）？或者能不能从overload的判定下手（如果是每个cpu上都有绑核的cyclictest这种rt进程的情况，并且有更高优先级的rt运行在一个cpu上的时候，这个是不是就不判定overload？？？不对，也不合理，因为我们没办法假设每个cpu上绑核的rt进程的任务是什么样的，因此，只要某个cpu上有可push的任务的时候，其实就应该判定为overload的）？

https://lkml2.uits.iu.edu/hypermail/linux/kernel/1704.2/04980.html

结合这个的话，看上去可能确实存在问题啊？rto_loop_next是用来标记cpu往低优先级的任务进行调度（比如cyclictest sleep的话进行schedule到swapper），而cyclictest会进行很多这样的重复操作，这就有可能导致rto_loop老是追不上rto_loop_next，从而导致除了cyclictest的cpu往X上发ipi，X cpu上执行rto_push_irq_work_func时候也会给自己发ipi。

这样合理嘛？不让cpu自己给自己发ipi了，应该可以减少很多没用的ipi，但是应该还是没办法解决cyclictest的那些cpu给X发ipi

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index bd66a46b06ac..2ebb7f75ff79 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -2165,8 +2165,12 @@ static int rto_next_cpu(struct root_domain *rd)

                rd->rto_cpu = cpu;

-               if (cpu < nr_cpu_ids)
-                       return cpu;
+               if (cpu < nr_cpu_ids) {
+                       if (cpu != smp_processor_id())
+                               return cpu;
+                       else
+                               return -1;
+               }

                rd->rto_cpu = -1;

看来社区已经有人解决了

sched/rt: Make rt_rq->pushable_tasks updates drive rto_mask(https://lore.kernel.org/lkml/20230811112044.3302588-1-vschneid@redhat.com/T/)

这个解法妙，是从根本上解决问题的，产生这个问题的根本原因并不是说rto_push_irq_work_func给自己发ipi，而是由大量的cyclictest rt进程的sleep导致的schedule检测到需要进行rt balance，那么rt balance怎么做呢，它就找之前已经在root_domain中是否已经设置了overload，并在rto_mask中寻找overload的cpu，然后向这些overload的cpu发送ipi起执行rto_push_irq_work_func这个irq work，这个rto_push_irq_work_func函数的主要内容是什么呢？就是把pushable task push上去。

而为什么cyclictest导致宕机了呢？因为在softlockup的那个cpu上，存在cyclictest和multipathd两个rt进程，cyclictest是绑核的不可迁移，只有multipathd是可迁移的，但是按照当前的设置方式来说，当enqueue multipathd的时候，因为它是可迁移的，所以rt_nr_migratory此时是1，而rt_nr_total是2大于1（因为有之前的cyclictest rt进程在），所以就符合了update_rt_migration函数中的判断，所以就认为该cpu上是overload的。但实际上，当multipathd运行的时候，pushable task队列中并没有可以push的task，为什么呢？因为只有multipathd是可push的但是它正在运行，所以这就导致其他的cyclictest的cpu检测到这个multipathd的cpu overload了，去给他发ipi，但是这个cpu执行rto_push_irq_work_func函数相当于并没有什么实质操作，因为此时他没有可push的任务，这就导致的问题的产生，其他的cyclictest的cpu还是会一直给它发ipi，因为他们一直认为他是overload的，但是这个cpu上并没有把任何task push上去，所以就相当于一直空执行rto_push_irq_work_func，最终导致了softlockup。

所以，产生本问题的根因应该是：这种情况下（两个rt线程，一个rt线程是绑核的，另一个是不绑核的可以迁移的，但是这个可迁移的task正在cpu上运行）就不应该把这个cpu判定为overload，因为它当前就没有可以push的task。

sched/rt: Make rt_rq->pushable_tasks updates drive rto_mask这个patch怎么解决的这个问题呢？他把原本的判定overload的方式去掉了，原本的判断时机是在enqueue_rt_entity/dequeue_rt_entity中最终调用到update_rt_migration来判断是否overload（也就是根据rt_nr_migratory和rt_nr_total）。把这个判断时机放到了enqueue_pushable_task和dequeue_pushable_task中，这俩是在enqueue_task_rt/dequeue_task_rt会调用。如下所示，在enqueue_task_rt中，以前是在enqueue_rt_entity直接检测并设置overload的，现在放到了enqueue_pushable_task中。假设还是cyclictest导致宕机的那种情况，当multiptahd enqueue的时候，进入if (!task_current(rq, p) && p->nr_cpus_allowed > 1)判断，因为要先判断一下task_current(rq, p)，也就是当前的task是不是正在running的，如果是的话，就不设置overload，所以就不会出现这种情况了，妙啊，还得是社区大佬们

static void
enqueue_task_rt(struct rq *rq, struct task_struct *p, int flags)
{
        struct sched_rt_entity *rt_se = &p->rt;

        if (flags & ENQUEUE_WAKEUP)
                rt_se->timeout = 0;

        check_schedstat_required();
        update_stats_wait_start_rt(rt_rq_of_se(rt_se), rt_se);

        enqueue_rt_entity(rt_se, flags);

        if (!task_current(rq, p) && p->nr_cpus_allowed > 1)
                enqueue_pushable_task(rq, p);
}

转载请注明来源，欢迎对文章中的引用来源进行考证，欢迎指出任何有错误或不够清晰的表达。可以在下面评论区评论，也可以邮件至 857879363@qq.com