列表 上一篇 下一篇

Process Context Switch in Linux Kernel

Basic call path of process scheduling in Linux (start from kernel/sched.c): schedule()->context_switch()->switch_to()->__switch_to() When switching process context, two main parts should be taken into consideration: 1.switch global page table 2.switch kernel stack and hardware context. These steps are done in context_switch(), switch_to() and __switch_to(). switch_to() switches kernel stack and __switch_to() handles hardware context switch. Here is the source code of switch_to(): <ol class="linenums"><li class="L0"><span class="com">/* </span></li><li class="L1"><span class="com">* Saving eflags is important. It switches not only IOPL between tasks, </span></li><li class="L2"><span class="com">* it also protects other tasks from NT leaking through sysenter etc. </span></li><li class="L3"><span class="com">*/</span><span class="pln"> </span></li><li class="L4"><span class="com">#define</span><span class="pln"> switch_to</span><span class="pun">(</span><span class="pln">prev</span><span class="pun">,</span><span class="pln"> </span><span class="kwd">next</span><span class="pun">,</span><span class="pln"> </span><span class="kwd">last</span><span class="pun">)</span><span class="pln"> \ </span></li><li class="L5"><span class="kwd">do</span><span class="pln"> </span><span class="pun">{</span><span class="pln"> \ </span></li><li class="L6"><span class="pln"> </span><span class="com">/* \ </span></li><li class="L7"><span class="com"> * Context-switching clobbers all registers, so we clobber \ </span></li><li class="L8"><span class="com"> * them explicitly, via unused output variables. \ </span></li><li class="L9"><span class="com"> * (EAX and EBP is not listed because EBP is saved/restored \ </span></li><li class="L0"><span class="com"> * explicitly for wchan access and EAX is the return value of \ </span></li><li class="L1"><span class="com"> * __switch_to()) \ </span></li><li class="L2"><span class="com"> */</span><span class="pln"> \ </span></li><li class="L3"><span class="pln"> </span><span class="kwd">unsigned</span><span class="pln"> </span><span class="kwd">long</span><span class="pln"> ebx</span><span class="pun">,</span><span class="pln"> ecx</span><span class="pun">,</span><span class="pln"> edx</span><span class="pun">,</span><span class="pln"> esi</span><span class="pun">,</span><span class="pln"> edi</span><span class="pun">;</span><span class="pln"> \ </span></li><li class="L4"><span class="pln"> \ </span></li><li class="L5"><span class="pln"> </span><span class="kwd">asm</span><span class="pln"> </span><span class="kwd">volatile</span><span class="pun">(</span><span class="str">"pushfl\n\t"</span><span class="pln"> </span><span class="com">/* save flags */</span><span class="pln"> \ </span></li><li class="L6"><span class="pln"> </span><span class="str">"pushl %%ebp\n\t"</span><span class="pln"> </span><span class="com">/* save EBP */</span><span class="pln"> \ </span></li><li class="L7"><span class="pln"> </span><span class="str">"movl %%esp,%[prev_sp]\n\t"</span><span class="pln"> </span><span class="com">/* save ESP */</span><span class="pln"> \ </span></li><li class="L8"><span class="pln"> </span><span class="str">"movl %[next_sp],%%esp\n\t"</span><span class="pln"> </span><span class="com">/* restore ESP */</span><span class="pln"> \ </span></li><li class="L9"><span class="pln"> </span><span class="str">"movl $1f,%[prev_ip]\n\t"</span><span class="pln"> </span><span class="com">/* save EIP */</span><span class="pln"> \ </span></li><li class="L0"><span class="pln"> </span><span class="str">"pushl %[next_ip]\n\t"</span><span class="pln"> </span><span class="com">/* restore EIP */</span><span class="pln"> \ </span></li><li class="L1"><span class="pln"> __switch_canary \ </span></li><li class="L2"><span class="pln"> </span><span class="str">"jmp __switch_to\n"</span><span class="pln"> </span><span class="com">/* regparm call */</span><span class="pln"> \ </span></li><li class="L3"><span class="pln"> </span><span class="str">"1:\t"</span><span class="pln"> \ </span></li><li class="L4"><span class="pln"> </span><span class="str">"popl %%ebp\n\t"</span><span class="pln"> </span><span class="com">/* restore EBP */</span><span class="pln"> \ </span></li><li class="L5"><span class="pln"> </span><span class="str">"popfl\n"</span><span class="pln"> </span><span class="com">/* restore flags */</span><span class="pln"> \ </span></li><li class="L6"><span class="pln"> \ </span></li><li class="L7"><span class="pln"> </span><span class="com">/* output parameters */</span><span class="pln"> \ </span></li><li class="L8"><span class="pln"> </span><span class="pun">:</span><span class="pln"> </span><span class="pun">[</span><span class="pln">prev_sp</span><span class="pun">]</span><span class="pln"> </span><span class="str">"=m"</span><span class="pln"> </span><span class="pun">(</span><span class="pln">prev</span><span class="pun">-&gt;</span><span class="pln">thread</span><span class="pun">.</span><span class="pln">sp</span><span class="pun">),</span><span class="pln"> \ </span></li><li class="L9"><span class="pln"> </span><span class="pun">[</span><span class="pln">prev_ip</span><span class="pun">]</span><span class="pln"> </span><span class="str">"=m"</span><span class="pln"> </span><span class="pun">(</span><span class="pln">prev</span><span class="pun">-&gt;</span><span class="pln">thread</span><span class="pun">.</span><span class="pln">ip</span><span class="pun">),</span><span class="pln"> \ </span></li><li class="L0"><span class="pln"> </span><span class="str">"=a"</span><span class="pln"> </span><span class="pun">(</span><span class="kwd">last</span><span class="pun">),</span><span class="pln"> \ </span></li><li class="L1"><span class="pln"> \ </span></li><li class="L2"><span class="pln"> </span><span class="com">/* clobbered output registers: */</span><span class="pln"> \ </span></li><li class="L3"><span class="pln"> </span><span class="str">"=b"</span><span class="pln"> </span><span class="pun">(</span><span class="pln">ebx</span><span class="pun">),</span><span class="pln"> </span><span class="str">"=c"</span><span class="pln"> </span><span class="pun">(</span><span class="pln">ecx</span><span class="pun">),</span><span class="pln"> </span><span class="str">"=d"</span><span class="pln"> </span><span class="pun">(</span><span class="pln">edx</span><span class="pun">),</span><span class="pln"> \ </span></li><li class="L4"><span class="pln"> </span><span class="str">"=S"</span><span class="pln"> </span><span class="pun">(</span><span class="pln">esi</span><span class="pun">),</span><span class="pln"> </span><span class="str">"=D"</span><span class="pln"> </span><span class="pun">(</span><span class="pln">edi</span><span class="pun">)</span><span class="pln"> \ </span></li><li class="L5"><span class="pln"> \ </span></li><li class="L6"><span class="pln"> __switch_canary_oparam \ </span></li><li class="L7"><span class="pln"> \ </span></li><li class="L8"><span class="pln"> </span><span class="com">/* input parameters: */</span><span class="pln"> \ </span></li><li class="L9"><span class="pln"> </span><span class="pun">:</span><span class="pln"> </span><span class="pun">[</span><span class="pln">next_sp</span><span class="pun">]</span><span class="pln"> </span><span class="str">"m"</span><span class="pln"> </span><span class="pun">(</span><span class="kwd">next</span><span class="pun">-&gt;</span><span class="pln">thread</span><span class="pun">.</span><span class="pln">sp</span><span class="pun">),</span><span class="pln"> \ </span></li><li class="L0"><span class="pln"> </span><span class="pun">[</span><span class="pln">next_ip</span><span class="pun">]</span><span class="pln"> </span><span class="str">"m"</span><span class="pln"> </span><span class="pun">(</span><span class="kwd">next</span><span class="pun">-&gt;</span><span class="pln">thread</span><span class="pun">.</span><span class="pln">ip</span><span class="pun">),</span><span class="pln"> \ </span></li><li class="L1"><span class="pln"> \ </span></li><li class="L2"><span class="pln"> </span><span class="com">/* regparm parameters for __switch_to(): */</span><span class="pln"> \ </span></li><li class="L3"><span class="pln"> </span><span class="pun">[</span><span class="pln">prev</span><span class="pun">]</span><span class="pln"> </span><span class="str">"a"</span><span class="pln"> </span><span class="pun">(</span><span class="pln">prev</span><span class="pun">),</span><span class="pln"> \ </span></li><li class="L4"><span class="pln"> </span><span class="pun">[</span><span class="kwd">next</span><span class="pun">]</span><span class="pln"> </span><span class="str">"d"</span><span class="pln"> </span><span class="pun">(</span><span class="kwd">next</span><span class="pun">)</span><span class="pln"> \ </span></li><li class="L5"><span class="pln"> \ </span></li><li class="L6"><span class="pln"> __switch_canary_iparam \ </span></li><li class="L7"><span class="pln"> \ </span></li><li class="L8"><span class="pln"> </span><span class="pun">:</span><span class="pln"> </span><span class="com">/* reloaded segment registers */</span><span class="pln"> \ </span></li><li class="L9"><span class="pln"> </span><span class="str">"memory"</span><span class="pun">);</span><span class="pln"> \ </span></li><li class="L0"><span class="pun">}</span><span class="pln"> </span><span class="kwd">while</span><span class="pln"> </span><span class="pun">(</span><span class="lit">0</span><span class="pun">)</span></li></ol> We can see 3 things are done in function switch_to(): 1.switch %esp 2.hardware context switch (__switch_to()) 3.stack switch. This piece of code is so excellent and well designed, and the functionalities are clearly described by comments. We should notice that when calling __switch_to(), %[next_ip] is pushed into stack so the %eip of next process is used as __switch_to()'s param. Then we can see the next line ("1:\t") is a symbol, which is used as %[prev_ip] (in line "movl %1f, %[prev_ip]\n\t"). After switching to the new context, %eip is stored here. When new process starts, %ebp is poped and flags are restored. If the process we switch to is a new process, %[next_ip] will be the address of ENTRY(ret_from_fork) (in file arch/x86/kernel/entry_32.S), so we cannot use "call __switch_to" instead of "jmp __switch_to" because "call" will push the address of the following code into stack and we can't get the address of ret_from_fork in %[next_ip]. Here will do "pushl" operation manually.

哲理性理解

如前所述,对于每个硬件处理器,仅通过检查栈就可以获得当前正确的进程。早先的Linux版本没有把内核栈与进程描述符放在一起,而是强制引入全局静态变量current来标示正在运行进程的描述符。在多处理器系统上,有必要把current定义为一个数组,每一个元素对应一个可用CPU。
上面这段话总结一下就是,esp代表当前进程。所以esp的切换就是进程的切换,Linux内核进程切换正是这样实现的。如下:

<ol class="linenums"><li class="L0"><span class="pln"> movl </span><span class="pun">%[</span><span class="pln">next_sp</span><span class="pun">],%%</span><span class="pln">esp</span></li></ol>

上面这条命令作用是把next->thread.esp装入esp。即进行esp切换,之后的指令已经是在next进程,只不过还要进行硬件上下文的切换(之前的版本由控制器自动实现),所以可以把堆栈切换之后的指令,看做next进程(代替控制器)去完成硬件上下文的切换(因为next进程在内核态,所以可以访问prev与next进程,从而去保存prev进程硬件上下文,恢复自己的硬件上下文),可以从另一个角度看:内核代码运行在prev或者next进程上来实现对prev、next进程的管理(也就是说:内核代码在堆栈切换之前运行在prev进程,之后运行在next进程,但并不看做是prev、next进程在运行,而是内核代码在运行)。

为什么不化简成call __switch_to

这里,如果之前B也被switch_to出去过,那么[next_ip]里存的就是下面这个1f的标号,但如果进程B刚刚被创建,之前没有被switch_to出去过,那么[next_ip]里存的将是ret_ftom_fork(参看copy_thread()函数)。这就是这里为什么不用call __switch_to而用jmp,因为call会导致自动把下面这句话的地址(也就是1:)压栈,然后__switch_to()就必然只能ret到这里,而无法根据需要ret到ret_from_fork。

next是否是新进程(没有switch_to过的进程,如fork出的子进程)

1.如果next是新进程,虽然

<ol class="linenums"><li class="L0"><span class="pln"> movl </span><span class="pun">%[</span><span class="pln">next_sp</span><span class="pun">],%%</span><span class="pln">esp</span></li></ol> 之后是在next进程上运行,但是并不是next进程(新进程)的真正开始点,而ret_ftom_fork才新进程的开始点。

要注意一句话,当处理器切换到某个进程时,不必是上次切换出去时的下一条指令,也就是说切换到next进程时并不一定要运行ret_ftom_fork,而是要继续做一下切换的后续管理工作(硬件上下文的保存、恢复),之后再跳转到ret_ftom_fork(也就是__switch_to函数返回)。总结一下就是切换到next进程(堆栈切换)时并没有完成完全的切换工作,需要恢复硬件上下文之后,才是完全的切换了进程。

2.如果next进程不是新进程(next进程被switch_to过),虽然

<ol class="linenums"><li class="L0"><span class="pln"> movl </span><span class="pun">%[</span><span class="pln">next_sp</span><span class="pun">],%%</span><span class="pln">esp</span></li></ol> 之后是的真正开始点,但是此时硬件上下文还不是next进程的。__switch_to进行硬件上下文恢复之后才算是完全恢复next进程。

3.综上两条所述,堆栈切换到switch_to结束虽然是在next进程上运行,但是并不完全算是next进程的开始,switch_to返回之后才是next进程真真正正恢复。

switch_to宏化简一下就是:

<ol class="linenums"><li class="L0"><span class="pln"> </span><span class="com">//参照《深入理解linux内核》第三版P111</span></li><li class="L1"><span class="pln"> movl prev</span><span class="pun">,%</span><span class="pln">eax</span></li><li class="L2"><span class="pln"> movl </span><span class="kwd">next</span><span class="pun">,%</span><span class="pln">edx</span></li><li class="L3"><span class="pln"> pushfl</span></li><li class="L4"><span class="pln"> pushl </span><span class="pun">%</span><span class="pln">ebp</span></li><li class="L5"><span class="pln"> popl </span><span class="pun">%</span><span class="pln">ebp</span></li><li class="L6"><span class="pln"> popfl</span></li></ol>