Process Context Switch in Linux Kernel
Basic call path of process scheduling in Linux (start from kernel/sched.c):
schedule()->context_switch()->switch_to()->__switch_to()
When switching process context, two main parts should be taken into consideration:
1.switch global page table
2.switch kernel stack and hardware context.
These steps are done in context_switch(), switch_to() and __switch_to(). switch_to() switches kernel stack and __switch_to() handles hardware context switch.
Here is the source code of switch_to():
/*
* Saving eflags is important. It switches not only IOPL between tasks,
* it also protects other tasks from NT leaking through sysenter etc.
*/
#define switch_to(prev, next, last) \
do { \
/* \
* Context-switching clobbers all registers, so we clobber \
* them explicitly, via unused output variables. \
* (EAX and EBP is not listed because EBP is saved/restored \
* explicitly for wchan access and EAX is the return value of \
* __switch_to()) \
*/ \
unsigned long ebx, ecx, edx, esi, edi; \
\
asm volatile("pushfl\n\t" /* save flags */ \
"pushl %%ebp\n\t" /* save EBP */ \
"movl %%esp,%[prev_sp]\n\t" /* save ESP */ \
"movl %[next_sp],%%esp\n\t" /* restore ESP */ \
"movl $1f,%[prev_ip]\n\t" /* save EIP */ \
"pushl %[next_ip]\n\t" /* restore EIP */ \
__switch_canary \
"jmp __switch_to\n" /* regparm call */ \
"1:\t" \
"popl %%ebp\n\t" /* restore EBP */ \
"popfl\n" /* restore flags */ \
\
/* output parameters */ \
: [prev_sp] "=m" (prev->thread.sp), \
[prev_ip] "=m" (prev->thread.ip), \
"=a" (last), \
\
/* clobbered output registers: */ \
"=b" (ebx), "=c" (ecx), "=d" (edx), \
"=S" (esi), "=D" (edi) \
\
__switch_canary_oparam \
\
/* input parameters: */ \
: [next_sp] "m" (next->thread.sp), \
[next_ip] "m" (next->thread.ip), \
\
/* regparm parameters for __switch_to(): */ \
[prev] "a" (prev), \
[next] "d" (next) \
\
__switch_canary_iparam \
\
: /* reloaded segment registers */ \
"memory"); \
} while (0)
We can see 3 things are done in function switch_to():
1.switch %esp
2.hardware context switch (__switch_to())
3.stack switch.
This piece of code is so excellent and well designed, and the functionalities are clearly described by comments. We should notice that when calling __switch_to(), %[next_ip] is pushed into stack so the %eip of next process is used as __switch_to()'s param. Then we can see the next line ("1:\t") is a symbol, which is used as %[prev_ip] (in line "movl %1f, %[prev_ip]\n\t"). After switching to the new context, %eip is stored here. When new process starts, %ebp is poped and flags are restored. If the process we switch to is a new process, %[next_ip] will be the address of ENTRY(ret_from_fork) (in file arch/x86/kernel/entry_32.S), so we cannot use "call __switch_to" instead of "jmp __switch_to" because "call" will push the address of the following code into stack and we can't get the address of ret_from_fork in %[next_ip]. Here will do "pushl" operation manually.
哲理性理解
如前所述,对于每个硬件处理器,仅通过检查栈就可以获得当前正确的进程。早先的Linux版本没有把内核栈与进程描述符放在一起,而是强制引入全局静态变量current来标示正在运行进程的描述符。在多处理器系统上,有必要把current定义为一个数组,每一个元素对应一个可用CPU。
上面这段话总结一下就是,esp代表当前进程。所以esp的切换就是进程的切换,Linux内核进程切换正是这样实现的。如下:
movl %[next_sp],%%esp
上面这条命令作用是把next->thread.esp装入esp。即进行esp切换,之后的指令已经是在next进程,只不过还要进行硬件上下文的切换(之前的版本由控制器自动实现),所以可以把堆栈切换之后的指令,看做next进程(代替控制器)去完成硬件上下文的切换(因为next进程在内核态,所以可以访问prev与next进程,从而去保存prev进程硬件上下文,恢复自己的硬件上下文),可以从另一个角度看:内核代码运行在prev或者next进程上来实现对prev、next进程的管理(也就是说:内核代码在堆栈切换之前运行在prev进程,之后运行在next进程,但并不看做是prev、next进程在运行,而是内核代码在运行)。
为什么不化简成call __switch_to
这里,如果之前B也被switch_to出去过,那么[next_ip]里存的就是下面这个1f的标号,但如果进程B刚刚被创建,之前没有被switch_to出去过,那么[next_ip]里存的将是ret_ftom_fork(参看copy_thread()函数)。这就是这里为什么不用call __switch_to而用jmp,因为call会导致自动把下面这句话的地址(也就是1:)压栈,然后__switch_to()就必然只能ret到这里,而无法根据需要ret到ret_from_fork。
next是否是新进程(没有switch_to过的进程,如fork出的子进程)
1.如果next是新进程,虽然
movl %[next_sp],%%esp
之后是在next进程上运行,但是并不是next进程(新进程)的真正开始点,而ret_ftom_fork才新进程的开始点。
要注意一句话,当处理器切换到某个进程时,不必是上次切换出去时的下一条指令,也就是说切换到next进程时并不一定要运行ret_ftom_fork,而是要继续做一下切换的后续管理工作(硬件上下文的保存、恢复),之后再跳转到ret_ftom_fork(也就是__switch_to函数返回)。总结一下就是切换到next进程(堆栈切换)时并没有完成完全的切换工作,需要恢复硬件上下文之后,才是完全的切换了进程。
2.如果next进程不是新进程(next进程被switch_to过),虽然
movl %[next_sp],%%esp
之后是的真正开始点,但是此时硬件上下文还不是next进程的。__switch_to进行硬件上下文恢复之后才算是完全恢复next进程。
3.综上两条所述,堆栈切换到switch_to结束虽然是在next进程上运行,但是并不完全算是next进程的开始,switch_to返回之后才是next进程真真正正恢复。
switch_to宏化简一下就是:
//参照《深入理解linux内核》第三版P111
movl prev,%eax
movl next,%edx
pushfl
pushl %ebp
popl %ebp
popfl