6.828 笔记4

date: 2019-02-18
tags: OS 6.828

这里会记录阅读6.828课程lecture note的我的个人笔记。可能会中英混杂，不是很适合外人阅读，也请见谅。

Lecture 5: Isolation mechanisms

多个进程同时运行导致了对操作系统的3项主要要求：

multiplexing
isolation
interaction / sharing

而这其中isolation是最不好完成的要求。

那么isolation要完成什么呢？

用隔离来包裹住错误
进程是isolation的基本单元
防止进程x监视进程y
防止进程干预操作系统

kernel用硬件机制来辅助得到process isolation

processor有user / kernel mode flag
每个进程有分配address spaces
timeslicing
system call interface

先来说硬件上的user / kernel mode flag。

控制instruction是否能有权限使用privileged h / w
在x86上叫CPL，是%cs的后两位

CPL=0 -- kernel mode

CPL=3 -- user mode
x86 CPL保护了很多寄存器，包括：
- I / O port access
- control register access (eflag, %cs4)，包括%cs自己
- 间接影响内存访问
每个非娱乐性质的microprocessor都会有user / kernel mode flag

如何用system call来切换CPL，考虑如下几个问题

可不可以设置一个system call直接
```
set CPL=0
```
这样肯定不行，因为用户可以直接设置CPL，就没有办法保护内核了。
那么如果要求这个system call必须要直接跳入kernel中的一个位置呢？
这样也不好，因为可能会跳到一个很尴尬的位置，没法运行下去了

所以x86给出的答案是：

给出几个permissible kernel entry points
INT 指令会设置CPL=0，然后跳到某一个entry point
system call在返回的时候设置CPL=3，再运行user code

这样就有well-defined notion of user vs kernel，不会出现再kernel mode运行user code，更不会有user mode里运行kernel code。

之后来说如何隔离进程内存，也就是address space。

address space的目的是让每个process可以有内存来访问自己的code, variables, heap, stack并不让其访问其他的内存。

那么如何建立isolated address spaces呢？

xv6用的是x86的memory management unit(MMU)里的 "paging hardware"，MMU会把所以地址进行翻译：

CPU -> MMU -> RAM
        |
     pagetable
VA -> PA

MMU会对所有memory reference进行翻译:user and kernel。

instructions and data

注意instruction用的永远都是virtual address(VA), 从来不用physical address (PA)

最后来说xv6的system call是如何实现的

xv6的process / stack diagram:

user process ; kernel thread
user stack ; kernel stack
two mechanisms: switch between user/kernel switch between kernel threads
trap frame
kernel function calls...
struct context

简单的xv6的user/kernel virtual address-space设置：

  FFFFFFFF:
            ...
  80000000: kernel
            user stack
            user data
  00000000: user instructions

kernel通过设置MMU来让user code只能访问到lower half。每个进程会有不同的address space，但是kernel的mapping都是一样的。

xv6 中system call的流程

下面来看看xv6的代码层面是如何完成的调用一个system call并返回的。

我们选择的system call是sh.asm里的write，注意这里note有误，应该是b * 0x0d32而不是0d42。而且运行x/3i的结果和sh.asm里面记录的内容也不一样。。。神奇...正常运行的代码应该是：

00000d32 <write>:
SYSCALL(write)
     d32:	b8 10 00 00 00       	mov    $0x10,%eax
     d37:	cd 40                	int    $0x40
     d39:	c3                   	ret

其中0x10是write的system call number。这里还有一个疑问，就是系统是怎么在启动之后自动调用shell的，没太明白。下面就是按照lecture中的要求进行调试：

(gdb) b * 0x0d32
Breakpoint 1 at 0xd32
(gdb) c
Continuing.
...
(gdb) info reg
eax            0x3f7a   16250
ecx            0x24     36
edx            0x0      0
ebx            0x24     36
esp            0x3f4c   0x3f4c
ebp            0x3f98   0x3f98
esi            0x11b9   4537
edi            0x0      0
eip            0xd32    0xd32
eflags         0x216    [ PF AF IF ]
cs             0x1b     27
ss             0x23     35
ds             0x23     35
es             0x23     35
fs             0x0      0
gs             0x0      0

可以看到这时的%cs是0x1b也就是CPL=3，处于user mode。

(gdb) x/4x $esp
0x3f4c: 0x00000ea5      0x00000002      0x00003f7a      0x00000001

esp的这4个值分别是ebf(return address？这里存疑...)，2是fd，0x3f7a是buffer的地址，1是count，对应的就是write(2, 0x3f7a, 1)

(gdb) x/c 0x3f7a
0x3f7a: 36 '$'

就是说buffer里面存的就是要写出的$。如果继续往下运行，运行两步之后

(gdb) info reg
eax            0x10     16
ecx            0x24     36
edx            0x0      0
ebx            0x24     36
esp            0x8dffefe8       0x8dffefe8
ebp            0x3f98   0x3f98
esi            0x11b9   4537
edi            0x0      0
eip            0x80105d49       0x80105d49 <vector64+2>
eflags         0x216    [ PF AF IF ]
cs             0x8      8
ss             0x10     16
ds             0x23     35
es             0x23     35
fs             0x0      0
gs             0x0      0

%cs变为8，也就是CPL=0，进入了kernel模式，且注意eip已经进入了kernel memory里面，esp已经在kernel stack中。

(gdb) x/6wx $esp
0x8dffefe8:     0x00000000      0x00000d39      0x0000001b      0x00000216
0x8dffeff8:     0x00003f4c      0x00000023

可以看到INT指令把一些之前的register放在堆栈保存起来了，保存在了kernel stack。保存的register包括err, eip, cs, eflags, esp, ss。之所以会进行保存，是因为INT会overwrite这些register。

总结来说INT做了如下的内容：

切换为kernel stack（调整esp）
保存用户的register于kernel stack
设置CPL=0
让eip指向kernel-supplied vector。

前文我们知道eip是给定的kernel-supplied vector，那么esp来源于哪里呢？

kernel会在创建进程的时候告诉硬件应该用哪个kernel stack。

为什么INT需要保存用户状态？应该保存多少状态？

transparency vs speed (这里transparency指在调用system call的时候不会对外部状态造成太多影响）

而保存剩余的register用的是trapasm.S里头的alltraps函数。

  # vectors.S sends all traps here.
.globl alltraps
alltraps:
  # Build trap frame.
  pushl %ds
  pushl %es
  pushl %fs
  pushl %gs
  pushal
  
  # Set up data segments.
  movw $(SEG_KDATA<<3), %ax
  movw %ax, %ds
  movw %ax, %es

  # Call trap(tf), where tf=%esp
  pushl %esp
  call trap
  addl $4, %esp

  # Return falls through to trapret...
.globl trapret
trapret:
  popal
  popl %gs
  popl %fs
  popl %es
  popl %ds
  addl $0x8, %esp  # trapno and errcode
  iret

这个函数就是现在eip对应的0x80105d49对应的位置。所以堆栈里面存有的是19words:

    ss
    esp
    eflags
    cs
    eip
    err    -- INT saved from here up
    trapno
    ds
    es
    fs
    gs
    eax..edi

这些都是之后会被恢复的。同时有的时候kernel的C code需要对这些尽心读写，就通过x86.h中的struct trapframe进行操作。

可以看到调用了trap函数(在trap.c文件中)，其中pushl %esp就是trap函数的参数tf。进入这个函数之后，有：

(gdb) print tf
$1 = (struct trapframe *) 0x8dffefb4
(gdb) print *tf
$2 = {edi = 0, esi = 4537, ebp = 16280, oesp = 2382360532, ebx = 36, edx = 0,
  ecx = 36, eax = 16, gs = 0, padding1 = 0, fs = 0, padding2 = 0, es = 35,
  padding3 = 0, ds = 35, padding4 = 0, trapno = 64, err = 0, eip = 3385,
  cs = 27, padding5 = 0, eflags = 534, esp = 16204, ss = 35, padding6 = 0}

trap函数的代码如下：

void
trap(struct trapframe *tf) {
  if(tf->trapno == T_SYSCALL){
    if(myproc()->killed)
      exit();
    myproc()->tf = tf;
    syscall();
    if(myproc()->killed)
      exit();
    return;
  }
  ...

进入trap函数的不只有system call，还有interrupt和fault，所以会先判断一下tf->trapno是不是T_SYSCALL。在本次运行的代码里，的确是（注意T_SYSCALL是60就是INT调用的0x40）。

判断完就会运行myproc()（函数在proc.c中，函数返回的struct proc在proc.h中）。myproc()->tf = tf;会被当成当前的system call的单数。注意这个myproc函数很重要，返回的永远都是当前的process，所以可以在其他函数中通过调用它达到共享参数的功能：

// Disable interrupts so that we are not rescheduled
// while reading proc from the cpu structure
struct proc*
myproc(void) {
  struct cpu *c;
  struct proc *p;
  pushcli();
  c = mycpu();
  p = c->proc;
  popcli();
  return p;
}

然后就调用sycall()，在syscall.c中

void
syscall(void) {
  int num;
  struct proc *curproc = myproc();

  num = curproc->tf->eax;
  if(num > 0 && num < NELEM(syscalls) && syscalls[num]) {
    curproc->tf->eax = syscalls[num]();
  } else {
    cprintf("%d %s: unknown sys call %d\n",
            curproc->pid, curproc->name, num);
    curproc->tf->eax = -1;
  }
}

用curproc->tf->eax得到当前的system call的序号，这里是0x10，对应的函数是sys_write，而sys_write在sysfile.c中，

int
sys_write(void) {
  struct file *f;
  int n;
  char *p;

  if(argfd(0, 0, &f) < 0 || argint(2, &n) < 0 || argptr(1, &p, n) < 0)
    return -1;
  return filewrite(f, p, n);
}

上面的arg*函数会从user stack中读取write(fd, buf, n)的参数，例如argint函数在syscall.c中：

// Fetch the nth 32-bit system call argument.
int
argint(int n, int *ip) {
  return fetchint((myproc()->tf->esp) + 4 + 4*n, ip);
}

就是从myproc()中读出整数。等写入完了，也就是sys_write中的filewrite返回了，就开始一层一层返回，最终要还原user register。我们来一步一步看这个。

syscall函数把filewrite的返回值设为了curproc->tf->eax，之后就一步一步没有操作得返回到了trapasm.S。之后一连串直接运行到iret。在运行iret之前，就已经把除了int保存的那5个register以外的都恢复了，iret会把eip, cs, eflags, esp, ss这5个还原，因为还原了cs，所以也把模式还原到了CPL=3，user mode。从而完成了

(gdb) info reg
eax            0x3f7a   16250
ecx            0x20     32
edx            0x0      0
ebx            0x20     32
esp            0x3f4c   0x3f4c
ebp            0x3f98   0x3f98
esi            0x11ba   4538
edi            0x0      0
eip            0xd32    0xd32
eflags         0x216    [ PF AF IF ]
cs             0x1b     27
ss             0x23     35
ds             0x23     35
es             0x23     35
fs             0x0      0
gs             0x0      0

还原之后的register（不是很知道这些许的不同是什么意思，不过不同的都不是iret还原的，应该就是调用system call所需要产生的变化吧）。

总结一下：

INT -> alltrap -> trap -> syscall -> syswrite -> IRET
user/kernel transition是很复杂的，如果出了个bug就gg了。
kernel必须假设process是有敌意的，不能相信user stack，而且需要在kernel里面检查argument（应该是指syswrite里的argument检查）。
下一讲会讲解page table是怎么限制user program可以读写的内存的。

zhuzilin's Blog

about

6.828 笔记4

Lecture 5: Isolation mechanisms

xv6 中system call的流程