zhuzilin's Blog

about

6.828 lab4 Preemptive Multitasking

date: 2019-03-31
tags: OS  6.828  

Part A: Multiprocessor Support and Cooperative Multitasking

第一部分要把JOS拓展到多处理器的系统,并实现一些system call来让user-level environments创建新的环境。然后还需要实现cooperative的round-robin,也就是让当前的用户环境可以主动的退出。之后的part C会实现pre-emptive的版本。

Multiprocessor Support

我们首先来实现symmetric multiprocessing (SMP)。这是一个多处理器的模型,所有的CPU都有同样的系统资源权限,如内存和I/O。因为在SMP中所有的处理器都一样,在boot的过程中,CPU被分为两类:

  • the bootstrap processor (BSP): 用于初始化系统并boot操作系统。
  • the application processors (APs): 在操作系统正常运行之后由BSP激活。

哪个CPU是BSP是由硬件和BIOS决定的。截止到现在,JOS都运行在BSP上。

在一个SMP系统里,每个CPU都有an accompanying local APIC (LAPIC) unit. 之前提到过,LAPIC是用于在系统中deliver interrupt。LAPIC同时还是其对应的CPU的unique identifier。这次的lab里面我们会用LAPIC unit的一些基本的功能(在kern/lapic.c)。

  • cpunum()函数里用LAPIC的identifier来获得当前的代码运行在哪个CPU上:

    #define ID      (0x0020/4)   // ID
    ...
    int
    cpunum(void)
    {
    	if (lapic)
    		return lapic[ID] >> 24;
    	return 0;
    }
  • lapic_startap()函数来让BSP给APs发送STARTUP interprocessor interrupt (IPI)来启动其他的CPU

    #define IO_RTC  0x70
    
    // Start additional processor running entry code at addr.
    // See Appendix B of MultiProcessor Specification.
    void
    lapic_startap(uint8_t apicid, uint32_t addr)
    {
    	int i;
    	uint16_t *wrv;
    
    	// "The BSP must initialize CMOS shutdown code to 0AH
    	// and the warm reset vector (DWORD based at 40:67) to point at
    	// the AP startup code prior to the [universal startup algorithm]."
    	outb(IO_RTC, 0xF);  // offset 0xF is shutdown code
    	outb(IO_RTC+1, 0x0A);
    	wrv = (uint16_t *)KADDR((0x40 << 4 | 0x67));  // Warm reset vector
    	wrv[0] = 0;
    	wrv[1] = addr >> 4;
    
    	// "Universal startup algorithm."
    	// Send INIT (level-triggered) interrupt to reset other CPU.
    	lapicw(ICRHI, apicid << 24);
    	lapicw(ICRLO, INIT | LEVEL | ASSERT);
    	microdelay(200);
    	lapicw(ICRLO, INIT | LEVEL);
    	microdelay(100);    // should be 10ms, but too slow in Bochs!
    
    	// Send startup IPI (twice!) to enter code.
    	// Regular hardware is supposed to only accept a STARTUP
    	// when it is in the halted state due to an INIT.  So the second
    	// should be ignored, but it is part of the official Intel algorithm.
    	// Bochs complains about the second one.  Too bad for Bochs.
    	for (i = 0; i < 2; i++) {
    		lapicw(ICRHI, apicid << 24);
    		lapicw(ICRLO, STARTUP | (addr >> 12));
    		microdelay(200);
    	}
    }
  • 在PART C中,我们会给LAPIC的内置timer来trigger clock interrupt以实现preemptive multitasking,在apic_init()中。

一个处理器通过memory-mapping I/O (MMIO,之前note2中提到过)来访问其LAPIC。在MMIO中,一部跟物理内存是hardwired到一些I/O设备的寄存器上,所以用来访问内存的指令也可以用来访问device registers。我们之前提到过在物理地址0xa0000上有一个IO hole(见之前note2的物理地址layout),这个是用来写入VGA显示buffer的。LAPIC类似,其PA在0xfe000000开始的32MB。因为我们的虚拟内存只映射到KERNBASE,也就是0xf0000000,所以我们我们不能直接访问。在JOS的虚拟内存中,有从MMIOBASE开始的4MB的gap,以映射LAPIC这样的设备。因为之后的之后需要使用更多的MMIO,我们需要写一个函数来从这个区域中分配内存以映射device memory。

Exercise 1

实现kern/lapic.c中的mmio_map_region。需要做完下一个exercise才能测试。

//
// Reserve size bytes in the MMIO region and map [pa,pa+size) at this
// location.  Return the base of the reserved region.  size does *not*
// have to be multiple of PGSIZE.
//
void *
mmio_map_region(physaddr_t pa, size_t size)
{
	// Where to start the next region.  Initially, this is the
	// beginning of the MMIO region.  Because this is static, its
	// value will be preserved between calls to mmio_map_region
	// (just like nextfree in boot_alloc).
	static uintptr_t base = MMIOBASE;

	// Reserve size bytes of virtual memory starting at base and
	// map physical pages [pa,pa+size) to virtual addresses
	// [base,base+size).  Since this is device memory and not
	// regular DRAM, you'll have to tell the CPU that it isn't
	// safe to cache access to this memory.  Luckily, the page
	// tables provide bits for this purpose; simply create the
	// mapping with PTE_PCD|PTE_PWT (cache-disable and
	// write-through) in addition to PTE_W.  (If you're interested
	// in more details on this, see section 10.5 of IA32 volume
	// 3A.)
	//
	// Be sure to round size up to a multiple of PGSIZE and to
	// handle if this reservation would overflow MMIOLIM (it's
	// okay to simply panic if this happens).
	//
	// Hint: The staff solution uses boot_map_region.
	//
	// Your code here:
	size = ROUNDUP(size, PGSIZE);
	if ((base + size) > MMIOLIM || (base + size < base))
		panic("overflow MMIOLIM");
	boot_map_region(kern_pgdir, base, size, pa, PTE_W | PTE_PCD | PTE_PWT);
	base += size;
	return (void *)(base - size);
}

按照注释要求写就行了,唯一需要注意的就是base + size可能会overflow,所以需要检查base + size >= base

Application Processor Bootstrap

在启动APs之前,BSP应该首先收集多处理器系统的相关信息,比如CPU总数,他们的APIC ID以及他们的LAPIC在MMIO里的地址。kern/mpconfig.c中的mp_init()就通过读MP configuration table中的信息来得到这些信息,这个表在BIOS' s region of memory。

kern/init.c中的boot_aps()进行AP bootstrap process。

// Start the non-boot (AP) processors.
static void
boot_aps(void)
{
	extern unsigned char mpentry_start[], mpentry_end[];
	void *code;
	struct CpuInfo *c;

	// Write entry code to unused memory at MPENTRY_PADDR
	code = KADDR(MPENTRY_PADDR);
	memmove(code, mpentry_start, mpentry_end - mpentry_start);

	// Boot each AP one at a time
	for (c = cpus; c < cpus + ncpu; c++) {
		if (c == cpus + cpunum())  // We've started already.
			continue;

		// Tell mpentry.S what stack to use 
		mpentry_kstack = percpu_kstacks[c - cpus] + KSTKSIZE;
		// Start the CPU at mpentry_start
		lapic_startap(c->cpu_id, PADDR(code));
		// Wait for the CPU to finish some basic setup in mp_main()
		while(c->cpu_status != CPU_STARTED)
			;
	}
}

就像是boot/boot.S中的,所以boot_aps()把AP entry code(在kern/mpentry.S中)移到a memory location that is addressable in the real mode。不像bootloader,我们对AP在哪里会运行AP的代码,我们把entry code复制到 0x7000 (MPENTRY_PADDR),但实际上任何640KB以下的地址都行。

在完成上述的这个memmove之后,boot_aps()一个一个通过发STARTUP IPIs到LAPIC unit来激活AP(也就是运行lapic_startap)。启动之后,会返回一个初始的CS:IP,AP就会从这个地址开始运行entry code,这个地址就是MPENTRY_PADDR

这个entry code就是kern/mpentry.S。它之中的代码和boot/boot.S很像,先做一些设置之后,就会让AP转到protected mode,然后运行kern/init.c中的mp_main()boot_aps()会等待c->cpu_status != CPU_STARTED,这样这一个AP就被启动了,然后在boot_aps的循环里面就可以进行下一个AP了。

// Setup code for APs
void
mp_main(void)
{
	// We are in high EIP now, safe to switch to kern_pgdir 
	lcr3(PADDR(kern_pgdir));
	cprintf("SMP: CPU %d starting\n", cpunum());

	lapic_init();
	env_init_percpu();
	trap_init_percpu();
	xchg(&thiscpu->cpu_status, CPU_STARTED); // tell boot_aps() we're up

	// Now that we have finished some basic setup, call sched_yield()
	// to start running processes on this CPU.  But make sure that
	// only one CPU can enter the scheduler at a time!
	//
	// Your code here:

	// Remove this after you finish Exercise 6
	for (;;);
}

Exercise 2

修改page_init()以腾空MPENTRY_PADDR。是的我们可以安全的把AP bootstrap code复制到对应的位置。

//
// Initialize page structure and memory free list.
// After this is done, NEVER use boot_alloc again.  ONLY use the page
// allocator functions below to allocate and deallocate physical
// memory via the page_free_list.
//
void
page_init(void)
{
	// LAB 4:
	// Change your code to mark the physical page at MPENTRY_PADDR
	// as in use
	struct PageInfo *mpentry_page = pa2page(MPENTRY_PADDR);
	// The example code here marks all physical pages as free.
	// However this is not truly the case.  What memory is free?
	//  1) Mark physical page 0 as in use.
	//     This way we preserve the real-mode IDT and BIOS structures
	//     in case we ever need them.  (Currently we don't, but...)
	//  2) The rest of base memory, [PGSIZE, npages_basemem * PGSIZE)
	//     is free.
	//  3) Then comes the IO hole [IOPHYSMEM, EXTPHYSMEM), which must
	//     never be allocated.
	//  4) Then extended memory [EXTPHYSMEM, ...).
	//     Some of it is in use, some is free. Where is the kernel
	//     in physical memory?  Which pages are already in use for
	//     page tables and other data structures?
	//
	// Change the code to reflect this.
	// NB: DO NOT actually touch the physical memory corresponding to
	// free pages!
	size_t i;
	for (i = 1; i < npages_basemem; i++) {
		if (pages + i == mpentry_page)
			continue;
		pages[i].pp_ref = 0;
		pages[i].pp_link = page_free_list;
		page_free_list = &pages[i];
	}
  for(i = PADDR(boot_alloc(0))/PGSIZE; i < npages; i++) {
    pages[i].pp_ref = 0;
		pages[i].pp_link = page_free_list;
		page_free_list = &pages[i];
  }
}

加上这个之后,测试可以有:

$ make qemu-nox
...
check_page_free_list() succeeded!
check_page_alloc() succeeded!
check_page() succeeded!
kernel panic on CPU 0 at kern/pmap.c:813: assertion failed: check_va2pa(pgdir, KERNBASE + i) == i
Welcome to the JOS kernel monitor!
Type 'help' for a list of commands.
K>
  • 比较kern/mpentry.Sboot/boot.S. 牢记kern/mpentry.S和kernel中的其他代码一样,是被编译与链接在KERNBASE之上的。mpentry.SMPBOOTPHYS这个宏来?为什么boot.S不需要?换句话说,如果我们不用这个宏,会出什么问题呢?

    mpentry.S的注释里面有写这两者的区别,那就是:

    This code is similar to boot/boot.S except that

    • it does not need to enable A20
    • it uses MPBOOTPHYS to calculate absolute addresses of its

      symbols, rather than relying on the linker to fill them

    这种转换是因为bootloader的LMA和VMA都在0x7c00,并没有进行什么映射,所以运行boot.S时虚拟地址就是物理地址,不需要转换。但是kernel中的则不然,已经进行了转换了,但是在加载GDT的时候需要物理地址。具体来看,MPBOOTPHYS的代码如下:

    #define MPBOOTPHYS(s) ((s) - mpentry_start + MPENTRY_PADDR)

    mpentry-start就是这个函数的地址,而MPENTRY_PADDR是APs的startup code的物理地址。相当于是把gdt的位置相对于mpentry_start这个函数的地址的对应到MPENTRY_PADDR + gdt - mpentry_start,就如同boot.Sgdt对应到在start + gdt - start是一样的(startboot.S的那个函数的地址)。

Per-CPU State and Initialization

当写一个多处理器的OS的时候,区分per-CPU state that is private to each processor与global state that the whole system shares是非常重要的。kern/cpu.h中定义了大多数per-CPU state,包括了CpuInfo,其存储了per-CPU变量。cpunum()总会返回CPU的ID,这个ID可以用来作为cpus这样的数组的index。thiscpu表示当前的CPU对应的struct CpuInfo

一些值得注意的per-CPU state:

  • per-CPU kernel stack

    因为多个CPU可能会同时trap进kernel,所以我们需要给每个CPU一个单独的kernel stack。percpu_kstacks[NCPU][KSTKSIZE]这个数组就是干这个的。

    在lab2中,我们把bootstack对应的BSP的kernel stack映射到了KSTACKTOP下面:

    	//////////////////////////////////////////////////////////////////////
    	// Use the physical memory that 'bootstack' refers to as the kernel
    	// stack.  The kernel stack grows down from virtual address KSTACKTOP.
    	// We consider the entire range from [KSTACKTOP-PTSIZE, KSTACKTOP)
    	// to be the kernel stack, but break this into two pieces:
    	//     * [KSTACKTOP-KSTKSIZE, KSTACKTOP) -- backed by physical memory
    	//     * [KSTACKTOP-PTSIZE, KSTACKTOP-KSTKSIZE) -- not backed; so if
    	//       the kernel overflows its stack, it will fault rather than
    	//       overwrite memory.  Known as a "guard page".
    	//     Permissions: kernel RW, user NONE
    	// Your code goes here:
    	boot_map_region(kern_pgdir, KSTACKTOP-KSTKSIZE, KSTKSIZE, PADDR(bootstack), 
                      PTE_W | PTE_P);

    在lab4中,我们会把所有的kernel stack都映射到这个区域,并包含一个guard page。CPU 0的stack会从KSTACKTOP开始,CPU 1会在CPU 0的stack再向下KSTGAP。在更新了个inc/memorylayout.h中有详细的体现:

    /*
    * Virtual memory map:                                Permissions
    *                                                    kernel/user
    *
    *    4 Gig -------->  +------------------------------+
    *                     |                              | RW/--
    *                     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    *                     :              .               :
    *                     :              .               :
    *                     :              .               :
    *                     |~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~| RW/--
    *                     |                              | RW/--
    *                     |   Remapped Physical Memory   | RW/--
    *                     |                              | RW/--
    *    KERNBASE, ---->  +------------------------------+ 0xf0000000      --+
    *    KSTACKTOP        |     CPU0's Kernel Stack      | RW/--  KSTKSIZE   |
    *                     | - - - - - - - - - - - - - - -|                   |
    *                     |      Invalid Memory (*)      | --/--  KSTKGAP    |
    *                     +------------------------------+                   |
    *                     |     CPU1's Kernel Stack      | RW/--  KSTKSIZE   |
    *                     | - - - - - - - - - - - - - - -|                 PTSIZE
    *                     |      Invalid Memory (*)      | --/--  KSTKGAP    |
    *                     +------------------------------+                   |
    *                     :              .               :                   |
    *                     :              .               :                   |
    *    MMIOLIM ------>  +------------------------------+ 0xefc00000      --+
    *                     |       Memory-mapped I/O      | RW/--  PTSIZE
    * ULIM, MMIOBASE -->  +------------------------------+ 0xef800000
    *                     |  Cur. Page Table (User R-)   | R-/R-  PTSIZE
    *    UVPT      ---->  +------------------------------+ 0xef400000
    *                     |          RO PAGES            | R-/R-  PTSIZE
    *    UPAGES    ---->  +------------------------------+ 0xef000000
    *                     |           RO ENVS            | R-/R-  PTSIZE
    * UTOP,UENVS ------>  +------------------------------+ 0xeec00000
    * UXSTACKTOP -/       |     User Exception Stack     | RW/RW  PGSIZE
    *                     +------------------------------+ 0xeebff000
    *                     |       Empty Memory (*)       | --/--  PGSIZE
    *    USTACKTOP  --->  +------------------------------+ 0xeebfe000
    *                     |      Normal User Stack       | RW/RW  PGSIZE
    *                     +------------------------------+ 0xeebfd000
    *                     |                              |
    *                     |                              |
    *                     ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    *                     .                              .
    *                     .                              .
    *                     .                              .
    *                     |~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
    *                     |     Program Data & Heap      |
    *    UTEXT -------->  +------------------------------+ 0x00800000
    *    PFTEMP ------->  |       Empty Memory (*)       |        PTSIZE
    *                     |                              |
    *    UTEMP -------->  +------------------------------+ 0x00400000      --+
    *                     |       Empty Memory (*)       |                   |
    *                     | - - - - - - - - - - - - - - -|                   |
    *                     |  User STAB Data (optional)   |                 PTSIZE
    *    USTABDATA ---->  +------------------------------+ 0x00200000        |
    *                     |       Empty Memory (*)       |                   |
    *    0 ------------>  +------------------------------+                 --+
    *
    * (*) Note: The kernel ensures that "Invalid Memory" is *never* mapped.
    *     "Empty Memory" is normally unmapped, but user programs may map pages
    *     there if desired.  JOS user programs map pages temporarily at UTEMP.
    */
  • per-CPU TSS and TSS descriptor

    首先先说一下什么是TSS:

    The task state segment (TSS) is a special structure on x86-based computers which holds information about a task. It is used by the operating system kernel for task management. Specifically, the following information is stored in the TSS:

    • Processor register state
    • I/O port permissions
    • Inner-level stack pointers
    • Previous TSS link

    The TSS may reside anywhere in memory. A special segment register called the task register (TR) holds a segment selector that points to a valid TSS segment descriptor which resides in the GDT (a TSS descriptor may not reside in the LDT). Therefore, to use a TSS the following must be done by the operating system kernel:

    1. Create a TSS descriptor entry in the GDT
    2. Load the TR with the segment selector for that segment
    3. Add information to the TSS in memory as needed

    For security purposes, the TSS should be placed in memory that is accessible only to the kernel.

    quote from https://en.wikipedia.org/wiki/Task_state_segment

    所以说,TSS就是一个用来保存一个task (在scheduling中指一个thread或者一个process)的相关信息的。而TSS descriptor会被在GDT中分配,因为GDT就是用来管理每个segment是做什么的用的。

    因为需要明确每个CPU的stack在哪里,所以需要per-CPU TSS。CPU i的TSS被存在cpus[i].cpu_ts中,对应的TSS descriptor在gdt[(GDT_TSS0 >> 3) + i]。在kern/trap.c中使用个全局变量struct Taskstate ts 将变得没什么用。

  • per-CPU current environment pointer

    因为CPU们可以同时跑不同的进程,所以不能只有一个curenv了,我们把它改为cpus[cpunum()].cpu_env(或者thiscpu->cpu_env)。这指向当前CPU正在执行的环境。

  • per-CPU system register

    所有的寄存器,包括系统寄存器都是private to CPU。所以对于寄存器的初始化函数,如lcr3(), ltr(), lgdt(), lidt()都需要在每个CPU上运行一下。

除此之外,乳沟还有加入任何per-CPU state,或者和CPU相关的初始化,注意都要转化为在CPU上运行。

Exercise 3

修改kern.pmap.c中的mem_init_mp()来把多个CPU的stack按照上面所说的映射上。

写法就是按照注释的要求,注意stacktop是最上面:

// Modify mappings in kern_pgdir to support SMP
//   - Map the per-CPU stacks in the region [KSTACKTOP-PTSIZE, KSTACKTOP)
//
static void
mem_init_mp(void)
{
	// Map per-CPU stacks starting at KSTACKTOP, for up to 'NCPU' CPUs.
	//
	// For CPU i, use the physical memory that 'percpu_kstacks[i]' refers
	// to as its kernel stack. CPU i's kernel stack grows down from virtual
	// address kstacktop_i = KSTACKTOP - i * (KSTKSIZE + KSTKGAP), and is
	// divided into two pieces, just like the single stack you set up in
	// mem_init:
	//     * [kstacktop_i - KSTKSIZE, kstacktop_i)
	//          -- backed by physical memory
	//     * [kstacktop_i - (KSTKSIZE + KSTKGAP), kstacktop_i - KSTKSIZE)
	//          -- not backed; so if the kernel overflows its stack,
	//             it will fault rather than overwrite another CPU's stack.
	//             Known as a "guard page".
	//     Permissions: kernel RW, user NONE
	//
	// LAB 4: Your code here:
	for (int i=0; i<NCPU; i++) {
		uintptr_t kstacktop_i = KSTACKTOP - i * (KSTKSIZE + KSTKGAP);
		boot_map_region(kern_pgdir, kstacktop_i - KSTKSIZE, KSTKSIZE, 
                        PADDR(percpu_kstacks[i]), PTE_W | PTE_P);
	}
}

运行之后的结果为:

$ make qemu-nox
...
check_page_free_list() succeeded!
check_page_alloc() succeeded!
check_page() succeeded!
check_kern_pgdir() succeeded!
check_page_free_list() succeeded!
check_page_installed_pgdir() succeeded!
SMP: CPU 0 found 1 CPU(s)
enabled interrupts: 1 2
[00000000] new env 00001000
kernel panic on CPU 0 at kern/trap.c:220: page fault in kernel
Welcome to the JOS kernel monitor!
Type 'help' for a list of commands.
K>

Exercise 4

修改kern/trap.c中的trap_init_percpu(),以初始化BSP的TSS和TSS descriptor。注意,改完之后的代码应该不包含ts这个全局变量了。

// Initialize and load the per-CPU TSS and IDT
void
trap_init_percpu(void)
{
	// The example code here sets up the Task State Segment (TSS) and
	// the TSS descriptor for CPU 0. But it is incorrect if we are
	// running on other CPUs because each CPU has its own kernel stack.
	// Fix the code so that it works for all CPUs.
	//
	// Hints:
	//   - The macro "thiscpu" always refers to the current CPU's
	//     struct CpuInfo;
	//   - The ID of the current CPU is given by cpunum() or
	//     thiscpu->cpu_id;
	//   - Use "thiscpu->cpu_ts" as the TSS for the current CPU,
	//     rather than the global "ts" variable;
	//   - Use gdt[(GD_TSS0 >> 3) + i] for CPU i's TSS descriptor;
	//   - You mapped the per-CPU kernel stacks in mem_init_mp()
	//   - Initialize cpu_ts.ts_iomb to prevent unauthorized environments
	//     from doing IO (0 is not the correct value!)
	//
	// ltr sets a 'busy' flag in the TSS selector, so if you
	// accidentally load the same TSS on more than one CPU, you'll
	// get a triple fault.  If you set up an individual CPU's TSS
	// wrong, you may not get a fault until you try to return from
	// user space on that CPU.
	//
	// LAB 4: Your code here:

	// Setup a TSS so that we get the right stack
	// when we trap to the kernel.
	thiscpu->cpu_ts.ts_esp0 = KSTACKTOP - cpunum() * (KSTKSIZE + KSTKGAP);
	thiscpu->cpu_ts.ts_ss0 = GD_KD;
	thiscpu->cpu_ts.ts_iomb = sizeof(struct Taskstate);

	// Initialize the TSS slot of the gdt.
	gdt[(GD_TSS0 >> 3) + cpunum()] = SEG16(STS_T32A, (uint32_t) (&thiscpu->cpu_ts),
					sizeof(struct Taskstate) - 1, 0);
	gdt[(GD_TSS0 >> 3) + cpunum()].sd_s = 0;

	// Load the TSS selector (like other segment selectors, the
	// bottom three bits are special; we leave them 0)
	ltr(GD_TSS0 + (cpunum() << 3));

	// Load the IDT
	lidt(&idt_pd);
}

测试结果如下:

$ make qemu-nox CPUS=4
...
Physical memory: 131072K available, base = 640K, extended = 130432K
check_page_free_list() succeeded!
check_page_alloc() succeeded!
check_page() succeeded!
check_kern_pgdir() succeeded!
check_page_free_list() succeeded!
check_page_installed_pgdir() succeeded!
SMP: CPU 0 found 4 CPU(s)
enabled interrupts: 1 2
SMP: CPU 1 starting
SMP: CPU 2 starting
SMP: CPU 3 starting
[00000000] new env 00001000
...

Locking

现在当在mp_main()初始化AP之后进入spin:

// Setup code for APs
void
mp_main(void)
{
	// We are in high EIP now, safe to switch to kern_pgdir 
	lcr3(PADDR(kern_pgdir));
	cprintf("SMP: CPU %d starting\n", cpunum());

	lapic_init();
	env_init_percpu();
	trap_init_percpu();
	xchg(&thiscpu->cpu_status, CPU_STARTED); // tell boot_aps() we're up

	// Now that we have finished some basic setup, call sched_yield()
	// to start running processes on this CPU.  But make sure that
	// only one CPU can enter the scheduler at a time!
	//
	// Your code here:

	// Remove this after you finish Exercise 6
	for (;;);
}

再让AP进行更复杂的操作之前,我们需要解决这里面可能有的一些race。(这里不太明白,明明boot_aps是一个CPU一个CPU处理的,为啥会有race...)最简单的方法就是用一个big kernel lock。big kernel lock是一个全局锁,当一个环境进入kernel mode的时候就会获取这个锁,当环境返回user mode的时候就会释放。这样让use mode的环境们可以并行运行,但是如果要进入kernel就只能有一个了。

kern/spinlock.h中声明了big kernel lock,名为kernel_lock,其接口为lock_kernel()unlock_kernel()。我们需要在以下的4个地方加入这个锁相关的代码:

  • i386_init(), BSP唤醒其他CPUs之前acquire。
  • mp_main(), 初始化AP后acquire, 之后调用sched_yield()来在这个AP上运行环境。
  • trap(), 当trap是来自user mode的时候acquire。
  • env_run(), release the lock right before switching to user mode. 或早或晚都会导致races or deadlocks.

Exercise 5

根据上面的4条修改代码:

// i386_init()
	// Acquire the big kernel lock before waking up APs
	// Your code here:
	lock_kernel();

	// Starting non-boot CPUs
	boot_aps();

// mp_main()
	// Now that we have finished some basic setup, call sched_yield()
	// to start running processes on this CPU.  But make sure that
	// only one CPU can enter the scheduler at a time!
	//
	// Your code here:
	lock_kernel();
	sched_yield();
	// Remove this after you finish Exercise 6
	for (;;);

// trap()
	if ((tf->tf_cs & 3) == 3) {
		// Trapped from user mode.
		// Acquire the big kernel lock before doing any
		// serious kernel work.
		// LAB 4: Your code here.
        lock_kernel();
		assert(curenv);
		...

// env_run()
void
env_run(struct Env *e)
{
	// Step 1: If this is a context switch (a new environment is running):
	//	   1. Set the current environment (if any) back to
	//	      ENV_RUNNABLE if it is ENV_RUNNING (think about
	//	      what other states it can be in),
	//	   2. Set 'curenv' to the new environment,
	//	   3. Set its status to ENV_RUNNING,
	//	   4. Update its 'env_runs' counter,
	//	   5. Use lcr3() to switch to its address space.
	// Step 2: Use env_pop_tf() to restore the environment's
	//	   registers and drop into user mode in the
	//	   environment.

	// Hint: This function loads the new environment's state from
	//	e->env_tf.  Go back through the code you wrote above
	//	and make sure you have set the relevant parts of
	//	e->env_tf to sensible values.

	// LAB 3: Your code here.
	if(curenv && curenv->env_status == ENV_RUNNING)
		curenv->env_status = ENV_RUNNABLE;
	curenv = e;
	curenv->env_status = ENV_RUNNING;
	curenv->env_runs++;
	lcr3(PADDR(curenv->env_pgdir));
	unlock_kernel();
	env_pop_tf(&(curenv->env_tf));
	// panic("env_run not yet implemented");
}

最后的这个env_run是通过上面的注释加在了step 1和step 2之间了。

需要做完下一个exercise 才能测试这一个。

  • 如果big kernel lock保证了只能有一个环境运行在kernel mode,为什么我们还需要不同的kernel stack呢?

    回忆进入trap函数之前,在trapentry.S_alltraps中是需要存入当前的esp等状态信息的。如果先后有两个不同CPU的触发从user mode到kernel mode的中断,那么会先后推入esp等信息,如CPU 0先推入,然后CPU 1再推入,但是如果CPU 0先返回了,就会恢复成CPU 0的状态了,就出问题了。简而言之,在进入trap()之前仍然是有关于栈的操作的,所以只用在trap里的锁是午发避免race的。

Round-Robin Scheduling

下一个任务就是实现Round-Robin。其在ROS中的工作方式如下:

  • kern/sched.c中的sched_yield()函数会循环搜索envs[],找到第一个状态为ENV_RUNNABLE的环境就调用env_run()来进入这个环境。
  • sched_yield()需要保证一个环境同时只能运行在一个CPU上。已经被运行了的环境的状态为ENV_RUNNING
  • 我们实现了一个新的system call,名为sys_yield(),其让用户环境可以主动的调用sched_yield()来让出CPU。

Exercise 6

实现sched_yield(),并把sys_yield()加入syscall()。修改kern/init.c来创建3个环境,让他们都运行user/yield.c

首先是sched_yield()

// Choose a user environment to run and run it.
void
sched_yield(void)
{
	struct Env *idle;

	// Implement simple round-robin scheduling.
	//
	// Search through 'envs' for an ENV_RUNNABLE environment in
	// circular fashion starting just after the env this CPU was
	// last running.  Switch to the first such environment found.
	//
	// If no envs are runnable, but the environment previously
	// running on this CPU is still ENV_RUNNING, it's okay to
	// choose that environment.
	//
	// Never choose an environment that's currently running on
	// another CPU (env_status == ENV_RUNNING). If there are
	// no runnable environments, simply drop through to the code
	// below to halt the cpu.

	// LAB 4: Your code here.
	idle = curenv == NULL ? envs : (curenv + 1);
	int i;
	for (i = 0; i != NENV; i++) {
		struct Env *e = envs + (idle - envs + i) % NENV;
		if (e->env_status != ENV_RUNNABLE)
			continue;
		env_run(e);
		break;
	}
	if(i == NENV && curenv && curenv->env_status == ENV_RUNNING) {
		env_run(curenv);
	}
	// sched_halt never returns
	sched_halt();
}

然后在syscall()里面加上sys_yield。注意SYS_yield已经在inc/syscall.h里面被加上了。

		case SYS_yield:
			sys_yield();
			return 0;

最后在init()中加入3个yield:

#if defined(TEST)
	// Don't touch -- used by grading script!
	ENV_CREATE(TEST, ENV_TYPE_USER);
#else
	// Touch all you want.
	ENV_CREATE(user_yield, ENV_TYPE_USER);
	ENV_CREATE(user_yield, ENV_TYPE_USER);
	ENV_CREATE(user_yield, ENV_TYPE_USER);
	//ENV_CREATE(user_primes, ENV_TYPE_USER);
#endif // TEST*
	// Schedule and run the first user environment!
	sched_yield();
}

运行可以得到:

$ make qemu-nox
...
[00000000] new env 00001000
[00000000] new env 00001001
[00000000] new env 00001002
Hello, I am environment 00001000.
Hello, I am environment 00001001.
Hello, I am environment 00001002.
Back in environment 00001000, iteration 0.
Back in environment 00001001, iteration 0.
Back in environment 00001002, iteration 0.
Back in environment 00001000, iteration 1.
Back in environment 00001001, iteration 1.
Back in environment 00001002, iteration 1.
Back in environment 00001000, iteration 2.
Back in environment 00001001, iteration 2.
Back in environment 00001002, iteration 2.
Back in environment 00001000, iteration 3.
Back in environment 00001001, iteration 3.
Back in environment 00001002, iteration 3.
Back in environment 00001000, iteration 4.
All done in environment 00001000.
[00001000] exiting gracefully
[00001000] free env 00001000
Back in environment 00001001, iteration 4.
All done in environment 00001001.
[00001001] exiting gracefully
[00001001] free env 00001001
Back in environment 00001002, iteration 4.
All done in environment 00001002.
[00001002] exiting gracefully
[00001002] free env 00001002
No runnable environments in the system!
...

如果设置CPUS=2会有:

$ make qemu-nox CPUS=2
***
*** Use Ctrl-a x to exit qemu
***
...
[00000000] new env 00001000
[00000000] new env 00001001
[00000000] new env 00001002
Hello, I am environment 00001000.
Hello, I am environment 00001001.
Back in environment 00001000, iteration 0.
Hello, I am environment 00001002.
Back in environment 00001001, iteration 0.
Back in environment 00001000, iteration 1.
Back in environment 00001002, iteration 0.
Back in environment 00001001, iteration 1.
Back in environment 00001000, iteration 2.
Back in environment 00001002, iteration 1.
Back in environment 00001001, iteration 2.
Back in environment 00001000, iteration 3.
Back in environment 00001001, iteration 3.
Back in environment 00001002, iteration 2.
Back in environment 00001000, iteration 4.
Back in environment 00001001, iteration 4.
All done in environment 00001000.
All done in environment 00001001.
[00001000] exiting gracefully
[00001000] free env 00001000
[00001001] exiting gracefully
[00001001] free env 00001001
Back in environment 00001002, iteration 3.
Back in environment 00001002, iteration 4.
All done in environment 00001002.
[00001002] exiting gracefully
[00001002] free env 00001002
No runnable environments in the system!
...
  • 在现在的env_run()中,为什么在进行context switch前后都可以dereference e

    因为e在kernel部分,所有的环境的里对应的地址是一样的。

  • 再切换的时候是在哪里保存的旧的寄存器状态的?为什么要这么做?

    如果不这么做切换回来就不能继续运行了...保存是在_alltraps中做的,恢复是在env_runenv_pop_tf()做的。

System Calls for Environment Creation

尽管现在可以子啊用户环境之间相互切换了,创建环境仍然需要在kernel初始化的时候进行。现在开始需要实现system calls来让用户环境可以创建新的用户环境。

Unix用fork()作为process creation primitive。其会复制整个address space。父进程和子进程的唯一区别在于其process ID与parent process ID。在父进程中,fork()返回子进程ID,子进程中返回0。在默认设置下,这两个进程之后对内存的修改互不影响。

我们将创建一个不同的,更primitive的一组system call。用这些system call我们将可以完全在用户环境下实现fork。新的system call有:

  • sys_exofork

    创建一个新的几乎完全是空白的环境。这个环境的address space在用户部分什么都没有映射,且环境不能运行。但是其寄存器状态会和父进程相同。在父进程中,sys_exofork会返回子进程的envid_t,子进程中返回0(因为子进程不能运行,所以直到parent把child标记为runnable之后才会返回这个0)。

  • sys_env_set_status

    把某一个环境的状态设置为ENV_RUNNABLEENV_NOT_RUNNABLE。基本上是用来表示一个新环境可以运行了。

  • sys_page_alloc

    把某一page的物理内存映射到制定address space的虚拟地址。

  • sys_page_map

    把一个page(注意不是page的内容)从一个地址空间映射到另一个。保存memory sharing arrangement inplace使得两者会指向同样的物理内存。

  • sys_page_unmap

    在某一给定地址空间中,unmap a page mapped at a given VA。

对于以上这些输入中包含environment ID的system call,JOS支持0表示当前环境。这一支持在envid2env()中被实现了。

user/dumbfork.c中已经实现了一个比较蠢的fork,测试程序会使用上述system call来创建一个子进程。然后父进程和子进程会相互调用sys_yield,10个循环之后parent exit,20个之后child退出。

Exercise 7

kern/syscall.c中实现上述system call。

首先是sys_exofork

// Allocate a new environment.
// Returns envid of new environment, or < 0 on error.  Errors are:
//	-E_NO_FREE_ENV if no free environment is available.
//	-E_NO_MEM on memory exhaustion.
static envid_t
sys_exofork(void)
{
	// Create the new environment with env_alloc(), from kern/env.c.
	// It should be left as env_alloc created it, except that
	// status is set to ENV_NOT_RUNNABLE, and the register set is copied
	// from the current environment -- but tweaked so sys_exofork
	// will appear to return 0.

	// LAB 4: Your code here.
	int r;
	struct Env *e = NULL;
	
	if((r = env_alloc(&e, curenv->env_id)) < 0)
		return r;
	e->env_status = ENV_NOT_RUNNABLE;
	e->env_tf = curenv->env_tf;
	e->env_tf.tf_regs.reg_eax = 0;
	return e->env_id;
	//panic("sys_exofork not implemented");
}

然后是sys_env_set_status:

// Set envid's env_status to status, which must be ENV_RUNNABLE
// or ENV_NOT_RUNNABLE.
//
// Returns 0 on success, < 0 on error.  Errors are:
//	-E_BAD_ENV if environment envid doesn't currently exist,
//		or the caller doesn't have permission to change envid.
//	-E_INVAL if status is not a valid status for an environment.
static int
sys_env_set_status(envid_t envid, int status)
{
	// Hint: Use the 'envid2env' function from kern/env.c to translate an
	// envid to a struct Env.
	// You should set envid2env's third argument to 1, which will
	// check whether the current environment has permission to set
	// envid's status.

	// LAB 4: Your code here.
	int r;
	struct Env *e = NULL;
	
	if(status != ENV_NOT_RUNNABLE && status != ENV_RUNNABLE)
		return -E_INVAL;
	if ((r = envid2env(envid, &e, 1)) < 0)
		return r;
	e->env_status = status;
	return 0;
	//panic("sys_env_set_status not implemented");
}

然后是sys_page_alloc:

// Allocate a page of memory and map it at 'va' with permission
// 'perm' in the address space of 'envid'.
// The page's contents are set to 0.
// If a page is already mapped at 'va', that page is unmapped as a
// side effect.
//
// perm -- PTE_U | PTE_P must be set, PTE_AVAIL | PTE_W may or may not be set,
//         but no other bits may be set.  See PTE_SYSCALL in inc/mmu.h.
//
// Return 0 on success, < 0 on error.  Errors are:
//	-E_BAD_ENV if environment envid doesn't currently exist,
//		or the caller doesn't have permission to change envid.
//	-E_INVAL if va >= UTOP, or va is not page-aligned.
//	-E_INVAL if perm is inappropriate (see above).
//	-E_NO_MEM if there's no memory to allocate the new page,
//		or to allocate any necessary page tables.
static int
sys_page_alloc(envid_t envid, void *va, int perm)
{
	// Hint: This function is a wrapper around page_alloc() and
	//   page_insert() from kern/pmap.c.
	//   Most of the new code you write should be to check the
	//   parameters for correctness.
	//   If page_insert() fails, remember to free the page you
	//   allocated!

	// LAB 4: Your code here.
	int r;
	struct Env *e = NULL;
	struct PageInfo *pp = NULL;
	
	if ((uint32_t)va >= UTOP || ROUNDDOWN((uint32_t)va, PGSIZE) != (uint32_t)va)
		return -E_INVAL;
	if((perm & (PTE_U | PTE_P)) != (PTE_U | PTE_P) || (perm & ~PTE_SYSCALL) != 0)
		return -E_INVAL;
	if ((r = envid2env(envid, &e, 1)) < 0)
		return r;
	if ((pp = page_alloc(ALLOC_ZERO))== NULL)
		return -E_NO_MEM;
	if((r = page_insert(e->env_pgdir, pp, va, perm)) < 0) {
		page_free(pp);
		return r;
	}
	return 0;
	//panic("sys_page_alloc not implemented");
}

然后是sys_page_map

// Map the page of memory at 'srcva' in srcenvid's address space
// at 'dstva' in dstenvid's address space with permission 'perm'.
// Perm has the same restrictions as in sys_page_alloc, except
// that it also must not grant write access to a read-only
// page.
//
// Return 0 on success, < 0 on error.  Errors are:
//	-E_BAD_ENV if srcenvid and/or dstenvid doesn't currently exist,
//		or the caller doesn't have permission to change one of them.
//	-E_INVAL if srcva >= UTOP or srcva is not page-aligned,
//		or dstva >= UTOP or dstva is not page-aligned.
//	-E_INVAL is srcva is not mapped in srcenvid's address space.
//	-E_INVAL if perm is inappropriate (see sys_page_alloc).
//	-E_INVAL if (perm & PTE_W), but srcva is read-only in srcenvid's
//		address space.
//	-E_NO_MEM if there's no memory to allocate any necessary page tables.
static int
sys_page_map(envid_t srcenvid, void *srcva,
	     envid_t dstenvid, void *dstva, int perm)
{
	// Hint: This function is a wrapper around page_lookup() and
	//   page_insert() from kern/pmap.c.
	//   Again, most of the new code you write should be to check the
	//   parameters for correctness.
	//   Use the third argument to page_lookup() to
	//   check the current permissions on the page.

	// LAB 4: Your code here.
	int r;
	struct Env *srce = NULL;
	struct Env *dste = NULL;
	struct PageInfo *pp = NULL;
	pte_t *src_pte = NULL;

	if ((r = envid2env(srcenvid, &srce, 1)) < 0)
		return r;
	if ((r = envid2env(dstenvid, &dste, 1)) < 0)
		return r;
	if ((uint32_t)srcva >= UTOP || ROUNDDOWN((uint32_t)srcva, PGSIZE) != (uint32_t)srcva)
		return -E_INVAL;
	if ((uint32_t)dstva >= UTOP || ROUNDDOWN((uint32_t)dstva, PGSIZE) != (uint32_t)dstva)
		return -E_INVAL;
	if((perm & (PTE_U | PTE_P)) != (PTE_U | PTE_P) || (perm & ~PTE_SYSCALL) != 0)
		return -E_INVAL;
	
	if((pp = page_lookup(srce->env_pgdir, srcva, &src_pte)) == NULL)
		return -E_INVAL;
	if((*src_pte & PTE_W) == 0 && (perm & PTE_W) != 0)
		return -E_INVAL;
	if((r = page_insert(dste->env_pgdir, pp, dstva, perm)) < 0)
		return r;
	return 0;
	//panic("sys_page_map not implemented");
}

最后是sys_page_unmap

// Unmap the page of memory at 'va' in the address space of 'envid'.
// If no page is mapped, the function silently succeeds.
//
// Return 0 on success, < 0 on error.  Errors are:
//	-E_BAD_ENV if environment envid doesn't currently exist,
//		or the caller doesn't have permission to change envid.
//	-E_INVAL if va >= UTOP, or va is not page-aligned.
static int
sys_page_unmap(envid_t envid, void *va)
{
	// Hint: This function is a wrapper around page_remove().

	// LAB 4: Your code here.
	int r;
	struct Env *e = NULL;
	
	if ((uint32_t)va >= UTOP || ROUNDDOWN((uint32_t)va, PGSIZE) != (uint32_t)va)
		return -E_INVAL;
	if ((r = envid2env(envid, &e, 1)) < 0)
		return r;
	page_remove(e->env_pgdir, va);
	return 0;
	//panic("sys_page_unmap not implemented");
}

最后把这些system call都放到syscall()函数:

// Dispatches to the correct kernel function, passing the arguments.
int32_t
syscall(uint32_t syscallno, uint32_t a1, uint32_t a2, uint32_t a3, uint32_t a4, uint32_t a5)
{
	// Call the function corresponding to the 'syscallno' parameter.
	// Return any appropriate return value.
	// LAB 3: Your code here.
	// panic("syscall not implemented");
	switch (syscallno) {
		case SYS_cputs:
			sys_cputs((char *)a1, (size_t)a2);
			return 0;
		case SYS_cgetc:
			return sys_cgetc();
		case SYS_getenvid:
			return sys_getenvid();
		case SYS_env_destroy:
			return sys_env_destroy(sys_getenvid());
		case SYS_yield:
			sys_yield();
			return 0;
		case SYS_page_alloc:
			return sys_page_alloc((envid_t)a1, (void *)a2, (int)a3);
		case SYS_page_map:
			return sys_page_map((envid_t)a1, (void *)a2, (envid_t)a3, 
			                    (void *)a4, (int)a5);
		case SYS_page_unmap:
			return sys_page_unmap((envid_t)a1, (void *)a2);
		case SYS_exofork:
			return sys_exofork();
		case SYS_env_set_status:
			return sys_env_set_status((envid_t)a1, (int)a2);
		default:
			return -E_INVAL;
	}
}

然后就可以运行make grade了:

$ make grade
...
dumbfork: OK (1.1s)
Part A score: 5/5

Part B: Copy-on-Write Fork

xv6中实现的fork()就是基本上和dumbfork()一样的,把parent的address space复制到child。而这个赋值操作也就是fork开销最大的地方。

然而,很多时候fork后面都直接会跟着exec,那么这个复制就显得很浪费了。

基于这个原因,Unix的后续版本利用virtual memory hardware来允许parent和child共享内存,直到某一个进行修改内存。这个技术被称为copy-on-write。为了达到这一目的,fork的时候系统复制的时候address space mappings 而不是page里具体的内容,并标记这些page为read-only。当任意一个进程进行写入的时候就会触发page fault,这时在给这个出发了fault的进程分配一个new, private, writable copy of the page。这样直到写入才会真正进行复制,从而使得fork之后直接exec的开销小了很多:child很可能只需要复制1个page(the current page of its stack,这个没明白...)

我们接下来就来实现copy-on-wirte fork

User-level page fault handling

一个user-level copy-on-write fork需要获取在write-protected pages上的page faults。注意copy-on-write只是user-level page fault handling的众多应用之一。

创建一个address space然后用page faults来进行后续操作是非常常见的。如大多数Unix kernel最开始都只会分配1个page作为stack,之后随着需求再扩增(lazy allocation)。一个典型的Unix kernel需要keep track of what action to take when a page fault occers in each region of a process's space.比如说,再stack触法的一般会需要分配并映射一个新的physical page,在BSS(存储未初始化全局变量的segment)中的往往需要分配一个新的page,全部填充为0,然后再映射。在ystem with demand-paged executables(这啥...),在text region触法的会从硬盘读入对应page的binary,并进行映射。

kernel需要track很多信息。于其采用传统Unix的方法,我们将在user space里面决定不同的page fault要做什么,这样bugs are less damaging。这种设计让应用定义其内存区域的自动度更大了:我们将在后续的disk-based file system中使用user-level page fault handling。

Setting the Page Fault Handler

为了handle自己的page fault,一个用户环境需要在JOS kernel中注册一个page fault handler entrypoint。用户环境会用sys_env_set_pgfault_upcall这个system call来注册其page fault entrypoint。我们已经在Env中加入了env_pgfault_upcall来进行记录了。

Exercise 8

实现sys_env_set_pgfault_upcall。注意在查询environment id的时候要进行permission checking(对应参数设为1),因为这是一个很危险的system call。

// Set the page fault upcall for 'envid' by modifying the corresponding struct
// Env's 'env_pgfault_upcall' field.  When 'envid' causes a page fault, the
// kernel will push a fault record onto the exception stack, then branch to
// 'func'.
//
// Returns 0 on success, < 0 on error.  Errors are:
//	-E_BAD_ENV if environment envid doesn't currently exist,
//		or the caller doesn't have permission to change envid.
static int
sys_env_set_pgfault_upcall(envid_t envid, void *func)
{
	// LAB 4: Your code here.
	int r;
	struct Env *e = NULL;
	
	if ((r = envid2env(envid, &e, 1)) < 0)
		return r;
	e->env_pgfault_upcall = func;
	return 0;
	//panic("sys_env_set_pgfault_upcall not implemented");
}

Normal and Exception Stacks in User Environments

在正常(normal)的执行过程中,JOS的一个用户环境会正常地运行在user stack上:其ESP寄存器最开始会为USTACKTOP,且其栈的数据会在USTACKTOP-PGSIZEUSTACKTOP-1之间。然而,当发生了page fault的时候,kernel会重启用户环境,让其运行a designated user-level page fault handler在一个不同的stack上——user exception stack。本质上,我们需要让JOS代表用户环境实现这个自动的"stack switch",和x86处理器代表JOS从user mode转换到kernel mode一样。

JOS的user exception stack也是1 page大,在UXSTACKTOP-PGSIZEUSTACKTOP-1之间。当在这个栈上运行的时候,user-level page fault handler可以用JOS的regular system call来映射新page或解决page fault对应的问题。然后通过一个assemble language stub,user-level page fault handler 返回到origin stack的错误代码处。

每个想要支持user-level page fault handling的用户环境都需要使用sys-page_alloc()来给其exception stack分配内存。

Invoking the User Page Fault Handler

我们将需要改变kern/trap.c中的page fault handling code以按照如下方法在user mode处理page fault。我们将把fault出现了时候的用户环境的状态称为trap-time state。

如果没有注册page fault handler,JOS会和以前一样destroy the user environment。如若不然,kernel会在exception stack上创建一个类似于inc/trap.h中的struct UTrapframe的trap frame。

                    <-- UXSTACKTOP
trap-time esp
trap-time eflags
trap-time eip
trap-time eax       start of struct PushRegs
trap-time ecx
trap-time edx
trap-time ebx
trap-time esp
trap-time ebp
trap-time esi
trap-time edi       end of struct PushRegs
tf_err (error code)
fault_va            <-- %esp when handler is run

UTrapframe的结构如下:

struct UTrapframe {
	/* information about the fault */
	uint32_t utf_fault_va;	/* va for T_PGFLT, 0 otherwise */
	uint32_t utf_err;
	/* trap-time return state */
	struct PushRegs utf_regs;
	uintptr_t utf_eip;
	uint32_t utf_eflags;
	/* the trap-time stack to return to */
	uintptr_t utf_esp;
} __attribute__((packed));

kernel之后会用这个stack和page fault handler来让user environment继续运行。注意,fault_va就是出了错误的VA。

如果用户环境已经运行在exception stack的情况下出现了要给exception,那么说明page fault handler出错了。这种情况下,会在现在的tf->tf_esp下重新创建一个stack frame。我们需要先放入一个空的32-bit word,然后创建一个struct UTrapframe

我们可以通过检查tf->tf_esp是否在UXSTACKTOP-PGSIZEUSTACKTOP-1之间来判断是不是已经在exception stack上了。

Exercise 9

实现kern/trap.c中的page_fault_handler。如果exception stack runs out of space了怎么办?

void
page_fault_handler(struct Trapframe *tf)
{
	uint32_t fault_va;

	// Read processor's CR2 register to find the faulting address
	fault_va = rcr2();

	// Handle kernel-mode page faults.

	// LAB 3: Your code here.

	// We've already handled kernel-mode exceptions, so if we get here,
	// the page fault happened in user mode.

	// Call the environment's page fault upcall, if one exists.  Set up a
	// page fault stack frame on the user exception stack (below
	// UXSTACKTOP), then branch to curenv->env_pgfault_upcall.
	//
	// The page fault upcall might cause another page fault, in which case
	// we branch to the page fault upcall recursively, pushing another
	// page fault stack frame on top of the user exception stack.
	//
	// It is convenient for our code which returns from a page fault
	// (lib/pfentry.S) to have one word of scratch space at the top of the
	// trap-time stack; it allows us to more easily restore the eip/esp. In
	// the non-recursive case, we don't have to worry about this because
	// the top of the regular user stack is free.  In the recursive case,
	// this means we have to leave an extra word between the current top of
	// the exception stack and the new stack frame because the exception
	// stack _is_ the trap-time stack.
	//
	// If there's no page fault upcall, the environment didn't allocate a
	// page for its exception stack or can't write to it, or the exception
	// stack overflows, then destroy the environment that caused the fault.
	// Note that the grade script assumes you will first check for the page
	// fault upcall and print the "user fault va" message below if there is
	// none.  The remaining three checks can be combined into a single test.
	//
	// Hints:
	//   user_mem_assert() and env_run() are useful here.
	//   To change what the user environment runs, modify 'curenv->env_tf'
	//   (the 'tf' variable points at 'curenv->env_tf').

	// LAB 4: Your code here.
	bool flag = true;
	if (curenv->env_pgfault_upcall == NULL)
		flag = false;
	if (fault_va < UXSTACKTOP && fault_va >= UXSTACKTOP - PGSIZE)
		flag = false;
	if (flag) {  // this makes sure it's from user mode
		struct UTrapframe *utf = NULL;
		if (curenv->env_tf.tf_esp < UXSTACKTOP && curenv->env_tf.tf_esp >= UXSTACKTOP - PGSIZE)
			utf = (struct UTrapframe *)(curenv->env_tf.tf_esp - sizeof(struct UTrapframe) - 4);
		else
			utf = (struct UTrapframe *)(UXSTACKTOP - sizeof(struct UTrapframe));
		// check the overflow and write permission
		user_mem_assert(curenv, utf, sizeof(struct UTrapframe), PTE_W);

		utf->utf_fault_va = fault_va;
		utf->utf_err = tf->tf_err;
		utf->utf_regs = tf->tf_regs;
		utf->utf_eip = tf->tf_eip;
		utf->utf_eflags = tf->tf_eflags;
		utf->utf_esp = tf->tf_esp;
		tf->tf_eip = (uintptr_t)curenv->env_pgfault_upcall;
		tf->tf_esp = (uintptr_t)utf;
		env_run(curenv);
	}
	// Destroy the environment that caused the fault.
	cprintf("[%08x] user fault va %08x ip %08x\n",
		curenv->env_id, fault_va, tf->tf_eip);
	print_trapframe(tf);
	env_destroy(curenv);
}

User-mode Page Fault Entrypoint

Next, you need to implement the assembly routine that will take care of calling the C page fault handler and resume execution at the original faulting instruction. This assembly routine is the handler that will be registered with the kernel using sys_env_set_pgfault_upcall().

Exercise 10

实现lib/pfentry.S中的_pgfault_upcall

.text
.globl _pgfault_upcall
_pgfault_upcall:
	// Call the C page fault handler.
	pushl %esp			// function argument: pointer to UTF
	movl _pgfault_handler, %eax
	call *%eax
	addl $4, %esp			// pop function argument
	
	// Now the C page fault handler has returned and you must return
	// to the trap time state.
	// Push trap-time %eip onto the trap-time stack.
	//
	// Explanation:
	//   We must prepare the trap-time stack for our eventual return to
	//   re-execute the instruction that faulted.
	//   Unfortunately, we can't return directly from the exception stack:
	//   We can't call 'jmp', since that requires that we load the address
	//   into a register, and all registers must have their trap-time
	//   values after the return.
	//   We can't call 'ret' from the exception stack either, since if we
	//   did, %esp would have the wrong value.
	//   So instead, we push the trap-time %eip onto the *trap-time* stack!
	//   Below we'll switch to that stack and call 'ret', which will
	//   restore %eip to its pre-fault value.
	//
	//   In the case of a recursive fault on the exception stack,
	//   note that the word we're pushing now will fit in the
	//   blank word that the kernel reserved for us.
	//
	// Throughout the remaining code, think carefully about what
	// registers are available for intermediate calculations.  You
	// may find that you have to rearrange your code in non-obvious
	// ways as registers become unavailable as scratch space.
	//
	// LAB 4: Your code here.
	movl 0x28(%esp), %edx # trap-time eip
	subl $0x4, 0x30(%esp) # we have to use subl now because we can't use after popfl
	movl 0x30(%esp), %eax # trap-time esp-4
	movl %edx, (%eax)
	addl $0x8, %esp

	// Restore the trap-time registers.  After you do this, you
	// can no longer modify any general-purpose registers.
	// LAB 4: Your code here.
	popal

	// Restore eflags from the stack.  After you do this, you can
	// no longer use arithmetic operations or anything else that
	// modifies eflags.
	// LAB 4: Your code here.
	addl $0x4, %esp #eip
	popfl

	// Switch back to the adjusted trap-time stack.
	// LAB 4: Your code here.
	popl %esp

	// Return to re-execute the instruction that faulted.
	// LAB 4: Your code here.
	ret

然后是完成lib/pgfault.c中的set_pgfault_handler()

//
// Set the page fault handler function.
// If there isn't one yet, _pgfault_handler will be 0.
// The first time we register a handler, we need to
// allocate an exception stack (one page of memory with its top
// at UXSTACKTOP), and tell the kernel to call the assembly-language
// _pgfault_upcall routine when a page fault occurs.
//
void
set_pgfault_handler(void (*handler)(struct UTrapframe *utf))
{
	int r;

	if (_pgfault_handler == 0) {
		// First time through!
		// LAB 4: Your code here.
		if ((r = sys_page_alloc(0, (void *)(UXSTACKTOP - PGSIZE), PTE_W | PTE_U | PTE_P)) < 0)
			panic("fail to allocate exception stack");
		sys_env_set_pgfault_upcall(0, _pgfault_upcall);
		//panic("set_pgfault_handler not implemented");
	}

	// Save handler pointer for assembly to call.
	_pgfault_handler = handler;
}

然后就是进行测试,我们来一次看这些个测试样例。

首先是user_faultread:

void
umain(int argc, char **argv)
{
	cprintf("I read %08x from location 0!\n", *(unsigned*)0);
}

单纯的就是访问了一个没有被分配的内存,没有设置_pgfault_upcall,所以直接destroy,输出为:

$ make run-faultread-nox
...
[00000000] new env 00001000
[00001000] user fault va 00000000 ip 00800039
TRAP frame at 0xf02b0000 from CPU 0
  edi  0x00000000
  esi  0x00000000
  ebp  0xeebfdfd0
  oesp 0xf0235fdc
  ebx  0x00000000
  edx  0x00000000
  ecx  0x00000000
  eax  0xeec00000
  es   0x----0023
  ds   0x----0023
  trap 0x0000000e Page Fault
  cr2  0x00000000
  err  0x00000004 [user, read, not-present]
  eip  0x00800039
  cs   0x----001b
  flag 0x00000086
  esp  0xeebfdfc0
  ss   0x----0023
[00001000] free env 00001000
No runnable environments in the system!
...

然后是user_faultdie:

void
handler(struct UTrapframe *utf)
{
	void *addr = (void*)utf->utf_fault_va;
	uint32_t err = utf->utf_err;
	cprintf("i faulted at va %x, err %x\n", addr, err & 7);
	sys_env_destroy(sys_getenvid());
}

void
umain(int argc, char **argv)
{
	set_pgfault_handler(handler);
	*(int*)0xDeadBeef = 0;
}

会输出错误地址与错误信息,然后destroy。所以结果是:

$ make run-faultdie-nox
...
[00000000] new env 00001000
i faulted at va deadbeef, err 6
[00001000] exiting gracefully
[00001000] free env 00001000
No runnable environments in the system!
...

然后是user_faultalloc

void
handler(struct UTrapframe *utf)
{
	int r;
	void *addr = (void*)utf->utf_fault_va;

	cprintf("fault %x\n", addr);
	if ((r = sys_page_alloc(0, ROUNDDOWN(addr, PGSIZE),
				PTE_P|PTE_U|PTE_W)) < 0)
		panic("allocating at %x in page fault handler: %e", addr, r);
	snprintf((char*) addr, 100, "this string was faulted in at %x", addr);
}

void
umain(int argc, char **argv)
{
	set_pgfault_handler(handler);
	cprintf("%s\n", (char*)0xDeadBeef);
	cprintf("%s\n", (char*)0xCafeBffe);
}

如果遇到问题,尝试分配一个page,然后输出出错的位置。snprintf是把字符串存在对应地址用的。所以输出结果为:

$ make run-faultalloc-nox
...
[00000000] new env 00001000
fault deadbeef
this string was faulted in at deadbeef
fault cafebffe
fault cafec000
this string was faulted in at cafebffe
[00001000] exiting gracefully
[00001000] free env 00001000
No runnable environments in the system!
...

经过测试可以知道是在进行第二次snprintf的时候在exception stack里面又触发了中断。这是因为分配的buffer是100 byte,而0xcafebffe + 100已经到了下一个page了,所以就需要再进行一次分配。

最后是user_faultallocbad

void
handler(struct UTrapframe *utf)
{
	int r;
	void *addr = (void*)utf->utf_fault_va;

	cprintf("fault %x\n", addr);
	if ((r = sys_page_alloc(0, ROUNDDOWN(addr, PGSIZE),
				PTE_P|PTE_U|PTE_W)) < 0)
		panic("allocating at %x in page fault handler: %e", addr, r);
	snprintf((char*) addr, 100, "this string was faulted in at %x", addr);
}

void
umain(int argc, char **argv)
{
	set_pgfault_handler(handler);
	sys_cputs((char*)0xDEADBEEF, 4);
}

注意和上面的区别是没有使用cprintf,而是调用了sys_cputs这一system call。

$ make run-faultallocbad-nox
...
[00000000] new env 00001000
[00001000] user_mem_check assertion failure for va deadbeef
[00001000] free env 00001000
No runnable environments in the system!
...

cprintfsys_cputs的区别在于,cprintf再实现的时候先使用了user lib中的vprintfmtvprintfmt会访问0xdeadbeef里面存储的内容(把内容复制到buffer中),从而在在user mode触法page fault。而直接调用sys_cputs会在触法page fault之前触法system call,然后在sys_cputs里面的user_mem_assert会发现这个地址不能访问,从而导致destroy。

Implementing Copy-on-Write Fork

现在我们来实现copy-on-write fork。其代码在lib/fork.c,主要的控制流如下:

  1. parent把pgfault()作为page fault handler
  2. parent调用set_exofork来创建一个child environment
  3. 对于每个在UTOP之下的writable或是copy-on-write page,parent调用duppage。其会map这个page到child的对应位置,然后再map到自己的位置,只不过是把权限改为copy-on-write。用PTE_COW表示copy-on-write page。

    exception stack不会这样map,child需要重新分配一个新的page作为exception stack,不然就没法正常运行page fault handler了。

  4. parent重新为child设置page fault entry point
  5. 截止到这个时候,child已经可以运行了,所以parent把child设置为ENV_RUNNABLE

每次一个环境访问要向一个copy-on-write page写入的时候就会触法page fault handler,其控制流如下:

  1. kernel通过_pgfault_upcall调用pgfault()
  2. pgfault检查page fault是否为一个写入,以及page是PTE_COW
  3. pgfault现在一个临时的地方分配一个新的page,然后把需要写入的page的内容复制到这个临时的page中,然后在把临时page map到对应的位置,给予read/write权限。最后unmap这个临时的地方。

Exercise 12

按照上述的流程补全lib/fork.c

具体代码如下,其中大量参考了dumbfork,尤其是pgfault,实际上和dumbfork做的事情一样。

// implement fork from user space

#include <inc/string.h>
#include <inc/lib.h>

// PTE_COW marks copy-on-write page table entries.
// It is one of the bits explicitly allocated to user processes (PTE_AVAIL).
#define PTE_COW		0x800
extern void _pgfault_upcall(void);

//
// Custom page fault handler - if faulting page is copy-on-write,
// map in our own private writable copy.
//
static void
pgfault(struct UTrapframe *utf)
{
	void *addr = (void *) utf->utf_fault_va;
	uint32_t err = utf->utf_err;
	int r;

	// Check that the faulting access was (1) a write, and (2) to a
	// copy-on-write page.  If not, panic.
	// Hint:
	//   Use the read-only page table mappings at uvpt
	//   (see <inc/memlayout.h>).

	// LAB 4: Your code here.
	if(!((err & FEC_WR) && 
	     (uvpd[PDX(addr)] & PTE_P) &&
		   (uvpt[PGNUM(addr)] & PTE_P) &&
		   (uvpt[PGNUM(addr)] & PTE_COW)))
		panic("fault not on copy-on-write");
	// Allocate a new page, map it at a temporary location (PFTEMP),
	// copy the data from the old page to the new page, then move the new
	// page to the old page's address.
	// Hint:
	//   You should make three system calls.

	// LAB 4: Your code here.
	addr = ROUNDDOWN(addr, PGSIZE);
	if((r = sys_page_alloc(0, PFTEMP, PTE_W | PTE_U | PTE_P)) < 0)
		panic("fail to alloc");
	memmove(PFTEMP, addr, PGSIZE);
	if((r = sys_page_map(0, PFTEMP, 0, addr, PTE_W | PTE_U | PTE_P)) < 0)
		panic("fail to map");
	if((r = sys_page_unmap(0, PFTEMP)) < 0)
		panic("fail to umap");
	//panic("pgfault not implemented");
}

//
// Map our virtual page pn (address pn*PGSIZE) into the target envid
// at the same virtual address.  If the page is writable or copy-on-write,
// the new mapping must be created copy-on-write, and then our mapping must be
// marked copy-on-write as well.  (Exercise: Why do we need to mark ours
// copy-on-write again if it was already copy-on-write at the beginning of
// this function?)
//
// Returns: 0 on success, < 0 on error.
// It is also OK to panic on error.
//
static int
duppage(envid_t envid, unsigned pn)
{
	int r;

	// LAB 4: Your code here.
	void *addr = (void *)(pn * PGSIZE);
	if(!(uvpd[PDX(addr)] & PTE_P) || !(uvpt[pn] & PTE_P))
		return -1;
	if(uvpt[pn] & (PTE_W | PTE_COW)) {
		if ((r = sys_page_map(0, addr, envid, addr, PTE_COW | PTE_U | PTE_P)) < 0)
			return r;
		if ((r = sys_page_map(0, addr, 0, addr, PTE_COW | PTE_U | PTE_P)) < 0)
			return r;
	} else {
		if ((r = sys_page_map(0, addr, envid, addr, PTE_U | PTE_P)) < 0)
			return r;
	}
	//panic("duppage not implemented");
	return 0;
}

//
// User-level fork with copy-on-write.
// Set up our page fault handler appropriately.
// Create a child.
// Copy our address space and page fault handler setup to the child.
// Then mark the child as runnable and return.
//
// Returns: child's envid to the parent, 0 to the child, < 0 on error.
// It is also OK to panic on error.
//
// Hint:
//   Use uvpd, uvpt, and duppage.
//   Remember to fix "thisenv" in the child process.
//   Neither user exception stack should ever be marked copy-on-write,
//   so you must allocate a new page for the child's user exception stack.
//
envid_t
fork(void)
{
	// LAB 4: Your code here.
	envid_t envid;
	uint8_t *addr;
	int r;
	
	set_pgfault_handler(pgfault);
	envid = sys_exofork();
	if (envid < 0)
		panic("sys_exofork: %e", envid);
	if (envid == 0) {
		thisenv = &envs[ENVX(sys_getenvid())];
		return 0;
	}
	if((r = sys_page_alloc(envid, (void *)(UXSTACKTOP - PGSIZE), PTE_U | PTE_W | PTE_P)) < 0)
		panic("fail to alloc for exception stack");
	for (addr = 0; (uint32_t)addr < UTOP; addr += PGSIZE) {
		if ((uint32_t)addr == UXSTACKTOP - PGSIZE)
			continue;
		else if((uvpd[PDX(addr)] & PTE_P) && (uvpt[PGNUM(addr)] & (PTE_P))) {
			if((r = duppage(envid, PGNUM(addr))) < 0)
				panic("duppage failed");
		}
	}
	sys_env_set_pgfault_upcall(envid, _pgfault_upcall);
	if ((r = sys_env_set_status(envid, ENV_RUNNABLE)) < 0)
		panic("sys_env_set_status: %e", r);
	return envid;
	//panic("fork not implemented");
}

运行测试,有:

$ make run-forktree-nox
...
[00000000] new env 00001000
1000: I am ''
[00001000] new env 00001001
[00001000] new env 00001002
[00001000] exiting gracefully
[00001000] free env 00001000
1001: I am '0'
[00001001] new env 00002000
[00001001] new env 00001003
[00001001] exiting gracefully
[00001001] free env 00001001
2000: I am '00'
[00002000] new env 00002001
[00002000] new env 00001004
[00002000] exiting gracefully
[00002000] free env 00002000
2001: I am '000'
[00002001] exiting gracefully
[00002001] free env 00002001
1002: I am '1'
[00001002] new env 00003001
[00001002] new env 00003000
[00001002] exiting gracefully
[00001002] free env 00001002
3000: I am '11'
[00003000] new env 00002002
[00003000] new env 00001005
[00003000] exiting gracefully
[00003000] free env 00003000
3001: I am '10'
[00003001] new env 00004000
[00003001] new env 00001006
[00003001] exiting gracefully
[00003001] free env 00003001
4000: I am '100'
[00004000] exiting gracefully
[00004000] free env 00004000
2002: I am '110'
[00002002] exiting gracefully
[00002002] free env 00002002
1003: I am '01'
[00001003] new env 00003002
[00001003] new env 00005000
[00001003] exiting gracefully
[00001003] free env 00001003
5000: I am '011'
[00005000] exiting gracefully
[00005000] free env 00005000
3002: I am '010'
[00003002] exiting gracefully
[00003002] free env 00003002
1004: I am '001'
[00001004] exiting gracefully
[00001004] free env 00001004
1005: I am '111'
[00001005] exiting gracefully
[00001005] free env 00001005
1006: I am '101'
[00001006] exiting gracefully
[00001006] free env 00001006
...

然后就完成了Part B

$ make grade
...
dumbfork: OK (1.4s)
Part A score: 5/5

faultread: OK (0.9s)
faultwrite: OK (1.0s)
faultdie: OK (1.0s)
faultregs: OK (2.0s)
faultalloc: OK (1.0s)
faultallocbad: OK (1.9s)
faultnostack: OK (2.1s)
faultbadhandler: OK (2.0s)
faultevilhandler: OK (1.9s)
forktree: OK (2.1s)
Part B score: 50/50

Part C Preemptive Multitasking and Inter-Process communication (IPC)

这部分我们会实现preempty uncooperative environment以及environment之间的通信。

Clock Interrupts and Preemption

运行user/spin,会发现child里面的循环会永久得占据CPU,导致parent或是kernel没有办法运行。

// user/spin.c
void
umain(int argc, char **argv) {
	envid_t env;

	cprintf("I am the parent.  Forking the child...\n");
	if ((env = fork()) == 0) {
		cprintf("I am the child.  Spinning...\n");
		while (1)
			/* do nothing */;
	}
	cprintf("I am the parent.  Running the child...\n");
	sys_yield();
	sys_yield();
	sys_yield();
	sys_yield();
	sys_yield();
	sys_yield();
	sys_yield();
	sys_yield();
	cprintf("I am the parent.  Killing the child...\n");
	sys_env_destroy(env);
}

显然这不是一个好的设计,因为user-mode environment可以直接导致整个系统halt。为了让kernel可以preempt a running environment,我们需要让JOS kernel实现external hardware interrupt from the clock hardware。

Interrupt discipline

External interrupts被称为IRQ。有16个IRQ,从0到15。IRQ number到IDT entry的映射还没有实现。

inc/trap.c中定义了IRQ_OFFSET为32,所以IDT中的32~47对应了IRQ的0~15。比如IDT[IRQ_OFFSET + 0]对应了始终中断的handler。采用这个IRQ_OFFSET是为了让device interrupt和processor exception分开。(最早版本的MS-DOS不进行区分,导致了很多问题...)

在JOS中,我们相较于xv6做了简化,在kernel里面externel device interrupt都被禁用了。在JOS中externel device interrupt是用%eflagsFL_IF位控制的,为1则开启。尽管我们有很多种方法可以修改这一位,为了简化,我们仅仅在进入或离开user mode的时候进行%eflags的恢复或存储。

我们需要保证在用户环境中FL_IF位为1,从而能够接受中断。在bootloader的最初处我们加载了仅用了所有中断,然后截止到现在我们没有开启过他们。

Exercise 13

修改kern/trapentry.S, kern/trap.c来加入IRQ 0~15。同时更改env_alloc()让用户环境开启中断。并且取消sched_haltsti前面的注释。让idle CPU启用中断。

下面是对应的代码。首先是trapentry.S,注意提示中又说the processor never pushes an error code when invoking a hardware interrupt handler,所以应该采用TRAPHANDLER_NOEC(至少我是这么觉得的...)

TRAPHANDLER_NOEC(IRQ_0, IRQ_OFFSET + 0)
TRAPHANDLER_NOEC(IRQ_1, IRQ_OFFSET + 1)
TRAPHANDLER_NOEC(IRQ_2, IRQ_OFFSET + 2)
TRAPHANDLER_NOEC(IRQ_3, IRQ_OFFSET + 3)
TRAPHANDLER_NOEC(IRQ_4, IRQ_OFFSET + 4)
TRAPHANDLER_NOEC(IRQ_5, IRQ_OFFSET + 5)
TRAPHANDLER_NOEC(IRQ_6, IRQ_OFFSET + 6)
TRAPHANDLER_NOEC(IRQ_7, IRQ_OFFSET + 7)
TRAPHANDLER_NOEC(IRQ_8, IRQ_OFFSET + 8)
TRAPHANDLER_NOEC(IRQ_9, IRQ_OFFSET + 9)
TRAPHANDLER_NOEC(IRQ_10, IRQ_OFFSET + 10)
TRAPHANDLER_NOEC(IRQ_11, IRQ_OFFSET + 11)
TRAPHANDLER_NOEC(IRQ_12, IRQ_OFFSET + 12)
TRAPHANDLER_NOEC(IRQ_13, IRQ_OFFSET + 13)
TRAPHANDLER_NOEC(IRQ_14, IRQ_OFFSET + 14)
TRAPHANDLER_NOEC(IRQ_15, IRQ_OFFSET + 15)

然后是kern/trap.c,添加的代码如下:

	// IRQ
	void IRQ_0();
	void IRQ_1();
	void IRQ_2();
	void IRQ_3();
	void IRQ_4();
	void IRQ_5();
	void IRQ_6();
	void IRQ_7();
	void IRQ_8();
	void IRQ_9();
	void IRQ_10();
	void IRQ_11();
	void IRQ_12();
	void IRQ_13();
	void IRQ_14();
	void IRQ_15();
	SETGATE(idt[IRQ_OFFSET + 0], 0, GD_KT, IRQ_0, 0);
	SETGATE(idt[IRQ_OFFSET + 1], 0, GD_KT, IRQ_1, 0);
	SETGATE(idt[IRQ_OFFSET + 2], 0, GD_KT, IRQ_2, 0);
	SETGATE(idt[IRQ_OFFSET + 3], 0, GD_KT, IRQ_3, 0);
	SETGATE(idt[IRQ_OFFSET + 4], 0, GD_KT, IRQ_4, 0);
	SETGATE(idt[IRQ_OFFSET + 5], 0, GD_KT, IRQ_5, 0);
	SETGATE(idt[IRQ_OFFSET + 6], 0, GD_KT, IRQ_6, 0);
	SETGATE(idt[IRQ_OFFSET + 7], 0, GD_KT, IRQ_7, 0);
	SETGATE(idt[IRQ_OFFSET + 8], 0, GD_KT, IRQ_8, 0);
	SETGATE(idt[IRQ_OFFSET + 9], 0, GD_KT, IRQ_9, 0);
	SETGATE(idt[IRQ_OFFSET + 10], 0, GD_KT, IRQ_10, 0);
	SETGATE(idt[IRQ_OFFSET + 11], 0, GD_KT, IRQ_11, 0);
	SETGATE(idt[IRQ_OFFSET + 12], 0, GD_KT, IRQ_12, 0);
	SETGATE(idt[IRQ_OFFSET + 13], 0, GD_KT, IRQ_13, 0);
	SETGATE(idt[IRQ_OFFSET + 14], 0, GD_KT, IRQ_14, 0);
	SETGATE(idt[IRQ_OFFSET + 15], 0, GD_KT, IRQ_15, 0);

最开始在实现的时候出了一个问题,那就是无法通过trap()中的。

assert(!(read_eflags() & FL_IF));

也就是在trap的时候FL_IF处有值。经过网上搜索解决方案,发现是在lab 3中的SETGATE使用出了错误。

// Set up a normal interrupt/trap gate descriptor.
// - istrap: 1 for a trap (= exception) gate, 0 for an interrupt gate.
    //   see section 9.6.1.3 of the i386 reference: "The difference between
    //   an interrupt gate and a trap gate is in the effect on IF (the
    //   interrupt-enable flag). An interrupt that vectors through an
    //   interrupt gate resets IF, thereby preventing other interrupts from
    //   interfering with the current interrupt handler. A subsequent IRET
    //   instruction restores IF to the value in the EFLAGS image on the
    //   stack. An interrupt through a trap gate does not change IF."
// - sel: Code segment selector for interrupt/trap handler
// - off: Offset in code segment for interrupt/trap handler
// - dpl: Descriptor Privilege Level -
//	  the privilege level required for software to invoke
//	  this interrupt/trap gate explicitly using an int instruction.
#define SETGATE(gate, istrap, sel, off, dpl)			\
{								\
	(gate).gd_off_15_0 = (uint32_t) (off) & 0xffff;		\
	(gate).gd_sel = (sel);					\
	(gate).gd_args = 0;					\
	(gate).gd_rsv1 = 0;					\
	(gate).gd_type = (istrap) ? STS_TG32 : STS_IG32;	\
	(gate).gd_s = 0;					\
	(gate).gd_dpl = (dpl);					\
	(gate).gd_p = 1;					\
	(gate).gd_off_31_16 = (uint32_t) (off) >> 16;		\
}

阅读SETGATE的注释可以看到istrap部分如果设置为0才能保证在中断的里面不会有中断,这也是JOS的要求(和xv6不同),所以所有的SETGATE(包括lab3的部分)都需要采用istrap=0

然后在env_alloc()中:

	// Enable interrupts while in user mode.
	// LAB 4: Your code here.
	e->env_tf.tf_eflags |= FL_IF;

最后是sched_halt()

	// Reset stack pointer, enable interrupts and then halt.
	asm volatile (
		"movl $0, %%ebp\n"
		"movl %0, %%esp\n"
		"pushl $0\n"
		"pushl $0\n"
		// Uncomment the following line after completing exercise 13
		"sti\n"
		"1:\n"
		"hlt\n"
		"jmp 1b\n"
	: : "a" (thiscpu->cpu_ts.ts_esp0));

再用user/spin测试会发现中断已经被打开了:

$ make run-spin-nox
...
[00000000] new env 00001000
I am the parent.  Forking the child...
[00001000] new env 00001001
I am the parent.  Running the child...
I am the child.  Spinning...
TRAP frame at 0xf02b707c from CPU 0
  edi  0x00000000
  esi  0x00000000
  ebp  0xeebfdfd0
  oesp 0xf023cfdc
  ebx  0x00000000
  edx  0xeebfde88
  ecx  0x0000001d
  eax  0x0000001d
  es   0x----0023
  ds   0x----0023
  trap 0x00000020 Hardware Interrupt
  err  0x00000000
  eip  0x00800060
  cs   0x----001b
  flag 0x00000282
  esp  0xeebfdfc8
  ss   0x----0023
[00001001] free env 00001001
I am the parent.  Killing the child...
[00001000] exiting gracefully
[00001000] free env 00001000
No runnable environments in the system!
...

Handling Clock Interrupts

我们接下来来设置时钟中断的handler。lapic_initpic_init已经设置了始终和中断控制器,我们只需要写handler了。

Exercise 14

修改trap_dispatch,添加timer interrupt handler。

		// Handle clock interrupts. Don't forget to acknowledge the
		// interrupt using lapic_eoi() before calling the scheduler!
		// LAB 4: Your code here.
		case IRQ_OFFSET + IRQ_TIMER:
			lapic_eoi();
			sched_yield();

这个时候我们就可以正常运行user/spin了。

$ make run-spin-nox
...
[00000000] new env 00001000
I am the parent.  Forking the child...
[00001000] new env 00001001
I am the parent.  Running the child...
I am the child.  Spinning...
I am the parent.  Killing the child...
[00001000] destroying 00001001
[00001000] free env 00001001
[00001000] exiting gracefully
[00001000] free env 00001000
No runnable environments in the system!
...

Inter-Process communication (IPC)

截止到现在,我们非常重视isolation,但是让程序可以相互之间交流也是很重要的一个功能。让程序可以互动可以带来强大的功能。interprocess communication有很多的模型,到今天为止哪个模型最好仍然没有定论。我们会实现一个简单的模型。

IPC in JOS

我们会实现几个简单的system call来实现一个简单的interprocess communication mechanism。两个system call分别实sys_ipc_recvsys_ipc_try_send。之后我们会把他们包装在ipc_recvipc_send里。

在JOS的IPC中相互传输的信息由两部分组成,一个32位的值和一个可选的single page mapping。让环境可以相互之间传page mapping是一种传输大数据量的高效方法,同时也让环境更好设置共享内存。

Sending and Receiving Messages

如果要接收到信息,环境需要调用sys_ipc_recv,这个system call会de-schedules当前环境并只有收到信息之后才会继续运行。当一个环境等待获取信息的时候,其他的环境可以发信息。注意是是任何环境,因为IPC的设计本身确保了安全,发信息不会搞崩接受的环境,所以不需要环境之间由parent child这样的关系。

如果要发送一条信息,环境需要调用sys_ipc_try_send,以自己和接受环境为参数。如果对应的环境确实在准备接收,那么数据会被传过去并返回0,不然返回-E_IPC_NOT_RECV

User space中的库函数ipc_recv会负责调用sys_ipc_recv以及在struct Env中查询接收到的数据。

同样ipc_send会重复调用sys_ipc_try_send直到成功发送。

Transferring Pages

当一个环境以一个合理的dstva(在UTOP之下)调用sys_ipc_recv的时候,环境表示其可以接受一个mapping。如果sender发过来一个page,那么这个page就会被map在dstva处。如果dstva原来就有一个mapping,这个mapping会被unmap。

当一个环境以一个合理的srcva(在UTOP之下)调用sys_ipc_try_send的时候,环境表示其想要发送一个map在srcva的page,其permission为perm。IPC成功之后,sender会保留原始的mapping,但是receiver和它会share这个page。

如果sender和receive中的一个不愿意发送或是接受page,那么就不会有page被传输。IPC之后,Env中的env_ipc_perm会为接收到的page的permission或0。

Implementing IPC

Exercise 15

完成sys_ipc_recv, sys_ipc_try_send以及ipc_recvipc_send。注意在使用envid2env的时候checkperm应设为0,原因上面提到了。

按照详细的注释一点一点加上就好了,首先是sys_ipc_recv

// Block until a value is ready.  Record that you want to receive
// using the env_ipc_recving and env_ipc_dstva fields of struct Env,
// mark yourself not runnable, and then give up the CPU.
//
// If 'dstva' is < UTOP, then you are willing to receive a page of data.
// 'dstva' is the virtual address at which the sent page should be mapped.
//
// This function only returns on error, but the system call will eventually
// return 0 on success.
// Return < 0 on error.  Errors are:
//	-E_INVAL if dstva < UTOP but dstva is not page-aligned.
static int
sys_ipc_recv(void *dstva)
{
	// LAB 4: Your code here.
	if((uint32_t)dstva < UTOP)
		if(ROUNDDOWN((uint32_t)dstva, PGSIZE) != (uint32_t)dstva)
			return -E_INVAL;
	curenv->env_ipc_recving = 1;
	curenv->env_ipc_dstva = dstva;
	curenv->env_status = ENV_NOT_RUNNABLE;
	sched_yield();
	//panic("sys_ipc_recv not implemented");
	return 0;
}

然后是sys_ipc_try_send

// Try to send 'value' to the target env 'envid'.
// If srcva < UTOP, then also send page currently mapped at 'srcva',
// so that receiver gets a duplicate mapping of the same page.
//
// The send fails with a return value of -E_IPC_NOT_RECV if the
// target is not blocked, waiting for an IPC.
//
// The send also can fail for the other reasons listed below.
//
// Otherwise, the send succeeds, and the target's ipc fields are
// updated as follows:
//    env_ipc_recving is set to 0 to block future sends;
//    env_ipc_from is set to the sending envid;
//    env_ipc_value is set to the 'value' parameter;
//    env_ipc_perm is set to 'perm' if a page was transferred, 0 otherwise.
// The target environment is marked runnable again, returning 0
// from the paused sys_ipc_recv system call.  (Hint: does the
// sys_ipc_recv function ever actually return?)
//
// If the sender wants to send a page but the receiver isn't asking for one,
// then no page mapping is transferred, but no error occurs.
// The ipc only happens when no errors occur.
//
// Returns 0 on success, < 0 on error.
// Errors are:
//	-E_BAD_ENV if environment envid doesn't currently exist.
//		(No need to check permissions.)
//	-E_IPC_NOT_RECV if envid is not currently blocked in sys_ipc_recv,
//		or another environment managed to send first.
//	-E_INVAL if srcva < UTOP but srcva is not page-aligned.
//	-E_INVAL if srcva < UTOP and perm is inappropriate
//		(see sys_page_alloc).
//	-E_INVAL if srcva < UTOP but srcva is not mapped in the caller's
//		address space.
//	-E_INVAL if (perm & PTE_W), but srcva is read-only in the
//		current environment's address space.
//	-E_NO_MEM if there's not enough memory to map srcva in envid's
//		address space.
static int
sys_ipc_try_send(envid_t envid, uint32_t value, void *srcva, unsigned perm)
{
	// LAB 4: Your code here.
	int r;
	struct Env *e;
	struct PageInfo *pp = NULL;
	pte_t *pte = NULL;
	if ((r = envid2env(envid, &e, 0)) < 0)
		return r;
	if(!e->env_ipc_recving)
		return -E_IPC_NOT_RECV;
	if ((uint32_t)srcva < UTOP && (uint32_t)e->env_ipc_dstva < UTOP) {
		if(ROUNDDOWN((uint32_t)srcva, PGSIZE) != (uint32_t)srcva)
			return -E_INVAL;
		if((perm & (PTE_U | PTE_P)) != (PTE_U | PTE_P) || (perm & ~PTE_SYSCALL) != 0)
			return -E_INVAL;
		if((pp = page_lookup(curenv->env_pgdir, srcva, &pte)) == NULL)
			return -E_INVAL;
		if((*pte & perm) != perm)
			return -E_INVAL;
		if((perm & PTE_W) && !(*pte & PTE_W))
			return -E_INVAL;
		if((r = page_insert(e->env_pgdir, pp, e->env_ipc_dstva, perm)) < 0)
			return r;
		e->env_ipc_perm = perm;
	}
	e->env_ipc_recving = 0;
	e->env_ipc_from = curenv->env_id;
	e->env_ipc_value = value;
	e->env_status = ENV_RUNNABLE;
	e->env_tf.tf_regs.reg_eax = 0;
	return 0;
	//panic("sys_ipc_try_send not implemented");
}

注意写完这两个之后别忘了在syscall()里面加上这两个函数。

然后是ipc_recv

// Receive a value via IPC and return it.
// If 'pg' is nonnull, then any page sent by the sender will be mapped at
//	that address.
// If 'from_env_store' is nonnull, then store the IPC sender's envid in
//	*from_env_store.
// If 'perm_store' is nonnull, then store the IPC sender's page permission
//	in *perm_store (this is nonzero iff a page was successfully
//	transferred to 'pg').
// If the system call fails, then store 0 in *fromenv and *perm (if
//	they're nonnull) and return the error.
// Otherwise, return the value sent by the sender
//
// Hint:
//   Use 'thisenv' to discover the value and who sent it.
//   If 'pg' is null, pass sys_ipc_recv a value that it will understand
//   as meaning "no page".  (Zero is not the right value, since that's
//   a perfectly valid place to map a page.)
int32_t
ipc_recv(envid_t *from_env_store, void *pg, int *perm_store)
{
	// LAB 4: Your code here.
	int r;
	void *dstva = (void *)UTOP;

	if(pg)
		dstva = pg;
	if (from_env_store)
			*from_env_store = 0;
	if (perm_store)
		*perm_store = 0;
	if((r = sys_ipc_recv(dstva)) < 0)
		return r;
	if (from_env_store)
		*from_env_store = thisenv->env_ipc_from;
	if (perm_store)
		*perm_store = thisenv->env_ipc_perm;
	return thisenv->env_ipc_value;
	//panic("ipc_recv not implemented");
}

最后是ipc_send

// Send 'val' (and 'pg' with 'perm', if 'pg' is nonnull) to 'toenv'.
// This function keeps trying until it succeeds.
// It should panic() on any error other than -E_IPC_NOT_RECV.
//
// Hint:
//   Use sys_yield() to be CPU-friendly.
//   If 'pg' is null, pass sys_ipc_try_send a value that it will understand
//   as meaning "no page".  (Zero is not the right value.)
void
ipc_send(envid_t to_env, uint32_t val, void *pg, int perm)
{
	// LAB 4: Your code here.
	int r;
	void *srcva = (void *)UTOP;
	if(pg)
		srcva = pg;

	while ((r = sys_ipc_try_send(to_env, val, pg, perm)) < 0) {
		if (r == 0)
			break;
		if (r != -E_IPC_NOT_RECV)
			panic("ipc send fail");
		sys_yield();
	}
	//panic("ipc_send not implemented");
}

完成了之后就可以用user/pingpong或者是user/primes来进行测试。

最后来make grade一下:

$ make grade
...
dumbfork: OK (2.5s)
Part A score: 5/5

faultread: OK (1.9s)
faultwrite: OK (2.1s)
faultdie: OK (2.0s)
faultregs: OK (2.1s)
faultalloc: OK (0.9s)
faultallocbad: OK (1.9s)
faultnostack: OK (2.0s)
faultbadhandler: OK (2.2s)
faultevilhandler: OK (1.9s)
forktree: OK (2.2s)
Part B score: 50/50

spin: OK (2.0s)
stresssched: OK (2.2s)
sendpage: OK (1.7s)
pingpong: OK (2.0s)
primes: OK (4.4s)
Part C score: 25/25

Score: 80/80

lab4就完成了。