zhuzilin's Blog

about

6.828 lab3 User Environments

date: 2019-03-05
tags: OS  6.828  

注意,在运行lab3之前,需要修改kern/kernel.ld文件中的bss部分为:

	.bss : {
		PROVIDE(edata = .);
		*(.dynbss)
		*(.bss .bss.*)
		*(COMMON)
		PROVIDE(end = .);
	}

非常感谢解决了这个问题的同学,解决的原文在这里

Part A: User Environments and Exception Handling

首先我们需要看一下新的inc/env.h文件,其中包含了user environment的基本定义:

typedef int32_t envid_t;

// An environment ID 'envid_t' has three parts:
//
// +1+---------------21-----------------+--------10--------+
// |0|          Uniqueifier             |   Environment    |
// | |                                  |      Index       |
// +------------------------------------+------------------+
//                                       \--- ENVX(eid) --/
//
// The environment index ENVX(eid) equals the environment's index in the
// 'envs[]' array.  The uniqueifier distinguishes environments that were
// created at different times, but share the same environment index.
//
// All real environments are greater than 0 (so the sign bit is zero).
// envid_ts less than 0 signify errors.  The envid_t == 0 is special, and
// stands for the current environment.

#define LOG2NENV		10
#define NENV			(1 << LOG2NENV)
#define ENVX(envid)		((envid) & (NENV - 1))

// Values of env_status in struct Env
enum {
	ENV_FREE = 0,
	ENV_DYING,
	ENV_RUNNABLE,
	ENV_RUNNING,
	ENV_NOT_RUNNABLE
};

// Special environment types
enum EnvType {
	ENV_TYPE_USER = 0,
};

struct Env {
	struct Trapframe env_tf;	// Saved registers
	struct Env *env_link;		// Next free Env
	envid_t env_id;			// Unique environment identifier
	envid_t env_parent_id;		// env_id of this env's parent
	enum EnvType env_type;		// Indicates special system environments
	unsigned env_status;		// Status of the environment
	uint32_t env_runs;		// Number of times environment has run

	// Address space
	pde_t *env_pgdir;		// Kernel virtual address of page dir
};

虽然这个lab只会去创建一个user environment,但是为了之后的lab,需要能够支持多个environment。

kern/env.c的前几行定义了kernel中的3个和环境相关的重要全局变量:

struct Env *envs = NULL;		// All environments
struct Env *curenv = NULL;		// The current env
static struct Env *env_free_list;	// Free environment list
					// (linked by Env->env_link)

当JOS启动的时候,envs会指向一个struct Env的数组表示系统中所有的环境。在JOS的设计中,最多有NENV(1024)个环境(一般远远达不到这个值)。这个数组中会存在一个能够保存这NENV个环境的数据结构。

就像page_free_list一样,JOS有一个env_free_list用来表示inactive Env,用来进行allocation, deallocation。curenv是当前正在执行的环境的指针,初始化为NULL

Environment State

回到inc/env.h,我们来看一下Env

struct Env {
	struct Trapframe env_tf;	// Saved registers
	struct Env *env_link;		// Next free Env
	envid_t env_id;			// Unique environment identifier
	envid_t env_parent_id;		// env_id of this env's parent
	enum EnvType env_type;		// Indicates special system environments
	unsigned env_status;		// Status of the environment
	uint32_t env_runs;		// Number of times environment has run

	// Address space
	pde_t *env_pgdir;		// Kernel virtual address of page dir
};

对于这些field的更详细的解释是:

  • env_tf: 当该环境不运行的时候保存寄存器,比如从user mode到kernel mode的转换过程。
  • env_link: 指向env_free_list里的下一个Env
  • env_id: 保存一个uniquely identifier。注意如果一个环境被释放了,之后又有环境用了这个struct Env,他们的env_id会不同。
  • env_parent_id: 保存创建了这个环境的环境的env_id。从而可以建立一个树,从而方便一些security decision,也就是决定某个环境是否有某个权限。
  • env_type: 用来区分特殊环境的,普通的都是ENV_TYPE_USER
  • env_status: 状态,具体取值如下:

    // Values of env_status in struct Env
    enum {
    	ENV_FREE = 0,  // inactive, Env在env_free上
    	ENV_DYING,  // zombie, environment, 下次trap到kernel的时候会被释放
    	ENV_RUNNABLE,  // waiting to run
    	ENV_RUNNING,  // currently running
    	ENV_NOT_RUNNABLE  // currently active, but not ready to run,如等待IPC
    };
  • env_pgdir: 该环境的page directory

同Unix process一样,JOS环境结合了thread与address space。thread用保存的寄存器确定(env_tf),address space用env_pgdir确定。kernel必须要设置好这两者以成功运行某个环境。

Allocating the Environments Array

修改mem_init以分配envs的地址。并把envs映射到kernel page directory的对应位置。

Exercise 1

	//////////////////////////////////////////////////////////////////////
	// Make 'envs' point to an array of size 'NENV' of 'struct Env'.
	// LAB 3: Your code here.
	envs = (struct Env *)boot_alloc(NENV*sizeof(struct Env));
...
    //////////////////////////////////////////////////////////////////////
	// Map the 'envs' array read-only by the user at linear address UENVS
	// (ie. perm = PTE_U | PTE_P).
	// Permissions:
	//    - the new image at UENVS  -- kernel R, user R
	//    - envs itself -- kernel RW, user NONE
	// LAB 3: Your code here.
	boot_map_region(kern_pgdir, UENVS, PTSIZE, PADDR(envs), PTE_U | PTE_P);

注意后面的这个映射位置以及大小是参照的JOS的虚拟内存分布。写完之后运行kernel应该会出现好几个succeeded:

check_page_free_list() succeeded!
check_page_alloc() succeeded!
check_page() succeeded!
check_kern_pgdir() succeeded!
check_page_free_list() succeeded!
check_page_installed_pgdir() succeeded!

Creating and Running Environments

因为现在还没有file system,所以要运行一个用户环境需要让kernel去加载一个静态的二进制image。这些影响都在obj/user/中,这些在kern/Makefrag中也有体现:

# Binary program images to embed within the kernel.
# Binary files for LAB3
KERN_BINFILES :=	user/hello \
			user/buggyhello \
			user/buggyhello2 \
			user/evilhello \
			user/testbss \
			user/divzero \
			user/breakpoint \
			user/softint \
			user/badsegment \
			user/faultread \
			user/faultreadkernel \
			user/faultwrite \
			user/faultwritekernel
...
# How to build the kernel itself
$(OBJDIR)/kern/kernel: $(KERN_OBJFILES) $(KERN_BINFILES) kern/kernel.ld \
	  $(OBJDIR)/.vars.KERN_LDFLAGS
	@echo + ld $@
	$(V)$(LD) -o $@ $(KERN_LDFLAGS) $(KERN_OBJFILES) $(GCC_LIB) -b binary $(KERN_BINFILES)
	$(V)$(OBJDUMP) -S $@ > $@.asm
	$(V)$(NM) -n $@ > $@.sym

这里的-b binary表示把文件当成raw unterpreted binary而不是编译器生成的.o文件。如果查看obj/kern/kernel.sym,可以看到一系列神奇的symbol

00008acc A _binary_obj_user_hello_size
00008ad0 A _binary_obj_user_badsegment_size
00008ad0 A _binary_obj_user_breakpoint_size
00008ad0 A _binary_obj_user_buggyhello_size
00008ad0 A _binary_obj_user_evilhello_size
00008ad0 A _binary_obj_user_faultread_size
00008ad0 A _binary_obj_user_faultwrite_size
00008ad0 A _binary_obj_user_softint_size
00008ad8 A _binary_obj_user_faultreadkernel_size
00008ad8 A _binary_obj_user_faultwritekernel_size
00008ae4 A _binary_obj_user_divzero_size
00008ae8 A _binary_obj_user_testbss_size
00008aec A _binary_obj_user_buggyhello2_size

linker生成了这些symbol以让kernel可以调用这些二进制文件。

kern/init.c中,i386_init()函数会调用这些二进制文件中的一个(默认是user_hello)。但是这个函数里面和环境相关的部分都还没有完成,下面就是要填充上这些函数了。

Exercise 2

完成kern/env.c中的如下函数:

  • env_init()

    初始化envsenv_free_list

    // Mark all environments in 'envs' as free, set their env_ids to 0,
    // and insert them into the env_free_list.
    // Make sure the environments are in the free list in the same order
    // they are in the envs array (i.e., so that the first call to
    // env_alloc() returns envs[0]).
    //
    void
    env_init(void)
    {
    	// Set up envs array
    	// LAB 3: Your code here.
    	memset(envs, 0, sizeof(envs));
    	env_free_list = NULL;
    	for(int i=NENV-1; i>=0; i--) {
    		envs[i].env_link = env_free_list;
    		env_free_list = envs + i;
    	}
    	// Per-CPU part of the initialization
    	env_init_percpu();
    }

    注意这里注释要求env_free_list顺序和envs的一样所以这里和page_init的顺序相反。(不明白为啥...)

  • env_setup_vm()

    把初始化env_pgdir,并把其中的kernel部分的内存分配好。

    // Initialize the kernel virtual memory layout for environment e.
    // Allocate a page directory, set e->env_pgdir accordingly,
    // and initialize the kernel portion of the new environment's address space.
    // Do NOT (yet) map anything into the user portion
    // of the environment's virtual address space.
    //
    // Returns 0 on success, < 0 on error.  Errors include:
    //	-E_NO_MEM if page directory or table could not be allocated.
    //
    static int
    env_setup_vm(struct Env *e)
    {
    	int i;
    	struct PageInfo *p = NULL;
    
    	// Allocate a page for the page directory
    	if (!(p = page_alloc(ALLOC_ZERO)))
    		return -E_NO_MEM;
    
    	// Now, set e->env_pgdir and initialize the page directory.
    	//
    	// Hint:
    	//    - The VA space of all envs is identical above UTOP
    	//	(except at UVPT, which we've set below).
    	//	See inc/memlayout.h for permissions and layout.
    	//	Can you use kern_pgdir as a template?  Hint: Yes.
    	//	(Make sure you got the permissions right in Lab 2.)
    	//    - The initial VA below UTOP is empty.
    	//    - You do not need to make any more calls to page_alloc.
    	//    - Note: In general, pp_ref is not maintained for
    	//	physical pages mapped only above UTOP, but env_pgdir
    	//	is an exception -- you need to increment env_pgdir's
    	//	pp_ref for env_free to work correctly.
    	//    - The functions in kern/pmap.h are handy.
    
    	// LAB 3: Your code here.
    	e->env_pgdir = page2kva(p);
    	p->pp_ref++;
    	memcpy(e->env_pgdir, kern_pgdir, PGSIZE);
    	// UVPT maps the env's own page table read-only.
    	// Permissions: kernel R, user R
    	e->env_pgdir[PDX(UVPT)] = PADDR(e->env_pgdir) | PTE_P | PTE_U;
    
    	return 0;
    }

    注意这里用内存复制是因为boot_region_map是一个静态函数,不能调用。

  • region_alloc

    在当前环境下在虚拟地址va处分配长为len的内存。

    // Allocate len bytes of physical memory for environment env,
    // and map it at virtual address va in the environment's address space.
    // Does not zero or otherwise initialize the mapped pages in any way.
    // Pages should be writable by user and kernel.
    // Panic if any allocation attempt fails.
    //
    static void
    region_alloc(struct Env *e, void *va, size_t len)
    {
    	// LAB 3: Your code here.
    	// (But only if you need it for load_icode.)
    	//
    	// Hint: It is easier to use region_alloc if the caller can pass
    	//   'va' and 'len' values that are not page-aligned.
    	//   You should round va down, and round (va + len) up.
    	//   (Watch out for corner-cases!)
    	int r;
    	void *v = ROUNDDOWN(va, PGSIZE);
    	void* end = ROUNDUP(va + len, PGSIZE);
    	struct PageInfo *p = NULL;
    	for(; v < end; v += PGSIZE) {
    		if((p = page_alloc(ALLOC_ZERO)) == NULL)
    			panic("region_alloc: %e", -E_NO_MEM);
    		if((r = page_insert(e->env_pgdir, p, v, PTE_U | PTE_W | PTE_P)) < 0)
    			panic("region_alloc: %e", r);
    	}
    }
  • load_icode()

    设置initial program binary, stack与processor flags。如注释所说,主要是模仿boot/main.c中的函数。注意需要切换pgdir,因为进入这个函数的时候是kernel mode,但是分配内存要在用户的地址空间分配。

    // Set up the initial program binary, stack, and processor flags
    // for a user process.
    // This function is ONLY called during kernel initialization,
    // before running the first user-mode environment.
    //
    // This function loads all loadable segments from the ELF binary image
    // into the environment's user memory, starting at the appropriate
    // virtual addresses indicated in the ELF program header.
    // At the same time it clears to zero any portions of these segments
    // that are marked in the program header as being mapped
    // but not actually present in the ELF file - i.e., the program's bss section.
    //
    // All this is very similar to what our boot loader does, except the boot
    // loader also needs to read the code from disk.  Take a look at
    // boot/main.c to get ideas.
    //
    // Finally, this function maps one page for the program's initial stack.
    //
    // load_icode panics if it encounters problems.
    //  - How might load_icode fail?  What might be wrong with the given input?
    //
    static void
    load_icode(struct Env *e, uint8_t *binary)
    {
    	// Hints:
    	//  Load each program segment into virtual memory
    	//  at the address specified in the ELF segment header.
    	//  You should only load segments with ph->p_type == ELF_PROG_LOAD.
    	//  Each segment's virtual address can be found in ph->p_va
    	//  and its size in memory can be found in ph->p_memsz.
    	//  The ph->p_filesz bytes from the ELF binary, starting at
    	//  'binary + ph->p_offset', should be copied to virtual address
    	//  ph->p_va.  Any remaining memory bytes should be cleared to zero.
    	//  (The ELF header should have ph->p_filesz <= ph->p_memsz.)
    	//  Use functions from the previous lab to allocate and map pages.
    	//
    	//  All page protection bits should be user read/write for now.
    	//  ELF segments are not necessarily page-aligned, but you can
    	//  assume for this function that no two segments will touch
    	//  the same virtual page.
    	//
    	//  You may find a function like region_alloc useful.
    	//
    	//  Loading the segments is much simpler if you can move data
    	//  directly into the virtual addresses stored in the ELF binary.
    	//  So which page directory should be in force during
    	//  this function?
    	//
    	//  You must also do something with the program's entry point,
    	//  to make sure that the environment starts executing there.
    	//  What?  (See env_run() and env_pop_tf() below.)
    
    	// LAB 3: Your code here.
    	struct Elf *elfhdr = (struct Elf *)binary;
    	if (elfhdr->e_magic != ELF_MAGIC)
    		panic("load_icode: not valid elf file");
    	struct Proghdr *ph, *eph;
    	ph = (struct Proghdr *) (binary + elfhdr->e_phoff);
    	eph = ph + elfhdr->e_phnum;
    	lcr3(PADDR(e->env_pgdir));
    	for (; ph < eph; ph++) {
    		if(ph->p_type == ELF_PROG_LOAD) {
    			region_alloc(e, (void *)ph->p_va, ph->p_memsz);
    			memcpy((void *)(ph->p_va), (void *)(binary + ph->p_offset), ph->p_filesz);
    		}
    	}
    	e->env_tf.tf_eip = elfhdr->e_entry;
    	lcr3(PADDR(kern_pgdir));
    	// Now map one page for the program's initial stack
    	// at virtual address USTACKTOP - PGSIZE.
    	// LAB 3: Your code here.
    	region_alloc(e, (void *)(USTACKTOP - PGSIZE), PGSIZE);
    }
  • env_create

    就是结合上面的两个函数,先env_allocload_icode

    // Allocates a new env with env_alloc, loads the named elf
    // binary into it with load_icode, and sets its env_type.
    // This function is ONLY called during kernel initialization,
    // before running the first user-mode environment.
    // The new env's parent ID is set to 0.
    //
    void
    env_create(uint8_t *binary, enum EnvType type)
    {
    	// LAB 3: Your code here.
    	int r;
    	struct Env *e = NULL;
    	if((r = env_alloc(&e, 0)) < 0)
    		panic("env_create: %e", r);
    	load_icode(e, binary);
    	e->env_type = type;
    }
  • env_run()

    按照注释的要求一步一步写就好了。注意别忘了最开始curenv可能是NULL

    // Context switch from curenv to env e.
    // Note: if this is the first call to env_run, curenv is NULL.
    //
    // This function does not return.
    //
    void
    env_run(struct Env *e)
    {
    	// Step 1: If this is a context switch (a new environment is running):
    	//	   1. Set the current environment (if any) back to
    	//	      ENV_RUNNABLE if it is ENV_RUNNING (think about
    	//	      what other states it can be in),
    	//	   2. Set 'curenv' to the new environment,
    	//	   3. Set its status to ENV_RUNNING,
    	//	   4. Update its 'env_runs' counter,
    	//	   5. Use lcr3() to switch to its address space.
    	// Step 2: Use env_pop_tf() to restore the environment's
    	//	   registers and drop into user mode in the
    	//	   environment.
    
    	// Hint: This function loads the new environment's state from
    	//	e->env_tf.  Go back through the code you wrote above
    	//	and make sure you have set the relevant parts of
    	//	e->env_tf to sensible values.
    
    	// LAB 3: Your code here.
    	if(curenv && curenv->env_status == ENV_RUNNING)
    		curenv->env_status = ENV_RUNNABLE;
    	curenv = e;
    	curenv->env_status = ENV_RUNNING;
    	curenv->env_runs++;
    	lcr3(PADDR(curenv->env_pgdir));
    	env_pop_tf(&(curenv->env_tf));
    	// panic("env_run not yet implemented");
    }

完成了这几步之后运行当启动kernel的时候会进行如下操作:

  • start (kern/entry.S):kernel的entry,也就是boot loader加载kernel的entry
  • i386_init(kern/init.c):上面的entry调用了这个函数,对kernel进行初始化

    • cons_init:初始化console
    • mem_init:初始化kernel address space
    • env_init:初始化所有的环境
    • trap_init (still incomplete at this point):初始化中断
    • env_create:创建一个用户环境
    • env_run:运行用户环境
    • env_pop_tf:从trapframe中还原这个用户环境所需要的寄存器状态。

完成了exercise 2之后因为并没有初始化中断,所以会在user_hello第一次进行system call的时候报triple fault的错。这是因为:When the CPU discovers that it is not set up to handle this system call interrupt, it will generate a general protection exception, find that it can't handle that, generate a double fault exception, find that it can't handle that either, and finally give up with what's known as a "triple fault".

我们可以使用gdb来检测是否进入了用户环境,在env_pop_tf中加断点之后逐步运行可以发现其会运行至地址为0x800020(可能会有出入)的指令,也就是进入了user mode。然后在int $0x30处加断点,之后再运行1步就会进入triple fault了。

Handling Interrupts and Exceptions

我们来完成中断部分。

Exercise 3

读书,在这里就不记录了。

Basics of Protected Control Transfer

Exeception和interrupt都是protected control transfer,其在用户代码不能干涉kernel的情况下,让处理器进入kernel mode。在intel的术语中,interrupt是由处理器外部的异步事件,如IO引起的,而exception是同步运行的代码引起的,如除0或者page fault。

之前提到过,为了能够做到protected,处理器的中断机制让用户只能进入几个固定的kernel位置。在x86中,由2种机制可以确保这种protection。

  • The Interrupt Descriptor Table (IDT)

    一个在kernel private memory中的表,记录了0~255这256种不同的中断的EIPCS,前者是中断进入的kernel code的位置,后者是中断的privilege level(在JOS中都是0,也就是kernel mode)。

  • The Task State Segment (TSS)

    用于存放中断前的old processor state,用于中断之后还原状态。注意这部分也是存储在kernel stack中的。

    尽管TSS很大,可以有很多功用,JOS仅仅记录中断转移到的kernel stack。处理器用ESP0SS0来定义kernel mode,且JOS中不使用TSS的其他field。

Types of Exceptions and Interrupts

大于31的中断为software interrupt或hardware interrupt,前者是可以用int指令进入,后者是外部硬件发出的。

在这节里,我们会拓展JOS使其可以处理它自己产生的0~31中断。之后一节我们会处理48(0x30),也就是system call,注意这个48是随机选的。lab4里面我们会处理硬件中断。

An Example

例如,代码中出现了除0,那么:

  1. processor会通过TSS中的ESP0SS0来切换到kernel stack。在JOS中,这两个值分别是GD_KDKSTACKTOP
  2. 处理器会把exception parameter推进kernel stack,其地址始于KSTACKTOP

    +--------------------+ KSTACKTOP             
    | 0x00000 | old SS   |     " - 4
    |      old ESP       |     " - 8
    |     old EFLAGS     |     " - 12
    | 0x00000 | old CS   |     " - 16
    |      old EIP       |     " - 20 <---- ESP 
    +--------------------+             
  3. 对于除0这种情况,在x86中对应的是vector 0,处理器会读取IDT中entry 0,并设置对应的CS:IP
  4. 最后会运行这个exception对应的handler,例如结束程序。

对于一些特殊的exception,除了会推入上述的5个words,处理器还会退入error code。可以阅读80386的manual来查看不同的error code意味着什么。

+--------------------+ KSTACKTOP             
| 0x00000 | old SS   |     " - 4
|      old ESP       |     " - 8
|     old EFLAGS     |     " - 12
| 0x00000 | old CS   |     " - 16
|      old EIP       |     " - 20
|     error code     |     " - 24 <---- ESP
+--------------------+             

Nested Exceptions and Interrupts

中断即可以在user mode中产生,也可以从kernel mode中产生。但是x86处理器只会在从user到kernel的过程中自动保存old register state。如果发生中断时已经在kernel里了,CPU只会继续向同样的kernel stack中推入值,从而使kernel可以处理嵌套的中断。

具体来说,因为不需要换栈,所以就不需要保存SSESP,所以handler眼中的第二个中断对应的stack就会是这样:

+--------------------+ <---- old ESP
|     old EFLAGS     |     " - 4
| 0x00000 | old CS   |     " - 8
|      old EIP       |     " - 12
+--------------------+ 

Setting Up the IDT

我们来设置0~31的IDT。我们需要用到的一些定义在inc/trap.hkern/trap.h中。

注意,0~31中的有一些中断已经被intel保留了,所以处理器永远都不会产生这些中断,怎么处理都行。

整个的控制方式应该如下:

      IDT                   trapentry.S         trap.c
   
+----------------+                        
|   &handler1    |---------> handler1:          trap (struct Trapframe *tf)
|                |             // do stuff      {
|                |             call trap          // handle the exception/interrupt
|                |             // ...           }
+----------------+
|   &handler2    |--------> handler2:
|                |            // do stuff
|                |            call trap
|                |            // ...
+----------------+
       .
       .
       .
+----------------+
|   &handlerX    |--------> handlerX:
|                |             // do stuff
|                |             call trap
|                |             // ...
+----------------+

每个中断都应该在trapentry.Strap_init()中有其对应的地址。

Exercise 4

这部分我主要是通过和xv6的对应部分对照着写的。

首先写trapentry.S,这个文件分为两部分,第一是写handler:

/*
 * Lab 3: Your code here for generating entry points for the different traps.
 */
TRAPHANDLER_NOEC(T_DIVIDE_handler, T_DIVIDE)
TRAPHANDLER_NOEC(T_DEBUG_handler, T_DEBUG)
TRAPHANDLER_NOEC(T_NMI_handler, T_NMI)
TRAPHANDLER_NOEC(T_BRKPT_handler, T_BRKPT)
TRAPHANDLER_NOEC(T_OFLOW_handler, T_OFLOW)
TRAPHANDLER_NOEC(T_BOUND_handler, T_BOUND)
TRAPHANDLER_NOEC(T_ILLOP_handler, T_ILLOP)
TRAPHANDLER_NOEC(T_DEVICE_handler, T_DEVICE)
TRAPHANDLER(T_DBLFLT_handler, T_DBLFLT)
TRAPHANDLER(T_TSS_handler, T_TSS)
TRAPHANDLER(T_SEGNP_handler, T_SEGNP)
TRAPHANDLER(T_STACK_handler, T_STACK)
TRAPHANDLER(T_GPFLT_handler, T_GPFLT)
TRAPHANDLER(T_PGFLT_handler, T_PGFLT)
TRAPHANDLER_NOEC(T_FPERR_handler, T_FPERR)
TRAPHANDLER(T_ALIGN_handler, T_ALIGN)
TRAPHANDLER_NOEC(T_MCHK_handler, T_MCHK)
TRAPHANDLER_NOEC(T_SIMDERR_handler, T_SIMDERR)
TRAPHANDLER_NOEC(T_SYSCALL_handler, T_SYSCALL)

具体是使用TRAPHANDLER还是TRAPHANDLER_NOEC可以对照xv6的vector.S文件。

然后是写_alltrap:

/*
 * Lab 3: Your code here for _alltraps
 */
  # vectors.S sends all traps here.
_alltraps:
  # Build trap frame.
  pushl %ds
  pushl %es
  pushal
  
  # Set up data segments.
  movw $GD_KD, %ax
  movw %ax, %ds
  movw %ax, %es

  # Call trap(tf), where tf=%esp
  pushl %esp
  call trap
  addl $4, %esp

  popal
  popl %es
  popl %ds
  addl $0x8, %esp  # trapno and errcode
  iret

注意要对照着inc/trap.h中的Trapframe的定义来写,同时要参照xv6中的trapasm.Sx86.h(有trapframe的定义)来写。最后是trap_init()。因为在trapentry.S中只有函数名是全局变量,所以只能重复性的写很多...

void
trap_init(void)
{
	extern struct Segdesc gdt[];

	// LAB 3: Your code here.
	void T_DIVIDE_handler();
	void T_DEBUG_handler();
	void T_NMI_handler();
	void T_BRKPT_handler();
	void T_OFLOW_handler();
	void T_BOUND_handler();
	void T_ILLOP_handler();
	void T_DEVICE_handler();
	void T_DBLFLT_handler();
	void T_TSS_handler();
	void T_SEGNP_handler();
	void T_STACK_handler();
	void T_GPFLT_handler();
	void T_PGFLT_handler();
	void T_FPERR_handler();
	void T_ALIGN_handler();
	void T_MCHK_handler();
	void T_SIMDERR_handler();
	void T_SYSCALL_handler();
	SETGATE(idt[T_DIVIDE], 1, GD_KT, T_DIVIDE_handler, 0);
	SETGATE(idt[T_DEBUG], 1, GD_KT, T_DEBUG_handler, 0);
	SETGATE(idt[T_NMI], 1, GD_KT, T_NMI_handler, 0);
	SETGATE(idt[T_BRKPT], 1, GD_KT, T_BRKPT_handler, 0);
	SETGATE(idt[T_OFLOW], 1, GD_KT, T_OFLOW_handler, 0);
	SETGATE(idt[T_BOUND], 1, GD_KT, T_BOUND_handler, 0);
	SETGATE(idt[T_ILLOP], 1, GD_KT, T_ILLOP_handler, 0);
	SETGATE(idt[T_DEVICE], 1, GD_KT, T_DEVICE_handler, 0);
	SETGATE(idt[T_DBLFLT], 1, GD_KT, T_DBLFLT_handler, 0);
	SETGATE(idt[T_TSS], 1, GD_KT, T_TSS_handler, 0);
	SETGATE(idt[T_SEGNP], 1, GD_KT, T_SEGNP_handler, 0);
	SETGATE(idt[T_STACK], 1, GD_KT, T_STACK_handler, 0);
	SETGATE(idt[T_GPFLT], 1, GD_KT, T_GPFLT_handler, 0);
	SETGATE(idt[T_PGFLT], 1, GD_KT, T_PGFLT_handler, 0);
	SETGATE(idt[T_FPERR], 1, GD_KT, T_FPERR_handler, 0);
	SETGATE(idt[T_ALIGN], 1, GD_KT, T_ALIGN_handler, 0);
	SETGATE(idt[T_MCHK], 1, GD_KT, T_MCHK_handler, 0);
	SETGATE(idt[T_SIMDERR], 1, GD_KT, T_SIMDERR_handler, 0);
	SETGATE(idt[T_SYSCALL], 0, GD_KT, T_SYSCALL_handler, 3);
	// Per-CPU setup 
	trap_init_percpu();
}

然后运行make grade,就通过了Part A。注意这里的代码虽然可以通过lab3,但是到lab4会出问题...因为其istrap参数的问题,详情请见lab4。

回答两个问题:

  • 为什么要每个中断一个handler?那样就不能分开设置SETGATE中的trapit了,也就是不能区分exception和interruption了,同时也不能给不同的中断设置不同的中断等级了。
  • 为什么user/softint中的int $14会进入vector 13?14的privilege level是0,也就是user不能调用,在上面的代码中不是$30都会被识别为general protection fault,也就是中断13。

Part B: Page Faults, Breakpoints Exceptions, and System Calls

处理其他的中断。

Handling Page Faults

处理page fault,也就是14。当发生page fault的时候,处理器会把产生错误的地址存在CR2寄存器中。

Exercise 5

trap_dispatch()里面加入page_fault_handler()

static void
trap_dispatch(struct Trapframe *tf)
{
	// Handle processor exceptions.
	// LAB 3: Your code here.
	switch(tf->tf_trapno) {
		case T_PGFLT:
			page_fault_handler(tf);
			return;
		default:
			break;
	}
	// Unexpected trap: The user process or the kernel has a bug.
	print_trapframe(tf);
	if (tf->tf_cs == GD_KT)
		panic("unhandled trap in kernel");
	else {
		env_destroy(curenv);
		return;
	}
}

The Breakpoint Exception

Exercise 6

对于断点中断,需要调用的是kern/monitor.c中的monitor函数,不过注意,因为breakpoint.c中是通过直接触法来进行测试的,所以需要把断点的等级调为3

SETGATE(idt[T_BRKPT], 1, GD_KT, T_BRKPT_handler, 3);

然后trap_dispatch为:

	switch(tf->tf_trapno) {
		case T_PGFLT:
			page_fault_handler(tf);
			return;
		case T_BRKPT:
			monitor(tf);
			return;
		default:
			break;
	}

System calls

在JOS中,我们使用int $0x30来进行system call。应用会自己把system call需要的参数以及其编号川籍来,所以kernel就不需要去操作用户环境或者instruction stream了。system call number会在%eax, 参数(前5个)会在 %edx, %ecx, %ebx, %edi, 和 %esi。同样,kernel会把返回值存在%eax中。syscall函数在lb/syscall.c中。

static inline int32_t
syscall(int num, int check, uint32_t a1, uint32_t a2, uint32_t a3, uint32_t a4, uint32_t a5)
{
	int32_t ret;

	// Generic system call: pass system call number in AX,
	// up to five parameters in DX, CX, BX, DI, SI.
	// Interrupt kernel with T_SYSCALL.
	//
	// The "volatile" tells the assembler not to optimize
	// this instruction away just because we don't use the
	// return value.
	//
	// The last clause tells the assembler that this can
	// potentially change the condition codes and arbitrary
	// memory locations.

	asm volatile("int %1\n"
		     : "=a" (ret)
		     : "i" (T_SYSCALL),
		       "a" (num),
		       "d" (a1),
		       "c" (a2),
		       "b" (a3),
		       "D" (a4),
		       "S" (a5)
		     : "cc", "memory");

	if(check && ret > 0)
		panic("syscall %d returned %d (> 0)", num, ret);

	return ret;
}

上面的这种写法叫gcc内联汇编,感兴趣的同学可以取查一下。

注意这里的和xv6的对比,明显JOS比xv6要简单很多,并没有通过用户的stack(esp)来掏出来参数,而且JOS也没有myproc这样一个全局状态。

Exercise 7

加入system call的handler。由于我们已经加过了基本设置,所以只需要修改trap_dispatch()kern/syscall.c中的syscall()了。

首先是trap_dispatch():

	switch(tf->tf_trapno) {
		case T_PGFLT:
			page_fault_handler(tf);
			return;
		case T_BRKPT:
			monitor(tf);
			return;
		case T_SYSCALL:
			tf->tf_regs.reg_eax = syscall(
				tf->tf_regs.reg_eax, tf->tf_regs.reg_edx,
				tf->tf_regs.reg_ecx, tf->tf_regs.reg_ebx,
				tf->tf_regs.reg_edi, tf->tf_regs.reg_esi
			);
			return;
		default:
			break;
	}

注意别忘了用返回值更新eax

其次是syscall():

// Dispatches to the correct kernel function, passing the arguments.
int32_t
syscall(uint32_t syscallno, uint32_t a1, uint32_t a2, uint32_t a3, uint32_t a4, uint32_t a5)
{
	// Call the function corresponding to the 'syscallno' parameter.
	// Return any appropriate return value.
	// LAB 3: Your code here.

	// panic("syscall not implemented");

	switch (syscallno) {
		case SYS_cputs:
			sys_cputs((char *)a1, (size_t)a2);
			return;
		case SYS_cgetc:
			return sys_cgetc();
		case SYS_getenvid:
			return sys_getenvid();
		case SYS_env_destroy:
			return sys_env_destroy((envid_t)a1);
		default:
			return -E_INVAL;
	}
}

User-mode startup

用户应用会从lib/entry进入,然后调用lib/libmain.c中的libmain(),之后libmain会调用umain也就是进入了比如hello这样的函数中。我们希望能够在用户应用中使用thisenv也就是当前的环境状态。由于我们已经有了sys_getenvid()这样的函数,这个函数在lib/syscall.c中被声明,用来掉system call中的SYS_getenvid。有了envid之后,因为从inc/env.h中得知:

// An environment ID 'envid_t' has three parts:
//
// +1+---------------21-----------------+--------10--------+
// |0|          Uniqueifier             |   Environment    |
// | |                                  |      Index       |
// +------------------------------------+------------------+
//                                       \--- ENVX(eid) --/
//
// The environment index ENVX(eid) equals the environment's index in the
// 'envs[]' array.  The uniqueifier distinguishes environments that were
// created at different times, but share the same environment index.
//
// All real environments are greater than 0 (so the sign bit is zero).
// envid_ts less than 0 signify errors.  The envid_t == 0 is special, and
// stands for the current environment.

#define LOG2NENV		10
#define NENV			(1 << LOG2NENV)
#define ENVX(envid)		((envid) & (NENV - 1))

我们只需要取后10位就可以得到当前环境在envs中的序号了,所以有:

thisenv = &envs[ENVX(sys_getenvid())];

Page faults and memory protection

内存保护是操作系统非常重要的一部分,也是保证bug不能破坏其他程序或者kernel的一个重要手段。

操作系统通常通过硬件来实现内存保护。OS让硬件知道哪些虚拟地址是可以访问的,哪些不行。当一个程序试图访问非法地址的之后,处理器会trap。如果问题可以结局,那么kernel就会解决这个问题并让程序继续运行,如果不行,那么程序就不会继续运行。

一个常见的解决方法是自动扩充stack。一般默认就分配一个page作为用户的stack,如果触发了page fault,就自动再进行分配。

system call会导致一个很有趣的问题。很多system call允许用户传指针进kernel,这些指针会指向读写的buffer。这种做法有两个问题:

  • kernel中的page fault会比user program中的严重许多。如果kernel中的page fault不能解决,那么就会panic整个系统。但是事实上,在上面谈到的问题里,那些buffer带来的page fault是user program的,而不是kernel的。
  • kernel往往有更强的权限,上面的这个system call可能会泄露一些kernel的private memory。

基于这两个原因,我们需要很谨慎的处理传进kernel的指针。

我们讲用一个机制来解决这两个问题。当程序向kernel传递指针的时候,kernel会检查该指针是不是在用户地址内,以及对应的page table允许内存操作。这样,kernel就不会因为dereference用户指针导致page fault了。

Exercise 9

首先给trap中加上page fault在kernel mode,直接panic:

	switch(tf->tf_trapno) {
		case T_PGFLT:
			if ((tf->tf_cs & 0x3) == 0)
				panic("page fault in kernel");
			page_fault_handler(tf);
			return;

之后补全kern/pmap.c中的user_mem_check

// Check that an environment is allowed to access the range of memory
// [va, va+len) with permissions 'perm | PTE_P'.
// Normally 'perm' will contain PTE_U at least, but this is not required.
// 'va' and 'len' need not be page-aligned; you must test every page that
// contains any of that range.  You will test either 'len/PGSIZE',
// 'len/PGSIZE + 1', or 'len/PGSIZE + 2' pages.
//
// A user program can access a virtual address if (1) the address is below
// ULIM, and (2) the page table gives it permission.  These are exactly
// the tests you should implement here.
//
// If there is an error, set the 'user_mem_check_addr' variable to the first
// erroneous virtual address.
//
// Returns 0 if the user program can access this range of addresses,
// and -E_FAULT otherwise.
//
int
user_mem_check(struct Env *env, const void *va, size_t len, int perm)
{
	// LAB 3: Your code here.
	uintptr_t v = ROUNDDOWN((uintptr_t)va, PGSIZE);
	uintptr_t end = ROUNDUP((uintptr_t)va + len, PGSIZE);
	for(;v < end; v += PGSIZE) {
		pte_t *pte = pgdir_walk(env->env_pgdir, (void *)v, 0);
		if (!pte || (*pte & perm) != perm) {
			if(v < (uintptr_t)va)
				user_mem_check_addr = (uintptr_t)va;
			else
				user_mem_check_addr = v;
			return -E_FAULT;
		}
	}
	return 0;
}

注意需要返回的是这区间里的第一个地址,所以如果vva小,返回的应该是va

然后修改syscall.c中的sys_cputs以检查指针。

static void
sys_cputs(const char *s, size_t len)
{
	// Check that the user has permission to read memory [s, s+len).
	// Destroy the environment if not.

	// LAB 3: Your code here.
	user_mem_assert(curenv, s, len, PTE_U);
	// Print the string supplied by the user.
	cprintf("%.*s", len, s);
}

之后,为了在breakpoint中实现backtrace功能,在kern/kdebug.cdebuginfo_eip()中加入如下代码:

	// Find the relevant set of stabs
	if (addr >= ULIM) {
		stabs = __STAB_BEGIN__;
		stab_end = __STAB_END__;
		stabstr = __STABSTR_BEGIN__;
		stabstr_end = __STABSTR_END__;
	} else {
		// The user-application linker script, user/user.ld,
		// puts information about the application's stabs (equivalent
		// to __STAB_BEGIN__, __STAB_END__, __STABSTR_BEGIN__, and
		// __STABSTR_END__) in a structure located at virtual address
		// USTABDATA.
		const struct UserStabData *usd = (const struct UserStabData *) USTABDATA;

		// Make sure this memory is valid.
		// Return -1 if it is not.  Hint: Call user_mem_check.
		// LAB 3: Your code here.
		if(user_mem_check(curenv, (void *)usd, sizeof(struct UserStabData), PTE_U))
			return -1;

		stabs = usd->stabs;
		stab_end = usd->stab_end;
		stabstr = usd->stabstr;
		stabstr_end = usd->stabstr_end;

		// Make sure the STABS and string table memory is valid.
		// LAB 3: Your code here.
		if(user_mem_check(curenv, (void *)stabs, stab_end - stabs, PTE_U))
			return -1;
		if(user_mem_check(curenv, (void *)stabstr, stabstr_end - stabstr, PTE_U))
			return -1;
	}

之后运行make run-breakpoint-nox进入中断之后,如果运行bracktrack就会有如下结果:

K> backtrace
Stack backtrace:
  ebp efffff00  eip f0100ad7  args 00000001 efffff28 f01d2000 f0106781 f011af48
      kern/monitor.c:151: monitor+353
  ebp efffff80  eip f010429b  args f01d2000 efffffbc f0150508 00000092 f011afd8
      kern/trap.c:191: trap+282
  ebp efffffb0  eip f0104389  args efffffbc 00000000 00000000 eebfdfc0 efffffdc
      kern/trapentry.S:87: <unknown>+0
  ebp eebfdfc0  eip 00800087  args 00000000 00000000 eebfdff0 00800058 00000000
      lib/libmain.c:25: libmain+78
  ebp eebfdff0  eip 00800031  args 00000000 00000000Incoming TRAP frame at 0xeffffe64
kernel panic at kern/trap.c:187: page fault in kernel

这里为什么没有搞懂。。。

Exercise 10

当完成exercise 9的时候,exercise 10自动完成了。9和10的唯一区别就是传入的指针,一个是未分配的,另外一个是传入了对应kernel部分的地址,这两者都可以用上面的检查方法搞定。

最后来运行一下make grade

divzero: OK (1.0s)
softint: OK (0.9s)
badsegment: OK (1.0s)
Part A score: 30/30

faultread: OK (1.0s)
faultreadkernel: OK (2.0s)
faultwrite: OK (1.1s)
faultwritekernel: OK (1.8s)
breakpoint: OK (1.1s)
testbss: OK (1.9s)
hello: OK (2.1s)
buggyhello: OK (2.0s)
buggyhello2: OK (2.2s)
evilhello: OK (1.8s)
Part B score: 50/50

Score: 80/80