mm/swap_state.c

    当一个page要和外部存储设备发生联系的时候,就要建立一个address_space,对于swap
就是 swapper_space .还要提供address_space_operations,对于swap 就是swap_aops.
    考虑page cache/swap cache/shmem/filemap,无不如此.
    
    建立着两个结构只是解决了页面写出的问题,而读入靠的是handle_pte_fault->直接的
函数调用.对于swap就是do_swap_page,file map/mmap靠vma->vm_ops->nopage.没有一个统
一的解决方案.
             
     不打算太多分析这些东西了.这里重点讨论物理内存页面,page->count以及swp entry
的引用计数.(真的需要逐函数列到这里?)
               
               
               
               
               
               
               
               
                            page, 何去何从
                  
   看page_alloc.c, buddy系统,所有物理页面都受buddy管理(reserve除外,那是外设内存,
或者特殊用途). page的去向只看分配函数的调用关系即可.
   page_alloc.c 提供的分配接口:(只有这几个被应用--2.4)
     
     1.(alloc_pages:call by)-->page_cache_alloc
        从这个接口流出的页面都在page cache(swap cache).用于磁盘(or疑似)文件缓
        存.具体的使用者是: swap cache, page cache,file read(page cache),filemap
        (page cache,or copy from page cache),COW(may not in page cache),
        shmem_no_page(page cache).
     
     2.__get_free_pages:
         广泛应用于驱动, 内核使用的hash表, task struct结构,网络(hash等),buffers
         (文件系统的meta data,blk设备文件读写.(还有fly的buffers,创建于需要io的页
         面,这种页面不是从__get_free_pages流出) ), slab(slab).
         
     3. __get_free_page
         page table(pdir,pmd),驱动, 用户参数页.
     
     4.get_zeroed_page:
         驱动(tty), shmem(建立于内核的直接/间接映射表,永不与后备缓存打交道),
   
     5.alloc_page:
         buffers,string参数页,匿名页(缺页中断),vmalloc(内核页面,永不交换).
         
     
     现在可以回答这个问题,物理内存都用到哪里去了?:(fix me,i think everything is here)
     1)内核'自己'使用
       包括驱动,网络,页表(内核或者进程),各种hash表,从用户copy的参数,通过slab作为
       各种内核数据结构的cache(inode,dentry.....),shmem映射表,vmalloc使用的内核
       页面.
       
     2)page cache/swap cache 
       页面只能位于这两个cache中的一个.用于缓存位于磁盘上的文件内容(不是meta).包
      括普通文件,filemap(共享),shemem.
      
     3)buffers
        用于缓存文件系统的meta data,用于设备文件读写的缓存.不包括那些为了进行page
        io而创建的fly buffers,但是这些fly buffer也page->count++.
        
     4)用户进程页面
        这是一个混合体. 进程使用的页面也可以位于page cache/swap cache, 还可以拥有
       buffer. 除了这些有所属的页面,进程使用的页面还有一种叫做匿名页,即无mapping.
       包括还未进入swap的进程页面,filemap(no shared),COW页面.
       
     
     
                      
                      
                                page->count  
                           
  先贴一段从mm.h中的注释,这个值得一看.注意,这个注释太老了,inode->i_pages在2.4中已
经不存在了. 这段话--> For pages belonging to inodes, the page->count is the number
of attaches, plus 1 if buffers are allocated to the page.已经不正确了.(和我们的文
档一样,好久没有更新了,2.6中还行.)

/*
 * Various page->flags bits:
 *
 * PG_reserved is set for a page which must never be accessed (which
 * may not even be present).
 *
 * PG_DMA has been removed, page->zone now tells exactly wether the
 * page is suited to do DMAing into.
 *
 * Multiple processes may "see" the same page. E.g. for untouched
 * mappings of /dev/null, all processes see the same page full of
 * zeroes, and text pages of executables and shared libraries have
 * only one copy in memory, at most, normally.
 *
 * For the non-reserved pages, page->count denotes a reference count.
 *   page->count == 0 means the page is free.
 *   page->count == 1 means the page is used for exactly one purpose
 *   (e.g. a private data page of one process).
 *
 * A page may be used for kmalloc() or anyone else who does a
 * __get_free_page(). In this case the page->count is at least 1, and
 * all other fields are unused but should be 0 or NULL. The
 * management of this page is the responsibility of the one who uses
 * it.
 *
 * The other pages (we may call them "process pages") are completely
 * managed by the Linux memory manager: I/O, buffers, swapping etc.
 * The following discussion applies only to them.
 *
 * A page may belong to an inode's memory mapping. In this case,
 * page->inode is the pointer to the inode, and page->offset is the
 * file offset of the page (not necessarily a multiple of PAGE_SIZE).
 *
 * A page may have buffers allocated to it. In this case,
 * page->buffers is a circular list of these buffer heads. Else,
 * page->buffers == NULL.
 *
 * For pages belonging to inodes, the page->count is the number of
 * attaches, plus 1 if buffers are allocated to the page.
 *
 * All pages belonging to an inode make up a doubly linked list
 * inode->i_pages, using the fields page->next and page->prev. (These
 * fields are also used for freelist management when page->count==0.)
 * There is also a hash table mapping (inode,offset) to the page
 * in memory if present. The lists for this hash table use the fields
 * page->next_hash and page->pprev_hash.
 *
 * All process pages can do I/O:
 * - inode pages may need to be read from disk,
 * - inode pages which have been modified and are MAP_SHARED may need
 *   to be written to disk,
 * - private pages which have been modified may need to be swapped out
 *   to swap space and (later) to be read back into memory.
 * During disk I/O, PG_locked is used. This bit is set before I/O
 * and reset when I/O completes. page->wait is a wait queue of all
 * tasks waiting for the I/O on this page to complete.
 * PG_uptodate tells whether the page's contents is valid.
 * When a read completes, the page becomes uptodate, unless a disk I/O
 * error happened.
 *
 * For choosing which pages to swap out, inode pages carry a
 * PG_referenced bit, which is set any time the system accesses
 * that page through the (inode,offset) hash table.
 *
 * PG_skip is used on sparc/sparc64 architectures to "skip" certain
 * parts of the address space.
 *
 * PG_error is set to indicate that an I/O error occurred on this page.
 *
 * PG_arch_1 is an architecture specific page state bit.  The generic
 * code guarentees that this bit is cleared for a page when it first
 * is entered into the page cache.
 */

  根据刚才分析的物理页面,page的流向, 对page->count的简单描述如下:
  1)第一类内核自己使用的页面,一般引用计数都是1.(fixme).
  
  2)page/swap cache中的页面,增加1, buffers 增加1.
  
  3)用户进程: 每个进程增加1.
  
  4)许多地方为了保护页面临时不被释放, get后很快释放.此类忽略.
  
  
  
                       
                          page->count 实例分析

  我选择了函数is_page_shared来进行详细分析. 0210月份的时候,linuxforum很是热闹.
对此函数的讨论,淹没在一片汪洋之中.不过对page->count的好奇和争论一直未曾停歇.或许
国外的论坛上早已经不存在这种问题的活跃讨论了,而我们仍将继续.
  请仔细阅读注释.
/*
 * Work out if there are any other processes sharing this page, ignoring
 * any page reference coming from the swap cache, or from outstanding
 * swap IO on this page.  (The page cache _does_ count as another valid
 * reference to the page, however.)
 */
 /* I)这种情况下page 引用计数来源:
  *   1. 进程,one per process  2. swap or page cahce, one   3.buffers one
  *  
  * II)page 相关的swap entry:
  *     page加入了swap cache, 当page 对应的swap entry引用计数不是1 的时候(例如tmpfs),
  *  代表另外一个地方依然希望通过swap entry 找到此page(tmpfs).所以相当于此page 多
  *  了一个匿名的引用方式.
  *
  * III) page cache 算作了"另一个进程" (见上面的en comment)
  */
static inline int is_page_shared(struct page *page)
{
	unsigned int count;
	if (PageReserved(page))
		return 1;
	count = page_count(page); //page 本身的引用计数

   /*  II) page在swap cache:  (不在page cache)
    *      所有进程的引用= page count + swap entry -(swap 本身对page的引用) 
    *      swap 本身对page的引用是: 
    *        swap cache 对page 引用 1,此page 对swap entry 的引用 1 如果有
    *        buffers, 算作swap 对其引用,1(反正不是进程).
    */
	if (PageSwapCache(page))
		count += swap_count(page) - 2 - !!page->buffers;

    /* III) 存在于page cache  或者不存在于page cahce 
     *    此中情况下,如有buffers,则必然属于page cache(filemap).否则
     *  进程的页面何故需要写入磁盘?
     *    进程+ page cache(bind buffers)的引用计数=page count
     */
        	

	 /* 如果是在swap cache, 剩下的计数有一个是当前进程
	  * 所以>1 时才是有其他进程使用此页面
    */
	 return  count > 1;
}
  
  其含义以经分析如上,下面看看使用条件和具体使用的方式:
  此函数假设已经有进程在使用此page(ref one),这就是使用的条件.共有三处引用: 
  1. do_wp_page-> 目的是pte_mkwrite. 引用计数已知,就是2,如果只有swap cahce 引用
     此页面(不会有buffer),此操作安全.此函数适用.
  2.do_swap_page->页面肯定在swap cache.并且即使有buffers, 读入操作也已完成.故可
    以减去buffers的引用.
  3. memory.c : free_pte->free_page_and_swap_cache(所有情况都是进程期望释放自己
    的pte.),已知在swap  cache, 并且后续对于buffers也要释放掉(锁定页面). 所以这个
    情况使用此函数应该是最初的目的.
  
  
  另外就是deactivate_page_nolock这个函数,参考try_to_swap_out ->deactivate_page->
deactivate_page_nolock:
   try swap out:考察 当前 进程的时候,觉得要deacite此页面,但是除了swap cache,当前
进程和可能有的buffer之外如果还有其他引用的地方,则暂时不要deactive等到另外的一个进
程也决定deactive的时候再真正deactive. 
   另外page_ramdisk的页面不应该deactive,保证ramdisk的页面永驻内存.
   另外refill_inactive_scan 是个特例.请参考相关代码.
   
   deactive后页面转入lru队列的inactive_dirty_list,对于这个队列中的页面,将做何处
理?:
   就是page_launder,清洗dirt 页面(脏了就洗干净吗!^_^).而清洗的时候要lock页面,如
果还有其他进程或者像tmpfs,ramdisk这样的人在悄悄的使用这个页面,情况将不堪设想.所
以不要清洗除了swap cache/buffer之外还有其他引用的页面.(caller extra ref或者当前
进程的引用再调用完这个函数后会page->count--,try swap out.


/**
 * (de)activate_page - move pages from/to active and inactive lists
 * @page: the page we want to move
 * @nolock - are we already holding the pagemap_lru_lock?
 *
 * Deactivate_page will move an active page to the right
 * inactive list, while activate_page will move a page back
 * from one of the inactive lists to the active list. If
 * called on a page which is not on any of the lists, the
 * page is left alone.
 */
void deactivate_page_nolock(struct page * page)
{
	/*
	 * One for the cache, one for the extra reference the
	 * caller has and (maybe) one for the buffers.
	 *
	 * This isn't perfect, but works for just about everything.
	 * Besides, as long as we don't move unfreeable pages to the
	 * inactive_clean list it doesn't need to be perfect...
	 */
	 /* extra reference: 当前进程或者调用者.记住
	   * ref count 的三个来源,才能灵活运用.
	   */
	int maxcount = (page->buffers ? 3 : 2);
	page->age = 0;
	ClearPageReferenced(page);

	/*
	 * Don't touch it if it's not on the active list.
	 * (some pages aren't on any list at all)
	 */
	if (PageActive(page) && page_count(page) <= maxcount && !page_ramdisk(page)) {
		del_page_from_active_list(page);
		add_page_to_inactive_dirty_list(page);
	}
}	
  
  
   对付page count的思路就是如此了. 
   
   
   
                         
                        题外, swap entry的引用计数
  紧紧分析一下shmem_writepage对swap entry的引用计数的处理.
/*
 * Move the page from the page cache to the swap cache
 * (未做真正写入,留给swap cache 写入)
 */
 /*  page_launder:page->mapping->a_ops->writepage
   *  filemap_fdatasync-> page->mapping->a_ops->writepage
   */
static int shmem_writepage(struct page * page)
{
	int error;
	struct shmem_inode_info *info;
	swp_entry_t *entry, swap;

  /*
   *  
	 */
	info = &page->mapping->host->u.shmem_i;
	if (info->locked)
		return 1;
	swap = __get_swap_page(2); /* 分配swap page(tmpfs(映射表) +swap cache(page->index) ,so refs is 2)*/
	if (!swap.val)
		return 1;

	spin_lock(&info->lock);
	/*寻找tmpfs内记录swap entry 的散列表*/
	entry = shmem_swp_entry (info, page->index);
	if (!entry)	/* this had been allocted on page allocation */
		BUG();
	error = -EAGAIN;
	if (entry->val) { /*已经有了swap entry与之对应*/
                __swap_free(swap, 2);
		goto out;
        }

        *entry = swap; /*tempfs ref swap entry, 释放引用见shmem_unuse-..>shmem_clear_swp*/
	error = 0;
	/* Remove the from the page cache */
	lru_cache_del(page);
	remove_inode_page(page);

	/* Add it to the swap cache */
	add_to_swap_cache(page, swap); /*swap cache ref swap entry,释放引用见try_to_unuse,or __delete_from_swap_cache*/
	page_cache_release(page);
	set_page_dirty(page);
	info->swapped++;
out:
	spin_unlock(&info->lock);
	UnlockPage(page);
	return error;
}

   总之,对于ref count,目的是从一个地方能到他的时候,就应该对应一个ref.