mm/swap_state.c 当一个page要和外部存储设备发生联系的时候,就要建立一个address_space,对于swap 就是 swapper_space .还要提供address_space_operations,对于swap 就是swap_aops. 考虑page cache/swap cache/shmem/filemap,无不如此. 建立着两个结构只是解决了页面写出的问题,而读入靠的是handle_pte_fault->直接的 函数调用.对于swap就是do_swap_page,file map/mmap靠vma->vm_ops->nopage.没有一个统 一的解决方案. 不打算太多分析这些东西了.这里重点讨论物理内存页面,page->count以及swp entry 的引用计数.(真的需要逐函数列到这里?) page, 何去何从 看page_alloc.c, buddy系统,所有物理页面都受buddy管理(reserve除外,那是外设内存, 或者特殊用途). page的去向只看分配函数的调用关系即可. page_alloc.c 提供的分配接口:(只有这几个被应用--2.4) 1.(alloc_pages:call by)-->page_cache_alloc 从这个接口流出的页面都在page cache(swap cache)中.用于磁盘(or疑似)文件缓 存.具体的使用者是: swap cache, page cache,file read(page cache),filemap (page cache,or copy from page cache),COW(may not in page cache), shmem_no_page(page cache). 2.__get_free_pages: 广泛应用于驱动, 内核使用的hash表, task struct结构,网络(hash等),buffers (文件系统的meta data,blk设备文件读写.(还有fly的buffers,创建于需要io的页 面,这种页面不是从__get_free_pages流出) ), slab(slab). 3. __get_free_page page table(pdir,pmd),驱动, 用户参数页. 4.get_zeroed_page: 驱动(tty), shmem(建立于内核的直接/间接映射表,永不与后备缓存打交道), 5.alloc_page: buffers,string参数页,匿名页(缺页中断),vmalloc(内核页面,永不交换). 现在可以回答这个问题,物理内存都用到哪里去了?:(fix me,i think everything is here) 1)内核'自己'使用 包括驱动,网络,页表(内核或者进程),各种hash表,从用户copy的参数,通过slab作为 各种内核数据结构的cache(inode,dentry.....),shmem映射表,vmalloc使用的内核 页面. 2)page cache/swap cache 页面只能位于这两个cache中的一个.用于缓存位于磁盘上的文件内容(不是meta).包 括普通文件,filemap(共享),shemem. 3)buffers 用于缓存文件系统的meta data,用于设备文件读写的缓存.不包括那些为了进行page io而创建的fly buffers,但是这些fly buffer也page->count++了. 4)用户进程页面 这是一个混合体. 进程使用的页面也可以位于page cache/swap cache, 还可以拥有 buffer. 除了这些有所属的页面,进程使用的页面还有一种叫做匿名页,即无mapping. 包括还未进入swap的进程页面,filemap(no shared),COW页面. page->count 先贴一段从mm.h中的注释,这个值得一看.注意,这个注释太老了,inode->i_pages在2.4中已 经不存在了. 这段话--> For pages belonging to inodes, the page->count is the number of attaches, plus 1 if buffers are allocated to the page.已经不正确了.(和我们的文 档一样,好久没有更新了,2.6中还行.) /* * Various page->flags bits: * * PG_reserved is set for a page which must never be accessed (which * may not even be present). * * PG_DMA has been removed, page->zone now tells exactly wether the * page is suited to do DMAing into. * * Multiple processes may "see" the same page. E.g. for untouched * mappings of /dev/null, all processes see the same page full of * zeroes, and text pages of executables and shared libraries have * only one copy in memory, at most, normally. * * For the non-reserved pages, page->count denotes a reference count. * page->count == 0 means the page is free. * page->count == 1 means the page is used for exactly one purpose * (e.g. a private data page of one process). * * A page may be used for kmalloc() or anyone else who does a * __get_free_page(). In this case the page->count is at least 1, and * all other fields are unused but should be 0 or NULL. The * management of this page is the responsibility of the one who uses * it. * * The other pages (we may call them "process pages") are completely * managed by the Linux memory manager: I/O, buffers, swapping etc. * The following discussion applies only to them. * * A page may belong to an inode's memory mapping. In this case, * page->inode is the pointer to the inode, and page->offset is the * file offset of the page (not necessarily a multiple of PAGE_SIZE). * * A page may have buffers allocated to it. In this case, * page->buffers is a circular list of these buffer heads. Else, * page->buffers == NULL. * * For pages belonging to inodes, the page->count is the number of * attaches, plus 1 if buffers are allocated to the page. * * All pages belonging to an inode make up a doubly linked list * inode->i_pages, using the fields page->next and page->prev. (These * fields are also used for freelist management when page->count==0.) * There is also a hash table mapping (inode,offset) to the page * in memory if present. The lists for this hash table use the fields * page->next_hash and page->pprev_hash. * * All process pages can do I/O: * - inode pages may need to be read from disk, * - inode pages which have been modified and are MAP_SHARED may need * to be written to disk, * - private pages which have been modified may need to be swapped out * to swap space and (later) to be read back into memory. * During disk I/O, PG_locked is used. This bit is set before I/O * and reset when I/O completes. page->wait is a wait queue of all * tasks waiting for the I/O on this page to complete. * PG_uptodate tells whether the page's contents is valid. * When a read completes, the page becomes uptodate, unless a disk I/O * error happened. * * For choosing which pages to swap out, inode pages carry a * PG_referenced bit, which is set any time the system accesses * that page through the (inode,offset) hash table. * * PG_skip is used on sparc/sparc64 architectures to "skip" certain * parts of the address space. * * PG_error is set to indicate that an I/O error occurred on this page. * * PG_arch_1 is an architecture specific page state bit. The generic * code guarentees that this bit is cleared for a page when it first * is entered into the page cache. */ 根据刚才分析的物理页面,page的流向, 对page->count的简单描述如下: 1)第一类内核自己使用的页面,一般引用计数都是1.(fixme). 2)page/swap cache中的页面,增加1, buffers 增加1. 3)用户进程: 每个进程增加1. 4)许多地方为了保护页面临时不被释放, get后很快释放.此类忽略. page->count 实例分析 我选择了函数is_page_shared来进行详细分析. 02年10月份的时候,linuxforum很是热闹. 对此函数的讨论,淹没在一片汪洋之中.不过对page->count的好奇和争论一直未曾停歇.或许 国外的论坛上早已经不存在这种问题的活跃讨论了,而我们仍将继续. 请仔细阅读注释. /* * Work out if there are any other processes sharing this page, ignoring * any page reference coming from the swap cache, or from outstanding * swap IO on this page. (The page cache _does_ count as another valid * reference to the page, however.) */ /* I)这种情况下page 引用计数来源: * 1. 进程,one per process 2. swap or page cahce, one 3.buffers one * * II)page 相关的swap entry: * page加入了swap cache, 当page 对应的swap entry引用计数不是1 的时候(例如tmpfs), * 代表另外一个地方依然希望通过swap entry 找到此page(tmpfs).所以相当于此page 多 * 了一个匿名的引用方式. * * III) page cache 算作了"另一个进程" (见上面的en comment) */ static inline int is_page_shared(struct page *page) { unsigned int count; if (PageReserved(page)) return 1; count = page_count(page); //page 本身的引用计数 /* II) page在swap cache: (不在page cache) * 所有进程的引用= page count + swap entry -(swap 本身对page的引用) * swap 本身对page的引用是: * swap cache 对page 引用 1,此page 对swap entry 的引用 1 如果有 * buffers, 算作swap 对其引用,1(反正不是进程). */ if (PageSwapCache(page)) count += swap_count(page) - 2 - !!page->buffers; /* III) 存在于page cache 或者不存在于page cahce * 此中情况下,如有buffers,则必然属于page cache(filemap).否则 * 进程的页面何故需要写入磁盘? * 进程+ page cache(bind buffers)的引用计数=page count */ /* 如果是在swap cache, 剩下的计数有一个是当前进程 * 所以>1 时才是有其他进程使用此页面 */ return count > 1; } 其含义以经分析如上,下面看看使用条件和具体使用的方式: 此函数假设已经有进程在使用此page(ref one),这就是使用的条件.共有三处引用: 1. do_wp_page-> 目的是pte_mkwrite. 引用计数已知,就是2,如果只有swap cahce 引用 此页面(不会有buffer),此操作安全.此函数适用. 2.do_swap_page->页面肯定在swap cache.并且即使有buffers, 读入操作也已完成.故可 以减去buffers的引用. 3. memory.c : free_pte->free_page_and_swap_cache(所有情况都是进程期望释放自己 的pte.),已知在swap cache, 并且后续对于buffers也要释放掉(锁定页面). 所以这个 情况使用此函数应该是最初的目的. 另外就是deactivate_page_nolock这个函数,参考try_to_swap_out ->deactivate_page-> deactivate_page_nolock: try swap out:考察 当前 进程的时候,觉得要deacite此页面,但是除了swap cache,当前 进程和可能有的buffer之外如果还有其他引用的地方,则暂时不要deactive等到另外的一个进 程也决定deactive的时候再真正deactive. 另外page_ramdisk的页面不应该deactive,保证ramdisk的页面永驻内存. 另外refill_inactive_scan 是个特例.请参考相关代码. deactive后页面转入lru队列的inactive_dirty_list,对于这个队列中的页面,将做何处 理?: 就是page_launder,清洗dirt 页面(脏了就洗干净吗!^_^).而清洗的时候要lock页面,如 果还有其他进程或者像tmpfs,ramdisk这样的人在悄悄的使用这个页面,情况将不堪设想.所 以不要清洗除了swap cache/buffer之外还有其他引用的页面.(caller extra ref或者当前 进程的引用再调用完这个函数后会page->count--,见try swap out. /** * (de)activate_page - move pages from/to active and inactive lists * @page: the page we want to move * @nolock - are we already holding the pagemap_lru_lock? * * Deactivate_page will move an active page to the right * inactive list, while activate_page will move a page back * from one of the inactive lists to the active list. If * called on a page which is not on any of the lists, the * page is left alone. */ void deactivate_page_nolock(struct page * page) { /* * One for the cache, one for the extra reference the * caller has and (maybe) one for the buffers. * * This isn't perfect, but works for just about everything. * Besides, as long as we don't move unfreeable pages to the * inactive_clean list it doesn't need to be perfect... */ /* extra reference: 当前进程或者调用者.记住 * ref count 的三个来源,才能灵活运用. */ int maxcount = (page->buffers ? 3 : 2); page->age = 0; ClearPageReferenced(page); /* * Don't touch it if it's not on the active list. * (some pages aren't on any list at all) */ if (PageActive(page) && page_count(page) <= maxcount && !page_ramdisk(page)) { del_page_from_active_list(page); add_page_to_inactive_dirty_list(page); } } 对付page count的思路就是如此了. 题外, swap entry的引用计数 紧紧分析一下shmem_writepage对swap entry的引用计数的处理. /* * Move the page from the page cache to the swap cache * (未做真正写入,留给swap cache 写入) */ /* page_launder:page->mapping->a_ops->writepage * filemap_fdatasync-> page->mapping->a_ops->writepage */ static int shmem_writepage(struct page * page) { int error; struct shmem_inode_info *info; swp_entry_t *entry, swap; /* * */ info = &page->mapping->host->u.shmem_i; if (info->locked) return 1; swap = __get_swap_page(2); /* 分配swap page(tmpfs(映射表) +swap cache(page->index) ,so refs is 2)*/ if (!swap.val) return 1; spin_lock(&info->lock); /*寻找tmpfs内记录swap entry 的散列表*/ entry = shmem_swp_entry (info, page->index); if (!entry) /* this had been allocted on page allocation */ BUG(); error = -EAGAIN; if (entry->val) { /*已经有了swap entry与之对应*/ __swap_free(swap, 2); goto out; } *entry = swap; /*tempfs ref swap entry, 释放引用见shmem_unuse-..>shmem_clear_swp*/ error = 0; /* Remove the from the page cache */ lru_cache_del(page); remove_inode_page(page); /* Add it to the swap cache */ add_to_swap_cache(page, swap); /*swap cache ref swap entry,释放引用见try_to_unuse,or __delete_from_swap_cache*/ page_cache_release(page); set_page_dirty(page); info->swapped++; out: spin_unlock(&info->lock); UnlockPage(page); return error; } 总之,对于ref count,目的是从一个地方能到他的时候,就应该对应一个ref.