2006-12-21   
 
   阅读fs之前普通文件系统包含大量与disk互动的部分.为了更好的理解这些操作.先将IDE
驱动的相关部分研究一下. 解决一直有点模糊的问题.
                         大容量磁盘相关问题
                        
  最初的问题来自于BIOS的设计人员和ATA接口的设计人员没有达成一致的意见,BIOS和ATA
为CHS分配的总的字节数,以及cylinder, head, and sector各占用多少bit,都不相同。更严
重的问题是好像谁也没有预见到磁盘的容量增长的如此迅速! 
  
  先来看看各种借口的容量限制:  
BIOS int 13接口:
            Cylinder   head     sector   limitation             time reach limitation
bits         10         8         6      total 24bits  8.4GB    
ATA 接口:  cylinder   head    sector
bits         16          4        8     total 28bits  137.4GB   Sept 2001 160 GB Maxtor Diamondmax
扩展 int 13 接口:(97年之后的bios基本都支持)
bits         8bytes*8  64bit LBA number    9.4 trillion gigabytes !!!!
ATA-6 接口:
bits         6bytes*8  48bit LBA number    less tan Extended int13,but Large enough!!!
  可以看到,BIOS被设计成最大可寻址8GB, ATA-1-5也只能寻址到137GB。这是设计上的硬伤。
这两个限制也由此而来。

在达到8GB之前值得注意的限制是528MB限制:
bios 13 和 ATA-5的联合,old bios 直接使用用户传入的CHS给ATA, BIOS mode->Normal
    Cylinder    head  sector  limitation         time
     10           4     6      2^20 = 528MB      being a problem around 1993  
    
  为了解决这个问题,BIOS引入了Extended CHS,即ECHS,有些bios中叫large mode. 
以一个2.95GiB的硬盘为例, 硬件报告的CHS是 6136/16/63
                              Cylinders      Heads      Sectors   Capacity      
IDE/ATA Limits                 65,536           16       256       128 GiB
Hard Disk Logical Geometry      6,136           16       63        2.95 GiB
BIOS Translation Factor     divide by 8  multiply by 8
BIOS Translated Geometry          767          128       63        2.95 GiB
BIOS Int 13h Limits              1024          256       63        7.88GiB
    
   突破8G的限制只用使用extended int13+ LBA mode(ATA). 如果在使用LBA模式的情况
下还有int 13的程序,则BIOS将CHS直接转换成LBA地址,这个叫assisted LBA. 
   无论如何,只要使用BIOS无论是ECHS转换还是assisted LBA, 都无法突破8.5 GBlimit.
并且还有一个问题值得一提: 最大head 数不是256,而是255,因为dos和window95 不能处理
head为256的情况.所以总的限制比8.5GB要稍微少一些. 
           
   ATA规定,大于8.4 GB 的硬盘应该报告CHS为16383/16/63,这意味着`geometry'过时了,
硬盘的总大小不能通过geometry来计算了,只能从IDENTIFY command返回的LBA capacity域
来获知. 大于137.4 GB的硬盘应该报告LBA capacity是0xfffffff = 268435455 sectors 
(137G),正确的disksize在新的48 bit的域中.
  
   下面列出linux对大容量磁盘的支持情况:
   >8.4 GB     kernel should be 2.0.34 or later.
   >33.8 GB    kernel should be 2.0.39/2.2.14/2.3.21 or later.
   > 137 GB     kernel should be 2.4.19/2.5.3 or later. 
   
   检查一个版本的linux是否支持大容量硬盘,可以看函数do_rw_disk (ide-disk.c).
refrence:
1. Large Disk Drives >8.4Gb (in addtion, a IBM doc attached)
http://www-oss.fnal.gov/projects/fermilinux/common/faq/old/0009.html
2. PC guid of hard disk
   http://www.pcguide.com/ref/hdd/index.htm
   
   
                        block size的种种问题
   分析mm的时候说过do_generic_file_read的几个问题,关键的一点是理解最基本的观点。
首先是磁盘上的文件尽量缓存在内存,这样才能更快的读写。缓存在内存中,最基本的单
位就是内存页面了,在i386上,常见的大小是4k。
   通过缓存读取文件的时候,首先是把用户指定的以字节为单位的offset,size转换成
以4k为单位的内存页,这样可以直接拷贝数据给用户。如果文件不在缓存中,就要从磁盘
读取,比如通过block_read_full_page从磁盘读取一个page大小的数据。
   
    文件存储于一个具体的文件系统,而这个文件系统有自己的分配单位,那就是block,
比如对于ext2,block就是具体的ext2可以分配的最小单位,常见的ext2的block size是
1k,可以为2k,4k,但是不能大于4k(refer. ext2_read_super)。
    作为存储在这个ext2上的文件,属于它的所有block纪录在磁文件的inode中,纪录的
是每一个block的block number。ext2上bocknumber 从1开始(block0#是boot),最大看磁
盘容量了,呵呵。这样一来,就把每个文件,以block size为单位分成了从0开始的block。
每个文件都是这样一个线型空间,通过inode的一个数组映射到ext2文件系统上从1开始的
block空间去。
    
    过了这样一个步骤,就要和磁盘打交道。通常这个接口是bread(block#, size).
struct buffer_head * bread(kdev_t dev, int block, int size)
{
	struct buffer_head * bh;
	bh = getblk(dev, block, size); /*bh包含了block#和block的size信息*/
	if (buffer_uptodate(bh))
		return bh;
	ll_rw_block(READ, 1, &bh); /* 传递给硬盘驱动*/
	wait_on_buffer(bh);
	if (buffer_uptodate(bh))
		return bh;
	brelse(bh);
	return NULL;
}
   这个函数的意思是按照块大小是size读取块号为block的块. 换一种角度,bread按照文件系
统理解磁盘的方式提供一个访问磁盘的接口,块大小由size指定,读取那个块由block指定.至
于磁盘怎么划分扇区,就不用操心了.
                           IDE Driver overview
   我们从bread入手,看看磁盘驱动如何读取磁盘扇区。上边说了bread,这里从ll_rw_block
开始。过程虽然从代码里看很复杂,但是主线并不复杂: 给buffer设置一个回叫函数,等磁盘
完成读取后通过这个回叫函数设置bh的uptodate位,同时,如果有任务等待这个bh读取完成则唤醒
等待的任务. 
  提交给磁盘的时候,磁盘将这个操作安排到一个队列,然后对所有请求进行调度,以提高磁盘io
速度,然后根据调度的结果执行读取任务.
ll_rw_block(int rw, int nr, struct buffer_head * bhs[])
{
	unsigned int major;
	int correct_size;
	int i;
  /*先进行一系列的检查*/
	
	 1. /* Determine correct block size for this device. */
	 2.	/* Verify requested block sizes. */
	 3. 如果是写操作,看看设备是否容许写
 
  /*接着是为bh设置b_end_io:end_buffer_io_sync,通过这个函数通知等待的进程*/
	for (i = 0; i < nr; i++) {
		struct buffer_head *bh;
		bh = bhs[i];
		/* Only one thread can actually submit the I/O. */
		if (test_and_set_bit(BH_Lock, &bh->b_state))
			continue;
		/* We have the buffer lock */
		bh->b_end_io = end_buffer_io_sync;
  
     ......... //考虑一些可能存在竞争的情况
		
		submit_bh(rw, bh); /*提交申请给磁盘驱动程序*/
	}
	return;
  .... //clean
}
    
   然后是通过submit_bh给磁盘驱动提交申请:
void submit_bh(int rw, struct buffer_head * bh)
{
	if (!test_bit(BH_Lock, &bh->b_state))
		BUG();
	set_bit(BH_Req, &bh->b_state);
	/*
	 * First step, 'identity mapping' - RAID or LVM might
	 * further remap this.
	 * 这里把文件系统定义的block#(size)转化为扇区号
	 */
	bh->b_rdev = bh->b_dev;
	bh->b_rsector = bh->b_blocknr * (bh->b_size>>9);
	generic_make_request(rw, bh);
	switch (rw) {
		case WRITE:
			kstat.pgpgout++;
			break;
		default:
			kstat.pgpgin++;
			break;
	}
}
                          
  看看如何向磁盘驱动提交申请:
void generic_make_request (int rw, struct buffer_head * bh)
{
	int major = MAJOR(bh->b_rdev);
	request_queue_t *q;
 
  .....//检查读取范围是否存在于磁盘,比如超出最大扇区号
	/*
	 * Resolve the mapping until finished. (drivers are
	 * still free to implement/resolve their own stacking
	 * by explicitly returning 0)
	 */
	/* NOTE: we don't repeat the blk_size check for each new device.
	 * Stacking drivers are expected to know what they are doing.
	 */
	do {
		q = blk_get_queue(bh->b_rdev);
		if (!q) {
			printk(KERN_ERR
			       "generic_make_request: Trying to access nonexistent block-device %s (%ld)\n",
			       kdevname(bh->b_rdev), bh->b_rsector);
			buffer_IO_error(bh);
			break;
		}
	}
	while (q->make_request_fn(q, rw, bh)); /*参考blk_init_queue,初始化为 __make_request*/
}
 
  这里通过一个while循环来提交一个请求,但是对于IDE,这是没有必要的.__make_request总是返回0.
static int __make_request(request_queue_t * q, int rw,
				  struct buffer_head * bh)
{
	unsigned int sector, count;
	int max_segments = MAX_SEGMENTS;
	struct request * req = NULL, *freereq = NULL;
	int rw_ahead, max_sectors, el_ret;
	struct list_head *head;
	int latency;
	elevator_t *elevator = &q->elevator;
 
 again:
  ........
	if (list_empty(head)) {
		q->plug_device_fn(q, bh->b_rdev); /* is atomic */
		                                /*这个函数对IDE来讲是generic_plug_device,见blk_init_queue*/
		goto get_rq;
	}
	el_ret = elevator->elevator_merge_fn(q, &req, bh, rw,
					     &max_sectors, &max_segments);
	switch (el_ret) {
		case ELEVATOR_BACK_MERGE:
			if (!q->back_merge_fn(q, req, bh, max_segments))
				break;
			req->bhtail->b_reqnext = bh;
			req->bhtail = bh;
			req->nr_sectors = req->hard_nr_sectors += count;
			req->e = elevator;
			drive_stat_acct(req->rq_dev, req->cmd, count, 0);
			attempt_back_merge(q, req, max_sectors, max_segments);
			goto out;
		case ELEVATOR_FRONT_MERGE:
			if (!q->front_merge_fn(q, req, bh, max_segments))
				break;
			bh->b_reqnext = req->bh;
			req->bh = bh;
			req->buffer = bh->b_data;
			req->current_nr_sectors = count;
			req->sector = req->hard_sector = sector;
			req->nr_sectors = req->hard_nr_sectors += count;
			req->e = elevator;
			drive_stat_acct(req->rq_dev, req->cmd, count, 0);
			attempt_front_merge(q, head, req, max_sectors, max_segments);
			goto out;
		/*
		 * elevator says don't/can't merge. get new request
		 */
		case ELEVATOR_NO_MERGE:
			break;
		default:
			printk("elevator returned crap (%d)\n", el_ret);
			BUG();
	}
		
	/*
	 * Grab a free request from the freelist. Read first try their
	 * own queue - if that is empty, we steal from the write list.
	 * Writes must block if the write list is empty, and read aheads
	 * are not crucial.
	 */
get_rq:
	if (freereq) {
		req = freereq;
		freereq = NULL;
	} else if ((req = get_request(q, rw)) == NULL) {
		spin_unlock_irq(&io_request_lock);
		if (rw_ahead)
			goto end_io;
		freereq = __get_request_wait(q, rw);
		goto again;
	}
/* fill up the request-info, and add it to the queue */
	req->cmd = rw;
	req->errors = 0;
	req->hard_sector = req->sector = sector;
	req->hard_nr_sectors = req->nr_sectors = count;
	req->current_nr_sectors = count;
	req->nr_segments = 1; /* Always 1 for a new request. */
	req->nr_hw_segments = 1; /* Always 1 for a new request. */
	req->buffer = bh->b_data;
	req->sem = NULL;
	req->bh = bh;
	req->bhtail = bh;
	req->rq_dev = bh->b_rdev;
	req->e = elevator;
	add_request(q, req, head, latency); /*提交给磁盘驱动*/
out:
	if (!q->plugged)  
  	(q->request_fn)(q);/*见ide_init_queue,将其初始化为do_ide_request */
	
     if (freereq)
	blkdev_release_request(freereq);
	spin_unlock_irq(&io_request_lock);
	return 0;
end_io:
	bh->b_end_io(bh, test_bit(BH_Uptodate, &bh->b_state));
	return 0;
}
  待会儿再说q->plugged的含义.先看看do_ide_request做了什么:
void do_ide_request(request_queue_t *q)
{
	ide_do_request(q->queuedata, 0);
}
static void ide_do_request(ide_hwgroup_t *hwgroup, int masked_irq)
{
	ide_drive_t	*drive;
	ide_hwif_t	*hwif;
	ide_startstop_t	startstop;
	ide_get_lock(&ide_lock, ide_intr, hwgroup);	/* for atari only: POSSIBLY BROKEN HERE(?) */
	__cli();	/* necessary paranoia: ensure IRQs are masked on local CPU */
	while (!hwgroup->busy) {               /*hwgroup不忙的时候需要处理,否则这就是一个空函数而已*/
		hwgroup->busy = 1;           /*如果busy置位,代表其他进程已经进入次循环,第一个进入此循环的
		                                线程负责处理所有连接到此hwgroup上drive的请求。一个hwgorp共享
		                                同一个中断。
		                               */
		drive = choose_drive(hwgroup); /*选择一个控制器,呵呵,处理的请求未必就是你刚刚提交的
		                                 那个,甚至你读hda,这里却选中了hdc,注意
		                                 drive->queue.plugged ==0 才会被选中,plugged 置位代表
		                                 这个drive开始处理请求,这种情况下不需要这个线程调用
		                                 ide_do_request而是通过中断ide_intr->ide_do_request(drive);
		                                 来获取cpu处理请求
		                               */
		if (drive == NULL) {
			unsigned long sleep = 0;
			hwgroup->rq = NULL;
			drive = hwgroup->drive;
			do {
				if (drive->sleep && (!sleep || 0 < (signed long)(sleep - drive->sleep)))
					sleep = drive->sleep;
			} while ((drive = drive->next) != hwgroup->drive);
			if (sleep) {
				/*
				 * Take a short snooze, and then wake up this hwgroup again.
				 * This gives other hwgroups on the same a chance to
				 * play fairly with us, just in case there are big differences
				 * in relative throughputs.. don't want to hog the cpu too much.
				 */
				if (0 < (signed long)(jiffies + WAIT_MIN_SLEEP - sleep)) 
					sleep = jiffies + WAIT_MIN_SLEEP;
#if 1
				if (timer_pending(&hwgroup->timer))
					printk("ide_set_handler: timer already active\n");
#endif
				hwgroup->sleeping = 1;	/* so that ide_timer_expiry knows what to do */
				mod_timer(&hwgroup->timer, sleep);
				/* we purposely leave hwgroup->busy==1 while sleeping */
			} else {
				/* Ugly, but how can we sleep for the lock otherwise? perhaps from tq_disk? */
				ide_release_lock(&ide_lock);	/* for atari only */
				hwgroup->busy = 0;
			}
			return;		/* no more work for this hwgroup (for now) */
		}
		hwif = HWIF(drive);
		if (hwgroup->hwif->sharing_irq && hwif != hwgroup->hwif && hwif->io_ports[IDE_CONTROL_OFFSET]) {
			/* set nIEN for previous hwif */
			SELECT_INTERRUPT(hwif, drive);
		}
		hwgroup->hwif = hwif;
		hwgroup->drive = drive;
		drive->sleep = 0;
		drive->service_start = jiffies;
		if ( drive->queue.plugged )	/* paranoia */
			printk("%s: Huh? nuking plugged queue\n", drive->name);
		hwgroup->rq = blkdev_entry_next_request(&drive->queue.queue_head);
		/*
		 * Some systems have trouble with IDE IRQs arriving while
		 * the driver is still setting things up.  So, here we disable
		 * the IRQ used by this interface while the request is being started.
		 * This may look bad at first, but pretty much the same thing
		 * happens anyway when any interrupt comes in, IDE or otherwise
		 *  -- the kernel masks the IRQ while it is being handled.
		 */
		if (masked_irq && hwif->irq != masked_irq)
			disable_irq_nosync(hwif->irq);
		spin_unlock(&io_request_lock);
		ide__sti();	/* allow other IRQs while we start this request */
		startstop = start_request(drive);
		spin_lock_irq(&io_request_lock);
		if (masked_irq && hwif->irq != masked_irq)
			enable_irq(hwif->irq);
		if (startstop == ide_stopped)
			hwgroup->busy = 0;
	}
}
   IDE分析到这种地步,我们开始接触磁盘操作的‘核心’逻辑:__make_request,ide_do_request,plugged,ide_intr,tq_disk。
__make_request,tq_disk 主要负责调度磁盘的读写请求。ide_do_request,ide_intr完成ide借口的操作,真正的完成读写磁盘。
   __make_request 第一次接到磁盘读写请求(que为空),直接将请求挂如队列,置plug,放入tq_task(延后对ide_do_request的调用)。后续
的读写请求则首先进行调度,然后再决定是否马上向hw发起操作。当向hw请求发出后(ide_do_request得以执行),intr接管对ide_do_request
的调用同时que plug位被清除,hwgroup的busy位置位 。当plug到tq_disk时,不会进行hw操作的ide_do_request只选择非plug的队列)。

   intr接管对ide_do_request的调用之后,也不见得会将所有的读写请求处理完,这要看磁盘级别的调度结果,ide_do_request负责在磁盘
之间调度。这里注意一下head_acitve,对于ide,此位总是 1,这代表在对读写请求调度时,如果处于unplug状态,则不能操作第一个req(unplug时
有可能在进行io操作,即ide_intr已经在进行真正的io操作了)。

   处于plug状态的队列其实是在等待进行读写请求的调度,以便达到比较好的io吞吐率。但是也不能这样长久的等待下去。所以,如果我们搜索一下
tq_task,就会发现内核有许多地方在调整着吞吐率和延迟之间的矛盾。具体细节就不再罗列了。
   真正操作ide的代码是start_request,drive->do_request(对于ide 硬盘是do_rw_disk):
/*
 * do_rw_disk() issues READ and WRITE commands to a disk,
 * using LBA if supported, or CHS otherwise, to address sectors.
 * It also takes care of issuing special DRIVE_CMDs.
 */
static ide_startstop_t do_rw_disk (ide_drive_t *drive, struct request *rq, unsigned long block)
{
	if (IDE_CONTROL_REG)
		OUT_BYTE(drive->ctl,IDE_CONTROL_REG);
	OUT_BYTE(rq->nr_sectors,IDE_NSECTOR_REG);

	if (drive->select.b.lba) { /*LBA,可以看到,2.4.0的内核还不支持48bitLBA操作,不能支持〉137G的硬盘*/

#ifdef DEBUG
		printk("%s: %sing: LBAsect=%ld, sectors=%ld, buffer=0x%08lx\n",
			drive->name, (rq->cmd==READ)?"read":"writ",
			block, rq->nr_sectors, (unsigned long) rq->buffer);
#endif
		OUT_BYTE(block,IDE_SECTOR_REG);
		OUT_BYTE(block>>=8,IDE_LCYL_REG);
		OUT_BYTE(block>>=8,IDE_HCYL_REG);
		OUT_BYTE(((block>>8)&0x0f)|drive->select.all,IDE_SELECT_REG);
	} else {
		unsigned int sect,head,cyl,track;
		track = block / drive->sect;
		sect  = block % drive->sect + 1;
		OUT_BYTE(sect,IDE_SECTOR_REG);
		head  = track % drive->head;
		cyl   = track / drive->head;
		OUT_BYTE(cyl,IDE_LCYL_REG);
		OUT_BYTE(cyl>>8,IDE_HCYL_REG);
		OUT_BYTE(head|drive->select.all,IDE_SELECT_REG);
#ifdef DEBUG
		printk("%s: %sing: CHS=%d/%d/%d, sectors=%ld, buffer=0x%08lx\n",
			drive->name, (rq->cmd==READ)?"read":"writ", cyl,
			head, sect, rq->nr_sectors, (unsigned long) rq->buffer);
#endif
	}
#ifdef CONFIG_BLK_DEV_PDC4030
	if (IS_PDC4030_DRIVE) {
		extern ide_startstop_t do_pdc4030_io(ide_drive_t *, struct request *);
		return do_pdc4030_io (drive, rq);
	}
#endif /* CONFIG_BLK_DEV_PDC4030 */
	if (rq->cmd == READ) {
#ifdef CONFIG_BLK_DEV_IDEDMA
		if (drive->using_dma && !(HWIF(drive)->dmaproc(ide_dma_read, drive)))
			return ide_started;
#endif /* CONFIG_BLK_DEV_IDEDMA */
		ide_set_handler(drive, &read_intr, WAIT_CMD, NULL);
		OUT_BYTE(drive->mult_count ? WIN_MULTREAD : WIN_READ, IDE_COMMAND_REG);
		return ide_started;
	}
	if (rq->cmd == WRITE) {
		ide_startstop_t startstop;
#ifdef CONFIG_BLK_DEV_IDEDMA
		if (drive->using_dma && !(HWIF(drive)->dmaproc(ide_dma_write, drive)))
			return ide_started;
#endif /* CONFIG_BLK_DEV_IDEDMA */
		OUT_BYTE(drive->mult_count ? WIN_MULTWRITE : WIN_WRITE, IDE_COMMAND_REG);
		if (ide_wait_stat(&startstop, drive, DATA_READY, drive->bad_wstat, WAIT_DRQ)) {
			printk(KERN_ERR "%s: no DRQ after issuing %s\n", drive->name,
				drive->mult_count ? "MULTWRITE" : "WRITE");
			return startstop;
		}
		if (!drive->unmask)
			__cli();	/* local CPU only */
		if (drive->mult_count) {
			ide_hwgroup_t *hwgroup = HWGROUP(drive);
			/*
			 * Ugh.. this part looks ugly because we MUST set up
			 * the interrupt handler before outputting the first block
			 * of data to be written.  If we hit an error (corrupted buffer list)
			 * in ide_multwrite(), then we need to remove the handler/timer
			 * before returning.  Fortunately, this NEVER happens (right?).
			 *
			 * Except when you get an error it seems...
			 */
			hwgroup->wrq = *rq; /* scratchpad */
			ide_set_handler (drive, &multwrite_intr, WAIT_CMD, NULL);
			if (ide_multwrite(drive, drive->mult_count)) {
				unsigned long flags;
				spin_lock_irqsave(&io_request_lock, flags);
				hwgroup->handler = NULL;
				del_timer(&hwgroup->timer);
				spin_unlock_irqrestore(&io_request_lock, flags);
				return ide_stopped;
			}
		} else {
			ide_set_handler (drive, &write_intr, WAIT_CMD, NULL);
			idedisk_output_data(drive, rq->buffer, SECTOR_WORDS);
		}
		return ide_started;
	}
	printk(KERN_ERR "%s: bad command: %d\n", drive->name, rq->cmd);
	ide_end_request(0, HWGROUP(drive));
	return ide_stopped;
}