linux内核奇遇记之md源代码解读之十raid5数据流之同步数据流程 转载请注明出处:http://blog.csdn.net/liumangxiong
上一节讲到在raid5的同步函数sync_request中炸土豆片是通过handle_stripe来进行的。从最初的创建阵列,到申请各种资源,建立每个阵列的personality,所有的一切都是为了迎接数据流而作的准备。就像我们寒窗苦读就是为了上大学一样。数据流的过程就像大学校园一样丰富多彩并且富有挑战性,但只要跨过了这道坎,内核代码将不再神秘,剩下的问题只是时间而已。 首先看handle_stripe究竟把我们的土豆片带往何处: [cpp] view plaincopy
- 3379 static void handle_stripe(struct stripe_head *sh)
- 3380 {
- 3381 struct stripe_head_state s;
- 3382 struct r5conf *conf = sh->raid_conf;
- 3383 int i;
- 3384 int prexor;
- 3385 int disks = sh->disks;
- 3386 struct r5dev *pdev, *qdev;
- 3387
- 3388 clear_bit(STRIPE_HANDLE, &sh->state);
- 3389 if (test_and_set_bit_lock(STRIPE_ACTIVE, &sh->state)) {
- 3390 /* already being handled, ensure it gets handled
- 3391 * again when current action finishes */
- 3392 set_bit(STRIPE_HANDLE, &sh->state);
- 3393 return;
- 3394 }
- 3395
- 3396 if (test_and_clear_bit(STRIPE_SYNC_REQUESTED, &sh->state)) {
- 3397 set_bit(STRIPE_SYNCING, &sh->state);
- 3398 clear_bit(STRIPE_INSYNC, &sh->state);
- 3399 }
- 3400 clear_bit(STRIPE_DELAYED, &sh->state);
- 3401
- 3402 pr_debug("handling stripe %llu, state=%#lx cnt=%d, "
- 3403 "pd_idx=%d, qd_idx=%dn, check:%d, reconstruct:%dn",
- 3404 (unsigned long long)sh->sector, sh->state,
- 3405 atomic_read(&sh->count), sh->pd_idx, sh->qd_idx,
- 3406 sh->check_state, sh->reconstruct_state);
- 3407
- 3408 analyse_stripe(sh, &s);
这个函数代码比较长先贴第一部分,分析条带。分析的作用就是根据条带的状态做一些预处理,根据这些状态再来判断下一步应该做什么具体操作。比如说同步,那么首先会读数据盘,等读回来之后,再校验,然后再写校验值。但是这些步骤又不是一次性在handle_stripe里就完成的,因为跟磁盘IO都是异步的,所以必要要等上一次磁盘请求回调之后再次调用handle_stripe,通常每个数据流都会多次进入handle_stripe,而每一次进入经过的代码流程是不大一样的。 struct stripe_head有很多状态,这些状态决定条带应该怎么处理,所以必须非常小心处理这些标志,这些标志很多,现在先简单地过一下。 [cpp] view plaincopy
- enum {
- STRIPE_ACTIVE, // 正在处理
- STRIPE_HANDLE, // 需要处理
- STRIPE_SYNC_REQUESTED, // 同步请求
- STRIPE_SYNCING, // 正在处理同步
- STRIPE_INSYNC, // 条带已同步
- STRIPE_PREREAD_ACTIVE, // 预读
- STRIPE_DELAYED, // 延迟处理
- STRIPE_DEGRADED, // 降级
- STRIPE_BIT_DELAY, // 等待bitmap处理
- STRIPE_EXPANDING, //
- STRIPE_EXPAND_SOURCE, //
- STRIPE_EXPAND_READY, //
- STRIPE_IO_STARTED, /* do not count towards 'bypass_count' */ // IO已下发
- STRIPE_FULL_WRITE, /* all blocks are set to be overwritten */ // 满写
- STRIPE_BIOFILL_RUN, // bio填充,就是将page页拷贝到bio
- STRIPE_COMPUTE_RUN, // 运行计算
- STRIPE_OPS_REQ_PENDING, // handle_stripe排队用
- STRIPE_ON_UNPLUG_LIST, // 批量release_stripe时标识是否加入unplug链表
- };
3388行,清除需要处理标志。 3389行,设置正在处理标志。 3392行,如果已经在处理则设置下次处理标志并返回。 3396行,如果是同步请求。 3397行,设置正在处理同步标志。 3398行,清除已同步标志。 3400行,清除延迟处理标志。 3408行,分析stripe,这个函数很长分几段来说明: [cpp] view plaincopy
- 3198 static void analyse_stripe(struct stripe_head *sh, struct stripe_head_state *s)
- 3199 {
- 3200 struct r5conf *conf = sh->raid_conf;
- 3201 int disks = sh->disks;
- 3202 struct r5dev *dev;
- 3203 int i;
- 3204 int do_recovery = 0;
- 3205
- 3206 memset(s, 0, sizeof(*s));
- 3207
- 3208 s->expanding = test_bit(STRIPE_EXPAND_SOURCE, &sh->state);
- 3209 s->expanded = test_bit(STRIPE_EXPAND_READY, &sh->state);
- 3210 s->failed_num[0] = -1;
- 3211 s->failed_num[1] = -1;
- 3212
- 3213 /* Now to look around and see what can be done */
- 3214 rcu_read_lock();
数据初始化和加锁,接着看: [cpp] view plaincopy
- 3215 for (i=disks; i–; ) {
- 3216 struct md_rdev *rdev;
- 3217 sector_t first_bad;
- 3218 int bad_sectors;
- 3219 int is_bad = 0;
- 3220
- 3221 dev = &sh->dev[i];
- 3222
- 3223 pr_debug("check %d: state 0x%lx read %p write %p written %pn",
- 3224 i, dev->flags,
- 3225 dev->toread, dev->towrite, dev->written);
接着是一个大循环,循环次数是数据盘的个数,循环的对象的3221行的dev,dev的类型是struct r5dev,那我们先来看一看这个结构,这个结构是嵌套在struct stripe_head里面的: [cpp] view plaincopy
- struct r5dev {
- /* rreq and rvec are used for the replacement device when
- * writing data to both devices.
- */
- struct bio req, rreq;
- struct bio_vec vec, rvec;
- struct page *page;
- struct bio *toread, *read, *towrite, *written;
- sector_t sector; /* sector of this page */
- unsigned long flags;
- } dev[1]; /* allocated with extra space depending of RAID geometry */
首先看注释,rreq 和rvec由replacement设备在写数据时使用。r就是replacement的简写,replacement是什么意思呢?就是原数据盘的替代,replacement是最近几个版本里才引入的特性,在实际产品中这个特性很重要,具体实现后面会讲到。page是缓存页,通常用于运行计算,接着几个bio是读写bio头指针。sector是条带对应的物理扇区位置。flags是struct r5dev的标志。 [cpp] view plaincopy
- 3226 /* maybe we can reply to a read
- 3227 *
- 3228 * new wantfill requests are only permitted while
- 3229 * ops_complete_biofill is guaranteed to be inactive
- 3230 */
- 3231 if (test_bit(R5_UPTODATE, &dev->flags) && dev->toread &&
- 3232 !test_bit(STRIPE_BIOFILL_RUN, &sh->state))
- 3233 set_bit(R5_Wantfill, &dev->flags);
- 3234
- 3235 /* now count some things */
- 3236 if (test_bit(R5_LOCKED, &dev->flags))
- 3237 s->locked++;
- 3238 if (test_bit(R5_UPTODATE, &dev->flags))
- 3239 s->uptodate++;
- 3240 if (test_bit(R5_Wantcompute, &dev->flags)) {
- 3241 s->compute++;
- 3242 BUG_ON(s->compute > 2);
- 3243 }
- 3244
- 3245 if (test_bit(R5_Wantfill, &dev->flags))
- 3246 s->to_fill++;
- 3247 else if (dev->toread)
- 3248 s->to_read++;
- 3249 if (dev->towrite) {
- 3250 s->to_write++;
- 3251 if (!test_bit(R5_OVERWRITE, &dev->flags))
- 3252 s->non_overwrite++;
- 3253 }
- 3254 if (dev->written)
- 3255 s->written++;
3231行,什么样的r5dev要设置R5_Wantfill标志呢?已更新、有读请求、不在拷贝过程。这又是什么意思呢?就是说需要的数据已经为最新了,这时只要把数据从page拷贝到bio就可以了。 3236行,统计加锁磁盘数 3238行,统计已最新磁盘数 3240行,统计需要计算的磁盘数 3245行,统计需要拷贝操作磁盘数 3247行,统计需要读的磁盘数 3249行,统计需要写的磁盘数 3251行,统计满写的磁盘数 3254行,统计已下发写的磁盘数 [cpp] view plaincopy
- 3256 /* Prefer to use the replacement for reads, but only
- 3257 * if it is recovered enough and has no bad blocks.
- 3258 */
- 3259 rdev = rcu_dereference(conf->disks[i].replacement);
- 3260 if (rdev && !test_bit(Faulty, &rdev->flags) &&
- 3261 rdev->recovery_offset >= sh->sector + STRIPE_SECTORS &&
- 3262 !is_badblock(rdev, sh->sector, STRIPE_SECTORS,
- 3263 &first_bad, &bad_sectors))
- 3264 set_bit(R5_ReadRepl, &dev->flags);
- 3265 else {
- 3266 if (rdev)
- 3267 set_bit(R5_NeedReplace, &dev->flags);
- 3268 rdev = rcu_dereference(conf->disks[i].rdev);
- 3269 clear_bit(R5_ReadRepl, &dev->flags);
- 3270 }
- 3271 if (rdev && test_bit(Faulty, &rdev->flags))
- 3272 rdev = NULL;
- 3273 if (rdev) {
- 3274 is_bad = is_badblock(rdev, sh->sector, STRIPE_SECTORS,
- 3275 &first_bad, &bad_sectors);
- 3276 if (s->blocked_rdev == NULL
- 3277 && (test_bit(Blocked, &rdev->flags)
- 3278 || is_bad < 0)) {
- 3279 if (is_bad < 0)
- 3280 set_bit(BlockedBadBlocks,
- 3281 &rdev->flags);
- 3282 s->blocked_rdev = rdev;
- 3283 atomic_inc(&rdev->nr_pending);
- 3284 }
- 3285 }
3256行,优先读重建过并没有坏扇区的replacement盘 3264行,读replacement盘 3267行,写replacement盘 3271行,坏盘 3273行,检查坏扇区 3286行,初始化dev状态 3300行,没有坏扇区,设置同步标志 3312行,写错误处理 3325行,数据盘修复处理 3336行,replacement盘修复处理 3352行,记录不同步盘 3360行,判断同步还是重构replacement盘 到此analyse_stripe就结束了,那么对于同步来说,这个函数做了哪些事情呢?就只是设置了s.syncing=1而已,所以不要看这个函数那么长,每一次进来做的事情却很少。 继续返回到handle_stripe函数中,中间不执行的代码先跳过,然后就会执行到这里: [cpp] view plaincopy
- 3468 /* Now we might consider reading some blocks, either to check/generate
- 3469 * parity, or to satisfy requests
- 3470 * or to load a block that is being partially written.
- 3471 */
- 3472 if (s.to_read || s.non_overwrite
- 3473 || (conf->level == 6 && s.to_write && s.failed)
- 3474 || (s.syncing && (s.uptodate + s.compute < disks))
- 3475 || s.replacing
- 3476 || s.expanding)
- 3477 handle_stripe_fill(sh, &s, disks);
3468行,这里是准备读磁盘,在生成校验、读写请求时都有可能读磁盘 3474行,在analyse_stripe中设置了syncing标志,所以这里满足这个条件,进入handle_stripe_fill函数。 [cpp] view plaincopy
- 2707 /**
- 2708 * handle_stripe_fill – read or compute data to satisfy pending requests.
- 2709 */
- 2710 static void handle_stripe_fill(struct stripe_head *sh,
- 2711 struct stripe_head_state *s,
- 2712 int disks)
- 2713 {
- 2714 int i;
- 2715
- 2716 /* look for blocks to read/compute, skip this if a compute
- 2717 * is already in flight, or if the stripe contents are in the
- 2718 * midst of changing due to a write
- 2719 */
- 2720 if (!test_bit(STRIPE_COMPUTE_RUN, &sh->state) && !sh->check_state &&
- 2721 !sh->reconstruct_state)
- 2722 for (i = disks; i–; )
- 2723 if (fetch_block(sh, s, i, disks))
- 2724 break;
- 2725 set_bit(STRIPE_HANDLE, &sh->state);
- 2726 }
2720行,如果已经在计算、校验或重建状态,则不需要再读磁盘 2722行,循环每一个r5dev看是否需要读磁盘 跟进fetch_block函数: 2618 /* fetch_block – checks the given member device to see if its data needs
2619 * to be read or computed to satisfy a request.
2620 *
2621 * Returns 1 when no more member devices need to be checked, otherwise returns
2622 * 0 to tell the loop in handle_stripe_fill to continue
2623 */ 查看指定设置是否有必要读入数据,返回1表示剩余设置不需要检查了,返回0表示需要继续检查剩余的设置。 [cpp] view plaincopy
- 2624 static int fetch_block(struct stripe_head *sh, struct stripe_head_state *s,
- 2625 int disk_idx, int disks)
- 2626 {
- 2627 struct r5dev *dev = &sh->dev[disk_idx];
- 2628 struct r5dev *fdev[2] = { &sh->dev[s->failed_num[0]],
- 2629 &sh->dev[s->failed_num[1]] };
- 2630
- 2631 /* is the data in this block needed, and can we get it? */
- 2632 if (!test_bit(R5_LOCKED, &dev->flags) &&
- 2633 !test_bit(R5_UPTODATE, &dev->flags) &&
- 2634 (dev->toread ||
- 2635 (dev->towrite && !test_bit(R5_OVERWRITE, &dev->flags)) ||
- 2636 s->syncing || s->expanding ||
- 2637 (s->replacing && want_replace(sh, disk_idx)) ||
- 2638 (s->failed >= 1 && fdev[0]->toread) ||
- 2639 (s->failed >= 2 && fdev[1]->toread) ||
- 2640 (sh->raid_conf->level <= 5 && s->failed && fdev[0]->towrite &&
- 2641 !test_bit(R5_OVERWRITE, &fdev[0]->flags)) ||
- 2642 (sh->raid_conf->level == 6 && s->failed && s->to_write))) {
- 2643 /* we would like to get this block, possibly by computing it,
- 2644 * otherwise read it if the backing disk is insync
- 2645 */
- 2646 BUG_ON(test_bit(R5_Wantcompute, &dev->flags));
- 2647 BUG_ON(test_bit(R5_Wantread, &dev->flags));
- 2648 if ((s->uptodate == disks – 1) &&
- 2649 (s->failed && (disk_idx == s->failed_num[0] ||
- 2650 disk_idx == s->failed_num[1]))) {
- 2651 /* have disk failed, and we're requested to fetch it;
- 2652 * do compute it
- 2653 */
- 2654 pr_debug("Computing stripe %llu block %dn",
- 2655 (unsigned long long)sh->sector, disk_idx);
- 2656 set_bit(STRIPE_COMPUTE_RUN, &sh->state);
- 2657 set_bit(STRIPE_OP_COMPUTE_BLK, &s->ops_request);
- 2658 set_bit(R5_Wantcompute, &dev->flags);
- 2659 sh->ops.target = disk_idx;
- 2660 sh->ops.target2 = -1; /* no 2nd target */
- 2661 s->req_compute = 1;
- 2662 /* Careful: from this point on 'uptodate' is in the eye
- 2663 * of raid_run_ops which services 'compute' operations
- 2664 * before writes. R5_Wantcompute flags a block that will
- 2665 * be R5_UPTODATE by the time it is needed for a
- 2666 * subsequent operation.
- 2667 */
- 2668 s->uptodate++;
- 2669 return 1;
- 2670 } else if (s->uptodate == disks-2 && s->failed >= 2) {
- 2671 /* Computing 2-failure is *very* expensive; only
- 2672 * do it if failed >= 2
- 2673 */
- 2674 int other;
- 2675 for (other = disks; other–; ) {
- 2676 if (other == disk_idx)
- 2677 continue;
- 2678 if (!test_bit(R5_UPTODATE,
- 2679 &sh->dev[other].flags))
- 2680 break;
- 2681 }
- 2682 BUG_ON(other < 0);
- 2683 pr_debug("Computing stripe %llu blocks %d,%dn",
- 2684 (unsigned long long)sh->sector,
- 2685 disk_idx, other);
- 2686 set_bit(STRIPE_COMPUTE_RUN, &sh->state);
- 2687 set_bit(STRIPE_OP_COMPUTE_BLK, &s->ops_request);
- 2688 set_bit(R5_Wantcompute, &sh->dev[disk_idx].flags);
- 2689 set_bit(R5_Wantcompute, &sh->dev[other].flags);
- 2690 sh->ops.target = disk_idx;
- 2691 sh->ops.target2 = other;
- 2692 s->uptodate += 2;
- 2693 s->req_compute = 1;
- 2694 return 1;
- 2695 } else if (test_bit(R5_Insync, &dev->flags)) {
- 2696 set_bit(R5_LOCKED, &dev->flags);
- 2697 set_bit(R5_Wantread, &dev->flags);
- 2698 s->locked++;
- 2699 pr_debug("Reading block %d (sync=%d)n",
- 2700 disk_idx, s->syncing);
- 2701 }
- 2702 }
- 2703
- 2704 return 0;
- 2705 }
从这个函数进入,我们拥有的仅仅是s.syncing这张牌,那么这张牌在这里能不能发挥作用呢? 2632行,判断是否需要读设置 2636行,很明显地,这个判断为真,因为s.syncing==1,其他判断暂且不看 2648行,当前设置都未读入,所以s->uptodate==0 2670行,同上也不成立 2695行,真正执行到的是这个分支 2696行,设置设备加锁标志 2697行,设置设备准备读标志 2698行,递增本条带加锁设备数 handle_stripe函数执行完成,条带的每个struct r5dev都被设置了R5_Wantread标志。在接下来handle_stripe就会调用ops_run_io函数去读:
[cpp] view plaincopy
- 3673 ops_run_io(sh, &s);
我们再跟进这个函数,为了突出重点,这里只列出跟同步相关的代码: [cpp] view plaincopy
- 537 static void ops_run_io(struct stripe_head *sh, struct stripe_head_state *s)
- 538 {
- 539 struct r5conf *conf = sh->raid_conf;
- 540 int i, disks = sh->disks;
- 541
- 542 might_sleep();
- 543
- 544 for (i = disks; i–; ) {
- 545 int rw;
- 546 int replace_only = 0;
- 547 struct bio *bi, *rbi;
- 548 struct md_rdev *rdev, *rrdev = NULL;
- …
- 554 } else if (test_and_clear_bit(R5_Wantread, &sh->dev[i].flags))
- 555 rw = READ;
- …
- 560 } else
- 561 continue;
- 564
- 565 bi = &sh->dev[i].req;
- 566 rbi = &sh->dev[i].rreq; /* For writing to replacement */
- 567
- 568 bi->bi_rw = rw;
- 569 rbi->bi_rw = rw;
- 570 if (rw & WRITE) {
- 573 } else
- 574 bi->bi_end_io = raid5_end_read_request;
- 575
- 576 rcu_read_lock();
- 577 rrdev = rcu_dereference(conf->disks[i].replacement);
- 578 smp_mb(); /* Ensure that if rrdev is NULL, rdev won't be */
- 579 rdev = rcu_dereference(conf->disks[i].rdev);
- 580 if (!rdev) {
- 581 rdev = rrdev;
- 582 rrdev = NULL;
- 583 }
- …
- 598 if (rdev)
- 599 atomic_inc(&rdev->nr_pending);
- …
- 604 rcu_read_unlock();
- …
- 643 if (rdev) {
- 644 if (s->syncing || s->expanding || s->expanded
- 645 || s->replacing)
- 646 md_sync_acct(rdev->bdev, STRIPE_SECTORS);
- 647
- 648 set_bit(STRIPE_IO_STARTED, &sh->state);
- 649
- 650 bi->bi_bdev = rdev->bdev;
- 651 pr_debug("%s: for %llu schedule op %ld on disc %dn",
- 652 __func__, (unsigned long long)sh->sector,
- 653 bi->bi_rw, i);
- 654 atomic_inc(&sh->count);
- 655 if (use_new_offset(conf, sh))
- 656 bi->bi_sector = (sh->sector
- 657 + rdev->new_data_offset);
- 658 else
- 659 bi->bi_sector = (sh->sector
- 660 + rdev->data_offset);
- 661 if (test_bit(R5_ReadNoMerge, &sh->dev[i].flags))
- 662 bi->bi_rw |= REQ_FLUSH;
- 663
- 664 bi->bi_flags = 1 << BIO_UPTODATE;
- 665 bi->bi_idx = 0;
- 666 bi->bi_io_vec[0].bv_len = STRIPE_SIZE;
- 667 bi->bi_io_vec[0].bv_offset = 0;
- 668 bi->bi_size = STRIPE_SIZE;
- 669 bi->bi_next = NULL;
- 670 if (rrdev)
- 671 set_bit(R5_DOUBLE_LOCKED, &sh->dev[i].flags);
- 672 generic_make_request(bi);
- 673 }
- …
- 709 }
- 710 }
542行,函数可能休眠 544行,遍历每一个r5dev 554行,设置读标志 568行,设置bio为读 574行,设置bio回调函数为raid5_end_read_request,这里将是下发读请求之后代码继续执行的入口点。 598行,增加设备nr_pending 646行,统计信息 648行,设置IO下发标志 650行,设置bio设备为对应的磁盘设备 654行,增加stripe_head引用计数 655-660行,设置新的扇区数,需要加上磁盘上的数据偏移 661行,如果为NoMerge读,则设置bio REQ_FLUSH标志 664行,接着设置bio其他域 672行,下发bio到磁盘 在磁盘执行完读请求的时候,raid5_end_read_request被调用:
[cpp] view plaincopy
- 1710 static void raid5_end_read_request(struct bio * bi, int error)
- 1711 {
- ...
- 1824 rdev_dec_pending(rdev, conf->mddev);
- 1825 clear_bit(R5_LOCKED, &sh->dev[i].flags);
- 1826 set_bit(STRIPE_HANDLE, &sh->state);
- 1827 release_stripe(sh);
- 1828 }
在这个函数中,清除了R5_LOCKED标志,并重新将stripe_head加入处理。经过raid5d中转,重新调用到handle_stripe函数,这一次调用时在analyse_stripe函数中递增s->uptodate,所有数据盘都递增1,所以s->uptodate等于数据盘。接着handle_tripe函数到达: [cpp] view plaincopy
- 3528 if (sh->check_state ||
- 3529 (s.syncing && s.locked == 0 &&
- 3530 !test_bit(STRIPE_COMPUTE_RUN, &sh->state) &&
- 3531 !test_bit(STRIPE_INSYNC, &sh->state))) {
- 3532 if (conf->level == 6)
- 3533 handle_parity_checks6(conf, sh, &s, disks);
- 3534 else
- 3535 handle_parity_checks5(conf, sh, &s, disks);
- 3536 }
进入3535行进行校验,进入handle_parity_check5函数: [cpp] view plaincopy
- 2881 switch (sh->check_state) {
- 2882 case check_state_idle:
- 2883 /* start a new check operation if there are no failures */
- 2884 if (s->failed == 0) {
- 2885 BUG_ON(s->uptodate != disks);
- 2886 sh->check_state = check_state_run;
- 2887 set_bit(STRIPE_OP_CHECK, &s->ops_request);
- 2888 clear_bit(R5_UPTODATE, &sh->dev[sh->pd_idx].flags);
- 2889 s->uptodate–;
- 2890 break;
- 2891 }
2881行,check_state为0,进入2882行分支 2886行,设置check_state_run状态 2887行,设置STRIPE_OP_CHECK操作 2889行,递减s->uptodate 由于这里设置了STRIPE_OP_CHECK操作,所以在handle_stripe会调用到raid_run_ops,进而会调用到: [cpp] view plaincopy
- 1412 if (test_bit(STRIPE_OP_CHECK, &ops_request)) {
- 1413 if (sh->check_state == check_state_run)
- 1414 ops_run_check_p(sh, percpu);
ops_run_check_p校验条带是否同步,对应的回调函数为: [cpp] view plaincopy
- 1301static void ops_complete_check(void *stripe_head_ref)
- 1302{
- 1303 struct stripe_head *sh = stripe_head_ref;
- 1304
- 1305 pr_debug("%s: stripe %llun", __func__,
- 1306 (unsigned long long)sh->sector);
- 1307
- 1308 sh->check_state = check_state_check_result;
- 1309 set_bit(STRIPE_HANDLE, &sh->state);
- 1310 release_stripe(sh);
- 1311}
第1308行将状态设置为check_state_check_result,条带继续又重新加入到handle_list。handle_stripe再一次调用到handle_parity_check5函数,但这一次check_state==check_state_check_result: [cpp] view plaincopy
- 2916 case check_state_check_result:
- 2917 sh->check_state = check_state_idle;
- 2918
- 2919 /* if a failure occurred during the check operation, leave
- 2920 * STRIPE_INSYNC not set and let the stripe be handled again
- 2921 */
- 2922 if (s->failed)
- 2923 break;
- 2924
- 2925 /* handle a successful check operation, if parity is correct
- 2926 * we are done. Otherwise update the mismatch count and repair
- 2927 * parity if !MD_RECOVERY_CHECK
- 2928 */
- 2929 if ((sh->ops.zero_sum_result & SUM_CHECK_P_RESULT) == 0)
- 2930 /* parity is correct (on disc,
- 2931 * not in buffer any more)
- 2932 */
- 2933 set_bit(STRIPE_INSYNC, &sh->state);
- 2934 else {
- 2935 conf->mddev->resync_mismatches += STRIPE_SECTORS;
- 2936 if (test_bit(MD_RECOVERY_CHECK, &conf->mddev->recovery))
- 2937 /* don't try to repair!! */
- 2938 set_bit(STRIPE_INSYNC, &sh->state);
- 2939 else {
- 2940 sh->check_state = check_state_compute_run;
- 2941 set_bit(STRIPE_COMPUTE_RUN, &sh->state);
- 2942 set_bit(STRIPE_OP_COMPUTE_BLK, &s->ops_request);
- 2943 set_bit(R5_Wantcompute,
- 2944 &sh->dev[sh->pd_idx].flags);
- 2945 sh->ops.target = sh->pd_idx;
- 2946 sh->ops.target2 = -1;
- 2947 s->uptodate++;
- 2948 }
- 2949 }
- 2950 break;
2929行,如果校验的结果是同步的 2933行,直接设置条带为同步的,不需要进行其他任何操作了 2934行,如果条带不同步 2940行,设置check_state为check_state_compute_run 2942行,ops_request 为STRIPE_OP_COMPUTE_BLK,即准备计算校验 2943行,计算目标为条带校验盘 2947行,由于之前计算校验时uptodate递减,这里恢复 如果条带已经同步了,那么带着STRIPE_INSYNC标志我们来到了handle_stripe: [cpp] view plaincopy
- 3550 if ((s.syncing || s.replacing) && s.locked == 0 &&
- 3551 test_bit(STRIPE_INSYNC, &sh->state)) {
- 3552 md_done_sync(conf->mddev, STRIPE_SECTORS, 1);
- 3553 clear_bit(STRIPE_SYNCING, &sh->state);
- 3554 }
如果条带未同步,那带着STRIPE_OP_COMPUTE_BLK标志来到了raid_run_ops函数,该函数调用__raid_run_ops: [cpp] view plaincopy
- 1383 if (test_bit(STRIPE_OP_COMPUTE_BLK, &ops_request)) {
- 1384 if (level < 6)
- 1385 tx = ops_run_compute5(sh, percpu);
最终调用ops_run_compute5函数计算出条带中校验盘的值,该函数回调函数ops_complete_compute: [cpp] view plaincopy
- 856static void ops_complete_compute(void *stripe_head_ref)
- 857{
- 858 struct stripe_head *sh = stripe_head_ref;
- 859
- 860 pr_debug("%s: stripe %llun", __func__,
- 861 (unsigned long long)sh->sector);
- 862
- 863 /* mark the computed target(s) as uptodate */
- 864 mark_target_uptodate(sh, sh->ops.target);
- 865 mark_target_uptodate(sh, sh->ops.target2);
- 866
- 867 clear_bit(STRIPE_COMPUTE_RUN, &sh->state);
- 868 if (sh->check_state == check_state_compute_run)
- 869 sh->check_state = check_state_compute_result;
- 870 set_bit(STRIPE_HANDLE, &sh->state);
- 871 release_stripe(sh);
- 872}
864行,设置校验盘dev为R5_UPTODATE 869行,由于handle_parity_check5中设置为check_state_compute_run,这里继续设置为check_state_compute_result 870行,设置处理标志,在871之后再一次进入handle_stripe 当再一次进入handle_stripe函数,又再一次来到handle_parity_check5函数,由于这次是check_state_compute_result标志: [cpp] view plaincopy
- 2894 case check_state_compute_result:
- 2895 sh->check_state = check_state_idle;
- 2896 if (!dev)
- 2897 dev = &sh->dev[sh->pd_idx];
- 2898
- 2899 /* check that a write has not made the stripe insync */
- 2900 if (test_bit(STRIPE_INSYNC, &sh->state))
- 2901 break;
- 2902
- 2903 /* either failed parity check, or recovery is happening */
- 2904 BUG_ON(!test_bit(R5_UPTODATE, &dev->flags));
- 2905 BUG_ON(s->uptodate != disks);
- 2906
- 2907 set_bit(R5_LOCKED, &dev->flags);
- 2908 s->locked++;
- 2909 set_bit(R5_Wantwrite, &dev->flags);
- 2910
- 2911 clear_bit(STRIPE_DEGRADED, &sh->state);
- 2912 set_bit(STRIPE_INSYNC, &sh->state);
- 2913 break;
我们可以一眼看2912行设置了STRIPE_INSYNC标志,那么也意味着条带同步的结束。但是也别高兴得太早,回头看却有2908行s->locked++,同步结束的判断条件之一就是s->locked==0,所以在同步结束之前我们还有一件事情要做,2909行设置了R5_Wantwrite标志就是告诉我们需要调用一次ops_run_io将刚才计算的校验值写入条带的校验盘中,再写成功再返回时就会满足同步结束的条件了。就这样,一次简单的同步过程就完成了。
转载请注明出处:http://blog.csdn.net/liumangxiong