当前位置：网站首页>When to write disk IO after one byte of write file

When to write disk IO after one byte of write file

2020-11-08 16:12:00 【Zhang Yanfei Allen】

Want to make APP Same thing as WeChat , Can run small programs smoothly ？ | Experience will send you to Xinjiang 、 Huawei 、 Cherry keyboard ！>>>

In the foreword 《read How much disk does a byte of file actually take place on IO？》 After you've written , I wanted to be lazy , Just read to let you know Linux IO Each module of the stack is OK . But many students said that they asked me to write another article about writing operation . Since many people have this need , I'll write it down .

Linux The kernel is really complicated , The number of lines of source code has been changed from 1.0 Tens of thousands of lines in the version , By now, it's a giant of thousands of lines . If you go straight in , It's easy to get lost in all kinds of dazzling calls , I can't drill out any more . I'd like to share with you a way I'm thinking about the kernel . Generally, I think about a problem that I really want to make clear . No matter how you jump around in the code , Always remember your problems , The irrelevant parts should be scattered as little as possible , Just figure out your problem .

Now what I want to understand is , In the most common way , Don't drive O_DIRECT、 Don't drive O_SYNC（ There are many ways to write files , Yes sync Pattern 、direct Pattern 、mmap Memory mapping mode ）,write How is it written .c The code example of is as follows :

#include <fcntl.h>
int main()
{
    char c = 'a';
    int out;

    out = open("out.txt", O_WRONLY | O_CREAT | O_TRUNC);
    write(out,&c,1);
    ...
}

Further refine my question , After we write a byte to the open question

write How functions are executed in the kernel ?
When can the data really be written to the disk ？

In the course of our discussion, it is inevitable to refer to the kernel code , The kernel version I'm using is 3.10.1. If necessary , You can download it here .https://mirrors.edge.kernel.org/pub/linux/kernel/v3.x/.

write Function implementation analysis

I spent a lot of time tracking write writes ext4 File system calls and returns , I sort out an interaction diagram . Of course, to highlight the point , I abandoned a lot of details , such as DIRECT IO、ext4 There's nothing in the log , Only a few calls that I think are critical are extracted .

file

In the flow chart above , Where did all the writing end up ？ At the back __block_commit_write in , It's just make dirty. Then most of the time your function call returns （ Later on balance_dirty_pages_ratelimited）. The data is still in memory PageCache in , It's not really written to the hard disk .

Why do we have to do this , Don't write to the hard disk directly ？ The reason is hard disk, especially mechanical hard disk , The performance is that it's too slow . A server level turntable , The worst-case random access average latency is at the millisecond level , The conversion IOPS Only 100 Not much 200. imagine , If every user in your back-end interface needs a random disk to access IO, No matter how good your server is , Per second 200 Of qps It's going to blow up your hard drive , Believe as a million / Ten million / More than 100 million users provide interface for you , This is something you can't stand .

Linux There are also side effects , If the next server power down , I lost everything in my memory . therefore Linux There's another one “ Patch ”- Delayed writing , Help us alleviate this problem . Pay attention to , I'm talking about relief , It's not completely solved .

Besides, under the balance_dirty_pages_ratelimited, Although most of the time , It's all written directly into Page Cache It's back in . But in one case , The user process must wait for the write to complete before it can return , That's right. balance_dirty_pages_ratelimited If your judgment goes beyond the limit . This function determines whether the current dirty page has exceeded the dirty page upper limit dirty_bytes、dirty_ratio, You have to wait if you exceed it . Only one of these two parameters will take effect , in addition 1 Yes 0. take dirty_ratio Come on , If the setting is 30, It means that if the proportion of dirty pages exceeds that of memory 30%, be write Function calls must wait for the write to complete before returning . It can be under your machine /proc/sys/vm/ Directory to view these two configurations .

# cat /proc/sys/vm/dirty_bytes
0
# cat /proc/sys/vm/dirty_ratio
30

Kernel delay write

When does the kernel actually write data to the hard disk ？ In order to get a quick picture of the whole picture , The way I came up with was to use systemtap Tools , Find the kernel and write IO A key function in the process , And then you type the function call stack in it . After looking up the data for a long time , I decided to use it. do_writepages This function .

#!/usr/bin/stap
probe kernel.function("do_writepages")
{
    printf("--------------------------------------------------------\n"); 
    print_backtrace(); 
    printf("--------------------------------------------------------\n"); 
}

systemtab After tracking , The printed information is as follows :

 0xffffffff8118efe0 : do_writepages+0x0/0x40 [kernel]
 0xffffffff8122d7d0 : __writeback_single_inode+0x40/0x220 [kernel]
 0xffffffff8122e414 : writeback_sb_inodes+0x1c4/0x490 [kernel]
 0xffffffff8122e77f : __writeback_inodes_wb+0x9f/0xd0 [kernel]
 0xffffffff8122efb3 : wb_writeback+0x263/0x2f0 [kernel]
 0xffffffff8122f35c : bdi_writeback_workfn+0x1cc/0x460 [kernel]
 0xffffffff810a881a : process_one_work+0x17a/0x440 [kernel]
 0xffffffff810a94e6 : worker_thread+0x126/0x3c0 [kernel]
 0xffffffff810b098f : kthread+0xcf/0xe0 [kernel]
 0xffffffff816b4f18 : ret_from_fork+0x58/0x90 [kernel]

From the output above, we can see that , The real file writing process is performed by worker From the kernel thread （ It has nothing to do with our own app process , At this point, our application's write The function call returned long ago ）. This worker Thread writebacks are executed periodically , Its cycle depends on the kernel parameters dirty_writeback_centisecs Set up , According to the parameter name, you can probably see that , Its unit is one hundredth of a second .

# cat /proc/sys/vm/dirty_writeback_centisecs
500

I see that my configuration is 500, That is, every 5 The second will do it periodically . Looking back on our questions , When was our most concerned question written in , It's just a lot of divergence around this idea . So we keep tracking along the call stack , Jump , Finally found the following code . In the following code we see , If it is for_background Pattern , And over_bground_thresh Judge success , It will start to write back .

static long wb_writeback(struct bdi_writeback *wb,
                         struct wb_writeback_work *work)
{
	work->older_than_this = &oldest_jif;
    ...
    if (work->for_background && !over_bground_thresh(wb->bdi))
        break;
	...

    if (work->for_kupdate) {
        oldest_jif = jiffies -
                msecs_to_jiffies(dirty_expire_interval * 10);
    } else ...
}
static long wb_check_background_flush(struct bdi_writeback *wb)
{
    if (over_bground_thresh(wb->bdi)) {
   		...
        return wb_writeback(wb, &work);
    }
}

that over_bground_thresh What does the function judge ？ In fact, it is to judge whether the current dirty page exceeds the kernel parameters dirty_background_ratio or dirty_background_bytes Configuration of , If you don't exceed it, you won't write it （ Code is located fs/fs-writeback.c：1440, Limited to space, I will not post ）. Only one of these two parameters will actually work , among dirty_background_ratio The configuration is proportional 、dirty_background_bytes The configuration is bytes .

The two parameters on my machine are configured as follows , Indicates that the proportion of dirty pages exceeds 10% I started writing back .

# cat /proc/sys/vm/dirty_background_bytes
0
# cat /proc/sys/vm/dirty_background_ratio
10

So what if the dirty pages don't exceed this percentage all the time , Don't write it ？ No, it isn't . Above wb_writeback Function, we see , If it is for_kupdate Pattern , An expiration mark will be recorded to work->older_than_this, In the following code, the page that meets this condition is also written back .dirty_expire_interval Where does this variable come from ？ stay kernel/sysctl.c in , We found clues . Oh , It turns out that it came from /proc/sys/vm/dirty_expire_centisecs This configuration .

1158         {
1159                 .procname       = "dirty_expire_centisecs",
1160                 .data           = &dirty_expire_interval,
1161                 .maxlen         = sizeof(dirty_expire_interval),
1162                 .mode           = 0644,
1163                 .proc_handler   = proc_dointvec_minmax,
1164                 .extra1         = &zero,
1165         },

It's on my machine , Its value is 3000. The unit is one hundredth of a second , So the dirty pages are over 30 Seconds will be thought by the kernel thread to write back to disk .

# cat /proc/sys/vm/dirty_expire_centisecs
3000

Conclusion

We demo Writing in code , In fact, most cases are written to PageCache It's back to , It's not really written to the disk . Our data will be actually initiated to write to disk at the following three times IO request ：

Case one , If write When the system is called , If you find that PageCache There are too many dirty pages , More than the dirty_ratio or dirty_bytes,write You have to wait .
The second case ,write writes PageCache It's back .worker When kernel threads run asynchronously , Judge the proportion of dirty pages again , If you exceed dirty_background_ratio or dirty_background_bytes, Also initiate a write back request .
The third case , It's the same time write The call has returned .worker When kernel threads run asynchronously , Although the system visceral page has not exceeded dirty_background_ratio or dirty_background_bytes, But dirty pages stay in memory longer than dirty_expire_centisecs 了 , I can also initiate and write .

If you are not satisfied with the above configuration , You can modify it yourself /etc/sysctl.conf To adjust , Don't forget to carry out the modification sysctl -p.

Finally, we should realize that , This set of write pagecache+ The first goal of the write back mechanism is performance , It's not a guarantee that we won't lose the data we've written . If the power goes off at this time , Dirty pages have not been dirty for more than dirty_expire_centisecs I really lost it . If you're doing a very important business with money , You have to make sure that the drop is complete before you can return , Then you may need to consider using fsync.

file

Development of hard disk album of internal training ：

My official account is 「 Develop internal skill and practice 」, I'm not just talking about technical theory here , It's not just about practical experience . It's about combining theory with practice , Deepen the understanding of theory with practice 、 Use theory to improve your technical practice ability . Welcome to my official account , Please also share with your friends ~~~

版权声明
本文为[Zhang Yanfei Allen]所创，转载请带上原文链接，感谢

当前位置：网站首页>When to write disk IO after one byte of write file

When to write disk IO after one byte of write file

write Function implementation analysis

Kernel delay write

Conclusion

边栏推荐

猜你喜欢

随机推荐