Karan Singh

Where there's a Cloud , there's a way !!

How Application IO's Are Treated by IO Scheduler

| Comments

io-scheduler

Recently i have been doing FIO benchmarking and i found that IOPS reported by FIO != IOPS reported by iostat . Which made me think Why The Heck ?

So here is my FIO job with bs=4M and seq write

1
2
3
4
5
6
7
8
$ fio --filename=/dev/sdb --name=write-4M --rw=write --ioengine=libaio --bs=4M --numjobs=1 --direct=1 --randrepeat=0  --iodepth=1 --runtime=100 --ramp_time=5 --size=100G --group_reporting

write-4M: (g=0): rw=write, bs=4M-4M/4M-4M/4M-4M, ioengine=libaio, iodepth=1
fio-2.11-12-g82e6
Starting 1 process
Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/220.3MB/0KB /s] [0/55/0 iops] [eta 00m:00s]
write-4M: (groupid=0, jobs=1): err= 0: pid=424038: Tue Jun  7 23:48:32 2016
  write: io=22332MB, bw=228677KB/s, iops=55, runt=100001msec

As you can see fio reported 55 iops

At the same time when FIO job was running i was monitoring iostat output which looks like this :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  441.00     0.00 225792.00  1024.00     4.39    9.94    0.00    9.94   2.25  99.30


Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  445.00     0.00 227840.00  1024.00     4.38    9.90    0.00    9.90   2.23  99.40


Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  455.00     0.00 232960.00  1024.00     4.39    9.65    0.00    9.65   2.18  99.10


Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  449.00     0.00 229888.00  1024.00     4.41    9.81    0.00    9.81   2.21  99.40


Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdb               0.00     0.00    0.00  443.00     0.00 226816.00  1024.00     4.37    9.80    0.00    9.80   2.24  99.30

As you can see that w/s field of iostat which represents write/sec i.e. write IOPS is over 400

So the question here is application fio reports write IOPS as 55 but system utility iostat reports write IOPS as 400+ ?? Why is it so

Whats going on ?

To find this out i read quite a lot about how things really happens at I/O scheduler level and thanks to Jens Axboe and FIO community for their pointers. Here are the key things i learned

  • I/O scheduler splits IO which is > 512k , irrespective of seq. or random pattern
  • I/O scheduler merges the IO which are supposed to get written to adjacent regions of the disk. Happens with seq. request.

    How does this works ?

    Application submits IO request to Block IO layer which consists of I/O queue and I/O scheduler. The I/O scheduler sorts I/O queue and performs IO splits, merge if required .

Merging occurs when an I/O request is issued to an identical or adjacent region of the disk. Instead of issuing new request on its own, it is merged into the identical or adjacent request. This minimizes the number of outstanding requests.

Split occurs when the IO submitted by application is greater that 512k , the I/O scheduler splits application IO in several chunks of 512k blocks.

Finally I/O scheduler selects one request at a time and dispatch it to block device driver which in turn writes to disk

Here is the proof

Let’s verify this theory using the output shown above.

  • FIO submits 4M seq. IO to block IO layer , since BS used by FIO > 512k , the I/O scheduler splits each IO into chunk of 8 where each chunk is 512K size and its written to disk. Since this split operation is abstracted from application, FIO does not know about it and from FIO’s perspective write happened at 4M block size so it reports 55 IOPS
  • However the write happened at 512k block size since I/O scheduler splitted application IO. So iostat reports higher IOPS.

Now lets do some reverse engineering on iostat output to prove this

As we know block size = bandwidth / IOPS

So block size = ( wkB/s ) / ( r/s ) = 225792k / 441 = 512k This is the block size of each request written to disk as a result iostat output shows 400+ IOPS while FIO shows 55 IOPS

I hope this will give you some idea of how application IO’s are treated at Block IO layer

Comments