Recently i have been doing FIO benchmarking and i found that IOPS reported by FIO != IOPS reported by
iostat . Which made me think Why The Heck ?
So here is my FIO job with bs=4M and seq write
1 2 3 4 5 6 7 8
As you can see fio reported 55 iops
At the same time when FIO job was running i was monitoring iostat output which looks like this :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
As you can see that w/s field of iostat which represents write/sec i.e. write IOPS is over 400
So the question here is application
fioreports write IOPS as 55 but system utility
iostatreports write IOPS as 400+ ?? Why is it so
Whats going on ?
To find this out i read quite a lot about how things really happens at I/O scheduler level and thanks to Jens Axboe and FIO community for their pointers. Here are the key things i learned
- I/O scheduler splits IO which is > 512k , irrespective of seq. or random pattern
- I/O scheduler merges the IO which are supposed to get written to adjacent regions of the disk. Happens with seq. request.
How does this works ?
Application submits IO request to Block IO layer which consists of I/O queue and I/O scheduler. The I/O scheduler sorts I/O queue and performs IO splits, merge if required .
Merging occurs when an I/O request is issued to an identical or adjacent region of the disk. Instead of issuing new request on its own, it is merged into the identical or adjacent request. This minimizes the number of outstanding requests.
Split occurs when the IO submitted by application is greater that 512k , the I/O scheduler splits application IO in several chunks of 512k blocks.
Finally I/O scheduler selects one request at a time and dispatch it to block device driver which in turn writes to disk
Here is the proof
Let’s verify this theory using the output shown above.
- FIO submits 4M seq. IO to block IO layer , since BS used by FIO > 512k , the I/O scheduler splits each IO into chunk of 8 where each chunk is 512K size and its written to disk. Since this split operation is abstracted from application, FIO does not know about it and from FIO’s perspective write happened at 4M block size so it reports 55 IOPS
- However the write happened at 512k block size since I/O scheduler splitted application IO. So iostat reports higher IOPS.
Now lets do some reverse engineering on iostat output to prove this
As we know block size = bandwidth / IOPS
So block size = ( wkB/s ) / ( r/s ) = 225792k / 441 = 512k This is the block size of each request written to disk as a result iostat output shows 400+ IOPS while FIO shows 55 IOPS
I hope this will give you some idea of how application IO’s are treated at Block IO layer