Storage test gimmicks. What to watch out for when comparing storage performance results between systems and cloud vendors.
There are multiple websites offering “Performance benchmarking for your server” or “Benchmark your server” services. Generally, you download their tools, run them and get a performance report, which is supposedly representative of different aspects of your dedicated physical server or VM. They measure CPU, networking and storage performance and assume that overall performance of each of these aspects could be quickly and easily measured.
The eagerness of the authors of the benchmarking websites to give really quick results to their users’ clashes with the best practices for performance measurement. Most users will spend less than 15 seconds on a web page, so making the report short and easy to read sounds reasonable. However, when it comes to storage subsystem performance, important details are omitted or even simply disregarded. Even though we generally advise against considering such “quick test” results seriously, we will point out a few important details to look out for when reading or comparing such results.
The good thing about most performance measurement websites is that they provide the command line and the raw output of the tool used for measuring the storage performance aspect. These are usually very popular and widely used open source tools like the “Flexible IO tester” or in short fio and the Linux coreutils tool for converting and copying files – dd (also known as “disk destroyer” by some sysadmins).
Some “Direct” consequences
While most services do multiple tests and at least some of them are using the flag “O_DIRECT”, most will show the result without this flag as a summary or headline figure. So first a step back on why using O_DIRECT is important and what it means. With this flag set, the kernel is instructed to not use the cache subsystem and directly send requests for reads and writes to the underlying block device being tested. This buffer-cache-bypass mechanism works even when the test is performed on a file. This prevents cases of cache pollution or double caching – both cases lead to distorted results. The only valid omissions of O_DIRECT we could think of are:
- the cache system itself is being tested or;
- the test involves write-only workload and the O_SYNC flag is also set.
The latter has its own set of corner cases, however, we will not discuss them in detail here, few words on combining O_SYNC with O_DIRECT below.
If the testing tool used was fio and the test was measuring read performance, e.g. “–rw=read” or “–rw=randread”, then the “–direct=1” should also be set. If the test involves “–rw=write” (sequential write) or “–rw=randwrite” (random write) either “–direct=1” or “–sync=1” should also be specified.
There are cases when both flags could be set, but some combinations are completely broken. For example O_SYNC+O_DIRECT on file in regular file systems, as well as on virtio block device. Both deliver test results skewed towards exposing a particular bottleneck (serialization) in the file system or the virtualization stack. They are not close to what a real application does, so should not be taken as representative.
The same rule is also relevant for dd. Even if it is a test with IO depth of 1, for some devices the results with or without O_DIRECT will differ significantly. Look for missing “iflag=direct” when the result was from reading from a device or file. Similarly for the “oflag=direct” when the test involved writing to the device or file. Also, see “Say no to zeroes” below.
The lesson here is to not blindly believe a test result. First, understand what the test actually does. It is even better to do some changes to the testing tool and try to construct a test which is similar to the way your application actually uses the storage system.
Depths and queues
The queue depth is essentially the number of outstanding (also known as inflight) requests that the storage device has to serve in any given moment in time. The fio parameters to look out for are “–ioengine=” and “–iodepth=”. The preferred engine for tests in Linux is libaio. We rarely see other engines used with “–thread”, “–group_reporting” and “–numjobs=…”, so we will not dig deeper into these options. However, you could check the man page of fio for more info on engines.
To oversimplify (the experts are kindly invited to move directly to the next section 🙂 ) most of the time you could read “–thread –group_reporting –numjobs=N” as “–iodepth=N”. Actually, as a side note, there are very few applications that use the libaio engine, mainly RDBMS, and a test with threads will probably better represent the case with multiple applications accessing the storage system simultaneously.
A single request at a time will produce an I/O depth of 1. With fio this is stated as “–iodepth=1”, which is also the default. This is the usual I/O depth to measure the “unloaded latency” of a storage service (and not its throughput).
Iostat is very handy to check the average queue depth for a block device and great for validating that the tested I/O depth is actually “seen” by the device. We have seen cases when the filesystem heavily distorts the actual I/O depth. The iostat -x /path/to/block/device command will provide extended statistics for the underlying block device. The avgqu-sz field in the output shows the average I/O depth for this device since the last reboot. Note that any bursts have been averaged out.
Even more appropriate is to run iostat during the test with an additional parameter “1” at the end, which instructs iostat to print the average statistics once per second. Using this as a form of validation during the test might reveal important details which are missing or difficult to see otherwise. For example, the device might stall for several seconds during the test or as earlier stated might have much lower actual I/O depth due to filesystem options, merges and other.
Real world workloads rarely spike beyond depths of 15 and are usually single digit. However with some heavily loaded databases iodepth could easily get into the hundreds. So a test with a higher queue depth might provide insight for the loaded RDBMS use cases, but could hardly be meaningful for the typical web server virtual machine.
Size up the test
Let’s assume that the test result was properly configured to use the Linux native AIO (the fio “–ioengine=libaio” parameter) and the Linux caching system did not interfere with the test run. Аnd let’s also assume that some reasonable “–iodepth” and block size values were selected. This could mean that the test run was closer to the storage patterns of real-world workloads. Even then the impact from the size of the file or block device being tested will be significant. For example, testing performance on 1GB file will give quite different results than testing on a 100GB file, due to caching and data proximity/locality effects.
Files with small size and caching mechanisms in the storage system might skew the result. Our aim is to test storage performance as would be seen by a real application, at real scale, so we need to use a test size to match the application. Using too big files or block devices to test is not good either as the tests start to seriously deviate from the majority of real-world workloads and get into purely synthetic benchmarks.
Say no to zeroes
The notes here mostly concern the use of the dd tool. FIO also has a way of writing zero buffers using the “–zero_buffers” option, but we rarely see people using it.
We usually see dd used to read from a device or file directly without the device or file being filled with non-zero data prior to running the test. If the device has never been populated with real data, then some of the storage systems might just return zeroes with a speed similar to reading them from /dev/zero, which essentially tests the memory read speed. When testing storage performance, this is probably not what we want.
Similarly writing zeroes to some devices is accelerated. Some storage systems might handle writing zeroes differently than writing non-zero data. With zeroes, you could get a result of several gigabytes per second, but it has no relation to real workloads.
Benchmarking should be treated just as a temporary state, merely a snapshot of the performance of a given system at a particular point in time. Measuring performance regularly and collecting detailed info about the environment, like versions of tools used, parameters for the test and so on, is the best advice we can give. This approach tends to be very useful, especially when some test shows either poorer or better results. Digging into historical results with excruciating details of the environment would be the perfect start towards quickly pointing out the perpetrator.
If you have any questions feel free to contact us at firstname.lastname@example.org
Share this Post