OCFS2 single file performance limit? 8GB/sec

I have an interesting problem, and I’m hoping that someone out there has done something like this.

I’m working on a hardware platform where I need to write data to a single file (from a single node) at ~13GB/sec, and I need to use a parallel file system (OCFS2, sharing read only to another node for processing later). I seem to only be able to get to 8GB/sec to a single file (writes, from only one node). If I increase the number of files to 2, then I can get to ~14GB/sec (limit to this section of the hardware config). So I know that I can get to the 13GB/sec target, but I would have to use 2x files. (I can also get there if I just mount the volume as xfs and use a single file, so the underlying hardware platform can do this).

Block size of the application is 1MB

Here’s the ocfs2 mkfs options that I’m using on the volume:

-b 4096
-C 1M
-T datafiles
-Jblock64
–fs-features=sparse
–fs-feature-level=max-features

I’m trying to find any internal metrics/monitoring that I can do to figure out what’s going on inside of OCFS2. Anyone with an “under the covers” monitoring knowledge of OCFS2 out there?

Has anyone else done some performance testing with OCFS2 that they could share that might indicate an upper limit?

According to the above, you can play with the cache coherency in order to increase bandwidth.

Thanks for the recommendation! Unfortunately, that didn’t seem to make much of a difference. Might be that I’m only writing to the file from a single node (there are 8x threads in the application). In fact, the config that yields the highest throughput is using only the “localflocks” option (with nothing else). Wierd…

ocfs2 file system test
From fio, 8x jobs to a single file
jobocfs2-R0-1: (groupid=0, jobs=8): err= 0: pid=7716: Thu Jan 16 06:01:40 2020
write: IOPS=8278, BW=8279MiB/s (8681MB/s)(485GiB/60002msec)
slat (usec): min=142, max=812399, avg=962.75, stdev=13224.77
clat (usec): min=46, max=813134, avg=2897.54, stdev=22854.13
lat (usec): min=386, max=813504, avg=3860.69, stdev=26365.31

ext4 file system test for baseline of hardware, shows that the hardware can do more than 8GB/sec
From fio, 8x jobs to a single file
write: IOPS=18.3k, BW=17.9GiB/s (19.2GB/s)(1075GiB/60002msec)
slat (usec): min=154, max=2428, avg=215.03, stdev=25.61
clat (usec): min=275, max=6916, avg=1528.54, stdev=91.62
lat (usec): min=542, max=7103, avg=1743.82, stdev=91.47

Anyone hit a higher limit on OCFS2 than 8GB/sec?

Or… Are there any internal utilities/logs that I can use to see what OCFS2 is doing when it gets to 8GB/sec?

What mount options are you using?
Have you tested with 4M block size ?

[QUOTE=strahil-nikolov-dxc;59232]What mount options are you using?
Have you tested with 4M block size ?[/QUOTE]

Great idea. From some quick testing, the mount options of “-o localflocks,coherency=buffered,noatime” produced good results (although from a limited number of tests performed).

The application that i’m working with writes with 1MB block sizes, but if larger blocks got us there then I can make a case for it. Below are the tests for 1M through 16M block sizes with the above mount options. While the throughput of the 4M test case got me up to 9235MiB/s, the latency really picks up at 4M (~10ms). Block sizes 1M (1.1ms) and 2M (5.7ms) are more in line with what the application can handle. (Although 5.7ms >>> 1.1ms)

I have the cluster and block sizes of the mkfs command set as large as I can (-C 1M, -b 4096).

1M writes
mount -o localflocks,coherency=buffered,noatime /dev/md117 /mnt/ocfs2-R0
write: IOPS=7076, BW=7076MiB/s (7420MB/s)(415GiB/60002msec)
slat (usec): min=152, max=1124.6k, avg=1125.54, stdev=17354.68
clat (usec): min=47, max=1125.2k, avg=2263.23, stdev=24523.08
lat (usec): min=458, max=1125.7k, avg=3389.35, stdev=30009.56

2M writes
mount -o localflocks,coherency=buffered,noatime /dev/md117 /mnt/ocfs2-R0
write: IOPS=4192, BW=8386MiB/s (8793MB/s)(491GiB/60003msec)
slat (usec): min=321, max=1000.7k, avg=1905.87, stdev=20998.16
clat (usec): min=103, max=1001.4k, avg=3816.29, stdev=29642.73
lat (usec): min=578, max=1002.1k, avg=5722.46, stdev=36239.94

4M writes
mount -o localflocks,coherency=buffered,noatime /dev/md117 /mnt/ocfs2-R0
write: IOPS=2308, BW=9235MiB/s (9683MB/s)(541GiB/60005msec)
slat (usec): min=484, max=805147, avg=3462.84, stdev=29867.22
clat (usec): min=234, max=806260, avg=6930.31, stdev=42104.41
lat (usec): min=1464, max=807378, avg=10393.46, stdev=51402.95

8M writes
mount -o localflocks,coherency=buffered,noatime /dev/md117 /mnt/ocfs2-R0
write: IOPS=988, BW=7907MiB/s (8292MB/s)(463GiB/60009msec)
slat (usec): min=1544, max=2473.8k, avg=8090.49, stdev=73644.51
clat (usec): min=193, max=2475.7k, avg=16183.57, stdev=103988.25
lat (msec): min=2, max=3234, avg=24.27, stdev=129.85

16M writes
mount -o localflocks,coherency=buffered,noatime /dev/md117 /mnt/ocfs2-R0
write: IOPS=490, BW=7843MiB/s (8224MB/s)(460GiB/60018msec)
slat (msec): min=3, max=6150, avg=16.32, stdev=136.79
clat (usec): min=137, max=8301.4k, avg=32592.63, stdev=205172.65
lat (msec): min=4, max=8305, avg=48.91, stdev=256.05

(My counterpart in the lab is trying the same test with GPFS and cLVM, so it’s a race to 12.5GB/sec)

Has anyone else tried to benchmark OCFS2 on workloads with high write throughput rates?

Sadly, I have never had such requirements infront of me.

Have you considered also GFS2 or even XFS with drbd bellow?
As last resort a Gluster/CEPH cluster can also be tested, although the costs (hardware) will be higher.

I would be happy to learn more about your project.

P.S: you can also add ‘nobarrier’ mount option if you have a battery-backed cache.

Edit: Also there is another option - Lustre. There testimonials in the web where lustre has reached 100 GB/s.

[QUOTE=strahil-nikolov-dxc;59243]Sadly, I have never had such requirements infront of me.

Have you considered also GFS2 or even XFS with drbd bellow?
As last resort a Gluster/CEPH cluster can also be tested, although the costs (hardware) will be higher.

I would be happy to learn more about your project.

P.S: you can also add ‘nobarrier’ mount option if you have a battery-backed cache.[/QUOTE]

One of the other guys on my team is trying GFS2 with cLVM. My next thing to try is going to be Lustre. We are doing this on a box called “Axellio” that has a shared internal PCIe fabric that lets both nodes access the NVMe drives (both nodes can see all of the SSDs, hence the requirement for a parallel file system). One of the nodes is going to be writing txt files @ 12.5GB/sec, and the other node is going to be reading them as fast as it can for analysis.

https://axellio.com/platforms/fabricxpress

For Lustre, I’m envisioning both of the nodes as OSS, with clients installed on both so each can access the data.

I’ll PM you with my contact info. It’s a pretty interesting project.