Using IO500 for HPC Disk Testing

Ed Davis

Center for Quantitative Life Sciences, Oregon State University

2025-10-31

IO500 Represents a Holistic Benchmark

Fio

  • Flexible I/O tester, the “Swiss Army knife” of I/O benchmarking.
  • Excellent for micro-benchmarks (raw throughput, IOPS, latency).
  • Can simulate many I/O patterns (random, sequential, mixed).
  • Limitation: It tests a single aspect of I/O at a time. It doesn’t fully represent a complex HPC workload out-of-the-box.

IO500

  • A benchmark suite that bundles trusted tools (ior, mdtest, pfind).
  • Designed to measure performance for common HPC workloads, including both data and metadata operations.
  • Provides a balanced, holistic view by combining bandwidth, IOPS, and metadata tests into a single score.
  • Represents both optimized (‘easy’) and challenging (‘hard’) I/O patterns.
  • The de facto standard for comparing HPC storage systems.

The Workflow at a Glance

A complete cycle from setup to analysis.

  1. Setup Environment: Prepare the necessary tools.
  2. Gather Node Metadata: Collect hardware details of the test machine.
  3. Run Benchmark: Execute the IO500 tests using a wrapper script.
  4. Parse & Consolidate: Post-process raw results into a standard format.
  5. Aggregate Results: Combine all test runs from all nodes into a single dataset.
  6. Analyze & Report: Visualize the data to draw conclusions.

Core Component: io500_wrapper.sh

This is the main entry point for running a benchmark.

#!/usr/-bin/env bash
# IO500 Benchmark Wrapper Script

# Usage: 
# ./io500_wrapper.sh <cluster> <node> <storage> <device> <test_dir> [volume]
# Example: 
# ./io500_wrapper.sh cluster1 node01 nfs hdd /path/to/nfs/dir
  • Purpose: Standardizes how IO500 is run and where results are stored.
  • Key Features:
    • Creates a timestamped output directory.
    • Gathers run-specific metadata (cluster, node, storage type, etc.).
    • Selects an appropriate IO500 config file (config-nfs.ini, config-ssd.ini).
    • Uses envsubst to inject variables ($TEST_DIR, $OUTPUT_DIR) into the config.
    • Runs io500 (with mpirun if multiple processors are detected).
    • Calls a parser to create results.json from the raw result_summary.txt.

Core Component: io500_wrapper.sh

The wrapper also has a --reprocess mode.

# io500_wrapper.sh (continued)

# OR for reprocessing existing results:
# Usage: 
# ./io500_wrapper.sh --reprocess <results_directory>
# Example: 
# ./io500_wrapper.sh --reprocess ./results/cluster1/node01/local/ssd/2025.10.23...
  • Purpose: If the parsing logic changes or failed, we can regenerate results.json without re-running the hours-long benchmark.
  • This is handled by the parse_results function, which embeds a Python script.

Understanding the Configuration

Our wrapper automates config file selection for different storage types.

We use different .ini files to tune the workload for the target storage:

  • config-ssd.ini: n = 50000000
  • config-nfs.ini: n = 1000000
  • config.ini (default): n = 10000000

The n parameter in the [mdtest-*] sections defines the number of files to create.

  • For SSDs we increase n to properly stress.
  • For NFS systems, we use a smaller n.

Automatic Selection

The io500_wrapper.sh script automatically chooses:

  • config-ssd.ini for local ssd
  • config-nfs.ini for nfs hdd
  • config.ini for all other combinations.

Manual Override

CONFIG_FILE="config-powerscale.ini" ./io500_wrapper.sh ...

Stonewall Time & Valid Results

IO500 has a mechanism to ensure it’s measuring sustained performance.

  • The stonewall-time = 300 setting in the config file mandates that key write/create phases must run for at least 300 seconds.
  • Why? This prevents storage systems with large, fast caches from finishing instantly and posting unrealistic scores. It forces the system to work long enough that performance settles to a steady state.
  • If a test phase (e.g., ior-easy-write) completes in less than 300 seconds, the benchmark correctly marks the result as [INVALID].
  • Our parsing script detects this flag and saves it in the final data. This is a feature, not an error. It helps us identify results that aren’t representative of a sustained workload.

Metadata is Crucial

Performance numbers are useless without context.

gather_node_metadata.sh

This script captures the hardware and software configuration of the node under test.

  • CPU Model, Core Count
  • Total Memory
  • Network Interface & Speed
  • OS & Kernel Version
  • Hostname

It saves this information to a node_metadata.json file in a structured path:

./io500_results/<cluster>/<node>/node_metadata.json

From Raw Data to Usable CSV

Individual JSON files are good, but a single table is better for analysis.

aggregate_results.py

  • Scans the results directory tree for all completed benchmark runs.
  • Loads the run metadata (run_metadata.json) for each test.
  • Loads the corresponding node hardware metadata (node_metadata.json).
  • Loads the parsed benchmark results (results.json).
  • Combines all this information into a single, wide-format aggregated_results.csv.

The final CSV contains one row per benchmark run, with columns for hardware specs, run parameters, and every metric from the IO500 test.

The Workflow in Action: Step-by-Step

A practical guide to running a new benchmark.

1. Setup the Environment

Note on Environment

I use pixi to help manage dependencies and tasks.

Project code and results are in gitlab.

If you need a CQLS account: https://access.cqls.oregonstate.edu/

I will add your account to the repo.

To start, a user just needs to run (download a pixi release if necessary, first):

git clone ssh://git@gitlab.cqls.oregonstate.edu:732/cgrb-biocomputing/hpc-disk-bench.git
cd hpc-disk-bench
git checkout io500
pixi install
pixi run test-io500

The Workflow in Action: Step-by-Step

2. Prepare for the Run

The io500_wrapper.sh script handles metadata collection automatically.

Automatic Metadata Gathering

You do not need to run gather_node_metadata.sh manually before every test.

When io500_wrapper.sh starts, it checks if node_metadata.json exists for the target node. If the file is missing, the wrapper will automatically call gather_node_metadata.sh to create it.

This ensures we never have a benchmark run without the corresponding hardware details.

{
  "cluster_name": "wildwood",
  "node_name": "chrom1",
  "hostname": "chrom1.hpc.oregonstate.edu",
  "cpu_model": "AMD Opteron(tm) Processor 6376",
  "cpu_count": 64,
  "cpu_cores": 8,
  "memory_gb": 995,
  "network_interface": "enp9s0f0",
  "network_speed": "10Gbps",
  "kernel": "5.14.0-427.18.1.el9_4.x86_64",
  "os": "Rocky Linux 9.4 (Blue Onyx)",
  "metadata_timestamp": "2025-10-28T20:58:30Z"
}

The Workflow in Action: Step-by-Step

3. Run the Benchmark

Execute the wrapper script with the correct parameters for the test. The script is run via pixi run within our controlled environment.

Example Commands

The arguments specify the cluster, node, storage backend, and test path. An optional final argument can be used to name the specific storage volume.

Local HDD on chrom1:

pixi run ./io500_wrapper.sh wildwood chrom1 local hdd /scratch/davised/disk_test

NFS-backed HDD on chrom1:

pixi run ./io500_wrapper.sh wildwood chrom1 nfs hdd /nfs7/core/scratch/davised/disk_test nfs7

Local SSD on olympus:

pixi run ./io500_wrapper.sh wildwood olympus local ssd /scratch/davised/disk_test md126

These commands are typically submitted as a parallel batch job, as shown next.

Scaling the Benchmark: Processor Count

How It Works

  • Our io500_wrapper.sh script automatically detects the number of allocated CPUs from the Slurm environment variable $SLURM_CPUS_ON_NODE.
  • This number determines how many parallel processes (mpirun -np $NUM_PROCS) will be used for the benchmark.
  • Using multiple processes generates enough I/O load to saturate modern storage systems and get a realistic performance measurement. A single-process test would likely be bottlenecked by CPU.
  • If you are not using Slurm for job submission, you can manually set NUM_PROCS=XX pixi run ...

Scaling the Benchmark: Processor Count

Submitting a Parallel Job

We use a wrapper script hqsub to submit our benchmark jobs to Slurm, specifying the number of processors with the -p flag. We typically use 16 CPUs for these tests.

# Example of submitting a 16-core job to test
# a local SSD on the 'olympus' node.

hqsub 'pixi run ./io500_wrapper.sh wildwood olympus local ssd /scratch/davised/disk_test md126' \
      -p 16 \
      -r job.olympus_ssd_io500 \
      -w olympus \
      -q sharpton

The script then uses this count to launch with mpirun.

# Inside io500_wrapper.sh
if [ "$NUM_PROCS" -gt 1 ]; then
    mpirun io500 ...
else
    io500 ...
fi

Standardized outputs

  • Each run generates a run_metadata.json and a results.json that we later aggregate into a csv file.

Run metadata

{
  "cluster_name": "wildwood",
  "node_name": "chrom1",
  "storage_type": "local",
  "storage_volume": "",
  "device_type": "hdd",
  "timestamp": "2025.10.29-14.56.43",
  "num_procs": 16,
  "test_dir": "/scratch/davised/disk_test"
}

Output

{
  "ior-easy-write": {
    "value": 0.186573,
    "unit": "GiB/s",
    "time": 357.073,
    "valid": true
  },
  "mdtest-easy-write": {
    "value": 21.306493,
    "unit": "kIOPS",
    "time": 301.924,
    "valid": true
  },
  "ior-hard-write": {
    "value": 0.177066,
    "unit": "GiB/s",
    "time": 356.216,
    "valid": true
  },
  "mdtest-hard-write": {
    "value": 14.479136,
    "unit": "kIOPS",
    "time": 313.218,
    "valid": true
  },
  "find": {
    "value": 603.6442,
    "unit": "kIOPS",
    "time": 18.092,
    "valid": true
  },
  "ior-easy-read": {
    "value": 0.18754,
    "unit": "GiB/s",
    "time": 354.587,
    "valid": true
  },
  "mdtest-easy-stat": {
    "value": 65.507102,
    "unit": "kIOPS",
    "time": 98.751,
    "valid": true
  },
  "ior-hard-read": {
    "value": 0.18154,
    "unit": "GiB/s",
    "time": 347.031,
    "valid": true
  },
  "mdtest-hard-stat": {
    "value": 141.592495,
    "unit": "kIOPS",
    "time": 32.922,
    "valid": true
  },
  "mdtest-easy-delete": {
    "value": 17.76809,
    "unit": "kIOPS",
    "time": 366.711,
    "valid": true
  },
  "mdtest-hard-read": {
    "value": 15.494298,
    "unit": "kIOPS",
    "time": 292.032,
    "valid": true
  },
  "mdtest-hard-delete": {
    "value": 23.672366,
    "unit": "kIOPS",
    "time": 195.132,
    "valid": true
  },
  "score": {
    "bandwidth": 0.183131,
    "iops": 42.79838,
    "total": 2.799595,
    "valid": true
  }
}

The Workflow in Action: Step-by-Step

4. Aggregate All Results

After one or more benchmark runs are complete, update the master CSV file.

# This can be run from anywhere with access to the results directory
./aggregate_results.py ./io500_results aggregated_results.csv
# or
pixi run aggregate
 pixi run aggregate
 Pixi task (aggregate in report): aggregate_results.py io500_results                                                                                        ============================================================
IO500 Results Aggregation
============================================================
Scanning base directory: io500_results
Output file: aggregated_results.csv
============================================================

Recursively scanning io500_results for result directories...
Found 55 result directories
   Loaded: wildwood/ayaya01/nfs/nfs7/hdd
   Loaded: wildwood/ayaya01/local/ssd
   Loaded: wildwood/aspen12/local/ssd
   Skipping wildwood/aspen12/nfs/hdd/2025.10.30-14.51.24: missing results.json
   Loaded: wildwood/build1/local/hdd
   Loaded: wildwood/build1/nfs/fs0/ssd
   Loaded: wildwood/build1/nfs/nfs6/hdd
   Loaded: wildwood/build1/nfs/nfs4/hdd
   Loaded: wildwood/microbiome/nfs/nfs6/hdd
   Loaded: wildwood/microbiome/nfs/nfs4/hdd
   Loaded: wildwood/microbiome/nfs/fs0/ssd
   Loaded: wildwood/microbiome/local/hdd
   Loaded: wildwood/ayaya02/nfs/nfs6/hdd
   Loaded: wildwood/ayaya02/nfs/nfs7/hdd
   Loaded: wildwood/ayaya02/nfs/nfs4/hdd
   Loaded: wildwood/ayaya02/local/ssd
   Loaded: wildwood/jackson/local/ssd
   Loaded: wildwood/jackson/nfs/fs0/ssd
   Loaded: wildwood/jackson/nfs/fs0/ssd
   Loaded: wildwood/jackson/nfs/nfs4/hdd
   Loaded: wildwood/jackson/nfs/nfs7/hdd
   Loaded: wildwood/jackson/nfs/nfs6/hdd
   Loaded: wildwood/darwin/local/hdd
   Loaded: wildwood/darwin/nfs/nfs4/hdd
   Loaded: wildwood/darwin/nfs/nfs6/hdd
   Loaded: wildwood/cascade/local/ssd
   Skipping wildwood/cascade/nfs/hdd/2025.10.30-12.40.14: missing results.json
   Loaded: wildwood/chrom1/local/hdd
   Loaded: wildwood/chrom1/nfs/fs0/ssd
   Loaded: wildwood/chrom1/nfs/nfs7/hdd
   Loaded: wildwood/chrom1/nfs/nfs4/hdd
   Loaded: wildwood/chrom1/nfs/nfs6/hdd
   Loaded: wildwood/cqls-gpu4/nfs/nfs6/hdd
   Loaded: wildwood/cqls-gpu4/nfs/nfs4/hdd
   Loaded: wildwood/cqls-gpu4/local/ssd
   Loaded: wildwood/bact0/local/hdd
   Loaded: wildwood/bact0/nfs/fs0/ssd
   Loaded: wildwood/bact0/nfs/nfs6/hdd
   Loaded: wildwood/bact0/nfs/nfs7/hdd
   Loaded: wildwood/bact0/nfs/nfs4/hdd
   Loaded: wildwood/olympus/local/nvme0n1/ssd
   Loaded: wildwood/olympus/local/md126/ssd
   Loaded: wildwood/olympus/nfs/nfs7/hdd
   Loaded: wildwood/olympus/nfs/nfs4/hdd
   Loaded: wildwood/olympus/nfs/nfs6/hdd
   Loaded: wildwood/olympus/nfs/fs0/ssd
   Loaded: wildwood/samwise/local/hdd
   Loaded: wildwood/samwise/nfs/fs0/ssd
   Loaded: wildwood/samwise/nfs/nfs4/hdd
   Loaded: wildwood/samwise/nfs/nfs6/hdd
   Loaded: wildwood/neo/local/hdd
   Loaded: wildwood/neo/nfs/fs0/ssd
   Loaded: wildwood/neo/nfs/nfs7/hdd
   Loaded: wildwood/neo/nfs/nfs6/hdd
   Loaded: wildwood/neo/nfs/nfs4/hdd

 Warning: 2 directories are missing results.json
  You can regenerate these by running:
    ./io500_wrapper.sh --reprocess <directory>

============================================================
Successfully loaded 53 benchmark results
============================================================

Found 12 unique test types

============================================================
 Aggregated results written to: aggregated_results.csv
 Total benchmark runs: 53
 Total test types: 12
============================================================

The script will scan for any new results and add them to the CSV.

The Workflow in Action: Step-by-Step

5. Generate the report

 pixi run render
 Pixi task (render): quarto render io500_analysis.qmd
Starting python3 kernel...Done

Executing 'io500_analysis.quarto_ipynb'
  Cell 1/24: 'setup'........................Done
  Cell 2/24: 'summary-stats'................Done
  Cell 3/24: 'score-table'..................Done
  Cell 4/24: 'score-plot'...................Done
  Cell 5/24: 'bandwidth-prep'...............Done
  Cell 6/24: 'bandwidth-by-storage'.........Done
  Cell 7/24: 'bandwidth-comparison-table'...Done
  Cell 8/24: 'bandwidth-by-difficulty'......Done
  Cell 9/24: 'iops-prep'....................Done
  Cell 10/24: 'iops-by-operation'............Done
  Cell 11/24: 'iops-summary-table'...........Done
  Cell 12/24: 'network-speed-impact'.........Done
  Cell 13/24: 'nfs-volume-comparison'........Done
  Cell 14/24: 'storage-comparison-plot'......Done
  Cell 15/24: 'cluster-heatmap'..............Done
  Cell 16/24: 'cluster-table'................Done
  Cell 17/24: 'device-comparison'............Done
  Cell 18/24: 'top-performers'...............Done
  Cell 19/24: 'time-analysis-prep'...........Done
  Cell 20/24: 'time-by-test'.................Done
  Cell 21/24: 'recommendations'..............Done
  Cell 22/24: 'key-statistics'...............Done
  Cell 23/24: 'raw-data'.....................Done
  Cell 24/24: 'system-info'..................Done

pandoc
  to: html
  output-file: io500_analysis.html
  standalone: true
  embed-resources: true
  section-divs: true
  html-math-method: mathjax
  wrap: none
  default-image-extension: png
  toc: true
  toc-depth: 3
  variables: {}

metadata
  document-css: false
  link-citations: true
  date-format: long
  lang: en
  title: IO500 Benchmark Analysis
  subtitle: HPC Cluster Storage Performance Comparison
  author: Ed Davis
  date: today
  theme: cosmo

Output created: io500_analysis.html

pixi run render  40.00s user 3.96s system 31% cpu 2:20.92 total
  • The report is generated in a self-contained html file.

Utilities for Maintenance

We have helper scripts to manage the data.

reprocess_all_results.sh

  • Sometimes the parsing logic in io500_wrapper.sh is improved.
  • This script finds every result_summary.txt in the entire results directory and re-runs the parsing step to regenerate all results.json files.

Analysis: Local vs. Networked Storage

Categorization Strategy

  • Local Storage:
    • Includes direct-attached SSDs, NVMe drives, or HDDs.
    • Performance is primarily dependent on local system.
  • Networked Storage:
    • Includes all shared filesystems accessed over the network, such as NFS, Isilon, and PowerScale.
    • Performance is a complex interplay between the client node, the network fabric (Ethernet, InfiniBand), storage server.

Implementation in Analysis

We implement this separation in our analysis scripts. For example, to create a dataset containing only networked filesystems, we filter by storage_type:

# Select all storage types that are not 'local'
# This includes 'nfs', 'powerscale', 'isilon', etc.
networked = results[results['storage_type'] != 'local'].copy()

# Ensure we have the necessary data for plotting
networked = networked.dropna(
    subset=['score_bandwidth', 'score_iops', 'network_speed']
)

This allows us to perform specific analyses, such as correlating network speed with I/O bandwidth, which would be meaningless for local drives.

From Data to Insight: The Analysis Pipeline

The Toolchain

  • Quarto: We use a Quarto document (.qmd) as the foundation for the report. This allows us to combine code, text, and visualizations in one place.
  • Python: All data manipulation and plotting is done in Python using libraries like pandas for data wrangling and plotly for interactive visualizations.
  • Reproducibility: Because the report is generated from code, it can be re-run and updated with one command whenever new benchmark data is added.

Key Areas of Analysis

The report is structured to answer critical performance questions by breaking down the data:

  • Overall Performance Rankings
  • Bandwidth Analysis (Read vs. Write, Easy vs. Hard)
  • IOPS Deep Dive (Stat, Read, Write, Delete operations)
  • Storage Comparisons (Local vs. Networked, HDD vs. SSD)
  • Network Impact Analysis
  • Cluster-to-Cluster Performance

The Interactive HTML Report

The final output is a self-contained, interactive HTML file that allows for deep exploration of the results.

This report moves beyond static images and provides a dynamic way to understand our storage infrastructure. It allows us to answer specific questions like:

  • Which NFS volume provides the best balance of bandwidth and IOPS?
  • How much faster are SSDs for metadata-heavy workloads (e.g., stat)?
  • Is the network a bottleneck for our PowerScale or NFS storage?

Live Report

The live, interactive HTML report generated from the Quarto document is available for review. Here you can hover over data points, filter results, and explore the detailed tables.

Link to Live HPC Storage Performance Report

Issues

Sometimes jobs fail:

 cat job.aspen12_nfs7_io500/job.aspen12_nfs7_io500.e1857286
##hpcman.jobs={'runid':'1857286','runname':'job.aspen12_nfs7_io500','host':'aspen12','wd':'/nfs4/core/home/davised/projects/hpc-disk-bench','taskid':''}
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
  Proc: [[13476,1],0]
  Errorcode: -1

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
prterun has exited due to process rank 0 with PID 0 on node aspen12 calling
"abort". This may have caused other processes in the application to be
terminated by signals sent by prterun (as reported here).
--------------------------------------------------------------------------
Command exited with non-zero status 255
        Memory (kb):                       131292
        # SWAP  (freq):                    0
        # Waits (freq):                    9914621
        CPU (percent):                     19%
        Time (seconds):                    2769.47
        Time (hh:mm:ss.ms):                46:09.47
        System CPU Time (seconds):         508.76
        User   CPU Time (seconds):         22.38
  • Difficult to determine why some jobs fail
  • Usually just re-run and the jobs complete

Next Steps

  • Clone the repo and begin testing
  • Example runs:
pixi run ./io500_wrapper.sh wildwood amaterasu01 powerscale hdd /ceoas/olsont/disk_test
  • Reminders:
    • Set NUM_PROCS if you aren’t using Slurm
    • Name your storage devices based on local vs nfs/powerscale/isilon
    • Specify HDD or SSD
    • Provide full path to test directory (preferably a directory that doesn’t already exist)
    • If ambiguity exists (e.g. nfs4 vs nfs6 are both zfs), provide a name after test directory

Questions?