Binary Multi-Channel Loading Performance Analysis

Author

Performance Investigation

Published

December 29, 2024

Overview

This document summarizes a performance investigation into loading multi-channel binary analog data in WhiskerToolbox. The goal was to identify bottlenecks in the current loading approach and evaluate alternative strategies.

Use Case: Loading 32-channel int16 binary files (e.g., 5 GB electrophysiology recordings) and distributing them into separate AnalogTimeSeries objects per channel.

Current Implementation

The current multi-channel loader in binary_loaders.hpp (readBinaryFileMultiChannel) reads data one time slice at a time:

std::vector<T> time_slice_buffer(num_channels);
while (file.read(..., sizeof(T) * num_channels)) {
    for (size_t i = 0; i < num_channels; ++i) {
        data[i][current_time_index] = time_slice_buffer[i];
    }
    current_time_index++;
}

For 32 channels of int16 data, this reads 64 bytes per syscall - an extremely inefficient pattern.

Benchmark Results

Three benchmarks were created in benchmark/adhoc/ to isolate performance characteristics.

1. Raw Disk Read Performance

Testing how read chunk size affects throughput:

File Size Single Read 1MB Chunks 64KB Chunks 64-byte Reads
20 MB 4871 MB/s 3437 MB/s 4398 MB/s 2146 MB/s
100 MB 5043 MB/s 3784 MB/s 3866 MB/s 1862 MB/s
500 MB 4910 MB/s 3844 MB/s 3613 MB/s N/A (too slow)
1 GB 5097 MB/s 2222 MB/s 4545 MB/s N/A
2 GB 5512 MB/s 1698 MB/s 3436 MB/s N/A
5 GB 5541 MB/s 4028 MB/s 4698 MB/s N/A

Key Finding: Single large reads achieve ~5 GB/s throughput (SSD theoretical max). Tiny 64-byte reads degrade to ~2 GB/s even for small files, and would be catastrophically slow for large files.

2. Int16 to Float Conversion

Testing conversion strategies for 2.5 billion samples (~5 GB of int16):

Method Throughput (M samples/s)
Simple loop 641
std::transform (presized) 786
Unrolled 8x 844
With scaling 780
From char buffer 852

Key Finding: Conversion achieves ~800M samples/sec (~1.6 GB/s of int16 data). This is faster than disk I/O, meaning conversion is NOT the bottleneck.

3. Multi-Channel Loading Strategies

Comparing different approaches for a 5 GB file (78M time samples × 32 channels):

Method Read Time Process Time Total Time Throughput
Time-slice reads (all) - VERY SLOW ~600 MB/s
Bulk then parcelate 1128 ms 5822 ms 6950 ms 256 MB/s
Bulk single pass 477 ms 2306 ms 2783 ms 639 MB/s
Strided copy 446 ms 6103 ms 6550 ms 271 MB/s
Chunked 1k 238 ms 1992 ms 2547 ms 698 MB/s
Chunked 10k 249 ms 2010 ms 2577 ms 690 MB/s
Chunked 100k 286 ms 2042 ms 2638 ms 674 MB/s

Key Findings:

  1. Time-slice reading (current approach) is by far the worst - each 64-byte read incurs syscall overhead
  2. Chunked approach with ~1000 time samples per chunk achieves the best balance
  3. Strided copy (processing one channel at a time) has poor cache locality
  4. Bulk single pass is good but requires 2× memory (raw buffer + output vectors)

Effect of Channel Count

Testing with 10M time samples, varying channels:

Channels File Size Bulk Single Pass Strided Copy
1 20 MB 858 MB/s 1463 MB/s
4 80 MB 1148 MB/s 960 MB/s
16 320 MB 449 MB/s 468 MB/s
32 640 MB 359 MB/s 326 MB/s
64 1.2 GB 290 MB/s 192 MB/s

Key Finding: More channels = more cache pressure during parcelation. The strided access pattern becomes increasingly problematic with more channels.

Root Cause Analysis

The bottleneck is NOT: - Disk I/O speed (SSDs can easily sustain 5+ GB/s) - int16 → float conversion (~1.6 GB/s throughput)

The bottleneck IS: 1. Syscall overhead: Reading 64 bytes at a time causes millions of syscalls for a 5GB file 2. Cache thrashing: Strided writes across 32 different vectors have poor locality 3. Double handling: Current approach reads + distributes in same loop, can’t overlap I/O

Benchmark Code Location

Ad-hoc benchmarks are in benchmark/adhoc/:

  • disk_read_benchmark.cpp - Raw disk read patterns
  • int16_to_float_benchmark.cpp - Conversion overhead
  • multichannel_loading_benchmark.cpp - End-to-end loading strategies

To build and run:

cd benchmark/adhoc
mkdir -p build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j
./benchmark_multichannel_loading

These are temporary benchmarks for this investigation. They can be deleted once the fix is implemented and verified.

Memory-Mapped Alternative

For extremely large files where even chunked reading is slow, the memory-mapped path (MmapStorageConfig) in Analog_Time_Series_Binary.cpp already provides an excellent solution:

  • No explicit I/O - OS handles paging
  • Lazy loading - only pages actually accessed are read
  • Can handle files larger than RAM
  • Already implemented for single-channel strided access

For multi-channel memory-mapped loading, each channel gets its own MemoryMappedAnalogDataStorage with appropriate stride and offset, which works well for sequential access patterns.

Conclusion

The current multi-channel loading performance issue is caused by excessive syscalls from reading one time-slice (64 bytes for 32-channel int16) at a time. Switching to a chunked reading strategy with ~10,000 time samples per chunk should improve performance by approximately 20-25×, bringing 5 GB file loading time from over a minute down to ~3 seconds.