Binary Multi-Channel Loading Performance Analysis

Author

Performance Investigation

Published

December 29, 2024

Overview

This document summarizes a performance investigation into loading multi-channel binary analog data in WhiskerToolbox. The goal was to identify bottlenecks in the current loading approach and evaluate alternative strategies.

Use Case: Loading 32-channel int16 binary files (e.g., 5 GB electrophysiology recordings) and distributing them into separate AnalogTimeSeries objects per channel.

Current Implementation

The current multi-channel loader in binary_loaders.hpp (readBinaryFileMultiChannel) reads data one time slice at a time:

std::vector<T> time_slice_buffer(num_channels);
while (file.read(..., sizeof(T) * num_channels)) {
    for (size_t i = 0; i < num_channels; ++i) {
        data[i][current_time_index] = time_slice_buffer[i];
    }
    current_time_index++;
}

For 32 channels of int16 data, this reads 64 bytes per syscall - an extremely inefficient pattern.

Benchmark Results

Three benchmarks were created in benchmark/adhoc/ to isolate performance characteristics.

1. Raw Disk Read Performance

Testing how read chunk size affects throughput:

File Size	Single Read	1MB Chunks	64KB Chunks	64-byte Reads
20 MB	4871 MB/s	3437 MB/s	4398 MB/s	2146 MB/s
100 MB	5043 MB/s	3784 MB/s	3866 MB/s	1862 MB/s
500 MB	4910 MB/s	3844 MB/s	3613 MB/s	N/A (too slow)
1 GB	5097 MB/s	2222 MB/s	4545 MB/s	N/A
2 GB	5512 MB/s	1698 MB/s	3436 MB/s	N/A
5 GB	5541 MB/s	4028 MB/s	4698 MB/s	N/A

Key Finding: Single large reads achieve ~5 GB/s throughput (SSD theoretical max). Tiny 64-byte reads degrade to ~2 GB/s even for small files, and would be catastrophically slow for large files.

2. Int16 to Float Conversion

Testing conversion strategies for 2.5 billion samples (~5 GB of int16):

Method	Throughput (M samples/s)
Simple loop	641
std::transform (presized)	786
Unrolled 8x	844
With scaling	780
From char buffer	852

Key Finding: Conversion achieves ~800M samples/sec (~1.6 GB/s of int16 data). This is faster than disk I/O, meaning conversion is NOT the bottleneck.

3. Multi-Channel Loading Strategies

Comparing different approaches for a 5 GB file (78M time samples × 32 channels):

Method	Read Time	Process Time	Total Time	Throughput
Time-slice reads	(all)	-	VERY SLOW	~600 MB/s
Bulk then parcelate	1128 ms	5822 ms	6950 ms	256 MB/s
Bulk single pass	477 ms	2306 ms	2783 ms	639 MB/s
Strided copy	446 ms	6103 ms	6550 ms	271 MB/s
Chunked 1k	238 ms	1992 ms	2547 ms	698 MB/s
Chunked 10k	249 ms	2010 ms	2577 ms	690 MB/s
Chunked 100k	286 ms	2042 ms	2638 ms	674 MB/s

Key Findings:

Time-slice reading (current approach) is by far the worst - each 64-byte read incurs syscall overhead
Chunked approach with ~1000 time samples per chunk achieves the best balance
Strided copy (processing one channel at a time) has poor cache locality
Bulk single pass is good but requires 2× memory (raw buffer + output vectors)

Effect of Channel Count

Testing with 10M time samples, varying channels:

Channels	File Size	Bulk Single Pass	Strided Copy
1	20 MB	858 MB/s	1463 MB/s
4	80 MB	1148 MB/s	960 MB/s
16	320 MB	449 MB/s	468 MB/s
32	640 MB	359 MB/s	326 MB/s
64	1.2 GB	290 MB/s	192 MB/s

Key Finding: More channels = more cache pressure during parcelation. The strided access pattern becomes increasingly problematic with more channels.

Root Cause Analysis

The bottleneck is NOT: - Disk I/O speed (SSDs can easily sustain 5+ GB/s) - int16 → float conversion (~1.6 GB/s throughput)

The bottleneck IS: 1. Syscall overhead: Reading 64 bytes at a time causes millions of syscalls for a 5GB file 2. Cache thrashing: Strided writes across 32 different vectors have poor locality 3. Double handling: Current approach reads + distributes in same loop, can’t overlap I/O

Recommended Fix

Replace the time-slice reading in readBinaryFileMultiChannel with a chunked approach:

template<typename T>
inline std::vector<std::vector<T>> readBinaryFileMultiChannel(BinaryAnalogOptions const & options) {
    // ... validation code ...
    
    // Pre-allocate output vectors
    std::vector<std::vector<T>> data(options.num_channels);
    for (auto& ch : data) {
        ch.resize(num_samples_per_channel);
    }
    
    // Read in chunks of 10,000 time samples (adjust based on cache size)
    constexpr size_t CHUNK_TIME_SAMPLES = 10000;
    std::vector<T> chunk_buffer(CHUNK_TIME_SAMPLES * options.num_channels);
    
    file.seekg(options.header_size_bytes);
    size_t time_offset = 0;
    
    while (time_offset < num_samples_per_channel) {
        size_t chunk_size = std::min(CHUNK_TIME_SAMPLES, num_samples_per_channel - time_offset);
        size_t bytes_to_read = chunk_size * options.num_channels * sizeof(T);
        
        // Single large read
        file.read(reinterpret_cast<char*>(chunk_buffer.data()), bytes_to_read);
        
        // Distribute to channels (cache-friendly inner loop on time)
        for (size_t t = 0; t < chunk_size; ++t) {
            for (size_t ch = 0; ch < options.num_channels; ++ch) {
                data[ch][time_offset + t] = chunk_buffer[t * options.num_channels + ch];
            }
        }
        
        time_offset += chunk_size;
    }
    
    return data;
}

Expected Improvement

Approach	5 GB File Time	Improvement
Current (time-slice)	~60+ seconds	Baseline
Chunked (10k samples)	~2.6 seconds	~23× faster

Benchmark Code Location

Ad-hoc benchmarks are in benchmark/adhoc/:

disk_read_benchmark.cpp - Raw disk read patterns
int16_to_float_benchmark.cpp - Conversion overhead
multichannel_loading_benchmark.cpp - End-to-end loading strategies

To build and run:

cd benchmark/adhoc
mkdir -p build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
make -j
./benchmark_multichannel_loading

These are temporary benchmarks for this investigation. They can be deleted once the fix is implemented and verified.

Memory-Mapped Alternative

For extremely large files where even chunked reading is slow, the memory-mapped path (MmapStorageConfig) in Analog_Time_Series_Binary.cpp already provides an excellent solution:

No explicit I/O - OS handles paging
Lazy loading - only pages actually accessed are read
Can handle files larger than RAM
Already implemented for single-channel strided access

For multi-channel memory-mapped loading, each channel gets its own MemoryMappedAnalogDataStorage with appropriate stride and offset, which works well for sequential access patterns.

Conclusion

The current multi-channel loading performance issue is caused by excessive syscalls from reading one time-slice (64 bytes for 32-channel int16) at a time. Switching to a chunked reading strategy with ~10,000 time samples per chunk should improve performance by approximately 20-25×, bringing 5 GB file loading time from over a minute down to ~3 seconds.