Data Structure Performance

Author

WhiskerToolbox Team

Published

January 2, 2026

Overview

This document provides profiling information and design rationale for the core data structures used in WhiskerToolbox. Understanding the performance characteristics of these structures is essential for making informed architectural decisions.

RaggedTimeSeries

RaggedTimeSeries<TData> is a templated container for time series data where multiple entries can exist at each timestamp. It is the base class for:

  • MaskData - Binary masks over time
  • LineData - Lines/curves over time
  • PointData - Points over time

Each entry has:

  1. A TimeFrameIndex - when this entry occurs
  2. A LocalIndex - distinguishing multiple entries at the same time (0, 1, 2, …)
  3. An EntityId - unique identifier for tracking and grouping
  4. The actual data (Mask2D, Line2D, Point2D<float>)

Current Implementation

The current implementation uses a map of vectors:

std::map<TimeFrameIndex, std::vector<DataEntry<TData>>> _data;

template<typename TData>
struct DataEntry {
    TData data;
    EntityId entity_id;
};

Characteristics

Operation Complexity Notes
Lookup by TimeFrameIndex O(log n) Map lookup
Lookup by EntityId O(n) Full scan required
Insert at time (append) O(log n) Map insert + vector push_back
Insert at time (middle) O(log n + k) k = entries at that time
Iterate all entries O(n) Must traverse map + vectors
Memory layout Fragmented Many small vector allocations

Alternative: Structure of Arrays (SoA)

An alternative storage strategy uses parallel arrays:

template<typename TData>
struct SoAStorage {
    std::vector<TimeFrameIndex> times;
    std::vector<LocalIndex> local_indices;
    std::vector<TData> data;
    std::vector<EntityId> entity_ids;
    
    // Acceleration structures
    std::map<TimeFrameIndex, std::pair<size_t, size_t>> time_ranges;
    std::unordered_map<EntityId, size_t> entity_to_index;
};

Characteristics

Operation Complexity Notes
Lookup by TimeFrameIndex O(log n) Via time_ranges map
Lookup by EntityId O(1) Via entity_to_index map
Append O(1) amortized Vector push_back
Middle insert O(n) Must shift all arrays
Iterate all entries O(n) Contiguous memory
Memory layout Contiguous Cache-friendly

Profiling: Trivially Copyable vs Non-Trivially Copyable Types

A critical performance consideration is whether the stored type is trivially copyable.

What is Trivially Copyable?

A type is trivially copyable if it can be copied with a simple memcpy/memmove:

// Trivially copyable - just raw bytes
struct Point2D {
    float x, y;  // 8 bytes total
};
static_assert(std::is_trivially_copyable_v<Point2D>);

// NOT trivially copyable - contains std::vector
struct Mask2D {
    std::vector<uint8_t> data;  // 24 bytes (pointer + size + capacity)
};
static_assert(!std::is_trivially_copyable_v<Mask2D>);

Why Does It Matter?

When inserting into the middle of a std::vector:

For trivially copyable types:

// Compiler generates a single memmove call
memmove(dest, src, num_elements * sizeof(T));

For non-trivially copyable types:

// Compiler must call move constructor for EACH element
for (size_t i = num_elements; i > insert_pos; --i) {
    new (&dest[i]) T(std::move(src[i-1]));
    src[i-1].~T();
}

Does Data Size Inside Mask2D Affect Performance?

No! The move constructor for std::vector just swaps 3 pointers:

// std::vector move constructor (simplified)
vector(vector&& other) noexcept {
    _data = other._data;           // Steal the pointer
    _size = other._size;
    _capacity = other._capacity;
    other._data = nullptr;         // Leave source empty
}

Measured results with 200,000 elements and 100 middle insertions:

Mask data size Time per insert
0 bytes ~148 µs
100 bytes ~172 µs
1 KB ~145 µs

The times are essentially identical regardless of mask content size.

Profiling: Middle Insertion Performance

The following benchmark compares insertion performance across different data types and storage strategies.

Results

=============================================================================
RaggedTimeSeries Storage Benchmark
=============================================================================

Configuration:
  Initial elements: 200000
  Middle insertions: 1000
  Average elements moved: 100000

-----------------------------------------------------------------------------
MIDDLE INSERTION PERFORMANCE
-----------------------------------------------------------------------------
Type                              Size    Per Insert    Bytes Moved
-----------------------------------------------------------------------------

-- Trivially Copyable Types (uses memmove) --
Point2D<float>                  8 bytes        18.6 µs        800000 bytes
EntityId                        8 bytes        18.7 µs        800000 bytes
TimeFrameIndex                  8 bytes        19.2 µs        800000 bytes
SoAKey (composite)             24 bytes        65.0 µs       2400000 bytes

-- Non-Trivially Copyable Types (move constructor per element) --
Mask2D (empty)                 24 bytes       190.7 µs       2400000 bytes
Mask2D (100 bytes)             24 bytes       210.9 µs       2400000 bytes
Mask2D (1KB)                   24 bytes       155.7 µs       2400000 bytes
Line2D (empty)                 24 bytes       159.6 µs       2400000 bytes
Line2D (50 points)             24 bytes       169.0 µs       2400000 bytes

-----------------------------------------------------------------------------
APPEND PERFORMANCE (for comparison)
-----------------------------------------------------------------------------
Type                        Per Append
-----------------------------------------------------------------------------
Point2D<float>                 1.38 ns
SoAKey (composite)             5.14 ns
Mask2D (empty)                 2.66 ns
Mask2D (1KB)                 245.80 ns

Analysis

Key findings:

  1. Trivially copyable types are 3-10× faster for middle insertion:

    • Point2D (8 bytes): 18.6 µs per insert
    • SoAKey (24 bytes): 65.0 µs per insert
    • Mask2D (24 bytes, non-trivial): 155-210 µs per insert
  2. Data size inside containers doesn’t matter for moves:

    • Mask2D with 0 bytes: 190.7 µs
    • Mask2D with 1KB: 155.7 µs (within noise)

    This is because std::vector’s move constructor just swaps 3 pointers.

  3. Append is dramatically faster than middle insertion:

    • Point2D append: 1.38 ns (13,000× faster than insert)
    • Mask2D append: 2.66 ns (60,000× faster than insert)
  4. For SoA with views: Keep metadata (TimeFrameIndex, EntityId) in trivially copyable arrays; keep actual data (Mask2D) in append-only storage.

Profiling: Storage Strategy Comparison

This benchmark compares the current map-based implementation against the SoA approach.

Results

=============================================================================
Storage Strategy Benchmark: Map-Based vs Structure of Arrays
=============================================================================

Configuration:
  Total entries: 200000
  Time points: 50000
  Avg entries per time: 4
  Lookups per test: 10000
  Filter set size: 1000

-----------------------------------------------------------------------------
CONSTRUCTION (append 200000 entries in time order)
-----------------------------------------------------------------------------
Map-Based:  35.3 ms
SoA:        85.5 ms
Speedup:    0.41x

-----------------------------------------------------------------------------
LOOKUP BY EntityId (10000 random lookups)
-----------------------------------------------------------------------------
Map-Based:  9105.2 ms total, 910.5 µs per lookup
SoA:        0.7 ms total, 69.4 ns per lookup
Speedup:    13120x

-----------------------------------------------------------------------------
LOOKUP BY TimeFrameIndex (10000 random lookups)
-----------------------------------------------------------------------------
Map-Based:  2.9 ms total, 293.6 ns per lookup
SoA:        4.0 ms total, 403.5 ns per lookup
Speedup:    0.73x

-----------------------------------------------------------------------------
FILTER BY EntityId SET (filter 1000 EntityIds)
-----------------------------------------------------------------------------
Map-Based (full copy):    8.1 ms, 998 results
SoA (scan, indices only): 4.6 ms, 998 results
SoA (hash, indices only): 0.1 ms, 998 results
Speedup (hash vs map):    75x

-----------------------------------------------------------------------------
ITERATE ALL ENTRIES
-----------------------------------------------------------------------------
Map-Based:  2.8 ms
SoA:        0.5 ms
Speedup:    5.17x

Analysis

Operation Map-Based SoA Winner
Construction 35.3 ms 85.5 ms Map (2.4×)
EntityId Lookup 910 µs 69 ns SoA (13,120×)
TimeFrameIndex Lookup 294 ns 404 ns Map (1.4×)
EntityId Filter (1000 ids) 8.1 ms 0.1 ms SoA (75×)
Iterate All 2.8 ms 0.5 ms SoA (5×)

Key findings:

  1. EntityId operations are dramatically faster with SoA:

    • Single lookup: 13,120× faster (910 µs → 69 ns)
    • Filtering 1000 EntityIds: 75× faster (8.1 ms → 0.1 ms)

    This is because SoA maintains an unordered_map<EntityId, size_t> for O(1) lookup, while map-based requires O(n) full scan.

  2. TimeFrameIndex lookup is slightly faster with Map:

    • Both are O(log m) where m = unique time points
    • Map wins by ~1.4× due to direct access vs indirection
  3. Construction favors Map:

    • Map: 35.3 ms
    • SoA: 85.5 ms (2.4× slower)

    This is because SoA maintains additional acceleration structures.

  4. Iteration strongly favors SoA:

    • 5× faster due to cache-friendly sequential memory access
    • Map traverses fragmented memory (many small vectors)
  5. For view-based filtering, the SoA approach returns indices only (8 bytes each), while map-based must copy the entire DataEntry<TData> including the data.

Implications for View-Based Filtering

One key use case is creating filtered views of data (e.g., by EntityId set or TimeFrameIndex range).

Current Approach: Copy on Filter

// Creates a new RaggedTimeSeries with copied data
auto filtered = source.copyByEntityIds(entity_set);

Alternative: Index-Based Views

With SoA storage, filtering can return a lightweight view:

// View is just a vector of indices into the source
struct RaggedView {
    SoAStorage const* source;
    std::vector<size_t> indices;  // Which entries are included
};

// Creating a filtered view - O(n) scan, O(m) allocation for m matches
// No data copying!
auto filtered_view = source.createEntityIdView(entity_set);

Performance Comparison

Operation Copy Approach View Approach
Create filter O(n) + copy TData O(n) + O(m) indices
Memory usage Full copy Just indices (8 bytes each)
Access element O(1) O(1) + 1 indirection
Modify source Independent View sees changes

Recommendations

Based on the profiling results:

  1. Append-only data: SoA storage is highly efficient for data that arrives chronologically. With 200,000 entries, iteration is 5× faster due to cache-friendly memory layout.

  2. Frequent EntityId lookups: The current map-based implementation is O(n) per lookup (910 µs with 200k entries). Adding an entity_to_index map provides O(1) access (69 ns), a 13,000× improvement.

  3. View-based filtering: Use index vectors rather than copying data. Filtering 1000 EntityIds takes 8.1 ms with copy vs 0.1 ms with indices only (75× faster).

  4. Middle insertions: Avoid when possible; batch operations and sort. A single middle insertion into 200k Mask2D elements costs ~170 µs, while append costs ~3 ns.

  5. TimeFrameIndex operations: The current map-based approach is already optimal for time-based lookup (O(log m)). SoA provides similar performance.

  6. Hybrid approach: For best overall performance:

    • Keep Mask2D/Line2D data in append-only storage (arena)
    • Maintain parallel vectors of trivially-copyable metadata (TimeFrameIndex, EntityId)
    • Use hash maps for O(1) EntityId lookup
    • Views hold index vectors referencing the source data

Derived Storage (Views) Performance

When creating filtered views that reference source data without copying, we measured the overhead of index indirection.

Results

=============================================================================
Derived Storage Overhead Benchmark
=============================================================================

Configuration:
  Source entries: 200000
  View size: 10000 (5% of source)

-----------------------------------------------------------------------------
SEQUENTIAL ITERATION (all elements)
-----------------------------------------------------------------------------
Direct array access:     7438.8 µs  (37.2 ns/element)
Source (virtual):        6958.1 µs  (34.8 ns/element)
Derived 'all' view:      6627.0 µs  (33.1 ns/element)

Overhead analysis (vs direct array):
  Virtual dispatch:      0.94x
  View indirection:      0.89x

-----------------------------------------------------------------------------
FILTERED VIEW ITERATION (subset)
-----------------------------------------------------------------------------
Filtered view (9742 entries): 1724.7 µs  (177.0 ns/element)
Source + filter check:   7853.9 µs  (scanning all 200000)

Filtered view speedup:   4.6x faster

-----------------------------------------------------------------------------
CACHE EFFECTS: SEQUENTIAL vs RANDOM VIEW INDICES
-----------------------------------------------------------------------------
Sequential indices:      306.9 µs  (30.7 ns/element)
Random indices:          1670.0 µs  (171.2 ns/element)

Cache penalty:           5.44x slower

-----------------------------------------------------------------------------
MEMORY OVERHEAD
-----------------------------------------------------------------------------
Source storage:
  Total:                  30468 KB

Derived view (9742 entries):
  Index vector:           76 KB
  Local EntityId map:     152 KB
  Data copied:            0 KB (references source)
  Total:                  228 KB

Memory savings:          82.9% vs full copy

Analysis

Metric Result Notes
Virtual dispatch overhead 0.94x (faster!) Compiler optimizes well
Index indirection overhead 0.89x (faster!) CPU prefetching helps
Filtered iteration 4.6x faster vs scanning all + filter check
Cache penalty (random indices) 5.4x slower 171 ns vs 31 ns per element
Memory savings 83% 228 KB view vs full copy
View creation 233 ns/EntityId One-time cost

Key insights:

  1. Virtual dispatch is essentially free - The compiler devirtualizes with -O3

  2. Index indirection overhead is negligible for sequential access due to CPU prefetching

  3. The REAL cost is cache misses with random access patterns (5.4x penalty)

  4. Views pay for themselves quickly - Creation cost is amortized in < 1 iteration

  5. Sort view indices when possible for cache-friendly access

CRTP + Variant vs Virtual Inheritance

An alternative to virtual inheritance for polymorphic storage is the CRTP (Curiously Recurring Template Pattern) combined with std::variant for type erasure. This pattern is already used in AnalogDataStorage.hpp.

The Two Approaches

Virtual Inheritance (Classic OOP):

template<typename TData>
class IRaggedStorage {
public:
    virtual ~IRaggedStorage() = default;
    virtual size_t size() const = 0;
    virtual TData const& getData(size_t idx) const = 0;
    // ... other pure virtual methods
};

class SourceStorage : public IRaggedStorage<Mask2D> { /* ... */ };
class DerivedStorage : public IRaggedStorage<Mask2D> { /* ... */ };

// Usage via base pointer
std::shared_ptr<IRaggedStorage<Mask2D>> storage = /*...*/;
auto& data = storage->getData(i);  // vtable dispatch

CRTP + Variant (Modern C++):

template<typename Derived, typename TData>
class StorageBase {
public:
    // Static dispatch via CRTP
    size_t size() const { 
        return static_cast<Derived const*>(this)->sizeImpl(); 
    }
    // ...
};

class SourceStorage : public StorageBase<SourceStorage, Mask2D> { /* ... */ };
class DerivedStorage : public StorageBase<DerivedStorage, Mask2D> { /* ... */ };

// Type erasure via variant
using StorageVariant = std::variant<
    std::shared_ptr<SourceStorage>,
    std::shared_ptr<DerivedStorage>
>;

StorageVariant storage = /*...*/;
auto& data = std::visit([i](auto& ptr) -> auto& { 
    return ptr->getData(i); 
}, storage);  // std::visit dispatch

Results

=============================================================================
CRTP + Variant vs Virtual Inheritance Benchmark
=============================================================================

Configuration:
  Entries: 200000
  View size: 10000
  Iterations: 100

=============================================================================
PART 1: Point2D (trivially copyable) - Isolating Dispatch Overhead
=============================================================================

POINT2D: SEQUENTIAL ITERATION (dispatch overhead visible)
Virtual (base ptr):      691.7 µs  (3.5 ns/elem)
CRTP (direct):           332.3 µs  (1.7 ns/elem)
CRTP (variant):          320.8 µs  (1.6 ns/elem)

Speedup vs Virtual:
  CRTP direct:  2.08x
  CRTP variant: 2.16x

=============================================================================
PART 2: Mask2D (non-trivially copyable) - Data Access Dominates
=============================================================================

SOURCE STORAGE: SEQUENTIAL ITERATION
Virtual (base ptr):      7366.7 µs  (36.8 ns/elem)
CRTP (direct):           6635.9 µs  (33.2 ns/elem)
CRTP (variant):          7000.2 µs  (35.0 ns/elem)

Speedup vs Virtual:
  CRTP direct:  1.11x
  CRTP variant: 1.05x

ENTITY ID LOOKUP (10000 lookups)
Virtual:                 4406.9 µs  (440.7 ns/lookup)
CRTP (direct):           3778.3 µs  (377.8 ns/lookup)
CRTP (variant):          2993.7 µs  (299.4 ns/lookup)

Speedup vs Virtual:
  CRTP direct:  1.17x
  CRTP variant: 1.47x

Analysis Summary

Access Pattern Virtual CRTP Direct CRTP Variant
Point2D iteration 3.5 ns/elem 1.7 ns 1.6 ns
Mask2D iteration 36.8 ns/elem 33.2 ns 35.0 ns
EntityId lookup 440.7 ns 377.8 ns 299.4 ns

Key insights:

  1. CRTP is 2× faster for lightweight data (Point2D):

    • Dispatch overhead becomes visible when data access is cheap
    • Both direct and variant approaches eliminate vtable lookup
  2. When data access dominates, dispatch doesn’t matter:

    • For Mask2D (vector-backed), virtual vs CRTP vs variant are within 10%
    • The CPU spends most time chasing pointers, not dispatching
  3. CRTP Variant wins on hash lookups (1.47× faster):

    • std::visit can be faster than virtual when compiler can see all types
    • Better inlining opportunities
  4. When to use each approach:

    Approach Best For
    Virtual Open extension (plugins), external libraries
    CRTP Direct Maximum performance, type known at compile time
    CRTP Variant Closed type set, value semantics, pattern matching
  5. CRTP + Variant advantages:

    • No heap allocation for small types (in-place storage)
    • Exhaustive handling enforced by compiler (std::visit)
    • No virtual destructor overhead
    • Better optimization across type boundaries
  6. CRTP + Variant disadvantages:

    • More complex code
    • All types must be known at compile time
    • Larger object size (variant stores largest type + discriminant)
    • Recompile required to add new types