Data Structure Performance
Overview
This document provides profiling information and design rationale for the core data structures used in WhiskerToolbox. Understanding the performance characteristics of these structures is essential for making informed architectural decisions.
RaggedTimeSeries
RaggedTimeSeries<TData> is a templated container for time series data where multiple entries can exist at each timestamp. It is the base class for:
MaskData- Binary masks over timeLineData- Lines/curves over time
PointData- Points over time
Each entry has:
- A
TimeFrameIndex- when this entry occurs - A
LocalIndex- distinguishing multiple entries at the same time (0, 1, 2, …) - An
EntityId- unique identifier for tracking and grouping - The actual data (
Mask2D,Line2D,Point2D<float>)
Current Implementation
The current implementation uses a map of vectors:
std::map<TimeFrameIndex, std::vector<DataEntry<TData>>> _data;
template<typename TData>
struct DataEntry {
TData data;
EntityId entity_id;
};Characteristics
| Operation | Complexity | Notes |
|---|---|---|
| Lookup by TimeFrameIndex | O(log n) | Map lookup |
| Lookup by EntityId | O(n) | Full scan required |
| Insert at time (append) | O(log n) | Map insert + vector push_back |
| Insert at time (middle) | O(log n + k) | k = entries at that time |
| Iterate all entries | O(n) | Must traverse map + vectors |
| Memory layout | Fragmented | Many small vector allocations |
Alternative: Structure of Arrays (SoA)
An alternative storage strategy uses parallel arrays:
template<typename TData>
struct SoAStorage {
std::vector<TimeFrameIndex> times;
std::vector<LocalIndex> local_indices;
std::vector<TData> data;
std::vector<EntityId> entity_ids;
// Acceleration structures
std::map<TimeFrameIndex, std::pair<size_t, size_t>> time_ranges;
std::unordered_map<EntityId, size_t> entity_to_index;
};Characteristics
| Operation | Complexity | Notes |
|---|---|---|
| Lookup by TimeFrameIndex | O(log n) | Via time_ranges map |
| Lookup by EntityId | O(1) | Via entity_to_index map |
| Append | O(1) amortized | Vector push_back |
| Middle insert | O(n) | Must shift all arrays |
| Iterate all entries | O(n) | Contiguous memory |
| Memory layout | Contiguous | Cache-friendly |
Profiling: Trivially Copyable vs Non-Trivially Copyable Types
A critical performance consideration is whether the stored type is trivially copyable.
What is Trivially Copyable?
A type is trivially copyable if it can be copied with a simple memcpy/memmove:
// Trivially copyable - just raw bytes
struct Point2D {
float x, y; // 8 bytes total
};
static_assert(std::is_trivially_copyable_v<Point2D>);
// NOT trivially copyable - contains std::vector
struct Mask2D {
std::vector<uint8_t> data; // 24 bytes (pointer + size + capacity)
};
static_assert(!std::is_trivially_copyable_v<Mask2D>);Why Does It Matter?
When inserting into the middle of a std::vector:
For trivially copyable types:
// Compiler generates a single memmove call
memmove(dest, src, num_elements * sizeof(T));For non-trivially copyable types:
// Compiler must call move constructor for EACH element
for (size_t i = num_elements; i > insert_pos; --i) {
new (&dest[i]) T(std::move(src[i-1]));
src[i-1].~T();
}Does Data Size Inside Mask2D Affect Performance?
No! The move constructor for std::vector just swaps 3 pointers:
// std::vector move constructor (simplified)
vector(vector&& other) noexcept {
_data = other._data; // Steal the pointer
_size = other._size;
_capacity = other._capacity;
other._data = nullptr; // Leave source empty
}Measured results with 200,000 elements and 100 middle insertions:
| Mask data size | Time per insert |
|---|---|
| 0 bytes | ~148 µs |
| 100 bytes | ~172 µs |
| 1 KB | ~145 µs |
The times are essentially identical regardless of mask content size.
Profiling: Middle Insertion Performance
The following benchmark compares insertion performance across different data types and storage strategies.
Results
=============================================================================
RaggedTimeSeries Storage Benchmark
=============================================================================
Configuration:
Initial elements: 200000
Middle insertions: 1000
Average elements moved: 100000
-----------------------------------------------------------------------------
MIDDLE INSERTION PERFORMANCE
-----------------------------------------------------------------------------
Type Size Per Insert Bytes Moved
-----------------------------------------------------------------------------
-- Trivially Copyable Types (uses memmove) --
Point2D<float> 8 bytes 18.6 µs 800000 bytes
EntityId 8 bytes 18.7 µs 800000 bytes
TimeFrameIndex 8 bytes 19.2 µs 800000 bytes
SoAKey (composite) 24 bytes 65.0 µs 2400000 bytes
-- Non-Trivially Copyable Types (move constructor per element) --
Mask2D (empty) 24 bytes 190.7 µs 2400000 bytes
Mask2D (100 bytes) 24 bytes 210.9 µs 2400000 bytes
Mask2D (1KB) 24 bytes 155.7 µs 2400000 bytes
Line2D (empty) 24 bytes 159.6 µs 2400000 bytes
Line2D (50 points) 24 bytes 169.0 µs 2400000 bytes
-----------------------------------------------------------------------------
APPEND PERFORMANCE (for comparison)
-----------------------------------------------------------------------------
Type Per Append
-----------------------------------------------------------------------------
Point2D<float> 1.38 ns
SoAKey (composite) 5.14 ns
Mask2D (empty) 2.66 ns
Mask2D (1KB) 245.80 ns
Analysis
Key findings:
Trivially copyable types are 3-10× faster for middle insertion:
Point2D(8 bytes): 18.6 µs per insertSoAKey(24 bytes): 65.0 µs per insert
Mask2D(24 bytes, non-trivial): 155-210 µs per insert
Data size inside containers doesn’t matter for moves:
Mask2Dwith 0 bytes: 190.7 µsMask2Dwith 1KB: 155.7 µs (within noise)
This is because
std::vector’s move constructor just swaps 3 pointers.Append is dramatically faster than middle insertion:
Point2Dappend: 1.38 ns (13,000× faster than insert)Mask2Dappend: 2.66 ns (60,000× faster than insert)
For SoA with views: Keep metadata (TimeFrameIndex, EntityId) in trivially copyable arrays; keep actual data (Mask2D) in append-only storage.
Profiling: Storage Strategy Comparison
This benchmark compares the current map-based implementation against the SoA approach.
Results
=============================================================================
Storage Strategy Benchmark: Map-Based vs Structure of Arrays
=============================================================================
Configuration:
Total entries: 200000
Time points: 50000
Avg entries per time: 4
Lookups per test: 10000
Filter set size: 1000
-----------------------------------------------------------------------------
CONSTRUCTION (append 200000 entries in time order)
-----------------------------------------------------------------------------
Map-Based: 35.3 ms
SoA: 85.5 ms
Speedup: 0.41x
-----------------------------------------------------------------------------
LOOKUP BY EntityId (10000 random lookups)
-----------------------------------------------------------------------------
Map-Based: 9105.2 ms total, 910.5 µs per lookup
SoA: 0.7 ms total, 69.4 ns per lookup
Speedup: 13120x
-----------------------------------------------------------------------------
LOOKUP BY TimeFrameIndex (10000 random lookups)
-----------------------------------------------------------------------------
Map-Based: 2.9 ms total, 293.6 ns per lookup
SoA: 4.0 ms total, 403.5 ns per lookup
Speedup: 0.73x
-----------------------------------------------------------------------------
FILTER BY EntityId SET (filter 1000 EntityIds)
-----------------------------------------------------------------------------
Map-Based (full copy): 8.1 ms, 998 results
SoA (scan, indices only): 4.6 ms, 998 results
SoA (hash, indices only): 0.1 ms, 998 results
Speedup (hash vs map): 75x
-----------------------------------------------------------------------------
ITERATE ALL ENTRIES
-----------------------------------------------------------------------------
Map-Based: 2.8 ms
SoA: 0.5 ms
Speedup: 5.17x
Analysis
| Operation | Map-Based | SoA | Winner |
|---|---|---|---|
| Construction | 35.3 ms | 85.5 ms | Map (2.4×) |
| EntityId Lookup | 910 µs | 69 ns | SoA (13,120×) |
| TimeFrameIndex Lookup | 294 ns | 404 ns | Map (1.4×) |
| EntityId Filter (1000 ids) | 8.1 ms | 0.1 ms | SoA (75×) |
| Iterate All | 2.8 ms | 0.5 ms | SoA (5×) |
Key findings:
EntityId operations are dramatically faster with SoA:
- Single lookup: 13,120× faster (910 µs → 69 ns)
- Filtering 1000 EntityIds: 75× faster (8.1 ms → 0.1 ms)
This is because SoA maintains an
unordered_map<EntityId, size_t>for O(1) lookup, while map-based requires O(n) full scan.TimeFrameIndex lookup is slightly faster with Map:
- Both are O(log m) where m = unique time points
- Map wins by ~1.4× due to direct access vs indirection
Construction favors Map:
- Map: 35.3 ms
- SoA: 85.5 ms (2.4× slower)
This is because SoA maintains additional acceleration structures.
Iteration strongly favors SoA:
- 5× faster due to cache-friendly sequential memory access
- Map traverses fragmented memory (many small vectors)
For view-based filtering, the SoA approach returns indices only (8 bytes each), while map-based must copy the entire
DataEntry<TData>including the data.
Implications for View-Based Filtering
One key use case is creating filtered views of data (e.g., by EntityId set or TimeFrameIndex range).
Current Approach: Copy on Filter
// Creates a new RaggedTimeSeries with copied data
auto filtered = source.copyByEntityIds(entity_set);Alternative: Index-Based Views
With SoA storage, filtering can return a lightweight view:
// View is just a vector of indices into the source
struct RaggedView {
SoAStorage const* source;
std::vector<size_t> indices; // Which entries are included
};
// Creating a filtered view - O(n) scan, O(m) allocation for m matches
// No data copying!
auto filtered_view = source.createEntityIdView(entity_set);Performance Comparison
| Operation | Copy Approach | View Approach |
|---|---|---|
| Create filter | O(n) + copy TData | O(n) + O(m) indices |
| Memory usage | Full copy | Just indices (8 bytes each) |
| Access element | O(1) | O(1) + 1 indirection |
| Modify source | Independent | View sees changes |
Recommendations
Based on the profiling results:
Append-only data: SoA storage is highly efficient for data that arrives chronologically. With 200,000 entries, iteration is 5× faster due to cache-friendly memory layout.
Frequent EntityId lookups: The current map-based implementation is O(n) per lookup (910 µs with 200k entries). Adding an
entity_to_indexmap provides O(1) access (69 ns), a 13,000× improvement.View-based filtering: Use index vectors rather than copying data. Filtering 1000 EntityIds takes 8.1 ms with copy vs 0.1 ms with indices only (75× faster).
Middle insertions: Avoid when possible; batch operations and sort. A single middle insertion into 200k
Mask2Delements costs ~170 µs, while append costs ~3 ns.TimeFrameIndex operations: The current map-based approach is already optimal for time-based lookup (O(log m)). SoA provides similar performance.
Hybrid approach: For best overall performance:
- Keep
Mask2D/Line2Ddata in append-only storage (arena) - Maintain parallel vectors of trivially-copyable metadata (TimeFrameIndex, EntityId)
- Use hash maps for O(1) EntityId lookup
- Views hold index vectors referencing the source data
- Keep
Derived Storage (Views) Performance
When creating filtered views that reference source data without copying, we measured the overhead of index indirection.
Results
=============================================================================
Derived Storage Overhead Benchmark
=============================================================================
Configuration:
Source entries: 200000
View size: 10000 (5% of source)
-----------------------------------------------------------------------------
SEQUENTIAL ITERATION (all elements)
-----------------------------------------------------------------------------
Direct array access: 7438.8 µs (37.2 ns/element)
Source (virtual): 6958.1 µs (34.8 ns/element)
Derived 'all' view: 6627.0 µs (33.1 ns/element)
Overhead analysis (vs direct array):
Virtual dispatch: 0.94x
View indirection: 0.89x
-----------------------------------------------------------------------------
FILTERED VIEW ITERATION (subset)
-----------------------------------------------------------------------------
Filtered view (9742 entries): 1724.7 µs (177.0 ns/element)
Source + filter check: 7853.9 µs (scanning all 200000)
Filtered view speedup: 4.6x faster
-----------------------------------------------------------------------------
CACHE EFFECTS: SEQUENTIAL vs RANDOM VIEW INDICES
-----------------------------------------------------------------------------
Sequential indices: 306.9 µs (30.7 ns/element)
Random indices: 1670.0 µs (171.2 ns/element)
Cache penalty: 5.44x slower
-----------------------------------------------------------------------------
MEMORY OVERHEAD
-----------------------------------------------------------------------------
Source storage:
Total: 30468 KB
Derived view (9742 entries):
Index vector: 76 KB
Local EntityId map: 152 KB
Data copied: 0 KB (references source)
Total: 228 KB
Memory savings: 82.9% vs full copy
Analysis
| Metric | Result | Notes |
|---|---|---|
| Virtual dispatch overhead | 0.94x (faster!) | Compiler optimizes well |
| Index indirection overhead | 0.89x (faster!) | CPU prefetching helps |
| Filtered iteration | 4.6x faster | vs scanning all + filter check |
| Cache penalty (random indices) | 5.4x slower | 171 ns vs 31 ns per element |
| Memory savings | 83% | 228 KB view vs full copy |
| View creation | 233 ns/EntityId | One-time cost |
Key insights:
Virtual dispatch is essentially free - The compiler devirtualizes with
-O3Index indirection overhead is negligible for sequential access due to CPU prefetching
The REAL cost is cache misses with random access patterns (5.4x penalty)
Views pay for themselves quickly - Creation cost is amortized in < 1 iteration
Sort view indices when possible for cache-friendly access
CRTP + Variant vs Virtual Inheritance
An alternative to virtual inheritance for polymorphic storage is the CRTP (Curiously Recurring Template Pattern) combined with std::variant for type erasure. This pattern is already used in AnalogDataStorage.hpp.
The Two Approaches
Virtual Inheritance (Classic OOP):
template<typename TData>
class IRaggedStorage {
public:
virtual ~IRaggedStorage() = default;
virtual size_t size() const = 0;
virtual TData const& getData(size_t idx) const = 0;
// ... other pure virtual methods
};
class SourceStorage : public IRaggedStorage<Mask2D> { /* ... */ };
class DerivedStorage : public IRaggedStorage<Mask2D> { /* ... */ };
// Usage via base pointer
std::shared_ptr<IRaggedStorage<Mask2D>> storage = /*...*/;
auto& data = storage->getData(i); // vtable dispatchCRTP + Variant (Modern C++):
template<typename Derived, typename TData>
class StorageBase {
public:
// Static dispatch via CRTP
size_t size() const {
return static_cast<Derived const*>(this)->sizeImpl();
}
// ...
};
class SourceStorage : public StorageBase<SourceStorage, Mask2D> { /* ... */ };
class DerivedStorage : public StorageBase<DerivedStorage, Mask2D> { /* ... */ };
// Type erasure via variant
using StorageVariant = std::variant<
std::shared_ptr<SourceStorage>,
std::shared_ptr<DerivedStorage>
>;
StorageVariant storage = /*...*/;
auto& data = std::visit([i](auto& ptr) -> auto& {
return ptr->getData(i);
}, storage); // std::visit dispatchResults
=============================================================================
CRTP + Variant vs Virtual Inheritance Benchmark
=============================================================================
Configuration:
Entries: 200000
View size: 10000
Iterations: 100
=============================================================================
PART 1: Point2D (trivially copyable) - Isolating Dispatch Overhead
=============================================================================
POINT2D: SEQUENTIAL ITERATION (dispatch overhead visible)
Virtual (base ptr): 691.7 µs (3.5 ns/elem)
CRTP (direct): 332.3 µs (1.7 ns/elem)
CRTP (variant): 320.8 µs (1.6 ns/elem)
Speedup vs Virtual:
CRTP direct: 2.08x
CRTP variant: 2.16x
=============================================================================
PART 2: Mask2D (non-trivially copyable) - Data Access Dominates
=============================================================================
SOURCE STORAGE: SEQUENTIAL ITERATION
Virtual (base ptr): 7366.7 µs (36.8 ns/elem)
CRTP (direct): 6635.9 µs (33.2 ns/elem)
CRTP (variant): 7000.2 µs (35.0 ns/elem)
Speedup vs Virtual:
CRTP direct: 1.11x
CRTP variant: 1.05x
ENTITY ID LOOKUP (10000 lookups)
Virtual: 4406.9 µs (440.7 ns/lookup)
CRTP (direct): 3778.3 µs (377.8 ns/lookup)
CRTP (variant): 2993.7 µs (299.4 ns/lookup)
Speedup vs Virtual:
CRTP direct: 1.17x
CRTP variant: 1.47x
Analysis Summary
| Access Pattern | Virtual | CRTP Direct | CRTP Variant |
|---|---|---|---|
| Point2D iteration | 3.5 ns/elem | 1.7 ns | 1.6 ns |
| Mask2D iteration | 36.8 ns/elem | 33.2 ns | 35.0 ns |
| EntityId lookup | 440.7 ns | 377.8 ns | 299.4 ns |
Key insights:
CRTP is 2× faster for lightweight data (Point2D):
- Dispatch overhead becomes visible when data access is cheap
- Both direct and variant approaches eliminate vtable lookup
When data access dominates, dispatch doesn’t matter:
- For Mask2D (vector-backed), virtual vs CRTP vs variant are within 10%
- The CPU spends most time chasing pointers, not dispatching
CRTP Variant wins on hash lookups (1.47× faster):
std::visitcan be faster than virtual when compiler can see all types- Better inlining opportunities
When to use each approach:
Approach Best For Virtual Open extension (plugins), external libraries CRTP Direct Maximum performance, type known at compile time CRTP Variant Closed type set, value semantics, pattern matching CRTP + Variant advantages:
- No heap allocation for small types (in-place storage)
- Exhaustive handling enforced by compiler (
std::visit) - No virtual destructor overhead
- Better optimization across type boundaries
CRTP + Variant disadvantages:
- More complex code
- All types must be known at compile time
- Larger object size (variant stores largest type + discriminant)
- Recompile required to add new types