Profiling and Performance

This document describes the benchmarking infrastructure for Neuralyzer, including how to create benchmarks, run them, and analyze performance using various profiling tools.

Quick Start

# Build with benchmarks enabled (they're on by default)
cmake --preset linux-clang-release -DENABLE_BENCHMARK=ON
cmake --build --preset linux-clang-release

# Run all benchmarks
cd out/build/Clang/Release/benchmark
./benchmark_MaskArea

# Run specific benchmarks with filtering
./benchmark_MaskArea --benchmark_filter=Pipeline

# Run with different output formats
./benchmark_MaskArea --benchmark_format=json > results.json
./benchmark_MaskArea --benchmark_format=csv > results.csv

Architecture

CMake Infrastructure

The benchmarking system is built on cmake/BenchmarkUtils.cmake, which provides:

  • add_selective_benchmark() - Create individual benchmark executables
  • configure_benchmark_for_profiling() - Add profiling tool support
  • print_benchmark_summary() - Display configuration summary

Each benchmark can be individually enabled/disabled via CMake options:

# Disable a specific benchmark
cmake -DBENCHMARK_MASK_AREA=OFF ..

# Only build specific benchmarks
cmake -DBENCHMARK_MASK_AREA=ON -DBENCHMARK_OTHER=OFF ..

Test Data Fixtures

benchmark/fixtures/BenchmarkFixtures.hpp provides fixtures for generating realistic test data:

  • MaskDataFixture - Generate MaskData with configurable properties
  • LineDataFixture - Generate LineData for curve/line benchmarks
  • PointDataFixture - Generate PointData for point-based benchmarks

Presets are available for common scenarios: - Presets::SmallMaskData() - Quick iteration (10 frames) - Presets::MediumMaskData() - Realistic testing (100 frames) - Presets::LargeMaskData() - Stress testing (1000 frames) - Presets::SparseMaskData() - Few masks, large time gaps - Presets::DenseMaskData() - Many small masks per frame

Example usage:

#include "fixtures/BenchmarkFixtures.hpp"

BENCHMARK_F(MyBenchmark, TestCase)(benchmark::State& state) {
    auto fixture = MaskDataFixture(Presets::MediumMaskData());
    auto mask_data = fixture.generate();
    
    for (auto _ : state) {
        // Benchmark code here
    }
}

Creating New Benchmarks

1. Create Benchmark Source File

Create benchmark/MyFeature.benchmark.cpp:

#include "fixtures/BenchmarkFixtures.hpp"
#include <benchmark/benchmark.h>

// Simple function benchmark
static void BM_MyFunction(benchmark::State& state) {
    // Setup
    auto data = setupTestData();
    
    for (auto _ : state) {
        auto result = myFunction(data);
        benchmark::DoNotOptimize(result);
    }
}
BENCHMARK(BM_MyFunction);

// Fixture-based benchmark
class MyFeatureBenchmark : public benchmark::Fixture {
public:
    void SetUp(benchmark::State const& state) override {
        // Generate test data once
        test_data_ = generateData();
    }
    
protected:
    std::shared_ptr<DataType> test_data_;
};

BENCHMARK_F(MyFeatureBenchmark, TestCase)(benchmark::State& state) {
    for (auto _ : state) {
        auto result = process(test_data_);
        benchmark::DoNotOptimize(result);
    }
}

BENCHMARK_MAIN();

2. Register in CMake

Add to benchmark/CMakeLists.txt:

add_selective_benchmark(
    NAME MyFeature
    SOURCES 
        MyFeature.benchmark.cpp
    LINK_LIBRARIES 
        DataManager
        MyFeatureLib
    DEFAULT ON
)

# Optional: Enable profiling support
if(TARGET benchmark_MyFeature)
    configure_benchmark_for_profiling(
        TARGET benchmark_MyFeature
        ENABLE_PERF ON
        ENABLE_HEAPTRACK ON
        GENERATE_ASM ON
    )
endif()

3. Build and Run

cmake --build --preset linux-clang-release
./out/build/Clang/Release/benchmark/benchmark_MyFeature

Google Benchmark Features

Parameterized Benchmarks

Test with different input sizes:

BENCHMARK(BM_MyFunction)
    ->Arg(100)
    ->Arg(1000)
    ->Arg(10000)
    ->Unit(benchmark::kMillisecond);

// Or use ranges
BENCHMARK(BM_MyFunction)
    ->Range(8, 8<<10)  // 8 to 8192, powers of 2
    ->RangeMultiplier(2);

Fixtures with Parameters

BENCHMARK_F(MyBenchmark, TestCase)(benchmark::State& state) {
    size_t size = state.range(0);
    // Use size parameter
}
BENCHMARK_REGISTER_F(MyBenchmark, TestCase)
    ->DenseRange(0, 4)  // Parameters 0, 1, 2, 3, 4
    ->Unit(benchmark::kMicrosecond);

Custom Counters

Track additional metrics:

for (auto _ : state) {
    auto result = myFunction(data);
    state.counters["items_processed"] = data.size();
    state.counters["bytes_processed"] = data.size() * sizeof(Item);
}

state.SetItemsProcessed(state.iterations() * data.size());
state.SetBytesProcessed(state.iterations() * data.size() * sizeof(Item));

Performance Analysis Tools

1. Perf (CPU Profiling)

Profile CPU usage and generate call graphs:

# Record profile data
perf record -g ./benchmark_MaskArea --benchmark_filter=Pipeline

# View interactive report
perf report

# Generate flamegraph
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg

# View specific functions
perf annotate functionName

# Show hot spots
perf top

Key perf options: - -g - Enable call-graph (stack trace) recording - -e cycles - Profile CPU cycles - -e cache-misses - Profile cache misses - --call-graph dwarf - Use DWARF for more accurate call graphs

2. Heaptrack (Memory Profiling)

Analyze memory allocation patterns:

# Run with heaptrack
heaptrack ./benchmark_MaskArea

# View results in GUI
heaptrack_gui heaptrack.benchmark_MaskArea.*.gz

# View results in terminal
heaptrack_print heaptrack.benchmark_MaskArea.*.gz

What heaptrack shows: - Total memory allocated - Peak memory usage - Number of allocations - Call stacks for allocations - Memory leaks - Temporary allocations

3. Valgrind Cachegrind (Cache Analysis)

Analyze cache behavior:

# Run cachegrind
valgrind --tool=cachegrind ./benchmark_MaskArea --benchmark_filter=CacheBehavior

# Visualize results
cg_annotate cachegrind.out.<pid>

# Interactive visualization
kcachegrind cachegrind.out.<pid>

Metrics provided: - L1/L2/L3 cache hits/misses - Instruction cache behavior - Data cache behavior - Branch prediction statistics

4. Assembly Inspection

View generated assembly for optimization:

# Method 1: objdump
objdump -d -C -S ./benchmark_MaskArea | less

# Search for specific function
objdump -d -C -S ./benchmark_MaskArea | grep -A 50 "calculateMaskArea"

# Method 2: During compilation (if GENERATE_ASM=ON)
# Assembly files (.s) generated alongside object files
find . -name "*.s" | xargs less

What to look for: - Loop vectorization (SIMD instructions) - Unnecessary branches - Memory access patterns - Inlining decisions - Register usage

5. Time Command (Quick Stats)

Get quick overview of resource usage:

/usr/bin/time -v ./benchmark_MaskArea

# Key metrics:
# - Maximum resident set size (peak memory)
# - Page faults (major = disk I/O, minor = memory)
# - Context switches
# - CPU percentage

6. Clang Build Time Analysis

If built with -DENABLE_TIME_TRACE=ON:

# View compilation time breakdown
ClangBuildAnalyzer --all build_trace build_results.bin
ClangBuildAnalyzer --analyze build_results.bin

Example Workflows

Workflow 1: Optimize a Hot Function

# 1. Identify hot spots
perf record -g ./benchmark_MaskArea
perf report  # Identify slowest function

# 2. View assembly
objdump -d -C -S ./benchmark_MaskArea | grep -A 100 "mySlowFunction"

# 3. Check cache behavior
valgrind --tool=cachegrind ./benchmark_MaskArea --benchmark_filter=MyFunction

# 4. Make changes and compare
./benchmark_MaskArea --benchmark_filter=MyFunction > before.txt
# ... make changes ...
./benchmark_MaskArea --benchmark_filter=MyFunction > after.txt
compare.py compare before.txt after.txt  # From Google Benchmark tools

Workflow 2: Fix Memory Issues

# 1. Identify allocation hot spots
heaptrack ./benchmark_MaskArea

# 2. View in GUI
heaptrack_gui heaptrack.benchmark_MaskArea.*.gz
# Look for:
# - Temporary allocations in loops
# - Large peak allocations
# - Memory leaks

# 3. Check for leaks with valgrind
valgrind --leak-check=full ./benchmark_MaskArea --benchmark_filter=MyTest

# 4. Verify fix
heaptrack ./benchmark_MaskArea
# Compare before/after total allocations and peak memory

Workflow 3: Compare Algorithms

# Benchmark different implementations
./benchmark_MaskArea --benchmark_filter="Algorithm_v1|Algorithm_v2" \
    --benchmark_format=json > comparison.json

# Use benchmark tools to compare
tools/compare.py benchmarks comparison.json baseline.json

# Or export to CSV for plotting
./benchmark_MaskArea --benchmark_format=csv > data.csv
# Import into spreadsheet/plotting tool

Best Practices

Do’s

  1. Use Fixtures - Set up expensive test data once
  2. DoNotOptimize - Prevent compiler from optimizing away benchmarks
  3. Test Multiple Sizes - Use parameter ranges to test scaling
  4. Add Counters - Track domain-specific metrics (items/sec, throughput)
  5. Disable Benchmarks - Turn off unused benchmarks to speed up builds
  6. Run Multiple Times - Benchmarks automatically run multiple iterations
  7. Profile Before Optimizing - Use profiling tools to find actual bottlenecks

Don’ts

  1. Don’t Measure Setup - Put expensive setup in SetUp(), not in loop
  2. Don’t Ignore Variability - Check standard deviation in results
  3. Don’t Benchmark Debug Builds - Always use Release or RelWithDebInfo
  4. Don’t Optimize Without Data - Profile first, optimize second
  5. Don’t Test Only One Size - Real data varies, test across ranges

Code Patterns

// ✅ Good: Setup outside loop
void SetUp(benchmark::State const& state) override {
    test_data_ = generateExpensiveData();
}

BENCHMARK_F(MyBench, Test)(benchmark::State& state) {
    for (auto _ : state) {
        auto result = process(test_data_);
        benchmark::DoNotOptimize(result);
    }
}

// ❌ Bad: Setup in loop
BENCHMARK(MyBench)(benchmark::State& state) {
    for (auto _ : state) {
        auto data = generateExpensiveData();  // Measured every time!
        auto result = process(data);
    }
}

// ✅ Good: Prevent optimization
auto result = expensiveComputation();
benchmark::DoNotOptimize(result);

// ❌ Bad: Result might be optimized away
auto result = expensiveComputation();
// Compiler might remove entire computation if result unused

Interpreting Results

Benchmark Output

-------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations
-------------------------------------------------------------------
BM_Pipeline/0                  1.23 ms         1.23 ms          569
BM_Pipeline/1                 12.45 ms        12.43 ms           56
BM_Pipeline/2                124.56 ms       124.12 ms            6
  • Time: Wall-clock time (includes I/O, system time)
  • CPU: CPU time (excludes I/O wait)
  • Iterations: How many times the benchmark ran

What’s Fast Enough?

  • < 1 microsecond: Excellent for element operations
  • < 1 millisecond: Good for frame-level processing
  • < 100 milliseconds: Acceptable for batch operations
  • > 1 second: Consider optimization or progress reporting

Profiling Result Interpretation

Perf Report: - Focus on functions with >5% of total time - Check for unexpected function calls - Look for optimization opportunities in hot paths

Heaptrack: - Peak memory < 2x working set is good - Allocation count should be O(1) for inner loops - Look for unnecessary temporary allocations

Cachegrind: - L1 miss rate < 5% is excellent - L2 miss rate < 1% is good - Look for cache-unfriendly access patterns

Continuous Performance Monitoring

Save Baseline Results

# Save current performance as baseline
./benchmark_MaskArea --benchmark_out=baseline.json \
    --benchmark_out_format=json

# After changes, compare
./benchmark_MaskArea --benchmark_out=current.json \
    --benchmark_out_format=json

# Compare results
tools/compare.py benchmarks baseline.json current.json

Troubleshooting

Benchmarks Too Fast

If benchmarks complete in < 1 microsecond:

// Increase work per iteration
for (auto _ : state) {
    for (int i = 0; i < 100; ++i) {  // Amortize overhead
        auto result = fastFunction();
        benchmark::DoNotOptimize(result);
    }
}
state.SetItemsProcessed(state.iterations() * 100);

High Variability

If standard deviation > 10% of mean:

  1. Close other applications
  2. Disable CPU frequency scaling: sudo cpupower frequency-set --governor performance
  3. Pin to specific CPU: taskset -c 0 ./benchmark_MaskArea
  4. Increase minimum iteration time: --benchmark_min_time=5

Build Issues

# Benchmarks not building
cmake -DENABLE_BENCHMARK=ON ..

# Specific benchmark disabled
cmake -DBENCHMARK_MASK_AREA=ON ..

# Google Benchmark not found
# (Should auto-fetch with FetchContent, but can install manually)
sudo apt install libbenchmark-dev  # Ubuntu/Debian

Additional Resources