Profiling and Performance
This document describes the benchmarking infrastructure for Neuralyzer, including how to create benchmarks, run them, and analyze performance using various profiling tools.
Quick Start
# Build with benchmarks enabled (they're on by default)
cmake --preset linux-clang-release -DENABLE_BENCHMARK=ON
cmake --build --preset linux-clang-release
# Run all benchmarks
cd out/build/Clang/Release/benchmark
./benchmark_MaskArea
# Run specific benchmarks with filtering
./benchmark_MaskArea --benchmark_filter=Pipeline
# Run with different output formats
./benchmark_MaskArea --benchmark_format=json > results.json
./benchmark_MaskArea --benchmark_format=csv > results.csvArchitecture
CMake Infrastructure
The benchmarking system is built on cmake/BenchmarkUtils.cmake, which provides:
add_selective_benchmark()- Create individual benchmark executablesconfigure_benchmark_for_profiling()- Add profiling tool supportprint_benchmark_summary()- Display configuration summary
Each benchmark can be individually enabled/disabled via CMake options:
# Disable a specific benchmark
cmake -DBENCHMARK_MASK_AREA=OFF ..
# Only build specific benchmarks
cmake -DBENCHMARK_MASK_AREA=ON -DBENCHMARK_OTHER=OFF ..Test Data Fixtures
benchmark/fixtures/BenchmarkFixtures.hpp provides fixtures for generating realistic test data:
MaskDataFixture- Generate MaskData with configurable propertiesLineDataFixture- Generate LineData for curve/line benchmarksPointDataFixture- Generate PointData for point-based benchmarks
Presets are available for common scenarios: - Presets::SmallMaskData() - Quick iteration (10 frames) - Presets::MediumMaskData() - Realistic testing (100 frames) - Presets::LargeMaskData() - Stress testing (1000 frames) - Presets::SparseMaskData() - Few masks, large time gaps - Presets::DenseMaskData() - Many small masks per frame
Example usage:
#include "fixtures/BenchmarkFixtures.hpp"
BENCHMARK_F(MyBenchmark, TestCase)(benchmark::State& state) {
auto fixture = MaskDataFixture(Presets::MediumMaskData());
auto mask_data = fixture.generate();
for (auto _ : state) {
// Benchmark code here
}
}Creating New Benchmarks
1. Create Benchmark Source File
Create benchmark/MyFeature.benchmark.cpp:
#include "fixtures/BenchmarkFixtures.hpp"
#include <benchmark/benchmark.h>
// Simple function benchmark
static void BM_MyFunction(benchmark::State& state) {
// Setup
auto data = setupTestData();
for (auto _ : state) {
auto result = myFunction(data);
benchmark::DoNotOptimize(result);
}
}
BENCHMARK(BM_MyFunction);
// Fixture-based benchmark
class MyFeatureBenchmark : public benchmark::Fixture {
public:
void SetUp(benchmark::State const& state) override {
// Generate test data once
test_data_ = generateData();
}
protected:
std::shared_ptr<DataType> test_data_;
};
BENCHMARK_F(MyFeatureBenchmark, TestCase)(benchmark::State& state) {
for (auto _ : state) {
auto result = process(test_data_);
benchmark::DoNotOptimize(result);
}
}
BENCHMARK_MAIN();2. Register in CMake
Add to benchmark/CMakeLists.txt:
add_selective_benchmark(
NAME MyFeature
SOURCES
MyFeature.benchmark.cpp
LINK_LIBRARIES
DataManager
MyFeatureLib
DEFAULT ON
)
# Optional: Enable profiling support
if(TARGET benchmark_MyFeature)
configure_benchmark_for_profiling(
TARGET benchmark_MyFeature
ENABLE_PERF ON
ENABLE_HEAPTRACK ON
GENERATE_ASM ON
)
endif()3. Build and Run
cmake --build --preset linux-clang-release
./out/build/Clang/Release/benchmark/benchmark_MyFeatureGoogle Benchmark Features
Parameterized Benchmarks
Test with different input sizes:
BENCHMARK(BM_MyFunction)
->Arg(100)
->Arg(1000)
->Arg(10000)
->Unit(benchmark::kMillisecond);
// Or use ranges
BENCHMARK(BM_MyFunction)
->Range(8, 8<<10) // 8 to 8192, powers of 2
->RangeMultiplier(2);Fixtures with Parameters
BENCHMARK_F(MyBenchmark, TestCase)(benchmark::State& state) {
size_t size = state.range(0);
// Use size parameter
}
BENCHMARK_REGISTER_F(MyBenchmark, TestCase)
->DenseRange(0, 4) // Parameters 0, 1, 2, 3, 4
->Unit(benchmark::kMicrosecond);Custom Counters
Track additional metrics:
for (auto _ : state) {
auto result = myFunction(data);
state.counters["items_processed"] = data.size();
state.counters["bytes_processed"] = data.size() * sizeof(Item);
}
state.SetItemsProcessed(state.iterations() * data.size());
state.SetBytesProcessed(state.iterations() * data.size() * sizeof(Item));Performance Analysis Tools
1. Perf (CPU Profiling)
Profile CPU usage and generate call graphs:
# Record profile data
perf record -g ./benchmark_MaskArea --benchmark_filter=Pipeline
# View interactive report
perf report
# Generate flamegraph
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg
# View specific functions
perf annotate functionName
# Show hot spots
perf topKey perf options: - -g - Enable call-graph (stack trace) recording - -e cycles - Profile CPU cycles - -e cache-misses - Profile cache misses - --call-graph dwarf - Use DWARF for more accurate call graphs
2. Heaptrack (Memory Profiling)
Analyze memory allocation patterns:
# Run with heaptrack
heaptrack ./benchmark_MaskArea
# View results in GUI
heaptrack_gui heaptrack.benchmark_MaskArea.*.gz
# View results in terminal
heaptrack_print heaptrack.benchmark_MaskArea.*.gzWhat heaptrack shows: - Total memory allocated - Peak memory usage - Number of allocations - Call stacks for allocations - Memory leaks - Temporary allocations
3. Valgrind Cachegrind (Cache Analysis)
Analyze cache behavior:
# Run cachegrind
valgrind --tool=cachegrind ./benchmark_MaskArea --benchmark_filter=CacheBehavior
# Visualize results
cg_annotate cachegrind.out.<pid>
# Interactive visualization
kcachegrind cachegrind.out.<pid>Metrics provided: - L1/L2/L3 cache hits/misses - Instruction cache behavior - Data cache behavior - Branch prediction statistics
4. Assembly Inspection
View generated assembly for optimization:
# Method 1: objdump
objdump -d -C -S ./benchmark_MaskArea | less
# Search for specific function
objdump -d -C -S ./benchmark_MaskArea | grep -A 50 "calculateMaskArea"
# Method 2: During compilation (if GENERATE_ASM=ON)
# Assembly files (.s) generated alongside object files
find . -name "*.s" | xargs lessWhat to look for: - Loop vectorization (SIMD instructions) - Unnecessary branches - Memory access patterns - Inlining decisions - Register usage
5. Time Command (Quick Stats)
Get quick overview of resource usage:
/usr/bin/time -v ./benchmark_MaskArea
# Key metrics:
# - Maximum resident set size (peak memory)
# - Page faults (major = disk I/O, minor = memory)
# - Context switches
# - CPU percentage6. Clang Build Time Analysis
If built with -DENABLE_TIME_TRACE=ON:
# View compilation time breakdown
ClangBuildAnalyzer --all build_trace build_results.bin
ClangBuildAnalyzer --analyze build_results.binExample Workflows
Workflow 1: Optimize a Hot Function
# 1. Identify hot spots
perf record -g ./benchmark_MaskArea
perf report # Identify slowest function
# 2. View assembly
objdump -d -C -S ./benchmark_MaskArea | grep -A 100 "mySlowFunction"
# 3. Check cache behavior
valgrind --tool=cachegrind ./benchmark_MaskArea --benchmark_filter=MyFunction
# 4. Make changes and compare
./benchmark_MaskArea --benchmark_filter=MyFunction > before.txt
# ... make changes ...
./benchmark_MaskArea --benchmark_filter=MyFunction > after.txt
compare.py compare before.txt after.txt # From Google Benchmark toolsWorkflow 2: Fix Memory Issues
# 1. Identify allocation hot spots
heaptrack ./benchmark_MaskArea
# 2. View in GUI
heaptrack_gui heaptrack.benchmark_MaskArea.*.gz
# Look for:
# - Temporary allocations in loops
# - Large peak allocations
# - Memory leaks
# 3. Check for leaks with valgrind
valgrind --leak-check=full ./benchmark_MaskArea --benchmark_filter=MyTest
# 4. Verify fix
heaptrack ./benchmark_MaskArea
# Compare before/after total allocations and peak memoryWorkflow 3: Compare Algorithms
# Benchmark different implementations
./benchmark_MaskArea --benchmark_filter="Algorithm_v1|Algorithm_v2" \
--benchmark_format=json > comparison.json
# Use benchmark tools to compare
tools/compare.py benchmarks comparison.json baseline.json
# Or export to CSV for plotting
./benchmark_MaskArea --benchmark_format=csv > data.csv
# Import into spreadsheet/plotting toolBest Practices
Do’s
- Use Fixtures - Set up expensive test data once
- DoNotOptimize - Prevent compiler from optimizing away benchmarks
- Test Multiple Sizes - Use parameter ranges to test scaling
- Add Counters - Track domain-specific metrics (items/sec, throughput)
- Disable Benchmarks - Turn off unused benchmarks to speed up builds
- Run Multiple Times - Benchmarks automatically run multiple iterations
- Profile Before Optimizing - Use profiling tools to find actual bottlenecks
Don’ts
- Don’t Measure Setup - Put expensive setup in
SetUp(), not in loop - Don’t Ignore Variability - Check standard deviation in results
- Don’t Benchmark Debug Builds - Always use Release or RelWithDebInfo
- Don’t Optimize Without Data - Profile first, optimize second
- Don’t Test Only One Size - Real data varies, test across ranges
Code Patterns
// ✅ Good: Setup outside loop
void SetUp(benchmark::State const& state) override {
test_data_ = generateExpensiveData();
}
BENCHMARK_F(MyBench, Test)(benchmark::State& state) {
for (auto _ : state) {
auto result = process(test_data_);
benchmark::DoNotOptimize(result);
}
}
// ❌ Bad: Setup in loop
BENCHMARK(MyBench)(benchmark::State& state) {
for (auto _ : state) {
auto data = generateExpensiveData(); // Measured every time!
auto result = process(data);
}
}
// ✅ Good: Prevent optimization
auto result = expensiveComputation();
benchmark::DoNotOptimize(result);
// ❌ Bad: Result might be optimized away
auto result = expensiveComputation();
// Compiler might remove entire computation if result unusedInterpreting Results
Benchmark Output
-------------------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------------------
BM_Pipeline/0 1.23 ms 1.23 ms 569
BM_Pipeline/1 12.45 ms 12.43 ms 56
BM_Pipeline/2 124.56 ms 124.12 ms 6
- Time: Wall-clock time (includes I/O, system time)
- CPU: CPU time (excludes I/O wait)
- Iterations: How many times the benchmark ran
What’s Fast Enough?
- < 1 microsecond: Excellent for element operations
- < 1 millisecond: Good for frame-level processing
- < 100 milliseconds: Acceptable for batch operations
- > 1 second: Consider optimization or progress reporting
Profiling Result Interpretation
Perf Report: - Focus on functions with >5% of total time - Check for unexpected function calls - Look for optimization opportunities in hot paths
Heaptrack: - Peak memory < 2x working set is good - Allocation count should be O(1) for inner loops - Look for unnecessary temporary allocations
Cachegrind: - L1 miss rate < 5% is excellent - L2 miss rate < 1% is good - Look for cache-unfriendly access patterns
Continuous Performance Monitoring
Save Baseline Results
# Save current performance as baseline
./benchmark_MaskArea --benchmark_out=baseline.json \
--benchmark_out_format=json
# After changes, compare
./benchmark_MaskArea --benchmark_out=current.json \
--benchmark_out_format=json
# Compare results
tools/compare.py benchmarks baseline.json current.jsonTroubleshooting
Benchmarks Too Fast
If benchmarks complete in < 1 microsecond:
// Increase work per iteration
for (auto _ : state) {
for (int i = 0; i < 100; ++i) { // Amortize overhead
auto result = fastFunction();
benchmark::DoNotOptimize(result);
}
}
state.SetItemsProcessed(state.iterations() * 100);High Variability
If standard deviation > 10% of mean:
- Close other applications
- Disable CPU frequency scaling:
sudo cpupower frequency-set --governor performance - Pin to specific CPU:
taskset -c 0 ./benchmark_MaskArea - Increase minimum iteration time:
--benchmark_min_time=5
Build Issues
# Benchmarks not building
cmake -DENABLE_BENCHMARK=ON ..
# Specific benchmark disabled
cmake -DBENCHMARK_MASK_AREA=ON ..
# Google Benchmark not found
# (Should auto-fetch with FetchContent, but can install manually)
sudo apt install libbenchmark-dev # Ubuntu/Debian