Background Inference Roadmap

Overview

This document tracks the implementation plan for moving deep learning batch inference off the main thread so the UI stays responsive during long runs.

The plan has two stages: first, a simple fork-join approach where the worker accumulates all results and bulk-writes them after completion (Phases 1–2). Then, an upgrade to the write reservation + merge pattern where results appear progressively in the UI during inference (Phase 3). Both stages follow the Concurrency Architecture.

Design Rationale

This roadmap follows the thread confinement + merge model described in the Concurrency Architecture. DataManager stays single-threaded; the worker uses private data; results are merged on the main thread.

Why a separate MediaData instance?

VideoData wraps ffmpeg_wrapper::VideoDecoder, which holds stateful FFmpeg context (current decode position, codec context, frame buffer). If the worker thread seeks to frame N while the UI is displaying frame M, they corrupt each other’s state. Creating a second VideoData at the same file path gives the worker its own independent FFmpeg decoder — both can read frames concurrently from the same file without interference.

This aligns with ConcurrencyTraits<VideoData>::supports_cheap_clone = true — VideoData cannot be shared across threads but can be cheaply cloned. See Concurrency Architecture for the full access pattern decision matrix.

Why all-at-once results first?

For the initial implementation (Phases 1–2), the worker accumulates all results in a private buffer and bulk-writes them on the main thread after completion. This is the simplest correct approach: no synchronization, no merge timer, no partial-state concerns.

Phase 3 upgrades to the write reservation + periodic merge pattern from the Concurrency Architecture, giving progressive visibility — the user sees mask frames appearing as the model processes them.

Key Existing Patterns

MLCoreWidget PipelineWorker

src/WhiskerToolbox/MLCore_Widget/MLCoreWidget.cpp (L46–80):

class PipelineWorker : public QThread with Q_OBJECT
Emits progressUpdate(int, QString) signal from the worker thread
Qt automatically delivers cross-thread signals via QueuedConnection
Main thread connects QThread::finished to a lambda that harvests results and calls deleteLater()
Panels disabled during execution via _setPipelineRunning(bool)

producer_consumer_pipeline

src/DataManager/transforms/Media/producer_consumer_pipeline.hpp:

Uses std::mutex around MediaData access when multiple threads read frames
Demonstrates that creating separate media access per thread is the safe approach

NotifyObservers::No

addAtTime(frame_idx, data, NotifyObservers::No) suppresses per-frame observer callbacks. Followed by a single notifyObservers() call after all data is written. Used to avoid N observer callback fires during bulk writes.

Files to Modify

File	Changes
`SlotAssembler.hpp`	`BatchInferenceResult` struct, `runBatchRangeOffline()` declaration, `ResultCallback` type alias
`SlotAssembler.cpp`	`runBatchRangeOffline()` and `decodeOutputsToBuffer()` implementation, progressive callback support
`DeepLearningPropertiesWidget.hpp`	Worker pointer member, `_setBatchRunning()` method, `QTimer*` and `WriteReservation` members, `_mergeResults()`
`DeepLearningPropertiesWidget.cpp`	`BatchInferenceWorker` class, rewritten `_onRunBatch()`, cancel handler, merge timer, progressive merge

Files to Create

File	Purpose
`WriteReservation.hpp`	Thread-safe buffer for progressive result delivery

Verification

Build — cmake --build --preset linux-clang-release must succeed
Existing tests — All DeepLearning and SlotAssembler tests must pass
Manual — UI responsiveness — Start batch inference on >100 frames; verify Media_Widget slider and other widgets remain interactive
Manual — cancellation — Start batch, cancel mid-run; verify partial results are written and UI recovers cleanly
Manual — correctness — Compare batch output against single-frame inference for the same frames; results must match
Manual — video contention — During batch inference, scrub the media slider to confirm the UI’s video decoder works independently from the worker’s

Design Decisions

All-at-once first, progressive merge second — Phases 1–2 use all-at-once delivery (simplest correct implementation). Phase 3 upgrades to write reservation + periodic merge for progressive visibility. This staged approach lets us ship a working background inference quickly and add polish later.
Separate VideoData instance per worker — VideoData has supports_concurrent_read = false and supports_cheap_clone = true in ConcurrencyTraits. Cloning gives the worker its own FFmpeg decoder.
QThread (not std::jthread) — Consistent with the existing MLCoreWidget pattern; integrates naturally with Qt signal/slot for cross-thread communication.
Scope: deep learning only — DataTransform_Widget threading uses the same pattern but is out of scope for this plan.

Relationship to Concurrency Architecture

This roadmap implements the merge pattern from the Concurrency Architecture for the specific case of deep learning batch inference.

Architecture Concept	How It Applies Here
Thread confinement	DataManager writes only on main thread
`ConcurrencyTraits`	`VideoData` cloned (not shared); output data uses reservation buffer
Write reservation	Worker writes to private buffer; main thread merges periodically
Fork-join (Phases 1–2)	Worker returns `BatchInferenceResult`; main thread writes all at once
Merge (Phase 3)	Timer-driven merge for progressive visibility

The same pattern generalizes to DataTransform_Widget and any other long-running computation. See the architecture document for the full design.