Background Inference Roadmap
Overview
This document tracks the implementation plan for moving deep learning batch inference off the main thread so the UI stays responsive during long runs.
The plan has two stages: first, a simple fork-join approach where the worker accumulates all results and bulk-writes them after completion (Phases 1–2). Then, an upgrade to the write reservation + merge pattern where results appear progressively in the UI during inference (Phase 3). Both stages follow the Concurrency Architecture.
Design Rationale
This roadmap follows the thread confinement + merge model described in the Concurrency Architecture. DataManager stays single-threaded; the worker uses private data; results are merged on the main thread.
Why a separate MediaData instance?
VideoData wraps ffmpeg_wrapper::VideoDecoder, which holds stateful FFmpeg context (current decode position, codec context, frame buffer). If the worker thread seeks to frame N while the UI is displaying frame M, they corrupt each other’s state. Creating a second VideoData at the same file path gives the worker its own independent FFmpeg decoder — both can read frames concurrently from the same file without interference.
This aligns with ConcurrencyTraits<VideoData>::supports_cheap_clone = true — VideoData cannot be shared across threads but can be cheaply cloned. See Concurrency Architecture for the full access pattern decision matrix.
Why all-at-once results first?
For the initial implementation (Phases 1–2), the worker accumulates all results in a private buffer and bulk-writes them on the main thread after completion. This is the simplest correct approach: no synchronization, no merge timer, no partial-state concerns.
Phase 3 upgrades to the write reservation + periodic merge pattern from the Concurrency Architecture, giving progressive visibility — the user sees mask frames appearing as the model processes them.
Key Existing Patterns
MLCoreWidget PipelineWorker
src/WhiskerToolbox/MLCore_Widget/MLCoreWidget.cpp (L46–80):
class PipelineWorker : public QThreadwithQ_OBJECT- Emits
progressUpdate(int, QString)signal from the worker thread - Qt automatically delivers cross-thread signals via
QueuedConnection - Main thread connects
QThread::finishedto a lambda that harvests results and callsdeleteLater() - Panels disabled during execution via
_setPipelineRunning(bool)
producer_consumer_pipeline
src/DataManager/transforms/Media/producer_consumer_pipeline.hpp:
- Uses
std::mutexaroundMediaDataaccess when multiple threads read frames - Demonstrates that creating separate media access per thread is the safe approach
NotifyObservers::No
addAtTime(frame_idx, data, NotifyObservers::No) suppresses per-frame observer callbacks. Followed by a single notifyObservers() call after all data is written. Used to avoid N observer callback fires during bulk writes.
Implementation Plan
Phase 1: Worker Thread Infrastructure
Phase 2: Wiring and UI
-
- writes all results to
DataManageron the main thread usingaddAtTime()withNotifyObservers::No, (3) callsnotifyObservers()once on affected data objects, (4) callsworker->deleteLater(), (5) re-enables UI. Disable “Run Batch”, “Run Single”, and “Run Recurrent” buttons while running (following_setPipelineRunning()pattern).
- writes all results to
Phase 3: Progressive Visibility via Write Reservations
Goal: Instead of waiting for the entire batch to finish, show results appearing frame-by-frame in the UI while the worker runs. This uses the write reservation + merge pattern from the Concurrency Architecture.
Prerequisite: createWriteReservation<T>() must be implemented on DataManager (see Concurrency Architecture Stage 2).
Phase 4: Testing and Documentation
Files to Modify
| File | Changes |
|---|---|
SlotAssembler.hpp |
BatchInferenceResult struct, runBatchRangeOffline() declaration, ResultCallback type alias |
SlotAssembler.cpp |
runBatchRangeOffline() and decodeOutputsToBuffer() implementation, progressive callback support |
DeepLearningPropertiesWidget.hpp |
Worker pointer member, _setBatchRunning() method, QTimer* and WriteReservation members, _mergeResults() |
DeepLearningPropertiesWidget.cpp |
BatchInferenceWorker class, rewritten _onRunBatch(), cancel handler, merge timer, progressive merge |
Files to Create
| File | Purpose |
|---|---|
WriteReservation.hpp |
Thread-safe buffer for progressive result delivery |
Verification
- Build —
cmake --build --preset linux-clang-releasemust succeed - Existing tests — All
DeepLearningandSlotAssemblertests must pass - Manual — UI responsiveness — Start batch inference on >100 frames; verify
Media_Widgetslider and other widgets remain interactive - Manual — cancellation — Start batch, cancel mid-run; verify partial results are written and UI recovers cleanly
- Manual — correctness — Compare batch output against single-frame inference for the same frames; results must match
- Manual — video contention — During batch inference, scrub the media slider to confirm the UI’s video decoder works independently from the worker’s
Design Decisions
- All-at-once first, progressive merge second — Phases 1–2 use all-at-once delivery (simplest correct implementation). Phase 3 upgrades to write reservation + periodic merge for progressive visibility. This staged approach lets us ship a working background inference quickly and add polish later.
- Separate
VideoDatainstance per worker —VideoDatahassupports_concurrent_read = falseandsupports_cheap_clone = trueinConcurrencyTraits. Cloning gives the worker its own FFmpeg decoder. QThread(notstd::jthread) — Consistent with the existingMLCoreWidgetpattern; integrates naturally with Qt signal/slot for cross-thread communication.- Scope: deep learning only —
DataTransform_Widgetthreading uses the same pattern but is out of scope for this plan.
Relationship to Concurrency Architecture
This roadmap implements the merge pattern from the Concurrency Architecture for the specific case of deep learning batch inference.
| Architecture Concept | How It Applies Here |
|---|---|
| Thread confinement | DataManager writes only on main thread |
ConcurrencyTraits |
VideoData cloned (not shared); output data uses reservation buffer |
| Write reservation | Worker writes to private buffer; main thread merges periodically |
| Fork-join (Phases 1–2) | Worker returns BatchInferenceResult; main thread writes all at once |
| Merge (Phase 3) | Timer-driven merge for progressive visibility |
The same pattern generalizes to DataTransform_Widget and any other long-running computation. See the architecture document for the full design.