Background Inference Roadmap

Overview

This document tracks the implementation plan for moving deep learning batch inference off the main thread so the UI stays responsive during long runs.

The plan has two stages: first, a simple fork-join approach where the worker accumulates all results and bulk-writes them after completion (Phases 1–2). Then, an upgrade to the write reservation + merge pattern where results appear progressively in the UI during inference (Phase 3). Both stages follow the Concurrency Architecture.

Design Rationale

This roadmap follows the thread confinement + merge model described in the Concurrency Architecture. DataManager stays single-threaded; the worker uses private data; results are merged on the main thread.

Why a separate MediaData instance?

VideoData wraps ffmpeg_wrapper::VideoDecoder, which holds stateful FFmpeg context (current decode position, codec context, frame buffer). If the worker thread seeks to frame N while the UI is displaying frame M, they corrupt each other’s state. Creating a second VideoData at the same file path gives the worker its own independent FFmpeg decoder — both can read frames concurrently from the same file without interference.

This aligns with ConcurrencyTraits<VideoData>::supports_cheap_clone = trueVideoData cannot be shared across threads but can be cheaply cloned. See Concurrency Architecture for the full access pattern decision matrix.

Why all-at-once results first?

For the initial implementation (Phases 1–2), the worker accumulates all results in a private buffer and bulk-writes them on the main thread after completion. This is the simplest correct approach: no synchronization, no merge timer, no partial-state concerns.

Phase 3 upgrades to the write reservation + periodic merge pattern from the Concurrency Architecture, giving progressive visibility — the user sees mask frames appearing as the model processes them.

Key Existing Patterns

MLCoreWidget PipelineWorker

src/WhiskerToolbox/MLCore_Widget/MLCoreWidget.cpp (L46–80):

  • class PipelineWorker : public QThread with Q_OBJECT
  • Emits progressUpdate(int, QString) signal from the worker thread
  • Qt automatically delivers cross-thread signals via QueuedConnection
  • Main thread connects QThread::finished to a lambda that harvests results and calls deleteLater()
  • Panels disabled during execution via _setPipelineRunning(bool)

producer_consumer_pipeline

src/DataManager/transforms/Media/producer_consumer_pipeline.hpp:

  • Uses std::mutex around MediaData access when multiple threads read frames
  • Demonstrates that creating separate media access per thread is the safe approach

NotifyObservers::No

addAtTime(frame_idx, data, NotifyObservers::No) suppresses per-frame observer callbacks. Followed by a single notifyObservers() call after all data is written. Used to avoid N observer callback fires during bulk writes.

Implementation Plan

Phase 1: Worker Thread Infrastructure

Phase 2: Wiring and UI

    1. writes all results to DataManager on the main thread using addAtTime() with NotifyObservers::No, (3) calls notifyObservers() once on affected data objects, (4) calls worker->deleteLater(), (5) re-enables UI. Disable “Run Batch”, “Run Single”, and “Run Recurrent” buttons while running (following _setPipelineRunning() pattern).

Phase 3: Progressive Visibility via Write Reservations

Goal: Instead of waiting for the entire batch to finish, show results appearing frame-by-frame in the UI while the worker runs. This uses the write reservation + merge pattern from the Concurrency Architecture.

Prerequisite: createWriteReservation<T>() must be implemented on DataManager (see Concurrency Architecture Stage 2).

Phase 4: Testing and Documentation

Files to Modify

File Changes
SlotAssembler.hpp BatchInferenceResult struct, runBatchRangeOffline() declaration, ResultCallback type alias
SlotAssembler.cpp runBatchRangeOffline() and decodeOutputsToBuffer() implementation, progressive callback support
DeepLearningPropertiesWidget.hpp Worker pointer member, _setBatchRunning() method, QTimer* and WriteReservation members, _mergeResults()
DeepLearningPropertiesWidget.cpp BatchInferenceWorker class, rewritten _onRunBatch(), cancel handler, merge timer, progressive merge

Files to Create

File Purpose
WriteReservation.hpp Thread-safe buffer for progressive result delivery

Verification

  1. Buildcmake --build --preset linux-clang-release must succeed
  2. Existing tests — All DeepLearning and SlotAssembler tests must pass
  3. Manual — UI responsiveness — Start batch inference on >100 frames; verify Media_Widget slider and other widgets remain interactive
  4. Manual — cancellation — Start batch, cancel mid-run; verify partial results are written and UI recovers cleanly
  5. Manual — correctness — Compare batch output against single-frame inference for the same frames; results must match
  6. Manual — video contention — During batch inference, scrub the media slider to confirm the UI’s video decoder works independently from the worker’s

Design Decisions

  • All-at-once first, progressive merge second — Phases 1–2 use all-at-once delivery (simplest correct implementation). Phase 3 upgrades to write reservation + periodic merge for progressive visibility. This staged approach lets us ship a working background inference quickly and add polish later.
  • Separate VideoData instance per workerVideoData has supports_concurrent_read = false and supports_cheap_clone = true in ConcurrencyTraits. Cloning gives the worker its own FFmpeg decoder.
  • QThread (not std::jthread) — Consistent with the existing MLCoreWidget pattern; integrates naturally with Qt signal/slot for cross-thread communication.
  • Scope: deep learning onlyDataTransform_Widget threading uses the same pattern but is out of scope for this plan.

Relationship to Concurrency Architecture

This roadmap implements the merge pattern from the Concurrency Architecture for the specific case of deep learning batch inference.

Architecture Concept How It Applies Here
Thread confinement DataManager writes only on main thread
ConcurrencyTraits VideoData cloned (not shared); output data uses reservation buffer
Write reservation Worker writes to private buffer; main thread merges periodically
Fork-join (Phases 1–2) Worker returns BatchInferenceResult; main thread writes all at once
Merge (Phase 3) Timer-driven merge for progressive visibility

The same pattern generalizes to DataTransform_Widget and any other long-running computation. See the architecture document for the full design.