Deep Learning Roadmap

Overview

This document tracks remaining development tasks for the Deep Learning library and widget. The core library (encoders, decoders, model abstraction, multi-backend inference, registry, runtime JSON models, and widget UI) is complete. The items below represent outstanding work for full production readiness.

Remaining Tasks

Integration Testing

End-to-end tests that exercise the full pipeline with real model weights are still needed. These require Python model export as a prerequisite.

Export test models — Create a simple model, export as both .pt (TorchScript) and .pt2 (AOT Inductor), and commit the exported files as test fixtures (see AOT Inductor Tutorial)
Integration test: TorchScript backend — Load a .pt file, run inference, verify output tensor shape and value range
Integration test: AOT Inductor backend — Load a .pt2 file, run inference, verify output tensor shape and value range
Integration test: NeuroSAM full pipeline — Load NeuroSAM weights, bind inputs, run inference, verify output written to DataManager
End-to-end UI test — Select a model in the widget, configure bindings, execute inference, verify results appear in DataManager

Tensor Preview Visualization

The View widget has placeholder areas for tensor channel previews.

Tensor channel thumbnails — Render color-mapped previews of each input tensor channel after encoding (pre-inference visualization)
Output overlay — Overlay decoded output (mask, points, lines) on the source image in the View widget

Model Export Documentation

User-facing export guide — Document the Python export workflow in the user guide for users who train their own models
Multi-model .pt2 packages — Document and test loading multiple models from a single .pt2 archive

Additional Encoder/Decoder Support

Distance mode for Line2DEncoder — Distance transform instead of binary/heatmap
Multi-peak TensorToPoint2D — Expose multi-peak detection parameters in the widget UI (currently only available programmatically)
TensorToAnalogTimeSeries decoder — For models that output scalar time series rather than spatial maps

General Encoder Model Wrapper

NeuroSAM is a task-specific wrapper with hard-coded 256×256 resolution and a three-input forward pass. Many use cases only need a standalone encoder (e.g. a ViT, ResNet, or similar backbone) that maps an input image to a feature tensor. This section covers providing a reusable GeneralEncoder wrapper.

Design Notes — Compile-Time vs. Runtime Shapes

Input and output resolution do not need to be known at compile time. The existing architecture already separates shape declarations (which are metadata for the UI and pre-allocation) from actual tensor operations:

TorchScript (.pt) accepts arbitrary dynamic shapes natively.
AOT Inductor (.pt2) supports dynamic shapes via torch.export.Dim() at export time — dimensions marked as dynamic can vary at runtime within declared bounds.
RuntimeModel already reads shapes from a JSON spec at runtime, so a GeneralEncoder can also be fully JSON-driven using RuntimeModelSpec rather than requiring a new compiled ModelBase subclass.

Therefore the recommended approach is:

For most users: Define the encoder via a RuntimeModelSpec JSON file specifying input shape (e.g. [3, 224, 224]) and output shape (e.g. [384, 16, 16]). No C++ changes needed.
For convenience: Provide a thin GeneralEncoderModel C++ wrapper with configurable resolution (constructor parameters or setters) that registers in the ModelRegistry and offers sensible defaults.

Tasks

GeneralEncoderModel C++ wrapper — A ModelBase subclass with configurable input resolution, input channels, and output feature shape. Single input slot ("image") and single output slot ("features"). Batch mode: DynamicBatch{1, max}. Register via DL_REGISTER_MODEL.
Example RuntimeModelSpec JSON for encoder — A documented JSON template for a generic encoder (e.g. ConvNeXt-Tiny producing [768, 7, 7] from [3, 224, 224]). Committed to examples/convnext_encoder_spec.json.
Export script for generic encoders — Python script that takes a torchvision ConvNeXt encoder, wraps it, and exports as .pt and .pt2 with dynamic batch support. Committed to examples/export_convnext_encoder.py.
Unit tests — Verify GeneralEncoderModel slot metadata, construction with various resolutions, registry integration, move semantics, and error handling.
Documentation — Developer doc at docs/developer/DeepLearning/general_encoder.qmd and user guide at docs/user_guide/deep_learning/export_encoder_models.qmd.

Post-Encoder Feature Extraction Modules

After an encoder produces a spatial feature tensor (e.g. [B, C, H, W] such as [1, 384, 16, 16]), downstream tasks often need a reduced feature vector rather than the full spatial map. This section covers composable post-processing modules that run after the encoder forward pass, using libtorch operations directly (no additional model weights required).

These modules operate on the output tensor before it reaches the decoder stage. They transform a [B, C, H, W] feature map into a [B, C] (or [B, C, 1, 1]) feature vector suitable for downstream use such as writing to TensorData or feeding into a classification head.

Integration with `GeneralEncoderModel`

Post-encoder modules are exposed as an optional extension of GeneralEncoderModel in the widget UI. The user configures the encoder normally (image input binding, weights file), then optionally selects a post-encoder module from a combo box. By default (None) the raw spatial feature tensor is passed through unchanged.

When a module is selected, GeneralEncoderModel holds an optional std::unique_ptr<PostEncoderModule> and applies it inside forward() before returning output. The output slot shape advertised by outputSlots() updates to reflect the post-processed shape (e.g. [C] after global average pooling).

Widget behaviour per module:

Selection	Extra UI shown	Output shape
None (default)	—	`[B, C_out, H_out, W_out]` (raw encoder output)
Global Average Pooling	—	`[B, C_out]`
Spatial Point Extraction	Combo box: select a `PointData` key from DataManager	`[B, C_out]`

Architecture — `PostEncoderModule` Interface

class PostEncoderModule {
public:
    virtual ~PostEncoderModule() = default;
    virtual std::string name() const = 0;
    virtual torch::Tensor apply(torch::Tensor const & features) = 0;
};

Modules are pure libtorch operations — no model weights, no DataManager dependency. They compose into a pipeline via a PostEncoderPipeline that chains modules sequentially. The pipeline is configured per-model (either in code or via JSON spec).

Module: Global Average Pooling

Reduces [B, C, H, W] → [B, C] by averaging over spatial dimensions. This is the standard approach for converting a spatial feature map into a global feature descriptor (e.g. for classification or retrieval).

Implementation: torch::adaptive_avg_pool2d(features, {1, 1}).squeeze(-1).squeeze(-1).

GlobalAvgPoolModule — Implement PostEncoderModule subclass using torch::adaptive_avg_pool2d. Input: [B, C, H, W]. Output: [B, C]. When selected in the GeneralEncoderModel widget, no extra configuration is needed — the module applies automatically after the encoder forward pass.
Unit test — Verify output shape and values (average of spatial elements) for known input tensors.

Module: Spatial Point Feature Extraction

Given a 2D position from a PointData object (in original image coordinates), extract the feature vector at the nearest spatial location in the feature map. This is useful for probing “what does the encoder think at this location?” — e.g. extracting the feature vector at a whisker tip or a labeled landmark.

The module takes the feature tensor [B, C, H, W] and a Point2D<float> in source image coordinates, scales the point to feature map coordinates (x_feat = x_src * W / src_W, y_feat = y_src * H / src_H), rounds to the nearest integer index, and extracts features[:, :, y_feat, x_feat] → [B, C].

SpatialPointExtractModule — Implement PostEncoderModule subclass. Constructor takes ImageSize source_image_size for coordinate scaling. Provide a setPoint(Point2D<float>) method to update the query location per frame. Input: [B, C, H, W]. Output: [B, C]. When selected in the GeneralEncoderModel widget, a combo box appears listing all PointData keys from DataManager; the selected key provides the per-frame query point.
Bilinear interpolation variant — Instead of nearest-neighbor, use torch::nn::functional::grid_sample for sub-pixel accuracy. Expose as a constructor option (InterpolationMode::Nearest vs InterpolationMode::Bilinear).
Integration with SlotAssembler — Wire the point source so that SlotAssembler reads the current PointData value from DataManager each frame and passes it to the module before apply().
Unit tests — Verify coordinate scaling and extraction for known feature maps and point positions. Test edge cases (point at image boundary, feature map 1×1).

Pipeline and Configuration

PostEncoderPipeline — A container that holds an ordered list of PostEncoderModule instances, chains apply() calls sequentially, and validates shape compatibility between steps.
JSON configuration — Extend RuntimeModelSpec with an optional "post_encoder" array field. Each entry specifies a module by name and parameters (e.g. {"module": "global_avg_pool"} or {"module": "spatial_point", "interpolation": "bilinear"}).
Factory registration — Register modules in a PostEncoderModuleFactory (string key → factory function) so JSON specs can instantiate them by name.
Widget UI integration — Add an optional “Post-Encoder Module” combo box to the GeneralEncoderModel section of the Deep Learning Properties widget. Options: None (default), Global Average Pooling, Spatial Point Extraction. Selecting Global Average Pooling requires no further input. Selecting Spatial Point Extraction reveals a second combo box listing all PointData keys currently registered in DataManager, allowing the user to choose the point source. The output slot shape shown in the UI updates dynamically to reflect the active module’s output.
TensorToFeatureVector decoder — A new decoder that writes a [B, C] output (after post-encoder processing) to a TensorData object in DataManager, storing one feature vector per frame.

Documentation

Developer doc — Architecture overview for PostEncoderModule, the pipeline, and the factory. Place at docs/developer/DeepLearning/post_encoder_modules.qmd.
User guide — Practical walkthrough: load an encoder, add global average pooling, extract per-frame features, visualize in DataViewer.

Performance

GPU memory management — Track and display GPU memory usage in the widget; warn when approaching limits
Benchmark suite — Benchmark encode/decode throughput and end-to-end inference latency for representative models