Deep Learning Roadmap

Overview

This document tracks remaining development tasks for the Deep Learning library and widget. The core library (encoders, decoders, model abstraction, multi-backend inference, registry, runtime JSON models, and widget UI) is complete. The items below represent outstanding work for full production readiness.

Remaining Tasks

Integration Testing

End-to-end tests that exercise the full pipeline with real model weights are still needed. These require Python model export as a prerequisite.

Tensor Preview Visualization

The View widget has placeholder areas for tensor channel previews.

Widget Polish

Model Export Documentation

Additional Encoder/Decoder Support

General Encoder Model Wrapper

NeuroSAM is a task-specific wrapper with hard-coded 256×256 resolution and a three-input forward pass. Many use cases only need a standalone encoder (e.g. a ViT, ResNet, or similar backbone) that maps an input image to a feature tensor. This section covers providing a reusable GeneralEncoder wrapper.

Design Notes — Compile-Time vs. Runtime Shapes

Input and output resolution do not need to be known at compile time. The existing architecture already separates shape declarations (which are metadata for the UI and pre-allocation) from actual tensor operations:

  • TorchScript (.pt) accepts arbitrary dynamic shapes natively.
  • AOT Inductor (.pt2) supports dynamic shapes via torch.export.Dim() at export time — dimensions marked as dynamic can vary at runtime within declared bounds.
  • RuntimeModel already reads shapes from a JSON spec at runtime, so a GeneralEncoder can also be fully JSON-driven using RuntimeModelSpec rather than requiring a new compiled ModelBase subclass.

Therefore the recommended approach is:

  1. For most users: Define the encoder via a RuntimeModelSpec JSON file specifying input shape (e.g. [3, 224, 224]) and output shape (e.g. [384, 16, 16]). No C++ changes needed.
  2. For convenience: Provide a thin GeneralEncoderModel C++ wrapper with configurable resolution (constructor parameters or setters) that registers in the ModelRegistry and offers sensible defaults.

Tasks

Post-Encoder Feature Extraction Modules

After an encoder produces a spatial feature tensor (e.g. [B, C, H, W] such as [1, 384, 16, 16]), downstream tasks often need a reduced feature vector rather than the full spatial map. This section covers composable post-processing modules that run after the encoder forward pass, using libtorch operations directly (no additional model weights required).

These modules operate on the output tensor before it reaches the decoder stage. They transform a [B, C, H, W] feature map into a [B, C] (or [B, C, 1, 1]) feature vector suitable for downstream use such as writing to TensorData or feeding into a classification head.

Integration with GeneralEncoderModel

Post-encoder modules are exposed as an optional extension of GeneralEncoderModel in the widget UI. The user configures the encoder normally (image input binding, weights file), then optionally selects a post-encoder module from a combo box. By default (None) the raw spatial feature tensor is passed through unchanged.

When a module is selected, GeneralEncoderModel holds an optional std::unique_ptr<PostEncoderModule> and applies it inside forward() before returning output. The output slot shape advertised by outputSlots() updates to reflect the post-processed shape (e.g. [C] after global average pooling).

Widget behaviour per module:

Selection Extra UI shown Output shape
None (default) [B, C_out, H_out, W_out] (raw encoder output)
Global Average Pooling [B, C_out]
Spatial Point Extraction Combo box: select a PointData key from DataManager [B, C_out]

Architecture — PostEncoderModule Interface

class PostEncoderModule {
public:
    virtual ~PostEncoderModule() = default;
    virtual std::string name() const = 0;
    virtual torch::Tensor apply(torch::Tensor const & features) = 0;
};

Modules are pure libtorch operations — no model weights, no DataManager dependency. They compose into a pipeline via a PostEncoderPipeline that chains modules sequentially. The pipeline is configured per-model (either in code or via JSON spec).

Module: Global Average Pooling

Reduces [B, C, H, W][B, C] by averaging over spatial dimensions. This is the standard approach for converting a spatial feature map into a global feature descriptor (e.g. for classification or retrieval).

Implementation: torch::adaptive_avg_pool2d(features, {1, 1}).squeeze(-1).squeeze(-1).

Module: Spatial Point Feature Extraction

Given a 2D position from a PointData object (in original image coordinates), extract the feature vector at the nearest spatial location in the feature map. This is useful for probing “what does the encoder think at this location?” — e.g. extracting the feature vector at a whisker tip or a labeled landmark.

The module takes the feature tensor [B, C, H, W] and a Point2D<float> in source image coordinates, scales the point to feature map coordinates (x_feat = x_src * W / src_W, y_feat = y_src * H / src_H), rounds to the nearest integer index, and extracts features[:, :, y_feat, x_feat][B, C].

Pipeline and Configuration

Documentation

Performance