Exporting Encoder Models

Overview

WhiskerToolbox can use general-purpose image encoder models (such as ConvNeXt, ViT, or ResNet) to extract feature maps from video frames. This guide explains how to export an encoder model from Python and load it in WhiskerToolbox.

Prerequisites

Install PyTorch and torchvision:

pip install torch torchvision

Quick Start: Export ConvNeXt

WhiskerToolbox ships with a ready-to-use export script for ConvNeXt models:

python examples/export_convnext_encoder.py

This exports a ConvNeXt-Tiny encoder and produces three files:

File	Description
`convnext_tiny_encoder.pt`	TorchScript model (works everywhere)
`convnext_tiny_encoder.pt2`	AOT Inductor model (faster, PyTorch 2.1+)
`convnext_tiny_encoder_spec.json`	Model specification for WhiskerToolbox

Using a Different ConvNeXt Variant

# ConvNeXt-Small (768-dim features)
python examples/export_convnext_encoder.py --model convnext_small

# ConvNeXt-Base (1024-dim features)
python examples/export_convnext_encoder.py --model convnext_base

# ConvNeXt-Large (1536-dim features)
python examples/export_convnext_encoder.py --model convnext_large

Export Options

python examples/export_convnext_encoder.py --help

Option	Description
`--model`	ConvNeXt variant (`convnext_tiny`, `convnext_small`, `convnext_base`, `convnext_large`)
`--output-dir`	Directory for exported files (default: current directory)
`--no-pt`	Skip TorchScript export
`--no-pt2`	Skip AOT Inductor export

Loading in WhiskerToolbox

Method 2: JSON Specification

For more control, use the generated JSON specification file:

Open the Deep Learning widget
Click Load Model Spec and select the _spec.json file
The model will appear with the correct input/output shapes pre-configured

Exporting Custom Encoders

To export your own encoder (not a ConvNeXt), create a Python wrapper that outputs the spatial feature map and follows this pattern:

import torch

class MyEncoderWrapper(torch.nn.Module):
    def __init__(self, backbone):
        super().__init__()
        self.backbone = backbone
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # x: [B, C, H, W] input image
        # return: [B, C_out, H_out, W_out] feature map
        return self.backbone.extract_features(x)

# Export as TorchScript
wrapper = MyEncoderWrapper(my_model).eval()
example = torch.randn(1, 3, 224, 224)
traced = torch.jit.trace(wrapper, example)
traced.save("my_encoder.pt")

Then create a JSON specification:

{
  "model_id": "my_encoder",
  "display_name": "My Custom Encoder",
  "weights_path": "my_encoder.pt",
  "inputs": [
    { "name": "image", "shape": [3, 224, 224], "recommended_encoder": "ImageEncoder" }
  ],
  "outputs": [
    { "name": "features", "shape": [512, 7, 7] }
  ]
}

Output Format

The encoder output is a spatial feature tensor of shape [B, C, H, W]:

B — batch size (number of frames processed simultaneously)
C — number of feature channels (depends on the encoder architecture)
H, W — spatial dimensions of the feature map

For ConvNeXt with 224×224 input, the output spatial size is 7×7 (224 / 32).

This feature tensor can be used as input to downstream processing such as global average pooling, spatial point extraction, or classification heads (see the Deep Learning roadmap for planned post-encoder modules).