File IO

Current Architecture

Data Type	Format	Dependency	Raw Data for Testing?
Analog	Binary
Analog	CSV		Some
DigitalEvent	CSV		Yes
DigitalInterval	Binary
DigitalInteral	CSV
Line Data	Binary	Cap’n Proto
Line Data	CSV
Mask Data	HDF5	HDF5
Mask Data	Image	OpenCV
Media Data	Video	ffmpeg	Yes
Media Data	Images	OpenCV
Media Data	HDF5	HDF5
Point Data	CSV		Yes
Tensor	Numpy	Numpy

File Input/Output and Memory Management Considerations

Overview of File IO Requirements

Because Neuralizer is compatible with a wide range of data, its file IO needs to support a wide range of file formats. Some of these file formats will be pre-subscribed, such as data acquisition-specific formats for electrophysiology. Others are more specific to this package, such as for line data that varies over time. We must also be aware from the start that some electrophysiology file sizes can be expected to be larger than easy to work with on a regular desktop machine. For instance, Neuropixel probes record almost 400 channels at 20 kHz and are becoming more routine for use. Consequently, multiple file formats should have the ability to be easily memory-mapped and loaded from disk rather than loading the entire file into RAM.

Incremental Data Adjustments and Key-Value Stores

Another important consideration from the start is that many of these data types will be incrementally adjusted. For instance, processing a video sequence frame by frame or manually adjusting the output of an automated algorithm requires changing some subset of data in a much larger scheme. For some of these file types that are greater than hundreds of megabytes, resaving an entire file for point manipulations could be onerous and time-consuming. Consequently, for some file types, it would be advantageous to use a key-value database type structure that allows us to easily make incremental adjustments and only save those changes, consequently saving data much more quickly. To my knowledge, this process is quite common in the wild but uncommon in neuroscience.

Serialization with Captain Proto

The process of data serialization and deserialization will be accomplished with the Captain Proto library. This is able to easily create binary files from more complex data structures, such as those that hold multiple lines of varying size per unit time. This library can define output file structures that are purely binary and would be saved and loaded as entire objects. Alternatively, different definitions can be used to save objects as key-value pairs; for instance, each time point can be the key and the data at that time can be the value.

Memory Mapping Implementation

Captain Proto does not perform memory mapping itself. Memory mapping is different across different types of file systems, so to maintain cross-platform use, we will need to either specify different interfaces for interacting with Captain Proto or we could use another third-party library that does this for us. Boost.Interprocess is commonly cited for this purpose, but there are others.

Key-Value Database Libraries

For key-value database structures, there are multiple libraries available. Many of these are mature and designed to work with servers of extremely large datasets, and some have been developed by very large companies. One of the most popular is called LMDB, but in my test cases, this appears to be not as great with Windows. Using this library requires specifying some maximum file size; on Linux, the resulting file size is only the size of the data, but with my testing on Windows, this large maximum file size seems to persist. An alternative database is RocksDB. This software is developed by Facebook, seems to be quite mature, and should also be able to save as a key-value pair across platforms.

Prioritization and Future Considerations

Currently, the data that would benefit from key-value updates is only theoretical. Having a line for whiskers for each frame in a high-speed video of up to 200,000 frames results in a file size that is almost 300 MB. However, saving a binary file in its entirety when making whisker-specific updates through manual curation still takes less than a second to save and is not noticeable to the user. Consequently, I think prioritizing having binary outputs to save entire files and memory mapping of those binary file types that are read-only, such as for Neuropixel analog traces, may be the most desirable next feature. Key-value pairs with something like RocksDB should be remembered for the future, but I am going to make that less of a priority until I see a clear use case that would benefit.