sparank.data.MemmapDataset

class sparank.data.MemmapDataset(input_path, label_path, valid_samples, max_samples, seq_len, num_classes, segment_layout, context_path=None, cl_mode=False, mrp_mode=False)[source]

Bases: Dataset

Memory-mapped dataset for 1-to-N modalities.

The concatenated token sequence is split into per-modality segments according to segment_layout, and CL dropout / MRP masking rates are applied independently per segment.

Parameters:

input_path (str) – Path to the memmap file for token inputs.
label_path (str) – Path to the memmap file for labels.
valid_samples (int) – Number of valid (written) samples to read.
max_samples (int) – Total allocated row count in the memmap files.
seq_len (int) – Total token-sequence length (= sum of per-modality top_k).
num_classes (int) – Number of cell-type classes.
segment_layout (List[Dict]) –
One dict per modality defining the layout:
```
{"name": str, "top_k": int, "mask_id": int,
 "cl_dropout_rate": float, "mrp_mask_rate": float}
```
Segments are contiguous and must sum to seq_len.
context_path (str, optional) – Path to context memmap. None indicates no context returned.
cl_mode (bool, default False) – If True, each sample produces two dropout-augmented views (view_a and view_b) for Contrastive Learning.
mrp_mode (bool, default False) – If True, produces masked targets and positions for Masked Region Prediction.

__init__(input_path, label_path, valid_samples, max_samples, seq_len, num_classes, segment_layout, context_path=None, cl_mode=False, mrp_mode=False)[source]

Parameters:

input_path (str)
label_path (str)
valid_samples (int)
max_samples (int)
seq_len (int)
num_classes (int)
segment_layout (List[Dict])
context_path (str | None)
cl_mode (bool)
mrp_mode (bool)

Methods

__init__(input_path, label_path, ...[, ...])