sparank.data.MemmapDataset
- class sparank.data.MemmapDataset(input_path, label_path, valid_samples, max_samples, seq_len, num_classes, segment_layout, context_path=None, cl_mode=False, mrp_mode=False)[source]
Bases:
DatasetMemory-mapped dataset for 1-to-N modalities.
The concatenated token sequence is split into per-modality segments according to segment_layout, and CL dropout / MRP masking rates are applied independently per segment.
- Parameters:
input_path (str) – Path to the memmap file for token inputs.
label_path (str) – Path to the memmap file for labels.
valid_samples (int) – Number of valid (written) samples to read.
max_samples (int) – Total allocated row count in the memmap files.
seq_len (int) – Total token-sequence length (= sum of per-modality top_k).
num_classes (int) – Number of cell-type classes.
segment_layout (List[Dict]) –
One dict per modality defining the layout:
{"name": str, "top_k": int, "mask_id": int, "cl_dropout_rate": float, "mrp_mask_rate": float}
Segments are contiguous and must sum to seq_len.
context_path (str, optional) – Path to context memmap.
Noneindicates no context returned.cl_mode (bool, default False) – If
True, each sample produces two dropout-augmented views (view_a and view_b) for Contrastive Learning.mrp_mode (bool, default False) – If
True, produces masked targets and positions for Masked Region Prediction.
- __init__(input_path, label_path, valid_samples, max_samples, seq_len, num_classes, segment_layout, context_path=None, cl_mode=False, mrp_mode=False)[source]
Methods
__init__(input_path, label_path, ...[, ...])