sparank.data.MemmapDataset

class sparank.data.MemmapDataset(input_path, label_path, valid_samples, max_samples, seq_len, num_classes, segment_layout, context_path=None, cl_mode=False, mrp_mode=False)[source]

Bases: Dataset

Memory-mapped dataset for 1-to-N modalities.

The concatenated token sequence is split into per-modality segments according to segment_layout, and CL dropout / MRP masking rates are applied independently per segment.

Parameters:
  • input_path (str) – Path to the memmap file for token inputs.

  • label_path (str) – Path to the memmap file for labels.

  • valid_samples (int) – Number of valid (written) samples to read.

  • max_samples (int) – Total allocated row count in the memmap files.

  • seq_len (int) – Total token-sequence length (= sum of per-modality top_k).

  • num_classes (int) – Number of cell-type classes.

  • segment_layout (List[Dict]) –

    One dict per modality defining the layout:

    {"name": str, "top_k": int, "mask_id": int,
     "cl_dropout_rate": float, "mrp_mask_rate": float}
    

    Segments are contiguous and must sum to seq_len.

  • context_path (str, optional) – Path to context memmap. None indicates no context returned.

  • cl_mode (bool, default False) – If True, each sample produces two dropout-augmented views (view_a and view_b) for Contrastive Learning.

  • mrp_mode (bool, default False) – If True, produces masked targets and positions for Masked Region Prediction.

__init__(input_path, label_path, valid_samples, max_samples, seq_len, num_classes, segment_layout, context_path=None, cl_mode=False, mrp_mode=False)[source]
Parameters:
  • input_path (str)

  • label_path (str)

  • valid_samples (int)

  • max_samples (int)

  • seq_len (int)

  • num_classes (int)

  • segment_layout (List[Dict])

  • context_path (str | None)

  • cl_mode (bool)

  • mrp_mode (bool)

Methods

__init__(input_path, label_path, ...[, ...])