sparank.data.tokenize_batch
- sparank.data.tokenize_batch(adata, vocabs, modality_names, top_ks, cell_types, context2id=None, context_key=None, mode='train')[source]
Tokenise an AnnData batch across 1-to-N modalities.
Features in adata must generally be prefixed (e.g.,
"rna-GAPDH","adt-CD3"). For unimodal workflows without a prefix, the vocab dict keys must matchadata.var_namesexactly.- Parameters:
adata (AnnData) – Annotated data matrix containing the batch to tokenise.
vocabs (Dict[str, Dict[str, int]]) – Per-modality vocabulary dictionaries mapping feature names to token IDs.
modality_names (List[str]) – List of modality names to process.
top_ks (Dict[str, int]) – Dictionary mapping modality names to their target sequence lengths.
cell_types (List[str]) – List of all possible cell-type labels.
context2id (Dict[str, int], optional) – Mapping of context labels to token IDs.
context_key (str, optional) – Key in
adata.obsdenoting context.mode (str, default "train") – Processing mode. If “train”, label matrices are computed.
- Returns:
A tuple containing: - token_matrix: np.ndarray of shape (N, total_seq_len) or None. - label_matrix: np.ndarray of shape (N, C) or None. - context_ids: np.ndarray of shape (N,) or None.
- Return type:
Tuple[Optional[np.ndarray], Optional[np.ndarray], Optional[np.ndarray]]