sparank.data.tokenize_batch

sparank.data.tokenize_batch(adata, vocabs, modality_names, top_ks, cell_types, context2id=None, context_key=None, mode='train')[source]

Tokenise an AnnData batch across 1-to-N modalities.

Features in adata must generally be prefixed (e.g., "rna-GAPDH", "adt-CD3"). For unimodal workflows without a prefix, the vocab dict keys must match adata.var_names exactly.

Parameters:

adata (AnnData) – Annotated data matrix containing the batch to tokenise.
vocabs (Dict[str, Dict[str, int]]) – Per-modality vocabulary dictionaries mapping feature names to token IDs.
modality_names (List[str]) – List of modality names to process.
top_ks (Dict[str, int]) – Dictionary mapping modality names to their target sequence lengths.
cell_types (List[str]) – List of all possible cell-type labels.
context2id (Dict[str, int], optional) – Mapping of context labels to token IDs.
context_key (str, optional) – Key in adata.obs denoting context.
mode (str, default "train") – Processing mode. If “train”, label matrices are computed.

Returns:

A tuple containing: - token_matrix: np.ndarray of shape (N, total_seq_len) or None. - label_matrix: np.ndarray of shape (N, C) or None. - context_ids: np.ndarray of shape (N,) or None.

Return type:

Tuple[Optional[np.ndarray], Optional[np.ndarray], Optional[np.ndarray]]