API Reference

API Reference#

Writing#

genvarloader.write(path, bed, variants=None, tracks=None, annot_tracks=None, samples=None, max_jitter=None, overwrite=False, max_mem='4g', extend_to_length=True)[source]#

Write a GVL dataset.

Parameters:

path (str | Path) – Path to write the dataset to.
bed (str | Path | DataFrame) – BED-like file or DataFrame of regions satisfying the BED3+ specification. Specifically, it must have columns ‘chrom’, ‘chromStart’, and ‘chromEnd’. If ‘strand’ is present, its values must be either ‘+’ or ‘-‘. Negative stranded regions will be reverse complemented during sequence and/or track reconstruction.
variants (str | Path | VCF | PGEN | SparseVar | SparseVar2 | None, default: None) – A genoray VCF or PGEN instance (genoray is a GVL dependency so it will be import-able). All variants must be left-aligned, bi-allelic, and atomized. Multi-allelic variants can be included by splitting them into bi-allelic half-calls. For VCFs, the bcftools norm command can do all of this normalization. Likewise, see the PLINK2 documentation for PGEN files. Commands of interest include --make-bpgen for splitting variants, --normalize for left-aligning and atomizing overlapping variants, and --ref-from-fa for REF allele correction.
tracks (IntervalTrack | Sequence[IntervalTrack] | None, default: None) – An IntervalTrack (e.g. BigWigs, Table) or a sequence of them. Each track must have a unique name; the on-disk layout writes to <path>/intervals/<track.name>/.
annot_tracks (dict[str, str | Path | DataFrame | LazyFrame] | None, default: None) – Sample-independent annotation tracks, as a mapping of track name to source. Each source is a path to an interval table, a path to a bigWig, or a polars DataFrame/LazyFrame interpreted as a BED-like interval table (columns chrom, chromStart, chromEnd, score). Table/DataFrame sources are served by the Rust COITrees overlap backend. Written to <path>/annot_intervals/<name>/.
samples (list[str] | None, default: None) – Samples to include in the dataset
max_jitter (int | None, default: None) – Maximum jitter to add to the regions
overwrite (bool, default: False) – Whether to overwrite an existing dataset
max_mem (int | str, default: '4g') – Approximate maximum total memory to use, including the genoray variant index. The reader’s index is loaded eagerly at the start of write() (for VCF and PGEN) so that nbytes reflects its true size; that value is subtracted from max_mem to determine the budget available for genotype chunking. A ValueError is raised if the remaining budget is too small to fit even a single variant chunk. Otherwise max_mem is a soft limit on overall usage and may be exceeded by a small amount.
extend_to_length (bool, default: True) – Whether to continue reading/writing variants until all haplotypes have a length at least as long as the intervals in bed. Otherwise, deletions can cause the length of haplotypes to be less than the intervals in bed. This can be disabled if having haplotypes shorter than the intervals is acceptable, in which case they will be padded with reference bases when appropriate. Disabling this also reduces the amount of data read/written and is faster to run.

Notes

The dataset directory is built atomically: all data is written to a private sibling temp directory and published via os.replace(). A best-effort filelock prevents redundant parallel rebuilds, but correctness relies on the atomic rename — the lock is advisory only.

Out of scope: genoray .gvi index files and pysam .fai/.gzi index files are created by those libraries and are not covered by gvl’s atomic/locked creation. Concurrent jobs that trigger index creation for those files depend on the upstream libraries’ behavior.

genvarloader.update(dataset, tracks=None, annot_tracks=None, *, overwrite=False, max_mem='4g')[source]#

Add tracks to an existing on-disk GVL dataset, analogous to write().

Parameters:

dataset (str | Path | Dataset) – Path to a dataset directory, or an opened Dataset (its .path is used). A live dataset can be read while it is being updated; it will not observe the new track until reopened.
tracks (IntervalTrack | Sequence[IntervalTrack] | None, default: None) – Per-sample IntervalTrack source(s) (BigWigs, Table), written to <path>/intervals/<name>/. The track’s sample set must match the dataset’s exactly (no missing, no extra); samples are reordered to the dataset order automatically.
annot_tracks (dict[str, str | Path | DataFrame | LazyFrame] | None, default: None) – Sample-independent sources, identical to write()’s annot_tracks, written to <path>/annot_intervals/<name>/.
overwrite (bool, default: False) – Replace a track of the same name if present; otherwise adding a duplicate name raises FileExistsError.
max_mem (int | str, default: '4g') – Approximate memory budget, divided across concurrently-running categories.

Return type:

None

genvarloader.get_splice_bed(gtf, contigs=None, transcript_support_level='1', require_multiple_of_3=True)[source]#

Process a GTF into a BED-compatible DataFrame for splicing datasets.

The result has columns chrom, chromStart (0-based), chromEnd, strand, gene_name, transcript_id, and exon_number, sorted by chromosome (natural order) and chromStart. Pass it directly to gvl.write() for splicing datasets.

Parameters:

gtf (str | Path) – Path to a GTF file (gzipped or plain) accepted by seqpro.gtf.scan().
contigs (list[str] | None, default: None) – If provided, keep only rows whose seqname is in this list.
transcript_support_level (str | None, default: '1') – If a string, require the GTF transcript_support_level attribute to equal it. None disables the filter.
require_multiple_of_3 (bool, default: True) – If True, keep only transcripts whose summed CDS length is a multiple of 3.

Return type:

DataFrame

genvarloader.read_bedlike(path)#

Reads a bed-like (BED3+) file as a pandas DataFrame. The file type is inferred from the file extension and supports .bed, .narrowPeak, and .broadPeak.

Return type:: DataFrame
Parameters:: path (str | Path)

Parameters#

path: Path to the bed-like file.

Returns#

pl.DataFrame: BED-like DataFrame with typed columns and zero-based coordinate metadata.

genvarloader.with_length(bed, length)#

Set the length of regions in a BED-like DataFrame to a fixed length by expanding or shrinking relative to the center (or peak) of the window. If the original region size + length is odd, the center will be 1 position closer the right end.

Return type:

TypeVar(FrameT, DataFrame[Any], LazyFrame[Any])

Parameters:

bed (FrameT)
length (int)

Parameters#

bed: BED-like DataFrame with at least the columns “chromStart” and “chromEnd”.
length: Desired length of the windows. Must be non-negative.

Returns#

FrameT: DataFrame of the same type as the input with updated “chromStart” and “chromEnd” columns.

class genvarloader.BigWigs[source]#

__init__(name, paths)[source]#

Read data from bigWig files.

Parameters:

name (str) – Name of the reader, for example ‘signal’.
paths (dict[str, str]) – Dictionary of sample names and paths to bigWig files for those samples.

Return type:

None

classmethod from_table(name, table)[source]#

Read data from bigWig files.

Parameters:

name (str) – Name of the reader, for example ‘signal’.
table (str | Path | DataFrame) – Path to a table or a DataFrame containing sample names and paths to bigWig files for those samples. It must have columns “sample” and “path”.

class genvarloader.Table[source]#

Long-form interval track keyed by (sample_id, chrom, start, end, value).

Overlap queries are served by a Rust COITrees backend. Coordinates are zero-based, half-open [start, end).

__init__(name, data, column_map=None)[source]#

Parameters:

name (str)
data (DataFrame | Mapping[str, DataFrame])
column_map (Mapping[str, str] | None)

Return type:

None

Insertion fill#

Strategies controlling how re-aligned track values are filled across inserted bases (indels). Pass an instance to gvl.Dataset.with_insertion_fill(). InsertionFill is the abstract base; instantiate one of the concrete strategies.

class genvarloader.InsertionFill[source]#

Base class for track insertion fill strategies. Do not instantiate directly.

__init__()[source]#

class genvarloader.Constant[source]#

Write a fixed value at every inserted position.

Parameters:: value (float, default: nan) – Value to write. Defaults to NaN.

__init__(value=nan)#

Parameters:: value (float)
Return type:: None

class genvarloader.FlankSample[source]#

Sample (with replacement) from the 2*flank_width+1 reference values centered at the variant POS.

Each inserted position samples independently. Out-of-bounds neighbors are clamped to in-bounds values.

Parameters:: flank_width (int, default: 5) – Half-width of the flanking pool. Must be >= 0.

__init__(flank_width=5)#

Parameters:: flank_width (int)
Return type:: None

class genvarloader.Interpolate[source]#

Polynomial interpolation across the inserted region.

order=1: linear between track[v_rel_pos] and track[v_rel_pos + 1]. order=2,3: Lagrange polynomial through ceil((order+1)/2) reference values on each side of the variant, clamped at boundaries.

Parameters:: order (int, default: 1) – Polynomial order. Must be in {1, 2, 3}.

__init__(order=1)#

Parameters:: order (int)
Return type:: None

class genvarloader.Repeat5p[source]#

Repeat the value at the variant POS across the entire inserted region. Current default behavior.

__init__()#

Return type:: None

class genvarloader.Repeat5pNormalized[source]#

Repeat track[v_rel_pos] / (v_diff + 1) across the inserted region.

Preserves the sum: when the full insertion stretch is written, the total written value equals track[v_rel_pos]. If the insertion is truncated at the output boundary, the sum is reduced proportionally.

__init__()#

Return type:: None

Dataset maintenance#

Utilities for upgrading on-disk datasets written by older GVL versions.

genvarloader.migrate(path)[source]#

Migrate a GVL dataset’s track intervals from format 1.x (array-of-structs) to format 2.0 (struct-of-arrays), in place.

Streaming and crash-safe: peak extra disk is one track’s interval store. Genotypes, regions, and reference are untouched. Idempotent — a no-op (with leftover-AoS cleanup) on a dataset that is already 2.0.

Parameters:: path (str | Path) – Path to the GVL dataset directory.
Return type:: None

genvarloader.migrate_svar_link(gvl_path)[source]#

Upgrade a legacy GVL dataset’s link.svar symlink to an svar_link entry in metadata.json and remove the symlink.

Idempotent. No-op when svar_link is already populated, or when the dataset has no SVAR dependency. Raises FileNotFoundError if the legacy symlink is dangling.

Return type:: None
Parameters:: gvl_path (str | Path)

Reading#

Personalized data#

class genvarloader.Dataset[source]#

A dataset of genotypes, reference sequences, and intervals.

Note

This class is not meant to be instantiated directly. Use the Dataset.open() method to open a dataset after writing the data with genvarloader.write() or the GenVarLoader CLI.

GVL Datasets act like a collection of lazy ragged arrays that can be lazily subset or eagerly indexed as a 2D NumPy array. They have an effective shape of (n_regions, n_samples, [tracks], [ploidy], output_length), but only the region and sample dimensions can be indexed directly since the return value is generally a tuple of arrays.

Eager indexing

dataset[0, 9]  # first region, 10th sample
dataset[:10]  # first 10 regions and all samples
dataset[:10, :5]  # first 10 regions and 5 samples
dataset[[2, 2], [0, 1]]  # 3rd region, 1st and 2nd samples

Lazy indexing

See Dataset.subset_to(). This is useful, for example, to create splits for training, validation, and testing, or filter out regions or samples after writing a full dataset. This is also necessary if you intend to create a Pytorch DataLoader from the Dataset using Dataset.to_dataloader().

Return values

The return value depends on the Dataset state, namely sequence_type, active_tracks, and output_length. These can all be modified after opening a Dataset using the following methods: - Dataset.with_seqs() - Dataset.with_tracks() - Dataset.with_len()

static open(path, reference=None, jitter=0, rng=False, deterministic=True, rc_neg=True, min_af=None, max_af=None, var_fields=None, region_names=None, splice_info=None, var_filter=None, *, svar=None, svar2=None)[source]#

Open a dataset from a path.

If no reference genome is provided, the dataset cannot yield sequences. Will initialize the dataset such that it will return tracks and haplotypes (reference sequences if no genotypes) if possible. If tracks are available, they will be set to be returned in alphabetical order.

Parameters:

path (str | Path) – Path to a dataset.
reference (str | Path | Reference | None, default: None) – Path to a reference genome.
jitter (int, default: 0) – Amount of jitter to use, cannot be more than the maximum jitter of the dataset.
rng (int | Generator | None, default: False) – Random seed or np.random.Generator for any stochastic operations.
deterministic (bool, default: True) – Whether to use randomized or deterministic algorithms. If set to True, this will disable random shifting of longer-than-requested haplotypes.
rc_neg (bool, default: True) – Whether to reverse-complement sequences and reverse tracks on negative strands.
min_af (float | None, default: None) – The minimum allele frequency to include in the dataset. If dataset is not backed by SVAR genotypes, this will raise an error.
max_af (float | None, default: None) – The maximum allele frequency to include in the dataset. If dataset is not backed by SVAR genotypes, this will raise an error.
var_fields (list[str] | None, default: None) – The variant fields to include in the dataset. Defaults to the minimum useful set ["alt", "ilen", "start"]. Pass additional field names (e.g. "ref", "dosage", or any info column present in the source variants table) to load them eagerly at open time. Must be a subset of available_var_fields.
region_names (str | None, default: None) – The name of the column in the region-of-interest table (BED) to use as the region names.
splice_info (str | tuple[str, str] | None, default: None) – A string or tuple of strings representing the splice information to use. If a string, it will be used as the transcript ID and the exons are expected to be in order. If a tuple of strings, the first string will be used as the transcript ID and the second string will be used as the exon number. If a dictionary, the keys will be used as the transcript ID and the values should be the row number for each exon, in order. If False, splicing will be disabled.
var_filter (Optional[Literal['exonic']], default: None) – Whether to filter variants. If set to "exonic", only exonic variants will be applied.
svar (str | Path | None, default: None) – Override the recorded SVAR location. Use when the original SVAR has moved and the dataset cannot find it via the stored relative/absolute path or by sibling discovery.
svar2 (str | Path | None, default: None) – Override the recorded .svar2 location. Use when the original .svar2 store has moved and the dataset cannot find it via the stored relative/absolute path or by sibling discovery.

Return type:

RaggedDataset[TypeVar(MaybeRSEQ, None, RaggedSeqs, RaggedAnnotatedHaps, RaggedVariants), TypeVar(MaybeRTRK, None, Ragged[float32], RaggedIntervals)]

with_settings(jitter=None, rng=None, deterministic=None, rc_neg=None, min_af=None, max_af=None, var_fields=None, splice_info=None, var_filter=None, flank_length=None, token_alphabet=None, unknown_token=None, dummy_variant=None, unphased_union=None, realign_tracks=None)[source]#

Modify settings of the dataset, returning a new dataset without modifying the old one.

Parameters:

jitter (int | None, default: None) – How much jitter to use. Must be non-negative and <= the max_jitter of the dataset.
rng (int | Generator | None, default: None) – Random seed or np.random.Generator for non-deterministic operations e.g. jittering and shifting longer-than-requested haplotypes.
deterministic (bool | None, default: None) – Whether to use randomized or deterministic algorithms. If set to True, this will disable random shifting of longer-than-requested haplotypes and, for unphased variants, will enable deterministic variant assignment and always apply the highest CCF group. Note that for unphased variants, this will mean not all possible haplotypes can be returned.
rc_neg (bool | None, default: None) – Whether to reverse-complement sequences and reverse tracks on negative strands.
min_af (Union[float, Literal[False], None], default: None) – The minimum allele frequency to include in the dataset. If set to False, disables this filter. If dataset is not backed by SVAR genotypes, this will raise an error.
max_af (Union[float, Literal[False], None], default: None) – The maximum allele frequency to include in the dataset. If set to False, disables this filter. If dataset is not backed by SVAR genotypes, this will raise an error.
var_fields (list[str] | None, default: None) – The variant fields to include in the dataset.
splice_info (Union[str, tuple[str, str], Literal[False], None], default: None) – A string or tuple of strings representing the splice information to use. If a string, it will be used as the transcript ID and the exons are expected to be in order. If a tuple of strings, the first string will be used as the transcript ID and the second string will be used as the exon number. If a dictionary, the keys will be used as the transcript ID and the values should be the row number for each exon, in order. If False, splicing will be disabled.
var_filter (Optional[Literal[False, 'exonic']], default: None) – Whether to filter variants. If set to "exonic", only exonic variants will be applied.
flank_length (int | None, default: None) – Number of reference-sequence bases to fetch as flanks around each variant. Stored on the Haps reconstructor for use by the flat-window output mode.
token_alphabet (str | bytes | NucleotideAlphabet | None, default: None) – Characters that define the token alphabet (e.g. b"ACGT", "ACGT", or seqpro.alphabets.DNA). Accepts a str, bytes, or seqpro.NucleotideAlphabet and is normalized to bytes; position i in the alphabet maps to integer token i. Must be supplied together with unknown_token.
unknown_token (int | None, default: None) – Integer token to assign to any byte not present in token_alphabet. Must be supplied together with token_alphabet.
dummy_variant (Union[DummyVariant, Literal[False], None], default: None) – A DummyVariant to insert into empty (region, sample, ploid) variant groups so every group has at least one variant. Valid for the "variants" and "variant-windows" outputs (see with_seqs); indexing any other output kind with a dummy set raises. For token outputs (the ride-along flank_tokens and the variant-window token buffers) the dummy entry is filled entirely with unknown_token. Pass False to disable.
unphased_union (bool | None, default: None) – When True, fold the stored ploidy haplotypes onto a single haploid sequence: the union of called ALTs per (region, sample). ds.ploidy and n_variants(...) then report ploidy 1, and "variants" / "variant-windows" output decode at ploidy 1. Phase is discarded (suited to unphased somatic calls); ALT occurrences are concatenated across haplotypes with no sort or dedup (a hom call appears once per haplotype). Requires a dataset with genotypes and is incompatible with "haplotypes" / "annotated" output (raises). See issue #222.
realign_tracks (bool | None, default: None) – Whether to re-align track values to haplotype coordinates when both haplotypes and float tracks are active. Default True. Set False for reference-coordinate (as-is) tracks; required False for variant-windows + tracks and for kind="intervals" with any variant-aware seq mode.

Return type:

Self

with_len(output_length)[source]#

Modify the output length of the dataset, returning a new dataset without modifying the old one.

Parameters:: output_length (Union[Literal['ragged', 'variable'], int]) – The output length. Can be set to "ragged" or "variable" to allow for variable length sequences. If set to an integer, all sequences will be padded or truncated to this length. See the online documentation for more information.
Return type:: ArrayDataset | RaggedDataset

with_seqs(kind, window_opt=None)[source]#

Return a new dataset with the specified sequence type.

The sequence type can be one of the following:

"reference": reference sequences.
"haplotypes": personalized haplotype sequences.
"annotated": annotated haplotype sequences, which includes personalized haplotypes along with annotations.
"variants": no sequences, just variants as RaggedVariants

Annotated haplotypes are returned as the AnnotatedHaps class which is roughly:

class AnnotatedHaps:
    haps: NDArray[np.bytes_]
    var_idxs: NDArray[np.int32]
    ref_coords: NDArray[np.int32]

where haps are the haplotypes as bytes/S1, and var_idxs and ref_coords are arrays with the same shape as haps that annotate every nucleotide with the variant index and reference coordinate it corresponds to. A variant index of -1 corresponds to a reference nucleotide, and a reference coordinate of -1 corresponds to padded nucleotides that were added for regions beyond the bounds of the reference genome. i.e. if the region’s start position is negative or the end position is beyond the end of the reference genome.

For example, a toy result for chr1:1-10 could be:

haps:        A C G  T ...  T T  A ...
var_idxs:   -1 3 3 -1 ... -1 4 -1 ...
ref_coords:  1 2 2  3 ...  6 7  9 ...

where variant 3 is a 1 bp CG insertion and variant 4 is a 1 bp deletion T-. Note that the first nucleotide of every indel maps to a reference position since gvl.write() expects that variants are all left-aligned.

Important

The var_idxs are numbered with respect to the full set of variants even if the variants were extracted from per-chromosome VCFs/PGENs. So a variant index of 0 corresponds to the first variant across all chromosomes. Thus, if you want to map the variant index to per-chromosome VCFs/PGENs, you will need to subtract the number of variants on all other chromosomes before the variant index to get the correct variant index in the VCF/PGEN. Relevant values can be obtained by instantiating a gvl.Variants class from the VCFs/PGENs and accessing the Variants.records.contig_offsets attribute.

If the Dataset’s output length is "ragged", then annotated haplotypes will be RaggedAnnotatedHaps where each field is a Ragged array instead of NumPy arrays.

Parameters:

kind (Optional[Literal['haplotypes', 'reference', 'annotated', 'variants', 'variant-windows']]) – The type of sequences to return. Can be one of "reference", "haplotypes", "annotated", "variants", or None to return no sequences.
window_opt (VarWindowOpt | None, default: None) – Required when kind="variant-windows". A VarWindowOpt configuring the flank length, token alphabet, and unknown token used to extract fixed-length windows around each variant.

with_tracks(tracks=None, kind=None)[source]#

Modify which tracks to return, returning a new dataset without modifying the old one.

Parameters:

tracks (Union[str, list[str], Literal[False], None], default: None) – The tracks to return. Can be a (list of) track names or False to return no tracks.
kind (Optional[Literal['tracks', 'intervals']], default: None) – The container type to return tracks as: "tracks" for RaggedTracks or "intervals" for RaggedIntervals. If None, keeps the dataset’s current track container kind.

with_insertion_fill(fill)[source]#

Configure how track values are filled at insertion sites.

Only meaningful when the dataset returns haplotypes and tracks (i.e. when the reconstructor is HapsTracks). Pure-reference and pure-haplotype datasets have no insertion fill to configure.

Parameters:: fill (InsertionFill | Mapping[str, InsertionFill]) – Either a single InsertionFill strategy applied to every active track, or a dict mapping track name to strategy. Tracks not in the dict fall back to Repeat5p.
Return type:: Self

with_output_format(fmt)[source]#

Return a copy that yields fmt containers from eager indexing.

Parameters:: fmt (Literal['ragged', 'flat']) – "ragged" for _core.Ragged-backed Ragged/RaggedVariants (default), or "flat" for pure-numpy FlatRagged/FlatVariants.
Return type:: Dataset

path: Path#: Path to the dataset.

output_length: Literal['ragged', 'variable'] | int#: The output length. Can be set to "ragged" or "variable" to allow for variable length sequences. If set to an integer, all sequences will be padded or truncated to this length. See the online documentation for more information.

max_jitter: int#: Maximum jitter.

return_indices: bool#: Whether to return row and sample indices corresponding to the full dataset (no subsetting).

contigs: list[str]#: List of unique contigs.

jitter: int#: How much jitter to use.

deterministic: bool#: Whether to use randomized or deterministic algorithms. If set to False, this will enable random shifting of longer-than-requested haplotypes and, for unphased variants, enable choosing sets of compatible variants proportional to their CCF; otherwise the dataset will always apply compatible sets with the highest CCF.

Note

This setting is independent of jitter, if you want no jitter you should set it to 0.

rc_neg: bool#: Whether to reverse-complement the sequences on negative strands.

output_format: Literal['ragged', 'flat']#: Container format for eager indexing. "ragged" (default) returns seqpro _core.Ragged / RaggedVariants; "flat" returns pure-numpy FlatRagged / FlatVariants with zero allocations on the hot path. See with_output_format.

realign_tracks: bool#: Whether to re-align track values to haplotype coordinates when both haplotypes and float tracks (kind="tracks") are active. True (default) uses the indel-aware realignment kernel; False returns reference-coordinate (as-is) tracks. Only affects Haps + float tracks; a no-op otherwise. Required False for variant-windows + tracks and for kind="intervals" with any variant-aware seq mode.

property is_subset: bool#: Whether the dataset is a subset.

property is_spliced: bool#: Whether the dataset is spliced.

property has_reference: bool#: Whether the dataset was provided a reference genome.

property reference: Reference | None#: The reference genome.

property has_genotypes#: Whether the dataset has genotypes.

property has_intervals: bool#: Whether the dataset has intervals.

property samples: list[str]#: The samples in the dataset.

property regions: DataFrame#

The input regions in the dataset.

As they were provided to gvl.write() i.e. with all BED columns plus any extra columns that were present.

property n_regions: int#: The number of (spliced) regions in the dataset.

property spliced_regions: DataFrame | None#: The spliced regions in the dataset.

property n_samples: int#: The number of samples in the dataset.

property ploidy: int | None#

The ploidy of the dataset.

Reports 1 when unphased_union is set (the two stored haplotypes are folded onto a single haploid sequence); otherwise the stored ploidy.

property shape: tuple[int, int]#

(n_rows, n_samples).

Type:: Return the shape of the dataset

property full_shape: tuple[int, int]#

(n_rows, n_samples).

Type:: Return the full shape of the dataset, ignoring any subsetting

property available_var_fields: list[str]#: Available variant fields.

property active_var_fields: list[str]#: Active variant fields.

property available_tracks: list[str] | None#: The available tracks in the dataset.

property active_tracks: list[str] | None#: The active tracks in the dataset.

property sequence_type: Literal['haplotypes', 'reference', 'annotated', 'variants', 'variant-windows'] | None#: The type of sequences in the dataset.

subset_to(regions=None, samples=None)[source]#

Subset the dataset to specific regions and/or samples by index or a boolean mask.

If regions or samples are not provided, the corresponding dimension will not be subset.

Parameters:

regions (int | integer | ndarray[tuple[Any, ...], dtype[integer]] | slice | Sequence[int] | Sequence[bool] | ndarray[tuple[Any, ...], dtype[bool]] | Series | str | Sequence[str] | ndarray[tuple[Any, ...], dtype[str_]] | ndarray[tuple[Any, ...], dtype[object_]] | None, default: None) – The regions to subset to.
samples (int | integer | ndarray[tuple[Any, ...], dtype[integer]] | slice | Sequence[int] | Sequence[bool] | ndarray[tuple[Any, ...], dtype[bool]] | Series | str | Sequence[str] | ndarray[tuple[Any, ...], dtype[str_]] | ndarray[tuple[Any, ...], dtype[object_]] | None, default: None) – The samples to subset to.

Return type:

Self

Examples

Subsetting to the first 10 regions:

ds.subset_to(slice(10))

Subsetting to the 2nd and 4th samples:

ds.subset_to(samples=[1, 3])

Subsetting to chromosome 1, assuming it’s labeled "chr1":

r_idx = ds.regions["chrom"] == "chr1"
ds.subset_to(regions=r_idx)

Subsetting to regions labeled by a column “split”, assuming “split” existed in the input regions:

r_idx = ds.regions["split"] == "train"
ds.subset_to(regions=r_idx)

Subsetting to the intersection with another set of regions:

import seqpro as sp

regions = gvl.read_bedlike("regions.bed")
regions_pr = sp.bed.to_pyr(regions)
ds_regions_pr = sp.bed.to_pyr(ds.regions.with_row_index())
r_idx = ds_regions_pr.overlap(regions_pr).df["index"].to_numpy()
ds.subset_to(regions=r_idx)

to_full_dataset()[source]#

Return a full sized dataset, undoing any subsetting.

Return type:: Self

haplotype_lengths(regions=None, samples=None)[source]#

The lengths of jitter-extended haplotypes for specified regions and samples.

If the dataset is not phased or not deterministic, this will return None because the haplotypes are not guaranteed to be a consistent length due to randomness in what variants are used.

Note

Currently not implemented for spliced datasets.

Parameters:

regions (int | integer | ndarray[tuple[Any, ...], dtype[integer]] | slice | Sequence[int] | Sequence[bool] | ndarray[tuple[Any, ...], dtype[bool]] | Series | None, default: None) – Regions to compute haplotype lengths for.
samples (int | integer | ndarray[tuple[Any, ...], dtype[integer]] | slice | Sequence[int] | Sequence[bool] | ndarray[tuple[Any, ...], dtype[bool]] | Series | str | Sequence[str] | None, default: None) – Samples to compute haplotype lengths for.

Return type:

ndarray[tuple[Any, ...], dtype[int32]] | None

n_variants(regions=None, samples=None)[source]#

The number of variants in the dataset for specified regions and samples.

Parameters:

regions (int | integer | ndarray[tuple[Any, ...], dtype[integer]] | slice | Sequence[int] | Sequence[bool] | ndarray[tuple[Any, ...], dtype[bool]] | Series | None, default: None) – Regions to compute the number of variants for.
samples (int | integer | ndarray[tuple[Any, ...], dtype[integer]] | slice | Sequence[int] | Sequence[bool] | ndarray[tuple[Any, ...], dtype[bool]] | Series | str | Sequence[str] | ndarray[tuple[Any, ...], dtype[str_]] | ndarray[tuple[Any, ...], dtype[object_]] | None, default: None) – Samples to compute the number of variants for.

Return type:

ndarray[tuple[Any, ...], dtype[int32]]

Returns:

Array with shape (…, ploidy). The number of variants in the dataset for the specified regions and samples. If the dataset does not have genotypes, this will return None.

n_intervals(regions=None, samples=None)[source]#

The number of intervals in the dataset for specified regions and samples.

Parameters:

regions (int | integer | ndarray[tuple[Any, ...], dtype[integer]] | slice | Sequence[int] | Sequence[bool] | ndarray[tuple[Any, ...], dtype[bool]] | Series | None, default: None) – Regions to compute the number of intervals for.
samples (int | integer | ndarray[tuple[Any, ...], dtype[integer]] | slice | Sequence[int] | Sequence[bool] | ndarray[tuple[Any, ...], dtype[bool]] | Series | str | Sequence[str] | ndarray[tuple[Any, ...], dtype[str_]] | ndarray[tuple[Any, ...], dtype[object_]] | None, default: None) – Samples to compute the number of intervals for.

Return type:

ndarray[tuple[Any, ...], dtype[int32]]

Returns:

Array with shape (…, tracks). The number of intervals in the dataset for the specified regions and samples. If the dataset does not have intervals, this will return None.

write_transformed_track(new_track, existing_track, transform, max_mem=1073741824, overwrite=False)[source]#

Write transformed tracks to the dataset.

Parameters:

new_track (str) – The name of the new track.
existing_track (str) – The name of the existing track to transform.
transform (Callable[[ndarray[tuple[Any, ...], dtype[int64]], ndarray[tuple[Any, ...], dtype[int64]], Ragged[float32]], Ragged[float32]]) – A function to apply to the existing track to get a new, transformed track. This will be done in chunks such that the tracks provided will not exceed max_mem. The arguments given to the transform will be the region and sample indices as numpy arrays and the tracks themselves as a Ragged array with shape (regions, samples). The tracks must be a Ragged array since regions may be different lengths to accommodate indels. This function should then return the transformed tracks as a Ragged array with the same shape and lengths.
max_mem (int, default: 1073741824) – The maximum memory to use in bytes, by default 1 GiB (2**30 bytes)
overwrite (bool, default: False) – Whether to overwrite the existing track, by default False

Return type:

ArrayDataset | RaggedDataset

to_torch_dataset(return_indices, transform)[source]#

Convert the dataset to a PyTorch Dataset.

Requires PyTorch to be installed.

Parameters:

return_indices (bool) – Whether to append arrays of row and sample indices of the non-subset dataset to each batch.
transform (Callable | None) –
The transform to apply to each batch of data. The transform should take input matching the output of the dataset and can return anything that can be converted to a PyTorch tensor. In combination with indices, this allows you to combine arbitrary row- and sample-specific data with dataset output on-the-fly.

Note

Depending on how transforms are implemented, they can easily introduce a dataloading bottleneck. If you find dataloading is slow, it’s often a good idea to try disabling your transform to see if it’s impacting throughput.

Return type:

TorchDataset

to_dataloader(batch_size=1, shuffle=False, sampler=None, num_workers=0, collate_fn=None, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None, multiprocessing_context=None, generator=None, *, prefetch_factor=None, persistent_workers=False, pin_memory_device='', return_indices=False, transform=None, mode=None, buffer_bytes=2147483648, copy=True, heartbeat_seconds=60.0)[source]#

Convert the dataset to a PyTorch DataLoader.

The parameters are the same as a DataLoader with a few omissions e.g. batch_sampler. Requires PyTorch to be installed.

Parameters:

batch_size (int, default: 1) – How many samples per batch to load.
shuffle (bool, default: False) – Set to True to have the data reshuffled at every epoch.
sampler (Sampler | Iterable | None, default: None) –
Defines the strategy to draw samples from the dataset. Can be any Iterable with __len__ implemented. If specified, shuffle must not be specified.

Important

Do not provide a BatchSampler here. GVL Datasets use multithreading when indexed with batches of indices to avoid the overhead of multi-processing. To leverage this, GVL will automatically wrap the sampler with a BatchSampler so that lists of indices are given to the GVL Dataset instead of one index at a time. See this post for more information.
num_workers (int, default: 0) –
How many subprocesses to use for dataloading. 0 means that the data will be loaded in the main process.

Tip

For GenVarLoader, it is generally best to set this to 0 or 1 since almost everything in GVL is multithreaded. However, if you are using a transform that is compute intensive and single threaded, there may be a benefit to setting this > 1.
collate_fn (Callable | None, default: None) – Merges a list of samples to form a mini-batch of Tensor(s).
pin_memory (bool, default: False) – If True, the data loader will copy Tensors into device/CUDA pinned memory before returning them. If your data elements are a custom type, or your collate_fn returns a batch that is a custom type, see the example below.
drop_last (bool, default: False) – Set to True to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If False and the size of dataset is not divisible by the batch size, then the last batch will be smaller.
timeout (float, default: 0) – If positive, the timeout value for collecting a batch from workers. Should always be non-negative.
worker_init_fn (Callable | None, default: None) – If not None, this will be called on each worker subprocess with the worker id (an int in [0, num_workers - 1]) as input, after seeding and before data loading.
multiprocessing_context (Callable | None, default: None) – If None, the default multiprocessing context of your operating system will be used.
generator (Generator | None, default: None) – If not None, this RNG will be used by RandomSampler to generate random indexes and multiprocessing to generate base_seed for workers.
prefetch_factor (int | None, default: None) – Number of batches loaded in advance by each worker. 2 means there will be a total of 2 * num_workers batches prefetched across all workers. (default value depends on the set value for num_workers. If value of num_workers=0 default is None. Otherwise, if value of num_workers > 0 default is 2).
persistent_workers (bool, default: False) – If True, the data loader will not shut down the worker processes after a dataset has been consumed once. This allows to maintain the workers Dataset instances alive.
pin_memory_device (str, default: '') – The device to pin_memory to if pin_memory is True.
return_indices (bool, default: False) – Whether to append arrays of row and sample indices of the non-subset dataset to each batch.
transform (Callable | None, default: None) –
The transform to apply to each batch of data. The transform should take input matching the output of the dataset and can return anything that can be converted to a PyTorch tensor. In combination with indices, this allows you to combine arbitrary row- and sample-specific data with dataset output on-the-fly.

Note

Depending on how transforms are implemented, they can easily introduce a dataloading bottleneck. If you find dataloading is slow, it’s often a good idea to try disabling your transform to see if it’s impacting throughput.
mode (Optional[Literal['buffered', 'double_buffered']], default: None) – Dataloading strategy. None (default) uses the standard PyTorch DataLoader over a map-style dataset. "buffered" and "double_buffered" use a prefetching producer that fills an in-memory buffer ahead of consumption to hide read latency; both are incompatible with num_workers > 0 since the loader is itself the concurrency strategy. "double_buffered" serializes chunks into two fixed-size shared-memory slots, allowing a producer thread to fill one slot while the consumer drains the other.
buffer_bytes (int, default: 2147483648) – Total byte budget for the prefetch buffer when mode is "buffered" or "double_buffered". For "double_buffered" this is split across two shared-memory slots (buffer_bytes / 2 each). Defaults to 2 GiB.
copy (bool, default: True) – Only used when mode="double_buffered". If True (default), each batch owns its data. If False, batches are zero-copy views into shared memory and are only valid until the next batch is yielded.
heartbeat_seconds (float, default: 60.0) – Only used when mode="double_buffered". Seconds to wait per slot before checking that the producer is still alive.

Return type:

DataLoader

genvarloader.get_dummy_dataset(spliced=False)[source]#

Return a dummy Dataset with 4 regions, 4 samples, max jitter of 2, a reference genome of all "N", genotypes, and 1 track “read-depth” where each track is [1, 2, 3, 4, 5, 6] in the reference coordinate system, where 3 is aligned with each region’s start coordinate.

Is initialized to return ragged haplotypes and tracks with no jitter and deterministic reconstruction algorithms.

Parameters:

spliced (bool, default: False) –

If True, the dataset will be setup for splicing with all regions moved to chromosome 1 and a splice indexer with 2 genes, “tp53” and “shh”, corresponding to regions:

{
    "tp53": [3, 0, 2],
    "shh": [1],
}

class genvarloader.DummyVariant[source]#

Per-field values for the dummy variant inserted into empty (region, sample, ploid) groups.

Unspecified info fields default to 0 for integer columns and NaN for float columns.

scalar_for(name, dtype)[source]#

Return the dummy fill value for a scalar field, as a numpy scalar of dtype.

Parameters:

name (str)
dtype (dtype)

__init__(start=-1, ilen=0, dosage=0.0, ref=b'N', alt=b'N', info=<factory>)#

Parameters:

start (int)
ilen (int)
dosage (float)
ref (bytes)
alt (bytes)
info (dict[str, Any])

Return type:

None

class genvarloader.VarWindowOpt[source]#

Options for with_seqs('variant-windows').

Bundles every variant-window setting in one place so they are explicit rather than inherited from with_settings. ref and alt are chosen independently: "window" emits the flanked, tokenized window (ref = [start-L, end+L) reference read; alt = flank5 . alt . flank3), while "allele" emits the bare tokenized allele with no flanks.

token_alphabet accepts a str, bytes, or seqpro.NucleotideAlphabet (e.g. seqpro.alphabets.DNA) and is normalized to bytes on construction; each byte’s position is its token id, so ordering is preserved verbatim.

__init__(flank_length, token_alphabet, unknown_token, ref='window', alt='window')#

Parameters:

flank_length (int)
token_alphabet (str | bytes | NucleotideAlphabet)
unknown_token (int)
ref (Literal['window', 'allele'])
alt (Literal['window', 'allele'])

Return type:

None

class genvarloader.RaggedDataset[source]#: Only for type checking purposes, you should never instantiate this class directly.

class genvarloader.ArrayDataset[source]#: Only for type checking purposes, you should never instantiate this class directly.

Reference genome(s)#

class genvarloader.Reference[source]#

A reference genome kept in-memory.

Typically this is only instantiated to be passed to Dataset.open and avoid data duplication.

Note

Do not instantiate this class directly. Use Reference.from_path() instead.

path: Path#: The path to the reference genome.

reference: ndarray[tuple[Any, ...], dtype[uint8]]#: The reference genome as a numpy array, with contigs concatenated.

offsets: ndarray[tuple[Any, ...], dtype[int64]]#

(n_contigs + 1)

Type:: The offsets of the contigs in the reference genome. Shape

pad_char: int#: The padding character used in the reference genome.

classmethod from_path(fasta, contigs=None, in_memory=True)[source]#

Load a reference genome from a FASTA file.

Parameters:

fasta (str | Path) – Path to a .fa/.fa.bgz FASTA file or an existing .gvlfa cache directory. When a FASTA path is given, a sibling .gvlfa cache is built on first use and reused on subsequent calls; a legacy .fa.gvl flat cache is automatically migrated to the new format.
contigs (list[str] | None, default: None) –
List of contig names to load. If None, all contigs in the FASTA file are loaded. Can be either UCSC or Ensembl style (i.e. with or without the “chr” prefix) and will be handled appropriately to match the underlying FASTA.

Note

Reordering or subsetting contigs requires in_memory=True. With in_memory=False the memory-mapped reference stays in FASTA order, so contigs must be None or exactly the full FASTA contig order; anything else raises ValueError.
in_memory (bool, default: True) – Whether to load the reference genome into memory. If True, the reference genome is loaded into memory. If False, the reference genome is read on-demand from a memory mapped array. This will still be much faster than reading from FASTA but slower than keeping it in memory. This is useful if you need to work with many reference genomes or have very limited RAM. Because the memory map preserves FASTA order, in_memory=False cannot reorder or subset contigs (see above).

class genvarloader.RefDataset[source]#

A reference dataset for pulling out sequences from a reference genome.

When splice_info is provided, the dataset returns per-transcript concatenated reference sequence, with one row per splice group instead of one row per BED region. Same semantics as Dataset.open(splice_info=...).

reference: Reference#: The reference genome.

full_bed: DataFrame#: A table of regions to extract from the reference genome. The table must have the following columns: - chrom: The name of the contig (e.g. “chr1”, “chr2”, etc.) - chromStart: The start position of the region (0-based). - chromEnd: The end position of the region (0-based). A strand column can also be included, in which case the regions will be reverse complemented if the strand is -1 and the rc_neg parameter is set to True.

jitter: int#: The maximum length for randomly shifting start positions.

output_length: Literal['ragged', 'variable'] | int#: The output length of the dataset. Same meaning as Dataset.output_length.

deterministic: bool#: If true, fixed length sequences will be right truncated from their full length to the output length. If false, fixed length sequences will be randomly shifted to be within the output length.

rc_neg: bool#: Whether to reverse complement the regions that are on the negative strand.

region_names: str | None#: The name of the column in the full_bed table to use as the region names.

splice_info: str | tuple[str, str] | None#: If set, the dataset is spliced. Either the column name with rows already in splice order or a (group_col, sort_col) pair applied against full_bed.

property is_spliced: bool#: Whether the dataset is spliced.

property spliced_regions: DataFrame#: The spliced BED, subset to the current row subset.

property shape: tuple[int]#: Shape of the dataset.

subset_to(regions)[source]#

Subset the dataset to a subset of regions (or transcripts, when spliced).

Parameters:: regions (int | integer | ndarray[tuple[Any, ...], dtype[integer]] | slice | Sequence[int] | Sequence[bool] | ndarray[tuple[Any, ...], dtype[bool]] | Series | str | Sequence[str] | ndarray[tuple[Any, ...], dtype[str_]] | ndarray[tuple[Any, ...], dtype[object_]])

to_full_dataset()[source]#

Reset the dataset to the full dataset.

Return type:: Self

to_torch_dataset(return_indices=False, transform=None)[source]#

Convert the dataset to a PyTorch dataset.

Parameters:

return_indices (bool, default: False) – If True, the dataset will return the indices of the regions in the reference genome.
transform (Callable | None, default: None) – A function to transform the data. Should accept a numpy array of S1 with shape (batch_size, length). If return_indices is true, the function should accept a tuple of (sequences, indices).

Return type:

TorchDataset

Convert the dataset to a PyTorch DataLoader.

The parameters are the same as a DataLoader with a few omissions e.g. batch_sampler. Requires PyTorch to be installed.

Parameters:

batch_size (int, default: 1) – How many samples per batch to load.
shuffle (bool, default: False) – Set to True to have the data reshuffled at every epoch.
sampler (Sampler | Iterable | None, default: None) –
Defines the strategy to draw samples from the dataset. Can be any Iterable with __len__ implemented. If specified, shuffle must not be specified.

Important

Do not provide a BatchSampler here. GVL Datasets use multithreading when indexed with batches of indices to avoid the overhead of multi-processing. To leverage this, GVL will automatically wrap the sampler with a BatchSampler so that lists of indices are given to the GVL Dataset instead of one index at a time. See this post for more information.
num_workers (int, default: 0) –
How many subprocesses to use for dataloading. 0 means that the data will be loaded in the main process.

Tip

For GenVarLoader, it is generally best to set this to 0 or 1 since almost everything in GVL is multithreaded. However, if you are using a transform that is compute intensive and single threaded, there may be a benefit to setting this > 1.
collate_fn (Callable | None, default: None) – Merges a list of samples to form a mini-batch of Tensor(s).
pin_memory (bool, default: False) – If True, the data loader will copy Tensors into device/CUDA pinned memory before returning them. If your data elements are a custom type, or your collate_fn returns a batch that is a custom type, see the example below.
drop_last (bool, default: False) – Set to True to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If False and the size of dataset is not divisible by the batch size, then the last batch will be smaller.
timeout (float, default: 0) – If positive, the timeout value for collecting a batch from workers. Should always be non-negative.
worker_init_fn (Callable | None, default: None) – If not None, this will be called on each worker subprocess with the worker id (an int in [0, num_workers - 1]) as input, after seeding and before data loading.
multiprocessing_context (Callable | None, default: None) – If None, the default multiprocessing context of your operating system will be used.
generator (Generator | None, default: None) – If not None, this RNG will be used by RandomSampler to generate random indexes and multiprocessing to generate base_seed for workers.
prefetch_factor (int | None, default: None) – Number of batches loaded in advance by each worker. 2 means there will be a total of 2 * num_workers batches prefetched across all workers. (default value depends on the set value for num_workers. If value of num_workers=0 default is None. Otherwise, if value of num_workers > 0 default is 2).
persistent_workers (bool, default: False) – If True, the data loader will not shut down the worker processes after a dataset has been consumed once. This allows to maintain the workers Dataset instances alive.
pin_memory_device (str, default: '') – The device to pin_memory to if pin_memory is True.
return_indices (bool, default: False) – If True, the dataset will return the indices of the regions in the reference genome.
transform (Callable | None, default: None) – A function to transform the data. Should accept a numpy array of S1 with shape (batch_size, length). If return_indices is true, the function should accept a tuple of (sequences, indices).

Return type:

DataLoader

__init__(reference, full_bed, jitter=0, output_length='ragged', deterministic=True, rc_neg=True, seed=None, region_names=None, splice_info=None)#

Parameters:

reference (Reference)
full_bed (DataFrame)
jitter (int)
output_length (Literal['ragged', 'variable'] | int)
deterministic (bool)
rc_neg (bool)
seed (int | Generator | None)
region_names (str | None)
splice_info (str | tuple[str, str] | None)

Return type:

None

Non-personal/site-only variants#

class genvarloader.DatasetWithSites[source]#

__init__(dataset, sites, max_variants_per_region=1)[source]#

Dataset with variant sites, used to apply site-only variants e.g. from ClinVar to a Dataset of haplotypes.

Currently only supports bi-allelic SNPs. Takes the intersection of the dataset regions and the sites, and applies the site-only variants to the Dataset’s haplotypes.

Accessed just like a Dataset, but where the rows are combinations of dataset regions and sites. Will return two AnnotatedHaps with variants applied and flags indicating whether the variant was applied, deleted, or existed. The flags are 0 for applied, 1 for deleted, and 2 for existed. If the dataset has tracks, they will be returned as well and reflect any site-only variants. The first AnnotatedHaps is the wildtype haplotypes and the second is the mutated haplotypes. The mutant haplotypes will also have their variant indices and reference coordinates updated to reflect the applied variants. Locations where a site-only variant was applied will have a variant index of -2.

Parameters:

dataset (ArrayDataset[TypeVar(SEQ, ndarray[tuple[Any, ...], dtype[bytes_]], AnnotatedHaps, RaggedVariants), TypeVar(MaybeTRK, None, ndarray[tuple[Any, ...], dtype[float32]], RaggedIntervals)]) – Dataset of haplotypes and potentially tracks.
sites (DataFrame) – Table of variant site information.
max_variants_per_region (int, default: 1) – Maximum number of variants per region. Currently only 1 is supported.

Examples: .. code-block:: python

import genvarloader as gvl sites = gvl.sites_vcf_to_table(“path/to/variants.vcf”)

ds = gvl.Dataset.open(“path/to/dataset.gvl”, “path/to/reference.fasta”) ds_sites = gvl.DatasetWithSites(ds, sites) wt_haps, mut_haps, flags = ds_sites[0, 0] # flags is a np.uint8 (or an array of np.uint8 when accessing multiple rows/samples)

ds_sites.dataset = ds_sites.dataset.with_tracks(“read-depth”) wt_haps, mut_haps, flags, tracks = ds_sites[0, 0]

sites: DataFrame#: Table of variant site information.

dataset: ArrayDataset[AnnotatedHaps, MaybeTRK]#: Dataset of haplotypes and potentially tracks.

rows: DataFrame#: Rows of this object, where each row is a combination of a dataset region and a site.

genvarloader.sites_vcf_to_table(vcf, attributes=None, info_fields=None)[source]#

Extract a table of variant site info from a VCF. All sites must be bi-allelic.

Parameters:

vcf (str | Path | VCF) – Path to a VCF or a genoray.VCF instance. Note that genoray.VCF can accept a filter function.
attributes (list[str] | None, default: None) – A list of attributes to include in the output table. Note that “CHROM”, “POS”, “REF”, and “ALT” are always included even if not in this list.
info_fields (list[str] | None, default: None) – A list of INFO fields to include in the output table.

Return type:

DataFrame

genvarloader.SitesSchema = SitesSchema[source]#

Schema to validate a table of variant sites.

Return type:: DataFrameBase[Self]

Data registry#

genvarloader.data_registry.fetch(name)[source]#

Download and cache data for constructing/opening a GVL dataset.

Files are cached in the user’s home directory under ~/.cache/genvarloader.

Parameters:

name (Literal['geuvadis_ebi', '1kgp']) –

The name of the dataset to fetch. Can be one of:

”geuvadis_ebi”: Geuvadis data for the original analyses by Lappalainen et al. 2013. Phased, normalized, and split into biallelic variants.
”1kgp”: 1000 Genomes Project, all 3,202 individuals. Phased, normalized, and split into biallelic variants.

Return type:

dict[str, Path]

Returns:

A dictionary of paths to the fetched data.

Containers#

Classes that GVL Datasets may return.

class genvarloader.AnnotatedHaps[source]#

AnnotatedHaps(haps: ‘NDArray[np.bytes_]’, var_idxs: ‘NDArray[np.int32]’, ref_coords: ‘NDArray[np.int32]’)

haps: ndarray[tuple[Any, ...], dtype[bytes_]]#: Haplotypes with dtype S1.

var_idxs: ndarray[tuple[Any, ...], dtype[int32]]#: Variant indices for each position in the haplotypes. A value of -1 indicates no variant was applied at the position.

ref_coords: ndarray[tuple[Any, ...], dtype[int32]]#: Reference coordinates for each position in haplotypes.

property shape#: Shape of the haplotypes and all annotations.

reshape(shape)[source]#

Reshape the haplotypes and all annotations.

Parameters:: shape (int | tuple[int, ...]) – New shape for the haplotypes and all annotations. The total number of elements must remain the same.

squeeze(axis=None)[source]#

Squeeze the haplotypes and all annotations along the specified axis.

Parameters:: axis (int | tuple[int, ...] | None, default: None) – Axis or axes to squeeze. If None, all axes of length 1 will be squeezed.
Return type:: AnnotatedHaps

class genvarloader.Ragged[source]#

A non-branching ragged array with a single ragged axis (Spec A).

static from_fields(fields)[source]#

Build a record (struct-of-arrays) from named single-field Ragged inputs that share one ragged axis. Supports numeric, char, string-under-axis, and R=2 fields; record-of-record and R>=3 fields are not supported.

Return type:: Ragged[Any]
Parameters:: fields (dict[str, Ragged[Any]])

property data: ndarray[tuple[Any, ...], dtype[Any]]#: Return the underlying data array. For record Rageds, returns the dict of fields.

property fields: list[str]#: Field names for a record Ragged. Raises TypeError on non-record arrays.

property is_string: bool#: True for an opaque variable-width string Ragged (dtype ‘S’, shape (N,)).

to_chars()[source]#

Zero-copy view of an opaque string (‘S’, (…, None?)) as ascii chars (‘S1’, (…, None?, None)); str_offsets becomes the innermost real axis.

Return type:: Ragged[Any]

to_strings()[source]#

Zero-copy view of a 1-D ascii-char leaf (‘S1’, (…, None)) as an opaque string (‘S’, (…)); the innermost length axis becomes an uncounted byte leaf.

Return type:: Ragged[Any]

hash(algo, *, seed=None)[source]#

Hash each string element. Thin delegator to seqpro.rag.hash().

Return type:

Union[ndarray[tuple[Any, ...], dtype[Any]], Ragged[Any]]

Parameters:

algo (Literal['md5', 'sha256', 'rapidhash'])
seed (int | None)

class genvarloader.RaggedAnnotatedHaps[source]#

Ragged version of AnnotatedHaps.

haps: Ragged[bytes_]#: Haplotypes with dtype S1.

var_idxs: Ragged[int32]#: Variant indices for each position in the haplotypes. A value of -1 indicates no variant was applied at the position.

ref_coords: Ragged[int32]#: Reference coordinates for each position in haplotypes.

property shape#: Shape of the haplotypes and all annotations.

to_padded()[source]#

Convert this Ragged array to a rectilinear array by right-padding each entry with appropriate values.

The final axis will have the maximum length across all entries.

Return type:: AnnotatedHaps

reshape(shape)[source]#

Reshape the haplotypes and all annotations.

Parameters:: shape (int | tuple[int, ...]) – New shape for the haplotypes and all annotations. The total number of elements must remain the same.
Return type:: RaggedAnnotatedHaps

squeeze(axis=None)[source]#

Squeeze the haplotypes and all annotations along the specified axis.

Parameters:: axis (int | tuple[int, ...] | None, default: None) – Axis or axes to squeeze. If None, all axes of length 1 are squeezed.
Return type:: RaggedAnnotatedHaps

to_numpy()[source]#

If all entries in the ragged array have the same shape, convert to a rectilinear shape.

Parameters:: shape – Shape to convert to, including the length axis. The total number of elements must remain the same.
Return type:: AnnotatedHaps

class genvarloader.RaggedVariants[source]#

Variable-length variants as a single record Ragged with shape (batch, ploidy, ~variants). alt/ref are opaque-string fields; start and optional ilen/dosage/extra fields are numeric. Guaranteed: alt, start, and one of ref/ilen.

classmethod from_record(rag)[source]#

Wrap an existing record Ragged directly (no copy), preserving subclass.

Return type:: RaggedVariants
Parameters:: rag (Ragged)

property alt: Ragged#: Alternative alleles (opaque-string Ragged, shape (b,p,~v)).

property ref: Ragged#: Reference alleles (opaque-string Ragged, shape (b,p,~v)).

property start: Ragged#: 0-based start positions (numeric Ragged, shape (b,p,~v)).

property dosage: Ragged#: Dosages (numeric Ragged, shape (b,p,~v)).

property ilen: Ragged#: Indel lengths. Infallible — derived from alt/ref char lengths when absent.

property end: Ragged#: 0-based exclusive end positions.

to_nested_tensor_batch(device='cpu', tokenizer=None)[source]#

Convert a RaggedVariants object to a dictionary of nested tensors.

Numeric fields (start, ilen, dosage, any extra) are flattened across the ploidy dimension so their shape is (batch * ploidy, ~variants). Allele fields (alt, ref) are flattened across both the ploidy and variant dimensions so their shape is (batch * ploidy * ~variants, ~alt_len).

Parameters:

device (str | device, default: 'cpu') – Device to move tensors to.
tokenizer (Union[Literal['seqpro'], Callable[[ndarray[tuple[Any, ...], dtype[bytes_]]], ndarray[tuple[Any, ...], dtype[integer]]], None], default: None) –
How to encode allele characters.
- "seqpro" — use seqpro.tokenize (ACGTN → 0 1 2 3 4).
- None — uint8 ASCII values (ACGTN → 65 67 71 84 78).
- Callable — called with the flat NDArray[np.bytes_] data, returns an integer array of the same length.

Returns:

"alt" — nested tensor (batch*ploidy*~vars, ~alt_len)
"ref" — nested tensor (batch*ploidy*~vars, ~ref_len) (if present)
numeric field keys — nested tensor (batch*ploidy, ~vars)
"max_n_vars" — int
"max_alt_len" — int
"max_ref_len" — int (if ref present)

Return type:

dict

class genvarloader.RaggedIntervals[source]#

RaggedIntervals(starts: ‘Ragged[np.int32]’, ends: ‘Ragged[np.int32]’, values: ‘Ragged[np.float32]’)

property shape#: Shape of the haplotypes and all annotations.

to_padded(start, end, value)[source]#

Convert this RaggedIntervals to a tuple of rectilinear arrays by right-padding each entry with appropriate values.

The final axis will have the maximum length across all entries.

Return type:

tuple[ndarray[tuple[Any, ...], dtype[int32]], ndarray[tuple[Any, ...], dtype[int32]], ndarray[tuple[Any, ...], dtype[float32]]]

Parameters:

start (int)
end (int)
value (float)

reshape(shape)[source]#

Reshape the haplotypes and all annotations.

Parameters:: shape (int | tuple[int, ...]) – New shape for the haplotypes and all annotations. The total number of elements must remain the same.
Return type:: RaggedIntervals

squeeze(axis=None)[source]#

Squeeze the haplotypes and all annotations along the specified axis.

Parameters:: axis (int | tuple[int, ...] | None, default: None) – Axis or axes to squeeze. If None, all axes of length 1 are squeezed.
Return type:: RaggedIntervals

to_fixed_shape(shape)[source]#

If all entries in the ragged array have the same shape, convert to a rectilinear shape.

Parameters:: shape (tuple[int, ...]) – Shape to convert to, including the length axis. The total number of elements must remain the same.
Return type:: tuple[ndarray[tuple[Any, ...], dtype[int32]], ndarray[tuple[Any, ...], dtype[int32]], ndarray[tuple[Any, ...], dtype[float32]]]

to_packed()[source]#

Pack all arrays into contiguous buffers.

Return type:: RaggedIntervals

prepend_pad_itv(start=-1, end=-1, value=0.0)[source]#

Prepend a pad interval so that every group is guaranteed to have at least 1 interval.

Parameters:

start (int, default: -1) – The start position to use for the pad interval
end (int, default: -1) – The end position to use for the pad interval
value (float, default: 0.0) – The value to use for the pad interval

Return type:

RaggedIntervals

Flat containers#

Returned in place of the ragged containers when a Dataset uses with_output_format(“flat”). Each carries flat data/offsets buffers and a to_ragged() escape hatch back to the ragged form.

genvarloader.FlatRagged#: alias of _Flat

genvarloader.FlatAnnotatedHaps#: alias of _FlatAnnotatedHaps

class genvarloader.FlatIntervals[source]#

Flat-buffer analog of RaggedIntervals over three _Flat s.

Pure-numpy (data, offsets, shape) per field; converts to the RaggedIntervals only via to_ragged(). Returned by eager indexing when with_tracks(kind="intervals") is combined with with_output_format("flat").

genvarloader.FlatVariants#: alias of _FlatVariants

genvarloader.FlatAlleles#: alias of _FlatAlleles

genvarloader.FlatVariantWindows#: alias of _FlatVariantWindows

PyTorch interop#

genvarloader.to_nested_tensor(rag)[source]#

Convert a Ragged array to a PyTorch nested tensor.

Will cast byte arrays (dtype “S1”) to uint8.

Parameters:: rag (Ragged) – Ragged array to convert.
Return type:: Tensor

API Reference

Contents

API Reference#

Writing#

Parameters#

Returns#

Parameters#

Returns#

Insertion fill#

Dataset maintenance#

Reading#

Personalized data#

Reference genome(s)#

Non-personal/site-only variants#

Data registry#

Containers#

Flat containers#

PyTorch interop#