API Reference#
Writing#
- genvarloader.write(path, bed, variants=None, tracks=None, samples=None, max_jitter=None, overwrite=False, max_mem='4g', extend_to_length=True)[source]#
Write a GVL dataset.
- Parameters:
bed (
str|Path|DataFrame) –BED-likefile or DataFrame of regions satisfying the BED3+ specification. Specifically, it must have columns ‘chrom’, ‘chromStart’, and ‘chromEnd’. If ‘strand’ is present, its values must be either ‘+’ or ‘-‘. Negative stranded regions will be reverse complemented during sequence and/or track reconstruction.variants (
str|Path|VCF|PGEN|SparseVar|None, default:None) – AgenorayVCF or PGEN instance (genorayis a GVL dependency so it will be import-able). All variants must be left-aligned, bi-allelic, and atomized. Multi-allelic variants can be included by splitting them into bi-allelic half-calls. For VCFs, the bcftools norm command can do all of this normalization. Likewise, see the PLINK2 documentation for PGEN files. Commands of interest include--make-bpgenfor splitting variants,--normalizefor left-aligning and atomizing overlapping variants, and--ref-from-fafor REF allele correction.tracks (
IntervalTrack|Sequence[IntervalTrack] |None, default:None) – AnIntervalTrack(e.g.BigWigs,Table) or a sequence of them. Each track must have a uniquename; the on-disk layout writes to<path>/intervals/<track.name>/.samples (
list[str] |None, default:None) – Samples to include in the datasetmax_jitter (
int|None, default:None) – Maximum jitter to add to the regionsoverwrite (
bool, default:False) – Whether to overwrite an existing datasetmax_mem (
int|str, default:'4g') – Approximate maximum total memory to use, including the genoray variant index. The reader’s index is loaded eagerly at the start ofwrite()(forVCFandPGEN) so thatnbytesreflects its true size; that value is subtracted frommax_memto determine the budget available for genotype chunking. AValueErroris raised if the remaining budget is too small to fit even a single variant chunk. Otherwisemax_memis a soft limit on overall usage and may be exceeded by a small amount.extend_to_length (
bool, default:True) – Whether to continue reading/writing variants until all haplotypes have a length at least as long as the intervals in bed. Otherwise, deletions can cause the length of haplotypes to be less than the intervals in bed. This can be disabled if having haplotypes shorter than the intervals is acceptable, in which case they will be padded with reference bases when appropriate. Disabling this also reduces the amount of data read/written and is faster to run.
Notes
The dataset directory is built atomically: all data is written to a private sibling temp directory and published via
os.replace(). A best-effortfilelockprevents redundant parallel rebuilds, but correctness relies on the atomic rename — the lock is advisory only.Out of scope:
genoray.gviindex files andpysam.fai/.gziindex files are created by those libraries and are not covered by gvl’s atomic/locked creation. Concurrent jobs that trigger index creation for those files depend on the upstream libraries’ behavior.
- genvarloader.get_splice_bed(gtf, contigs=None, transcript_support_level='1', require_multiple_of_3=True)[source]#
Process a GTF into a BED-compatible DataFrame for splicing datasets.
The result has columns
chrom,chromStart(0-based),chromEnd,strand,gene_name,transcript_id, andexon_number, sorted by chromosome (natural order) andchromStart. Pass it directly togvl.write()for splicing datasets.- Parameters:
gtf (
str|Path) – Path to a GTF file (gzipped or plain) accepted byseqpro.gtf.scan().contigs (
list[str] |None, default:None) – If provided, keep only rows whoseseqnameis in this list.transcript_support_level (
str|None, default:'1') – If a string, require the GTFtranscript_support_levelattribute to equal it.Nonedisables the filter.require_multiple_of_3 (
bool, default:True) – IfTrue, keep only transcripts whose summed CDS length is a multiple of 3.
- Return type:
DataFrame
- genvarloader.read_bedlike(path)#
Reads a bed-like (BED3+) file as a pandas DataFrame. The file type is inferred from the file extension and supports .bed, .narrowPeak, and .broadPeak.
- genvarloader.with_length(bed, length)#
Set the length of regions in a BED-like DataFrame to a fixed length by expanding or shrinking relative to the center (or peak) of the window. If the original region size + length is odd, the center will be 1 position closer the right end.
- Parameters:
- Returns:
DataFrame of the same type as the input with updated “chromStart” and “chromEnd” columns.
- Return type:
FrameT
- class genvarloader.BigWigs[source]#
Reading#
Personalized data#
- class genvarloader.Dataset[source]#
A dataset of genotypes, reference sequences, and intervals.
Note
This class is not meant to be instantiated directly. Use the
Dataset.open()method to open a dataset after writing the data withgenvarloader.write()or the GenVarLoader CLI.GVL Datasets act like a collection of lazy ragged arrays that can be lazily subset or eagerly indexed as a 2D NumPy array. They have an effective shape of
(n_regions, n_samples, [tracks], [ploidy], output_length), but only the region and sample dimensions can be indexed directly since the return value is generally a tuple of arrays.Eager indexing
dataset[0, 9] # first region, 10th sample dataset[:10] # first 10 regions and all samples dataset[:10, :5] # first 10 regions and 5 samples dataset[[2, 2], [0, 1]] # 3rd region, 1st and 2nd samples
Lazy indexing
See
Dataset.subset_to(). This is useful, for example, to create splits for training, validation, and testing, or filter out regions or samples after writing a full dataset. This is also necessary if you intend to create a PytorchDataLoaderfrom the Dataset usingDataset.to_dataloader().Return values
The return value depends on the
Datasetstate, namelysequence_type,active_tracks, andoutput_length. These can all be modified after opening aDatasetusing the following methods: -Dataset.with_seqs()-Dataset.with_tracks()-Dataset.with_len()- static open(path, reference=None, jitter=0, rng=False, deterministic=True, rc_neg=True, min_af=None, max_af=None, var_fields=None, region_names=None, splice_info=None, var_filter=None, *, svar=None)[source]#
Open a dataset from a path. If no reference genome is provided, the dataset cannot yield sequences. Will initialize the dataset such that it will return tracks and haplotypes (reference sequences if no genotypes) if possible. If tracks are available, they will be set to be returned in alphabetical order.
- Parameters:
reference (
str|Path|Reference|None, default:None) – Path to a reference genome.jitter (
int, default:0) – Amount of jitter to use, cannot be more than the maximum jitter of the dataset.rng (
int|Generator|None, default:False) – Random seed or np.random.Generator for any stochastic operations.deterministic (
bool, default:True) – Whether to use randomized or deterministic algorithms. If set to True, this will disable random shifting of longer-than-requested haplotypes.rc_neg (
bool, default:True) – Whether to reverse-complement sequences and reverse tracks on negative strands.min_af (
float|None, default:None) – The minimum allele frequency to include in the dataset. If dataset is not backed by SVAR genotypes, this will raise an error.max_af (
float|None, default:None) – The maximum allele frequency to include in the dataset. If dataset is not backed by SVAR genotypes, this will raise an error.var_fields (
list[str] |None, default:None) – The variant fields to include in the dataset. Defaults to the minimum useful set["alt", "ilen", "start"]. Pass additional field names (e.g."ref","dosage", or any info column present in the source variants table) to load them eagerly at open time. Must be a subset ofavailable_var_fields.splice_info (
str|tuple[str,str] |None, default:None) – A string or tuple of strings representing the splice information to use. If a string, it will be used as the transcript ID and the exons are expected to be in order. If a tuple of strings, the first string will be used as the transcript ID and the second string will be used as the exon number. If a dictionary, the keys will be used as the transcript ID and the values should be the row number for each exon, in order. If False, splicing will be disabled.var_filter (
Optional[Literal['exonic']], default:None) – Whether to filter variants. If set to"exonic", only exonic variants will be applied.svar (
str|Path|None, default:None) – Override the recorded SVAR location. Use when the original SVAR has moved and the dataset cannot find it via the stored relative/absolute path or by sibling discovery.region_names (str | None)
- Return type:
RaggedDataset[TypeVar(MaybeRSEQ,None,RaggedSeqs,RaggedAnnotatedHaps,RaggedVariants),TypeVar(MaybeRTRK,None,Ragged[float32],RaggedIntervals)]
- with_settings(jitter=None, rng=None, deterministic=None, rc_neg=None, min_af=None, max_af=None, var_fields=None, splice_info=None, var_filter=None)[source]#
Modify settings of the dataset, returning a new dataset without modifying the old one.
- Parameters:
jitter (
int|None, default:None) – How much jitter to use. Must be non-negative and <= themax_jitterof the dataset.rng (
int|Generator|None, default:None) – Random seed or np.random.Generator for non-deterministic operations e.g. jittering and shifting longer-than-requested haplotypes.deterministic (
bool|None, default:None) – Whether to use randomized or deterministic algorithms. If set to True, this will disable random shifting of longer-than-requested haplotypes and, for unphased variants, will enable deterministic variant assignment and always apply the highest CCF group. Note that for unphased variants, this will mean not all possible haplotypes can be returned.rc_neg (
bool|None, default:None) – Whether to reverse-complement sequences and reverse tracks on negative strands.min_af (
Union[float,Literal[False],None], default:None) – The minimum allele frequency to include in the dataset. If set toFalse, disables this filter. If dataset is not backed by SVAR genotypes, this will raise an error.max_af (
Union[float,Literal[False],None], default:None) – The maximum allele frequency to include in the dataset. If set toFalse, disables this filter. If dataset is not backed by SVAR genotypes, this will raise an error.var_fields (
list[str] |None, default:None) – The variant fields to include in the dataset.splice_info (
Union[str,tuple[str,str],Literal[False],None], default:None) – A string or tuple of strings representing the splice information to use. If a string, it will be used as the transcript ID and the exons are expected to be in order. If a tuple of strings, the first string will be used as the transcript ID and the second string will be used as the exon number. If a dictionary, the keys will be used as the transcript ID and the values should be the row number for each exon, in order. If False, splicing will be disabled.var_filter (
Optional[Literal[False,'exonic']], default:None) – Whether to filter variants. If set to"exonic", only exonic variants will be applied.
- Return type:
Self
- with_len(output_length)[source]#
Modify the output length of the dataset, returning a new dataset without modifying the old one.
- Parameters:
output_length (
Union[Literal['ragged','variable'],int]) – The output length. Can be set to"ragged"or"variable"to allow for variable length sequences. If set to an integer, all sequences will be padded or truncated to this length. See the online documentation for more information.- Return type:
- with_seqs(kind)[source]#
Return a new dataset with the specified sequence type. The sequence type can be one of the following:
"reference": reference sequences."haplotypes": personalized haplotype sequences."annotated": annotated haplotype sequences, which includes personalized haplotypes along with annotations."variants": no sequences, just variants asRaggedVariants
Annotated haplotypes are returned as the
AnnotatedHapsclass which is roughly:class AnnotatedHaps: haps: NDArray[np.bytes_] var_idxs: NDArray[np.int32] ref_coords: NDArray[np.int32]
where
hapsare the haplotypes as bytes/S1, andvar_idxsandref_coordsare arrays with the same shape ashapsthat annotate every nucleotide with the variant index and reference coordinate it corresponds to. A variant index of -1 corresponds to a reference nucleotide, and a reference coordinate of -1 corresponds to padded nucleotides that were added for regions beyond the bounds of the reference genome. i.e. if the region’s start position is negative or the end position is beyond the end of the reference genome.For example, a toy result for
chr1:1-10could be:haps: A C G T ... T T A ... var_idxs: -1 3 3 -1 ... -1 4 -1 ... ref_coords: 1 2 2 3 ... 6 7 9 ...
where variant 3 is a 1 bp
CGinsertion and variant 4 is a 1 bp deletionT-. Note that the first nucleotide of every indel maps to a reference position sincegvl.write()expects that variants are all left-aligned.Important
The
var_idxsare numbered with respect to the full set of variants even if the variants were extracted from per-chromosome VCFs/PGENs. So a variant index of 0 corresponds to the first variant across all chromosomes. Thus, if you want to map the variant index to per-chromosome VCFs/PGENs, you will need to subtract the number of variants on all other chromosomes before the variant index to get the correct variant index in the VCF/PGEN. Relevant values can be obtained by instantiating a gvl.Variants class from the VCFs/PGENs and accessing the Variants.records.contig_offsets attribute.If the Dataset’s output length is
"ragged", then annotated haplotypes will beRaggedAnnotatedHapswhere each field is a Ragged array instead of NumPy arrays.
- with_tracks(tracks=None, kind=None)[source]#
Modify which tracks to return, returning a new dataset without modifying the old one.
- with_insertion_fill(fill)[source]#
Configure how track values are filled at insertion sites.
Only meaningful when the dataset returns haplotypes and tracks (i.e. when the reconstructor is
HapsTracks). Pure-reference and pure-haplotype datasets have no insertion fill to configure.
- output_length: Literal['ragged', 'variable'] | int#
The output length. Can be set to
"ragged"or"variable"to allow for variable length sequences. If set to an integer, all sequences will be padded or truncated to this length. See the online documentation for more information.
- return_indices: bool#
Whether to return row and sample indices corresponding to the full dataset (no subsetting).
- deterministic: bool#
Whether to use randomized or deterministic algorithms. If set to
False, this will enable random shifting of longer-than-requested haplotypes and, for unphased variants, enable choosing sets of compatible variants proportional to their CCF; otherwise the dataset will always apply compatible sets with the highest CCF.
- property has_genotypes#
Whether the dataset has genotypes.
- property regions: DataFrame#
The input regions in the dataset as they were provided to
gvl.write()i.e. with all BED columns plus any extra columns that were present.
- property full_shape: tuple[int, int]#
Return the full shape of the dataset, ignoring any subsetting.
(n_rows, n_samples)
- property sequence_type: Literal['haplotypes', 'reference', 'annotated', 'variants'] | None#
The type of sequences in the dataset.
- subset_to(regions=None, samples=None)[source]#
Subset the dataset to specific regions and/or samples by index or a boolean mask. If regions or samples are not provided, the corresponding dimension will not be subset.
- Parameters:
regions (
int|integer|ndarray[tuple[Any,...],dtype[integer]] |slice|Sequence[int] |Sequence[bool] |ndarray[tuple[Any,...],dtype[bool]] |Series|str|Sequence[str] |ndarray[tuple[Any,...],dtype[str_]] |ndarray[tuple[Any,...],dtype[object_]] |None, default:None) – The regions to subset to.samples (
int|integer|ndarray[tuple[Any,...],dtype[integer]] |slice|Sequence[int] |Sequence[bool] |ndarray[tuple[Any,...],dtype[bool]] |Series|str|Sequence[str] |ndarray[tuple[Any,...],dtype[str_]] |ndarray[tuple[Any,...],dtype[object_]] |None, default:None) – The samples to subset to.
- Return type:
Self
Examples
Subsetting to the first 10 regions:
ds.subset_to(slice(10))
Subsetting to the 2nd and 4th samples:
ds.subset_to(samples=[1, 3])
Subsetting to chromosome 1, assuming it’s labeled
"chr1":r_idx = ds.regions["chrom"] == "chr1" ds.subset_to(regions=r_idx)
Subsetting to regions labeled by a column “split”, assuming “split” existed in the input regions:
r_idx = ds.regions["split"] == "train" ds.subset_to(regions=r_idx)
Subsetting to the intersection with another set of regions:
import seqpro as sp regions = gvl.read_bedlike("regions.bed") regions_pr = sp.bed.to_pyr(regions) ds_regions_pr = sp.bed.to_pyr(ds.regions.with_row_index()) r_idx = ds_regions_pr.overlap(regions_pr).df["index"].to_numpy() ds.subset_to(regions=r_idx)
- haplotype_lengths(regions=None, samples=None)[source]#
The lengths of jitter-extended haplotypes for specified regions and samples. If the dataset is not phased or not deterministic, this will return
Nonebecause the haplotypes are not guaranteed to be a consistent length due to randomness in what variants are used.Note
Currently not implemented for spliced datasets.
- Parameters:
regions (
int|integer|ndarray[tuple[Any,...],dtype[integer]] |slice|Sequence[int] |Sequence[bool] |ndarray[tuple[Any,...],dtype[bool]] |Series|None, default:None) – Regions to compute haplotype lengths for.samples (
int|integer|ndarray[tuple[Any,...],dtype[integer]] |slice|Sequence[int] |Sequence[bool] |ndarray[tuple[Any,...],dtype[bool]] |Series|str|Sequence[str] |None, default:None) – Samples to compute haplotype lengths for.
- Return type:
- n_variants(regions=None, samples=None)[source]#
The number of variants in the dataset for specified regions and samples.
- Parameters:
regions (
int|integer|ndarray[tuple[Any,...],dtype[integer]] |slice|Sequence[int] |Sequence[bool] |ndarray[tuple[Any,...],dtype[bool]] |Series|None, default:None) – Regions to compute the number of variants for.samples (
int|integer|ndarray[tuple[Any,...],dtype[integer]] |slice|Sequence[int] |Sequence[bool] |ndarray[tuple[Any,...],dtype[bool]] |Series|str|Sequence[str] |ndarray[tuple[Any,...],dtype[str_]] |ndarray[tuple[Any,...],dtype[object_]] |None, default:None) – Samples to compute the number of variants for.
- Return type:
- Returns:
Array with shape (…, ploidy). The number of variants in the dataset for the specified regions and samples. If the dataset does not have genotypes, this will return
None.
- n_intervals(regions=None, samples=None)[source]#
The number of intervals in the dataset for specified regions and samples.
- Parameters:
regions (
int|integer|ndarray[tuple[Any,...],dtype[integer]] |slice|Sequence[int] |Sequence[bool] |ndarray[tuple[Any,...],dtype[bool]] |Series|None, default:None) – Regions to compute the number of intervals for.samples (
int|integer|ndarray[tuple[Any,...],dtype[integer]] |slice|Sequence[int] |Sequence[bool] |ndarray[tuple[Any,...],dtype[bool]] |Series|str|Sequence[str] |ndarray[tuple[Any,...],dtype[str_]] |ndarray[tuple[Any,...],dtype[object_]] |None, default:None) – Samples to compute the number of intervals for.
- Return type:
- Returns:
Array with shape (…, tracks). The number of intervals in the dataset for the specified regions and samples. If the dataset does not have intervals, this will return
None.
- write_transformed_track(new_track, existing_track, transform, max_mem=1073741824, overwrite=False)[source]#
Write transformed tracks to the dataset.
- Parameters:
new_track (
str) – The name of the new track.existing_track (
str) – The name of the existing track to transform.transform (
Callable[[ndarray[tuple[Any,...],dtype[int64]],ndarray[tuple[Any,...],dtype[int64]],Ragged[float32]],Ragged[float32]]) – A function to apply to the existing track to get a new, transformed track. This will be done in chunks such that the tracks provided will not exceedmax_mem. The arguments given to the transform will be the region and sample indices as numpy arrays and the tracks themselves as aRaggedarray with shape (regions, samples). The tracks must be aRaggedarray since regions may be different lengths to accommodate indels. This function should then return the transformed tracks as aRaggedarray with the same shape and lengths.max_mem (
int, default:1073741824) – The maximum memory to use in bytes, by default 1 GiB (2**30 bytes)overwrite (
bool, default:False) – Whether to overwrite the existing track, by default False
- Return type:
- write_annot_tracks(tracks, overwrite=False)[source]#
Write annotation tracks to the dataset. Returns a new dataset with the tracks available. Activate them with
with_tracks().- Parameters:
tracks (
dict[str,str|Path|DataFrame]) –Paths to the annotation tracks (or literal tables) in BED-like format. Keys should be the track names and values should be the paths to the BED files or polars.DataFrames.
Note
Only supports BED files for now.
overwrite (
bool, default:False) – Whether to overwrite the existing tracks, by default False
- Return type:
Self
- to_torch_dataset(return_indices, transform)[source]#
Convert the dataset to a PyTorch
Dataset. Requires PyTorch to be installed.- Parameters:
return_indices (
bool) – Whether to append arrays of row and sample indices of the non-subset dataset to each batch.The transform to apply to each batch of data. The transform should take input matching the output of the dataset and can return anything that can be converted to a PyTorch tensor. In combination with indices, this allows you to combine arbitrary row- and sample-specific data with dataset output on-the-fly.
Note
Depending on how transforms are implemented, they can easily introduce a dataloading bottleneck. If you find dataloading is slow, it’s often a good idea to try disabling your transform to see if it’s impacting throughput.
- Return type:
TorchDataset
- to_dataloader(batch_size=1, shuffle=False, sampler=None, num_workers=0, collate_fn=None, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None, multiprocessing_context=None, generator=None, *, prefetch_factor=None, persistent_workers=False, pin_memory_device='', return_indices=False, transform=None, mode=None, buffer_bytes=2147483648, copy=True, heartbeat_seconds=60.0)[source]#
Convert the dataset to a PyTorch
DataLoader. The parameters are the same as aDataLoaderwith a few omissions e.g.batch_sampler. Requires PyTorch to be installed.- Parameters:
batch_size (
int, default:1) – How many samples per batch to load.shuffle (
bool, default:False) – Set to True to have the data reshuffled at every epoch.sampler (
Sampler|Iterable|None, default:None) –Defines the strategy to draw samples from the dataset. Can be any
Iterablewith__len__implemented. If specified, shuffle must not be specified.Important
Do not provide a
BatchSamplerhere. GVL Datasets use multithreading when indexed with batches of indices to avoid the overhead of multi-processing. To leverage this, GVL will automatically wrap thesamplerwith aBatchSamplerso that lists of indices are given to the GVL Dataset instead of one index at a time. See this post for more information.num_workers (
int, default:0) –How many subprocesses to use for dataloading.
0means that the data will be loaded in the main process.Tip
For GenVarLoader, it is generally best to set this to 0 or 1 since almost everything in GVL is multithreaded. However, if you are using a transform that is compute intensive and single threaded, there may be a benefit to setting this > 1.
collate_fn (
Callable|None, default:None) – Merges a list of samples to form a mini-batch of Tensor(s).pin_memory (
bool, default:False) – IfTrue, the data loader will copy Tensors into device/CUDA pinned memory before returning them. If your data elements are a custom type, or yourcollate_fnreturns a batch that is a custom type, see the example below.drop_last (
bool, default:False) – Set toTrueto drop the last incomplete batch, if the dataset size is not divisible by the batch size. IfFalseand the size of dataset is not divisible by the batch size, then the last batch will be smaller.timeout (
float, default:0) – If positive, the timeout value for collecting a batch from workers. Should always be non-negative.worker_init_fn (
Callable|None, default:None) – If notNone, this will be called on each worker subprocess with the worker id (an int in[0, num_workers - 1]) as input, after seeding and before data loading.multiprocessing_context (
Callable|None, default:None) – IfNone, the default multiprocessing context of your operating system will be used.generator (
Generator|None, default:None) – If notNone, this RNG will be used by RandomSampler to generate random indexes and multiprocessing to generatebase_seedfor workers.prefetch_factor (
int|None, default:None) – Number of batches loaded in advance by each worker. 2 means there will be a total of 2 * num_workers batches prefetched across all workers. (default value depends on the set value for num_workers. If value of num_workers=0 default is None. Otherwise, if value of num_workers > 0 default is 2).persistent_workers (
bool, default:False) – IfTrue, the data loader will not shut down the worker processes after a dataset has been consumed once. This allows to maintain the workers Dataset instances alive.pin_memory_device (
str, default:'') – The device topin_memoryto ifpin_memoryisTrue.return_indices (
bool, default:False) – Whether to append arrays of row and sample indices of the non-subset dataset to each batch.transform (
Callable|None, default:None) –The transform to apply to each batch of data. The transform should take input matching the output of the dataset and can return anything that can be converted to a PyTorch tensor. In combination with indices, this allows you to combine arbitrary row- and sample-specific data with dataset output on-the-fly.
Note
Depending on how transforms are implemented, they can easily introduce a dataloading bottleneck. If you find dataloading is slow, it’s often a good idea to try disabling your transform to see if it’s impacting throughput.
mode (str | None)
buffer_bytes (int)
copy (bool)
heartbeat_seconds (float)
- Return type:
- genvarloader.get_dummy_dataset(spliced=False)[source]#
Return a dummy
Datasetwith 4 regions, 4 samples, max jitter of 2, a reference genome of all"N", genotypes, and 1 track “read-depth” where each track is[1, 2, 3, 4, 5, 6]in the reference coordinate system, where3is aligned with each region’s start coordinate. Is initialized to return ragged haplotypes and tracks with no jitter and deterministic reconstruction algorithms.- Parameters:
spliced (
bool, default:False) –If
True, the dataset will be setup for splicing with all regions moved to chromosome 1 and a splice indexer with 2 genes, “tp53” and “shh”, corresponding to regions:{ "tp53": [3, 0, 2], "shh": [1], }
Reference genome(s)#
- class genvarloader.Reference[source]#
A reference genome kept in-memory. Typically this is only instantiated to be passed to
Dataset.openand avoid data duplication.Note
Do not instantiate this class directly. Use
Reference.from_path()instead.- reference: ndarray[tuple[Any, ...], dtype[uint8]]#
The reference genome as a numpy array, with contigs concatenated.
- offsets: ndarray[tuple[Any, ...], dtype[int64]]#
(n_contigs + 1)
- Type:
The offsets of the contigs in the reference genome. Shape
- classmethod from_path(fasta, contigs=None, in_memory=True)[source]#
Load a reference genome from a FASTA file.
- Parameters:
fasta (
str|Path) – Path to a.fa/.fa.bgzFASTA file or an existing.gvlfacache directory. When a FASTA path is given, a sibling.gvlfacache is built on first use and reused on subsequent calls; a legacy.fa.gvlflat cache is automatically migrated to the new format.contigs (
list[str] |None, default:None) – List of contig names to load. If None, all contigs in the FASTA file are loaded. Can be either UCSC or Ensembl style (i.e. with or without the “chr” prefix) and will be handled appropriately to match the underlying FASTA.in_memory (
bool, default:True) – Whether to load the reference genome into memory. If True, the reference genome is loaded into memory. If False, the reference genome is read on-demand from a memory mapped array. This will still be much faster than reading from FASTA but slower than keeping it in memory. This is useful if you need to work with many reference genomes or have very limited RAM.
- class genvarloader.RefDataset[source]#
A reference dataset for pulling out sequences from a reference genome.
When
splice_infois provided, the dataset returns per-transcript concatenated reference sequence, with one row per splice group instead of one row per BED region. Same semantics asDataset.open(splice_info=...).- full_bed: DataFrame#
A table of regions to extract from the reference genome. The table must have the following columns: - chrom: The name of the contig (e.g. “chr1”, “chr2”, etc.) - chromStart: The start position of the region (0-based). - chromEnd: The end position of the region (0-based). A strand column can also be included, in which case the regions will be reverse complemented if the strand is -1 and the rc_neg parameter is set to True.
- output_length: Literal['ragged', 'variable'] | int#
The output length of the dataset. Same meaning as
Dataset.output_length.
- deterministic: bool#
If true, fixed length sequences will be right truncated from their full length to the output length. If false, fixed length sequences will be randomly shifted to be within the output length.
- splice_info: str | tuple[str, str] | None#
If set, the dataset is spliced. Either the column name with rows already in splice order or a (group_col, sort_col) pair applied against
full_bed.
- property spliced_regions: DataFrame#
The spliced BED, subset to the current row subset.
- subset_to(regions)[source]#
Subset the dataset to a subset of regions (or transcripts, when spliced).
- __init__(reference, full_bed, jitter=0, output_length='ragged', deterministic=True, rc_neg=True, seed=None, region_names=None, splice_info=None)#
- to_torch_dataset(return_indices=False, transform=None)[source]#
Convert the dataset to a PyTorch dataset.
- Parameters:
return_indices (
bool, default:False) – If True, the dataset will return the indices of the regions in the reference genome.transform (
Callable|None, default:None) – A function to transform the data. Should accept a numpy array of S1 with shape (batch_size, length). If return_indices is true, the function should accept a tuple of (sequences, indices).
- Return type:
TorchDataset
- to_dataloader(batch_size=1, shuffle=False, sampler=None, num_workers=0, collate_fn=None, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None, multiprocessing_context=None, generator=None, *, prefetch_factor=None, persistent_workers=False, pin_memory_device='', return_indices=False, transform=None)[source]#
Convert the dataset to a PyTorch
DataLoader. The parameters are the same as aDataLoaderwith a few omissions e.g.batch_sampler. Requires PyTorch to be installed.- Parameters:
batch_size (
int, default:1) – How many samples per batch to load.shuffle (
bool, default:False) – Set to True to have the data reshuffled at every epoch.sampler (
Sampler|Iterable|None, default:None) –Defines the strategy to draw samples from the dataset. Can be any
Iterablewith__len__implemented. If specified, shuffle must not be specified.Important
Do not provide a
BatchSamplerhere. GVL Datasets use multithreading when indexed with batches of indices to avoid the overhead of multi-processing. To leverage this, GVL will automatically wrap thesamplerwith aBatchSamplerso that lists of indices are given to the GVL Dataset instead of one index at a time. See this post for more information.num_workers (
int, default:0) –How many subprocesses to use for dataloading.
0means that the data will be loaded in the main process.Tip
For GenVarLoader, it is generally best to set this to 0 or 1 since almost everything in GVL is multithreaded. However, if you are using a transform that is compute intensive and single threaded, there may be a benefit to setting this > 1.
collate_fn (
Callable|None, default:None) – Merges a list of samples to form a mini-batch of Tensor(s).pin_memory (
bool, default:False) – IfTrue, the data loader will copy Tensors into device/CUDA pinned memory before returning them. If your data elements are a custom type, or yourcollate_fnreturns a batch that is a custom type, see the example below.drop_last (
bool, default:False) – Set toTrueto drop the last incomplete batch, if the dataset size is not divisible by the batch size. IfFalseand the size of dataset is not divisible by the batch size, then the last batch will be smaller.timeout (
float, default:0) – If positive, the timeout value for collecting a batch from workers. Should always be non-negative.worker_init_fn (
Callable|None, default:None) – If notNone, this will be called on each worker subprocess with the worker id (an int in[0, num_workers - 1]) as input, after seeding and before data loading.multiprocessing_context (
Callable|None, default:None) – IfNone, the default multiprocessing context of your operating system will be used.generator (
Generator|None, default:None) – If notNone, this RNG will be used by RandomSampler to generate random indexes and multiprocessing to generatebase_seedfor workers.prefetch_factor (
int|None, default:None) – Number of batches loaded in advance by each worker. 2 means there will be a total of 2 * num_workers batches prefetched across all workers. (default value depends on the set value for num_workers. If value of num_workers=0 default is None. Otherwise, if value of num_workers > 0 default is 2).persistent_workers (
bool, default:False) – IfTrue, the data loader will not shut down the worker processes after a dataset has been consumed once. This allows to maintain the workers Dataset instances alive.pin_memory_device (
str, default:'') – The device topin_memoryto ifpin_memoryisTrue.return_indices (
bool, default:False) – If True, the dataset will return the indices of the regions in the reference genome.transform (
Callable|None, default:None) – A function to transform the data. Should accept a numpy array of S1 with shape (batch_size, length). If return_indices is true, the function should accept a tuple of (sequences, indices).
- Return type:
Non-personal/site-only variants#
- class genvarloader.DatasetWithSites[source]#
- __init__(dataset, sites, max_variants_per_region=1)[source]#
Dataset with variant sites, used to apply site-only variants e.g. from ClinVar to a Dataset of haplotypes. Currently only supports bi-allelic SNPs. Takes the intersection of the dataset regions and the sites, and applies the site-only variants to the Dataset’s haplotypes.
Accessed just like a Dataset, but where the rows are combinations of dataset regions and sites. Will return two
AnnotatedHapswith variants applied and flags indicating whether the variant was applied, deleted, or existed. The flags are 0 for applied, 1 for deleted, and 2 for existed. If the dataset has tracks, they will be returned as well and reflect any site-only variants. The firstAnnotatedHapsis the wildtype haplotypes and the second is the mutated haplotypes. The mutant haplotypes will also have their variant indices and reference coordinates updated to reflect the applied variants. Locations where a site-only variant was applied will have a variant index of -2.- Parameters:
dataset (
ArrayDataset[TypeVar(SEQ,ndarray[tuple[Any,...],dtype[bytes_]],AnnotatedHaps,RaggedVariants),TypeVar(MaybeTRK,None,ndarray[tuple[Any,...],dtype[float32]],RaggedIntervals)]) – Dataset of haplotypes and potentially tracks.sites (
DataFrame) – Table of variant site information.max_variants_per_region (
int, default:1) – Maximum number of variants per region. Currently only 1 is supported.
Examples
import genvarloader as gvl sites = gvl.sites_vcf_to_table("path/to/variants.vcf") ds = gvl.Dataset.open("path/to/dataset.gvl", "path/to/reference.fasta") ds_sites = gvl.DatasetWithSites(ds, sites) wt_haps, mut_haps, flags = ds_sites[0, 0] # flags is a np.uint8 (or an array of np.uint8 when accessing multiple rows/samples) ds_sites.dataset = ds_sites.dataset.with_tracks("read-depth") wt_haps, mut_haps, flags, tracks = ds_sites[0, 0]
- sites: DataFrame#
Table of variant site information.
- dataset: ArrayDataset[AnnotatedHaps, MaybeTRK]#
Dataset of haplotypes and potentially tracks.
- rows: DataFrame#
Rows of this object, where each row is a combination of a dataset region and a site.
- genvarloader.sites_vcf_to_table(vcf, attributes=None, info_fields=None)[source]#
Extract a table of variant site info from a VCF. All sites must be bi-allelic.
- Parameters:
vcf (
str|Path|VCF) – Path to a VCF or agenoray.VCFinstance. Note thatgenoray.VCFcan accept a filter function.attributes (
list[str] |None, default:None) – A list of attributes to include in the output table. Note that “CHROM”, “POS”, “REF”, and “ALT” are always included even if not in this list.info_fields (
list[str] |None, default:None) – A list of INFO fields to include in the output table.
- Return type:
DataFrame
Data registry#
- genvarloader.data_registry.fetch(name)[source]#
Download and cache data for constructing/opening a GVL dataset. Files are cached in the user’s home directory under
~/.cache/genvarloader.- Parameters:
name (
Literal['geuvadis_ebi','1kgp']) –The name of the dataset to fetch. Can be one of:
”geuvadis_ebi”: Geuvadis data for the original analyses by Lappalainen et al. 2013. Phased, normalized, and split into biallelic variants.
”1kgp”: 1000 Genomes Project, all 3,202 individuals. Phased, normalized, and split into biallelic variants.
- Return type:
- Returns:
A dictionary of paths to the fetched data.
Containers#
Classes that GVL Datasets may return.
- class genvarloader.AnnotatedHaps[source]#
AnnotatedHaps(haps: ‘NDArray[np.bytes_]’, var_idxs: ‘NDArray[np.int32]’, ref_coords: ‘NDArray[np.int32]’)
- var_idxs: ndarray[tuple[Any, ...], dtype[int32]]#
Variant indices for each position in the haplotypes. A value of -1 indicates no variant was applied at the position.
- ref_coords: ndarray[tuple[Any, ...], dtype[int32]]#
Reference coordinates for each position in haplotypes.
- property shape#
Shape of the haplotypes and all annotations.
- class genvarloader.Ragged[source]#
An awkward array with exactly 1 ragged dimension. The ragged dimension is None in its shape tuple.
- !!! warning
Ragged arrays only support a subset of Awkward array features.
Strings are not supported since ASCII is sufficient for the bioinformatics domain.
Bytestrings count as a ragged dimension, and we break from the Awkward convention to not include a “var” in the type string.
Record-layout Ragged arrays (produced by ak.zip of Ragged inputs or by passing a record-layout ak.Array) return field-keyed dicts from dtype, data, and parts. Use rag[“field”] for zero-copy single-field access. view, apply, and to_numpy are not defined on record layouts; access individual fields. Union types remain unsupported.
- static from_offsets(data, shape, offsets)[source]#
Create a Ragged array from data, offsets, and shape.
- Parameters:
data (
ndarray[tuple[Any,...],dtype[TypeVar(DTYPE_co, bound=number|bytes_|datetime64|timedelta64|bool|void, covariant=True)]]) – The data to create the Ragged array from.shape (
tuple[int|None,...]) – The shape of the Ragged array.offsets (
ndarray[tuple[Any,...],dtype[int64]]) – The offsets to create the Ragged array from.
- Return type:
Ragged[DTYPE_co]
- static from_lengths(data, lengths)[source]#
Create a Ragged array from data and lengths.
- Parameters:
- Return type:
Ragged[DTYPE_co]
- parts#
The parts of the Ragged array. For record layouts, a dict of field name -> RagParts; all share the same offsets ndarray.
- data#
The data of the Ragged array. For record layouts, a dict of field name -> zero-copy ndarray view, in awkward field order.
- property offsets: ndarray[tuple[Any, ...], dtype[int64]]#
The offsets of the Ragged array. May have shape (n_ragged + 1) or (2, n_ragged).
- Return type:
NDArray[np.int64]
- property shape: tuple[int | None, ...]#
The shape of the Ragged array. The ragged dimension is None.
- Return type:
tuple[int | None,]
- property dtype: dtype[RDTYPE_co]#
The dtype of the Ragged array.
For non-record layouts, returns the numpy dtype of the flat data buffer (e.g.
np.dtype('int32')).For record layouts, returns a numpy structured dtype whose field names and per-field dtypes match the Ragged record fields — for example:
np.dtype([("seq", "S1"), ("score", "f4")])
Note
Memory layout is SoA, not AoS. A numpy structured dtype normally implies Array-of-Structs packing, but here each field lives in its own contiguous buffer (Structure of Arrays). The structured dtype is used purely as a convenient, numpy-compatible descriptor: it carries all field/dtype information in a single object without inventing a new type.
- Return type:
np.dtype[RDTYPE_co]
- property lengths: ndarray[tuple[Any, ...], dtype[integer]]#
The lengths of the segments.
- Return type:
NDArray[np.integer]
- view(dtype)[source]#
Return a view of the data with the given dtype.
- Parameters:
dtype (
type[TypeVar(DTYPE_co, bound=number|bytes_|datetime64|timedelta64|bool|void, covariant=True)] |str) – Target dtype.- Returns:
Zero-copy view with reinterpreted dtype.
- Return type:
Ragged[DTYPE_co]
- classmethod empty(shape, dtype)[source]#
Create an empty Ragged array with the given shape and dtype.
- property is_base: bool#
Whether the Ragged array is a base array (owns its data, contiguous, no offset).
- Return type:
- to_numpy(allow_missing=False)[source]#
Convert to a dense NumPy array. Not zero-copy if offsets or data are non-contiguous.
- Parameters:
allow_missing (
bool, default:False) – Passed through to ak.Array.to_numpy.- Return type:
NDArray[RDTYPE_co]
- to_packed(*, copy=True)[source]#
Pack into a fresh contiguous, zero-based Ragged (1-D offsets).
Numba-parallelized replacement for
Ragged(ak.to_packed(self)). Seeseqpro.rag.to_packed()for thecopysemantics.- Parameters:
copy (
bool, default:True) – WhenTrue(default), return a freshly allocated owned array. WhenFalse, return zero-copy if already packed, else raise.- Return type:
Ragged[RDTYPE_co]
- squeeze(axis=None)[source]#
Squeeze the ragged array along the given non-ragged axis. If squeezing would result in a 1D array, return the data as a numpy array. For record layouts, dispatches per-field; if fields collapse to 1D ndarrays, returns a dict of ndarrays, otherwise returns a record Ragged.
- class genvarloader.RaggedAnnotatedHaps[source]#
Ragged version of
AnnotatedHaps.- var_idxs: Ragged[int32]#
Variant indices for each position in the haplotypes. A value of -1 indicates no variant was applied at the position.
- property shape#
Shape of the haplotypes and all annotations.
- to_padded()[source]#
Convert this Ragged array to a rectilinear array by right-padding each entry with appropriate values. The final axis will have the maximum length across all entries.
- Return type:
- reshape(shape)[source]#
Reshape the haplotypes and all annotations.
- class genvarloader.RaggedVariants[source]#
An awkward record array with shape
(batch, ploidy, ~variants, [~length]). Guaranteed to at least have the field"alt"and"start"and one of"ref"or"ilen".- classmethod from_ak(arr)[source]#
Create a RaggedVariants object from an awkward array.
- Parameters:
arr (
Array) – The awkward array to create a RaggedVariants object from.- Return type:
- property alt: Array#
Alternative alleles.
- squeeze(axis=None, **kwargs)[source]#
Squeeze first axis.
- Return type:
Self- Parameters:
axis (int | None)
- infer_germline_ccfs_(ccf_field='dosages', max_ccf=1.0)[source]#
Infer germline CCFs in-place.
Germline variants are identified by having missing CCFs i.e. they have a variant index but missing CCFs. Missing CCFs are inferred to be
max_ccf- sum(overlapping CCFs).
- to_packed()[source]#
Pack all fields into contiguous, zero-based arrays.
Replaces the previous
ak.to_packed()call with field-wise packing: seqproto_packed()for numericRaggedfields, and an allele-level seqpro pack + group-offset rebase +_build_allele_layout()rebuild for the doubly-nestedalt/reffields.- Return type:
Self
- rc_(to_rc=None)[source]#
Reverse complement the alleles. This is an in-place operation.
- Parameters:
to_rc (
ndarray[tuple[Any,...],dtype[bool]] |None, default:None) – A boolean mask of the same shape as the variant dimension. IfTrue, the alternative allele will be reverse complemented. IfNone, will reverse complement all alternative alleles.- Return type:
Self- Returns:
The RaggedVariants object with the alleles reverse complemented.
- to_nested_tensor_batch(device='cpu', tokenizer=None)[source]#
Convert a RaggedVariants object to a dictionary of nested tensors. Will flatten across the ploidy dimension for attributes ILEN, starts, and dosages such that their shapes are (batch * ploidy, ~variants). For the alternative alleles, will flatten across both the ploidy and variant dimensions such that the shape is (batch * ploidy * ~variants, ~alt_len).
Important
This function assumes all variant data is packed (see
ak.to_packed()).- Parameters:
device (
str|device, default:'cpu') – The device to move the tensors to.tokenizer (
Union[Literal['seqpro'],Callable[[ndarray[tuple[Any,...],dtype[bytes_]]],ndarray[tuple[Any,...],dtype[integer]]],None], default:None) –The tokenizer to use for the alternative alleles.
If
"seqpro", will useseqpro.tokenize()to convertACGTN -> 0 1 2 3 4.If
None, will use the integer ASCII value of each character i.e.ACGTN -> 65 67 71 84 78.Otherwise, will use the provided callable to convert the alternative alleles to a tensor of integers.
- Return type:
- Returns:
Dictionary of nested tensors and integers with the following keys:
"alts"with shape(batch * ploidy * ~variants, ~alt_len)"ilens"with shape(batch * ploidy, ~variants)"starts"with shape(batch * ploidy, ~variants)"dosages"with shape(batch * ploidy, ~variants)"max_n_vars": int, maximum number of variants"max_alt_len": int, maximum length of an alternative allele"max_ref_len": int, maximum length of a reference allele
- pad(allele=b'N', ilen=0, start=-1, dosage=0.0, **pad_values)[source]#
Append a pad variant so that every group is guaranteed to have at least 1 variant. If the group has variants, no variant is appended.
- Parameters:
allele (
str|bytes, default:b'N') – The allele to use for ALTs and REFsilen (
int, default:0)start (
int, default:-1) – The start position to use for the pad variantdosage (
float, default:0.0) – The dosage to use for the pad variant**pad_values (
Any) – Additional values to use for each field. Raises a ValueError if any field does not have a pad value.
- Return type:
Self- Returns:
The RaggedVariants object with the pad variant appended to each group that has no variants.
- class genvarloader.RaggedIntervals[source]#
RaggedIntervals(starts: ‘Ragged[np.int32]’, ends: ‘Ragged[np.int32]’, values: ‘Ragged[np.float32]’)
- property shape#
Shape of the haplotypes and all annotations.
- to_padded(start, end, value)[source]#
Convert this RaggedIntervals to a tuple of rectilinear arrays by right-padding each entry with appropriate values. The final axis will have the maximum length across all entries.
- reshape(shape)[source]#
Reshape the haplotypes and all annotations.
- to_fixed_shape(shape)[source]#
If all entries in the ragged array have the same shape, convert to a rectilinear shape.
- to_packed()[source]#
Apply
ak.to_packed()to all arrays.- Return type: