GenVarLoader#

Features#

GenVarLoader provides a fast, memory efficient data structure for training sequence models on genetic variation. For example, this can be used to train a DNA language model on human genetic variation (e.g. Dalla-Torre et al.) or train sequence to function models with genetic variation (e.g. Celaj et al., Drusinsky et al., He et al., and Rastogi et al.).

Avoid writing any sequences to disk (can save >2,000x storage vs. writing personalized genomes with bcftools consensus)
Generate haplotypes up to 1,000 times faster than reading a FASTA file
Generate tracks up to 450 times faster than reading a BigWig
Supports indels and re-aligns tracks to haplotypes that have them
Extensible to new file formats: drop a feature request! Currently supports VCF, PGEN, and BigWig

See our preprint for benchmarking and implementation details.

Installation#

pip install genvarloader

A PyTorch dependency is not included since it may require special instructions.

Quick Start#

Write a `gvl.Dataset`#

import genvarloader as gvl

gvl.write(
    path="cool_dataset.gvl",
    bed="interesting_regions.bed",
    variants="cool_variants.vcf",
    tracks=gvl.BigWigs.from_table("bigwig", "samples_to_bigwigs.csv"),
    max_jitter=128,
)

Where samples_to_bigwigs.csv has columns sample and path mapping each sample to its BigWig.

Open a `gvl.Dataset` and get a PyTorch DataLoader#

import genvarloader as gvl

dataset = gvl.Dataset.open(path="cool_dataset.gvl", reference="hg38.fa")
train_samples = ["David", "Aaron"]
train_dataset = dataset.subset_to(regions="train_regions.bed", samples=train_samples)
train_dataloader = train_dataset.to_dataloader(
    batch_size=32, shuffle=True, num_workers=1
)

# use it in your training loop
for haplotypes, tracks in train_dataloader:
    ...

Inspect specific instances#

dataset[0, 9]  # first region, 10th sample
dataset[:10, 4]  # first 10 regions, 5th sample
dataset[:10, :5]  # first 10 regions and first 5 samples

Transform the data on-the-fly#

import seqpro as sp
from einops import rearrange


def transform(haplotypes, tracks):
    ohe = sp.DNA.ohe(haplotypes)
    ohe = rearrange(ohe, "... length alphabet -> ... alphabet length")
    return ohe, tracks


transformed_dataset = dataset.with_settings(transform=transform)

Performance tips#

GenVarLoader uses multithreading extensively, so it’s best to use 0 or 1 workers with your DataLoader.
A GenVarLoader Dataset is most efficient when given batches of indices, rather than one at a time. By default, DataLoaders use one index at a time, so if you want to use a custom Sampler you should wrap it with a BatchSampler before passing it to Dataset.to_dataloader().

GenVarLoader

Contents