GenVarLoader#

PyPI version Documentation Status Downloads PyPI - Downloads GitHub stars bioRxiv link

Features#

GenVarLoader provides a fast, memory efficient data structure for training sequence models on genetic variation. For example, this can be used to train a DNA language model on human genetic variation (e.g. Dalla-Torre et al.) or train sequence to function models with genetic variation (e.g. Celaj et al., Drusinsky et al., He et al., and Rastogi et al.).

  • Avoid writing any sequences to disk (can save >2,000x storage vs. writing personalized genomes with bcftools consensus)

  • Generate haplotypes up to 1,000 times faster than reading a FASTA file

  • Generate tracks up to 450 times faster than reading a BigWig

  • Supports indels and re-aligns tracks to haplotypes that have them

  • Extensible to new file formats: drop a feature request! Currently supports VCF, PGEN, and BigWig

See our preprint for benchmarking and implementation details.

Installation#

pip install genvarloader

A PyTorch dependency is not included since it may require special instructions.

Quick Start#

Write a gvl.Dataset#

import genvarloader as gvl

gvl.write(
    path="cool_dataset.gvl",
    bed="interesting_regions.bed",
    variants="cool_variants.vcf",
    tracks=gvl.BigWigs.from_table("bigwig", "samples_to_bigwigs.csv"),
    max_jitter=128,
)

Where samples_to_bigwigs.csv has columns sample and path mapping each sample to its BigWig.

Open a gvl.Dataset and get a PyTorch DataLoader#

import genvarloader as gvl

dataset = gvl.Dataset.open(path="cool_dataset.gvl", reference="hg38.fa")
train_samples = ["David", "Aaron"]
train_dataset = dataset.subset_to(regions="train_regions.bed", samples=train_samples)
train_dataloader = train_dataset.to_dataloader(
    batch_size=32, shuffle=True, num_workers=1
)

# use it in your training loop
for haplotypes, tracks in train_dataloader:
    ...

Inspect specific instances#

dataset[0, 9]  # first region, 10th sample
dataset[:10, 4]  # first 10 regions, 5th sample
dataset[:10, :5]  # first 10 regions and first 5 samples

Transform the data on-the-fly#

import seqpro as sp
from einops import rearrange


def transform(haplotypes, tracks):
    ohe = sp.DNA.ohe(haplotypes)
    ohe = rearrange(ohe, "... length alphabet -> ... alphabet length")
    return ohe, tracks


transformed_dataset = dataset.with_settings(transform=transform)

Performance tips#

  • GenVarLoader uses multithreading extensively, so it’s best to use 0 or 1 workers with your DataLoader.

  • A GenVarLoader Dataset is most efficient when given batches of indices, rather than one at a time. By default, DataLoaders use one index at a time, so if you want to use a custom Sampler you should wrap it with a BatchSampler before passing it to Dataset.to_dataloader().