GenVarLoader#
Features#
GenVarLoader provides a fast, memory efficient data structure for training sequence models on genetic variation. For example, this can be used to train a DNA language model on human genetic variation (e.g. Dalla-Torre et al.) or train sequence to function models with genetic variation (e.g. Celaj et al., Drusinsky et al., He et al., and Rastogi et al.).
Avoid writing any sequences to disk (can save >2,000x storage vs. writing personalized genomes with bcftools consensus)
Generate haplotypes up to 1,000 times faster than reading a FASTA file
Generate tracks up to 450 times faster than reading a BigWig
Supports indels and re-aligns tracks to haplotypes that have them
Extensible to new file formats: drop a feature request! Currently supports VCF, PGEN, and BigWig
See our preprint for benchmarking and implementation details.
Installation#
pip install genvarloader
A PyTorch dependency is not included since it may require special instructions.
Quick Start#
Write a gvl.Dataset#
import genvarloader as gvl
gvl.write(
path="cool_dataset.gvl",
bed="interesting_regions.bed",
variants="cool_variants.vcf",
tracks=gvl.BigWigs.from_table("bigwig", "samples_to_bigwigs.csv"),
max_jitter=128,
)
Where samples_to_bigwigs.csv has columns sample and path mapping each sample to its BigWig.
Open a gvl.Dataset and get a PyTorch DataLoader#
import genvarloader as gvl
dataset = gvl.Dataset.open(path="cool_dataset.gvl", reference="hg38.fa")
train_samples = ["David", "Aaron"]
train_dataset = dataset.subset_to(regions="train_regions.bed", samples=train_samples)
train_dataloader = train_dataset.to_dataloader(
batch_size=32, shuffle=True, num_workers=1
)
# use it in your training loop
for haplotypes, tracks in train_dataloader:
...
Inspect specific instances#
dataset[0, 9] # first region, 10th sample
dataset[:10, 4] # first 10 regions, 5th sample
dataset[:10, :5] # first 10 regions and first 5 samples
Transform the data on-the-fly#
import seqpro as sp
from einops import rearrange
def transform(haplotypes, tracks):
ohe = sp.DNA.ohe(haplotypes)
ohe = rearrange(ohe, "... length alphabet -> ... alphabet length")
return ohe, tracks
transformed_dataset = dataset.with_settings(transform=transform)
Performance tips#
GenVarLoader uses multithreading extensively, so it’s best to use
0or1workers with yourDataLoader.A GenVarLoader
Datasetis most efficient when given batches of indices, rather than one at a time. By default,DataLoaders use one index at a time, so if you want to use a customSampleryou should wrap it with aBatchSamplerbefore passing it toDataset.to_dataloader().