How to write a Dataset#

Depending on your needs, you’ll potentially need to normalize variants and prepare a mapping from sample names to BigWig files before you can write a Dataset.

Preparing input data#

Normalizing variants#

Before passing variants to GenVarLoader, they must be:

  • left-aligned

  • bi-allelic

  • atomic (no MNPs or compound MNP-indels)

  • for VCFs, indexed with bcftools index or tabix

In general, any VCF can be preprocessed to meet these requirements using bcftools norm, for example:

bcftools norm -f $reference \
    -a --atom-overlaps . \
    -m -any --multi-overlaps . \
    -O b -o $out $in

Alternatively, if your data is already in the PGEN format you can use PLINK 2.0:

plink2 --make-bpgen --pfile $input --out $intermediate
plink2 --make-pgen --normalize --ref-from-fa --fa reference.fa --bpfile $intermediate --out $normalized

See the PLINK 2.0 documentation for details.

BigWig table#

To process BigWig data, GenVarLoader needs a mapping from sample names to BigWig file paths. This can be provided either as a dictionary using BigWig() or a table using BigWig.from_table(). If using a table, it must at least contain the columns sample and path, for example:

sample

path

Aang

/data/bigwigs/aang.bw

Katara

/data/bigwigs/katara.bw

Sokka

/data/bigwigs/sokka.bw

Regions of interest (ROIs)#

Whether you’re working with variants, BigWigs, or both, you will need regions of interest specified in a BED-like table. This means your table must have the columns chrom, chromStart, and chromEnd as 0-based coordinates. You can also include a strand column consisting of +, -, and . (treated as +) which will let you choose to reverse and/or reverse complement data on negative strands. Any extra columns will be kept in the table and exist in Dataset.regions after writing. An example of a BED-like table:

split

chrom

chromStart

chromEnd

strand

train

chr1

0

5

-

val

chr2

3

8

+

test

chr3

20

21

.

Writing data#

Once your data is prepared, you can use gvl.write() to convert the data for regions of interest to a format that gvl.Dataset can open. You can include a single source of variants and/or any number of BigWig files. Some examples:

import genvarloader as gvl

gvl.write(
    path="1000_genomes_haplotypes.gvl",
    bed="tiling_windows.bed",
    variants="all_chroms.bcf",
    # OR variants='all_chroms.pgen',
)

This dataset would have haplotypes available for all samples in all_chroms.bcf.

gvl.write(
    path="1000_genomes_lncRNA.gvl",
    bed="lncRNA.bed",  # can be varying length regions
    variants="all_chroms.bcf",
    tracks=[
        gvl.BigWigs.from_table("pos", "pos_strands.tsv"),
        gvl.BigWigs.from_table("neg", "pos_strands.tsv"),
    ],
)

This dataset would have both haplotypes and two tracks (pos and neg) available for samples that exist in both all_chroms.bcf and the BigWig tables (i.e. gvl.write() performs an inner join on samples).