FAQ#

Why does a Dataset return “Ragged” objects and what are they?#

For why, see “What’s a gvl.Dataset?”. Ragged arrays are similar to NumPy arrays except that the final axis is a variable size. For example, a 2D ragged array might look like:

To store this, a Ragged array minimally consists of two NumPy arrays: a 1D array data with shape (size) containing the values, and another 1- or 2-D array offsets with shape (n_rows+1) or (2, n_rows), respectively, specifying the start and end position (exclusive) of every row’s data in the data array. We could thus create the above example:

data = np.array([1, 2, 3, 4, 5, 6])
offsets = np.array([0, 2, 3, 6])
shape = (3,)
ragged = gvl.Ragged.from_offsets(data, shape, offsets)
# [
#     [1, 2],
#     [3],
#     [4, 5, 6]
# ]

Ragged arrays are backed by seqpro’s Ragged type (a Rust-backed _core.Ragged). GVL computes on the data and offsets buffers directly in Rust, which is relatively straightforward (i.e. iterating over the rows of data via the offsets array). (Earlier releases subclassed Awkward Arrays; GVL no longer depends on awkward.)

.. note::

GVL Datasets can also return several other kinds of objects, see the [API reference](api.md#containers) for more details.

I have multiple tracks per sample, how can I add them?#

If you provide multiple BigWigs to gvl.write(), all of them can be returned simultaneously from the resulting Dataset and placed along the track axis, sorted by name. By default, a Dataset sets all tracks to active when opened. i.e. tracks have shape (batch, tracks, [ploidy], length). For example:

import genvarloader as gvl

pos_strand = gvl.BigWigs.from_table("pos", "pos_strand.tsv")
neg_strand = gvl.BigWigs.from_table("neg", "neg_strand.tsv")
gvl.write(
    "path/to/dataset.gvl", bed="path/to/regions.bed", tracks=[pos_strand, neg_strand]
)

How does GVL handle negative stranded regions provided to `gvl.write()`?#

By default, GVL will automatically reverse (and complement) negative stranded regions. You can modify this behavior using gvl.Dataset.with_settings() and setting rc_neg to False.

How does GVL handle unphased genotypes?#

GVL assumes all genotypes are phased and will not warn you if any genotypes are unphased. Generally, unphased genotypes cannot be resolved into haplotypes so we make this simplifying assumption. If you aren’t sure whether your genotypes are phased, it is relatively easy to inspect from the CLI using bcftools view or plink2:

# for VCF, -p filters for records where all samples are phased
bcftools view -Hp $vcf | wc -l
# returns number of phased records

# for PLINK
plink2 --pgen-info $prefix

How do I control how many threads GVL uses?#

GVL’s read path (haplotype reconstruction and track re-alignment) is parallelized in Rust with rayon. By default it uses one worker per available CPU, detected from the Linux cgroup cpuset (sched_getaffinity) so it respects container limits, and falling back to os.cpu_count() elsewhere. Three environment variables tune this:

GVL_NUM_THREADS — set the worker count explicitly (e.g. GVL_NUM_THREADS=4). Overrides cgroup detection. Resolved once, on first use, so set it before your first GVL call.
GVL_FORCE_PARALLEL — set to a truthy value (1, true, yes, on) to force the multithreaded paths even on small inputs. By default GVL runs small inputs serially because thread overhead would dominate; this bypasses that size gate. Mainly useful for benchmarking.
RAYON_NUM_THREADS — GVL overwrites this with its own resolved count so an inherited value (e.g. baked into a base image) can’t defeat the cgroup-aware cap. To size the pool yourself, use GVL_NUM_THREADS instead.

Should I use `.svar` or `.svar2` as my variant source?#

Both are sparse columnar variant archives from genoray that gvl.write(variants=...) accepts alongside BCF/PGEN; see write.md for how to build one. The two differ in their read-time behavior:

.svar reconstructs by building an interval search tree over the queried window and a per-read dense union of the overlapping variants.
.svar2 reconstructs via a read-bound path: gvl.write caches small per-(region, sample, ploid) variant-key ranges at write time, and Dataset.__getitem__ gathers directly off that cache and calls all-Rust kernels — it builds no interval search tree and no dense union per read. .svar2 stores are also typically smaller on disk than .svar, especially for large cohorts.

.svar2 is Phase-1 scope: a handful of combinations (annotated haplotypes, min_af/max_af, VarWindowOpt(ref="allele"), fixed-length haplotype-realigned tracks, spliced variant-windows, and variants/variant-windows output with jitter) aren’t wired yet and raise NotImplementedError rather than silently mis-computing. Haplotype and variants output support splicing, var_filter="exonic", and negative-strand reverse-complementation. "variant-windows" output, unphased_union (for both "variants" and "variant-windows"), and var_fields-selected store INFO/FORMAT fields (also for both, when the .svar2 was written with them) are also supported. See the genvarloader skill’s .svar2 section or docs/source/format.md for the full list. Everything else — haplotypes, tracks, and variants/variant-windows at any supported jitter/output-length combination — is byte-identical between the two backends.

One documented difference in raw output: for a pure deletion, with_seqs("variants") on a .svar dataset reports the VCF anchor base as ALT (e.g. b"G" for GTA>G), while a .svar2 dataset reports the atomized empty ALT (b"") — a genoray .svar2 format convention, not a bug. Reconstructed haplotypes are unaffected; only RaggedVariants.alt differs (and FlatVariantWindows.alt/.alt_window for "variant-windows"), and only for pure-deletion records. ref_window is byte-identical between the two backends.

How can I get personalized protein/spliced RNA sequences?#

Write a dataset from an exon-level BED containing transcript and exon-order columns, then open it with splice_info. Use var_filter="exonic" to drop variants whose reference span crosses an exon boundary:

ds = gvl.Dataset.open(
    "transcripts.gvl",
    reference="ref.fa",
    splice_info=("transcript_id", "exon_number"),
    var_filter="exonic",
).with_seqs("haplotypes")

This works with .svar and .svar2 variant sources. Negative-strand transcripts are reverse-complemented automatically when the BED includes strand="-". See the splicing guide for BED construction and output shapes.

Use .with_seqs("variants") on the same dataset to receive one complete RaggedVariants cell per (transcript, sample, phase). GVL performs the exon queries, exonic filtering, decode, and transcript regrouping internally; callers do not need to iterate over or concatenate exons.

Why aren’t the methods `with_len()`, `with_seqs()`, etc. combined into `with_settings()`?#

These methods modify the type of output returned by a gvl.Dataset. In order to allow type checkers like mypy and pyright to know how these settings modify state, they are given their own methods. As a result, if you use a type checker, you will have access to an improved developer workflow with compile-time errors for many common issues. For example, using an incompatible transform or unpacking return values into the wrong number of arguments.

FAQ

Contents