Dataset format#
A GVL dataset is a directory written by gvl.write and read
by gvl.Dataset.open. This page is the authoritative
description of its on-disk layout.
Directory layout#
dataset_dir/
├── metadata.json # the Metadata schema (below)
├── input_regions.arrow # original BED regions + region-index map
├── genotypes/ # present iff variants were provided to gvl.write
│ ├── offsets.npy # per (region, sample, ploidy) offsets into variant_idxs.npy
│ ├── svar_meta.json # shape + dtype of offsets.npy — present iff source was .svar
│ ├── variant_idxs.npy # variant indices; absent when sourced from .svar
│ ├── dosages.npy # optional, absent when sourced from .svar
│ └── variants.arrow # variant table; absent when sourced from .svar
└── intervals/ # or annot_intervals/ when annotated; present iff tracks given
When the dataset was built from an .svar, the heavy per-variant arrays (variant_idxs.npy,
dosages.npy, index.arrow) are not duplicated into the dataset. Instead the dataset
records a back-reference to the source .svar in metadata.json (see svar_link below).
metadata.json schema#
metadata.json is the serialization of genvarloader._dataset._write.Metadata:
Field |
Type |
Notes |
|---|---|---|
|
|
Sample identifiers, sorted. |
|
|
Contig names used to interpret BED coords. |
|
|
Number of regions (after jitter padding). |
|
|
Ploidy when the dataset has genotypes. |
|
|
Maximum coordinate jitter (defaults to 0). |
|
|
Package version that wrote this dataset. Drives format dispatch. |
|
|
Back-reference to a source |
SvarLink:
Field |
Type |
Notes |
|---|---|---|
|
|
POSIX path from |
|
|
Original absolute path; used as a fallback. |
|
|
Integrity check (see below). |
SvarFingerprint:
Field |
Type |
Notes |
|---|---|---|
|
|
Row count of the svar’s |
|
|
Byte size of the svar’s |
SVAR resolution at open time#
When opening a dataset whose metadata.svar_link is non-null,
Dataset.open resolves the svar in this order:
Caller-provided
svar=...argument.svar_link.relative_pathresolved against the dataset directory.svar_link.absolute_path.A unique
*.svardirectory next to the dataset.
If none match, a FileNotFoundError is raised naming the expected .svar basename. After
resolution, the fingerprint is verified; a mismatch raises ValueError and lists both
expected and observed values.
Format changelog#
Version |
Change |
|---|---|
|
Variant coordinates stored 0-based. |
|
Variant coordinates switched to 1-based. |
|
|
Upgrading legacy datasets. A dataset written before
0.25.0that was built from an.svarwill still open (with aDeprecationWarning). Rungenvarloader.migrate_svar_link(path)to convert the symlink layout to the new metadata layout in place.