What’s a gvl.Dataset?#

How a Dataset represents data#

In general, GenVarLoader Datasets represent a lazy collection of ragged arrays where sequences have shape (regions, samples, [ploidy], length) and tracks (e.g. read depth) have shape (regions, samples, tracks, [ploidy], length) where the length axis is always ragged (aka variable length). When the dataset is set to return personalized data, the ploidy axis is added to all output arrays.

For example, consider a personalized dataset with 4 regions and 4 samples.

Note

You can create this dummy dataset using get_dummy_dataset():

ds = gvl.get_dummy_dataset()
A personalized Dataset with 4 regions of length 3 but varying sequence lengths.

All of the sequences in this dataset have shape (regions: 4, samples: 4, ploidy: 1, length: var).

Important

Even though all regions have a length of 3, the personalized lengths can differ because of indels. Datasets can also represent variable length regions, in which case non-personalized data will also have different lengths.

Obtaining ragged, variable, or fixed length data#

The Dataset.output_length setting changes whether it returns ragged, variable, or fixed length data. For the ragged case, the data is returned exactly as it exists in the dataset. For example, suppose we define the personalized dataset from above:

ds = gvl.get_dummy_dataset()

Ragged#

By default, the Dataset.output_length is set to "ragged". So, when we index it to get two sequences we will get a gvl.Ragged object.

haps = ds[0, :3]  # shape: (batch: 3, ploidy, length: var)
len(haps[0, 0])  # 3
len(haps[1, 0])  # 2
len(haps[2, 0])  # 4

Variable#

We can change the Dataset.output_length to "variable" and each batch of data will be a NumPy array.

ds = ds.with_len("variable")
haps = ds[0, :3]
haps.shape  # (3, ploidy, 4)

With variable length output, a batch of ragged data is converted to be rectilinear by right-padding each item to have the same length as the longest item in the batch.

A variable length batch of data is right-padded to match the longest length in the batch.

The pad value depends on the data type, for sequences it is b'N' and for tracks it is 0. As a result, the length of any given batch will change depending on the longest item in the batch, hence the name, "variable", for this setting.

Fixed#

Finally, we can also obtain fixed length NumPy arrays from a Dataset by provided an integer:

ds = ds.with_len(2)
haps = ds[0, :3]
haps.shape  # (3, ploidy, 2)

In this case, the personalized length may be either shorter or longer than the output length so the Dataset can do a combination of padding, random shifting, and truncating to achieve the desired length. For example, suppose we have a personalized length \(L = 10\) and an output length \(L_{\text{out}} = 5\).

Personalized data that is too long is randomly shifted and truncated to the desired length.

Datasets can either deterministically truncate the extra length or apply a random shift \(s \sim U(0, L - L_{\text{out}})\) before truncating the rest.

Note

Datasets are deterministic by default (\(s = 0\)) but can be enabled via settings:

ds = ds.with_settings(deterministic=False)

Alternatively, we could have a personalized length \(L = 3\) and an output length \(L_{\text{out}} = 5\).

Personalized data that is too short is padded with personalized data to the desired length.

In this case, the personalized data is padded with more personalized data to the desired length.

Important

Datasets have a maximum fixed output length! The BED file given to gvl.write() specifies what data should exist in the Dataset – data outside of those regions cannot be generated by the Dataset! This means the output length must be no more than the smallest region in the Dataset. i.e. \(L_{\text{out}} \le \min(|\text{regions}|)\). However, Datasets support jitter so the data in the Dataset actually correspond to regions that have been expanded by max_jitter on either side. Thus, the full constraint on output length is:

\[L_{\text{out}} + 2 j \le \min(|\text{regions}|) + 2 j_{\text{max}}\]

where \(j\) is the current jitter and \(j_{\text{max}}\) is the maximum jitter allowed by the Dataset. If you try to use a value greater than this the Dataset will raise an error.