What’s a gvl.Dataset?#
How a Dataset represents data#
In general, GenVarLoader Datasets represent a lazy collection of ragged arrays where sequences have shape (regions, samples, [ploidy], length) and tracks (e.g. read depth) have shape (regions, samples, tracks, [ploidy], length) where the length axis is always ragged (aka variable length). When the dataset is set to return personalized data, the ploidy axis is added to all output arrays.
For example, consider a personalized dataset with 4 regions and 4 samples.
All of the sequences in this dataset have shape (regions: 4, samples: 4, ploidy: 1, length: var).
Important
Even though all regions have a length of 3, the personalized lengths can differ because of indels. Datasets can also represent variable length regions, in which case non-personalized data will also have different lengths.
Obtaining ragged, variable, or fixed length data#
The Dataset.output_length setting changes whether it returns ragged, variable, or fixed length data. For the ragged case, the data is returned exactly as it exists in the dataset. For example, suppose we define the personalized dataset from above:
ds = gvl.get_dummy_dataset()
Ragged#
By default, the Dataset.output_length is set to "ragged". So, when we index it to get two sequences we will get a gvl.Ragged object.
haps = ds[0, :3] # shape: (batch: 3, ploidy, length: var)
len(haps[0, 0]) # 3
len(haps[1, 0]) # 2
len(haps[2, 0]) # 4
Variable#
We can change the Dataset.output_length to "variable" and each batch of data will be a NumPy array.
ds = ds.with_len("variable")
haps = ds[0, :3]
haps.shape # (3, ploidy, 4)
With variable length output, a batch of ragged data is converted to be rectilinear by right-padding each item to have the same length as the longest item in the batch.
The pad value depends on the data type, for sequences it is b'N' and for tracks it is 0. As a result, the length of any given batch will change depending on the longest item in the batch, hence the name, "variable", for this setting.
Fixed#
Finally, we can also obtain fixed length NumPy arrays from a Dataset by provided an integer:
ds = ds.with_len(2)
haps = ds[0, :3]
haps.shape # (3, ploidy, 2)
In this case, the personalized length may be either shorter or longer than the output length so the Dataset can do a combination of padding, random shifting, and truncating to achieve the desired length. For example, suppose we have a personalized length \(L = 10\) and an output length \(L_{\text{out}} = 5\).
Datasets can either deterministically truncate the extra length or apply a random shift \(s \sim U(0, L - L_{\text{out}})\) before truncating the rest.
Note
Datasets are deterministic by default (\(s = 0\)) but can be enabled via settings:
ds = ds.with_settings(deterministic=False)
Alternatively, we could have a personalized length \(L = 3\) and an output length \(L_{\text{out}} = 5\).
In this case, the personalized data is padded with more personalized data to the desired length.
Important
Datasets have a maximum fixed output length! The BED file given to gvl.write() specifies what data should exist in the Dataset – data outside of those regions cannot be generated by the Dataset! This means the output length must be no more than the smallest region in the Dataset. i.e. \(L_{\text{out}} \le \min(|\text{regions}|)\). However, Datasets support jitter so the data in the Dataset actually correspond to regions that have been expanded by max_jitter on either side. Thus, the full constraint on output length is:
where \(j\) is the current jitter and \(j_{\text{max}}\) is the maximum jitter allowed by the Dataset. If you try to use a value greater than this the Dataset will raise an error.