# What's a `gvl.Dataset`?

## How a `Dataset` represents data

In general, GenVarLoader Datasets represent a **lazy collection of ragged arrays** where sequences have shape `(regions, samples, [ploidy], length)` and tracks (e.g. read depth) have shape `(regions, samples, tracks, [ploidy], length)` where the length axis is always ragged (aka variable length). When the dataset is set to return personalized data, the ploidy axis is added to all output arrays.

For example, consider a personalized dataset with 4 regions and 4 samples.

:::{note}
You can create this dummy dataset using [`get_dummy_dataset()`](api.md#genvarloader.get_dummy_dataset):
```python
ds = gvl.get_dummy_dataset()
```
:::

:::{image} _static/personalized_length.svg
:alt: A personalized Dataset with 4 regions of length 3 but varying sequence lengths.
:align: center
:width: 600
:::

All of the sequences in this dataset have shape `(regions: 4, samples: 4, ploidy: 1, length: var)`.

:::{important}
Even though **all regions have a length of 3**, the personalized lengths can **differ because of indels**. Datasets can also represent variable length regions, in which case non-personalized data will also have different lengths.
:::

## Obtaining ragged, variable, or fixed length data

The [`Dataset.output_length`](api.md#genvarloader.Dataset.output_length) setting changes whether it returns ragged, variable, or fixed length data. For the ragged case, the data is returned exactly as it exists in the dataset. For example, suppose we define the personalized dataset from above:

```python
ds = gvl.get_dummy_dataset()
```

### Ragged

By default, the [`Dataset.output_length`](api.md#genvarloader.Dataset.output_length) is set to `"ragged"`. So, when we index it to get two sequences we will get a [`gvl.Ragged`](api.md#genvarloader.Ragged) object.

```python
haps = ds[0, :3]  # shape: (batch: 3, ploidy, length: var)
len(haps[0, 0])  # 3
len(haps[1, 0])  # 2
len(haps[2, 0])  # 4
```

### Variable

We can change the [`Dataset.output_length`](api.md#genvarloader.Dataset.output_length) to `"variable"` and each batch of data will be a NumPy array.

```python
ds = ds.with_len("variable")
haps = ds[0, :3]
haps.shape  # (3, ploidy, 4)
```

With variable length output, a batch of ragged data is converted to be rectilinear by right-padding each item to have the same length as the longest item in the batch.

:::{image} _static/var_pad.svg
:alt: A variable length batch of data is right-padded to match the longest length in the batch.
:align: center
:width: 150
:::

The pad value depends on the data type, for sequences it is `b'N'` and for tracks it is `0`. As a result, the length of any given batch will change depending on the longest item in the batch, hence the name, `"variable"`, for this setting.

### Fixed

Finally, we can also obtain fixed length NumPy arrays from a [`Dataset`](api.md#genvarloader.Dataset) by provided an integer:

```python
ds = ds.with_len(2)
haps = ds[0, :3]
haps.shape  # (3, ploidy, 2)
```

In this case, the personalized length may be either shorter or longer than the output length so the [`Dataset`](api.md#genvarloader.Dataset) can do a combination of padding, random shifting, and truncating to achieve the desired length. For example, suppose we have a personalized length $L = 10$ and an output length $L_{\text{out}} = 5$.

:::{image} _static/shift_trunc.svg
:alt: Personalized data that is too long is randomly shifted and truncated to the desired length.
:align: center
:width: 250
:::

[`Datasets`](api.md#genvarloader.Dataset) can either deterministically truncate the extra length or apply a random shift $s \sim U(0, L - L_{\text{out}})$ before truncating the rest.

:::{note}
[`Datasets`](api.md#genvarloader.Dataset) are deterministic by default ($s = 0$) but can be enabled via settings:

```python
ds = ds.with_settings(deterministic=False)
```

:::

Alternatively, we could have a personalized length $L = 3$ and an output length $L_{\text{out}} = 5$.

:::{image} _static/pad.svg
:alt: Personalized data that is too short is padded with personalized data to the desired length.
:align: center
:width: 150
:::

In this case, the personalized data is padded **with more personalized data** to the desired length.

:::{important}
[`Datasets`](api.md#genvarloader.Dataset) have a maximum fixed output length! The BED file given to [`gvl.write()`](api.md#genvarloader.write) specifies what data should exist in the [`Dataset`](api.md#genvarloader.Dataset) -- data outside of those regions cannot be generated by the [`Dataset`](api.md#genvarloader.Dataset)! This means the output length must be no more than the smallest region in the [`Dataset`](api.md#genvarloader.Dataset). i.e. $L_{\text{out}} \le \min(|\text{regions}|)$. However, [`Datasets`](api.md#genvarloader.Dataset) support jitter so the data in the [`Dataset`](api.md#genvarloader.Dataset) actually correspond to regions that have been expanded by `max_jitter` on either side. Thus, the full constraint on output length is:

$$L_{\text{out}} + 2 j \le \min(|\text{regions}|) + 2 j_{\text{max}}$$

where $j$ is the current jitter and $j_{\text{max}}$ is the maximum jitter allowed by the [`Dataset`](api.md#genvarloader.Dataset). If you try to use a value greater than this the [`Dataset`](api.md#genvarloader.Dataset) will raise an error.
:::