API Reference

Contents

API Reference#

Writing#

genvarloader.write(path, bed, variants=None, tracks=None, samples=None, max_jitter=None, overwrite=False, max_mem='4g', extend_to_length=True)[source]#

Write a GVL dataset.

Parameters:
  • path (str | Path) – Path to write the dataset to.

  • bed (str | Path | DataFrame) – BED-like file or DataFrame of regions satisfying the BED3+ specification. Specifically, it must have columns ‘chrom’, ‘chromStart’, and ‘chromEnd’. If ‘strand’ is present, its values must be either ‘+’ or ‘-‘. Negative stranded regions will be reverse complemented during sequence and/or track reconstruction.

  • variants (str | Path | VCF | PGEN | SparseVar | None, default: None) – A genoray VCF or PGEN instance (genoray is a GVL dependency so it will be import-able). All variants must be left-aligned, bi-allelic, and atomized. Multi-allelic variants can be included by splitting them into bi-allelic half-calls. For VCFs, the bcftools norm command can do all of this normalization. Likewise, see the PLINK2 documentation for PGEN files. Commands of interest include --make-bpgen for splitting variants, --normalize for left-aligning and atomizing overlapping variants, and --ref-from-fa for REF allele correction.

  • tracks (IntervalTrack | Sequence[IntervalTrack] | None, default: None) – An IntervalTrack (e.g. BigWigs, Table) or a sequence of them. Each track must have a unique name; the on-disk layout writes to <path>/intervals/<track.name>/.

  • samples (list[str] | None, default: None) – Samples to include in the dataset

  • max_jitter (int | None, default: None) – Maximum jitter to add to the regions

  • overwrite (bool, default: False) – Whether to overwrite an existing dataset

  • max_mem (int | str, default: '4g') – Approximate maximum total memory to use, including the genoray variant index. The reader’s index is loaded eagerly at the start of write() (for VCF and PGEN) so that nbytes reflects its true size; that value is subtracted from max_mem to determine the budget available for genotype chunking. A ValueError is raised if the remaining budget is too small to fit even a single variant chunk. Otherwise max_mem is a soft limit on overall usage and may be exceeded by a small amount.

  • extend_to_length (bool, default: True) – Whether to continue reading/writing variants until all haplotypes have a length at least as long as the intervals in bed. Otherwise, deletions can cause the length of haplotypes to be less than the intervals in bed. This can be disabled if having haplotypes shorter than the intervals is acceptable, in which case they will be padded with reference bases when appropriate. Disabling this also reduces the amount of data read/written and is faster to run.

Notes

The dataset directory is built atomically: all data is written to a private sibling temp directory and published via os.replace(). A best-effort filelock prevents redundant parallel rebuilds, but correctness relies on the atomic rename — the lock is advisory only.

Out of scope: genoray .gvi index files and pysam .fai/.gzi index files are created by those libraries and are not covered by gvl’s atomic/locked creation. Concurrent jobs that trigger index creation for those files depend on the upstream libraries’ behavior.

genvarloader.get_splice_bed(gtf, contigs=None, transcript_support_level='1', require_multiple_of_3=True)[source]#

Process a GTF into a BED-compatible DataFrame for splicing datasets.

The result has columns chrom, chromStart (0-based), chromEnd, strand, gene_name, transcript_id, and exon_number, sorted by chromosome (natural order) and chromStart. Pass it directly to gvl.write() for splicing datasets.

Parameters:
  • gtf (str | Path) – Path to a GTF file (gzipped or plain) accepted by seqpro.gtf.scan().

  • contigs (list[str] | None, default: None) – If provided, keep only rows whose seqname is in this list.

  • transcript_support_level (str | None, default: '1') – If a string, require the GTF transcript_support_level attribute to equal it. None disables the filter.

  • require_multiple_of_3 (bool, default: True) – If True, keep only transcripts whose summed CDS length is a multiple of 3.

Return type:

DataFrame

genvarloader.read_bedlike(path)#

Reads a bed-like (BED3+) file as a pandas DataFrame. The file type is inferred from the file extension and supports .bed, .narrowPeak, and .broadPeak.

Parameters:

path (str | Path) – Path to the bed-like file.

Returns:

BED-like DataFrame with typed columns and zero-based coordinate metadata.

Return type:

pl.DataFrame

genvarloader.with_length(bed, length)#

Set the length of regions in a BED-like DataFrame to a fixed length by expanding or shrinking relative to the center (or peak) of the window. If the original region size + length is odd, the center will be 1 position closer the right end.

Parameters:
  • bed (TypeVar(FrameT, DataFrame[Any], LazyFrame[Any])) – BED-like DataFrame with at least the columns “chromStart” and “chromEnd”.

  • length (int) – Desired length of the windows. Must be non-negative.

Returns:

DataFrame of the same type as the input with updated “chromStart” and “chromEnd” columns.

Return type:

FrameT

class genvarloader.BigWigs[source]#
__init__(name, paths)[source]#

Read data from bigWig files.

Parameters:
  • name (str) – Name of the reader, for example ‘signal’.

  • paths (dict[str, str]) – Dictionary of sample names and paths to bigWig files for those samples.

Return type:

None

classmethod from_table(name, table)[source]#

Read data from bigWig files.

Parameters:
  • name (str) – Name of the reader, for example ‘signal’.

  • table (str | Path | DataFrame) – Path to a table or a DataFrame containing sample names and paths to bigWig files for those samples. It must have columns “sample” and “path”.

class genvarloader.Table[source]#

Long-form interval track keyed by (sample_id, chrom, start, end, value).

Warning

Temporarily disabled. Constructing a Table raises NotImplementedError while the polars-bio segfault (#395) is unresolved. Use BigWigs for track input in the meantime.

__init__(name, data, column_map=None)[source]#
Parameters:
Return type:

None

Reading#

Personalized data#

class genvarloader.Dataset[source]#

A dataset of genotypes, reference sequences, and intervals.

Note

This class is not meant to be instantiated directly. Use the Dataset.open() method to open a dataset after writing the data with genvarloader.write() or the GenVarLoader CLI.

GVL Datasets act like a collection of lazy ragged arrays that can be lazily subset or eagerly indexed as a 2D NumPy array. They have an effective shape of (n_regions, n_samples, [tracks], [ploidy], output_length), but only the region and sample dimensions can be indexed directly since the return value is generally a tuple of arrays.

Eager indexing

dataset[0, 9]  # first region, 10th sample
dataset[:10]  # first 10 regions and all samples
dataset[:10, :5]  # first 10 regions and 5 samples
dataset[[2, 2], [0, 1]]  # 3rd region, 1st and 2nd samples

Lazy indexing

See Dataset.subset_to(). This is useful, for example, to create splits for training, validation, and testing, or filter out regions or samples after writing a full dataset. This is also necessary if you intend to create a Pytorch DataLoader from the Dataset using Dataset.to_dataloader().

Return values

The return value depends on the Dataset state, namely sequence_type, active_tracks, and output_length. These can all be modified after opening a Dataset using the following methods: - Dataset.with_seqs() - Dataset.with_tracks() - Dataset.with_len()

static open(path, reference=None, jitter=0, rng=False, deterministic=True, rc_neg=True, min_af=None, max_af=None, var_fields=None, region_names=None, splice_info=None, var_filter=None, *, svar=None)[source]#

Open a dataset from a path. If no reference genome is provided, the dataset cannot yield sequences. Will initialize the dataset such that it will return tracks and haplotypes (reference sequences if no genotypes) if possible. If tracks are available, they will be set to be returned in alphabetical order.

Parameters:
  • path (str | Path) – Path to a dataset.

  • reference (str | Path | Reference | None, default: None) – Path to a reference genome.

  • jitter (int, default: 0) – Amount of jitter to use, cannot be more than the maximum jitter of the dataset.

  • rng (int | Generator | None, default: False) – Random seed or np.random.Generator for any stochastic operations.

  • deterministic (bool, default: True) – Whether to use randomized or deterministic algorithms. If set to True, this will disable random shifting of longer-than-requested haplotypes.

  • rc_neg (bool, default: True) – Whether to reverse-complement sequences and reverse tracks on negative strands.

  • min_af (float | None, default: None) – The minimum allele frequency to include in the dataset. If dataset is not backed by SVAR genotypes, this will raise an error.

  • max_af (float | None, default: None) – The maximum allele frequency to include in the dataset. If dataset is not backed by SVAR genotypes, this will raise an error.

  • var_fields (list[str] | None, default: None) – The variant fields to include in the dataset. Defaults to the minimum useful set ["alt", "ilen", "start"]. Pass additional field names (e.g. "ref", "dosage", or any info column present in the source variants table) to load them eagerly at open time. Must be a subset of available_var_fields.

  • splice_info (str | tuple[str, str] | None, default: None) – A string or tuple of strings representing the splice information to use. If a string, it will be used as the transcript ID and the exons are expected to be in order. If a tuple of strings, the first string will be used as the transcript ID and the second string will be used as the exon number. If a dictionary, the keys will be used as the transcript ID and the values should be the row number for each exon, in order. If False, splicing will be disabled.

  • var_filter (Optional[Literal['exonic']], default: None) – Whether to filter variants. If set to "exonic", only exonic variants will be applied.

  • svar (str | Path | None, default: None) – Override the recorded SVAR location. Use when the original SVAR has moved and the dataset cannot find it via the stored relative/absolute path or by sibling discovery.

  • region_names (str | None)

Return type:

RaggedDataset[TypeVar(MaybeRSEQ, None, RaggedSeqs, RaggedAnnotatedHaps, RaggedVariants), TypeVar(MaybeRTRK, None, Ragged[float32], RaggedIntervals)]

with_settings(jitter=None, rng=None, deterministic=None, rc_neg=None, min_af=None, max_af=None, var_fields=None, splice_info=None, var_filter=None)[source]#

Modify settings of the dataset, returning a new dataset without modifying the old one.

Parameters:
  • jitter (int | None, default: None) – How much jitter to use. Must be non-negative and <= the max_jitter of the dataset.

  • rng (int | Generator | None, default: None) – Random seed or np.random.Generator for non-deterministic operations e.g. jittering and shifting longer-than-requested haplotypes.

  • deterministic (bool | None, default: None) – Whether to use randomized or deterministic algorithms. If set to True, this will disable random shifting of longer-than-requested haplotypes and, for unphased variants, will enable deterministic variant assignment and always apply the highest CCF group. Note that for unphased variants, this will mean not all possible haplotypes can be returned.

  • rc_neg (bool | None, default: None) – Whether to reverse-complement sequences and reverse tracks on negative strands.

  • min_af (Union[float, Literal[False], None], default: None) – The minimum allele frequency to include in the dataset. If set to False, disables this filter. If dataset is not backed by SVAR genotypes, this will raise an error.

  • max_af (Union[float, Literal[False], None], default: None) – The maximum allele frequency to include in the dataset. If set to False, disables this filter. If dataset is not backed by SVAR genotypes, this will raise an error.

  • var_fields (list[str] | None, default: None) – The variant fields to include in the dataset.

  • splice_info (Union[str, tuple[str, str], Literal[False], None], default: None) – A string or tuple of strings representing the splice information to use. If a string, it will be used as the transcript ID and the exons are expected to be in order. If a tuple of strings, the first string will be used as the transcript ID and the second string will be used as the exon number. If a dictionary, the keys will be used as the transcript ID and the values should be the row number for each exon, in order. If False, splicing will be disabled.

  • var_filter (Optional[Literal[False, 'exonic']], default: None) – Whether to filter variants. If set to "exonic", only exonic variants will be applied.

Return type:

Self

with_len(output_length)[source]#

Modify the output length of the dataset, returning a new dataset without modifying the old one.

Parameters:

output_length (Union[Literal['ragged', 'variable'], int]) – The output length. Can be set to "ragged" or "variable" to allow for variable length sequences. If set to an integer, all sequences will be padded or truncated to this length. See the online documentation for more information.

Return type:

ArrayDataset | RaggedDataset

with_seqs(kind)[source]#

Return a new dataset with the specified sequence type. The sequence type can be one of the following:

  • "reference": reference sequences.

  • "haplotypes": personalized haplotype sequences.

  • "annotated": annotated haplotype sequences, which includes personalized haplotypes along with annotations.

  • "variants": no sequences, just variants as RaggedVariants

Annotated haplotypes are returned as the AnnotatedHaps class which is roughly:

class AnnotatedHaps:
    haps: NDArray[np.bytes_]
    var_idxs: NDArray[np.int32]
    ref_coords: NDArray[np.int32]

where haps are the haplotypes as bytes/S1, and var_idxs and ref_coords are arrays with the same shape as haps that annotate every nucleotide with the variant index and reference coordinate it corresponds to. A variant index of -1 corresponds to a reference nucleotide, and a reference coordinate of -1 corresponds to padded nucleotides that were added for regions beyond the bounds of the reference genome. i.e. if the region’s start position is negative or the end position is beyond the end of the reference genome.

For example, a toy result for chr1:1-10 could be:

haps:        A C G  T ...  T T  A ...
var_idxs:   -1 3 3 -1 ... -1 4 -1 ...
ref_coords:  1 2 2  3 ...  6 7  9 ...

where variant 3 is a 1 bp CG insertion and variant 4 is a 1 bp deletion T-. Note that the first nucleotide of every indel maps to a reference position since gvl.write() expects that variants are all left-aligned.

Important

The var_idxs are numbered with respect to the full set of variants even if the variants were extracted from per-chromosome VCFs/PGENs. So a variant index of 0 corresponds to the first variant across all chromosomes. Thus, if you want to map the variant index to per-chromosome VCFs/PGENs, you will need to subtract the number of variants on all other chromosomes before the variant index to get the correct variant index in the VCF/PGEN. Relevant values can be obtained by instantiating a gvl.Variants class from the VCFs/PGENs and accessing the Variants.records.contig_offsets attribute.

If the Dataset’s output length is "ragged", then annotated haplotypes will be RaggedAnnotatedHaps where each field is a Ragged array instead of NumPy arrays.

Parameters:

kind (Optional[Literal['haplotypes', 'reference', 'annotated', 'variants']]) – The type of sequences to return. Can be one of "reference", "haplotypes", "annotated", "variants", or None to return no sequences.

with_tracks(tracks=None, kind=None)[source]#

Modify which tracks to return, returning a new dataset without modifying the old one.

Parameters:
  • tracks (Union[str, list[str], Literal[False], None], default: None) – The tracks to return. Can be a (list of) track names or False to return no tracks.

  • kind (Literal['tracks', 'intervals'] | None)

with_insertion_fill(fill)[source]#

Configure how track values are filled at insertion sites.

Only meaningful when the dataset returns haplotypes and tracks (i.e. when the reconstructor is HapsTracks). Pure-reference and pure-haplotype datasets have no insertion fill to configure.

Parameters:

fill (InsertionFill | Mapping[str, InsertionFill]) – Either a single InsertionFill strategy applied to every active track, or a dict mapping track name to strategy. Tracks not in the dict fall back to Repeat5p.

Return type:

Self

path: Path#

Path to the dataset.

output_length: Literal['ragged', 'variable'] | int#

The output length. Can be set to "ragged" or "variable" to allow for variable length sequences. If set to an integer, all sequences will be padded or truncated to this length. See the online documentation for more information.

max_jitter: int#

Maximum jitter.

return_indices: bool#

Whether to return row and sample indices corresponding to the full dataset (no subsetting).

contigs: list[str]#

List of unique contigs.

jitter: int#

How much jitter to use.

deterministic: bool#

Whether to use randomized or deterministic algorithms. If set to False, this will enable random shifting of longer-than-requested haplotypes and, for unphased variants, enable choosing sets of compatible variants proportional to their CCF; otherwise the dataset will always apply compatible sets with the highest CCF.

Note

This setting is independent of jitter, if you want no jitter you should set it to 0.

rc_neg: bool#

Whether to reverse-complement the sequences on negative strands.

property is_subset: bool#

Whether the dataset is a subset.

property is_spliced: bool#

Whether the dataset is spliced.

property has_reference: bool#

Whether the dataset was provided a reference genome.

property reference: Reference | None#

The reference genome.

property has_genotypes#

Whether the dataset has genotypes.

property has_intervals: bool#

Whether the dataset has intervals.

property samples: list[str]#

The samples in the dataset.

property regions: DataFrame#

The input regions in the dataset as they were provided to gvl.write() i.e. with all BED columns plus any extra columns that were present.

property n_regions: int#

The number of (spliced) regions in the dataset.

property spliced_regions: DataFrame | None#

The spliced regions in the dataset.

property n_samples: int#

The number of samples in the dataset.

property ploidy: int | None#

The ploidy of the dataset.

property shape: tuple[int, int]#

Return the shape of the dataset. (n_rows, n_samples)

property full_shape: tuple[int, int]#

Return the full shape of the dataset, ignoring any subsetting. (n_rows, n_samples)

property available_var_fields: list[str]#

Available variant fields.

property active_var_fields: list[str]#

Active variant fields.

property available_tracks: list[str] | None#

The available tracks in the dataset.

property active_tracks: list[str] | None#

The active tracks in the dataset.

property sequence_type: Literal['haplotypes', 'reference', 'annotated', 'variants'] | None#

The type of sequences in the dataset.

subset_to(regions=None, samples=None)[source]#

Subset the dataset to specific regions and/or samples by index or a boolean mask. If regions or samples are not provided, the corresponding dimension will not be subset.

Parameters:
Return type:

Self

Examples

Subsetting to the first 10 regions:

ds.subset_to(slice(10))

Subsetting to the 2nd and 4th samples:

ds.subset_to(samples=[1, 3])

Subsetting to chromosome 1, assuming it’s labeled "chr1":

r_idx = ds.regions["chrom"] == "chr1"
ds.subset_to(regions=r_idx)

Subsetting to regions labeled by a column “split”, assuming “split” existed in the input regions:

r_idx = ds.regions["split"] == "train"
ds.subset_to(regions=r_idx)

Subsetting to the intersection with another set of regions:

import seqpro as sp

regions = gvl.read_bedlike("regions.bed")
regions_pr = sp.bed.to_pyr(regions)
ds_regions_pr = sp.bed.to_pyr(ds.regions.with_row_index())
r_idx = ds_regions_pr.overlap(regions_pr).df["index"].to_numpy()
ds.subset_to(regions=r_idx)
to_full_dataset()[source]#

Return a full sized dataset, undoing any subsetting.

Return type:

Self

haplotype_lengths(regions=None, samples=None)[source]#

The lengths of jitter-extended haplotypes for specified regions and samples. If the dataset is not phased or not deterministic, this will return None because the haplotypes are not guaranteed to be a consistent length due to randomness in what variants are used.

Note

Currently not implemented for spliced datasets.

Parameters:
Return type:

ndarray[tuple[Any, ...], dtype[int32]] | None

n_variants(regions=None, samples=None)[source]#

The number of variants in the dataset for specified regions and samples.

Parameters:
Return type:

ndarray[tuple[Any, ...], dtype[int32]]

Returns:

Array with shape (…, ploidy). The number of variants in the dataset for the specified regions and samples. If the dataset does not have genotypes, this will return None.

n_intervals(regions=None, samples=None)[source]#

The number of intervals in the dataset for specified regions and samples.

Parameters:
Return type:

ndarray[tuple[Any, ...], dtype[int32]]

Returns:

Array with shape (…, tracks). The number of intervals in the dataset for the specified regions and samples. If the dataset does not have intervals, this will return None.

write_transformed_track(new_track, existing_track, transform, max_mem=1073741824, overwrite=False)[source]#

Write transformed tracks to the dataset.

Parameters:
  • new_track (str) – The name of the new track.

  • existing_track (str) – The name of the existing track to transform.

  • transform (Callable[[ndarray[tuple[Any, ...], dtype[int64]], ndarray[tuple[Any, ...], dtype[int64]], Ragged[float32]], Ragged[float32]]) – A function to apply to the existing track to get a new, transformed track. This will be done in chunks such that the tracks provided will not exceed max_mem. The arguments given to the transform will be the region and sample indices as numpy arrays and the tracks themselves as a Ragged array with shape (regions, samples). The tracks must be a Ragged array since regions may be different lengths to accommodate indels. This function should then return the transformed tracks as a Ragged array with the same shape and lengths.

  • max_mem (int, default: 1073741824) – The maximum memory to use in bytes, by default 1 GiB (2**30 bytes)

  • overwrite (bool, default: False) – Whether to overwrite the existing track, by default False

Return type:

ArrayDataset | RaggedDataset

write_annot_tracks(tracks, overwrite=False)[source]#

Write annotation tracks to the dataset. Returns a new dataset with the tracks available. Activate them with with_tracks().

Parameters:
  • tracks (dict[str, str | Path | DataFrame]) –

    Paths to the annotation tracks (or literal tables) in BED-like format. Keys should be the track names and values should be the paths to the BED files or polars.DataFrames.

    Note

    Only supports BED files for now.

  • overwrite (bool, default: False) – Whether to overwrite the existing tracks, by default False

Return type:

Self

to_torch_dataset(return_indices, transform)[source]#

Convert the dataset to a PyTorch Dataset. Requires PyTorch to be installed.

Parameters:
  • return_indices (bool) – Whether to append arrays of row and sample indices of the non-subset dataset to each batch.

  • transform (Callable | None) –

    The transform to apply to each batch of data. The transform should take input matching the output of the dataset and can return anything that can be converted to a PyTorch tensor. In combination with indices, this allows you to combine arbitrary row- and sample-specific data with dataset output on-the-fly.

    Note

    Depending on how transforms are implemented, they can easily introduce a dataloading bottleneck. If you find dataloading is slow, it’s often a good idea to try disabling your transform to see if it’s impacting throughput.

Return type:

TorchDataset

to_dataloader(batch_size=1, shuffle=False, sampler=None, num_workers=0, collate_fn=None, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None, multiprocessing_context=None, generator=None, *, prefetch_factor=None, persistent_workers=False, pin_memory_device='', return_indices=False, transform=None, mode=None, buffer_bytes=2147483648, copy=True, heartbeat_seconds=60.0)[source]#

Convert the dataset to a PyTorch DataLoader. The parameters are the same as a DataLoader with a few omissions e.g. batch_sampler. Requires PyTorch to be installed.

Parameters:
  • batch_size (int, default: 1) – How many samples per batch to load.

  • shuffle (bool, default: False) – Set to True to have the data reshuffled at every epoch.

  • sampler (Sampler | Iterable | None, default: None) –

    Defines the strategy to draw samples from the dataset. Can be any Iterable with __len__ implemented. If specified, shuffle must not be specified.

    Important

    Do not provide a BatchSampler here. GVL Datasets use multithreading when indexed with batches of indices to avoid the overhead of multi-processing. To leverage this, GVL will automatically wrap the sampler with a BatchSampler so that lists of indices are given to the GVL Dataset instead of one index at a time. See this post for more information.

  • num_workers (int, default: 0) –

    How many subprocesses to use for dataloading. 0 means that the data will be loaded in the main process.

    Tip

    For GenVarLoader, it is generally best to set this to 0 or 1 since almost everything in GVL is multithreaded. However, if you are using a transform that is compute intensive and single threaded, there may be a benefit to setting this > 1.

  • collate_fn (Callable | None, default: None) – Merges a list of samples to form a mini-batch of Tensor(s).

  • pin_memory (bool, default: False) – If True, the data loader will copy Tensors into device/CUDA pinned memory before returning them. If your data elements are a custom type, or your collate_fn returns a batch that is a custom type, see the example below.

  • drop_last (bool, default: False) – Set to True to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If False and the size of dataset is not divisible by the batch size, then the last batch will be smaller.

  • timeout (float, default: 0) – If positive, the timeout value for collecting a batch from workers. Should always be non-negative.

  • worker_init_fn (Callable | None, default: None) – If not None, this will be called on each worker subprocess with the worker id (an int in [0, num_workers - 1]) as input, after seeding and before data loading.

  • multiprocessing_context (Callable | None, default: None) – If None, the default multiprocessing context of your operating system will be used.

  • generator (Generator | None, default: None) – If not None, this RNG will be used by RandomSampler to generate random indexes and multiprocessing to generate base_seed for workers.

  • prefetch_factor (int | None, default: None) – Number of batches loaded in advance by each worker. 2 means there will be a total of 2 * num_workers batches prefetched across all workers. (default value depends on the set value for num_workers. If value of num_workers=0 default is None. Otherwise, if value of num_workers > 0 default is 2).

  • persistent_workers (bool, default: False) – If True, the data loader will not shut down the worker processes after a dataset has been consumed once. This allows to maintain the workers Dataset instances alive.

  • pin_memory_device (str, default: '') – The device to pin_memory to if pin_memory is True.

  • return_indices (bool, default: False) – Whether to append arrays of row and sample indices of the non-subset dataset to each batch.

  • transform (Callable | None, default: None) –

    The transform to apply to each batch of data. The transform should take input matching the output of the dataset and can return anything that can be converted to a PyTorch tensor. In combination with indices, this allows you to combine arbitrary row- and sample-specific data with dataset output on-the-fly.

    Note

    Depending on how transforms are implemented, they can easily introduce a dataloading bottleneck. If you find dataloading is slow, it’s often a good idea to try disabling your transform to see if it’s impacting throughput.

  • mode (str | None)

  • buffer_bytes (int)

  • copy (bool)

  • heartbeat_seconds (float)

Return type:

DataLoader

genvarloader.get_dummy_dataset(spliced=False)[source]#

Return a dummy Dataset with 4 regions, 4 samples, max jitter of 2, a reference genome of all "N", genotypes, and 1 track “read-depth” where each track is [1, 2, 3, 4, 5, 6] in the reference coordinate system, where 3 is aligned with each region’s start coordinate. Is initialized to return ragged haplotypes and tracks with no jitter and deterministic reconstruction algorithms.

Parameters:

spliced (bool, default: False) –

If True, the dataset will be setup for splicing with all regions moved to chromosome 1 and a splice indexer with 2 genes, “tp53” and “shh”, corresponding to regions:

{
    "tp53": [3, 0, 2],
    "shh": [1],
}

class genvarloader.RaggedDataset[source]#

Only for type checking purposes, you should never instantiate this class directly.

class genvarloader.ArrayDataset[source]#

Only for type checking purposes, you should never instantiate this class directly.

Reference genome(s)#

class genvarloader.Reference[source]#

A reference genome kept in-memory. Typically this is only instantiated to be passed to Dataset.open and avoid data duplication.

Note

Do not instantiate this class directly. Use Reference.from_path() instead.

path: Path#

The path to the reference genome.

reference: ndarray[tuple[Any, ...], dtype[uint8]]#

The reference genome as a numpy array, with contigs concatenated.

offsets: ndarray[tuple[Any, ...], dtype[int64]]#

(n_contigs + 1)

Type:

The offsets of the contigs in the reference genome. Shape

pad_char: int#

The padding character used in the reference genome.

classmethod from_path(fasta, contigs=None, in_memory=True)[source]#

Load a reference genome from a FASTA file.

Parameters:
  • fasta (str | Path) – Path to a .fa/.fa.bgz FASTA file or an existing .gvlfa cache directory. When a FASTA path is given, a sibling .gvlfa cache is built on first use and reused on subsequent calls; a legacy .fa.gvl flat cache is automatically migrated to the new format.

  • contigs (list[str] | None, default: None) – List of contig names to load. If None, all contigs in the FASTA file are loaded. Can be either UCSC or Ensembl style (i.e. with or without the “chr” prefix) and will be handled appropriately to match the underlying FASTA.

  • in_memory (bool, default: True) – Whether to load the reference genome into memory. If True, the reference genome is loaded into memory. If False, the reference genome is read on-demand from a memory mapped array. This will still be much faster than reading from FASTA but slower than keeping it in memory. This is useful if you need to work with many reference genomes or have very limited RAM.

class genvarloader.RefDataset[source]#

A reference dataset for pulling out sequences from a reference genome.

When splice_info is provided, the dataset returns per-transcript concatenated reference sequence, with one row per splice group instead of one row per BED region. Same semantics as Dataset.open(splice_info=...).

reference: Reference#

The reference genome.

full_bed: DataFrame#

A table of regions to extract from the reference genome. The table must have the following columns: - chrom: The name of the contig (e.g. “chr1”, “chr2”, etc.) - chromStart: The start position of the region (0-based). - chromEnd: The end position of the region (0-based). A strand column can also be included, in which case the regions will be reverse complemented if the strand is -1 and the rc_neg parameter is set to True.

jitter: int#

The maximum length for randomly shifting start positions.

output_length: Literal['ragged', 'variable'] | int#

The output length of the dataset. Same meaning as Dataset.output_length.

deterministic: bool#

If true, fixed length sequences will be right truncated from their full length to the output length. If false, fixed length sequences will be randomly shifted to be within the output length.

rc_neg: bool#

Whether to reverse complement the regions that are on the negative strand.

region_names: str | None#

The name of the column in the full_bed table to use as the region names.

splice_info: str | tuple[str, str] | None#

If set, the dataset is spliced. Either the column name with rows already in splice order or a (group_col, sort_col) pair applied against full_bed.

property is_spliced: bool#

Whether the dataset is spliced.

property spliced_regions: DataFrame#

The spliced BED, subset to the current row subset.

property shape: tuple[int]#

Shape of the dataset.

subset_to(regions)[source]#

Subset the dataset to a subset of regions (or transcripts, when spliced).

Parameters:

regions (int | integer | ndarray[tuple[Any, ...], dtype[integer]] | slice | Sequence[int] | Sequence[bool] | ndarray[tuple[Any, ...], dtype[bool]] | Series | str | Sequence[str] | ndarray[tuple[Any, ...], dtype[str_]] | ndarray[tuple[Any, ...], dtype[object_]])

to_full_dataset()[source]#

Reset the dataset to the full dataset.

Return type:

Self

__init__(reference, full_bed, jitter=0, output_length='ragged', deterministic=True, rc_neg=True, seed=None, region_names=None, splice_info=None)#
Parameters:
Return type:

None

to_torch_dataset(return_indices=False, transform=None)[source]#

Convert the dataset to a PyTorch dataset.

Parameters:
  • return_indices (bool, default: False) – If True, the dataset will return the indices of the regions in the reference genome.

  • transform (Callable | None, default: None) – A function to transform the data. Should accept a numpy array of S1 with shape (batch_size, length). If return_indices is true, the function should accept a tuple of (sequences, indices).

Return type:

TorchDataset

to_dataloader(batch_size=1, shuffle=False, sampler=None, num_workers=0, collate_fn=None, pin_memory=False, drop_last=False, timeout=0, worker_init_fn=None, multiprocessing_context=None, generator=None, *, prefetch_factor=None, persistent_workers=False, pin_memory_device='', return_indices=False, transform=None)[source]#

Convert the dataset to a PyTorch DataLoader. The parameters are the same as a DataLoader with a few omissions e.g. batch_sampler. Requires PyTorch to be installed.

Parameters:
  • batch_size (int, default: 1) – How many samples per batch to load.

  • shuffle (bool, default: False) – Set to True to have the data reshuffled at every epoch.

  • sampler (Sampler | Iterable | None, default: None) –

    Defines the strategy to draw samples from the dataset. Can be any Iterable with __len__ implemented. If specified, shuffle must not be specified.

    Important

    Do not provide a BatchSampler here. GVL Datasets use multithreading when indexed with batches of indices to avoid the overhead of multi-processing. To leverage this, GVL will automatically wrap the sampler with a BatchSampler so that lists of indices are given to the GVL Dataset instead of one index at a time. See this post for more information.

  • num_workers (int, default: 0) –

    How many subprocesses to use for dataloading. 0 means that the data will be loaded in the main process.

    Tip

    For GenVarLoader, it is generally best to set this to 0 or 1 since almost everything in GVL is multithreaded. However, if you are using a transform that is compute intensive and single threaded, there may be a benefit to setting this > 1.

  • collate_fn (Callable | None, default: None) – Merges a list of samples to form a mini-batch of Tensor(s).

  • pin_memory (bool, default: False) – If True, the data loader will copy Tensors into device/CUDA pinned memory before returning them. If your data elements are a custom type, or your collate_fn returns a batch that is a custom type, see the example below.

  • drop_last (bool, default: False) – Set to True to drop the last incomplete batch, if the dataset size is not divisible by the batch size. If False and the size of dataset is not divisible by the batch size, then the last batch will be smaller.

  • timeout (float, default: 0) – If positive, the timeout value for collecting a batch from workers. Should always be non-negative.

  • worker_init_fn (Callable | None, default: None) – If not None, this will be called on each worker subprocess with the worker id (an int in [0, num_workers - 1]) as input, after seeding and before data loading.

  • multiprocessing_context (Callable | None, default: None) – If None, the default multiprocessing context of your operating system will be used.

  • generator (Generator | None, default: None) – If not None, this RNG will be used by RandomSampler to generate random indexes and multiprocessing to generate base_seed for workers.

  • prefetch_factor (int | None, default: None) – Number of batches loaded in advance by each worker. 2 means there will be a total of 2 * num_workers batches prefetched across all workers. (default value depends on the set value for num_workers. If value of num_workers=0 default is None. Otherwise, if value of num_workers > 0 default is 2).

  • persistent_workers (bool, default: False) – If True, the data loader will not shut down the worker processes after a dataset has been consumed once. This allows to maintain the workers Dataset instances alive.

  • pin_memory_device (str, default: '') – The device to pin_memory to if pin_memory is True.

  • return_indices (bool, default: False) – If True, the dataset will return the indices of the regions in the reference genome.

  • transform (Callable | None, default: None) – A function to transform the data. Should accept a numpy array of S1 with shape (batch_size, length). If return_indices is true, the function should accept a tuple of (sequences, indices).

Return type:

DataLoader

Non-personal/site-only variants#

class genvarloader.DatasetWithSites[source]#
__init__(dataset, sites, max_variants_per_region=1)[source]#

Dataset with variant sites, used to apply site-only variants e.g. from ClinVar to a Dataset of haplotypes. Currently only supports bi-allelic SNPs. Takes the intersection of the dataset regions and the sites, and applies the site-only variants to the Dataset’s haplotypes.

Accessed just like a Dataset, but where the rows are combinations of dataset regions and sites. Will return two AnnotatedHaps with variants applied and flags indicating whether the variant was applied, deleted, or existed. The flags are 0 for applied, 1 for deleted, and 2 for existed. If the dataset has tracks, they will be returned as well and reflect any site-only variants. The first AnnotatedHaps is the wildtype haplotypes and the second is the mutated haplotypes. The mutant haplotypes will also have their variant indices and reference coordinates updated to reflect the applied variants. Locations where a site-only variant was applied will have a variant index of -2.

Parameters:

Examples

import genvarloader as gvl
sites = gvl.sites_vcf_to_table("path/to/variants.vcf")

ds = gvl.Dataset.open("path/to/dataset.gvl", "path/to/reference.fasta")
ds_sites = gvl.DatasetWithSites(ds, sites)
wt_haps, mut_haps, flags = ds_sites[0, 0]
# flags is a np.uint8 (or an array of np.uint8 when accessing multiple rows/samples)

ds_sites.dataset = ds_sites.dataset.with_tracks("read-depth")
wt_haps, mut_haps, flags, tracks = ds_sites[0, 0]
sites: DataFrame#

Table of variant site information.

dataset: ArrayDataset[AnnotatedHaps, MaybeTRK]#

Dataset of haplotypes and potentially tracks.

rows: DataFrame#

Rows of this object, where each row is a combination of a dataset region and a site.

genvarloader.sites_vcf_to_table(vcf, attributes=None, info_fields=None)[source]#

Extract a table of variant site info from a VCF. All sites must be bi-allelic.

Parameters:
  • vcf (str | Path | VCF) – Path to a VCF or a genoray.VCF instance. Note that genoray.VCF can accept a filter function.

  • attributes (list[str] | None, default: None) – A list of attributes to include in the output table. Note that “CHROM”, “POS”, “REF”, and “ALT” are always included even if not in this list.

  • info_fields (list[str] | None, default: None) – A list of INFO fields to include in the output table.

Return type:

DataFrame

genvarloader.SitesSchema = SitesSchema[source]#

Schema to validate a table of variant sites.

Return type:

DataFrameBase[Self]

Data registry#

genvarloader.data_registry.fetch(name)[source]#

Download and cache data for constructing/opening a GVL dataset. Files are cached in the user’s home directory under ~/.cache/genvarloader.

Parameters:

name (Literal['geuvadis_ebi', '1kgp']) –

The name of the dataset to fetch. Can be one of:

  • ”geuvadis_ebi”: Geuvadis data for the original analyses by Lappalainen et al. 2013. Phased, normalized, and split into biallelic variants.

  • ”1kgp”: 1000 Genomes Project, all 3,202 individuals. Phased, normalized, and split into biallelic variants.

Return type:

dict[str, Path]

Returns:

A dictionary of paths to the fetched data.

Containers#

Classes that GVL Datasets may return.

class genvarloader.AnnotatedHaps[source]#

AnnotatedHaps(haps: ‘NDArray[np.bytes_]’, var_idxs: ‘NDArray[np.int32]’, ref_coords: ‘NDArray[np.int32]’)

haps: ndarray[tuple[Any, ...], dtype[bytes_]]#

Haplotypes with dtype S1.

var_idxs: ndarray[tuple[Any, ...], dtype[int32]]#

Variant indices for each position in the haplotypes. A value of -1 indicates no variant was applied at the position.

ref_coords: ndarray[tuple[Any, ...], dtype[int32]]#

Reference coordinates for each position in haplotypes.

property shape#

Shape of the haplotypes and all annotations.

reshape(shape)[source]#

Reshape the haplotypes and all annotations.

Parameters:

shape (int | tuple[int, ...]) – New shape for the haplotypes and all annotations. The total number of elements must remain the same.

squeeze(axis=None)[source]#

Squeeze the haplotypes and all annotations along the specified axis.

Parameters:

axis (int | tuple[int, ...] | None, default: None) – Axis or axes to squeeze. If None, all axes of length 1 will be squeezed.

Return type:

AnnotatedHaps

class genvarloader.Ragged[source]#

An awkward array with exactly 1 ragged dimension. The ragged dimension is None in its shape tuple.

!!! warning

Ragged arrays only support a subset of Awkward array features.

  • Strings are not supported since ASCII is sufficient for the bioinformatics domain.

  • Bytestrings count as a ragged dimension, and we break from the Awkward convention to not include a “var” in the type string.

  • Record-layout Ragged arrays (produced by ak.zip of Ragged inputs or by passing a record-layout ak.Array) return field-keyed dicts from dtype, data, and parts. Use rag[“field”] for zero-copy single-field access. view, apply, and to_numpy are not defined on record layouts; access individual fields. Union types remain unsupported.

static from_offsets(data, shape, offsets)[source]#

Create a Ragged array from data, offsets, and shape.

Parameters:
Return type:

Ragged[DTYPE_co]

static from_lengths(data, lengths)[source]#

Create a Ragged array from data and lengths.

Parameters:
Return type:

Ragged[DTYPE_co]

parts#

The parts of the Ragged array. For record layouts, a dict of field name -> RagParts; all share the same offsets ndarray.

data#

The data of the Ragged array. For record layouts, a dict of field name -> zero-copy ndarray view, in awkward field order.

property offsets: ndarray[tuple[Any, ...], dtype[int64]]#

The offsets of the Ragged array. May have shape (n_ragged + 1) or (2, n_ragged).

Return type:

NDArray[np.int64]

property shape: tuple[int | None, ...]#

The shape of the Ragged array. The ragged dimension is None.

Return type:

tuple[int | None, ]

property dtype: dtype[RDTYPE_co]#

The dtype of the Ragged array.

For non-record layouts, returns the numpy dtype of the flat data buffer (e.g. np.dtype('int32')).

For record layouts, returns a numpy structured dtype whose field names and per-field dtypes match the Ragged record fields — for example:

np.dtype([("seq", "S1"), ("score", "f4")])

Note

Memory layout is SoA, not AoS. A numpy structured dtype normally implies Array-of-Structs packing, but here each field lives in its own contiguous buffer (Structure of Arrays). The structured dtype is used purely as a convenient, numpy-compatible descriptor: it carries all field/dtype information in a single object without inventing a new type.

Return type:

np.dtype[RDTYPE_co]

property rag_dim: int#

The index of the ragged dimension.

Return type:

int

property lengths: ndarray[tuple[Any, ...], dtype[integer]]#

The lengths of the segments.

Return type:

NDArray[np.integer]

view(dtype)[source]#

Return a view of the data with the given dtype.

Parameters:

dtype (type[TypeVar(DTYPE_co, bound= number | bytes_ | datetime64 | timedelta64 | bool | void, covariant=True)] | str) – Target dtype.

Returns:

Zero-copy view with reinterpreted dtype.

Return type:

Ragged[DTYPE_co]

classmethod empty(shape, dtype)[source]#

Create an empty Ragged array with the given shape and dtype.

Parameters:
Return type:

Ragged[DTYPE_co]

property is_empty: bool#

Whether the Ragged array is empty.

Return type:

bool

property is_contiguous: bool#

Whether the Ragged array is contiguous.

Return type:

bool

property is_base: bool#

Whether the Ragged array is a base array (owns its data, contiguous, no offset).

Return type:

bool

to_numpy(allow_missing=False)[source]#

Convert to a dense NumPy array. Not zero-copy if offsets or data are non-contiguous.

Parameters:

allow_missing (bool, default: False) – Passed through to ak.Array.to_numpy.

Return type:

NDArray[RDTYPE_co]

to_packed(*, copy=True)[source]#

Pack into a fresh contiguous, zero-based Ragged (1-D offsets).

Numba-parallelized replacement for Ragged(ak.to_packed(self)). See seqpro.rag.to_packed() for the copy semantics.

Parameters:

copy (bool, default: True) – When True (default), return a freshly allocated owned array. When False, return zero-copy if already packed, else raise.

Return type:

Ragged[RDTYPE_co]

squeeze(axis=None)[source]#

Squeeze the ragged array along the given non-ragged axis. If squeezing would result in a 1D array, return the data as a numpy array. For record layouts, dispatches per-field; if fields collapse to 1D ndarrays, returns a dict of ndarrays, otherwise returns a record Ragged.

Parameters:

axis (int | tuple[int, ...] | None, default: None) – Axis or axes to squeeze. Must have size 1. If None, squeeze all size-1 axes.

Return type:

Self | NDArray[RDTYPE_co] | dict[str, NDArray[RDTYPE_co]]

reshape(*shape)[source]#

Reshape non-ragged axes.

Parameters:

*shape (int | None | tuple[int | None, ...]) – New shape including exactly one None for the ragged dimension.

Return type:

Self

to_ak()[source]#

Convert to a plain Awkward array, stripping the Ragged behavior.

Return type:

ak.Array

class genvarloader.RaggedAnnotatedHaps[source]#

Ragged version of AnnotatedHaps.

haps: Ragged[bytes_]#

Haplotypes with dtype S1.

var_idxs: Ragged[int32]#

Variant indices for each position in the haplotypes. A value of -1 indicates no variant was applied at the position.

ref_coords: Ragged[int32]#

Reference coordinates for each position in haplotypes.

property shape#

Shape of the haplotypes and all annotations.

to_padded()[source]#

Convert this Ragged array to a rectilinear array by right-padding each entry with appropriate values. The final axis will have the maximum length across all entries.

Return type:

AnnotatedHaps

reshape(shape)[source]#

Reshape the haplotypes and all annotations.

Parameters:

shape (int | tuple[int, ...]) – New shape for the haplotypes and all annotations. The total number of elements must remain the same.

Return type:

RaggedAnnotatedHaps

squeeze(axis=None)[source]#

Squeeze the haplotypes and all annotations along the specified axis.

Parameters:

axis (int | tuple[int, ...] | None, default: None) – Axis or axes to squeeze. If None, all axes of length 1 are squeezed.

Return type:

RaggedAnnotatedHaps

to_numpy()[source]#

If all entries in the ragged array have the same shape, convert to a rectilinear shape.

Parameters:

shape – Shape to convert to, including the length axis. The total number of elements must remain the same.

Return type:

AnnotatedHaps

class genvarloader.RaggedVariants[source]#

An awkward record array with shape (batch, ploidy, ~variants, [~length]). Guaranteed to at least have the field "alt" and "start" and one of "ref" or "ilen".

classmethod from_ak(arr)[source]#

Create a RaggedVariants object from an awkward array.

Parameters:

arr (Array) – The awkward array to create a RaggedVariants object from.

Return type:

RaggedVariants

property alt: Array#

Alternative alleles.

property start: Ragged[int32]#

0-based start positions.

property ilen: Ragged[int32]#

Indel lengths. Infallible.

property end: Ragged[int32]#

0-based, exclusive end positions.

reshape(shape)[source]#

Reshape leading, regular axes. Assumes no trailing regular axes.

Return type:

Self

Parameters:

shape (tuple[int | None, ...])

squeeze(axis=None, **kwargs)[source]#

Squeeze first axis.

Return type:

Self

Parameters:

axis (int | None)

infer_germline_ccfs_(ccf_field='dosages', max_ccf=1.0)[source]#

Infer germline CCFs in-place.

Germline variants are identified by having missing CCFs i.e. they have a variant index but missing CCFs. Missing CCFs are inferred to be max_ccf - sum(overlapping CCFs).

Parameters:
  • max_ccf (float, default: 1.0) – Maximum CCF value.

  • ccf_field (str)

Return type:

Self

to_packed()[source]#

Pack all fields into contiguous, zero-based arrays.

Replaces the previous ak.to_packed() call with field-wise packing: seqpro to_packed() for numeric Ragged fields, and an allele-level seqpro pack + group-offset rebase + _build_allele_layout() rebuild for the doubly-nested alt/ref fields.

Return type:

Self

rc_(to_rc=None)[source]#

Reverse complement the alleles. This is an in-place operation.

Parameters:

to_rc (ndarray[tuple[Any, ...], dtype[bool]] | None, default: None) – A boolean mask of the same shape as the variant dimension. If True, the alternative allele will be reverse complemented. If None, will reverse complement all alternative alleles.

Return type:

Self

Returns:

The RaggedVariants object with the alleles reverse complemented.

to_nested_tensor_batch(device='cpu', tokenizer=None)[source]#

Convert a RaggedVariants object to a dictionary of nested tensors. Will flatten across the ploidy dimension for attributes ILEN, starts, and dosages such that their shapes are (batch * ploidy, ~variants). For the alternative alleles, will flatten across both the ploidy and variant dimensions such that the shape is (batch * ploidy * ~variants, ~alt_len).

Important

This function assumes all variant data is packed (see ak.to_packed()).

Parameters:
  • device (str | device, default: 'cpu') – The device to move the tensors to.

  • tokenizer (Union[Literal['seqpro'], Callable[[ndarray[tuple[Any, ...], dtype[bytes_]]], ndarray[tuple[Any, ...], dtype[integer]]], None], default: None) –

    The tokenizer to use for the alternative alleles.

    • If "seqpro", will use seqpro.tokenize() to convert ACGTN -> 0 1 2 3 4.

    • If None, will use the integer ASCII value of each character i.e. ACGTN -> 65 67 71 84 78.

    • Otherwise, will use the provided callable to convert the alternative alleles to a tensor of integers.

Return type:

dict[str, NestedTensor | int]

Returns:

Dictionary of nested tensors and integers with the following keys:

  • "alts" with shape (batch * ploidy * ~variants, ~alt_len)

  • "ilens" with shape (batch * ploidy, ~variants)

  • "starts" with shape (batch * ploidy, ~variants)

  • "dosages" with shape (batch * ploidy, ~variants)

  • "max_n_vars": int, maximum number of variants

  • "max_alt_len": int, maximum length of an alternative allele

  • "max_ref_len": int, maximum length of a reference allele

pad(allele=b'N', ilen=0, start=-1, dosage=0.0, **pad_values)[source]#

Append a pad variant so that every group is guaranteed to have at least 1 variant. If the group has variants, no variant is appended.

Parameters:
  • allele (str | bytes, default: b'N') – The allele to use for ALTs and REFs

  • ilen (int, default: 0)

  • start (int, default: -1) – The start position to use for the pad variant

  • dosage (float, default: 0.0) – The dosage to use for the pad variant

  • **pad_values (Any) – Additional values to use for each field. Raises a ValueError if any field does not have a pad value.

Return type:

Self

Returns:

The RaggedVariants object with the pad variant appended to each group that has no variants.

class genvarloader.RaggedIntervals[source]#

RaggedIntervals(starts: ‘Ragged[np.int32]’, ends: ‘Ragged[np.int32]’, values: ‘Ragged[np.float32]’)

property shape#

Shape of the haplotypes and all annotations.

to_padded(start, end, value)[source]#

Convert this RaggedIntervals to a tuple of rectilinear arrays by right-padding each entry with appropriate values. The final axis will have the maximum length across all entries.

Return type:

tuple[ndarray[tuple[Any, ...], dtype[int32]], ndarray[tuple[Any, ...], dtype[int32]], ndarray[tuple[Any, ...], dtype[float32]]]

Parameters:
reshape(shape)[source]#

Reshape the haplotypes and all annotations.

Parameters:

shape (int | tuple[int, ...]) – New shape for the haplotypes and all annotations. The total number of elements must remain the same.

Return type:

RaggedIntervals

squeeze(axis=None)[source]#

Squeeze the haplotypes and all annotations along the specified axis.

Parameters:

axis (int | tuple[int, ...] | None, default: None) – Axis or axes to squeeze. If None, all axes of length 1 are squeezed.

Return type:

RaggedIntervals

to_fixed_shape(shape)[source]#

If all entries in the ragged array have the same shape, convert to a rectilinear shape.

Parameters:

shape (tuple[int, ...]) – Shape to convert to, including the length axis. The total number of elements must remain the same.

Return type:

tuple[ndarray[tuple[Any, ...], dtype[int32]], ndarray[tuple[Any, ...], dtype[int32]], ndarray[tuple[Any, ...], dtype[float32]]]

to_packed()[source]#

Apply ak.to_packed() to all arrays.

Return type:

RaggedIntervals

prepend_pad_itv(start=-1, end=-1, value=0.0)[source]#

Prepend a pad interval so that every group is guaranteed to have at least 1 interval.

Parameters:
  • start (int, default: -1) – The start position to use for the pad interval

  • end (int, default: -1) – The end position to use for the pad interval

  • value (float, default: 0.0) – The value to use for the pad interval

Return type:

RaggedIntervals