`biocantor.gene.collections`

Collection classes. The data model is structured into two general categories, transcripts and features. Each of those are wrapped into genes and feature collections, respectively. These are then wrapped up into one AnnotationIntervalCollection.

AnnotationIntervalCollections are the topmost class and hold all possible annotations for a given interval, as well as the place to find their sequence information.

It is useful to think of transcripts/genes as transcriptional units, which mean these data structures model transcribed sequence. In contrast, features are non-transcribed, and are meant to model things such as promoters or transcription factor binding sites.

Each object is capable of exporting itself to BED and GFF3.

Module Contents

Classes

AnnotationCollection

An AnnotationCollection is a container to contain GeneInterval,

Attributes

HAS_CGRANGES

biocantor.gene.collections.HAS_CGRANGES = False

class biocantor.gene.collections.AnnotationCollection(feature_collections: Optional[List[inscripta.biocantor.gene.feature.FeatureIntervalCollection]] = None, genes: Optional[List[inscripta.biocantor.gene.gene.GeneInterval]] = None, variant_collections: Optional[List[inscripta.biocantor.gene.variants.VariantIntervalCollection]] = None, name: Optional[str] = None, id: Optional[str] = None, sequence_name: Optional[str] = None, sequence_guid: Optional[uuid.UUID] = None, sequence_path: Optional[str] = None, qualifiers: Optional[Dict[Hashable, inscripta.biocantor.gene.interval.QualifierValue]] = None, start: Optional[int] = None, end: Optional[int] = None, completely_within: Optional[bool] = None, parent_or_seq_chunk_parent: Optional[inscripta.biocantor.parent.Parent] = None)

Bases: inscripta.biocantor.gene.interval.AbstractFeatureIntervalCollection

An AnnotationCollection is a container to contain GeneInterval, FeatureIntervalCollection and VariantIntervalCollection.

Encapsulates all possible annotations for a given interval on a specific source.

If no start/end points are provided, the interval for this collection is the min/max of the data it contains. The interval for an AnnotationCollection is always on the plus strand.

An AnnotationCollection can be empty (feature_collections, genes, and variant_collections can be None).

The object provided to parent_or_seq_chunk_parent must have a chromosome sequence-type in its ancestry, and there must be associated sequence. This object should look like the object produced by the function biocantor.io.parser.seq_to_parent(), and represent a full chromosome sequence. This will be automatically instantiated if you use the constructor method in biocantor.io.parser.ParsedAnnotationRecord, which will import the sequence from a BioPython SeqRecord object.

If you are using file parsers, then if the associated file types have sequence information (GenBank or GFF3+FASTA), then the sequences will also be automatically included when the ParsedAnnotationRecord is returned.

Object Bounds: If start is provided, end must be provided, and vice versa. If neither are provided, and a parent_or_seq_chunk_parent is provided, then the bounds of this collection will be inferred from that object, if possible. If not possible, the bounds of the collection will be the bounds of the child objects associated.

It is possible to instantiate a AnnotationCollection with a sequence_chunk as well. A sequence_chunk is a slice of a chromosomal sequence that allows operations without loading an entire chromosome into memory. The easiest way to produce the parental relationship required for this object to operate on sequence_chunk is to instantiate via the constructor biocantor.io.parser.seq_chunk_to_parent(), to which you provide the slice of sequence, the chromosomal start/end positions of that slice, and a sequence name, and the returned Parent object will be suitable for passing to this class.

property is_empty: bool: Is this an empty collection?

property children_guids: set

Get all of the GUIDs for children.

Returns: A set of UUIDs

property hierarchical_children_guids: Dict[uuid.UUID, Set[uuid.UUID]]: Returns children GUIDs in their hierarchical structure.

property interval_guids_to_collections: Dict[uuid.UUID, Union[inscripta.biocantor.gene.gene.GeneInterval, inscripta.biocantor.gene.feature.FeatureIntervalCollection]]

For example, if this collection had a gene with two transcripts with GUID ABC and 123, and the gene had GUID XYZ, this would return:

{
  "ABC": GeneInterval(guid=XYZ),
  "123": GeneInterval(guid=XYZ)
}

Returns: A map of sub-feature GUIDs to their containing elements.

property id: str: Returns the ID of this collection. Provides a shared API across genes/transcripts and features.

property name: str: Returns the name of this collection. Provides a shared API across genes/transcripts and features.

property children: List[Union[inscripta.biocantor.gene.gene.GeneInterval, inscripta.biocantor.gene.feature.FeatureIntervalCollection, inscripta.biocantor.gene.variants.VariantIntervalCollection]]: Sorted list of all children. Cached.

property non_variant_children: List[Union[inscripta.biocantor.gene.gene.GeneInterval, inscripta.biocantor.gene.feature.FeatureIntervalCollection, inscripta.biocantor.gene.variants.VariantIntervalCollection]]: Sorted list of all non-variant children. Cached.

property _child_interval_guid_map: Dict[uuid.UUID, Tuple[Union[inscripta.biocantor.gene.gene.GeneInterval, inscripta.biocantor.gene.feature.FeatureIntervalCollection, inscripta.biocantor.gene.variants.VariantIntervalCollection], Union[inscripta.biocantor.gene.transcript.TranscriptInterval, inscripta.biocantor.gene.feature.FeatureInterval, inscripta.biocantor.gene.variants.VariantInterval]]]: Construct a dictionary mapping grandchildren (interval GUIDs) to the children themselves.

_identifiers = ['name']

__repr__(): Return repr(self).

__len__()

__getstate__()

__setstate__(state)

_associate_intervals_with_variant_intervals()

If the constructor for this AnnotationCollection was passed one or more VariantIntervalCollections, then construct a mapping that associates them together. This produces new GeneInterval/FeatureIntervalCollection objects whose Parent are the alternative haplotype defined by the VariantIntervalCollection.

If a GeneInterval or FeatureIntervalCollection overlap multiple VariantIntervalCollections, then they will exist on sequence chunks that define the sub-fraction of the interval that the haplotype represents.

This is different from incorporate_variants() because it is applying the variants within this collection, enabling comparison of haplotypes rather than generating an entirely new AnnotationCollection centered on the alternative haplotypes. incorporate_variants() can only apply one VariantIntervalCollection at a time to an entire Interval, whereas this will instead group them by haplotype.

iter_children() → Iterator[Union[inscripta.biocantor.gene.gene.GeneInterval, inscripta.biocantor.gene.feature.FeatureIntervalCollection, inscripta.biocantor.gene.variants.VariantIntervalCollection]]: Iterate over all intervals in this collection, in sorted order.

iter_non_variant_children() → Iterator[Union[inscripta.biocantor.gene.gene.GeneInterval, inscripta.biocantor.gene.feature.FeatureIntervalCollection]]: Iterate over all intervals in this collection, in sorted order.

to_dict(chromosome_relative_coordinates: bool = True, export_parent: bool = False) → Dict[str, Any]

Convert to a dict usable by AnnotationCollectionModel.

Allows export of the parent object as well, which allows for sequence information to be serialized to disk.

It is not currently possible to export the parent in chunk-relative coordinates.

Raises: NotImplementedError if chromosome_relative_coordinates is False and export_parent is True. –

static from_dict(vals: Dict[str, Any], parent_or_seq_chunk_parent: Optional[inscripta.biocantor.parent.Parent] = None) → AnnotationCollection

Build a AnnotationCollection from a dictionary representation.

Will use the parent_or_seq_chunk_parent value encoded in the dict if it exists, but this will be overridden by anything passed to the parameter.

_subset_parent(start: int, end: int) → Optional[inscripta.biocantor.parent.Parent]

Subset the Parent of this collection to a new interval, building a chunk parent.

Parameters

start – Genome relative start position.
end – Genome relative end position.

Returns

A parent, or None if this location has no parent, or if start == end (empty interval).

_build_new_collection_from_query(genes_to_keep: List[inscripta.biocantor.gene.gene.GeneInterval], features_collections_to_keep: List[inscripta.biocantor.gene.feature.FeatureIntervalCollection], variants_to_keep: List[inscripta.biocantor.gene.variants.VariantIntervalCollection], start: Optional[int], end: Optional[int], completely_within: Optional[bool]) → AnnotationCollection: Convenience function that wraps functionality to build new collections

query_by_position(start: Optional[int] = None, end: Optional[int] = None, coding_only: Optional[bool] = False, completely_within: Optional[bool] = True, expand_location_to_children: Optional[bool] = False) → AnnotationCollection

Filter this annotation collection object based on positions, sequence, and boolean flags.

In all cases, the comparisons are made without considering strand. Intronic queries are still valid. In other words, a query from [10,20] would still return a transcript whose intervals were [0,9], [21, 30].

The resulting AnnotationCollection returned will have a ._location member whose bounds exactly match the query. If expand_location_to_children is True, then the child genes/feature collections will potentially extend beyond this range, in order to encapsulate their full length. The resulting gene/feature collections will potentially have a reduced set of transcripts/features, if those transcripts/features are outside the query range. However, if expand_location_to_children is False, then the child genes/feature collections will have location objects that represent the exact bounds of the query, which means that they may be sliced down. If the sliced down coordinates are entirely intronic for any isoform, then this isoform will have an EmptyLocation chunk_relative_location member, because it is no longer possible to have a relationship to the location object associated with this collection.

Here is an example (equals are exons, dashes are introns):

              10      15      20      25      30      35      40
Gene1: Tx1:     12============20
       Tx2:     12======16-17=20--22==25
Fc1:    F1:     12====15
        F2:     12======16-17=20--22==25
        F3:                                           35======40

Results:

start	end	completely_within	result
21	22	True	EmptyCollection
21	22	False	Tx1,Tx2,F2
28	35	False	EmptyCollection
28	36	False	F3
27	36	False	Tx1,F3
24	36	False	Tx1,Tx2,F2,F3

Parameters

start – Genome relative start position. If not set, will be 0.
end – Genome relative end position. If not set, will be unbounded.
coding_only – Filter for coding genes only?
completely_within – Strict query boundaries? If False, features that partially overlap will be included in the output. Bins optimization cannot be used, so these queries are slower.
expand_location_to_children – Should the underlying location objects be expanded so that no child gene/transcripts get sliced? If this is False, then the constituent objects may not actually represent their full lengths, although the original position information is retained.

Returns

AnnotationCollection that may be empty, and otherwise will contain new copies of every: constituent member.

Raises

InvalidQueryError – If the start/end bounds are not valid. This could be because they exceed the
bounds of the current interval. It could also happen if expand_location_to_children is True –
and the new expanded range would exceed the range of an associated sequence chunk. –

_build_position_interval_tree(): Build a position tree of every child interval.

_optimized_query_by_position(start: int, end: int, completely_within: bool, coding_only: bool) → Tuple[List[inscripta.biocantor.gene.gene.GeneInterval], List[inscripta.biocantor.gene.feature.FeatureIntervalCollection], List[inscripta.biocantor.gene.variants.VariantIntervalCollection]]: Optimized implementation of position query. Used when cgranges is installed. The tree is cached if this is the first time it is being built.

_query_by_position(start: int, end: int, completely_within: bool, coding_only: bool) → Tuple[List[inscripta.biocantor.gene.gene.GeneInterval], List[inscripta.biocantor.gene.feature.FeatureIntervalCollection], List[inscripta.biocantor.gene.variants.VariantIntervalCollection]]: Non-optimized implementation of position query. Used when cgranges is not installed.

_return_collection_for_id_queries(genes_to_keep: List[inscripta.biocantor.gene.gene.GeneInterval], features_collections_to_keep: List[inscripta.biocantor.gene.feature.FeatureIntervalCollection], variant_collections_to_keep: List[inscripta.biocantor.gene.variants.VariantIntervalCollection]) → AnnotationCollection: Convenience function shared by all functions that query by identifiers or GUIDs.

query_by_guids(id_or_ids: Union[uuid.UUID, List[uuid.UUID]]) → AnnotationCollection

Filter this annotation collection object by a list of unique IDs.

Parameters: id_or_ids – List of GUIDs, or unique IDs. Can also be a single ID.

NOTE: If the children of this collection have GUID collisions, either across genes or features or within genes and features, this function will return all members with the matching GUID.

Returns: AnnotationCollection that may be empty.

query_by_interval_guids(id_or_ids: Union[uuid.UUID, List[uuid.UUID]]) → AnnotationCollection

Filter this annotation collection object by a list of unique interval IDs.

NOTE: If the children of this collection have GUID collisions, either across genes or features or within genes and features, this function will return all members with the matching GUID.

Parameters: id_or_ids – List of GUIDs, or unique IDs. Can also be a single ID.
Returns: AnnotationCollection that may be empty.

query_by_transcript_interval_guids(id_or_ids: Union[uuid.UUID, List[uuid.UUID]]) → AnnotationCollection

Filter this annotation collection object by a list of unique TranscriptInterval IDs.

This function wraps the query_by_guid function of child GeneInterval objects.

NOTE: If the children of this collection have GUID collisions, either across genes or features or within genes and features, this function will return all members with the matching GUID.

Parameters: id_or_ids – List of GUIDs, or unique IDs. Can also be a single ID.
Returns: AnnotationCollection that may be empty.

query_by_feature_interval_guids(id_or_ids: Union[uuid.UUID, List[uuid.UUID]]) → AnnotationCollection

Filter this annotation collection object by a list of unique interval IDs.

This function wraps the query_by_guid function of child FeatureIntervalCollection objects.

NOTE: If the children of this collection have GUID collisions, either across genes or features or within genes and features, this function will return all members with the matching GUID.

Parameters: id_or_ids – List of GUIDs, or unique IDs. Can also be a single ID.
Returns: AnnotationCollection that may be empty.

query_by_feature_identifiers(id_or_ids: Union[str, List[str]]) → AnnotationCollection

Filter this annotation collection object by a list of identifiers, or a single identifier.

Identifiers are not necessarily unique; if your identifier matches more than one interval, all matching intervals will be returned. These ambiguous results will be adjacent in the resulting collection, but are not grouped or signified in any way.

This method is O(n_ids * m_identifiers).

Parameters: id_or_ids – List of identifiers, or a single identifier.
Returns: AnnotationCollection that may be empty.

get_children_by_type(child_type: str) → Union[List[inscripta.biocantor.gene.gene.GeneInterval], List[inscripta.biocantor.gene.feature.FeatureIntervalCollection], List[inscripta.biocantor.gene.variants.VariantIntervalCollection]]

_unsorted_gff_iter(chromosome_relative_coordinates: bool = True, raise_on_reserved_attributes: bool = True) → Iterator[inscripta.biocantor.io.gff3.rows.GFFRow]

Produces iterable of GFFRow for this annotation collection and its children.

The positions of the genes will be ordered by genomic position, but may not be globally position sorted because it could be the case that children gene/features will overlap. This private function exists to provide an iterator to sort in the main to_gff() function.

Parameters

chromosome_relative_coordinates – Output GFF in chromosome-relative coordinates? Will raise an exception if there is not a sequence_chunk ancestor type.
raise_on_reserved_attributes – If True, then GFF3 reserved attributes such as ID and Name present in the qualifiers will lead to an exception and not a warning.

Yields

GFFRow

to_gff(chromosome_relative_coordinates: bool = True, raise_on_reserved_attributes: Optional[bool] = True) → Iterator[inscripta.biocantor.io.gff3.rows.GFFRow]

Produces iterable of GFFRow for this annotation collection and its children.

Parameters

chromosome_relative_coordinates – Output GFF in chromosome-relative coordinates? Will raise an exception if there is not a sequence_chunk ancestor type.
raise_on_reserved_attributes – If True, then GFF3 reserved attributes such as ID and Name present in the qualifiers will lead to an exception and not a warning.

Yields

GFFRow

Raises

NoSuchAncestorException – If chromosome_relative_coordinates is False but there is no
sequence_chunk` ancestor type –

incorporate_variants(variants: Union[inscripta.biocantor.gene.variants.VariantInterval, inscripta.biocantor.gene.variants.VariantIntervalCollection]) → AnnotationCollection: Incorporate all of the variant(s) for an input VariantInterval or VariantIntervalCollection, producing a new AnnotationCollection with those changes incorporated on every child.

biocantor.gene.collections

Module Contents

Classes

Attributes

`biocantor.gene.collections`