biocantor.gene.collections
Collection classes. The data model is structured into two general categories,
transcripts and features. Each of those are wrapped into genes and feature collections,
respectively. These are then wrapped up into one AnnotationIntervalCollection
.
AnnotationIntervalCollections
are the topmost class and hold all possible annotations
for a given interval, as well as the place to find their sequence information.
It is useful to think of transcripts/genes as transcriptional units, which mean these data structures model transcribed sequence. In contrast, features are non-transcribed, and are meant to model things such as promoters or transcription factor binding sites.
Each object is capable of exporting itself to BED and GFF3.
Module Contents
Classes
An AnnotationCollection is a container to contain |
Attributes
- biocantor.gene.collections.HAS_CGRANGES = False
- class biocantor.gene.collections.AnnotationCollection(feature_collections: Optional[List[inscripta.biocantor.gene.feature.FeatureIntervalCollection]] = None, genes: Optional[List[inscripta.biocantor.gene.gene.GeneInterval]] = None, variant_collections: Optional[List[inscripta.biocantor.gene.variants.VariantIntervalCollection]] = None, name: Optional[str] = None, id: Optional[str] = None, sequence_name: Optional[str] = None, sequence_guid: Optional[uuid.UUID] = None, sequence_path: Optional[str] = None, qualifiers: Optional[Dict[Hashable, inscripta.biocantor.gene.interval.QualifierValue]] = None, start: Optional[int] = None, end: Optional[int] = None, completely_within: Optional[bool] = None, parent_or_seq_chunk_parent: Optional[inscripta.biocantor.parent.Parent] = None)
Bases:
inscripta.biocantor.gene.interval.AbstractFeatureIntervalCollection
An AnnotationCollection is a container to contain
GeneInterval
,FeatureIntervalCollection
andVariantIntervalCollection
.Encapsulates all possible annotations for a given interval on a specific source.
If no start/end points are provided, the interval for this collection is the min/max of the data it contains. The interval for an AnnotationCollection is always on the plus strand.
An AnnotationCollection can be empty (
feature_collections
,genes
, andvariant_collections
can beNone
).The object provided to
parent_or_seq_chunk_parent
must have achromosome
sequence-type in its ancestry, and there must be associated sequence. This object should look like the object produced by the functionbiocantor.io.parser.seq_to_parent()
, and represent a full chromosome sequence. This will be automatically instantiated if you use the constructor method inbiocantor.io.parser.ParsedAnnotationRecord
, which will import the sequence from a BioPythonSeqRecord
object.If you are using file parsers, then if the associated file types have sequence information (GenBank or GFF3+FASTA), then the sequences will also be automatically included when the
ParsedAnnotationRecord
is returned.Object Bounds: If start is provided, end must be provided, and vice versa. If neither are provided, and a parent_or_seq_chunk_parent is provided, then the bounds of this collection will be inferred from that object, if possible. If not possible, the bounds of the collection will be the bounds of the child objects associated.
It is possible to instantiate a
AnnotationCollection
with asequence_chunk
as well. Asequence_chunk
is a slice of a chromosomal sequence that allows operations without loading an entire chromosome into memory. The easiest way to produce the parental relationship required for this object to operate onsequence_chunk
is to instantiate via the constructorbiocantor.io.parser.seq_chunk_to_parent()
, to which you provide the slice of sequence, the chromosomal start/end positions of that slice, and a sequence name, and the returned Parent object will be suitable for passing to this class.- property hierarchical_children_guids: Dict[uuid.UUID, Set[uuid.UUID]]
Returns children GUIDs in their hierarchical structure.
- property interval_guids_to_collections: Dict[uuid.UUID, Union[inscripta.biocantor.gene.gene.GeneInterval, inscripta.biocantor.gene.feature.FeatureIntervalCollection]]
For example, if this collection had a gene with two transcripts with GUID ABC and 123, and the gene had GUID XYZ, this would return:
{ "ABC": GeneInterval(guid=XYZ), "123": GeneInterval(guid=XYZ) }
- Returns
A map of sub-feature GUIDs to their containing elements.
- property id: str
Returns the ID of this collection. Provides a shared API across genes/transcripts and features.
- property name: str
Returns the name of this collection. Provides a shared API across genes/transcripts and features.
- property children: List[Union[inscripta.biocantor.gene.gene.GeneInterval, inscripta.biocantor.gene.feature.FeatureIntervalCollection, inscripta.biocantor.gene.variants.VariantIntervalCollection]]
Sorted list of all children. Cached.
- property non_variant_children: List[Union[inscripta.biocantor.gene.gene.GeneInterval, inscripta.biocantor.gene.feature.FeatureIntervalCollection, inscripta.biocantor.gene.variants.VariantIntervalCollection]]
Sorted list of all non-variant children. Cached.
- property _child_interval_guid_map: Dict[uuid.UUID, Tuple[Union[inscripta.biocantor.gene.gene.GeneInterval, inscripta.biocantor.gene.feature.FeatureIntervalCollection, inscripta.biocantor.gene.variants.VariantIntervalCollection], Union[inscripta.biocantor.gene.transcript.TranscriptInterval, inscripta.biocantor.gene.feature.FeatureInterval, inscripta.biocantor.gene.variants.VariantInterval]]]
Construct a dictionary mapping grandchildren (interval GUIDs) to the children themselves.
- _identifiers = ['name']
- __repr__()
Return repr(self).
- __len__()
- __getstate__()
- __setstate__(state)
- _associate_intervals_with_variant_intervals()
If the constructor for this AnnotationCollection was passed one or more VariantIntervalCollections, then construct a mapping that associates them together. This produces new GeneInterval/FeatureIntervalCollection objects whose Parent are the alternative haplotype defined by the VariantIntervalCollection.
If a GeneInterval or FeatureIntervalCollection overlap multiple VariantIntervalCollections, then they will exist on sequence chunks that define the sub-fraction of the interval that the haplotype represents.
This is different from
incorporate_variants()
because it is applying the variants within this collection, enabling comparison of haplotypes rather than generating an entirely new AnnotationCollection centered on the alternative haplotypes.incorporate_variants()
can only apply one VariantIntervalCollection at a time to an entire Interval, whereas this will instead group them by haplotype.
- iter_children() Iterator[Union[inscripta.biocantor.gene.gene.GeneInterval, inscripta.biocantor.gene.feature.FeatureIntervalCollection, inscripta.biocantor.gene.variants.VariantIntervalCollection]]
Iterate over all intervals in this collection, in sorted order.
- iter_non_variant_children() Iterator[Union[inscripta.biocantor.gene.gene.GeneInterval, inscripta.biocantor.gene.feature.FeatureIntervalCollection]]
Iterate over all intervals in this collection, in sorted order.
- to_dict(chromosome_relative_coordinates: bool = True, export_parent: bool = False) Dict[str, Any]
Convert to a dict usable by
AnnotationCollectionModel
.Allows export of the parent object as well, which allows for sequence information to be serialized to disk.
It is not currently possible to export the parent in chunk-relative coordinates.
- Raises
NotImplementedError if chromosome_relative_coordinates is False and export_parent is True. –
- static from_dict(vals: Dict[str, Any], parent_or_seq_chunk_parent: Optional[inscripta.biocantor.parent.Parent] = None) AnnotationCollection
Build a
AnnotationCollection
from a dictionary representation.Will use the
parent_or_seq_chunk_parent
value encoded in the dict if it exists, but this will be overridden by anything passed to the parameter.
- _subset_parent(start: int, end: int) Optional[inscripta.biocantor.parent.Parent]
Subset the Parent of this collection to a new interval, building a chunk parent.
- Parameters
start – Genome relative start position.
end – Genome relative end position.
- Returns
A parent, or
None
if this location has no parent, or if start == end (empty interval).
- _build_new_collection_from_query(genes_to_keep: List[inscripta.biocantor.gene.gene.GeneInterval], features_collections_to_keep: List[inscripta.biocantor.gene.feature.FeatureIntervalCollection], variants_to_keep: List[inscripta.biocantor.gene.variants.VariantIntervalCollection], start: Optional[int], end: Optional[int], completely_within: Optional[bool]) AnnotationCollection
Convenience function that wraps functionality to build new collections
- query_by_position(start: Optional[int] = None, end: Optional[int] = None, coding_only: Optional[bool] = False, completely_within: Optional[bool] = True, expand_location_to_children: Optional[bool] = False) AnnotationCollection
Filter this annotation collection object based on positions, sequence, and boolean flags.
In all cases, the comparisons are made without considering strand. Intronic queries are still valid. In other words, a query from
[10,20]
would still return a transcript whose intervals were[0,9], [21, 30]
.The resulting
AnnotationCollection
returned will have a ._location member whose bounds exactly match the query. Ifexpand_location_to_children
isTrue
, then the child genes/feature collections will potentially extend beyond this range, in order to encapsulate their full length. The resulting gene/feature collections will potentially have a reduced set of transcripts/features, if those transcripts/features are outside the query range. However, ifexpand_location_to_children
isFalse
, then the child genes/feature collections will have location objects that represent the exact bounds of the query, which means that they may be sliced down. If the sliced down coordinates are entirely intronic for any isoform, then this isoform will have an EmptyLocation chunk_relative_location member, because it is no longer possible to have a relationship to the location object associated with this collection.Here is an example (equals are exons, dashes are introns):
10 15 20 25 30 35 40 Gene1: Tx1: 12============20 Tx2: 12======16-17=20--22==25 Fc1: F1: 12====15 F2: 12======16-17=20--22==25 F3: 35======40
Results:
start
end
completely_within
result
21
22
True
EmptyCollection
21
22
False
Tx1,Tx2,F2
28
35
False
EmptyCollection
28
36
False
F3
27
36
False
Tx1,F3
24
36
False
Tx1,Tx2,F2,F3
- Parameters
start – Genome relative start position. If not set, will be 0.
end – Genome relative end position. If not set, will be unbounded.
coding_only – Filter for coding genes only?
completely_within – Strict query boundaries? If
False
, features that partially overlap will be included in the output. Bins optimization cannot be used, so these queries are slower.expand_location_to_children – Should the underlying location objects be expanded so that no child gene/transcripts get sliced? If this is
False
, then the constituent objects may not actually represent their full lengths, although the original position information is retained.
- Returns
AnnotationCollection
that may be empty, and otherwise will contain new copies of everyconstituent member.
- Raises
InvalidQueryError – If the start/end bounds are not valid. This could be because they exceed the
bounds of the current interval. It could also happen if expand_location_to_children is True –
and the new expanded range would exceed the range of an associated sequence chunk. –
- _build_position_interval_tree()
Build a position tree of every child interval.
- _optimized_query_by_position(start: int, end: int, completely_within: bool, coding_only: bool) Tuple[List[inscripta.biocantor.gene.gene.GeneInterval], List[inscripta.biocantor.gene.feature.FeatureIntervalCollection], List[inscripta.biocantor.gene.variants.VariantIntervalCollection]]
Optimized implementation of position query. Used when cgranges is installed. The tree is cached if this is the first time it is being built.
- _query_by_position(start: int, end: int, completely_within: bool, coding_only: bool) Tuple[List[inscripta.biocantor.gene.gene.GeneInterval], List[inscripta.biocantor.gene.feature.FeatureIntervalCollection], List[inscripta.biocantor.gene.variants.VariantIntervalCollection]]
Non-optimized implementation of position query. Used when cgranges is not installed.
- _return_collection_for_id_queries(genes_to_keep: List[inscripta.biocantor.gene.gene.GeneInterval], features_collections_to_keep: List[inscripta.biocantor.gene.feature.FeatureIntervalCollection], variant_collections_to_keep: List[inscripta.biocantor.gene.variants.VariantIntervalCollection]) AnnotationCollection
Convenience function shared by all functions that query by identifiers or GUIDs.
- query_by_guids(id_or_ids: Union[uuid.UUID, List[uuid.UUID]]) AnnotationCollection
Filter this annotation collection object by a list of unique IDs.
- Parameters
id_or_ids – List of GUIDs, or unique IDs. Can also be a single ID.
NOTE: If the children of this collection have GUID collisions, either across genes or features or within genes and features, this function will return all members with the matching GUID.
- Returns
AnnotationCollection
that may be empty.
- query_by_interval_guids(id_or_ids: Union[uuid.UUID, List[uuid.UUID]]) AnnotationCollection
Filter this annotation collection object by a list of unique interval IDs.
NOTE: If the children of this collection have GUID collisions, either across genes or features or within genes and features, this function will return all members with the matching GUID.
- Parameters
id_or_ids – List of GUIDs, or unique IDs. Can also be a single ID.
- Returns
AnnotationCollection
that may be empty.
- query_by_transcript_interval_guids(id_or_ids: Union[uuid.UUID, List[uuid.UUID]]) AnnotationCollection
Filter this annotation collection object by a list of unique TranscriptInterval IDs.
This function wraps the
query_by_guid
function of child GeneInterval objects.NOTE: If the children of this collection have GUID collisions, either across genes or features or within genes and features, this function will return all members with the matching GUID.
- Parameters
id_or_ids – List of GUIDs, or unique IDs. Can also be a single ID.
- Returns
AnnotationCollection
that may be empty.
- query_by_feature_interval_guids(id_or_ids: Union[uuid.UUID, List[uuid.UUID]]) AnnotationCollection
Filter this annotation collection object by a list of unique interval IDs.
This function wraps the
query_by_guid
function of child FeatureIntervalCollection objects.NOTE: If the children of this collection have GUID collisions, either across genes or features or within genes and features, this function will return all members with the matching GUID.
- Parameters
id_or_ids – List of GUIDs, or unique IDs. Can also be a single ID.
- Returns
AnnotationCollection
that may be empty.
- query_by_feature_identifiers(id_or_ids: Union[str, List[str]]) AnnotationCollection
Filter this annotation collection object by a list of identifiers, or a single identifier.
Identifiers are not necessarily unique; if your identifier matches more than one interval, all matching intervals will be returned. These ambiguous results will be adjacent in the resulting collection, but are not grouped or signified in any way.
This method is
O(n_ids * m_identifiers)
.- Parameters
id_or_ids – List of identifiers, or a single identifier.
- Returns
AnnotationCollection
that may be empty.
- get_children_by_type(child_type: str) Union[List[inscripta.biocantor.gene.gene.GeneInterval], List[inscripta.biocantor.gene.feature.FeatureIntervalCollection], List[inscripta.biocantor.gene.variants.VariantIntervalCollection]]
- _unsorted_gff_iter(chromosome_relative_coordinates: bool = True, raise_on_reserved_attributes: bool = True) Iterator[inscripta.biocantor.io.gff3.rows.GFFRow]
Produces iterable of
GFFRow
for this annotation collection and its children.The positions of the genes will be ordered by genomic position, but may not be globally position sorted because it could be the case that children gene/features will overlap. This private function exists to provide an iterator to sort in the main
to_gff()
function.- Parameters
chromosome_relative_coordinates – Output GFF in chromosome-relative coordinates? Will raise an exception if there is not a
sequence_chunk
ancestor type.raise_on_reserved_attributes – If
True
, then GFF3 reserved attributes such asID
andName
present in the qualifiers will lead to an exception and not a warning.
- Yields
- to_gff(chromosome_relative_coordinates: bool = True, raise_on_reserved_attributes: Optional[bool] = True) Iterator[inscripta.biocantor.io.gff3.rows.GFFRow]
Produces iterable of
GFFRow
for this annotation collection and its children.- Parameters
chromosome_relative_coordinates – Output GFF in chromosome-relative coordinates? Will raise an exception if there is not a
sequence_chunk
ancestor type.raise_on_reserved_attributes – If
True
, then GFF3 reserved attributes such asID
andName
present in the qualifiers will lead to an exception and not a warning.
- Yields
- Raises
NoSuchAncestorException – If
chromosome_relative_coordinates
isFalse
but there is nosequence_chunk` ancestor type –
- incorporate_variants(variants: Union[inscripta.biocantor.gene.variants.VariantInterval, inscripta.biocantor.gene.variants.VariantIntervalCollection]) AnnotationCollection
Incorporate all of the variant(s) for an input VariantInterval or VariantIntervalCollection, producing a new AnnotationCollection with those changes incorporated on every child.