biocantor.gene.interval
This module contains abstract base classes for interval types and interval collection types.
Module Contents
Classes
This enum differentiates the three main types of Intervals -- Features, Transcripts and Variants |
|
This is a wrapper over |
|
This is a wrapper over |
|
Abstract class for holding groups of feature intervals. The two implementations of this class |
Attributes
- biocantor.gene.interval.QualifierValue
- class biocantor.gene.interval.IntervalType
-
This enum differentiates the three main types of Intervals – Features, Transcripts and Variants
- FEATURE = feature
- TRANSCRIPT = transcript
- VARIANT = variant
- class biocantor.gene.interval.AbstractInterval
Bases:
abc.ABC
This is a wrapper over
Location
that adds metadata coordinate transformation QOL functions.All operations on coordinates are assumed to operate in chromosome–relative coordinates unless otherwise specified. All constructors use chromosome relative coordinates as well. If you want to operate on coordinate systems that are a subset of a chromosome, you must instantiate a Parent object that provides the coordinate relationship.
A function to help you build these relationships can be found at
biocantor.io.parser.seq_chunk_to_parent()
.- abstract property id: str
Returns the ID of this feature. Provides a shared API across genes/transcripts and features.
- abstract property name: str
Returns the name of this feature. Provides a shared API across genes/transcripts and features.
- property chromosome_location: inscripta.biocantor.location.Location
Returns the Location of this in chromosome coordinates.
If the coordinate system is unknown, this will return the same coordinate system as
chunk_relative_location
, that is the true underlying_location
member.This Location object will always have the full span of the Interval in chromosome coordinates, even if this feature exists in chunk relative coordinates. As a result of this, if this Interval was built on chunk relative coordinates, the sequence information will not be present.
- property _chunk_relative_bounded_chromosome_location: inscripta.biocantor.location.Location
Returns the Location of this in chromosome coordinates.
This function is different from
chromosome_location
in that it will return a Location bounded by the chunk relative location of this Interval, if it exists.This accessor is private because using it may lead to weird behavior. However, it is necessary for things like slicing CDSFrames in a chunk relative CDSInterval.
- property chunk_relative_location: inscripta.biocantor.location.Location
Returns the Location of this in chunk relative coordinates
- property blocks: List[inscripta.biocantor.location.Location]
Returns the blocks of this location
- property num_chunk_relative_blocks: int
Returns the number of chunk-relative blocks of this location. Could be less than
num_blocks
if this interval is a slice of the full length interval.
- property chunk_relative_blocks: List[inscripta.biocantor.location.Location]
Returns the chunk relative blocks of this location
- property strand: inscripta.biocantor.location.Strand
Returns strand of location.
- property chunk_relative_strand: inscripta.biocantor.location.Strand
Returns strand of location.
- property identifiers: Set[Union[str, uuid.UUID]]
Returns the identifiers for this FeatureInterval, if they exist
- property identifiers_dict: Dict[str, Union[str, uuid.UUID]]
Returns the identifiers and their keys for this FeatureInterval, if they exist
- _location :inscripta.biocantor.location.Location
- _identifiers :List[Union[str, uuid.UUID]]
- qualifiers :Dict[Hashable, Set[str]]
- guid :uuid.UUID
- sequence_guid :Optional[uuid.UUID]
- sequence_name :Optional[str]
- bin :int
- start :int
- end :int
- _parent_or_seq_chunk_parent :Optional[inscripta.biocantor.parent.Parent]
- __len__()
- __eq__(other)
Return self==value.
- __hash__()
Produces a hash, which is the GUID.
- abstract to_dict(chromosome_relative_coordinates: bool = True) Dict[str, Any]
Dictionary to build Model representation. Defaults to always exporting in original chromosome relative coordinates, but this can be disabled to export in sequence-chunk relative coordinates.
If you have exported to sequence-chunk relative coordinates, and then try to re-instantiate, the subsequent object will now consider these new coordinates to be the original chromosome coordinates, and the relationship back to the true coordinates will be lost.
- _parent_to_dict(chromosome_relative_coordinates: bool = True) Optional[Dict[str, Any]]
Converts the
_parent_or_seq_chunk_parent
member of this Interval to a JSON-serializable representation.- Raises
NotImplementedError – If chromosome_relative_coordinates is
False
andself._parent_or_seq_chunk_parent
is notNone
.NoSuchAncestorException – If
self._parent_or_seq_chunk_parent
is chunk-relative but lacks sequence information.
- abstract static from_dict(vals: Dict[str, Any], parent_or_seq_chunk_parent: Optional[inscripta.biocantor.parent.Parent] = None) AbstractInterval
Build an interval from a dictionary representation
- abstract to_gff(chromosome_relative_coordinates: bool = True) Iterator[inscripta.biocantor.io.gff3.rows.GFFRow]
Writes a GFF format list of lists for this feature.
- Parameters
chromosome_relative_coordinates – Output GFF in chromosome-relative coordinates? Will raise an exception if there is not a
sequence_chunk
ancestor type.- Yields
- Raises
NoSuchAncestorException – If
chromosome_relative_coordinates
isFalse
but there is nosequence_chunk` ancestor type –
- static initialize_location(starts: List[int], ends: List[int], strand: inscripta.biocantor.location.Strand, parent_or_seq_chunk_parent: Optional[inscripta.biocantor.parent.Parent] = None) inscripta.biocantor.location.Location
Initialize the
Location
object for this interval.- Parameters
starts – Start positions relative to the chromosome.
ends – End positions relative to the chromosome.
strand – Strand relative to the chromosome.
parent_or_seq_chunk_parent – An optional parent, either as a full chromosome or as a sequence chunk.
- static liftover_location_to_seq_chunk_parent(location: inscripta.biocantor.location.Location, parent_or_seq_chunk_parent: Optional[inscripta.biocantor.parent.Parent] = None) inscripta.biocantor.location.Location
BioCantor supports constructing any of the interval classes from a subset of the chromosome. In order to be able to set up the coordinate relationship and successfully pull down sequence, this function lifts the coordinates from the original annotation object on to this new coordinate system.
parent_1_15 = Parent( sequence=Sequence( genome2[1:15], Alphabet.NT_EXTENDED_GAPPED, type=SequenceType.SEQUENCE_CHUNK, parent=Parent( location=SingleInterval(1, 15, Strand.PLUS, parent=Parent(id="genome_1_15", sequence_type=SequenceType.CHROMOSOME)) ), ) )
Alternatively, if the sequence is coming straight from a file, it will be a
Parent
with aSequence
attached:parent = Parent(id="chr1", sequence=Sequence(genome, Alphabet.NT_STRICT, type=SequenceType.CHROMOSOME))
This convenience function detects which kind of parent is given, and sets up the appropriate location.
This function also handles the case where the
location
argument is already chunk-relative. If this is the case, thelocation
object is first lifted back to its chromosomal coordinates, then lifted back down on to this new chunk.- Parameters
location – A location object, likely produced by
initialize_location()
. Could also be the location of an existing AbstractInterval subclass, such as when the methodliftover_interval_to_parent_or_seq_chunk_parent()
is called.parent_or_seq_chunk_parent – An optional parent, either as a full chromosome or as a sequence chunk. If not provided, this function is a no-op.
- Returns
A
Location
object.- Raises
ValidationException – If
parent_or_seq_chunk_parent
has no ancestor of typechromosome
orsequence_chunk
.NullSequenceException – If
parent_or_seq_chunk_parent
has no usable sequence ancestor.NoSuchAncestorException – If
location
has asequence_chunk
ancestor, but nochromosome
ancestor. Such a relationship is required to lift from one chunk to a new chunk.
- liftover_to_parent_or_seq_chunk_parent(parent_or_seq_chunk_parent: inscripta.biocantor.parent.Parent) AbstractInterval
This function returns a copy of this interval lifted over to a new coordinate system. If this interval is already in chunk-relative coordinates, it is first lifted back up the chromosome coordinates before the liftover occurs. This means that there must be a Parent somewhere in the ancestry with type “chromosome”, and that Parent must match the supplied parent except for location information.
Validation has to happen here in addition to in
liftover_location_to_seq_chunk_parent()
, because at this point the parent of this current interval is still known. Once theto_dict()
operation is performed, this information is list, and the new parent is applied under the assumption that it is valid.
- _liftover_this_location_to_seq_chunk_parent(seq_chunk_parent: inscripta.biocantor.parent.Parent)
Lift this interval to a new subset.
This could happen as the result of a subsetting operation.
This will introduce chunk-relative coordinates to this interval, or reduce the size of existing chunk-relative coordinates.
This function calls the parent static method
AbstractInterval.liftover_location_to_seq_chunk_parent()
, but differs in two key ways: 1. It acts on an instantiated subclass of this abstract class, modifying the location. 2. It handles the case where a subclass is already a slice, by first lifting up to genomic coordinates.For these reasons, and particularly #1, this is a private method that is intended to be used during construction of a subclass. Modifying the locations in-place are generally a bad idea after initial construction of a interval class.
- _reset_parent(parent: Optional[inscripta.biocantor.parent.Parent] = None) None
Convenience function that wraps location.reset_parent().
NOTE: This function modifies this interval in-place, and does not return a new copy. This is different behavior than the base function, and is this way because this function is called recursively from collection objects.
NOTE: Using this function presents the risk that you will change the sequence of this interval. There are no checks that the new parent provides the same sequence basis as the original parent.
- has_ancestor_of_type(ancestor_type: Union[str, inscripta.biocantor.parent.SequenceType]) bool
Convenience function that wraps location.has_ancestor_of_type().
- first_ancestor_of_type(ancestor_type: Union[str, inscripta.biocantor.parent.SequenceType]) inscripta.biocantor.parent.Parent
Convenience function that returns the first ancestor of this type.
- lift_over_to_first_ancestor_of_type(sequence_type: Optional[Union[str, inscripta.biocantor.parent.SequenceType]] = SequenceType.CHROMOSOME) inscripta.biocantor.location.Location
Lifts the location member to another coordinate system. Is a no-op if there is no parent assigned.
- Returns
The lifted Location.
- _import_qualifiers_from_list(qualifiers: Optional[Dict[Hashable, List[Hashable]]] = None)
Import input qualifiers to sets and store.
- class biocantor.gene.interval.AbstractFeatureInterval
Bases:
AbstractInterval
,abc.ABC
This is a wrapper over
AbstractInterval
that adds functions shared acrossTranscriptInterval
,FeatureInterval
, andVariantInterval
.- property chromosome_location: inscripta.biocantor.location.Location
Returns the Location of this in chromosome coordinates.
If the coordinate system is unknown, this will return the same coordinate system as
chunk_relative_location
, that is the true underlying_location
member.This Location object will always have the full span of the Interval in chromosome coordinates, even if this feature exists in chunk relative coordinates. As a result of this, if this Interval was built on chunk relative coordinates, the sequence information will not be present.
- property _chunk_relative_bounded_chromosome_location: inscripta.biocantor.location.Location
Returns the Location of this in chromosome coordinates.
This function is different from
chromosome_location
in that it will return a Location bounded by the chunk relative location of this Interval, if it exists.
- property chromosome_span: inscripta.biocantor.location.Location
Returns the full span of this Interval in chromosome coordinates.
- property chromosome_gaps_location: inscripta.biocantor.location.Location
Returns the Location of the gaps of this Interval in chromosome coordinates. This is analogous to returning the intron coordinates.
- property chunk_relative_span: inscripta.biocantor.location.Location
Returns the full span of this Interval in chunk-relative coordinates.
- property chunk_relative_gaps_location: inscripta.biocantor.location.Location
Returns the Location of the gaps of this Interval in chunk-relative coordinates. This is analogous to returning the intron coordinates.
- property blocks: Iterable[inscripta.biocantor.location.location_impl.SingleInterval]
Wrapper for blocks function that reports blocks in chromosome coordinates
- property relative_blocks: Iterable[inscripta.biocantor.location.location_impl.SingleInterval]
Wrapper for blocks function that reports blocks in chunk-relative coordinates
- _genomic_ends :List[int]
- _genomic_starts :List[int]
- _strand :inscripta.biocantor.location.Strand
- _is_primary_feature :Optional[bool]
- __len__()
- abstract export_qualifiers(parent_qualifiers: Optional[Dict[Hashable, Set[Hashable]]] = None) Dict[Hashable, Set[str]]
Exports qualifiers for GFF3 or GenBank export. This merges top level keys with the arbitrary values
- abstract to_bed12(score: Optional[int] = 0, rgb: Optional[inscripta.biocantor.io.bed.RGB] = RGB(0, 0, 0), name: Optional[str] = 'feature_name', chromosome_relative_coordinates: bool = True) inscripta.biocantor.io.bed.BED12
Write a BED12 format representation of this
AbstractFeatureInterval
.Both of these optional arguments are specific to the BED12 format.
- Parameters
score – An optional score associated with a interval. UCSC requires an integer between 0 and 1000.
rgb – An optional RGB string for visualization on a browser. This allows you to have multiple colors on a single UCSC track.
name – Which identifier in this record to use as ‘name’. feature_name to guid. If the supplied string is not a valid attribute, it is used directly.
chromosome_relative_coordinates – Output GFF in chromosome-relative coordinates? Will raise an exception if there is not a
sequence_chunk
ancestor type.
- Returns
A
BED12
object.- Raises
NoSuchAncestorException – If
chromosome_relative_coordinates
isFalse
but there is nosequence_chunk` ancestor type –
- abstract to_gff(parent: Optional[str] = None, parent_qualifiers: Optional[Dict] = None, chromosome_relative_coordinates: bool = True) Iterator[inscripta.biocantor.io.gff3.rows.GFFRow]
Writes a GFF format list of lists for this feature.
The additional qualifiers are used when writing a hierarchical relationship back to files. GFF files are easier to work with if the children features have the qualifiers of their parents.
- Parameters
parent – ID of the Parent of this transcript.
parent_qualifiers – Directly pull qualifiers in from this dictionary.
chromosome_relative_coordinates – Output GFF in chromosome-relative coordinates? Will raise an exception if there is not a
sequence_chunk
ancestor type.
- Yields
- Raises
NoSuchAncestorException – If
chromosome_relative_coordinates
isFalse
but there is nosequence_chunk` ancestor type –
- sequence_pos_to_feature(pos: int) int
Converts sequence position to relative position along this feature.
- sequence_interval_to_feature(chr_start: int, chr_end: int, chr_strand: inscripta.biocantor.location.Strand) inscripta.biocantor.location.Location
Converts a contiguous interval on the sequence to a relative location within this feature.
- feature_pos_to_sequence(pos: int) int
Converts a relative position along this feature to sequence coordinate.
- feature_interval_to_sequence(rel_start: int, rel_end: int, rel_strand: inscripta.biocantor.location.Strand) inscripta.biocantor.location.Location
Converts a contiguous interval relative to this feature to a spliced location on the sequence.
- chunk_relative_pos_to_feature(pos: int) int
Converts chunk-relative sequence position to relative position along this feature.
- chunk_relative_interval_to_feature(chr_start: int, chr_end: int, chr_strand: inscripta.biocantor.location.Strand) inscripta.biocantor.location.Location
Converts a contiguous chunk-relative interval on the sequence to a relative location within this feature.
- feature_pos_to_chunk_relative(pos: int) int
Converts a relative position along this feature to chunk-relative sequence coordinate.
- feature_interval_to_chunk_relative(rel_start: int, rel_end: int, rel_strand: inscripta.biocantor.location.Strand) inscripta.biocantor.location.Location
Converts a contiguous interval relative to this feature to a chunk-relative spliced location on the sequence.
- get_spliced_sequence() inscripta.biocantor.sequence.Sequence
Returns the feature’s spliced, stranded sequence.
- get_reference_sequence() inscripta.biocantor.sequence.Sequence
Returns the feature’s unspliced, positive strand genomic sequence.
- get_genomic_sequence() inscripta.biocantor.sequence.Sequence
Returns the feature’s unspliced, stranded (transcription orientation) genomic sequence.
- class biocantor.gene.interval.AbstractFeatureIntervalCollection
Bases:
AbstractInterval
,abc.ABC
Abstract class for holding groups of feature intervals. The two implementations of this class model Genes or non-transcribed FeatureCollections.
These are always on the same sequence, but can be on different strands.
- property strand: inscripta.biocantor.location.Strand
Returns strand of location.
- __iter__()
Iterate over all children of this collection
- abstract iter_children() Iterable[AbstractInterval]
Iterate over the children
- abstract children_guids() Set[uuid.UUID]
Get all of the GUIDs for children.
Returns: A set of UUIDs
- abstract query_by_guids(id_or_ids: Union[uuid.UUID, List[uuid.UUID]]) AbstractFeatureIntervalCollection
Filter this collection object by a list of unique IDs.
- Parameters
id_or_ids – List of GUIDs, or unique IDs. Can also be a single ID.
- _reset_parent(parent: Optional[inscripta.biocantor.parent.Parent] = None) None
Reset parent of this collection, and all of its children.
THIS FUNCTION IS ONLY INTENDED TO BE USED DURING INITIALIZATION OF A NEW INTERVAL OBJECT. USING THIS FUNCTION AFTER THAT POINT RUNS THE RISK OF THE PARENT OF THE OBJECT NOT BEING REFLECTED BY METHODS ON THIS FUNCTION THAT USE RESULT CACHING!
NOTE: This function modifies this collection in-place, and does not return a new copy. This is different behavior than the base function, and is this way because all of the children of this collection are also recursively modified.
NOTE: Using this function presents the risk that you will change the sequence of this interval. There are no checks that the new parent provides the same sequence basis as the original parent.
This overrides
reset_parent()
. The original function will remain applied on the leaf nodes.
- _initialize_location(start: int, end: int, parent_or_seq_chunk_parent: Optional[inscripta.biocantor.parent.Parent] = None)
Initialize the location for this collection. Assumes that the start/end coordinates are genome-relative, and builds a chunk-relative location for this.
- Parameters
start – genome-relative start
end – genome-relative end
parent_or_seq_chunk_parent – A parent that could be null, genome relative, or sequence chunk relative.
- get_reference_sequence() inscripta.biocantor.sequence.Sequence
Returns the plus strand sequence for this interval
- static _find_primary_feature(intervals: Union[List[inscripta.biocantor.gene.transcript.TranscriptInterval], List[inscripta.biocantor.gene.feature.FeatureInterval]]) Optional[Union[inscripta.biocantor.gene.transcript.TranscriptInterval, inscripta.biocantor.gene.feature.FeatureInterval]]
Used in object construction to find the primary feature. Shared between
GeneInterval
andFeatureIntervalCollection
.If not specified by the data source, primary features are determined by:
If the feature is coding, then its CDS size
The (spliced) feature size.
The position of the feature within the ordered list of features.
- _liftover_this_location_to_seq_chunk_parent(parent_or_seq_chunk_parent: inscripta.biocantor.parent.Parent)
Lift over this collection and all of its children