biocantor.gene.interval

This module contains abstract base classes for interval types and interval collection types.

Module Contents

Classes

IntervalType

This enum differentiates the three main types of Intervals -- Features, Transcripts and Variants

AbstractInterval

This is a wrapper over Location that adds metadata coordinate transformation

AbstractFeatureInterval

This is a wrapper over AbstractInterval that adds functions shared across

AbstractFeatureIntervalCollection

Abstract class for holding groups of feature intervals. The two implementations of this class

Attributes

QualifierValue

biocantor.gene.interval.QualifierValue
class biocantor.gene.interval.IntervalType

Bases: str, enum.Enum

This enum differentiates the three main types of Intervals – Features, Transcripts and Variants

FEATURE = feature
TRANSCRIPT = transcript
VARIANT = variant
class biocantor.gene.interval.AbstractInterval

Bases: abc.ABC

This is a wrapper over Location that adds metadata coordinate transformation QOL functions.

All operations on coordinates are assumed to operate in chromosome–relative coordinates unless otherwise specified. All constructors use chromosome relative coordinates as well. If you want to operate on coordinate systems that are a subset of a chromosome, you must instantiate a Parent object that provides the coordinate relationship.

A function to help you build these relationships can be found at biocantor.io.parser.seq_chunk_to_parent().

property is_chunk_relative: bool

Does this Interval object exist on a sequence chunk?

property chunk_relative_size: int
property has_sequence: bool

Returns true if this Interval has an associated sequence of any type

abstract property id: str

Returns the ID of this feature. Provides a shared API across genes/transcripts and features.

abstract property name: str

Returns the name of this feature. Provides a shared API across genes/transcripts and features.

property chunk_relative_start: int

Returns chunk relative start position.

property chunk_relative_end: int

Returns chunk relative end position.

property chromosome_location: inscripta.biocantor.location.Location

Returns the Location of this in chromosome coordinates.

If the coordinate system is unknown, this will return the same coordinate system as chunk_relative_location, that is the true underlying _location member.

This Location object will always have the full span of the Interval in chromosome coordinates, even if this feature exists in chunk relative coordinates. As a result of this, if this Interval was built on chunk relative coordinates, the sequence information will not be present.

property _chunk_relative_bounded_chromosome_location: inscripta.biocantor.location.Location

Returns the Location of this in chromosome coordinates.

This function is different from chromosome_location in that it will return a Location bounded by the chunk relative location of this Interval, if it exists.

This accessor is private because using it may lead to weird behavior. However, it is necessary for things like slicing CDSFrames in a chunk relative CDSInterval.

property chunk_relative_location: inscripta.biocantor.location.Location

Returns the Location of this in chunk relative coordinates

property blocks: List[inscripta.biocantor.location.Location]

Returns the blocks of this location

property num_blocks: int

Returns the number of blocks of this location.

property num_chunk_relative_blocks: int

Returns the number of chunk-relative blocks of this location. Could be less than num_blocks if this interval is a slice of the full length interval.

property chunk_relative_blocks: List[inscripta.biocantor.location.Location]

Returns the chunk relative blocks of this location

property strand: inscripta.biocantor.location.Strand

Returns strand of location.

property chunk_relative_strand: inscripta.biocantor.location.Strand

Returns strand of location.

property identifiers: Set[Union[str, uuid.UUID]]

Returns the identifiers for this FeatureInterval, if they exist

property identifiers_dict: Dict[str, Union[str, uuid.UUID]]

Returns the identifiers and their keys for this FeatureInterval, if they exist

_location :inscripta.biocantor.location.Location
_identifiers :List[Union[str, uuid.UUID]]
qualifiers :Dict[Hashable, Set[str]]
guid :uuid.UUID
sequence_guid :Optional[uuid.UUID]
sequence_name :Optional[str]
bin :int
start :int
end :int
_parent_or_seq_chunk_parent :Optional[inscripta.biocantor.parent.Parent]
__len__()
__eq__(other)

Return self==value.

__hash__()

Produces a hash, which is the GUID.

abstract to_dict(chromosome_relative_coordinates: bool = True) Dict[str, Any]

Dictionary to build Model representation. Defaults to always exporting in original chromosome relative coordinates, but this can be disabled to export in sequence-chunk relative coordinates.

If you have exported to sequence-chunk relative coordinates, and then try to re-instantiate, the subsequent object will now consider these new coordinates to be the original chromosome coordinates, and the relationship back to the true coordinates will be lost.

_parent_to_dict(chromosome_relative_coordinates: bool = True) Optional[Dict[str, Any]]

Converts the _parent_or_seq_chunk_parent member of this Interval to a JSON-serializable representation.

Raises
  • NotImplementedError – If chromosome_relative_coordinates is False and self._parent_or_seq_chunk_parent is not None.

  • NoSuchAncestorException – If self._parent_or_seq_chunk_parent is chunk-relative but lacks sequence information.

abstract static from_dict(vals: Dict[str, Any], parent_or_seq_chunk_parent: Optional[inscripta.biocantor.parent.Parent] = None) AbstractInterval

Build an interval from a dictionary representation

abstract to_gff(chromosome_relative_coordinates: bool = True) Iterator[inscripta.biocantor.io.gff3.rows.GFFRow]

Writes a GFF format list of lists for this feature.

Parameters

chromosome_relative_coordinates – Output GFF in chromosome-relative coordinates? Will raise an exception if there is not a sequence_chunk ancestor type.

Yields

GFFRow

Raises
  • NoSuchAncestorException – If chromosome_relative_coordinates is False but there is no

  • sequence_chunk` ancestor type

static initialize_location(starts: List[int], ends: List[int], strand: inscripta.biocantor.location.Strand, parent_or_seq_chunk_parent: Optional[inscripta.biocantor.parent.Parent] = None) inscripta.biocantor.location.Location

Initialize the Location object for this interval.

Parameters
  • starts – Start positions relative to the chromosome.

  • ends – End positions relative to the chromosome.

  • strand – Strand relative to the chromosome.

  • parent_or_seq_chunk_parent – An optional parent, either as a full chromosome or as a sequence chunk.

static liftover_location_to_seq_chunk_parent(location: inscripta.biocantor.location.Location, parent_or_seq_chunk_parent: Optional[inscripta.biocantor.parent.Parent] = None) inscripta.biocantor.location.Location

BioCantor supports constructing any of the interval classes from a subset of the chromosome. In order to be able to set up the coordinate relationship and successfully pull down sequence, this function lifts the coordinates from the original annotation object on to this new coordinate system.

parent_1_15 = Parent(
    sequence=Sequence(
        genome2[1:15],
        Alphabet.NT_EXTENDED_GAPPED,
        type=SequenceType.SEQUENCE_CHUNK,
        parent=Parent(
            location=SingleInterval(1, 15, Strand.PLUS,
                                   parent=Parent(id="genome_1_15", sequence_type=SequenceType.CHROMOSOME))
        ),
    )
)

Alternatively, if the sequence is coming straight from a file, it will be a Parent with a Sequence attached:

parent = Parent(id="chr1", sequence=Sequence(genome, Alphabet.NT_STRICT, type=SequenceType.CHROMOSOME))

This convenience function detects which kind of parent is given, and sets up the appropriate location.

This function also handles the case where the location argument is already chunk-relative. If this is the case, the location object is first lifted back to its chromosomal coordinates, then lifted back down on to this new chunk.

Parameters
  • location – A location object, likely produced by initialize_location(). Could also be the location of an existing AbstractInterval subclass, such as when the method liftover_interval_to_parent_or_seq_chunk_parent() is called.

  • parent_or_seq_chunk_parent – An optional parent, either as a full chromosome or as a sequence chunk. If not provided, this function is a no-op.

Returns

A Location object.

Raises
  • ValidationException – If parent_or_seq_chunk_parent has no ancestor of type chromosome or sequence_chunk.

  • NullSequenceException – If parent_or_seq_chunk_parent has no usable sequence ancestor.

  • NoSuchAncestorException – If location has a sequence_chunk ancestor, but no chromosome ancestor. Such a relationship is required to lift from one chunk to a new chunk.

liftover_to_parent_or_seq_chunk_parent(parent_or_seq_chunk_parent: inscripta.biocantor.parent.Parent) AbstractInterval

This function returns a copy of this interval lifted over to a new coordinate system. If this interval is already in chunk-relative coordinates, it is first lifted back up the chromosome coordinates before the liftover occurs. This means that there must be a Parent somewhere in the ancestry with type “chromosome”, and that Parent must match the supplied parent except for location information.

Validation has to happen here in addition to in liftover_location_to_seq_chunk_parent(), because at this point the parent of this current interval is still known. Once the to_dict() operation is performed, this information is list, and the new parent is applied under the assumption that it is valid.

_liftover_this_location_to_seq_chunk_parent(seq_chunk_parent: inscripta.biocantor.parent.Parent)

Lift this interval to a new subset.

This could happen as the result of a subsetting operation.

This will introduce chunk-relative coordinates to this interval, or reduce the size of existing chunk-relative coordinates.

This function calls the parent static method AbstractInterval.liftover_location_to_seq_chunk_parent(), but differs in two key ways: 1. It acts on an instantiated subclass of this abstract class, modifying the location. 2. It handles the case where a subclass is already a slice, by first lifting up to genomic coordinates.

For these reasons, and particularly #1, this is a private method that is intended to be used during construction of a subclass. Modifying the locations in-place are generally a bad idea after initial construction of a interval class.

_reset_parent(parent: Optional[inscripta.biocantor.parent.Parent] = None) None

Convenience function that wraps location.reset_parent().

NOTE: This function modifies this interval in-place, and does not return a new copy. This is different behavior than the base function, and is this way because this function is called recursively from collection objects.

NOTE: Using this function presents the risk that you will change the sequence of this interval. There are no checks that the new parent provides the same sequence basis as the original parent.

has_ancestor_of_type(ancestor_type: Union[str, inscripta.biocantor.parent.SequenceType]) bool

Convenience function that wraps location.has_ancestor_of_type().

first_ancestor_of_type(ancestor_type: Union[str, inscripta.biocantor.parent.SequenceType]) inscripta.biocantor.parent.Parent

Convenience function that returns the first ancestor of this type.

lift_over_to_first_ancestor_of_type(sequence_type: Optional[Union[str, inscripta.biocantor.parent.SequenceType]] = SequenceType.CHROMOSOME) inscripta.biocantor.location.Location

Lifts the location member to another coordinate system. Is a no-op if there is no parent assigned.

Returns

The lifted Location.

_import_qualifiers_from_list(qualifiers: Optional[Dict[Hashable, List[Hashable]]] = None)

Import input qualifiers to sets and store.

_export_qualifiers_to_list() Optional[Dict[Hashable, List[str]]]

Export qualifiers back to lists. This is used when exporting to dictionary / converting back to marshmallow schemas.

class biocantor.gene.interval.AbstractFeatureInterval

Bases: AbstractInterval, abc.ABC

This is a wrapper over AbstractInterval that adds functions shared across TranscriptInterval, FeatureInterval, and VariantInterval.

property chromosome_location: inscripta.biocantor.location.Location

Returns the Location of this in chromosome coordinates.

If the coordinate system is unknown, this will return the same coordinate system as chunk_relative_location, that is the true underlying _location member.

This Location object will always have the full span of the Interval in chromosome coordinates, even if this feature exists in chunk relative coordinates. As a result of this, if this Interval was built on chunk relative coordinates, the sequence information will not be present.

property _chunk_relative_bounded_chromosome_location: inscripta.biocantor.location.Location

Returns the Location of this in chromosome coordinates.

This function is different from chromosome_location in that it will return a Location bounded by the chunk relative location of this Interval, if it exists.

property chromosome_span: inscripta.biocantor.location.Location

Returns the full span of this Interval in chromosome coordinates.

property chromosome_gaps_location: inscripta.biocantor.location.Location

Returns the Location of the gaps of this Interval in chromosome coordinates. This is analogous to returning the intron coordinates.

property chunk_relative_span: inscripta.biocantor.location.Location

Returns the full span of this Interval in chunk-relative coordinates.

property chunk_relative_gaps_location: inscripta.biocantor.location.Location

Returns the Location of the gaps of this Interval in chunk-relative coordinates. This is analogous to returning the intron coordinates.

property is_primary_feature: bool

Is this the primary feature?

property blocks: Iterable[inscripta.biocantor.location.location_impl.SingleInterval]

Wrapper for blocks function that reports blocks in chromosome coordinates

property relative_blocks: Iterable[inscripta.biocantor.location.location_impl.SingleInterval]

Wrapper for blocks function that reports blocks in chunk-relative coordinates

_genomic_ends :List[int]
_genomic_starts :List[int]
_strand :inscripta.biocantor.location.Strand
_is_primary_feature :Optional[bool]
__len__()
abstract export_qualifiers(parent_qualifiers: Optional[Dict[Hashable, Set[Hashable]]] = None) Dict[Hashable, Set[str]]

Exports qualifiers for GFF3 or GenBank export. This merges top level keys with the arbitrary values

abstract to_bed12(score: Optional[int] = 0, rgb: Optional[inscripta.biocantor.io.bed.RGB] = RGB(0, 0, 0), name: Optional[str] = 'feature_name', chromosome_relative_coordinates: bool = True) inscripta.biocantor.io.bed.BED12

Write a BED12 format representation of this AbstractFeatureInterval.

Both of these optional arguments are specific to the BED12 format.

Parameters
  • score – An optional score associated with a interval. UCSC requires an integer between 0 and 1000.

  • rgb – An optional RGB string for visualization on a browser. This allows you to have multiple colors on a single UCSC track.

  • name – Which identifier in this record to use as ‘name’. feature_name to guid. If the supplied string is not a valid attribute, it is used directly.

  • chromosome_relative_coordinates – Output GFF in chromosome-relative coordinates? Will raise an exception if there is not a sequence_chunk ancestor type.

Returns

A BED12 object.

Raises
  • NoSuchAncestorException – If chromosome_relative_coordinates is False but there is no

  • sequence_chunk` ancestor type

abstract to_gff(parent: Optional[str] = None, parent_qualifiers: Optional[Dict] = None, chromosome_relative_coordinates: bool = True) Iterator[inscripta.biocantor.io.gff3.rows.GFFRow]

Writes a GFF format list of lists for this feature.

The additional qualifiers are used when writing a hierarchical relationship back to files. GFF files are easier to work with if the children features have the qualifiers of their parents.

Parameters
  • parent – ID of the Parent of this transcript.

  • parent_qualifiers – Directly pull qualifiers in from this dictionary.

  • chromosome_relative_coordinates – Output GFF in chromosome-relative coordinates? Will raise an exception if there is not a sequence_chunk ancestor type.

Yields

GFFRow

Raises
  • NoSuchAncestorException – If chromosome_relative_coordinates is False but there is no

  • sequence_chunk` ancestor type

sequence_pos_to_feature(pos: int) int

Converts sequence position to relative position along this feature.

sequence_interval_to_feature(chr_start: int, chr_end: int, chr_strand: inscripta.biocantor.location.Strand) inscripta.biocantor.location.Location

Converts a contiguous interval on the sequence to a relative location within this feature.

feature_pos_to_sequence(pos: int) int

Converts a relative position along this feature to sequence coordinate.

feature_interval_to_sequence(rel_start: int, rel_end: int, rel_strand: inscripta.biocantor.location.Strand) inscripta.biocantor.location.Location

Converts a contiguous interval relative to this feature to a spliced location on the sequence.

chunk_relative_pos_to_feature(pos: int) int

Converts chunk-relative sequence position to relative position along this feature.

chunk_relative_interval_to_feature(chr_start: int, chr_end: int, chr_strand: inscripta.biocantor.location.Strand) inscripta.biocantor.location.Location

Converts a contiguous chunk-relative interval on the sequence to a relative location within this feature.

feature_pos_to_chunk_relative(pos: int) int

Converts a relative position along this feature to chunk-relative sequence coordinate.

feature_interval_to_chunk_relative(rel_start: int, rel_end: int, rel_strand: inscripta.biocantor.location.Strand) inscripta.biocantor.location.Location

Converts a contiguous interval relative to this feature to a chunk-relative spliced location on the sequence.

get_spliced_sequence() inscripta.biocantor.sequence.Sequence

Returns the feature’s spliced, stranded sequence.

get_reference_sequence() inscripta.biocantor.sequence.Sequence

Returns the feature’s unspliced, positive strand genomic sequence.

get_genomic_sequence() inscripta.biocantor.sequence.Sequence

Returns the feature’s unspliced, stranded (transcription orientation) genomic sequence.

_merge_qualifiers(other_qualifiers: Optional[Dict[Hashable, Set[str]]] = None) Dict[Hashable, Set[str]]

Merges this Interval’s qualifiers dictionary with a new one, removing redundancy.

class biocantor.gene.interval.AbstractFeatureIntervalCollection

Bases: AbstractInterval, abc.ABC

Abstract class for holding groups of feature intervals. The two implementations of this class model Genes or non-transcribed FeatureCollections.

These are always on the same sequence, but can be on different strands.

property strand: inscripta.biocantor.location.Strand

Returns strand of location.

__iter__()

Iterate over all children of this collection

abstract iter_children() Iterable[AbstractInterval]

Iterate over the children

abstract children_guids() Set[uuid.UUID]

Get all of the GUIDs for children.

Returns: A set of UUIDs

abstract query_by_guids(id_or_ids: Union[uuid.UUID, List[uuid.UUID]]) AbstractFeatureIntervalCollection

Filter this collection object by a list of unique IDs.

Parameters

id_or_ids – List of GUIDs, or unique IDs. Can also be a single ID.

_reset_parent(parent: Optional[inscripta.biocantor.parent.Parent] = None) None

Reset parent of this collection, and all of its children.

THIS FUNCTION IS ONLY INTENDED TO BE USED DURING INITIALIZATION OF A NEW INTERVAL OBJECT. USING THIS FUNCTION AFTER THAT POINT RUNS THE RISK OF THE PARENT OF THE OBJECT NOT BEING REFLECTED BY METHODS ON THIS FUNCTION THAT USE RESULT CACHING!

NOTE: This function modifies this collection in-place, and does not return a new copy. This is different behavior than the base function, and is this way because all of the children of this collection are also recursively modified.

NOTE: Using this function presents the risk that you will change the sequence of this interval. There are no checks that the new parent provides the same sequence basis as the original parent.

This overrides reset_parent(). The original function will remain applied on the leaf nodes.

_initialize_location(start: int, end: int, parent_or_seq_chunk_parent: Optional[inscripta.biocantor.parent.Parent] = None)

Initialize the location for this collection. Assumes that the start/end coordinates are genome-relative, and builds a chunk-relative location for this.

Parameters
  • start – genome-relative start

  • end – genome-relative end

  • parent_or_seq_chunk_parent – A parent that could be null, genome relative, or sequence chunk relative.

get_reference_sequence() inscripta.biocantor.sequence.Sequence

Returns the plus strand sequence for this interval

static _find_primary_feature(intervals: Union[List[inscripta.biocantor.gene.transcript.TranscriptInterval], List[inscripta.biocantor.gene.feature.FeatureInterval]]) Optional[Union[inscripta.biocantor.gene.transcript.TranscriptInterval, inscripta.biocantor.gene.feature.FeatureInterval]]

Used in object construction to find the primary feature. Shared between GeneInterval and FeatureIntervalCollection.

If not specified by the data source, primary features are determined by:

  1. If the feature is coding, then its CDS size

  2. The (spliced) feature size.

  3. The position of the feature within the ordered list of features.

_liftover_this_location_to_seq_chunk_parent(parent_or_seq_chunk_parent: inscripta.biocantor.parent.Parent)

Lift over this collection and all of its children