biocantor.gene.cds

Module Contents

Classes

CDSInterval

This class represents a CDS interval, or an interval with coding potential. This is generally only used

class biocantor.gene.cds.CDSInterval(cds_starts: List[int], cds_ends: List[int], strand: inscripta.biocantor.location.Strand, frames_or_phases: List[Union[inscripta.biocantor.gene.cds_frame.CDSFrame, inscripta.biocantor.gene.cds_frame.CDSPhase]], sequence_guid: Optional[uuid.UUID] = None, sequence_name: Optional[str] = None, protein_id: Optional[str] = None, product: Optional[str] = None, qualifiers: Optional[Dict[Hashable, inscripta.biocantor.gene.interval.QualifierValue]] = None, guid: Optional[uuid.UUID] = None, parent_or_seq_chunk_parent: Optional[inscripta.biocantor.parent.Parent] = None)

Bases: inscripta.biocantor.gene.interval.AbstractFeatureInterval

This class represents a CDS interval, or an interval with coding potential. This is generally only used as a member of a TranscriptInterval. This class adds metadata and frame information to a Location object, and adds an understanding of codons, codon iteration, and translation.

property id: str

Returns the ID of this feature. Provides a shared API across genes/transcripts and features.

property name: str

Returns the name of this feature. Provides a shared API across genes/transcripts and features.

property chunk_relative_frames: List[inscripta.biocantor.gene.cds_frame.CDSFrame]

It may be the case that the chunk relative location of this CDSInterval object is a subset of the full chromosomal location. In this case, the frames list needs to be appropriately subsetted to the correct set of frame entries.

However, it is far from trivial to subset frames in chunk context, because frames are calculated based on the full transcript length. Therefore, this function makes a blanket assumption that everything is in-frame within the interval of the chunk. In other words, if you are modeling a programmed frameshift using the Frames vector, this information will be lost. It does this by looping over the frames in transcription orientation until it finds the first exon that is within the chunk, then uses that to parameterize the frame generating function CDSInterval.construct_frames_from_location()

property has_canonical_start_codon: bool

Does this CDS have a canonical valid start? Requires a sequence be associated.

property has_valid_stop: bool

Does this CDS have a valid stop? Requires a sequence be associated.

property num_codons: int

Returns the total number of codons. This will reflect the true number of codons, even if this CDSInterval is parented on a sequence chunk.

property num_chunk_relative_codons: int

Returns the number of codons.

NOTE: If this CDS is a subset of the original sequence, this number will represent the subset, not the original size!

Any leading or trailing bases that are annotated as CDS but cannot form a full codon are excluded. Additionally, any internal codons that are incomplete are excluded.

Incomplete internal codons are determined by comparing the CDSFrame of each exon as annotated, to the expected value of the CDSFrame. This allows for an annotation to model things like programmed frameshifts and indels that may be assembly errors.

property chunk_relative_codon_locations: Tuple[inscripta.biocantor.location.Location]

Returns a tuple of codon locations in chunk relative coordinates.

This function calls scan_codon_locations and stores the full result as a cached tuple.

property chromosome_codon_locations: Tuple[inscripta.biocantor.location.Location]

Returns a tuple of codon locations in chromosome coordinates.

If this is a chunk-relative CDS, the returned locations will not have sequence information.

This function calls scan_codon_locations and stores the full result as a cached tuple.

property has_in_frame_stop: bool

Does this CDS have an in-frame stop codon?

frames = []
_identifiers = ['protein_id', 'product']
__str__()

Return str(self).

__repr__()

Return repr(self).

__len__()
to_dict(chromosome_relative_coordinates: bool = True) Dict[str, Any]

Convert this CDS to a dictionary representation.

If chromosome_relative_coordinates is False, then the Frames list that comes out of this will lose programmed frameshift information.

Parameters

chromosome_relative_coordinates – Optional flag to export the interval in chromosome relative or chunk-relative coordinates.

Returns

A dictionary representation that can be passed to CDSInterval.from_dict()

static from_dict(vals: Dict[str, Any], parent_or_seq_chunk_parent: Optional[inscripta.biocantor.parent.Parent] = None) CDSInterval

Construct a CDSInterval from a dictionary representation such as one produced by CDSInterval.to_dict().

Parameters
  • vals – A dictionary representation.

  • parent_or_seq_chunk_parent – An optional Parent to associate with this new interval.

static from_location(location: inscripta.biocantor.location.Location, cds_frames: List[Union[inscripta.biocantor.gene.cds_frame.CDSFrame, inscripta.biocantor.gene.cds_frame.CDSPhase]], sequence_guid: Optional[uuid.UUID] = None, sequence_name: Optional[str] = None, protein_id: Optional[str] = None, product: Optional[str] = None, qualifiers: Optional[Dict[Hashable, inscripta.biocantor.gene.interval.QualifierValue]] = None, guid: Optional[uuid.UUID] = None) CDSInterval

A convenience function that allows for construction of a CDSInterval from a location object, a list of CDSFrames or CDSPhase, and optional metadata.

static from_chunk_relative_location(location: inscripta.biocantor.location.Location, cds_frames: List[Union[inscripta.biocantor.gene.cds_frame.CDSFrame, inscripta.biocantor.gene.cds_frame.CDSPhase]], sequence_guid: Optional[uuid.UUID] = None, sequence_name: Optional[str] = None, protein_id: Optional[str] = None, product: Optional[str] = None, qualifiers: Optional[Dict[Hashable, inscripta.biocantor.gene.interval.QualifierValue]] = None, guid: Optional[uuid.UUID] = None) CDSInterval

Allows construction of a TranscriptInterval from a chunk-relative location. This is a location present on a sequence chunk, which should be built by the convenience function seq_chunk_to_parent:

from inscripta.biocantor.io.parser import seq_chunk_to_parent
parent = seq_chunk_to_parent('AANAAATGGCGAGCACCTAACCCCCNCC', "NC_000913.3", 222213, 222241)
loc = SingleInterval(5, 20, Strand.PLUS, parent=parent)

And then, this can be lifted back to chromosomal coordinates like such:

loc.lift_over_to_first_ancestor_of_type("chromosome")
export_qualifiers(parent_qualifiers: Optional[Dict[Hashable, Set[str]]] = None) Dict[Hashable, Set[Hashable]]

Exports qualifiers for GFF3/GenBank export

to_gff(parent: Optional[str] = None, parent_qualifiers: Optional[Dict[Hashable, Set[str]]] = None, chromosome_relative_coordinates: bool = True, raise_on_reserved_attributes: Optional[bool] = True) Iterator[inscripta.biocantor.io.gff3.rows.GFFRow]

Writes a GFF format list of lists for this CDS.

The additional qualifiers are used when writing a hierarchical relationship back to files. GFF files are easier to work with if the children features have the qualifiers of their parents.

Parameters
  • parent – ID of the Parent of this transcript.

  • parent_qualifiers – Directly pull qualifiers in from this dictionary.

  • chromosome_relative_coordinates – Output GFF in chromosome-relative coordinates? Will raise an exception if there is not a sequence_chunk ancestor type.

  • raise_on_reserved_attributes – If True, then GFF3 reserved attributes such as ID and Name present in the qualifiers will lead to an exception and not a warning.

Yields

GFFRow

Raises
has_start_codon_in_specific_translation_table(translation_table: Optional[inscripta.biocantor.gene.codon.TranslationTable] = TranslationTable.DEFAULT) bool

Does this CDS have a valid start in a provided translation table? Requires a sequence be associated.

Defaults to the DEFAULT table, which is just ATG.

_frame_iter(chunk_relative_frames: bool = True) Iterator[inscripta.biocantor.gene.cds_frame.CDSFrame]

Iterate over frames taking strand into account.

If chunk_relative_frames is True, then this iterator will only iterate over frames that overlap the relative chunk. These frames will potentially be reduced in quantity, and also shifted to handle exons that are now partial exons.

_exon_iter(chunk_relative_exon: bool = True) Iterator[inscripta.biocantor.location.SingleInterval]

Iterate over exons in transcription direction

extract_sequence() inscripta.biocantor.sequence.Sequence

Returns a continuous CDS sequence that is in frame and always a multiple of 3.

Any leading or trailing bases that are annotated as CDS but cannot form a full codon are removed. Additionally, any internal codons that are incomplete are removed.

Incomplete internal codons are determined by comparing the CDSFrame of each exon as annotated, to the expected value of the CDSFrame. This allows for an annotation to model things like programmed frameshifts and indels that may be assembly errors.

This function has been optimized to run as fast as possible. The original implementation iterated over every codon, but this is slower because it has a lot of object instantiation overhead. However, if those objects have already been instantiated and cached, then it is faster to just re-use them.

scan_codons(truncate_at_in_frame_stop: Optional[bool] = False) Iterator[inscripta.biocantor.gene.codon.Codon]

Iterator along codons. If truncate_at_in_frame_stop is True, this will stop iteration at the first in-frame stop.

_convert_chromosome_start_end_to_relative_window(chromosome_start: Optional[int] = None, chromosome_end: Optional[int] = None, expand_window_to_partial_codons: bool = False) Optional[inscripta.biocantor.location.SingleInterval]

Converts possibly null chromosomal start/end positions to a SingleInterval representing the genomic span of that window. Null start/end values default to the start/end of this CDS.

_expand_coordinates_to_codons(chromosome_start: int, chromosome_end: int) Tuple[int, int]

Convenience function to take a pair of chromosome coordinates and return new coordinates that contain only full codons.

scan_chunk_relative_codon_locations(chromosome_start: Optional[int] = None, chromosome_end: Optional[int] = None, expand_window_to_partial_codons: bool = False) Iterator[inscripta.biocantor.location.Location]

Returns an iterator over codon locations in chunk relative coordinates.

Any leading or trailing bases that are annotated as CDS but cannot form a full codon are excluded. Additionally, any internal codons that are incomplete are excluded.

Incomplete internal codons are determined by comparing the CDSFrame of each exon as annotated, to the expected value of the CDSFrame. This allows for an annotation to model things like programmed frameshifts and indels that may be assembly errors.

Parameters
  • chromosome_start – An optional chromosome position to offset the iteration to. The resulting codons will maintain frame.

  • chromosome_end – An optional chromosome position to offset the iteration to end at. The resulting codons will maintain frame. This number can be larger than the chromosome end position.

  • expand_window_to_partial_codons – If True, and the chromosome_start or chromosome_end parameters are set to values within a codon, the full codon will be retained. If False, partial codons will be eliminated.

scan_chromosome_codon_locations(chromosome_start: Optional[int] = None, chromosome_end: Optional[int] = None, expand_window_to_partial_codons: bool = False) Iterator[inscripta.biocantor.location.Location]

Returns an iterator over codon locations in chromosome coordinates.

If this is a chunk-relative CDS, the returned locations will not have sequence information.

Any leading or trailing bases that are annotated as CDS but cannot form a full codon are excluded. Additionally, any internal codons that are incomplete are excluded.

Incomplete internal codons are determined by comparing the CDSFrame of each exon as annotated, to the expected value of the CDSFrame. This allows for an annotation to model things like programmed frameshifts and indels that may be assembly errors.

Parameters
  • chromosome_start – An optional chromosome position to offset the iteration to. The resulting codons will maintain frame.

  • chromosome_end – An optional chromosome position to offset the iteration to end at. The resulting codons will maintain frame. This number can be larger than the chromosome end position.

  • expand_window_to_partial_codons – If True, and the chromosome_start or chromosome_end parameters are set to values within a codon, the full codon will be retained. If False, partial codons will be eliminated.

scan_codon_locations() Iterator[inscripta.biocantor.location.Location]

Scan codon locations in chunk-relative coordinates. This function exists for backwards compatibility and is deprecated. It however retains the speedup optimization introduced in BioCantor 0.10.0.

_scan_codon_locations(relative_window: Optional[inscripta.biocantor.location.SingleInterval] = None, chunk_relative_coordinates: bool = True) Iterator[inscripta.biocantor.location.Location]

Returns an iterator over codon locations.

relative_window must be in chromosome coordinates, ideally built by _convert_chromosome_start_end_to_relative_window.

Any leading or trailing bases that are annotated as CDS but cannot form a full codon are excluded.

_prepare_single_exon_window_for_scan_codon_locations(relative_window: Optional[inscripta.biocantor.location.SingleInterval] = None, chunk_relative_coordinates: bool = True) Tuple[inscripta.biocantor.location.Location, int]

This function exists to prepare a Location object to pass to the iterator _scan_codon_locations. By placing the logic in this function, the result can be cached, as you cannot cache iterators.

relative_window must be in chromosome coordinates, ideally built by _convert_chromosome_start_end_to_relative_window.

Returns a tuple of the Location to be iterated over, and its offset.

_prepare_multi_exon_window_for_scan_codon_locations(relative_window: Optional[inscripta.biocantor.location.SingleInterval] = None, chunk_relative_coordinates: bool = True) Tuple[inscripta.biocantor.location.Location, int]

This function exists to prepare a Location object to pass to the iterator _scan_codon_locations. By placing the logic in this function, the result can be cached, as you cannot cache iterators.

Returns a tuple of the Location to be iterated over, and its offset.

_calculate_frame_offset(cleaned_location: inscripta.biocantor.location.Location, loc_on_chrom: inscripta.biocantor.location.Location) int

In either single-exon or multi-exon codon iteration, if this CDSInterval exists on chunk-relative coordinates that slice down the CDS, then the initial offset provided by the CDSFrame field must be adjusted to maintain frame.

translate(truncate_at_in_frame_stop: Optional[bool] = False, translation_table: Optional[inscripta.biocantor.gene.codon.TranslationTable] = TranslationTable.DEFAULT, strict: bool = True) inscripta.biocantor.sequence.Sequence

Returns amino acid sequence of this CDS.

Parameters
  • truncate_at_in_frame_stop – If truncate_at_in_frame_stop is True, this will stop at the first in-frame stop.

  • translation_table – Currently the translation_table field only controls the start codon. Using non-standard translation tables will change the set of start codons that code for Methionine, and will not change any other codons.

  • strict – If False, allows untranslatable codons to be represented by an X. Otherwise, throws ValueError.

Returns

The translated amino acid sequence

Return type

Sequence

Raises

ValueError – Codon is untranslatable and allow_unknown_translation is False

static construct_frames_from_location(location: inscripta.biocantor.location.Location, starting_frame: Optional[inscripta.biocantor.gene.cds_frame.CDSFrame] = CDSFrame.ZERO) List[inscripta.biocantor.gene.cds_frame.CDSFrame]

Construct a list of CDSFrames from a Location. This is intended to construct frames in situations where the frames are not known. One example of such a case is when parsing GenBank files, which have only a codon_start field to measure the offset at the start of translation.

This function is extremely hard to understand, so I hope the below example helps:

  1. Plus strand:

CompoundInterval([0, 7, 12], [5, 11, 18], Strand.PLUS)
Index:      0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Sequence:   A A A C A A A A G G G  T  A  C  C  C  A  A  A  A  A  A
Exons:      A A A C A     A G G G     A  C  C  C  A  A
Zero Frame: 0 1 2 0 1     2 0 1 2     0  1  2  0  1  2
One Frame:  - 0 1 2 0     1 2 0 1     2  0  1  2  0  1
Two Frame:  - - 0 1 2     0 1 2 0     1  2  0  1  2  0

In the non-zero case, the [0, 1, 2] cycle is offset by 1 or 2 bases.

So, for this test case we expect the frames to be:

    Zero Frame: [0, 2, 0]
    One Frame:  [1, 1, 2]
    Two Frame:  [2, 0, 1]


2. Minus strand:

.. code-block::

    CompoundInterval([0, 7, 12], [5, 11, 18], Strand.MINUS)


.. code-block::

    Index:      0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
    Sequence:   A A A C A A A A G G G  T  A  C  C  C  A  A  A  A  A  A
    Exons:      A A A C A     A G G G     A  C  C  C  A  A
    Zero Frame: 2 1 0 2 1     0 2 1 0     2  1  0  2  1  0
    One Frame:  1 0 2 1 0     2 1 0 2     1  0  2  1  0  -
    Two Frame:  0 2 1 0 2     1 0 2 1     0  2  1  0  -  -


Now, for negative strand CDS intervals, the frame list is still in plus strand orientation.

So, for this test case we expect the frames to be:

.. code-block::

    Zero Frame: [1, 0, 0]
    One Frame:  [0, 2, 1]
    Two Frame:  [2, 1, 2]

Args:
    location: A interval of the CDS.
    starting_frame: Frame to start iteration with. If ``codon_start`` was the source of this value,
        then you would subtract one before converting to :class:`CDSFrame`.

Returns:
    A list of :class:`CDSFrame` that could be combined with the input Location to build a
    :class:`CDSInterval`.
optimize_blocks() CDSInterval

Combine the blocks of this CDS interval, preserving overlapping blocks.

Once this operation is performed, internal frameshifts modeled by 0bp gaps will be lost, and the resulting translation will be out of frame downstream.

Returns

A new CDSInterval that has been merged.

optimize_and_combine_blocks() CDSInterval

Combine the blocks of this CDS interval, including removing overlapping blocks.

Once this operation is performed, internal frameshifts modeled by 0bp gaps will be lost, as well as programmed frameshifts modeled by overlapping blocks. The resulting translations will be out of frame downstream.

Returns

A new CDSInterval that has been merged.

abstract to_bed12(score: Optional[int] = 0, rgb: Optional[inscripta.biocantor.io.bed.RGB] = RGB(0, 0, 0), name: Optional[str] = 'feature_name', chromosome_relative_coordinates: bool = True) inscripta.biocantor.io.bed.BED12

Write a BED12 format representation of this AbstractFeatureInterval.

Both of these optional arguments are specific to the BED12 format.

Parameters
  • score – An optional score associated with a interval. UCSC requires an integer between 0 and 1000.

  • rgb – An optional RGB string for visualization on a browser. This allows you to have multiple colors on a single UCSC track.

  • name – Which identifier in this record to use as ‘name’. feature_name to guid. If the supplied string is not a valid attribute, it is used directly.

  • chromosome_relative_coordinates – Output GFF in chromosome-relative coordinates? Will raise an exception if there is not a sequence_chunk ancestor type.

Returns

A BED12 object.

Raises
  • NoSuchAncestorException – If chromosome_relative_coordinates is False but there is no

  • sequence_chunk` ancestor type

cds_pos_to_sequence(pos: int) int

Converts a relative position along the CDS to sequence coordinate.

cds_pos_to_chunk_relative(pos: int) int

Converts a relative position along the CDS to chunk-relative sequence coordinate.

cds_interval_to_sequence(rel_start: int, rel_end: int, rel_strand: inscripta.biocantor.location.Strand) inscripta.biocantor.location.Location

Converts a contiguous interval relative to the CDS to a spliced location on the sequence.

cds_interval_to_chunk_relative(rel_start: int, rel_end: int, rel_strand: inscripta.biocantor.location.Strand) inscripta.biocantor.location.Location

Converts a contiguous interval relative to the CDS to a spliced location on the chunk-relative sequence.

sequence_pos_to_cds(pos: int) int

Converts a sequence relative position to a CDS position. This is the distance from the translation start in CDS coordinates.

Returns

An integer position in CDS coordinates.

Raises

InvalidPositionException – If the position provided is not part of this CDSInterval.

chunk_relative_pos_to_cds(pos: int) int

Converts chunk-relative sequence position to relative position along the CDS.

sequence_interval_to_cds(chr_start: int, chr_end: int, chr_strand: inscripta.biocantor.location.Strand) inscripta.biocantor.location.Location

Converts a contiguous interval on the sequence to a relative location within the CDS.

chunk_relative_interval_to_cds(chr_start: int, chr_end: int, chr_strand: inscripta.biocantor.location.Strand) inscripta.biocantor.location.Location

Converts a contiguous interval on the chunk-relative sequence to a relative location within the CDS.

sequence_pos_to_amino_acid(pos: int) int

Converts a sequence relative position to amino acid. The resulting value is always left-aligned.

Returns

A zero based integer position on the amino acid sequence, left aligned.

Raises

InvalidPositionException – If the position provided is not part of this CDSInterval.

incorporate_variants(variants: Union[inscripta.biocantor.gene.variants.VariantInterval, inscripta.biocantor.gene.variants.VariantIntervalCollection]) CDSInterval

Incorporate all of the variant(s) for an input VariantInterval or VariantIntervalCollection, producing a new CDSInterval with those changes incorporated.