biocantor.gene.variants

This module contains VariantInterval, which models diploid sequence variation.

This model is intended to be as simple as possible, and represent a single alternative haplotype. Variants are always represented on the positive strand, and is loosely modeled after VCF files.

Variants must be represented by intervals of at least length 1 in all cases. It is better to left-pad indels, but it is not required.

Here are examples of how variants can be represented.

A SNV:

# ref: ACTCTCTCTATCTCATCCAC
# alt: AGTCTCTCTATCTCATCCAC
snp_1 = VariantInterval(start=1, end=2, sequence="G", variant_type="SNV")

A GG insertion at position 5

# ref: ACTCT  CTCTATCTCATCCAC
# alt: ACTCTGGCTCTATCTCATCCAC
insertion_5 = VariantInterval(start=5, end=6, sequence="GGC", variant_type="insertion")

A left-padded deletion from 10 to 13:

# ref: ACTCTCTCTATCTCATCCAC
# alt: ACTCTCTCTAT  CATCCAC
deletion_11_13 = VariantInterval(start=10, end=13, sequence="T", variant_type="deletion")

A deletion from 13 to 15 without padding:

# ref: ACTCTCTCTATCTCATCCAC
# alt: ACTCTCTCTATCT  TCCAC
deletion_13_15 = VariantInterval(start=13, end=15, sequence="", variant_type="deletion")

Module Contents

Classes

VariantInterval

This is a wrapper over AbstractInterval that adds functions shared across

VariantIntervalCollection

A container for many VariantInterval. Assumes that the variants are all on the same haplotype.

class biocantor.gene.variants.VariantInterval(start: int, end: int, sequence: str, variant_type: str, phase_block: Optional[int] = None, guid: Optional[uuid.UUID] = None, variant_guid: Optional[uuid.UUID] = None, variant_name: Optional[str] = None, variant_id: Optional[str] = None, qualifiers: Optional[Dict[Hashable, inscripta.biocantor.gene.interval.QualifierValue]] = None, parent_or_seq_chunk_parent: Optional[inscripta.biocantor.location.Parent] = None)

Bases: inscripta.biocantor.gene.interval.AbstractFeatureInterval

This is a wrapper over AbstractInterval that adds functions shared across TranscriptInterval, FeatureInterval, and VariantInterval.

property id: str

Returns the ID of this feature. Provides a shared API across genes/transcripts and features.

property name: str

Returns the name of this feature. Provides a shared API across genes/transcripts and features.

property alternative_genomic_sequence: inscripta.biocantor.sequence.sequence.Sequence

Edited version of the original genomic sequence

property parent_with_alternative_sequence: inscripta.biocantor.location.Parent
property length_difference
_identifiers = ['variant_name', 'variant_id']
__str__()

Return str(self).

__repr__()

Return repr(self).

export_qualifiers(parent_qualifiers: Optional[Dict[Hashable, Set[Hashable]]] = None) Dict[Hashable, Set[str]]

Exports qualifiers for GFF3 or GenBank export. This merges top level keys with the arbitrary values

abstract to_bed12(score: Optional[int] = 0, rgb: Optional[inscripta.biocantor.io.bed.RGB] = RGB(0, 0, 0), name: Optional[str] = 'feature_name', chromosome_relative_coordinates: bool = True) inscripta.biocantor.io.bed.BED12

Write a BED12 format representation of this AbstractFeatureInterval.

Both of these optional arguments are specific to the BED12 format.

Parameters
  • score – An optional score associated with a interval. UCSC requires an integer between 0 and 1000.

  • rgb – An optional RGB string for visualization on a browser. This allows you to have multiple colors on a single UCSC track.

  • name – Which identifier in this record to use as ‘name’. feature_name to guid. If the supplied string is not a valid attribute, it is used directly.

  • chromosome_relative_coordinates – Output GFF in chromosome-relative coordinates? Will raise an exception if there is not a sequence_chunk ancestor type.

Returns

A BED12 object.

Raises
  • NoSuchAncestorException – If chromosome_relative_coordinates is False but there is no

  • sequence_chunk` ancestor type

abstract to_gff(parent: Optional[str] = None, parent_qualifiers: Optional[Dict] = None, chromosome_relative_coordinates: bool = True) Iterator[inscripta.biocantor.io.gff3.rows.GFFRow]

Writes a GFF format list of lists for this feature.

The additional qualifiers are used when writing a hierarchical relationship back to files. GFF files are easier to work with if the children features have the qualifiers of their parents.

Parameters
  • parent – ID of the Parent of this transcript.

  • parent_qualifiers – Directly pull qualifiers in from this dictionary.

  • chromosome_relative_coordinates – Output GFF in chromosome-relative coordinates? Will raise an exception if there is not a sequence_chunk ancestor type.

Yields

GFFRow

Raises
  • NoSuchAncestorException – If chromosome_relative_coordinates is False but there is no

  • sequence_chunk` ancestor type

abstract to_vcf()
to_dict(chromosome_relative_coordinates: bool = True) Dict[str, Any]

Dictionary to build Model representation. Defaults to always exporting in original chromosome relative coordinates, but this can be disabled to export in sequence-chunk relative coordinates.

If you have exported to sequence-chunk relative coordinates, and then try to re-instantiate, the subsequent object will now consider these new coordinates to be the original chromosome coordinates, and the relationship back to the true coordinates will be lost.

static from_dict(vals: Dict[str, Any], parent_or_seq_chunk_parent: Optional[inscripta.biocantor.location.Parent] = None) VariantInterval

Build an interval from a dictionary representation

lift_over_location(location: inscripta.biocantor.location.Location) inscripta.biocantor.location.Location

Construct a new Location that takes the alternative sequence defined by this VariantInterval into account.

The Location can be chunk-relative or chromosome-relative. It will be returned relative to the coordinate system of this VariantInterval.

_lift_over_chromosome_location_single_interval(location: inscripta.biocantor.location.SingleInterval) inscripta.biocantor.location.Location
_lift_over_chromosome_location_compound_interval(location: inscripta.biocantor.location.CompoundInterval) inscripta.biocantor.location.Location
class biocantor.gene.variants.VariantIntervalCollection(variant_intervals: List[VariantInterval], variant_collection_name: Optional[str] = None, variant_collection_id: Optional[str] = None, sequence_name: Optional[str] = None, sequence_guid: Optional[uuid.UUID] = None, guid: Optional[uuid.UUID] = None, qualifiers: Optional[Dict[Hashable, List[inscripta.biocantor.gene.interval.QualifierValue]]] = None, parent_or_seq_chunk_parent: Optional[inscripta.biocantor.location.Parent] = None)

Bases: inscripta.biocantor.gene.interval.AbstractFeatureIntervalCollection

A container for many VariantInterval. Assumes that the variants are all on the same haplotype.

property children_guids: Set[uuid.UUID]

Get all of the GUIDs for children.

Returns: A set of UUIDs

property id: str

Returns the ID of this feature. Provides a shared API across genes/transcripts and features.

property name: str

Returns the name of this feature. Provides a shared API across genes/transcripts and features.

property alternative_genomic_sequence: inscripta.biocantor.sequence.sequence.Sequence

Edited version of the original sequence

property parent_with_alternative_sequence: inscripta.biocantor.location.Parent
interval_type
_identifiers = ['variant_collection_name', 'variant_collection_id']
__repr__()

Return repr(self).

iter_children() Iterable[inscripta.biocantor.gene.interval.AbstractInterval]

Iterate over the children

query_by_guids(id_or_ids: Union[uuid.UUID, List[uuid.UUID]]) VariantIntervalCollection

Filter this collection object by a list of unique IDs.

Parameters

id_or_ids – List of GUIDs, or unique IDs. Can also be a single ID.

to_dict(chromosome_relative_coordinates: bool = True) Dict[str, Any]

Convert to a dict usable by VariantIntervalCollectionModel.

static from_dict(vals: Dict[str, Any], parent_or_seq_chunk_parent: Optional[inscripta.biocantor.location.Parent] = None) VariantIntervalCollection

Build a VariantIntervalCollection from a dictionary representation

abstract to_gff(chromosome_relative_coordinates: bool = True) Iterator[inscripta.biocantor.io.gff3.rows.GFFRow]

Writes a GFF format list of lists for this feature.

Parameters

chromosome_relative_coordinates – Output GFF in chromosome-relative coordinates? Will raise an exception if there is not a sequence_chunk ancestor type.

Yields

GFFRow

Raises
  • NoSuchAncestorException – If chromosome_relative_coordinates is False but there is no

  • sequence_chunk` ancestor type

lift_over_location(location: inscripta.biocantor.location.Location) inscripta.biocantor.location.Location

Construct a new Location that takes the alternative sequence defined by this VariantIntervalCollection into account.

The Location can be chunk-relative or chromosome-relative. It will be returned relative to the coordinate system of this VariantInterval.