Vignettes

Several longer examples demonstrate use cases for BioCantor.

Map variants to their relative position within overlapping transcripts

This example simulates working with exome variant annotations. BioCantor can be used to map chromosome-relative variants to their relative position along transcripts, for example permitting RNA-seq-based genotyping to proceed by mapping reads directly to transcript sequences.

[ ]:
from inscripta.biocantor.location.location_impl import SingleInterval, CompoundInterval, Strand
from inscripta.biocantor.parent import Parent
from inscripta.biocantor.gene.cds import CDSInterval, CDSFrame
from inscripta.biocantor.sequence import Sequence, Alphabet

# Define a set of transcript annotations
# Parents are optional when not needed to establish a hierarchy,
# so we omit them here
transcripts = [
    SingleInterval(200, 250, Strand.PLUS),
    SingleInterval(600, 800, Strand.PLUS),
    CompoundInterval([150, 220], [170, 250], Strand.MINUS),
    CompoundInterval([150, 220, 300], [170, 250, 350], Strand.MINUS),
]

# Define a SNP position
snp = SingleInterval(230, 231, Strand.PLUS)

# Identify transcripts overlapping the SNP
# For overlapping transcripts, report the relative position of the SNP
# along the spliced transcript
for transcript in transcripts:
    if snp.has_overlap(transcript, match_strand=False):
        relative_pos = transcript.parent_to_relative_pos(snp.start)
        print("{}\tRelative position: {}".format(transcript, relative_pos))
200-250:+       Relative position: 30
150-170:-, 220-250:-    Relative position: 19
150-170:-, 220-250:-, 300-350:- Relative position: 69

Memory-efficient database extraction of sequence and annotations

This example simulates working with a genome database that contains chromosome sequences and annotations relative to those chromosomes.

We demonstrate how a database client could retrieve sequence and annotation data without having to hold entire chromosomes in memory.

BioCantor is used to map annotations back and forth between smaller chromosome slices and the full chromosome, so that a user can always see chromosome coordinates and correct transcript sequences, while the database client works with smaller slices under the hood.

[ ]:
# This string represents the sequence of chr1 living in a database.
# The entire chromosome sequence is not loaded into memory in the code below;
# it will only be used to validate this example code.
chr1_seq_in_db = "GCTAGGGTCGTGAAGTCGATTCCTTCGATGGTTAAAAATCAAAGGCTCAGAGTGCAGACTGGAGCGCCCA"

# A lightweight Parent object that will be reused to link various elements with chr1
chr1_parent = Parent(id="chr1")

# Simulate querying the database for a small chunk of chr1
# Link the chunk to chr1 by specifying its position in its parent attribute
seq_chunk_from_db = Sequence(chr1_seq_in_db[30:60], Alphabet.NT_STRICT,
                             parent=Parent(location=SingleInterval(30, 60, Strand.PLUS,
                                                                   parent=chr1_parent)))

# Simulate querying the database for the coordinates of a transcript relative to chr1
transcript_rel_to_chr1 = SingleInterval(45, 52, Strand.PLUS, parent=chr1_parent)

# Now, map the transcript to the sequence chunk instead of chr1
transcript_location_rel_to_chunk = seq_chunk_from_db.\
                                   location_on_parent.\
                                   parent_to_relative_location(transcript_rel_to_chr1)

# Check the location of the transcript relative to the sequence chunk
transcript_location_rel_to_chunk
<SingleInterval 15-22:+>
[ ]:
# Link the transcript to the sequence chunk as parent
# This object holds a pointer to the sequence chunk and may be passed to downstream code
transcript_rel_to_chunk = transcript_location_rel_to_chunk.\
                          reset_parent(Parent(sequence=seq_chunk_from_db))
[ ]:
# Extract the transcript sequence without having chr1 in memory
transcript_rel_to_chunk.extract_sequence()
<Sequence=CTCAGAG;
  Alphabet=NT_STRICT;
  Length=7;
  Parent=None;
  Type=None>
[ ]:
# Check against the full chromosome sequence
chr1_seq_in_db[45:52]
'CTCAGAG'

Working with a chain of parent relationships

This example demonstrates a long chain of parent relationships.

The hierarchy established here is:

  1. Chromosome

  2. Slice of chromosome

  3. Transcript

  4. CDS

  5. Exon

In the example:

  • The chromosome slice is given an explicit location relative to the chromosome.

  • The transcript is given an explicit location relative to the chromosome slice.

  • The CDS is given an explicit location relative to the transcript.

  • The exon”s place in the hierarchy is established implicitly by the code.

From these, coordinates can be automatically converted up and down the hierarchy.

[ ]:
# Chromosome slice sequence, maybe from a database
# Has a Parent object representing chromosome 1, which does not hold sequence data
chromosome_slice = Sequence("AAGGCTCACGGAGAGAAGAGATCTCTCGGG",
                            Alphabet.NT_STRICT,
                            type="chromosome_slice",
                            parent=Parent(
                                id="chr1",  # Slice is part of chr1
                                location=SingleInterval(60, 90, Strand.PLUS),  # Slice location chr1:60-90:+
                                sequence_type="chromosome"))

# A spliced transcript with coordinates relative to the chromosome slice, maybe from the database
transcript = CompoundInterval([5, 13], [9, 20], Strand.MINUS, parent=chromosome_slice)

# An exon of the transcript
exon1 = transcript.blocks[0]

# Get coordinates of the exon relative to the chromosome slice
# Two ways to do this.
# First, pass the chromosome slice directly
exon1_on_slice_direct = exon1.lift_over_to_sequence(chromosome_slice)
exon1_on_slice_direct
<SingleInterval <Parent: id=None, type=chromosome_slice, strand=-, location=<SingleInterval 5-9:->, sequence=<Sequence;
  Alphabet=NT_STRICT;
  Length=30;
  Parent=<Parent: id=chr1, type=chromosome, strand=+, location=<SingleInterval 60-90:+>, sequence=None, parent=None>;
  Type=chromosome_slice>, parent=<Parent: id=chr1, type=chromosome, strand=+, location=<SingleInterval 60-90:+>, sequence=None, parent=None>>:5-9:->
[ ]:
# Second, pass the sequence type of the chromosome slice, useful for lifting up to an ancestor by
# type without having to pass a sequence around, or if there is no sequence data
exon1.lift_over_to_first_ancestor_of_type("chromosome_slice")
<SingleInterval <Parent: id=None, type=chromosome_slice, strand=-, location=<SingleInterval 5-9:->, sequence=<Sequence;
  Alphabet=NT_STRICT;
  Length=30;
  Parent=<Parent: id=chr1, type=chromosome, strand=+, location=<SingleInterval 60-90:+>, sequence=None, parent=None>;
  Type=chromosome_slice>, parent=<Parent: id=chr1, type=chromosome, strand=+, location=<SingleInterval 60-90:+>, sequence=None, parent=None>>:5-9:->
[ ]:
# Get exon coordinates relative to chr1
exon1.lift_over_to_first_ancestor_of_type("chromosome")
<SingleInterval <Parent: id=chr1, type=chromosome, strand=-, location=<SingleInterval 65-69:->, sequence=None, parent=None>:65-69:->
[ ]:
# Get exon coordinates relative to transcript
# Accomplish by pushing slice-relative coordinates "down" through transcript
transcript.parent_to_relative_location(exon1_on_slice_direct)
<SingleInterval 7-11:+>
[ ]: