Parsing GenBank files

Without specification, the default GenBank parsing function will be used. This function relies on the locus_tag field present on every child of a gene feature.

[ ]:
import os
os.chdir("/Users/ian.fiddes/repos/biocantor/")
[ ]:
from inscripta.biocantor.io.genbank.parser import parse_genbank
[ ]:
gbk = "tests/data/INSC1006_chrI.gbff"
[ ]:
with open(gbk, "r") as fh:
    parsed = list(parse_genbank(fh))
/Users/ian.fiddes/repos/biocantor/inscripta/biocantor/io/genbank/parser.py:844: GenBankEmptyGeneWarning: Gene SeqFeature(FeatureLocation(BeforePosition(42579), ExactPosition(43218), strand=1), type='gene') has no valid children features
  warnings.warn(GenBankEmptyGeneWarning(f"Gene {gene} has no valid children features"))

After parsing, there will be one ParsedAnnotationRecord built for every sequence in the GenBank file. This container class holds the original BioPython SeqRecord object, as well as one AnnotationCollectionModel for the parsed understanding of the annotations. These model objects are marshmallow_dataclass objects, and so can be dumped to and loaded directly from JSON.

[ ]:
annotation_collection_model = parsed[0].annotation
annotation_collection_model.Schema().dump(annotation_collection_model)
OrderedDict([('feature_collections', None),
             ('genes',
              [OrderedDict([('transcripts',
                             [OrderedDict([('exon_starts', [16174]),
                                           ('exon_ends', [18079]),
                                           ('strand', 'MINUS'),
                                           ('cds_starts', None),
                                           ('cds_ends', None),
                                           ('cds_frames', None),
                                           ('qualifiers',
                                            {'locus_tag': ['GI526_G0000001'],
                                             'ncRNA_class': ['other'],
                                             'note': ['CAT transcript id: T0000001; CAT alignment id: IsoSeq-PB.2586.1; CAT novel prediction: IsoSeq'],
                                             'product': ['CAT novel prediction: IsoSeq']}),
                                           ('is_primary_tx', False),
                                           ('transcript_id', None),
                                           ('protein_id', None),
                                           ('product',
                                            'CAT novel prediction: IsoSeq'),
                                           ('transcript_symbol', None),
                                           ('transcript_type', 'ncRNA'),
                                           ('sequence_name', 'CM021111.1'),
                                           ('sequence_guid', None),
                                           ('transcript_interval_guid',
                                            '6f3bfe10-08f8-3455-5ead-5ec2bae0c939'),
                                           ('transcript_guid', None)])]),
                            ('gene_id', None),
                            ('gene_symbol', None),
                            ('gene_type', 'ncRNA'),
                            ('locus_tag', 'GI526_G0000001'),
                            ('qualifiers', {'locus_tag': ['GI526_G0000001']}),
                            ('sequence_name', 'CM021111.1'),
                            ('sequence_guid', None),
                            ('gene_guid',
                             '8ad3f444-384e-35e0-e560-aef88bd2863f')]),
               OrderedDict([('transcripts',
                             [OrderedDict([('exon_starts', [37461]),
                                           ('exon_ends', [39103]),
                                           ('strand', 'PLUS'),
                                           ('cds_starts', [37637]),
                                           ('cds_ends', [39011]),
                                           ('cds_frames', ['ZERO']),
                                           ('qualifiers',
                                            {'codon_start': ['1'],
                                             'gene': ['GDH3'],
                                             'locus_tag': ['GI526_G0000002'],
                                             'note': ['CAT transcript id: T0000002; CAT alignment id: NM_001178204.1-0; CAT source transcript id: NM_001178204.1; CAT source transcript biotype: protein_coding'],
                                             'product': ['GDH3 isoform 1'],
                                             'protein_id': ['KAF1903245.1'],
                                             'translation': ['MTSEPEFQQAYDEIVSSVEDSKIFEKFPQYKKVLPIVSVPERIIQFRVTWENDNGEQEVAQGYRVQFNSAKGPYKGGLRFHPSVNLSILKFLGFEQIFKNALTGLDMGGGKGGLCVDLKGKSDNEIRRICYAFMRELSRHIGKDTDVPAGDIGVGGREIGYLFGAYRSYKNSWEGVLTGKGLNWGGSLIRPEATGFGLVYYTQAMIDYATNGKESFEGKRVTISGSGNVAQYAALKVIELGGIVVSLSDSKGCIISETGITSEQIHDIASAKIRFKSLEEIVDEYSTFSESKMKYVAGARPWTHVSNVDIALPCATQNEVSGDEAKALVASGVKFVAEGANMGSTPEAISVFETARSTATNAKDAVWFGPPKAANLGGVAVSGLEMAQNSQKVTWTAERVDQELKKIMINCFNDCIQAAQEYSTEKNTNTLPSLVKGANIASFVMVADAMLDQGDVF']}),
                                           ('is_primary_tx', False),
                                           ('transcript_id', None),
                                           ('protein_id', 'KAF1903245.1'),
                                           ('product', 'GDH3 isoform 1'),
                                           ('transcript_symbol', 'GDH3'),
                                           ('transcript_type',
                                            'protein_coding'),
                                           ('sequence_name', 'CM021111.1'),
                                           ('sequence_guid', None),
                                           ('transcript_interval_guid',
                                            'b9b580fe-80d9-b12b-3ff8-dfdac8b87b13'),
                                           ('transcript_guid', None)])]),
                            ('gene_id', None),
                            ('gene_symbol', 'GDH3'),
                            ('gene_type', 'protein_coding'),
                            ('locus_tag', 'GI526_G0000002'),
                            ('qualifiers',
                             {'gene': ['GDH3'],
                              'locus_tag': ['GI526_G0000002']}),
                            ('sequence_name', 'CM021111.1'),
                            ('sequence_guid', None),
                            ('gene_guid',
                             'a1b669f1-57f6-ae9b-8f4f-a27a6e84d15a')]),
               OrderedDict([('transcripts',
                             [OrderedDict([('exon_starts', [39518]),
                                           ('exon_ends', [40772]),
                                           ('strand', 'PLUS'),
                                           ('cds_starts', [39518]),
                                           ('cds_ends', [40772]),
                                           ('cds_frames', ['ZERO']),
                                           ('qualifiers',
                                            {'codon_start': ['1'],
                                             'gene': ['BDH2'],
                                             'locus_tag': ['GI526_G0000003'],
                                             'product': ['BDH2 isoform 1'],
                                             'protein_id': ['KAF1903246.1'],
                                             'translation': ['MRALAYFGKGNIRFTNHLKEPHIVAPDELVIDIAWCGICGTDLHEYTDGPIFFPEDGHTHEISHNPLPQAMGHEMAGTVLEVGPSVKNLKVGDKVVVEPTGTCRDRYRWPLSPKVDKEWCAACKKGYYNICSYLGLCGAGVQSGGFAEGVVMNESHCYKVPDFVPLDVAALIQPLAVCWHAIRVCEFKAGSTALIIGAGPIGLGTILALNAAGCKDIVVSEPAKVRRELAEKMGARVYDPTAHAAKESIDYLRSIADGGDGFDYTFDCSGLEVTLNAAIQCLTFRGTAVNLAMWGHHKIQFSPMDITLHERKYTGSMCYTHHDFETVIEALEEGRIDIDRARHMITGRVNIEDGLDGAIMKLINEKESTIKIILTPNNHGELNREADNEKKEISELSSRKDQERLRESINEAKLRHT']}),
                                           ('is_primary_tx', False),
                                           ('transcript_id', None),
                                           ('protein_id', 'KAF1903246.1'),
                                           ('product', 'BDH2 isoform 1'),
                                           ('transcript_symbol', 'BDH2'),
                                           ('transcript_type',
                                            'protein_coding'),
                                           ('sequence_name', 'CM021111.1'),
                                           ('sequence_guid', None),
                                           ('transcript_interval_guid',
                                            '8feef6bd-4e47-2893-218d-fc13e2f6f0ba'),
                                           ('transcript_guid', None)])]),
                            ('gene_id', None),
                            ('gene_symbol', 'BDH2'),
                            ('gene_type', 'protein_coding'),
                            ('locus_tag', 'GI526_G0000003'),
                            ('qualifiers',
                             {'gene': ['BDH2'],
                              'locus_tag': ['GI526_G0000003']}),
                            ('sequence_name', 'CM021111.1'),
                            ('sequence_guid', None),
                            ('gene_guid',
                             '4967ade5-6d91-faeb-79ed-e57093e4e5f2')]),
               OrderedDict([('transcripts',
                             [OrderedDict([('exon_starts', [41085]),
                                           ('exon_ends', [42503]),
                                           ('strand', 'PLUS'),
                                           ('cds_starts', None),
                                           ('cds_ends', None),
                                           ('cds_frames', None),
                                           ('qualifiers',
                                            {'gene': ['BDH1'],
                                             'locus_tag': ['GI526_G0000004'],
                                             'note': ['CAT transcript id: T0000004; CAT alignment id: NM_001178202.2-0; CAT source transcript id: NM_001178202.2; CAT source transcript biotype: protein_coding'],
                                             'product': ['BDH1 isoform 1']}),
                                           ('is_primary_tx', False),
                                           ('transcript_id', None),
                                           ('protein_id', None),
                                           ('product', 'BDH1 isoform 1'),
                                           ('transcript_symbol', 'BDH1'),
                                           ('transcript_type',
                                            'protein_coding'),
                                           ('sequence_name', 'CM021111.1'),
                                           ('sequence_guid', None),
                                           ('transcript_interval_guid',
                                            'ec17480a-c3d5-3b55-9cf9-81aafcecd9f9'),
                                           ('transcript_guid', None)])]),
                            ('gene_id', None),
                            ('gene_symbol', 'BDH1'),
                            ('gene_type', 'protein_coding'),
                            ('locus_tag', 'GI526_G0000004'),
                            ('qualifiers',
                             {'gene': ['BDH1'],
                              'locus_tag': ['GI526_G0000004']}),
                            ('sequence_name', 'CM021111.1'),
                            ('sequence_guid', None),
                            ('gene_guid',
                             '278a932c-a0a7-e31b-3156-5860ca4a4021')]),
               OrderedDict([('transcripts',
                             [OrderedDict([('exon_starts', [42579]),
                                           ('exon_ends', [43218]),
                                           ('strand', 'PLUS'),
                                           ('cds_starts', None),
                                           ('cds_ends', None),
                                           ('cds_frames', None),
                                           ('qualifiers',
                                            {'gene': ['ECM1'],
                                             'locus_tag': ['GI526_G0000005']}),
                                           ('is_primary_tx', False),
                                           ('transcript_id', None),
                                           ('protein_id', None),
                                           ('product', None),
                                           ('transcript_symbol', 'ECM1'),
                                           ('transcript_type', 'ncRNA'),
                                           ('sequence_name', 'CM021111.1'),
                                           ('sequence_guid', None),
                                           ('transcript_interval_guid',
                                            '77c36639-a68d-72b8-3877-d7c5b4f7b3dd'),
                                           ('transcript_guid', None)])]),
                            ('gene_id', None),
                            ('gene_symbol', 'ECM1'),
                            ('gene_type', 'ncRNA'),
                            ('locus_tag', 'GI526_G0000005'),
                            ('qualifiers',
                             {'gene': ['ECM1'],
                              'locus_tag': ['GI526_G0000005']}),
                            ('sequence_name', 'CM021111.1'),
                            ('sequence_guid', None),
                            ('gene_guid',
                             'e8b37537-588b-43a2-eb26-c88883f06014')])]),
             ('name', 'CM021111.1'),
             ('id', None),
             ('sequence_name', 'CM021111.1'),
             ('sequence_guid', None),
             ('sequence_path', None),
             ('qualifiers',
              {'chromosome': ['I'],
               'country': ['USA: Boulder, CO'],
               'db_xref': ['taxon:4932'],
               'mol_type': ['genomic DNA'],
               'organism': ['Saccharomyces cerevisiae'],
               'strain': ['INSC1006']}),
             ('start', 0),
             ('end', 50040),
             ('completely_within', None),
             ('parent_or_seq_chunk_parent', None)])

Converting models to BioCantor data structures

After loading an AnnotationCollectionModel, this object can be directly converted in to an AnnotationCollection with sequence information.

AnnotationCollection objects are the core data structure, and contain a set of genes and features as children.

[ ]:
annotation_collection = parsed[0].to_annotation_collection()
[ ]:
# this example dataset has 4 genes and 0 features
for child in annotation_collection:
    print(child)
GeneInterval(identifiers={'GI526_G0000001'}, Intervals:TranscriptInterval((16174-18079:-), cds=[None], symbol=None))
GeneInterval(identifiers={'GI526_G0000002', 'GDH3'}, Intervals:TranscriptInterval((37461-39103:+), cds=[CDS((37637-39011:+), (CDSFrame.ZERO)], symbol=GDH3))
GeneInterval(identifiers={'BDH2', 'GI526_G0000003'}, Intervals:TranscriptInterval((39518-40772:+), cds=[CDS((39518-40772:+), (CDSFrame.ZERO)], symbol=BDH2))
GeneInterval(identifiers={'GI526_G0000004', 'BDH1'}, Intervals:TranscriptInterval((41085-42503:+), cds=[None], symbol=BDH1))
GeneInterval(identifiers={'GI526_G0000005', 'ECM1'}, Intervals:TranscriptInterval((42579-43218:+), cds=[None], symbol=ECM1))
[ ]:
gene1 = annotation_collection.genes[0]
tx1 = gene1.transcripts[0]
[ ]:
# convert mRNA coordinates to genomic coordinates
tx1.transcript_pos_to_sequence(0)
18078
[ ]:
# NoncodingTranscriptError is raised when trying to convert CDS coordinates on a non-coding transcript
tx1.cds_pos_to_sequence(0)
---------------------------------------------------------------------------
NoncodingTranscriptError                  Traceback (most recent call last)
<ipython-input-10-0276456c0756> in <module>()
      1 # NoncodingTranscriptError is raised when trying to convert CDS coordinates on a non-coding transcript
----> 2 tx1.cds_pos_to_sequence(0)

/Users/ian.fiddes/repos/biocantor/inscripta/biocantor/gene/transcript.py in cds_pos_to_sequence(self, pos)
    549         """Converts a relative position along the CDS to sequence coordinate."""
    550         if not self.is_coding:
--> 551             raise NoncodingTranscriptError("No CDS positions on non-coding transcript")
    552         return self.cds.chromosome_location.relative_to_parent_pos(pos)
    553

NoncodingTranscriptError: No CDS positions on non-coding transcript

Primary transcripts

It is often useful to have an understanding of what isoform of a gene is the ‘most important’. An input dataset can provide this information based on the parser implementation used. If this information is not provided, then this value is inferred by the simple heuristic of:

  1. Longest CDS isoform.

  2. Longest isoform (if no coding isoforms).

[ ]:
gene1.get_primary_transcript() == tx1
True

Incorporating sequence information

By default, the instantiation call ParsedAnnotationRecord.to_annotation_collection incorporated the sequence information on the objects. This allows for extraction of various types of sequences, including amino acid and spliced transcripts.

[ ]:
for gene in annotation_collection:
    for tx in gene.transcripts:
        if tx.is_coding:
            print(f"{tx.transcript_symbol}: mRNA: {tx.get_spliced_sequence()[:10]}... protein: {tx.get_protein_sequence()[:10]}...")
        else:
            print(f"{tx.transcript_symbol}: mRNA: {tx.get_spliced_sequence()[:10]}...")
None: mRNA: GCGGCGCTCT...
GDH3: mRNA: AAACAGTTAA... protein: MTSEPEFQQA...
BDH2: mRNA: ATGAGAGCCT... protein: MRALAYFGKG...
BDH1: mRNA: GGGGCAGATA...
ECM1: mRNA: ATGTGGGAAC...

Querying the collection

AnnotationCollections have the ability to be subsetted. These range queries can be performed in two modes, controlled by the flag completely_within. When completely_within = True, the positions in the query are exact bounds. When completely_within = False, any constituent object that overlaps the range query will be retained.

start and end are not required to be set, and are inferred to be 0 and len(sequence) respectively if not used.

[ ]:
# remove GI526_G0000001 by moving the start position to within its bounds, when strict boundaries are required
subset1 = annotation_collection.query_by_position(start=16175, completely_within=True)
print([x.locus_tag for x in subset1])
['GI526_G0000002', 'GI526_G0000003', 'GI526_G0000004', 'GI526_G0000005']
[ ]:
# the information on the current range of the object is retained
print(subset1.start, subset1.end, subset1.completely_within)
16175 50040 True
[ ]:
# select BDH1 and BDH2
subset2 = annotation_collection.query_by_position(start=40000, end=42000, completely_within=False)
print([x.gene_symbol for x in subset2])
['BDH2', 'BDH1']
[ ]: