Parsing GFF3 files

Without specification, the default GFF3 parsing function will be used. This function will call gffutils and instruct it to build a new database in memory, with the flag merge_strategy="create_unique". These can be adjusted by modfying the arguments to GffutilsParseArgs.

[ ]:
from inscripta.biocantor.io.gff3.parser import parse_standard_gff3
[ ]:
gff3 = "tests/data/INSC1006_chrI.gff3"
[ ]:
# GFF3 parsing requires that the file be passed directly instead as an open handle
annotation_records = list(parse_standard_gff3(gff3))

After parsing, there will be one AnnotationRecord built for every sequence in the GFF3 file. This dataclass is a wrapper that contains two objects, a SeqRecord object and a AnnotationCollectionModel. Because this parser is not given any FASTA information, the SeqRecord is null. The AnnotationCollectionModel objects are marshmallow_dataclass objects, and so can be dumped to and loaded directly from JSON.

[ ]:
annotation_collection_model = annotation_records[0].annotation
annotation_collection_model.Schema().dump(annotation_collection_model)
OrderedDict([('feature_collections', None),
             ('genes',
              [OrderedDict([('transcripts',
                             [OrderedDict([('exon_starts', [16174]),
                                           ('exon_ends', [18079]),
                                           ('strand', 'MINUS'),
                                           ('cds_starts', None),
                                           ('cds_ends', None),
                                           ('cds_frames', None),
                                           ('qualifiers',
                                            {'ncrna_class': ['other'],
                                             'note': ['CAT transcript id: T0000001; CAT alignment id: IsoSeq-PB.2586.1; CAT novel prediction: IsoSeq']}),
                                           ('is_primary_tx', False),
                                           ('transcript_id', 'GI526_G0000001'),
                                           ('protein_id', None),
                                           ('product', None),
                                           ('transcript_symbol',
                                            'GI526_G0000001'),
                                           ('transcript_type', 'ncRNA'),
                                           ('sequence_name', 'CM021111.1'),
                                           ('sequence_guid', None),
                                           ('transcript_interval_guid', None),
                                           ('transcript_guid', None)])]),
                            ('gene_id',
                             '8ad3f444-384e-35e0-e560-aef88bd2863f'),
                            ('gene_symbol', None),
                            ('gene_type', 'ncRNA'),
                            ('locus_tag', 'GI526_G0000001'),
                            ('qualifiers', None),
                            ('sequence_name', 'CM021111.1'),
                            ('sequence_guid', None),
                            ('gene_guid', None)]),
               OrderedDict([('transcripts',
                             [OrderedDict([('exon_starts', [37461]),
                                           ('exon_ends', [39103]),
                                           ('strand', 'PLUS'),
                                           ('cds_starts', [37637]),
                                           ('cds_ends', [39011]),
                                           ('cds_frames', ['ZERO']),
                                           ('qualifiers',
                                            {'codon_start': ['1'],
                                             'gene': ['GDH3'],
                                             'note': ['CAT transcript id: T0000002; CAT alignment id: NM_001178204.1-0; CAT source transcript id: NM_001178204.1; CAT source transcript biotype: protein_coding'],
                                             'translation': ['MTSEPEFQQAYDEIVSSVEDSKIFEKFPQYKKVLPIVSVPERIIQFRVTWENDNGEQEVAQGYRVQFNSAKGPYKGGLRFHPSVNLSILKFLGFEQIFKNALTGLDMGGGKGGLCVDLKGKSDNEIRRICYAFMRELSRHIGKDTDVPAGDIGVGGREIGYLFGAYRSYKNSWEGVLTGKGLNWGGSLIRPEATGFGLVYYTQAMIDYATNGKESFEGKRVTISGSGNVAQYAALKVIELGGIVVSLSDSKGCIISETGITSEQIHDIASAKIRFKSLEEIVDEYSTFSESKMKYVAGARPWTHVSNVDIALPCATQNEVSGDEAKALVASGVKFVAEGANMGSTPEAISVFETARSTATNAKDAVWFGPPKAANLGGVAVSGLEMAQNSQKVTWTAERVDQELKKIMINCFNDCIQAAQEYSTEKNTNTLPSLVKGANIASFVMVADAMLDQGDVF']}),
                                           ('is_primary_tx', False),
                                           ('transcript_id', 'GI526_G0000002'),
                                           ('protein_id', 'KAF1903245.1'),
                                           ('product', 'GDH3 isoform 1'),
                                           ('transcript_symbol', 'GDH3'),
                                           ('transcript_type',
                                            'protein_coding'),
                                           ('sequence_name', 'CM021111.1'),
                                           ('sequence_guid', None),
                                           ('transcript_interval_guid', None),
                                           ('transcript_guid', None)])]),
                            ('gene_id',
                             'a1b669f1-57f6-ae9b-8f4f-a27a6e84d15a'),
                            ('gene_symbol', 'GDH3'),
                            ('gene_type', 'protein_coding'),
                            ('locus_tag', 'GI526_G0000002'),
                            ('qualifiers', {'gene': ['GDH3']}),
                            ('sequence_name', 'CM021111.1'),
                            ('sequence_guid', None),
                            ('gene_guid', None)]),
               OrderedDict([('transcripts',
                             [OrderedDict([('exon_starts', [39518]),
                                           ('exon_ends', [40772]),
                                           ('strand', 'PLUS'),
                                           ('cds_starts', [39518]),
                                           ('cds_ends', [40772]),
                                           ('cds_frames', ['ZERO']),
                                           ('qualifiers',
                                            {'codon_start': ['1'],
                                             'gene': ['BDH2'],
                                             'translation': ['MRALAYFGKGNIRFTNHLKEPHIVAPDELVIDIAWCGICGTDLHEYTDGPIFFPEDGHTHEISHNPLPQAMGHEMAGTVLEVGPSVKNLKVGDKVVVEPTGTCRDRYRWPLSPKVDKEWCAACKKGYYNICSYLGLCGAGVQSGGFAEGVVMNESHCYKVPDFVPLDVAALIQPLAVCWHAIRVCEFKAGSTALIIGAGPIGLGTILALNAAGCKDIVVSEPAKVRRELAEKMGARVYDPTAHAAKESIDYLRSIADGGDGFDYTFDCSGLEVTLNAAIQCLTFRGTAVNLAMWGHHKIQFSPMDITLHERKYTGSMCYTHHDFETVIEALEEGRIDIDRARHMITGRVNIEDGLDGAIMKLINEKESTIKIILTPNNHGELNREADNEKKEISELSSRKDQERLRESINEAKLRHT']}),
                                           ('is_primary_tx', False),
                                           ('transcript_id', 'GI526_G0000003'),
                                           ('protein_id', 'KAF1903246.1'),
                                           ('product', 'BDH2 isoform 1'),
                                           ('transcript_symbol', 'BDH2'),
                                           ('transcript_type',
                                            'protein_coding'),
                                           ('sequence_name', 'CM021111.1'),
                                           ('sequence_guid', None),
                                           ('transcript_interval_guid', None),
                                           ('transcript_guid', None)])]),
                            ('gene_id',
                             '4967ade5-6d91-faeb-79ed-e57093e4e5f2'),
                            ('gene_symbol', 'BDH2'),
                            ('gene_type', 'protein_coding'),
                            ('locus_tag', 'GI526_G0000003'),
                            ('qualifiers', {'gene': ['BDH2']}),
                            ('sequence_name', 'CM021111.1'),
                            ('sequence_guid', None),
                            ('gene_guid', None)]),
               OrderedDict([('transcripts',
                             [OrderedDict([('exon_starts', [41085]),
                                           ('exon_ends', [42503]),
                                           ('strand', 'PLUS'),
                                           ('cds_starts', None),
                                           ('cds_ends', None),
                                           ('cds_frames', None),
                                           ('qualifiers',
                                            {'gene': ['BDH1'],
                                             'note': ['CAT transcript id: T0000004; CAT alignment id: NM_001178202.2-0; CAT source transcript id: NM_001178202.2; CAT source transcript biotype: protein_coding']}),
                                           ('is_primary_tx', False),
                                           ('transcript_id', 'GI526_G0000004'),
                                           ('protein_id', None),
                                           ('product', None),
                                           ('transcript_symbol', 'BDH1'),
                                           ('transcript_type',
                                            'protein_coding'),
                                           ('sequence_name', 'CM021111.1'),
                                           ('sequence_guid', None),
                                           ('transcript_interval_guid', None),
                                           ('transcript_guid', None)])]),
                            ('gene_id',
                             '278a932c-a0a7-e31b-3156-5860ca4a4021'),
                            ('gene_symbol', 'BDH1'),
                            ('gene_type', 'protein_coding'),
                            ('locus_tag', 'GI526_G0000004'),
                            ('qualifiers', {'gene': ['BDH1']}),
                            ('sequence_name', 'CM021111.1'),
                            ('sequence_guid', None),
                            ('gene_guid', None)]),
               OrderedDict([('transcripts',
                             [OrderedDict([('exon_starts', [42579]),
                                           ('exon_ends', [43218]),
                                           ('strand', 'PLUS'),
                                           ('cds_starts', None),
                                           ('cds_ends', None),
                                           ('cds_frames', None),
                                           ('qualifiers', {'gene': ['ECM1']}),
                                           ('is_primary_tx', False),
                                           ('transcript_id', 'GI526_G0000005'),
                                           ('protein_id', None),
                                           ('product', None),
                                           ('transcript_symbol', 'ECM1'),
                                           ('transcript_type', 'ncRNA'),
                                           ('sequence_name', 'CM021111.1'),
                                           ('sequence_guid', None),
                                           ('transcript_interval_guid', None),
                                           ('transcript_guid', None)])]),
                            ('gene_id',
                             'e8b37537-588b-43a2-eb26-c88883f06014'),
                            ('gene_symbol', 'ECM1'),
                            ('gene_type', 'ncRNA'),
                            ('locus_tag', 'GI526_G0000005'),
                            ('qualifiers', {'gene': ['ECM1']}),
                            ('sequence_name', 'CM021111.1'),
                            ('sequence_guid', None),
                            ('gene_guid', None)])]),
             ('name', None),
             ('id', None),
             ('sequence_name', 'CM021111.1'),
             ('sequence_guid', None),
             ('sequence_path', None),
             ('qualifiers', None),
             ('start', None),
             ('end', None),
             ('completely_within', None),
             ('parent_or_seq_chunk_parent', None)])

Converting models to BioCantor data structures

After loading an AnnotationCollectionModel, this object can be directly converted in to an AnnotationCollection with sequence information.

AnnotationCollection objects are the core data structure, and contain a set of genes and features as children.

[ ]:
annotation_collection = annotation_collection_model.to_annotation_collection()
[ ]:
# this example dataset has 4 genes and 0 features
for child in annotation_collection:
    print(child)
GeneInterval(identifiers={'8ad3f444-384e-35e0-e560-aef88bd2863f', 'GI526_G0000001'}, Intervals:TranscriptInterval((16174-18079:-), cds=[None], symbol=GI526_G0000001))
GeneInterval(identifiers={'GI526_G0000002', 'a1b669f1-57f6-ae9b-8f4f-a27a6e84d15a', 'GDH3'}, Intervals:TranscriptInterval((37461-39103:+), cds=[CDS((37637-39011:+), (CDSFrame.ZERO)], symbol=GDH3))
GeneInterval(identifiers={'4967ade5-6d91-faeb-79ed-e57093e4e5f2', 'BDH2', 'GI526_G0000003'}, Intervals:TranscriptInterval((39518-40772:+), cds=[CDS((39518-40772:+), (CDSFrame.ZERO)], symbol=BDH2))
GeneInterval(identifiers={'BDH1', 'GI526_G0000004', '278a932c-a0a7-e31b-3156-5860ca4a4021'}, Intervals:TranscriptInterval((41085-42503:+), cds=[None], symbol=BDH1))
GeneInterval(identifiers={'ECM1', 'e8b37537-588b-43a2-eb26-c88883f06014', 'GI526_G0000005'}, Intervals:TranscriptInterval((42579-43218:+), cds=[None], symbol=ECM1))
[ ]:
gene1 = annotation_collection.genes[0]
tx1 = gene1.transcripts[0]
[ ]:
# convert mRNA coordinates to genomic coordinates
tx1.transcript_pos_to_sequence(0)
18078
[ ]:
# NoncodingTranscriptError is raised when trying to convert CDS coordinates on a non-coding transcript
tx1.cds_pos_to_sequence(0)
---------------------------------------------------------------------------
NoncodingTranscriptError                  Traceback (most recent call last)
<ipython-input-12-0276456c0756> in <module>()
      1 # NoncodingTranscriptError is raised when trying to convert CDS coordinates on a non-coding transcript
----> 2 tx1.cds_pos_to_sequence(0)

/Users/ian.fiddes/repos/inscripta_content_store/dependencies/biocantor/inscripta/biocantor/gene/transcript.py in cds_pos_to_sequence(self, pos)
    549         """Converts a relative position along the CDS to sequence coordinate."""
    550         if not self.is_coding:
--> 551             raise NoncodingTranscriptError("No CDS positions on non-coding transcript")
    552         return self.cds.chromosome_location.relative_to_parent_pos(pos)
    553

NoncodingTranscriptError: No CDS positions on non-coding transcript

Primary transcripts

It is often useful to have an understanding of what isoform of a gene is the ‘most important’. An input dataset can provide this information based on the parser implementation used. If this information is not provided, then this value is inferred by the simple heuristic of:

  1. Longest CDS isoform.

  2. Longest isoform (if no coding isoforms).

[ ]:
gene1.get_primary_transcript() == tx1
True

Incorporating sequence information

The above parsing call treated the input GFF3 as a standard GFF3 without sequence. However, this file actually has a FASTA suffix, and so the sequence information can be loaded either from that file, or from a separate FASTA file.

When parsers are used that return sequence information, the returned ParsedAnnotationRecord now has a non-null seqrecord member. This object can then be used to instantiate an AnnotationCollection with sequence information.

[ ]:
from inscripta.biocantor.io.gff3.parser import parse_gff3_embedded_fasta

parsed_with_sequence = list(parse_gff3_embedded_fasta(gff3))
annotation_collection_with_sequence = parsed_with_sequence[0].to_annotation_collection()
annotation_collection_with_sequence.get_reference_sequence()
<Sequence;
  Alphabet=NT_EXTENDED_GAPPED;
  Length=50040;
  Parent=None;
  Type=None>
[ ]:
for gene in annotation_collection_with_sequence:
    for tx in gene.transcripts:
        if tx.is_coding:
            print(f"{tx.transcript_symbol}: mRNA: {tx.get_spliced_sequence()[:10]}... protein: {tx.get_protein_sequence()[:10]}...")
        else:
            print(f"{tx.transcript_symbol}: mRNA: {tx.get_spliced_sequence()[:10]}...")
GI526_G0000001: mRNA: GCGGCGCTCT...
GDH3: mRNA: AAACAGTTAA... protein: MTSEPEFQQA...
BDH2: mRNA: ATGAGAGCCT... protein: MRALAYFGKG...
BDH1: mRNA: GGGGCAGATA...
ECM1: mRNA: ATGTGGGAAC...

Sequence information can also come from a separate FASTA file. In this case, a different function call is used, and it gets provided the FASTA directly.

[ ]:
from inscripta.biocantor.io.gff3.parser import parse_gff3_fasta

fasta = "tests/data/INSC1006_chrI.fasta"
parsed_with_sequence_from_fasta = list(parse_gff3_embedded_fasta(gff3))
annotation_collection_with_sequence_from_fasta = parsed_with_sequence_from_fasta[0].to_annotation_collection()
[ ]:
annotation_collection_with_sequence_from_fasta.get_reference_sequence()
<Sequence;
  Alphabet=NT_EXTENDED_GAPPED;
  Length=50040;
  Parent=None;
  Type=None>

Querying the collection

AnnotationCollections have the ability to be subsetted. These range queries can be performed in two modes, controlled by the flag completely_within. When completely_within = True, the positions in the query are exact bounds. When completely_within = False, any constituent object that overlaps the range query will be retained.

start and end are not required to be set, and are inferred to be 0 and len(sequence) respectively if not used.

[ ]:
# remove GI526_G0000001 by moving the start position to within its bounds, when strict boundaries are required
subset1 = annotation_collection.query_by_position(start=16175, completely_within=True)
print([x.gene_symbol for x in subset1])
['GDH3', 'BDH2', 'BDH1', 'ECM1']
[ ]:
# select BDH1 and BDH2
subset2 = annotation_collection.query_by_position(start=40000, end=42000, completely_within=False)
print([x.gene_symbol for x in subset2])
['BDH2', 'BDH1']

Modifying the parser

Each of the GFF3 wrapper parser functions, parse_standard_gff3, parse_gff3_embedded_fasta, and parse_gff3_fasta, all have arguments that allow you to modify how the GFF3 file is interpreted. These include:

  1. gffutil_parse_args=GffutilsParseArgs(): This dataclass contains arguments that are passed to gffutils.

  2. parse_func: The default implementation of this function can be written bespoke to your annotation data.

  3. gffutil_transform_func: This function is passed straight to the same argument in gffutils.

  4. db_fn: Change this value to retain the gffutils sqlite database on disk.