Parsing annotation files with BioCantor
BioCantor supports two common annotation file formats, GenBank and GFF3. GenBank parsing is backed by the parser
present in BioPython, while GFF3 parsing relies on the library gffutils
.
Because annotation file format specifications are often interpreted differently, BioCantor is designed to allow for custom parsing functions to be written for all input formats. There exists default implementations in the library, but these can be replaced with your own methods in order to best map the data in your annotation file on to the values in the BioCantor annotation model.
GFF3
BioCantor supports import and export of GFF3 with or without an embedded FASTA file. Embedded FASTA files are a method
of combining sequence and annotation information. The annotation section is standard GFF3 up top, delimited with the
header line ##gff-version 3
. The FASTA section comes at the end, and is delimited by the break line ##FASTA
.
This version of GFF3 could be constructed from two separate files with the command:
(cat ${GFF3}; echo -e "##FASTA\n"; cat ${FASTA}) > gff3_with_fasta.gff3
To enable GFF3 parsing, BioCantor leverages the library gffutils.
This library builds a sqlite database of the input file, and has a lot of flexibility that allows for parsing of
the many ways that the GFF3 spec can be interpreted. BioCantor takes the resulting database and interprets it into
the data model. Users can pass commands down to gffutils
in order to tweak how it interprets the files.
GenBank
BioCantor relies on BioPython
to perform the core parsing of GenBank files. The resulting data structures are then
ran through the default BioCantor parsing function to build the BioCantor annotation model.
Because GenBank files do not have an explicit hierarchical structure to annotations like
GFF3 files do, the hierarchy must be inferred. BioCantor offers two ways to do this.
The first is to assume that the file is sorted, and the second
is to use the key qualifier locus_tag
to group together objects that belong to the
same gene or feature.
Parser Implementations
LocusTag (default parser)
The LocusTag
parser implementation groups features in the GenBank file based on the value
of the locus_tag
qualifier.
Within each locus_tag
group, genes are identified by looking for the presence of a gene
feature. If a locus_tag
group exists with no gene
features, an exception will be raised.
Any features without a locus_tag
will be interpreted as a FeatureIntervalCollection
with
one FeatureInterval
.
While the GenBank parser defaults to the LocusTag
parser, two other implementations
also exist that can be chosen by specifying a different GenBankParserType
to the
gbk_type
keyword argument in the parser function.
Sorted
This parser implementation relies entirely on the sort order of the GenBank file. Genes are identified by looking
at the ordering of features within the file, splitting them up into groups every time a gene
feature is
found. This parser will also handle isolated ncRNA
or CDS
features that have no gene
feature, and a gene
feature will be inferred.
Hybrid
The hybrid parser combines both LocusTag
and Sorted
parsing together. The initial parsing pass uses LocusTag
parsing to group features together by locus_tag
. After this step, the remaining features are passed to the
Sorted
parser.