Parsing annotation files with BioCantor
BioCantor supports two common annotation file formats, GenBank and GFF3. GenBank parsing is backed by the parser
present in BioPython, while GFF3 parsing relies on the library gffutils.
Because annotation file format specifications are often interpreted differently, BioCantor is designed to allow for custom parsing functions to be written for all input formats. There exists default implementations in the library, but these can be replaced with your own methods in order to best map the data in your annotation file on to the values in the BioCantor annotation model.
GFF3
BioCantor supports import and export of GFF3 with or without an embedded FASTA file. Embedded FASTA files are a method
of combining sequence and annotation information. The annotation section is standard GFF3 up top, delimited with the
header line ##gff-version 3. The FASTA section comes at the end, and is delimited by the break line ##FASTA.
This version of GFF3 could be constructed from two separate files with the command:
(cat ${GFF3}; echo -e "##FASTA\n"; cat ${FASTA}) > gff3_with_fasta.gff3
To enable GFF3 parsing, BioCantor leverages the library gffutils.
This library builds a sqlite database of the input file, and has a lot of flexibility that allows for parsing of
the many ways that the GFF3 spec can be interpreted. BioCantor takes the resulting database and interprets it into
the data model. Users can pass commands down to gffutils in order to tweak how it interprets the files.
GenBank
BioCantor relies on BioPython to perform the core parsing of GenBank files. The resulting data structures are then
ran through the default BioCantor parsing function to build the BioCantor annotation model.
Because GenBank files do not have an explicit hierarchical structure to annotations like
GFF3 files do, the hierarchy must be inferred. BioCantor offers two ways to do this.
The first is to assume that the file is sorted, and the second
is to use the key qualifier locus_tag to group together objects that belong to the
same gene or feature.
Parser Implementations
LocusTag (default parser)
The LocusTag parser implementation groups features in the GenBank file based on the value
of the locus_tag qualifier.
Within each locus_tag group, genes are identified by looking for the presence of a gene
feature. If a locus_tag group exists with no gene features, an exception will be raised.
Any features without a locus_tag will be interpreted as a FeatureIntervalCollection with
one FeatureInterval.
While the GenBank parser defaults to the LocusTag parser, two other implementations
also exist that can be chosen by specifying a different GenBankParserType to the
gbk_type keyword argument in the parser function.
Sorted
This parser implementation relies entirely on the sort order of the GenBank file. Genes are identified by looking
at the ordering of features within the file, splitting them up into groups every time a gene feature is
found. This parser will also handle isolated ncRNA or CDS features that have no gene feature, and a gene
feature will be inferred.
Hybrid
The hybrid parser combines both LocusTag and Sorted parsing together. The initial parsing pass uses LocusTag
parsing to group features together by locus_tag. After this step, the remaining features are passed to the
Sorted parser.