biocantor.io.genbank.parser

Parse GenBank files. Biopython provides the core parsing functionality, but is not capable of producing a hierarchical model. Thus this module does this, by depending on the ordering of the GenBank file.

There are two ways to infer hierarchy from GenBank files, that are not always followed.

The first (Model 1A) is sort order: so that it always goes

gene -> {mRNA, tRNA, rRNA} -> CDS (for coding genes only)

Each transcript feature can repeat. Each mRNA feature must be followed by a CDS feature. The presence of a new gene feature is the divider between genes.

In some genomes (often Prokaryotic), there is no transcript level feature for coding genes. That is, it goes from gene -> CDS. This is Model 1B.

The second way that a GenBank file can be grouped is via the locus_tag qualifiers. This method is the default for this parsing module.

The generic parsing function that interprets the BioPython results to BioCantor data models is implemented in GeneFeature.to_gene_model(). This function can be over-ridden to provide custom parsing implementations.

Module Contents

Classes

Feature

Generic feature.

FeatureIntervalGenBankCollection

A collection of generic (non-transcribed) feature intervals.

GeneFeature

A gene.

TranscriptFeature

A transcript

CDSFeature

A CDS interval

GroupedGeneFeatures

Container class for a grouping of gene-like SeqFeatures on their associated SeqRrecord.

BaseGenBankParser

Base class for GenBank parsing.

SortedGenBankParser

The Sorted GenBank parser relies entirely on increasing genomic position to partition features into genes or

LocusTagGenBankParser

The LocusTag parser expects that every gene feature in the GenBank file contains a /locus_tag qualifier.

HybridGenBankParser

The Hybrid parsing mode combines both LocusTag and Sorted parsing. LocusTag is preferentially used,

Functions

parse_genbank(...)

This is the main GenBank parsing function. The parse function implemented in GeneFeature can be

class biocantor.io.genbank.parser.Feature(feature: Bio.SeqFeature.SeqFeature, record: Bio.SeqRecord.SeqRecord)

Bases: abc.ABC

Generic feature.

property type: str
property strand: int
property start: int
types
class biocantor.io.genbank.parser.FeatureIntervalGenBankCollection(features: List[Bio.SeqFeature.SeqFeature], record: Bio.SeqRecord.SeqRecord)

A collection of generic (non-transcribed) feature intervals.

property start: int
static to_feature_model(cls: FeatureIntervalGenBankCollection) Dict[str, Any]

Convert to a Dict representation of a biocantor.gene.collections.FeatureIntervalCollection that can be used for analyses.

This is the default function, that can be over-ridden by specific implementations.

Looks for identifiers in the hierarchy defined by the Enum biocantor.io.genbank.constants.FeatureIntervalIdentifierKeys.

The feature collection produced will be named either the locus tag if provided, and otherwise by definition of the parser we have only one feature, so the first name is chosen.

class biocantor.io.genbank.parser.GeneFeature(feature: Bio.SeqFeature.SeqFeature, record: Bio.SeqRecord.SeqRecord)

Bases: Feature

A gene.

property type: str
property has_children: bool
types
__str__()

Return str(self).

__repr__()

Return repr(self).

static from_transcript_or_cds_feature(feature: Bio.SeqFeature.SeqFeature, seqrecord: Bio.SeqRecord.SeqRecord) GeneFeature

Some GenBank files lack a gene-level feature, but have transcript-level features or CDS-level features only.

Construct a GeneFeature from such records.

add_child(feature: Bio.SeqFeature.SeqFeature, cds_feature: Optional[Bio.SeqFeature.SeqFeature] = None)

Add a new feature as a child. Infer Transcripts if this child is a CDS feature.

infer_child()

If this is an isolated gene feature, then construct a child transcript feature that is a copy

static to_gene_model(cls: GeneFeature) Dict[str, Any]

Convert to a Dict representation of a biocantor.gene.collections.GeneInterval that can be used for analyses.

This is the default function, that can be over-ridden by specific implementations.

Looks for /transcript_id, /protein_id, and /gene on the transcript level, and looks for /gene_id, /gene, and /locus_tag on the gene level.

class biocantor.io.genbank.parser.TranscriptFeature(feature: Bio.SeqFeature.SeqFeature, record: Bio.SeqRecord.SeqRecord, cds_feature: Optional[Bio.SeqFeature.SeqFeature] = None)

Bases: Feature

A transcript

types
_exon_interval
_cds_interval
__str__()

Return str(self).

construct_frames(cds_interval: inscripta.biocantor.location.Location) List[str]

We need to build frames. Since GenBank lacks this info, do our best

get_qualifier_from_tx_or_cds_features(qualifier: str) Optional[str]

Get a specific qualifier, if it exists. Look at tx first, then children

find_exon_interval() inscripta.biocantor.location.CompoundInterval

Finds the Location of the Exons.

find_transcript_interval() inscripta.biocantor.location.Location

Finds the Location that spans the full length of the Transcript

find_cds_interval() inscripta.biocantor.location.Location

Finds the Location of the CDS.

Handle edge cases from tools like Geneious where the CDS Interval exceeds the bounds of the Exons.

merge_cds_qualifiers_to_transcript() Dict[str, List[str]]

If there were distinct transcript-level features, the qualifiers on the CDS feature will be lost when converting to the BioCantor data model unless those qualifiers are rolled into the qualifiers on the transcript feature.

class biocantor.io.genbank.parser.CDSFeature(feature: Bio.SeqFeature.SeqFeature, record: Bio.SeqRecord.SeqRecord)

Bases: Feature

A CDS interval

types
__str__()

Return str(self).

class biocantor.io.genbank.parser.GroupedGeneFeatures

Container class for a grouping of gene-like SeqFeatures on their associated SeqRrecord.

This class is used by implementations of the BaseGenBankParser to store groupings of features that are considered to be part of a gene unit together with the SeqRecord they came from.

Due to the various flavors of GenBank files out there, any of the gene, transcript or CDS features might not be present. The downstream usage of this class will infer the missing feature types.

seqrecord :Bio.SeqRecord.SeqRecord
gene_feature :Optional[Bio.SeqFeature.SeqFeature]
transcript_features :Optional[List[Bio.SeqFeature.SeqFeature]]
cds_features :Optional[List[Bio.SeqFeature.SeqFeature]]
class biocantor.io.genbank.parser.BaseGenBankParser(seq_records: List[Bio.SeqRecord.SeqRecord], parsed_variants: Dict[str, List[inscripta.biocantor.io.vcf.parser.VariantIntervalCollectionModel]], gene_parse_func: Callable[[GeneFeature], Dict[str, Any]], feature_parse_func: Callable[[FeatureIntervalGenBankCollection], Dict[str, Any]])

Bases: abc.ABC

Base class for GenBank parsing.

property num_genes: int
property num_feature_collections
genbank_parser_type :inscripta.biocantor.io.genbank.constants.GenBankParserType
abstract parse() Iterator[inscripta.biocantor.io.parser.ParsedAnnotationRecord]

Parse features

static validate_seqfeature(feature: Bio.SeqFeature.SeqFeature) bool

Perform validation checks on a SeqFeature. If there are issues with the feature, warnings and raised and this function returns False.

static _construct_gene_from_feature(feature: Bio.SeqFeature.SeqFeature, seqrecord: Bio.SeqRecord.SeqRecord) Optional[GeneFeature]

Convenience function for deciding which function to use when converting a feature to a gene

static _sort_features_by_position_and_type(features: List[Bio.SeqFeature.SeqFeature]) List[Bio.SeqFeature.SeqFeature]

sort features first by position then by the expected order of features for a coding gene all non-coding transcripts as a result end up at the end of the group

static _group_sorted_features_by_type(features: List[Bio.SeqFeature.SeqFeature]) Iterator[List[Bio.SeqFeature.SeqFeature]]

Iterator for grouping sorted features by each time a gene feature appears.

This function identifies groups of canonical genes, as well as interspersed non-coding genes, while also trying to identify isolated CDS records.

As an example, take this set of ordered features:

gene
mRNA
CDS
tRNA
rRNA
gene
CDS

This would be grouped as:

[gene, mRNA, CDS]
[tRNA]
[rRNA]
[gene, CDS]

However, sometimes non-coding genes still have the gene feature. Take this example:

gene
mRNA
CDS
gene
ncRNA
gene
CDS

This would be grouped as:

[gene, mRNA, CDS]
[gene, ncRNA]
[gene, CDS]

Now consider this problematic ordering, where a non-coding object interrupts a coding object:

gene
tRNA
CDS

This is ambiguous – is the gene meant to go with the tRNA, or the CDS? This would be grouped as:

[gene, tRNA]
[CDS]

And there will be a new gene object inferred for the CDS, regardless of if the gene was supposed to go with it or not.

This is an example of why the Sorted parser is inherently more fragile than the LocusTag parser.

_group_features_by_position(features: List[Bio.SeqFeature.SeqFeature], seqrecord: Bio.SeqRecord.SeqRecord, idx: int)

Group features by position. Since this function is always called in seqrecord order, self.grouped_gene_features is incremented here.

_group_features_by_locus_tag(features: List[Bio.SeqFeature.SeqFeature], seqrecord: Bio.SeqRecord.SeqRecord, idx: int)

Group features by locus tag. Since this function is always called in seqrecord order, self.grouped_gene_features is incremented here.

_parse_features()

Extract all generic features from a SeqRecord. These are anything that did not qualify as a gene, based on the feature type being one of the known members of biocantor.io.genbank.constants.GenBankFeatures.

Feature collections are inferred through the locus_tag field. Any items without such a tag are treated separately.

_convert_seqfeature_to_gene(grouped_gene_features: GroupedGeneFeatures, seqrecord: Bio.SeqRecord.SeqRecord) GeneFeature
_convert_seqfeatures_to_genes()

After the gene-like features have been grouped by either position, locus tag, or both, the groups are evaluated and converted into GeneFeature objects.

Gene-level objects with no children (transcripts or CDSes) are skipped.

_export_annotation_collections() Iterator[inscripta.biocantor.io.parser.ParsedAnnotationRecord]

The final step of GenBank parsing is exporting the annotations as ParsedAnnotationRecord.

These objects contain both the annotations as a AnnotationCollectionModel, as well as the associated SeqRecord, which allows for construction of a AnnotationCollection with sequence information.

class biocantor.io.genbank.parser.SortedGenBankParser(seq_records: List[Bio.SeqRecord.SeqRecord], parsed_variants: Dict[str, List[inscripta.biocantor.io.vcf.parser.VariantIntervalCollectionModel]], gene_parse_func: Callable[[GeneFeature], Dict[str, Any]], feature_parse_func: Callable[[FeatureIntervalGenBankCollection], Dict[str, Any]])

Bases: BaseGenBankParser

The Sorted GenBank parser relies entirely on increasing genomic position to partition features into genes or feature groups. This is inherently challenging because of issues like overlapping genes or multiple isoforms.

genbank_parser_type
parse() Iterator[inscripta.biocantor.io.parser.ParsedAnnotationRecord]

Parse features

_extract_seqfeatures_from_seqrecords()
_group_gene_features_by_position()

Soerted parser groups features using BaseGenBankParser._group_features_by_position().

class biocantor.io.genbank.parser.LocusTagGenBankParser(seq_records: List[Bio.SeqRecord.SeqRecord], parsed_variants: Dict[str, List[inscripta.biocantor.io.vcf.parser.VariantIntervalCollectionModel]], gene_parse_func: Callable[[GeneFeature], Dict[str, Any]], feature_parse_func: Callable[[FeatureIntervalGenBankCollection], Dict[str, Any]])

Bases: BaseGenBankParser

The LocusTag parser expects that every gene feature in the GenBank file contains a /locus_tag qualifier.

Gene-type features without a locus tag qualifier are ignored, and an exception is raised if multiple gene features have the same locus tag.

genbank_parser_type
parse() Iterator[inscripta.biocantor.io.parser.ParsedAnnotationRecord]

Parse features

_extract_seqfeatures_from_seqrecords()
_group_gene_features_by_locus_tag()

Locus tag parser groups features using BaseGenBankParser._group_features_by_locus_tag().

class biocantor.io.genbank.parser.HybridGenBankParser(seq_records: List[Bio.SeqRecord.SeqRecord], parsed_variants: Dict[str, List[inscripta.biocantor.io.vcf.parser.VariantIntervalCollectionModel]], gene_parse_func: Callable[[GeneFeature], Dict[str, Any]], feature_parse_func: Callable[[FeatureIntervalGenBankCollection], Dict[str, Any]])

Bases: LocusTagGenBankParser, SortedGenBankParser

The Hybrid parsing mode combines both LocusTag and Sorted parsing. LocusTag is preferentially used, with features that either lack a locus tag or with duplicate tags are be sent to the Sorted parser.

genbank_parser_type
parse() Iterator[inscripta.biocantor.io.parser.ParsedAnnotationRecord]

Parse features

_extract_seqfeatures_from_seqrecords()

Hybrid parser partitions all features by locus tag, if they exist. Any features without a locus tag or with duplicate tags will be sent to the Sorted parser.

_identify_locus_tag_collisions()

Identify duplicated locus tags on gene features across all of the SeqRecords. Raise a warning for each one, then reassign it to the Sorted parser.

Also sorts the objects in gene_filtered_features_without_locus_tag to prepare for sorted parsing.

_group_gene_features_by_locus_tag_and_position()

Hybrid parser groups genes by both position and locus tag, after they were partitioned by HybridGenBankParser._extract_seqfeatures_from_seqrecords().

biocantor.io.genbank.parser.parse_genbank(genbank_handle_or_path: Union[TextIO, str, pathlib.Path], variant_handle_or_path: Optional[Union[TextIO, str, pathlib.Path]] = None, parsed_variants: Optional[Dict[str, List[inscripta.biocantor.io.vcf.parser.VariantIntervalCollectionModel]]] = None, gene_parse_func: Callable[[GeneFeature], Dict[str, Any]] = GeneFeature.to_gene_model, feature_parse_func: Callable[[FeatureIntervalGenBankCollection], Dict[str, Any]] = FeatureIntervalGenBankCollection.to_feature_model, gbk_type: inscripta.biocantor.io.genbank.constants.GenBankParserType = GenBankParserType.HYBRID, allow_duplicate_sequence_identifiers: bool = False) Iterator[inscripta.biocantor.io.parser.ParsedAnnotationRecord]

This is the main GenBank parsing function. The parse function implemented in GeneFeature can be over-ridden to provide a custom implementation.

Parameters
  • genbank_handle_or_path – An open GenBank file or a path to a locally stored GenBank file.

  • variant_handle_or_path – Optional open handle to a VCF file. Mutually exclusive with parsed_variants.

  • parsed_variants – Optional parsed variants. Mutually exclusive with variant_handle_or_path.

  • gene_parse_func – Optional gene parse function implementation. Defaults to GeneFeature.to_gene_model() implemented in this module.

  • feature_parse_func – Optional feature interval parse function implementation. Defaults to FeatureIntervalGenBankCollection.to_feature_model() implemented in this module.

  • gbk_type – Use Hybrid, Sorted or LocusTag based parsing? Defaults to Hybrid.

  • allow_duplicate_sequence_identifiers – Should this parser raise an exception if the same identifier is seen twice? Defaults to False.

Yields

ParsedAnnotationRecord.