biocantor.io.genbank.parser
Parse GenBank files. Biopython provides the core parsing functionality, but is not capable of producing a hierarchical model. Thus this module does this, by depending on the ordering of the GenBank file.
There are two ways to infer hierarchy from GenBank files, that are not always followed.
The first (Model 1A) is sort order: so that it always goes
gene -> {mRNA, tRNA, rRNA} -> CDS (for coding genes only)
Each transcript feature can repeat. Each mRNA feature must be followed by a CDS feature. The presence of a new gene feature is the divider between genes.
In some genomes (often Prokaryotic), there is no transcript level feature for coding genes. That is, it goes from gene -> CDS. This is Model 1B.
The second way that a GenBank file can be grouped is via the locus_tag qualifiers. This method is the default for this parsing module.
The generic parsing function that interprets the BioPython results to BioCantor data models is implemented in
GeneFeature.to_gene_model()
. This function can be over-ridden to provide custom parsing implementations.
Module Contents
Classes
Generic feature. |
|
A collection of generic (non-transcribed) feature intervals. |
|
A gene. |
|
A transcript |
|
A CDS interval |
|
Container class for a grouping of gene-like SeqFeatures on their associated SeqRrecord. |
|
Base class for GenBank parsing. |
|
The Sorted GenBank parser relies entirely on increasing genomic position to partition features into genes or |
|
The LocusTag parser expects that every gene feature in the GenBank file contains a /locus_tag qualifier. |
|
The Hybrid parsing mode combines both LocusTag and Sorted parsing. LocusTag is preferentially used, |
Functions
|
This is the main GenBank parsing function. The parse function implemented in |
- class biocantor.io.genbank.parser.Feature(feature: Bio.SeqFeature.SeqFeature, record: Bio.SeqRecord.SeqRecord)
Bases:
abc.ABC
Generic feature.
- types
- class biocantor.io.genbank.parser.FeatureIntervalGenBankCollection(features: List[Bio.SeqFeature.SeqFeature], record: Bio.SeqRecord.SeqRecord)
A collection of generic (non-transcribed) feature intervals.
- static to_feature_model(cls: FeatureIntervalGenBankCollection) Dict[str, Any]
Convert to a Dict representation of a
biocantor.gene.collections.FeatureIntervalCollection
that can be used for analyses.This is the default function, that can be over-ridden by specific implementations.
Looks for identifiers in the hierarchy defined by the Enum
biocantor.io.genbank.constants.FeatureIntervalIdentifierKeys
.The feature collection produced will be named either the locus tag if provided, and otherwise by definition of the parser we have only one feature, so the first name is chosen.
- class biocantor.io.genbank.parser.GeneFeature(feature: Bio.SeqFeature.SeqFeature, record: Bio.SeqRecord.SeqRecord)
Bases:
Feature
A gene.
- types
- __str__()
Return str(self).
- __repr__()
Return repr(self).
- static from_transcript_or_cds_feature(feature: Bio.SeqFeature.SeqFeature, seqrecord: Bio.SeqRecord.SeqRecord) GeneFeature
Some GenBank files lack a gene-level feature, but have transcript-level features or CDS-level features only.
Construct a GeneFeature from such records.
- add_child(feature: Bio.SeqFeature.SeqFeature, cds_feature: Optional[Bio.SeqFeature.SeqFeature] = None)
Add a new feature as a child. Infer Transcripts if this child is a CDS feature.
- infer_child()
If this is an isolated gene feature, then construct a child transcript feature that is a copy
- static to_gene_model(cls: GeneFeature) Dict[str, Any]
Convert to a Dict representation of a
biocantor.gene.collections.GeneInterval
that can be used for analyses.This is the default function, that can be over-ridden by specific implementations.
Looks for /transcript_id, /protein_id, and /gene on the transcript level, and looks for /gene_id, /gene, and /locus_tag on the gene level.
- class biocantor.io.genbank.parser.TranscriptFeature(feature: Bio.SeqFeature.SeqFeature, record: Bio.SeqRecord.SeqRecord, cds_feature: Optional[Bio.SeqFeature.SeqFeature] = None)
Bases:
Feature
A transcript
- types
- _exon_interval
- _cds_interval
- __str__()
Return str(self).
- construct_frames(cds_interval: inscripta.biocantor.location.Location) List[str]
We need to build frames. Since GenBank lacks this info, do our best
- get_qualifier_from_tx_or_cds_features(qualifier: str) Optional[str]
Get a specific qualifier, if it exists. Look at tx first, then children
- find_exon_interval() inscripta.biocantor.location.CompoundInterval
Finds the Location of the Exons.
- find_transcript_interval() inscripta.biocantor.location.Location
Finds the Location that spans the full length of the Transcript
- find_cds_interval() inscripta.biocantor.location.Location
Finds the Location of the CDS.
Handle edge cases from tools like Geneious where the CDS Interval exceeds the bounds of the Exons.
- class biocantor.io.genbank.parser.CDSFeature(feature: Bio.SeqFeature.SeqFeature, record: Bio.SeqRecord.SeqRecord)
Bases:
Feature
A CDS interval
- types
- __str__()
Return str(self).
- class biocantor.io.genbank.parser.GroupedGeneFeatures
Container class for a grouping of gene-like SeqFeatures on their associated SeqRrecord.
This class is used by implementations of the
BaseGenBankParser
to store groupings of features that are considered to be part of a gene unit together with the SeqRecord they came from.Due to the various flavors of GenBank files out there, any of the gene, transcript or CDS features might not be present. The downstream usage of this class will infer the missing feature types.
- seqrecord :Bio.SeqRecord.SeqRecord
- gene_feature :Optional[Bio.SeqFeature.SeqFeature]
- transcript_features :Optional[List[Bio.SeqFeature.SeqFeature]]
- cds_features :Optional[List[Bio.SeqFeature.SeqFeature]]
- class biocantor.io.genbank.parser.BaseGenBankParser(seq_records: List[Bio.SeqRecord.SeqRecord], parsed_variants: Dict[str, List[inscripta.biocantor.io.vcf.parser.VariantIntervalCollectionModel]], gene_parse_func: Callable[[GeneFeature], Dict[str, Any]], feature_parse_func: Callable[[FeatureIntervalGenBankCollection], Dict[str, Any]])
Bases:
abc.ABC
Base class for GenBank parsing.
- property num_feature_collections
- genbank_parser_type :inscripta.biocantor.io.genbank.constants.GenBankParserType
- abstract parse() Iterator[inscripta.biocantor.io.parser.ParsedAnnotationRecord]
Parse features
- static validate_seqfeature(feature: Bio.SeqFeature.SeqFeature) bool
Perform validation checks on a SeqFeature. If there are issues with the feature, warnings and raised and this function returns
False
.
- static _construct_gene_from_feature(feature: Bio.SeqFeature.SeqFeature, seqrecord: Bio.SeqRecord.SeqRecord) Optional[GeneFeature]
Convenience function for deciding which function to use when converting a feature to a gene
- static _sort_features_by_position_and_type(features: List[Bio.SeqFeature.SeqFeature]) List[Bio.SeqFeature.SeqFeature]
sort features first by position then by the expected order of features for a coding gene all non-coding transcripts as a result end up at the end of the group
- static _group_sorted_features_by_type(features: List[Bio.SeqFeature.SeqFeature]) Iterator[List[Bio.SeqFeature.SeqFeature]]
Iterator for grouping sorted features by each time a
gene
feature appears.This function identifies groups of canonical genes, as well as interspersed non-coding genes, while also trying to identify isolated CDS records.
As an example, take this set of ordered features:
gene mRNA CDS tRNA rRNA gene CDS
This would be grouped as:
[gene, mRNA, CDS] [tRNA] [rRNA] [gene, CDS]
However, sometimes non-coding genes still have the gene feature. Take this example:
gene mRNA CDS gene ncRNA gene CDS
This would be grouped as:
[gene, mRNA, CDS] [gene, ncRNA] [gene, CDS]
Now consider this problematic ordering, where a non-coding object interrupts a coding object:
gene tRNA CDS
This is ambiguous – is the gene meant to go with the tRNA, or the CDS? This would be grouped as:
[gene, tRNA] [CDS]
And there will be a new
gene
object inferred for theCDS
, regardless of if thegene
was supposed to go with it or not.This is an example of why the Sorted parser is inherently more fragile than the LocusTag parser.
- _group_features_by_position(features: List[Bio.SeqFeature.SeqFeature], seqrecord: Bio.SeqRecord.SeqRecord, idx: int)
Group features by position. Since this function is always called in seqrecord order, self.grouped_gene_features is incremented here.
- _group_features_by_locus_tag(features: List[Bio.SeqFeature.SeqFeature], seqrecord: Bio.SeqRecord.SeqRecord, idx: int)
Group features by locus tag. Since this function is always called in seqrecord order, self.grouped_gene_features is incremented here.
- _parse_features()
Extract all generic features from a SeqRecord. These are anything that did not qualify as a gene, based on the feature type being one of the known members of
biocantor.io.genbank.constants.GenBankFeatures
.Feature collections are inferred through the
locus_tag
field. Any items without such a tag are treated separately.
- _convert_seqfeature_to_gene(grouped_gene_features: GroupedGeneFeatures, seqrecord: Bio.SeqRecord.SeqRecord) GeneFeature
- _convert_seqfeatures_to_genes()
After the gene-like features have been grouped by either position, locus tag, or both, the groups are evaluated and converted into
GeneFeature
objects.Gene-level objects with no children (transcripts or CDSes) are skipped.
- _export_annotation_collections() Iterator[inscripta.biocantor.io.parser.ParsedAnnotationRecord]
The final step of GenBank parsing is exporting the annotations as
ParsedAnnotationRecord
.These objects contain both the annotations as a
AnnotationCollectionModel
, as well as the associatedSeqRecord
, which allows for construction of a AnnotationCollection with sequence information.
- class biocantor.io.genbank.parser.SortedGenBankParser(seq_records: List[Bio.SeqRecord.SeqRecord], parsed_variants: Dict[str, List[inscripta.biocantor.io.vcf.parser.VariantIntervalCollectionModel]], gene_parse_func: Callable[[GeneFeature], Dict[str, Any]], feature_parse_func: Callable[[FeatureIntervalGenBankCollection], Dict[str, Any]])
Bases:
BaseGenBankParser
The Sorted GenBank parser relies entirely on increasing genomic position to partition features into genes or feature groups. This is inherently challenging because of issues like overlapping genes or multiple isoforms.
- genbank_parser_type
- parse() Iterator[inscripta.biocantor.io.parser.ParsedAnnotationRecord]
Parse features
- _extract_seqfeatures_from_seqrecords()
- _group_gene_features_by_position()
Soerted parser groups features using
BaseGenBankParser._group_features_by_position()
.
- class biocantor.io.genbank.parser.LocusTagGenBankParser(seq_records: List[Bio.SeqRecord.SeqRecord], parsed_variants: Dict[str, List[inscripta.biocantor.io.vcf.parser.VariantIntervalCollectionModel]], gene_parse_func: Callable[[GeneFeature], Dict[str, Any]], feature_parse_func: Callable[[FeatureIntervalGenBankCollection], Dict[str, Any]])
Bases:
BaseGenBankParser
The LocusTag parser expects that every gene feature in the GenBank file contains a /locus_tag qualifier.
Gene-type features without a locus tag qualifier are ignored, and an exception is raised if multiple gene features have the same locus tag.
- genbank_parser_type
- parse() Iterator[inscripta.biocantor.io.parser.ParsedAnnotationRecord]
Parse features
- _extract_seqfeatures_from_seqrecords()
- _group_gene_features_by_locus_tag()
Locus tag parser groups features using
BaseGenBankParser._group_features_by_locus_tag()
.
- class biocantor.io.genbank.parser.HybridGenBankParser(seq_records: List[Bio.SeqRecord.SeqRecord], parsed_variants: Dict[str, List[inscripta.biocantor.io.vcf.parser.VariantIntervalCollectionModel]], gene_parse_func: Callable[[GeneFeature], Dict[str, Any]], feature_parse_func: Callable[[FeatureIntervalGenBankCollection], Dict[str, Any]])
Bases:
LocusTagGenBankParser
,SortedGenBankParser
The Hybrid parsing mode combines both LocusTag and Sorted parsing. LocusTag is preferentially used, with features that either lack a locus tag or with duplicate tags are be sent to the Sorted parser.
- genbank_parser_type
- parse() Iterator[inscripta.biocantor.io.parser.ParsedAnnotationRecord]
Parse features
- _extract_seqfeatures_from_seqrecords()
Hybrid parser partitions all features by locus tag, if they exist. Any features without a locus tag or with duplicate tags will be sent to the Sorted parser.
- _identify_locus_tag_collisions()
Identify duplicated locus tags on gene features across all of the SeqRecords. Raise a warning for each one, then reassign it to the Sorted parser.
Also sorts the objects in
gene_filtered_features_without_locus_tag
to prepare for sorted parsing.
- _group_gene_features_by_locus_tag_and_position()
Hybrid parser groups genes by both position and locus tag, after they were partitioned by
HybridGenBankParser._extract_seqfeatures_from_seqrecords()
.
- biocantor.io.genbank.parser.parse_genbank(genbank_handle_or_path: Union[TextIO, str, pathlib.Path], variant_handle_or_path: Optional[Union[TextIO, str, pathlib.Path]] = None, parsed_variants: Optional[Dict[str, List[inscripta.biocantor.io.vcf.parser.VariantIntervalCollectionModel]]] = None, gene_parse_func: Callable[[GeneFeature], Dict[str, Any]] = GeneFeature.to_gene_model, feature_parse_func: Callable[[FeatureIntervalGenBankCollection], Dict[str, Any]] = FeatureIntervalGenBankCollection.to_feature_model, gbk_type: inscripta.biocantor.io.genbank.constants.GenBankParserType = GenBankParserType.HYBRID, allow_duplicate_sequence_identifiers: bool = False) Iterator[inscripta.biocantor.io.parser.ParsedAnnotationRecord]
This is the main GenBank parsing function. The parse function implemented in
GeneFeature
can be over-ridden to provide a custom implementation.- Parameters
genbank_handle_or_path – An open GenBank file or a path to a locally stored GenBank file.
variant_handle_or_path – Optional open handle to a VCF file. Mutually exclusive with
parsed_variants
.parsed_variants – Optional parsed variants. Mutually exclusive with
variant_handle_or_path
.gene_parse_func – Optional gene parse function implementation. Defaults to
GeneFeature.to_gene_model()
implemented in this module.feature_parse_func – Optional feature interval parse function implementation. Defaults to
FeatureIntervalGenBankCollection.to_feature_model()
implemented in this module.gbk_type – Use Hybrid, Sorted or LocusTag based parsing? Defaults to Hybrid.
allow_duplicate_sequence_identifiers – Should this parser raise an exception if the same identifier is seen twice? Defaults to False.
- Yields
ParsedAnnotationRecord
.