biocantor.io.genbank.writer

Writes GenBank formatted files out, given one or more AnnotationCollection.

These models must have sequences to retrieve.

There are two flavors of GenBank supported here:

1. Prokaryotic – each gene has a direct descendant, which is its intervals. In other words, all annotations come in pairs where you have gene followed by [CDS, tRNA, rRNA, ...]. 2. Eukaryotic – a more standard gene model, where the top level is always called gene, then there is a child [mRNA, tRNA, ...] and if the case where the child is mRNA, then there are CDS features.

Module Contents

Functions

collection_to_genbank(collections, ...[, ...])

Take an instantiated AnnotationCollection and produce a GenBank file.

gene_to_feature(→ Iterable[Bio.SeqFeature.SeqFeature])

Converts either a GeneInterval or a

transcripts_to_feature(...)

Converts a TranscriptInterval to a Bio.SeqFeature.SeqFeature.

add_cds_feature(→ Bio.SeqFeature.SeqFeature)

Converts a TranscriptInterval that has a CDS to a

feature_intervals_to_features(...)

Converts a FeatureInterval to a Bio.SeqFeature.SeqFeature.

biocantor.io.genbank.writer.collection_to_genbank(collections: List[inscripta.biocantor.gene.AnnotationCollection], genbank_file_handle_or_path: Union[TextIO, str, pathlib.Path], genbank_type: Optional[inscripta.biocantor.io.genbank.constants.GenbankFlavor] = GenbankFlavor.PROKARYOTIC, force_strand: Optional[bool] = True, organism: Optional[str] = None, source: Optional[str] = None, seqrecord_annotations: Optional[List[Dict[str, Any]]] = None, update_translations: bool = False)

Take an instantiated AnnotationCollection and produce a GenBank file.

Parameters
  • collections – Iterable of AnnotationCollections. They must have sequences associated with them.∂

  • genbank_file_handle_or_path – Open file handle or path to write GenBank file to.

  • genbank_type – Are we writing an prokaryotic or eukaryotic style GenBank file?

  • force_strand – Boolean flag; if True, then strand on children is forced, if False, then improper strands are instead skipped.

  • organism – What string to put in the ORGANISM field? If not set, will be a period.

  • source – What string to put in the SOURCE field? If not set, will be the basename of the GenBank path.

  • seqrecord_annotations – An arbitrary dictionary of annotations to include. If organism or source are set both in this function call and in this dictionary, they will be over-written. Must be a list of the same length as the collections.

  • update_translations – Should the /translation tag be calculated or re-calculated? This is a time consuming process.

biocantor.io.genbank.writer.gene_to_feature(gene_or_feature: Union[inscripta.biocantor.gene.GeneInterval, inscripta.biocantor.gene.FeatureIntervalCollection], genbank_type: inscripta.biocantor.io.genbank.constants.GenbankFlavor, force_strand: bool, translation_table: inscripta.biocantor.gene.TranslationTable, update_translations: bool) Iterable[Bio.SeqFeature.SeqFeature]

Converts either a GeneInterval or a FeatureIntervalCollection to a Bio.SeqFeature.SeqFeature.

Bio.SeqFeature.SeqFeature are BioPython objects that will then be used to write to a GenBank file. There is one Bio.SeqFeature.SeqFeature for every feature, or row group, in the output file. There will be one contiguous interval at the Gene level.

While GeneInterval always has its interval on the plus strand, GenBank files assume that a Gene has an explicit strand. Therefore, this function picks the most common strand and forces it on all of its children.

Parameters
  • gene_or_feature – A GeneInterval or FeatureIntervalCollection.

  • genbank_type – Are we writing an prokaryotic or eukaryotic style GenBank file?

  • force_strand – Boolean flag; if True, then strand on children is forced, if False, then improper strands are instead skipped.

  • translation_table – Translation table to use.

  • update_translations – Should the /translation tag be calculated or re-calculated? This is a time consuming process.

Yields
``SeqFeature``s, one for the gene, one for each child transcript, and one for each transcript’s CDS if it

exists.

biocantor.io.genbank.writer.transcripts_to_feature(transcripts: List[inscripta.biocantor.gene.TranscriptInterval], strand: inscripta.biocantor.location.strand.Strand, genbank_type: inscripta.biocantor.io.genbank.constants.GenbankFlavor, force_strand: bool, translation_table: inscripta.biocantor.gene.TranslationTable, gene_symbol: Optional[str] = None, locus_tag: Optional[str] = None, update_translations: bool = False) Iterable[Bio.SeqFeature.SeqFeature]

Converts a TranscriptInterval to a Bio.SeqFeature.SeqFeature.

Bio.SeqFeature.SeqFeature are BioPython objects that will then be used to write to a GenBank file. There is one Bio.SeqFeature.SeqFeature for every feature, or row group, in the output file. There will be one joined interval at the transcript level representing the exonic structure.

While transcript members of a gene can have different strands, for GenBank files that is not allowed. This function will explicitly force the strand and provide a warning that this is happening.

In eukaryotic mode, this function will create mRNA features for coding genes, and biotype features for non-coding. Coding genes are then passed on to create CDS features.

In prokaryotic mode, this function will only create biotype features for non-coding genes.

Parameters
  • transcripts – A list of TranscriptInterval.

  • strandStrand that this gene lives on.

  • genbank_type – Are we writing an prokaryotic or eukaryotic style GenBank file?

  • force_strand – Boolean flag; if True, then strand is forced, if False, then improper strands are instead skipped.

  • gene_symbol – An optional gene symbol.

  • locus_tag – An optional locus tag.

  • translation_table – Translation table to use.

  • update_translations – Should the /translation tag be calculated or re-calculated? This is a time consuming process.

Yields

``SeqFeature``s, one for each transcript and then one for each CDS of the transcript, if it exists.

biocantor.io.genbank.writer.add_cds_feature(transcript: inscripta.biocantor.gene.TranscriptInterval, transcript_qualifiers: Dict[Hashable, List[Hashable]], strand: inscripta.biocantor.location.strand.Strand, translation_table: inscripta.biocantor.gene.TranslationTable, update_translations: bool) Bio.SeqFeature.SeqFeature

Converts a TranscriptInterval that has a CDS to a Bio.SeqFeature.SeqFeature. that represents the spliced CDS interval.

Parameters
  • transcript – A TranscriptInterval.

  • strandStrand that this transcript lives on.

  • transcript_qualifiers – Qualifiers dictionary from the transcript level feature.

  • translation_table – Translation table to use.

  • update_translations – Should the /translation tag be calculated or re-calculated? This is a time consuming process.

Returns

SeqFeature for the CDS of this transcript.

biocantor.io.genbank.writer.feature_intervals_to_features(features: List[inscripta.biocantor.gene.FeatureInterval], strand: inscripta.biocantor.location.strand.Strand, force_strand: bool, feature_name: Optional[str] = None, locus_tag: Optional[str] = None) Iterable[Bio.SeqFeature.SeqFeature]

Converts a FeatureInterval to a Bio.SeqFeature.SeqFeature.

Bio.SeqFeature.SeqFeature are BioPython objects that will then be used to write to a GenBank file. There is one Bio.SeqFeature.SeqFeature for every feature, or row group, in the output file. There will be one joined interval at the transcript level representing the exonic structure.

While transcript members of a gene can have different strands, for GenBank files that is not allowed. This function will explicitly force the strand and provide a warning that this is happening.

Parameters
  • features – A list of TranscriptInterval.

  • strandStrand that this gene lives on.

  • force_strand – Boolean flag; if True, then strand is forced, if False, then improper strands are instead skipped.

  • feature_name – An optional feature name.

  • locus_tag – An optional locus tag.

Yields

A ``SeqFeature``s for each feature.