biocantor.io.ncbi.tbl_writer

Write BioCantor data models to the NCBI .tbl format.

The .tbl format is used for NCBI genome submission, and can be validated with the tool tbl2asn.

Module Contents

Classes

TblFeature

Models one feature in a tbl file.

GeneTblFeature

A gene feature should have a single interval only. It also should have a fairly limited set of qualifiers.

MRNATblFeature

A mRNA feature.

CDSTblFeature

A CDS feature.

NcRNATblFeature

A more specific ncRNA feature. Has a ncrna_class that must also be discerned.

MiscRNATblFeature

A generic ncRNA feature. Also provides a shared constructor for all non-coding transcripts.

TRNATblFeature

A tRNA feature. Applies tRNA-Xxx to products if they are not correct.

RRNATblFeature

A rRNA feature. rRNA features must have a product, so these are

TblGene

Container class that holds a gene and its descendant features.

Functions

random_uppercase_str(→ str)

Generates a random uppercase string of size size.

collection_to_tbl(collections, tbl_file_handle[, ...])

Take an iterable of AnnotationCollection and produce a TBL file.

biocantor.io.ncbi.tbl_writer.random_uppercase_str(size=10) str

Generates a random uppercase string of size size. :param size: Size of string to produce

Returns

A random string of size size.

class biocantor.io.ncbi.tbl_writer.TblFeature(location: inscripta.biocantor.location.Location, start_is_incomplete: bool, end_is_complete: bool, is_pseudo: bool, qualifiers: Dict[str, List[str]], children: Optional[List[TblFeature]] = None)

Bases: abc.ABC

Models one feature in a tbl file.

tbl is a funky format with five tab delimited columns, separated with FASTA-like headers. It is probably best to show an example:

>Feature gb|CM021127.1|
<14406  14026   gene
                        gene    TDA8
                        locus_tag       GI527_G0000001
                        gene_synonym    YAL064C-A
                        gene_synonym    YAL065C-A
                        db_xref GeneID:851234
<14406  14393   mRNA
14390   14382
14380   14026
                        product Tda8p
                        note    R64_transcript_id: NM_001180041.1
                        exception       low-quality sequence region
                        protein_id      gnl|WGS:JAAEAL|T0000001_1_prot
                        transcript_id   gnl|WGS:JAAEAL|T0000001
                        gene    TDA8
                        locus_tag       GI527_G0000001
                        gene_synonym    NM_001180041.1
                        gene_synonym    YAL065C-A
                        gene_synonym    YAL064C-A
                        db_xref GeneID:851234

The rows that define intervals have the first 3 tab delimited columns populated, and the last 2 are empty. On the other hand, rows that define qualifiers have the first 3 tab delimited columns empty, and the last 2 contain the key-value pairs.

The caret on the mRNA region above says that the region is incomplete. This must be set if the start codon is invalid or the stop codon is invalid, depending on which direction we are translating in.

The example gene here is on the negative strand, and this is signified with the intervals starting with a larger number than the end. The positions are 1 based inclusive. Another way to think of the intervals is that they are always 5’ to 3’.

VALID_KEYS :Optional[Set]
FEATURE_TYPE :Optional[Union[inscripta.biocantor.io.genbank.constants.TranscriptFeatures, inscripta.biocantor.io.genbank.constants.GeneFeatures, inscripta.biocantor.io.genbank.constants.GeneIntervalFeatures]]
children :Optional[List[TblFeature]]
chars_to_remove
__str__()

Return str(self).

__iter__()
iter_children()
_location_to_str() str

Extract location blocks. Handle strand.

_qualifiers_to_str() str

Converts a qualifiers dictionary to TBL representation.

Qualifiers are encoded as key-value pairs in the 4th and 5th columns of TBL rows, after the rows that represent the genomic interval of the feature. Keys with multiple values are represented by repeating the key on another line.

NCBI does not like parenthesis, brackets or semicolons in the values of a qualifier, so these are removed.

pseudo is a special qualifier key with no value, that marks the feature as being a pseudogene. This turns off many of the classifiers that tbl2asn applies to the gene, and so should be used carefully.

static extract_dbxref_synonyms(parsed_qualifiers: Dict[Hashable, Set[str]], tbl_qualifiers: Dict[str, List[str]], gene_symbol: Optional[str] = None)

Update tbl_qualifiers with values from parsed_qualifiers if they are xrefs or synonyms.

class biocantor.io.ncbi.tbl_writer.GeneTblFeature(gene: inscripta.biocantor.gene.GeneInterval, locus_tag)

Bases: TblFeature

A gene feature should have a single interval only. It also should have a fairly limited set of qualifiers.

The BioCantor model has GeneIntervals always on the + strand to account for the possibility of mixed-strand children. This is not possible under the NCBI model, so we override the constructor of a generic table feature to set the strand.

The signifiers for starts and ends being incomplete are not applied at the gene level.

FEATURE_TYPE
VALID_KEYS
class biocantor.io.ncbi.tbl_writer.MRNATblFeature(transcript: inscripta.biocantor.gene.transcript.TranscriptInterval, cds_feature: CDSTblFeature)

Bases: TblFeature

A mRNA feature.

FEATURE_TYPE
VALID_KEYS
class biocantor.io.ncbi.tbl_writer.CDSTblFeature(transcript: inscripta.biocantor.gene.transcript.TranscriptInterval, gene_feature: GeneTblFeature, submitter_lab_name: str, translation_table: inscripta.biocantor.gene.codon.TranslationTable)

Bases: TblFeature

A CDS feature.

FEATURE_TYPE
VALID_KEYS
class biocantor.io.ncbi.tbl_writer.NcRNATblFeature(transcript: inscripta.biocantor.gene.transcript.TranscriptInterval, gene_feature: GeneTblFeature)

Bases: TblFeature

A more specific ncRNA feature. Has a ncrna_class that must also be discerned.

FEATURE_TYPE
VALID_KEYS
class biocantor.io.ncbi.tbl_writer.MiscRNATblFeature(transcript: inscripta.biocantor.gene.transcript.TranscriptInterval, gene_feature: GeneTblFeature)

Bases: TblFeature

A generic ncRNA feature. Also provides a shared constructor for all non-coding transcripts.

FEATURE_TYPE
VALID_KEYS
class biocantor.io.ncbi.tbl_writer.TRNATblFeature(transcript: inscripta.biocantor.gene.transcript.TranscriptInterval, gene_feature: GeneTblFeature)

Bases: TblFeature

A tRNA feature. Applies tRNA-Xxx to products if they are not correct.

FEATURE_TYPE
VALID_KEYS
class biocantor.io.ncbi.tbl_writer.RRNATblFeature(transcript: inscripta.biocantor.gene.transcript.TranscriptInterval, gene_feature: GeneTblFeature)

Bases: TblFeature

A rRNA feature. rRNA features must have a product, so these are

FEATURE_TYPE
VALID_KEYS
class biocantor.io.ncbi.tbl_writer.TblGene(gene: inscripta.biocantor.gene.GeneInterval, submitter_lab_name: str, locus_tag: Optional[str] = None, translation_table: Optional[inscripta.biocantor.gene.codon.TranslationTable] = TranslationTable.DEFAULT)

Container class that holds a gene and its descendant features.

__iter__()
biocantor.io.ncbi.tbl_writer.collection_to_tbl(collections: Iterable[inscripta.biocantor.gene.collections.AnnotationCollection], tbl_file_handle: TextIO, translation_table: Optional[inscripta.biocantor.gene.codon.TranslationTable] = TranslationTable.DEFAULT, locus_tag_prefix: Optional[str] = None, genbank_flavor: Optional[inscripta.biocantor.io.genbank.constants.GenbankFlavor] = GenbankFlavor.EUKARYOTIC, locus_tag_jump_size: Optional[int] = 5, submitter_lab_name: Optional[str] = None, random_seed: Optional[int] = None)

Take an iterable of AnnotationCollection and produce a TBL file.

Parameters
  • collections – Iterable of AnnotationCollections. They must have sequences associated with them.∂

  • tbl_file_handle – Path to write TBL file to.

  • translation_table – Translation table of this species. Used to determine if the ends of coding genes are considered complete or not.

  • locus_tag_prefix – Locus tag prefix. If not set, the locus tag field present in the annotation set will be used. An exception will be raised if there is no locus tag on any GeneInterval.

  • genbank_flavor – NCBI treats TBL files similarly to Genbank files. The primary distinction is that for prokaryotic genomes, the mRNA feature is not produced, and instead a CDS is a direct child of the gene feature.

  • locus_tag_jump_size – NCBI likes to have locus tags jump so that they can be in sort order in subsequent genome iterations.

  • submitter_lab_name – Name to put in the dbname section of unique identifiers. This is intended to be a string that uniquely labels the submitter lab. If not set, will be a random string.

  • random_seed – A seed value for random string generation. Useful for reproducible runs of this function.