biocantor.io.ncbi.tbl_writer
Write BioCantor data models to the NCBI .tbl format.
The .tbl format is used for NCBI genome submission, and can be validated with the tool tbl2asn
.
Module Contents
Classes
Models one feature in a tbl file. |
|
A gene feature should have a single interval only. It also should have a fairly limited set of qualifiers. |
|
A mRNA feature. |
|
A CDS feature. |
|
A more specific ncRNA feature. Has a ncrna_class that must also be discerned. |
|
A generic ncRNA feature. Also provides a shared constructor for all non-coding transcripts. |
|
A tRNA feature. Applies |
|
A rRNA feature. rRNA features must have a product, so these are |
|
Container class that holds a gene and its descendant features. |
Functions
|
Generates a random uppercase string of size |
|
Take an iterable of |
- biocantor.io.ncbi.tbl_writer.random_uppercase_str(size=10) str
Generates a random uppercase string of size
size
. :param size: Size of string to produce- Returns
A random string of size
size
.
- class biocantor.io.ncbi.tbl_writer.TblFeature(location: inscripta.biocantor.location.Location, start_is_incomplete: bool, end_is_complete: bool, is_pseudo: bool, qualifiers: Dict[str, List[str]], children: Optional[List[TblFeature]] = None)
Bases:
abc.ABC
Models one feature in a tbl file.
tbl is a funky format with five tab delimited columns, separated with FASTA-like headers. It is probably best to show an example:
>Feature gb|CM021127.1| <14406 14026 gene gene TDA8 locus_tag GI527_G0000001 gene_synonym YAL064C-A gene_synonym YAL065C-A db_xref GeneID:851234 <14406 14393 mRNA 14390 14382 14380 14026 product Tda8p note R64_transcript_id: NM_001180041.1 exception low-quality sequence region protein_id gnl|WGS:JAAEAL|T0000001_1_prot transcript_id gnl|WGS:JAAEAL|T0000001 gene TDA8 locus_tag GI527_G0000001 gene_synonym NM_001180041.1 gene_synonym YAL065C-A gene_synonym YAL064C-A db_xref GeneID:851234
The rows that define intervals have the first 3 tab delimited columns populated, and the last 2 are empty. On the other hand, rows that define qualifiers have the first 3 tab delimited columns empty, and the last 2 contain the key-value pairs.
The caret on the mRNA region above says that the region is incomplete. This must be set if the start codon is invalid or the stop codon is invalid, depending on which direction we are translating in.
The example gene here is on the negative strand, and this is signified with the intervals starting with a larger number than the end. The positions are 1 based inclusive. Another way to think of the intervals is that they are always 5’ to 3’.
- VALID_KEYS :Optional[Set]
- FEATURE_TYPE :Optional[Union[inscripta.biocantor.io.genbank.constants.TranscriptFeatures, inscripta.biocantor.io.genbank.constants.GeneFeatures, inscripta.biocantor.io.genbank.constants.GeneIntervalFeatures]]
- children :Optional[List[TblFeature]]
- chars_to_remove
- __str__()
Return str(self).
- __iter__()
- iter_children()
- _qualifiers_to_str() str
Converts a qualifiers dictionary to TBL representation.
Qualifiers are encoded as key-value pairs in the 4th and 5th columns of TBL rows, after the rows that represent the genomic interval of the feature. Keys with multiple values are represented by repeating the key on another line.
NCBI does not like parenthesis, brackets or semicolons in the values of a qualifier, so these are removed.
pseudo
is a special qualifier key with no value, that marks the feature as being a pseudogene. This turns off many of the classifiers thattbl2asn
applies to the gene, and so should be used carefully.
- class biocantor.io.ncbi.tbl_writer.GeneTblFeature(gene: inscripta.biocantor.gene.GeneInterval, locus_tag)
Bases:
TblFeature
A gene feature should have a single interval only. It also should have a fairly limited set of qualifiers.
The BioCantor model has GeneIntervals always on the + strand to account for the possibility of mixed-strand children. This is not possible under the NCBI model, so we override the constructor of a generic table feature to set the strand.
The signifiers for starts and ends being incomplete are not applied at the gene level.
- FEATURE_TYPE
- VALID_KEYS
- class biocantor.io.ncbi.tbl_writer.MRNATblFeature(transcript: inscripta.biocantor.gene.transcript.TranscriptInterval, cds_feature: CDSTblFeature)
Bases:
TblFeature
A mRNA feature.
- FEATURE_TYPE
- VALID_KEYS
- class biocantor.io.ncbi.tbl_writer.CDSTblFeature(transcript: inscripta.biocantor.gene.transcript.TranscriptInterval, gene_feature: GeneTblFeature, submitter_lab_name: str, translation_table: inscripta.biocantor.gene.codon.TranslationTable)
Bases:
TblFeature
A CDS feature.
- FEATURE_TYPE
- VALID_KEYS
- class biocantor.io.ncbi.tbl_writer.NcRNATblFeature(transcript: inscripta.biocantor.gene.transcript.TranscriptInterval, gene_feature: GeneTblFeature)
Bases:
TblFeature
A more specific ncRNA feature. Has a ncrna_class that must also be discerned.
- FEATURE_TYPE
- VALID_KEYS
- class biocantor.io.ncbi.tbl_writer.MiscRNATblFeature(transcript: inscripta.biocantor.gene.transcript.TranscriptInterval, gene_feature: GeneTblFeature)
Bases:
TblFeature
A generic ncRNA feature. Also provides a shared constructor for all non-coding transcripts.
- FEATURE_TYPE
- VALID_KEYS
- class biocantor.io.ncbi.tbl_writer.TRNATblFeature(transcript: inscripta.biocantor.gene.transcript.TranscriptInterval, gene_feature: GeneTblFeature)
Bases:
TblFeature
A tRNA feature. Applies
tRNA-Xxx
to products if they are not correct.- FEATURE_TYPE
- VALID_KEYS
- class biocantor.io.ncbi.tbl_writer.RRNATblFeature(transcript: inscripta.biocantor.gene.transcript.TranscriptInterval, gene_feature: GeneTblFeature)
Bases:
TblFeature
A rRNA feature. rRNA features must have a product, so these are
- FEATURE_TYPE
- VALID_KEYS
- class biocantor.io.ncbi.tbl_writer.TblGene(gene: inscripta.biocantor.gene.GeneInterval, submitter_lab_name: str, locus_tag: Optional[str] = None, translation_table: Optional[inscripta.biocantor.gene.codon.TranslationTable] = TranslationTable.DEFAULT)
Container class that holds a gene and its descendant features.
- __iter__()
- biocantor.io.ncbi.tbl_writer.collection_to_tbl(collections: Iterable[inscripta.biocantor.gene.collections.AnnotationCollection], tbl_file_handle: TextIO, translation_table: Optional[inscripta.biocantor.gene.codon.TranslationTable] = TranslationTable.DEFAULT, locus_tag_prefix: Optional[str] = None, genbank_flavor: Optional[inscripta.biocantor.io.genbank.constants.GenbankFlavor] = GenbankFlavor.EUKARYOTIC, locus_tag_jump_size: Optional[int] = 5, submitter_lab_name: Optional[str] = None, random_seed: Optional[int] = None)
Take an iterable of
AnnotationCollection
and produce a TBL file.- Parameters
collections – Iterable of AnnotationCollections. They must have sequences associated with them.∂
tbl_file_handle – Path to write TBL file to.
translation_table – Translation table of this species. Used to determine if the ends of coding genes are considered complete or not.
locus_tag_prefix – Locus tag prefix. If not set, the locus tag field present in the annotation set will be used. An exception will be raised if there is no locus tag on any GeneInterval.
genbank_flavor – NCBI treats TBL files similarly to Genbank files. The primary distinction is that for prokaryotic genomes, the
mRNA
feature is not produced, and instead aCDS
is a direct child of thegene
feature.locus_tag_jump_size – NCBI likes to have locus tags jump so that they can be in sort order in subsequent genome iterations.
submitter_lab_name – Name to put in the
dbname
section of unique identifiers. This is intended to be a string that uniquely labels the submitter lab. If not set, will be a random string.random_seed – A seed value for random string generation. Useful for reproducible runs of this function.