biocantor.io.gff3.parser
Parse GFF3 file by wrapping the library gffutils
.
Functions that call gffutils
directly require that the filepaths be local paths that exist, because gffutils
cannot handle remote streams. Other functions accept any type of open file handle.
This module contains the default parser function default_parse_func()
. This function can be over-written
to write custom parsers.
Additionally, the lower-level interface to gffutils
can be tweaked by adjusting the GffutilsParseArgs
dataclass to adjust the arguments passed to gffutils
.
Module Contents
Classes
These arguments are passed to gffutils directly. |
Functions
|
Filter out the qualifiers for any terms we have extracted as BioCantor identifiers as well as any |
|
Wrapper function for conversion of exon/CDS features into a Transcript object. |
|
Parse canonical genes from this database. |
Find all top-level non gene features. GFFutils lacks a way to do this directly, so we just iterate over everything. |
|
|
Extract values from a list of child features and produce a dictionary to build a |
|
Parse generic features from this database. These are anything that cannot be interpreted as a gene. |
|
Non-gene feature types are those that are not either a member of |
|
This is the default parser function. Mappings include: |
This function is NOT a function to apply FASTA information to a GFF3. This function is purely intended |
|
|
Parses a GFF3 file using gffutils. |
Convenience function shared by |
|
|
Parses a GFF3 with an embedded FASTA. Wraps |
|
Parses a GFF3 with a separate FASTA. Wraps |
Attributes
- biocantor.io.gff3.parser.logger
- class biocantor.io.gff3.parser.GffutilsParseArgs
These arguments are passed to gffutils directly.
- id_spec :Optional[dict]
- merge_strategy :Optional[str] = create_unique
- biocantor.io.gff3.parser.filter_and_sort_qualifiers(qualifiers: Dict[str, List[str]]) Optional[Dict[str, List[str]]]
Filter out the qualifiers for any terms we have extracted as BioCantor identifiers as well as any GFF3 special terms
- biocantor.io.gff3.parser._convert_features_to_transcript(exons: List[gffutils.feature.Feature], cds: List[gffutils.feature.Feature], strand: str, chrom: str, transcript_qualifiers: Dict[str, List[str]], transcript_id: Optional[str], transcript_biotype: Optional[inscripta.biocantor.gene.Biotype], transcript_symbol: Optional[str])
Wrapper function for conversion of exon/CDS features into a Transcript object.
- biocantor.io.gff3.parser._parse_genes(chrom: str, db: gffutils.interface.FeatureDB) List[Dict]
Parse canonical genes from this database.
- Parameters
chrom – A chromosome to parse.
db – Database from
gffutils
.
- Returns
A list of nested dictionaries representing all genes on this chromosome.
- biocantor.io.gff3.parser._find_all_top_level_non_gene_features(chrom: str, db: gffutils.interface.FeatureDB, feature_types: List[str]) Iterable[gffutils.feature.Feature]
Find all top-level non gene features. GFFutils lacks a way to do this directly, so we just iterate over everything.
- Parameters
chrom – A chromosome to parse.
db – Database from
gffutils
.feature_types – A set of feature types that are in the database that are not genic.
- Yields
Iterable of
Feature
objects that are top-level.
- biocantor.io.gff3.parser._parse_child_features_to_feature_interval(features: List[gffutils.feature.Feature], locus_tag: Optional[str] = None) Dict[str, Any]
Extract values from a list of child features and produce a dictionary to build a
FeatureIntervalModel
from.Can also be provided a
locus_tag
value from a parent, if applicable.This function combines all child features of a top-level non-gene feature
- biocantor.io.gff3.parser._parse_features(chrom: str, db: gffutils.interface.FeatureDB, feature_types: List[str]) List[Dict]
Parse generic features from this database. These are anything that cannot be interpreted as a gene.
If a feature is a top-level feature with no children, then infer a collection wrapper for it.
- Parameters
chrom – A chromosome to parse.
db – Database from
gffutils
.feature_types – A set of feature types that are in the database that are not genic.
- Returns
A list of nested dictionaries representing non-gene features.
- biocantor.io.gff3.parser._find_non_gene_feature_types(db: gffutils.interface.FeatureDB, feature_types_to_ignore: Optional[Set[str]] = None) List[str]
Non-gene feature types are those that are not either a member of
Biotype
orGFF3GeneFeatureTypes
. This combination of filters prevents genes being inadvertently pulled in from either of the two main styles of representing them.NCBI Style: {gene,pseudogene} -> {mRNA, tRNA, etc} -> {exon, cds} Ensembl/GENCODE Style: gene -> transcript -> {exon, cds}
- Parameters
db – Database from
gffutils
.feature_types_to_ignore – Feature types to ignore, if chosen. This is often used to ignore pointless features like chromosome representations.
- Returns
A list of strings representing the non-gene feature types found in the database.
- biocantor.io.gff3.parser.default_parse_func(db: gffutils.interface.FeatureDB, chroms: List[str]) Iterable[inscripta.biocantor.io.models.AnnotationCollectionModel]
This is the default parser function. Mappings include:
gene_id -> gene_id gene_name or if missing gene_symbol -> gene_symbol gene_biotype or if missing gene_type -> gene_biotype transcript_id -> transcript_id transcript_name or if missing transcript_name -> transcript_symbol transcript_biotype or if missing transcript_type -> transcript_biotype if no transcript_biotype or transcript_type, then gene result is used
A list of chromosomes is required in order to allow there to be a specified order of data, otherwise they come back unordered from the database.
- Parameters
db – Database from
gffutils
.chroms – List of sequence names to iterate over.
- Yields
- biocantor.io.gff3.parser.extract_seqrecords_from_gff3_fasta(gff3_with_fasta_handle: TextIO) List[Bio.SeqRecord.SeqRecord]
This function is NOT a function to apply FASTA information to a GFF3. This function is purely intended to extract the FASTA from a combined file and produce SeqRecords.
Parses a GFF3 with a FASTA suffix. Will raise an exception if such a suffix is not found.
- Parameters
gff3_with_fasta_handle – Open file handle in text mode to a GFF3 file with a FASTA suffix.
- Raises
GFF3FastaException – if the GFF3 lacks a FASTA suffix.
- Returns
List of
SeqRecord
objects.
- biocantor.io.gff3.parser.parse_standard_gff3(gff: Optional[pathlib.Path] = None, gffutil_parse_args: Optional[GffutilsParseArgs] = GffutilsParseArgs(), parse_func: Optional[Callable[[gffutils.interface.FeatureDB, List[str]], Iterable[inscripta.biocantor.io.models.AnnotationCollectionModel]]] = default_parse_func, gffutil_transform_func: Optional[Callable[[gffutils.feature.Feature], gffutils.feature.Feature]] = None, db_fn: str = ':memory:') Iterable[inscripta.biocantor.io.parser.ParsedAnnotationRecord]
Parses a GFF3 file using gffutils.
The parameters parse_func, gffutil_parse_args are implemented separately for each data source. A default implementation exists in this module.
- Parameters
gff – Path to a GFF. Must be local or HTTPS. Optional only if
db_fn
is a pre-built GFFutils database.parse_func – Function that actually converts gffutils to BioCantor representation.
gffutil_transform_func – Function that transforms feature keys. Can be necessary in cases where IDs are not unique.
gffutil_parse_args – Parsing arguments to pass to gffutils.
db_fn – Location to write a gffutils database. Defaults to
:memory:
, which means the database will be built transiently. If this value is not:memory:
, and the file path exists, then it will be assumed to be a GFFutils database that was built externally and that database will be used. This value can be set to a file location if memory is a concern, or if you want to retain the gffutils database. It will not be cleaned up.
- Yields
Iterable of
ParsedAnnotationRecord
objects.
- biocantor.io.gff3.parser._produce_empty_records(seqrecords_dict: Dict[str, Bio.SeqRecord.SeqRecord], seen_seqs: Set[str]) Iterable[inscripta.biocantor.io.parser.ParsedAnnotationRecord]
Convenience function shared by
parse_gff3_embedded_fasta()
andparse_gff3_fasta()
that appends emptyParsedAnnotationRecord
objects to the end. This ensures that every sequence in the FASTA is still represented in the final object set, even if it has zero annotations.- Parameters
seqrecords_dict – Dictionary mapping sequence names to SeqRecord objects.
seen_seqs – Set of sequences that were found when parsing the GFF3.
- Yields
Iterable of
ParsedAnnotationRecord
objects with empty annotations.
- biocantor.io.gff3.parser.parse_gff3_embedded_fasta(gff3_with_fasta: pathlib.Path, gffutil_parse_args: Optional[GffutilsParseArgs] = GffutilsParseArgs(), parse_func: Optional[Callable[[gffutils.interface.FeatureDB, List[str]], Iterable[inscripta.biocantor.io.models.AnnotationCollectionModel]]] = default_parse_func, gffutil_transform_func: Optional[Callable[[gffutils.feature.Feature], gffutils.feature.Feature]] = None, db_fn: Optional[str] = ':memory:') Iterable[inscripta.biocantor.io.parser.ParsedAnnotationRecord]
Parses a GFF3 with an embedded FASTA. Wraps
parse_gff()
to produceParsedAnnotationRecord
.- Parameters
gff3_with_fasta – Path to a GFF3 file with a FASTA suffix. Must be local or HTTPS. Is not optional because GFFUtils databases do not contain sequence information.
parse_func – Function that actually converts gffutils to BioCantor representation.
gffutil_transform_func – Function that transforms feature keys. Can be necessary in cases where IDs are not unique.
gffutil_parse_args – Parsing arguments to pass to gffutils.
db_fn – Location to write a gffutils database. Defaults to
:memory:
, which means the database will be built transiently. If this value is not:memory:
, and the file path exists, then it will be assumed to be a GFFutils database that was built externally and that database will be used. This value can be set to a file location if memory is a concern, or if you want to retain the gffutils database. It will not be cleaned up.
- Raises
GFF3FastaException – if the GFF3 lacks a FASTA suffix.
DuplicateSequenceException – If the FASTA file contains duplicate sequences.
- Yields
Iterable of
ParsedAnnotationRecord
objects.
- biocantor.io.gff3.parser.parse_gff3_fasta(gff3: pathlib.Path, fasta: pathlib.Path, gffutil_parse_args: Optional[GffutilsParseArgs] = GffutilsParseArgs(), parse_func: Optional[Callable[[gffutils.interface.FeatureDB, List[str]], Iterable[inscripta.biocantor.io.models.AnnotationCollectionModel]]] = default_parse_func, gffutil_transform_func: Optional[Callable[[gffutils.feature.Feature], gffutils.feature.Feature]] = None, db_fn: Optional[str] = ':memory:') Iterable[inscripta.biocantor.io.parser.ParsedAnnotationRecord]
Parses a GFF3 with a separate FASTA. Wraps
parse_gff()
to produceParsedAnnotationRecord
.- Parameters
gff3 – Path to a GFF3 file. Must be local or HTTPS.
fasta – Path to a FASTA file. Must be local or HTTPS.
parse_func – Function that actually converts gffutils to BioCantor representation.
gffutil_transform_func – Function that transforms feature keys. Can be necessary in cases where IDs are not unique.
gffutil_parse_args – Parsing arguments to pass to gffutils.
db_fn – Location to write a gffutils database. Defaults to
:memory:
, which means the database will be built transiently. If this value is not:memory:
, and the file path exists, then it will be assumed to be a GFFutils database that was built externally and that database will be used. This value can be set to a file location if memory is a concern, or if you want to retain the gffutils database. It will not be cleaned up.
- Raises
GFF3FastaException – if the GFF3 lacks a FASTA suffix.
DuplicateSequenceException – If the FASTA file contains duplicate sequences.
- Yields
Iterable of
ParsedAnnotationRecord
objects.