`biocantor.io.gff3.parser`

Parse GFF3 file by wrapping the library gffutils.

Functions that call gffutils directly require that the filepaths be local paths that exist, because gffutils cannot handle remote streams. Other functions accept any type of open file handle.

This module contains the default parser function default_parse_func(). This function can be over-written to write custom parsers.

Additionally, the lower-level interface to gffutils can be tweaked by adjusting the GffutilsParseArgs dataclass to adjust the arguments passed to gffutils.

Module Contents

Classes

GffutilsParseArgs

These arguments are passed to gffutils directly.

Functions

`filter_and_sort_qualifiers`(→ Optional[Dict[str, ...)	Filter out the qualifiers for any terms we have extracted as BioCantor identifiers as well as any
`_convert_features_to_transcript`(exons, cds, strand, ...)	Wrapper function for conversion of exon/CDS features into a Transcript object.
`_parse_genes`(→ List[Dict])	Parse canonical genes from this database.
`_find_all_top_level_non_gene_features`(...)	Find all top-level non gene features. GFFutils lacks a way to do this directly, so we just iterate over everything.
`_parse_child_features_to_feature_interval`(→ Dict[str, Any])	Extract values from a list of child features and produce a dictionary to build a
`_parse_features`(→ List[Dict])	Parse generic features from this database. These are anything that cannot be interpreted as a gene.
`_find_non_gene_feature_types`(→ List[str])	Non-gene feature types are those that are not either a member of `Biotype`
`default_parse_func`(...)	This is the default parser function. Mappings include:
`extract_seqrecords_from_gff3_fasta`(...)	This function is NOT a function to apply FASTA information to a GFF3. This function is purely intended
`parse_standard_gff3`(, parse_func, List[str]], ...)	Parses a GFF3 file using gffutils.
`_produce_empty_records`(...)	Convenience function shared by `parse_gff3_embedded_fasta()` and `parse_gff3_fasta()` that appends
`parse_gff3_embedded_fasta`(, parse_func, List[str]], ...)	Parses a GFF3 with an embedded FASTA. Wraps `parse_gff()` to produce `ParsedAnnotationRecord`.
`parse_gff3_fasta`(, parse_func, List[str]], ...)	Parses a GFF3 with a separate FASTA. Wraps `parse_gff()` to produce `ParsedAnnotationRecord`.

Attributes

logger

biocantor.io.gff3.parser.logger

class biocantor.io.gff3.parser.GffutilsParseArgs

These arguments are passed to gffutils directly.

id_spec :Optional[dict]

merge_strategy :Optional[str] = create_unique

biocantor.io.gff3.parser.filter_and_sort_qualifiers(qualifiers: Dict[str, List[str]]) → Optional[Dict[str, List[str]]]: Filter out the qualifiers for any terms we have extracted as BioCantor identifiers as well as any GFF3 special terms

biocantor.io.gff3.parser._convert_features_to_transcript(exons: List[gffutils.feature.Feature], cds: List[gffutils.feature.Feature], strand: str, chrom: str, transcript_qualifiers: Dict[str, List[str]], transcript_id: Optional[str], transcript_biotype: Optional[inscripta.biocantor.gene.Biotype], transcript_symbol: Optional[str]): Wrapper function for conversion of exon/CDS features into a Transcript object.

biocantor.io.gff3.parser._parse_genes(chrom: str, db: gffutils.interface.FeatureDB) → List[Dict]

Parse canonical genes from this database.

Parameters

chrom – A chromosome to parse.
db – Database from gffutils.

Returns

A list of nested dictionaries representing all genes on this chromosome.

biocantor.io.gff3.parser._find_all_top_level_non_gene_features(chrom: str, db: gffutils.interface.FeatureDB, feature_types: List[str]) → Iterable[gffutils.feature.Feature]

Find all top-level non gene features. GFFutils lacks a way to do this directly, so we just iterate over everything.

Parameters

chrom – A chromosome to parse.
db – Database from gffutils.
feature_types – A set of feature types that are in the database that are not genic.

Yields

Iterable of Feature objects that are top-level.

biocantor.io.gff3.parser._parse_child_features_to_feature_interval(features: List[gffutils.feature.Feature], locus_tag: Optional[str] = None) → Dict[str, Any]

Extract values from a list of child features and produce a dictionary to build a FeatureIntervalModel from.

Can also be provided a locus_tag value from a parent, if applicable.

This function combines all child features of a top-level non-gene feature

biocantor.io.gff3.parser._parse_features(chrom: str, db: gffutils.interface.FeatureDB, feature_types: List[str]) → List[Dict]

Parse generic features from this database. These are anything that cannot be interpreted as a gene.

If a feature is a top-level feature with no children, then infer a collection wrapper for it.

Parameters

chrom – A chromosome to parse.
db – Database from gffutils.
feature_types – A set of feature types that are in the database that are not genic.

Returns

A list of nested dictionaries representing non-gene features.

biocantor.io.gff3.parser._find_non_gene_feature_types(db: gffutils.interface.FeatureDB, feature_types_to_ignore: Optional[Set[str]] = None) → List[str]

Non-gene feature types are those that are not either a member of Biotype or GFF3GeneFeatureTypes. This combination of filters prevents genes being inadvertently pulled in from either of the two main styles of representing them.

NCBI Style: {gene,pseudogene} -> {mRNA, tRNA, etc} -> {exon, cds} Ensembl/GENCODE Style: gene -> transcript -> {exon, cds}

Parameters

db – Database from gffutils.
feature_types_to_ignore – Feature types to ignore, if chosen. This is often used to ignore pointless features like chromosome representations.

Returns

A list of strings representing the non-gene feature types found in the database.

biocantor.io.gff3.parser.default_parse_func(db: gffutils.interface.FeatureDB, chroms: List[str]) → Iterable[inscripta.biocantor.io.models.AnnotationCollectionModel]

This is the default parser function. Mappings include:

gene_id -> gene_id gene_name or if missing gene_symbol -> gene_symbol gene_biotype or if missing gene_type -> gene_biotype transcript_id -> transcript_id transcript_name or if missing transcript_name -> transcript_symbol transcript_biotype or if missing transcript_type -> transcript_biotype if no transcript_biotype or transcript_type, then gene result is used

A list of chromosomes is required in order to allow there to be a specified order of data, otherwise they come back unordered from the database.

Parameters

db – Database from gffutils.
chroms – List of sequence names to iterate over.

Yields

AnnotationCollectionModel

biocantor.io.gff3.parser.extract_seqrecords_from_gff3_fasta(gff3_with_fasta_handle: TextIO) → List[Bio.SeqRecord.SeqRecord]

This function is NOT a function to apply FASTA information to a GFF3. This function is purely intended to extract the FASTA from a combined file and produce SeqRecords.

Parses a GFF3 with a FASTA suffix. Will raise an exception if such a suffix is not found.

Parameters: gff3_with_fasta_handle – Open file handle in text mode to a GFF3 file with a FASTA suffix.
Raises: GFF3FastaException – if the GFF3 lacks a FASTA suffix.
Returns: List of SeqRecord objects.

biocantor.io.gff3.parser.parse_standard_gff3(gff: Optional[pathlib.Path] = None, gffutil_parse_args: Optional[GffutilsParseArgs] = GffutilsParseArgs(), parse_func: Optional[Callable[[gffutils.interface.FeatureDB, List[str]], Iterable[inscripta.biocantor.io.models.AnnotationCollectionModel]]] = default_parse_func, gffutil_transform_func: Optional[Callable[[gffutils.feature.Feature], gffutils.feature.Feature]] = None, db_fn: str = ':memory:') → Iterable[inscripta.biocantor.io.parser.ParsedAnnotationRecord]

Parses a GFF3 file using gffutils.

The parameters parse_func, gffutil_parse_args are implemented separately for each data source. A default implementation exists in this module.

Parameters

gff – Path to a GFF. Must be local or HTTPS. Optional only if db_fn is a pre-built GFFutils database.
parse_func – Function that actually converts gffutils to BioCantor representation.
gffutil_transform_func – Function that transforms feature keys. Can be necessary in cases where IDs are not unique.
gffutil_parse_args – Parsing arguments to pass to gffutils.
db_fn – Location to write a gffutils database. Defaults to :memory:, which means the database will be built transiently. If this value is not :memory:, and the file path exists, then it will be assumed to be a GFFutils database that was built externally and that database will be used. This value can be set to a file location if memory is a concern, or if you want to retain the gffutils database. It will not be cleaned up.

Yields

Iterable of ParsedAnnotationRecord objects.

biocantor.io.gff3.parser._produce_empty_records(seqrecords_dict: Dict[str, Bio.SeqRecord.SeqRecord], seen_seqs: Set[str]) → Iterable[inscripta.biocantor.io.parser.ParsedAnnotationRecord]

Convenience function shared by parse_gff3_embedded_fasta() and parse_gff3_fasta() that appends empty ParsedAnnotationRecord objects to the end. This ensures that every sequence in the FASTA is still represented in the final object set, even if it has zero annotations.

Parameters

seqrecords_dict – Dictionary mapping sequence names to SeqRecord objects.
seen_seqs – Set of sequences that were found when parsing the GFF3.

Yields

Iterable of ParsedAnnotationRecord objects with empty annotations.

biocantor.io.gff3.parser.parse_gff3_embedded_fasta(gff3_with_fasta: pathlib.Path, gffutil_parse_args: Optional[GffutilsParseArgs] = GffutilsParseArgs(), parse_func: Optional[Callable[[gffutils.interface.FeatureDB, List[str]], Iterable[inscripta.biocantor.io.models.AnnotationCollectionModel]]] = default_parse_func, gffutil_transform_func: Optional[Callable[[gffutils.feature.Feature], gffutils.feature.Feature]] = None, db_fn: Optional[str] = ':memory:') → Iterable[inscripta.biocantor.io.parser.ParsedAnnotationRecord]

Parses a GFF3 with an embedded FASTA. Wraps parse_gff() to produce ParsedAnnotationRecord.

Parameters

gff3_with_fasta – Path to a GFF3 file with a FASTA suffix. Must be local or HTTPS. Is not optional because GFFUtils databases do not contain sequence information.
parse_func – Function that actually converts gffutils to BioCantor representation.
gffutil_transform_func – Function that transforms feature keys. Can be necessary in cases where IDs are not unique.
gffutil_parse_args – Parsing arguments to pass to gffutils.
db_fn – Location to write a gffutils database. Defaults to :memory:, which means the database will be built transiently. If this value is not :memory:, and the file path exists, then it will be assumed to be a GFFutils database that was built externally and that database will be used. This value can be set to a file location if memory is a concern, or if you want to retain the gffutils database. It will not be cleaned up.

Raises

GFF3FastaException – if the GFF3 lacks a FASTA suffix.
DuplicateSequenceException – If the FASTA file contains duplicate sequences.

Yields

Iterable of ParsedAnnotationRecord objects.

biocantor.io.gff3.parser.parse_gff3_fasta(gff3: pathlib.Path, fasta: pathlib.Path, gffutil_parse_args: Optional[GffutilsParseArgs] = GffutilsParseArgs(), parse_func: Optional[Callable[[gffutils.interface.FeatureDB, List[str]], Iterable[inscripta.biocantor.io.models.AnnotationCollectionModel]]] = default_parse_func, gffutil_transform_func: Optional[Callable[[gffutils.feature.Feature], gffutils.feature.Feature]] = None, db_fn: Optional[str] = ':memory:') → Iterable[inscripta.biocantor.io.parser.ParsedAnnotationRecord]

Parses a GFF3 with a separate FASTA. Wraps parse_gff() to produce ParsedAnnotationRecord.

Parameters

gff3 – Path to a GFF3 file. Must be local or HTTPS.
fasta – Path to a FASTA file. Must be local or HTTPS.
parse_func – Function that actually converts gffutils to BioCantor representation.
gffutil_transform_func – Function that transforms feature keys. Can be necessary in cases where IDs are not unique.
gffutil_parse_args – Parsing arguments to pass to gffutils.
db_fn – Location to write a gffutils database. Defaults to :memory:, which means the database will be built transiently. If this value is not :memory:, and the file path exists, then it will be assumed to be a GFFutils database that was built externally and that database will be used. This value can be set to a file location if memory is a concern, or if you want to retain the gffutils database. It will not be cleaned up.

Raises

GFF3FastaException – if the GFF3 lacks a FASTA suffix.
DuplicateSequenceException – If the FASTA file contains duplicate sequences.

Yields

Iterable of ParsedAnnotationRecord objects.

biocantor.io.gff3.parser

Module Contents

Classes

Functions

Attributes

`biocantor.io.gff3.parser`