Understanding Incorporating Variation
Requirements
The VCF file must contain non-overlapping variants.
The variants can be heterozygous.
Phase set information is inferred from the
PS
tag.
The Concepts
VariantInterval
A VariantInterval
represents a single alternative haplotype. If a given variant has more than one alternative haplotype, a VariantInterval
is constructed for each haplotype.
VariantIntervalCollection
A VariantIntervalCollection
is a group of variants on the same phase group. If variants are unphased, there is a 1:1 relationship between VariantInterval
and VariantIntervalCollection
.
Therefore, each VariantIntervalCollection
represents all of the changes associated with one known haplotype. It is possible for there to be multiple VariantIntervalCollection
objects associated with one gene/feature if there are phase block breaks within a window, or if there are multiple haplotypes.
Associating With an AnnotationCollection
Each parser module accepts an optional set of variants parsed from the included VCF parser. Upon instantiation, all of the haplotypes are associated with all of the genes/features parsed from the GFF3 or GenBank file.
A mapping of the VariantIntervalCollection
GUID is maintained to all of the modified genes/features associated with that collection.
[2]:
from inscripta.biocantor.io.genbank.parser import parse_genbank, ParsedAnnotationRecord
from inscripta.biocantor.io.vcf.parser import parse_vcf_file
from uuid import UUID
[3]:
# 3 example variants in 2 phase blocks:
# a homozygous SNP in the 2nd codon
# a homozygous GGG in-frame codon insertion
# a heterozygous SNP
!cat tests/data/INSC1003.example_variants.vcf
##fileformat=VCFv4.2
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase set identifier">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT INSC1003
FEPOIHMA_1 338 1 C G . . . GT:PS 1|1:1
FEPOIHMA_1 350 2 T TGGG . . . GT:PS 1|1:1
FEPOIHMA_1 450 3 C A . . . GT:PS 0|1:2
[4]:
parsed_variants = parse_vcf_file("tests/data/INSC1003.example_variants.vcf")
rec = list(
ParsedAnnotationRecord.parsed_annotation_records_to_model(
parse_genbank("tests/data/INSC1003.gbk",
parsed_variants=parsed_variants)
)
)[0]
[5]:
# 7 genes, 2 example VariantIntervalCollection
len(rec.genes), len(rec.feature_collections), len(rec.variant_collections)
[5]:
(7, 0, 2)
[6]:
# 2 VariantIntervalCollection each overlap the gene thrA
# note that the version of thrA associated with
# VariantIntervalCollection UUID('8a423eeb-d142-ff1a-da3c-3edd74810c97')
# now exists on 334-2800 due to the 3bp insertion
rec.alternative_haplotype_mapping
[6]:
{UUID('8a423eeb-d142-ff1a-da3c-3edd74810c97'): [GeneInterval(identifiers={'thrA', 'FEPOIHMA_00001'}, Intervals:TranscriptInterval((334-2800:+), cds=[CDS((334-2800:+), (CDSFrame.ZERO)], symbol=thrA))],
UUID('c68c9fa2-a55c-db48-3b2f-23bc61ae3d8d'): [GeneInterval(identifiers={'thrA', 'FEPOIHMA_00001'}, Intervals:TranscriptInterval((334-2797:+), cds=[CDS((334-2797:+), (CDSFrame.ZERO)], symbol=thrA))]}
[7]:
# the primary GeneInterval record retains the reference sequence, even if the variant is homozygous
str(rec.genes[0].get_primary_protein())[:20]
[7]:
'MRVLKFGGTSVANAERFLRV'
[8]:
# R -> G at 2nd codon
# G codon insertion at 5th codon
str(
rec.alternative_haplotype_mapping[UUID('8a423eeb-d142-ff1a-da3c-3edd74810c97')][0].get_primary_protein()
)[:20]
[8]:
'MGVLKLGGGTSVANAERFLR'
[12]:
# the AnnotationCollection can be sliced by position, and still retain the alternative haplotypes
# if they overlap
sliced_rec = rec.query_by_position(330, 400, completely_within=False)
[13]:
# the transcripts now exist on the sliced interval in the original coordinate system
# therefore, the haplotype with the 3bp insertion is now 334-403
# the other VariantIntervalCollection is now lost because it is outside of the window
sliced_rec.alternative_haplotype_mapping
[13]:
{UUID('8a423eeb-d142-ff1a-da3c-3edd74810c97'): [GeneInterval(identifiers={'thrA', 'FEPOIHMA_00001'}, Intervals:TranscriptInterval((334-403:+), cds=[CDS((334-403:+), (CDSFrame.ZERO)], symbol=thrA))]}
[14]:
# alternative haplotype translation now exists, but only within the window
str(
rec.alternative_haplotype_mapping[UUID('8a423eeb-d142-ff1a-da3c-3edd74810c97')][0].get_primary_protein()
)
[14]:
'MGVLKLGGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDALPNISDAERIFAELLTGLAAAQPGFPLAQLKTFVDQEFAQIKHVLHGISLLGQCPDSINAALICRGEKMSIAIMAGVLEARGHNVTVIDPVEKLLAVGHYLESTVDIAESTRRIAASRIPADHMVLMAGFTAGNEKGELVVLGRNGSDYSAAVLAACLRADCCEIWTDVDGVYTCDPRQVPDARLLKSMSYQEAMELSYFGAKVLHPRTITPIAQFQIPCLIKNTGNPQAPGTLIGASRDEDELPVKGISNLNNMAMFSVSGPGMKGMVGMAARVFAAMSRARISVVLITQSSSEYSISFCVPQSDCVRAERAMQEEFYLELKEGLLEPLAVTERLAIISVVGDGMRTLRGISAKFFAALARANINIVAIAQGSSERSISVVVNNDDATTGVRVTHQMLFNTDQVIEVFVIGVGGVGGALLEQLKRQQSWLKNKHIDLRVCGVANSKALLTNVHGLNLENWQEELAQAKEPFNLGRLIRLVKEYHLLNPVIVDCTSSQAVADQYADFLREGFHVVTPNKKANTSSMDYYHLLRHAAEKSRRKFLYDTNVGAGLPVIENLQNLLNAGDELMKFSGILSGSLSYIFGKLDEGMSFSEATTLAREMGYTEPDPRDDLSGMDVARKLLILARETGRELELADIEIEPVLPAEFNAEGDVAAFMANLSQLDDLFAARVAKARDEGKVLRYVGNIDEDGACRVKIAEVDGNDPLFKVKNGENALAFYSHYYQPLPLVLRGYGAGNDVTAAGVFADLLRTLSWKLGV*'
[ ]: