Understanding Incorporating Variation

Requirements

  1. The VCF file must contain non-overlapping variants.

  2. The variants can be heterozygous.

  3. Phase set information is inferred from the PS tag.

The Concepts

VariantInterval

A VariantInterval represents a single alternative haplotype. If a given variant has more than one alternative haplotype, a VariantInterval is constructed for each haplotype.

VariantIntervalCollection

A VariantIntervalCollection is a group of variants on the same phase group. If variants are unphased, there is a 1:1 relationship between VariantInterval and VariantIntervalCollection.

Therefore, each VariantIntervalCollection represents all of the changes associated with one known haplotype. It is possible for there to be multiple VariantIntervalCollection objects associated with one gene/feature if there are phase block breaks within a window, or if there are multiple haplotypes.

Associating With an AnnotationCollection

Each parser module accepts an optional set of variants parsed from the included VCF parser. Upon instantiation, all of the haplotypes are associated with all of the genes/features parsed from the GFF3 or GenBank file.

A mapping of the VariantIntervalCollection GUID is maintained to all of the modified genes/features associated with that collection.

[2]:
from inscripta.biocantor.io.genbank.parser import parse_genbank, ParsedAnnotationRecord
from inscripta.biocantor.io.vcf.parser import parse_vcf_file
from uuid import UUID
[3]:
# 3 example variants in 2 phase blocks:
# a homozygous SNP in the 2nd codon
# a homozygous GGG in-frame codon insertion
# a heterozygous SNP
!cat tests/data/INSC1003.example_variants.vcf
##fileformat=VCFv4.2
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase set identifier">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  INSC1003
FEPOIHMA_1      338     1       C       G       .       .       .       GT:PS   1|1:1
FEPOIHMA_1      350     2       T       TGGG    .       .       .       GT:PS   1|1:1
FEPOIHMA_1      450     3       C       A       .       .       .       GT:PS   0|1:2
[4]:
parsed_variants = parse_vcf_file("tests/data/INSC1003.example_variants.vcf")
rec = list(
    ParsedAnnotationRecord.parsed_annotation_records_to_model(
        parse_genbank("tests/data/INSC1003.gbk",
                     parsed_variants=parsed_variants)
    )
)[0]
[5]:
# 7 genes, 2 example VariantIntervalCollection
len(rec.genes), len(rec.feature_collections), len(rec.variant_collections)
[5]:
(7, 0, 2)
[6]:
# 2 VariantIntervalCollection each overlap the gene thrA
# note that the version of thrA associated with
# VariantIntervalCollection UUID('8a423eeb-d142-ff1a-da3c-3edd74810c97')
# now exists on 334-2800 due to the 3bp insertion
rec.alternative_haplotype_mapping
[6]:
{UUID('8a423eeb-d142-ff1a-da3c-3edd74810c97'): [GeneInterval(identifiers={'thrA', 'FEPOIHMA_00001'}, Intervals:TranscriptInterval((334-2800:+), cds=[CDS((334-2800:+), (CDSFrame.ZERO)], symbol=thrA))],
 UUID('c68c9fa2-a55c-db48-3b2f-23bc61ae3d8d'): [GeneInterval(identifiers={'thrA', 'FEPOIHMA_00001'}, Intervals:TranscriptInterval((334-2797:+), cds=[CDS((334-2797:+), (CDSFrame.ZERO)], symbol=thrA))]}
[7]:
# the primary GeneInterval record retains the reference sequence, even if the variant is homozygous
str(rec.genes[0].get_primary_protein())[:20]
[7]:
'MRVLKFGGTSVANAERFLRV'
[8]:
# R -> G at 2nd codon
# G codon insertion at 5th codon
str(
    rec.alternative_haplotype_mapping[UUID('8a423eeb-d142-ff1a-da3c-3edd74810c97')][0].get_primary_protein()
)[:20]
[8]:
'MGVLKLGGGTSVANAERFLR'
[12]:
# the AnnotationCollection can be sliced by position, and still retain the alternative haplotypes
# if they overlap
sliced_rec = rec.query_by_position(330, 400, completely_within=False)
[13]:
# the transcripts now exist on the sliced interval in the original coordinate system
# therefore, the haplotype with the 3bp insertion is now 334-403
# the other VariantIntervalCollection is now lost because it is outside of the window
sliced_rec.alternative_haplotype_mapping
[13]:
{UUID('8a423eeb-d142-ff1a-da3c-3edd74810c97'): [GeneInterval(identifiers={'thrA', 'FEPOIHMA_00001'}, Intervals:TranscriptInterval((334-403:+), cds=[CDS((334-403:+), (CDSFrame.ZERO)], symbol=thrA))]}
[14]:
# alternative haplotype translation now exists, but only within the window
str(
    rec.alternative_haplotype_mapping[UUID('8a423eeb-d142-ff1a-da3c-3edd74810c97')][0].get_primary_protein()
)
[14]:
'MGVLKLGGGTSVANAERFLRVADILESNARQGQVATVLSAPAKITNHLVAMIEKTISGQDALPNISDAERIFAELLTGLAAAQPGFPLAQLKTFVDQEFAQIKHVLHGISLLGQCPDSINAALICRGEKMSIAIMAGVLEARGHNVTVIDPVEKLLAVGHYLESTVDIAESTRRIAASRIPADHMVLMAGFTAGNEKGELVVLGRNGSDYSAAVLAACLRADCCEIWTDVDGVYTCDPRQVPDARLLKSMSYQEAMELSYFGAKVLHPRTITPIAQFQIPCLIKNTGNPQAPGTLIGASRDEDELPVKGISNLNNMAMFSVSGPGMKGMVGMAARVFAAMSRARISVVLITQSSSEYSISFCVPQSDCVRAERAMQEEFYLELKEGLLEPLAVTERLAIISVVGDGMRTLRGISAKFFAALARANINIVAIAQGSSERSISVVVNNDDATTGVRVTHQMLFNTDQVIEVFVIGVGGVGGALLEQLKRQQSWLKNKHIDLRVCGVANSKALLTNVHGLNLENWQEELAQAKEPFNLGRLIRLVKEYHLLNPVIVDCTSSQAVADQYADFLREGFHVVTPNKKANTSSMDYYHLLRHAAEKSRRKFLYDTNVGAGLPVIENLQNLLNAGDELMKFSGILSGSLSYIFGKLDEGMSFSEATTLAREMGYTEPDPRDDLSGMDVARKLLILARETGRELELADIEIEPVLPAEFNAEGDVAAFMANLSQLDDLFAARVAKARDEGKVLRYVGNIDEDGACRVKIAEVDGNDPLFKVKNGENALAFYSHYYQPLPLVLRGYGAGNDVTAAGVFADLLRTLSWKLGV*'
[ ]: