AnnotationCollection Operations

AnnotationCollection is the top level data structure for interacting with annotations imported from GFF3 or GenBank.

The class provides a range of functionality for interacting with child objects.

[ ]:

from inscripta.biocantor.io.genbank.parser import parse_genbank, ParsedAnnotationRecord
from inscripta.biocantor.gene.collections import AnnotationCollection, GeneInterval, FeatureIntervalCollection, SequenceType
from uuid import UUID

[ ]:

gbk = "tests/data/INSC1006_chrI_with_features.gbff"
with open(gbk, "r") as fh:
    parsed = list(ParsedAnnotationRecord.parsed_annotation_records_to_model(parse_genbank(fh)))

rec = parsed[0]
rec

AnnotationCollection(FeatureIntervalCollection(identifiers={'site1'}, Intervals:FeatureInterval((303-337:+), name=site1)),FeatureIntervalCollection(identifiers={'tag123', 'abc123'}, Intervals:FeatureInterval((16099-16175:-), name=abc123),FeatureInterval((42502-42600:-), name=abc123)),GeneInterval(identifiers={'GI526_G0000001'}, Intervals:TranscriptInterval((16174-18079:-), cds=[None], symbol=None)),GeneInterval(identifiers={'GDH3', 'GI526_G0000002'}, Intervals:TranscriptInterval((37461-39103:+), cds=[CDS((37637-39011:+), (CDSFrame.ZERO)], symbol=GDH3)),GeneInterval(identifiers={'GI526_G0000003', 'BDH2'}, Intervals:TranscriptInterval((39518-40772:+), cds=[CDS((39518-40772:+), (CDSFrame.ZERO)], symbol=BDH2)),GeneInterval(identifiers={'BDH1', 'GI526_G0000004'}, Intervals:TranscriptInterval((41085-42503:+), cds=[None], symbol=BDH1)))

[ ]:

# The identifiers set of an AnnotationCollection includes its sequence_name and sequence_guid if applicable
rec.identifiers

{'CM021111.1'}

[ ]:

rec.identifiers_dict

{'name': 'CM021111.1'}

GUIDs

All child objects have hash-based GUID values assigned to them on instantiation if they are not provided already. These values represent a digest of the underlying data, and as a result uniquely identify any interval.

[ ]:

# this is the set of guids associated with this AnnotationCollection
rec.children_guids

{UUID('278a932c-a0a7-e31b-3156-5860ca4a4021'),
 UUID('4967ade5-6d91-faeb-79ed-e57093e4e5f2'),
 UUID('6e0c7c7c-671b-042f-7f9f-bb6f8bfc0538'),
 UUID('8ad3f444-384e-35e0-e560-aef88bd2863f'),
 UUID('a1b669f1-57f6-ae9b-8f4f-a27a6e84d15a'),
 UUID('b8fde3d6-7218-e2f4-f81d-2bcb41e6aec0')}

[ ]:

# This is the set of GeneInterval or FeatureIntervalCollection GUIDs mapped on to their
# respective TranscriptInterval or FeatureInterval GUIDs.
rec.hierarchical_children_guids

{UUID('278a932c-a0a7-e31b-3156-5860ca4a4021'): {UUID('ec17480a-c3d5-3b55-9cf9-81aafcecd9f9')},
 UUID('4967ade5-6d91-faeb-79ed-e57093e4e5f2'): {UUID('8feef6bd-4e47-2893-218d-fc13e2f6f0ba')},
 UUID('6e0c7c7c-671b-042f-7f9f-bb6f8bfc0538'): {UUID('78d4f63b-555f-e556-efba-fa7c6a8aad1b'),
  UUID('cf23b45d-e202-740c-08c6-dc0458e707ff')},
 UUID('8ad3f444-384e-35e0-e560-aef88bd2863f'): {UUID('6f3bfe10-08f8-3455-5ead-5ec2bae0c939')},
 UUID('a1b669f1-57f6-ae9b-8f4f-a27a6e84d15a'): {UUID('b9b580fe-80d9-b12b-3ff8-dfdac8b87b13')},
 UUID('b8fde3d6-7218-e2f4-f81d-2bcb41e6aec0'): {UUID('f8a3106f-523c-779a-44a6-f09a2195960c')}}

[ ]:

# All Collection objects are iterable, in a depth-first fashion
for child in rec:
    print(child)

FeatureIntervalCollection(identifiers={'site1'}, Intervals:FeatureInterval((303-337:+), name=site1))
FeatureIntervalCollection(identifiers={'tag123', 'abc123'}, Intervals:FeatureInterval((16099-16175:-), name=abc123),FeatureInterval((42502-42600:-), name=abc123))
GeneInterval(identifiers={'GI526_G0000001'}, Intervals:TranscriptInterval((16174-18079:-), cds=[None], symbol=None))
GeneInterval(identifiers={'GDH3', 'GI526_G0000002'}, Intervals:TranscriptInterval((37461-39103:+), cds=[CDS((37637-39011:+), (CDSFrame.ZERO)], symbol=GDH3))
GeneInterval(identifiers={'GI526_G0000003', 'BDH2'}, Intervals:TranscriptInterval((39518-40772:+), cds=[CDS((39518-40772:+), (CDSFrame.ZERO)], symbol=BDH2))
GeneInterval(identifiers={'BDH1', 'GI526_G0000004'}, Intervals:TranscriptInterval((41085-42503:+), cds=[None], symbol=BDH1))

[ ]:

# The property guid_map maps a child GUID to the child object
rec.guid_map

{UUID('278a932c-a0a7-e31b-3156-5860ca4a4021'): GeneInterval(identifiers={'BDH1', 'GI526_G0000004'}, Intervals:TranscriptInterval((41085-42503:+), cds=[None], symbol=BDH1)),
 UUID('4967ade5-6d91-faeb-79ed-e57093e4e5f2'): GeneInterval(identifiers={'GI526_G0000003', 'BDH2'}, Intervals:TranscriptInterval((39518-40772:+), cds=[CDS((39518-40772:+), (CDSFrame.ZERO)], symbol=BDH2)),
 UUID('6e0c7c7c-671b-042f-7f9f-bb6f8bfc0538'): FeatureIntervalCollection(identifiers={'tag123', 'abc123'}, Intervals:FeatureInterval((16099-16175:-), name=abc123),FeatureInterval((42502-42600:-), name=abc123)),
 UUID('8ad3f444-384e-35e0-e560-aef88bd2863f'): GeneInterval(identifiers={'GI526_G0000001'}, Intervals:TranscriptInterval((16174-18079:-), cds=[None], symbol=None)),
 UUID('a1b669f1-57f6-ae9b-8f4f-a27a6e84d15a'): GeneInterval(identifiers={'GDH3', 'GI526_G0000002'}, Intervals:TranscriptInterval((37461-39103:+), cds=[CDS((37637-39011:+), (CDSFrame.ZERO)], symbol=GDH3)),
 UUID('b8fde3d6-7218-e2f4-f81d-2bcb41e6aec0'): FeatureIntervalCollection(identifiers={'site1'}, Intervals:FeatureInterval((303-337:+), name=site1))}

Querying AnnotationCollection objects

AnnotationCollection objects can be queried by GUID, identifier or position. The resulting object is always a newly instantiated AnnotationCollection.

All identifier based query functions support either a single value or a list of values.

Querying by identifiers

[ ]:

# query by child (GeneInterval/FeatureIntervalCollection) GUID
rec.query_by_guids(UUID("b8fde3d6-7218-e2f4-f81d-2bcb41e6aec0"))

AnnotationCollection(FeatureIntervalCollection(identifiers={'site1'}, Intervals:FeatureInterval((303-337:+), name=site1)))

AnnotationCollection can also be queried based on identifiers of the TranscriptInterval or FeatureInterval sub-children directly. The resulting object will include the parent GeneInterval or FeatureIntervalCollection. If there are multiple transcripts for a gene, only the queried transcripts will be retained.

[ ]:

rec.query_by_interval_guids(UUID("f8a3106f-523c-779a-44a6-f09a2195960c"))

AnnotationCollection(FeatureIntervalCollection(identifiers={'site1'}, Intervals:FeatureInterval((303-337:+), name=site1)))

[ ]:

# query GUIDs looking specifically for either TranscriptInterval or FeatureIntervalCollection
# this GUID is a FeatureInterval, so the resulting collection is empty
queried = rec.query_by_transcript_interval_guids(UUID("b8fde3d6-7218-e2f4-f81d-2bcb41e6aec0"))
queried.is_empty

True

Querying by common identifiers is also possible. These include anyything that is considered an identifier of the children, including names and locus tags.

[ ]:

queried = rec.query_by_feature_identifiers(["site1", "tag123", "GI526_G0000002", "BDH1"])
len(queried.feature_collections), len(queried.genes)

(2, 2)

Querying by Position

AnnotationCollection objects can be queried by position. This query takes two forms depending on the completely_within flag.

If completely_within is True, then the returned AnnotationCollection will have child objects that are a strict subset of the provided interval. If completely_within is False, then the child objects can have any overlap with the provided interval.

If one sub-feature of a child (TranscriptInterval or FeatureInterval) does not fit the query, but another sub-feature does, then only the sub-feature(s) that fit the query will remain in the resulting collection.

See the docstring for query_by_position for more detailed information.

[ ]:

# completely_within defaults to True
rec.query_by_position(start=0, end=1000)

AnnotationCollection(FeatureIntervalCollection(identifiers={'site1'}, Intervals:FeatureInterval((303-337:+), name=site1)))

[ ]:

rec.query_by_position(start=0, end=330)

AnnotationCollection()

[ ]:

rec.query_by_position(start=305, end=330, completely_within=False)

AnnotationCollection(FeatureIntervalCollection(identifiers={'site1'}, Intervals:FeatureInterval((303-337:+), name=site1)))

Representing AnnotationCollections as JSON/dictionaries

All BioCantor annotation data structures can be exported as dictionary representations, including optional sequence information. These dictionaries can be exported to JSON via marshmallow, or used to re-construct a new AnnotationCollection.

[ ]:

# exporting without Parent information generates a new collection without sequence information
rec_as_dict = rec.to_dict()
new_rec = AnnotationCollection.from_dict(rec_as_dict)
new_rec.chunk_relative_location.parent is None

True

[ ]:

# exporting with sequence allows for reconstruction of sequence information
rec_as_dict = rec.to_dict(export_parent=True)
new_rec = AnnotationCollection.from_dict(rec_as_dict)
new_rec.chunk_relative_location.parent

<Parent: id=CM021111.1, type=chromosome, strand=+, location=<SingleInterval 0-50040:+>, sequence=<CM021111.1;
  Alphabet=NT_EXTENDED_GAPPED;
  Length=50040;
  Parent=None;
  Type=chromosome>, parent=None>

Chunk-relative coordinates

Querying by position introduces the concept of chunk-relative Parent objects. When an AnnotationCollection is queried by position, by default the returned object will have its sequence information truncated to the window of the query.

If the query is performed with the expand_location_to_children flag set to True, then the resulting Parent object will still be reduced from the original sequence, but the bounds will be expanded to the bounds of the union of all children that satisfied the position query.

Chunk-relative coordinates are the way in which BioCantor can hold subsets of the genome sequence in memory while still representing annotations in their original coordinate space.

When any interval object exists on chunk-relative coordinates, the actual sequence information lives on the chunk-relative Location object, and the chromosome Location object represents a lift from that coordinate space to chromosome coordinate space.

[ ]:

queried = rec.query_by_position(start=37000, end=39200, completely_within=True)

[ ]:

# original record is a full length chromosome
rec.chromosome_location

<SingleInterval <Parent: id=CM021111.1, type=chromosome, strand=+, location=<SingleInterval 0-50040:+>, sequence=<CM021111.1;
  Alphabet=NT_EXTENDED_GAPPED;
  Length=50040;
  Parent=None;
  Type=chromosome>, parent=None>:0-50040:+>

[ ]:

# queried chromosome location now represents the sub-region of the genome queried
queried.chromosome_location

<SingleInterval <Parent: id=CM021111.1, type=chromosome, strand=+, location=<SingleInterval 37000-39200:+>, sequence=None, parent=None>:37000-39200:+>

[ ]:

# the chunk relative location shows the full hierarchical relationship from
# the sequence chunk to the chromosome
queried.chunk_relative_location

<SingleInterval <Parent: id=CM021111.1:37000-39200, type=sequence_chunk, strand=+, location=<SingleInterval 0-2200:+>, sequence=<CM021111.1:37000-39200;
  Alphabet=NT_EXTENDED_GAPPED;
  Length=2200;
  Parent=<Parent: id=CM021111.1, type=chromosome, strand=+, location=<SingleInterval <Parent: id=CM021111.1, type=chromosome, strand=+, location=<SingleInterval 37000-39200:+>, sequence=None, parent=None>:37000-39200:+>, sequence=None, parent=None>;
  Type=sequence_chunk>, parent=<Parent: id=CM021111.1, type=chromosome, strand=+, location=<SingleInterval <Parent: id=CM021111.1, type=chromosome, strand=+, location=<SingleInterval 37000-39200:+>, sequence=None, parent=None>:37000-39200:+>, sequence=None, parent=None>>:0-2200:+>

[ ]:

# boolean flags let you know that you are working with something that has a chunk-relative Parent
rec.is_chunk_relative, queried.is_chunk_relative

(False, True)

[ ]:

# the resulting AnnotationCollection can be exported to a dictionary/JSON representation with
# the sub-selected sequence.
chunk_as_dict = queried.to_dict(export_parent=True)
chunk_as_dict

{'completely_within': True,
 'end': 39200,
 'feature_collections': [],
 'genes': [{'gene_guid': UUID('a1b669f1-57f6-ae9b-8f4f-a27a6e84d15a'),
   'gene_id': None,
   'gene_symbol': 'GDH3',
   'gene_type': 'protein_coding',
   'locus_tag': 'GI526_G0000002',
   'qualifiers': {'gene': ['GDH3'], 'locus_tag': ['GI526_G0000002']},
   'sequence_guid': None,
   'sequence_name': 'CM021111.1',
   'transcripts': [{'cds_ends': [39011],
     'cds_frames': ['ZERO'],
     'cds_starts': [37637],
     'exon_ends': [39103],
     'exon_starts': [37461],
     'is_primary_tx': False,
     'product': 'GDH3 isoform 1',
     'protein_id': 'KAF1903245.1',
     'qualifiers': {'codon_start': ['1'],
      'gene': ['GDH3'],
      'locus_tag': ['GI526_G0000002'],
      'note': ['CAT transcript id: T0000002; CAT alignment id: NM_001178204.1-0; CAT source transcript id: NM_001178204.1; CAT source transcript biotype: protein_coding'],
      'product': ['GDH3 isoform 1'],
      'protein_id': ['KAF1903245.1'],
      'translation': ['MTSEPEFQQAYDEIVSSVEDSKIFEKFPQYKKVLPIVSVPERIIQFRVTWENDNGEQEVAQGYRVQFNSAKGPYKGGLRFHPSVNLSILKFLGFEQIFKNALTGLDMGGGKGGLCVDLKGKSDNEIRRICYAFMRELSRHIGKDTDVPAGDIGVGGREIGYLFGAYRSYKNSWEGVLTGKGLNWGGSLIRPEATGFGLVYYTQAMIDYATNGKESFEGKRVTISGSGNVAQYAALKVIELGGIVVSLSDSKGCIISETGITSEQIHDIASAKIRFKSLEEIVDEYSTFSESKMKYVAGARPWTHVSNVDIALPCATQNEVSGDEAKALVASGVKFVAEGANMGSTPEAISVFETARSTATNAKDAVWFGPPKAANLGGVAVSGLEMAQNSQKVTWTAERVDQELKKIMINCFNDCIQAAQEYSTEKNTNTLPSLVKGANIASFVMVADAMLDQGDVF']},
     'sequence_guid': None,
     'sequence_name': 'CM021111.1',
     'strand': 'PLUS',
     'transcript_guid': None,
     'transcript_id': None,
     'transcript_interval_guid': UUID('b9b580fe-80d9-b12b-3ff8-dfdac8b87b13'),
     'transcript_symbol': 'GDH3',
     'transcript_type': 'protein_coding'}]}],
 'id': None,
 'name': 'CM021111.1',
 'parent_or_seq_chunk_parent': {'alphabet': 'NT_EXTENDED_GAPPED',
  'end': 39200,
  'seq': 'AAAACGCTTTCAAAGTTTTCTCTATAAACATACTTGTAGCAGCTGGTTTTTTTTGTTTTATTTTTAAGTTTTGTTAGGTCTCTCAGAACTTTCAAAAAAAGAAAAAGTAAAGTATAATAAAACGGAGCACTTGCCAAAGTAATTAACGCCCATTAAAAAGAAGGCATAGGAGGCATATATATATATATATGGCTGTTAACAGATATTCTGCGCTTAAAAGCTAAAAATATTATACCAACTTTTCTTTTTCTTCCCATTCAGTTTGCTTGATTGGCCCAGCTCTTTGAAGAAAGGAAAAATGCGGAGAGGGAGCCAATGAGATTTTAAAGGGTATATTACTTATCTTATCGATAAGCAGTATTGATATTAAAGGGACAGTTTTATCGTTGGTTAATATGGAAAAAGTGATGACCATGATGCCTTTCTTAAAAAGGGTATTTCTTTTAATTTCACTTTCACATAAACAGTTAATGACTTCTGACTTTGAGCCGTTCGAACTCAGTTATATAAAGGTACATACATAGGCCACACACACACACACACACACACACACACACACATATATATATATATATATATATATATATATATAGGGAAGTAGCAACAGTCACCGAAAAGAAAAGGTAAAAAGTAAAAAATGACAAGCGAACCAGAGTTTCAGCAGGCTTACGATGAGATCGTTTCTTCTGTGGAGGATTCCAAAATCTTTGAAAAATTCCCACAGTATAAAAAAGTGTTACCTATTGTTTCTGTCCCGGAGAGGATCATTCAATTCAGGGTCACGTGGGAAAATGATAATGGCGAGCAAGAAGTGGCTCAAGGATATAGGGTGCAGTTCAATTCAGCCAAGGGCCCTTACAAGGGTGGCCTACGCTTCCACCCATCAGTGAATCTGTCTATCCTAAAATTTTTGGGTTTCGAACAGATCTTCAAGAATGCGCTCACTGGGCTAGATATGGGCGGTGGTAAGGGTGGCCTGTGTGTGGACTTGAAAGGCAAGTCTGACAACGAGATCAGAAGGATTTGTTATGCGTTCATGAGAGAATTGAGCAGGCATATCGGTAAGGACACAGACGTGCCCGCAGGAGATATTGGTGTCGGTGGCCGTGAAATTGGCTACCTATTCGGCGCTTACAGATCATACAAGAACTCCTGGGAAGGTGTGTTGACTGGTAAGGGTTTAAACTGGGGTGGCTCACTTATCAGGCCGGAGGCTACCGGGTTCGGCCTAGTTTACTATACGCAAGCAATGATCGATTATGCAACAAACGGCAAGGAGTCGTTTGAGGGCAAACGTGTGACAATCTCCGGAAGTGGCAATGTTGCGCAATATGCAGCTTTAAAAGTGATCGAGCTGGGTGGTATTGTGGTGTCTTTATCCGATTCGAAGGGGTGCATCATCTCTGAGACGGGCATTACTTCTGAGCAAATTCACGATATCGCTTCCGCCAAGATCCGTTTCAAGTCGTTAGAGGAAATCGTTGATGAATACTCTACTTTCAGCGAAAGTAAGATGAAGTACGTTGCAGGAGCACGCCCATGGACGCATGTGAGCAACGTCGACATTGCCTTGCCCTGTGCTACCCAAAACGAGGTCAGTGGTGACGAAGCCAAGGCCCTAGTGGCATCTGGCGTTAAGTTCGTTGCCGAAGGTGCTAACATGGGTTCTACACCCGAGGCTATTTCTGTTTTCGAAACAGCGCGTAGCACTGCAACCAATGCCAAGGATGCAGTTTGGTTTGGGCCACCAAAGGCAGCTAACCTGGGCGGCGTGGCAGTATCCGGTCTGGAAATGGCTCAGAATTCTCAAAAAGTAACTTGGACTGCCGAGCGGGTCGATCAAGAACTAAAGAAGATAATGATCAACTGCTTCAACGACTGCATACAGGCCGCACAAGAGTACTCTACGGAAAAAAATACAAACACCTTGCCATCATTGGTCAAGGGGGCCAATATTGCCAGCTTCGTCATGGTGGCTGACGCAATGCTTGACCAGGGAGACGTTTTTTAGCCGTAAGCGCTATTTTCTTTTTGTTCGTAACTATCTGTGTATGTATTAATGTAATCTACTTTTAATTTACTATGCAAATAGGGTTCAGCATTACGGAAGGAACTGAACTCCCTTCCGCGGAAGTTTCTTTGTAGTGACCGTGCGGGGTGAGGAGATTACATGTCGGTAATTAGATGATTAACCTAGGCA',
  'sequence_name': 'CM021111.1',
  'start': 37000,
  'strand': 'PLUS',
  'type': 'SEQUENCE_CHUNK'},
 'qualifiers': {'chromosome': ['I'],
  'country': ['USA: Boulder, CO'],
  'db_xref': ['taxon:4932'],
  'mol_type': ['genomic DNA'],
  'organism': ['Saccharomyces cerevisiae'],
  'strain': ['INSC1006']},
 'sequence_guid': None,
 'sequence_name': 'CM021111.1',
 'sequence_path': None,
 'start': 37000}

[ ]:

# this can be used to reconstitute the collection with the chunk intact.
new_queried = AnnotationCollection.from_dict(chunk_as_dict)

[ ]:

# because the query was the full length of the gene, the full length ORF can be extracted from the original collection
str(queried.genes[0].get_primary_protein())

'MTSEPEFQQAYDEIVSSVEDSKIFEKFPQYKKVLPIVSVPERIIQFRVTWENDNGEQEVAQGYRVQFNSAKGPYKGGLRFHPSVNLSILKFLGFEQIFKNALTGLDMGGGKGGLCVDLKGKSDNEIRRICYAFMRELSRHIGKDTDVPAGDIGVGGREIGYLFGAYRSYKNSWEGVLTGKGLNWGGSLIRPEATGFGLVYYTQAMIDYATNGKESFEGKRVTISGSGNVAQYAALKVIELGGIVVSLSDSKGCIISETGITSEQIHDIASAKIRFKSLEEIVDEYSTFSESKMKYVAGARPWTHVSNVDIALPCATQNEVSGDEAKALVASGVKFVAEGANMGSTPEAISVFETARSTATNAKDAVWFGPPKAANLGGVAVSGLEMAQNSQKVTWTAERVDQELKKIMINCFNDCIQAAQEYSTEKNTNTLPSLVKGANIASFVMVADAMLDQGDVF*'

[ ]:

# and from the reconstituted collection
str(new_queried.genes[0].get_primary_protein())

'MTSEPEFQQAYDEIVSSVEDSKIFEKFPQYKKVLPIVSVPERIIQFRVTWENDNGEQEVAQGYRVQFNSAKGPYKGGLRFHPSVNLSILKFLGFEQIFKNALTGLDMGGGKGGLCVDLKGKSDNEIRRICYAFMRELSRHIGKDTDVPAGDIGVGGREIGYLFGAYRSYKNSWEGVLTGKGLNWGGSLIRPEATGFGLVYYTQAMIDYATNGKESFEGKRVTISGSGNVAQYAALKVIELGGIVVSLSDSKGCIISETGITSEQIHDIASAKIRFKSLEEIVDEYSTFSESKMKYVAGARPWTHVSNVDIALPCATQNEVSGDEAKALVASGVKFVAEGANMGSTPEAISVFETARSTATNAKDAVWFGPPKAANLGGVAVSGLEMAQNSQKVTWTAERVDQELKKIMINCFNDCIQAAQEYSTEKNTNTLPSLVKGANIASFVMVADAMLDQGDVF*'

[ ]:

# The __repr__ of the intervals themselves still represent the original coordinate system
queried.genes[0]

GeneInterval(identifiers={'GDH3', 'GI526_G0000002'}, Intervals:TranscriptInterval((37461-39103:+), cds=[CDS((37637-39011:+), (CDSFrame.ZERO)], symbol=GDH3))

[ ]:

# Query a subset of the gene
queried = rec.query_by_position(start=37640, end=38900, completely_within=False)

# the resulting translation is now bounded by the window of the query
str(queried.genes[0].get_primary_protein())

'TSEPEFQQAYDEIVSSVEDSKIFEKFPQYKKVLPIVSVPERIIQFRVTWENDNGEQEVAQGYRVQFNSAKGPYKGGLRFHPSVNLSILKFLGFEQIFKNALTGLDMGGGKGGLCVDLKGKSDNEIRRICYAFMRELSRHIGKDTDVPAGDIGVGGREIGYLFGAYRSYKNSWEGVLTGKGLNWGGSLIRPEATGFGLVYYTQAMIDYATNGKESFEGKRVTISGSGNVAQYAALKVIELGGIVVSLSDSKGCIISETGITSEQIHDIASAKIRFKSLEEIVDEYSTFSESKMKYVAGARPWTHVSNVDIALPCATQNEVSGDEAKALVASGVKFVAEGANMGSTPEAISVFETARSTATNAKDAVWFGPPKAANLGGVAVSGLEMAQNSQKVTWTAERVDQELKKIMINCFNDCIQAAQE'