SEQUENCE ANALYSIS EXERCISES

SEQUENCE ANALYSIS EXERCISES I.

II. Pairwise alignment and database searching

Perform a self-comparison of human haptoglobin sequence with dottup, an EMBOSS interface at German Research Center for Biotechnology. What do the results show?

>Human haptoglobin alpha(2FS)-beta protein
MSALGAVIALLLWGQLFAVDSGNDVTDIADDGCPKPPEIAHGYVEHSVRYQ
CKNYYKLRTEGDGVYTLNDKKQWINKAVGDKLPECEADDGCPKPPEIAHGY
VEHSVRYQCKNYYKLRTEGDGVYTLNNEKQWINKAVGDKLPECEAVCGKPK
NPANPVQRILGGHLDAKGSFPWQAKMVSHHNLTTGATLINEQWLLTTAKNL
FLNHSENATAKDIAPTLTLYVGKKQLVEIEKVVLHPNYSQVDIGLIKLKQK
VSVNERVMPICLPSKDYAEVGRVGYVSGWGRNANFKFTDHLKYVMLPVADQ
DQCIRHYEGSTVPEKKTPKSPVGVQPILNEHTFCAGMSKYQEDTCYGDAGS
AFAVHDLEEDTWYATGILSFDKSCAVAEYGVYVKVTSIQDWVQKTIAEN

Compare the alignment scores obtained with small and large gap penalties in the following example.

>Drosophila melanogaster Odorant receptor 85e (Or85e)   
MASLQFHGNVDADIRYDISLDPARESNLFRLLMGLQLANGTKPSPRLPKW
WPKRLEMIGKVLPKAYCSMVIFTSLHLGVLFTKTTLDVLPTGELQAITDA
LTMTIIYFFTGYGTIYWCLRSRRLLAYMEHMNREYRHHSLAGVTFVSSHA
AFRMSRNFTVVWIMSCLLGVISWGVSPLMLGIRMLPLQCWYPFDALGPGT
YTAVYATQLFGQIMVGMTFGFGGSLFVTLSLLLLGQFDVLYCSLKNLDAH
TKLLGGESVNGLSSLQEELLLGDSKRELNQYVLLQEHPTDLLRLSAGRKC
PDQGNAFHNALVECIRLHRFILHCSQELENLFSPYCLVKSLQITFQLCLL
VFVGVSGTREVLRIVNQLQYLGLTIFELLMFTYCGELLSRHSIRSGDAFW
RGAWWKHAHFIRQDILIFLVNSRRAVHVTAGKFYVMDVNRLRSVITQAFS
FLTLLQKLAAKKTESEL


>Drosophila melanogaster Odorant receptor 23a (Or23a)
MKLSETLKIDYFRVQLNAWRICGALDLSEGRYWSWSMLLCILVYLPTPMLL
RGVYSFEDPVENNFSLSLTVTSLSNLMKFCMYVAQLTKMVEVQSLIGQLDA
RVSGESQSERHRNMTEHLLRMSKLFQITYAVVFIIAAVPFVFETELSLPMP
MWFPFDWKNSMVAYIGALVFQEIGYVFQIMQCFAADSFPPLVLYLISEQCQ
LLILRISEIGYGYKTLEENEQDLVNCIRDQNALYRLLDVTKSLVSYPMMVQ
FMVIGINIAITLFVLIFYVETLYDRIYYLCFLLGITVQTYPLCYYGTMVQE
SFAELHYAVFCSNWVDQSASYRGHMLILAERTKRMQLLLAGNLVPIHLSTY
VACWKGAYSFFTLMADRDGLGS

Align the above two sequences with Stretcher, an interface for global alignment at German Research Center for Biotechnology. To get help on this program, click on the large ? and, on the next page, on the Go button.
Repeat the alignment with Water, an interface for Smith-Waterman local alignment at German Research Center for Biotechnology. Choose the BLOSUM62 matrix for the comparison.
What would happen to the alignment when you decrease the penalties for the local alignment by assigning gap penalty to 1?
Compare the % identity, % similarity and the score for the 3 alignments. What can you conclude?

Janet cloned the human mitogen-activated protein kinase-activated protein kinase 3(MAPKAPK3) gene last year(accession number is NM_004635.2). Recently she found that NCBI had updated this gene. How could she find out the similarities between the new one and old one?
- Click here to get to the NCBI home page, change pull-down menu to Nucleotide. If you type MAPKAPK3 or NM_004635, the website will lead you to the current version of the gene NM_004635.3; if you type NM_004635.2 into the text box and click on Go, it will lead you to the history. Click on revision history. Then click on the link at Jul 3 2001 1:46. The page you see is in GenBank format.
- Change from GenBank format to FASTA format(a common format for bioinformatics program): Try these two different ways for converting files:
  1. In the NCBI website, choose 'FASTA' next to 'Display'. Select 'Text' and click the 'Send to' button.
  2. Use the READSEQ program at Baylor College of Medicine. Copy the GenBank format of NM_004635.2 and NM_004635.3 (from the line beginning with 'LOCUS' through the line beginning with '//') and paste them to READSEQ to get the FASTA format of nucleotide sequences.
  3. Is there any difference between the fasta files by the above methods?
- Compare the sequences of NM_004635.3 with NM_004635.2. Click on the NCBI blast 2 sequences. Copy and paste your NM_004635.2 and NM_004635.3 sequences in fasta format to the sequence boxes, and click on align. Which part of the sequences are identical? Compare the results with and without filter.

How could you find the genomic location of NM_004635.2?

We can use the UCSC BLAT tool. BLAT can quickly find genomic sequences of 95% or greater similarity by keeping an index of the entire genome in memory. Click UCSC Genome Bioinformatics website, and choose Blat from the left frame to go to the BLAT Browser. Paste the raw sequense or FASTA-formated sequence obtained in the last question to the big text box, choose the human Genome, July 2003 Assembly, DNA in Query type and press submit button.

There are 3 hits for NM_004635.2. The first one is on chromosome 3, and is the best among the three hits because of the dramatic differences in the SCORE, the length of the alignment(only misssed 10 bases by comparing query START, END and QSIZE), and the percent IDENTITY. To obtain more information on the first hit, click on the details link. This page includes three parts: NM_004635 sequence, the genomic sequence and the alignment of the NM_004635 to the genomic sequence. The MATCHING BASES between the cDNA and genomic sequence are in upper case and darker blue, Gaps are in lower-case and black. Light blue and upper-cases indicate the the BOUNDARIES of the aligned regions on the either side of a gap and are often splices sites.

The following sequence was published in Michael Crichton's book The Lost World. The sequence was generated by Mark Boguski, a Bioinformaticist (then at NCBI), who was a consultant for Mr. Crichton. Mark played a little joke in creating this sequence. Do a blastx search and look carefully at the alignment to see the hidden message. (Hint: Pay particular attention to the gaps.) What message did you find?

>LostWorld DinoDNA from the book The Lost World
gaattccgga agcgagcaag agataagtcc tggcatcaga tacagttgga gataaggacg
gacgtgtggc agctcccgca gaggattcac tggaagtgca ttacctatcc catgggagcc
atggagttcg tggcgctggg ggggccggat gcgggctccc ccactccgtt ccctgatgaa
gccggagcct tcctggggct gggggggggc gagaggacgg aggcgggggg gctgctggcc
tcctaccccc cctcaggccg cgtgtccctg gtgccgtggg cagacacggg tactttgggg
accccccagt gggtgccgcc cgccacccaa atggagcccc cccactacct ggagctgctg
caaccccccc ggggcagccc cccccatccc tcctccgggc ccctactgcc actcagcagc
gggcccccac cctgcgaggc ccgtgagtgc gtcatggcca ggaagaactg cggagcgacg
gcaacgccgc tgtggcgccg ggacggcacc gggcattacc tgtgcaactg ggcctcagcc
tgcgggctct accaccgcct caacggccag aaccgcccgc tcatccgccc caaaaagcgc
ctgcgggtga gtaagcgcgc aggcacagtg tgcagccacg agcgtgaaaa ctgccagaca
tccaccacca ctctgtggcg tcgcagcccc atgggggacc ccgtctgcaa caacattcac
gcctgcggcc tctactacaa actgcaccaa gtgaaccgcc ccctcacgat gcgcaaagac
ggaatccaaa cccgaaaccg caaagtttcc tccaagggta aaaagcggcg ccccccgggg
gggggaaacc cctccgccac cgcgggaggg ggcgctccta tggggggagg gggggacccc
tctatgcccc ccccgccgcc ccccccggcc gccgcccccc ctcaaagcga cgctctgtac
gctctcggcc ccgtggtcct ttcgggccat tttctgccct ttggaaactc cggagggttt
tttggggggg gggcgggggg ttacacggcc cccccggggc tgagcccgca gatttaaata
ataactctga cgtgggcaag tgggccttgc tgagaagaca gtgtaacata ataatttgca
cctcggcaat tgcagagggt cgatctccac tttggacaca acagggctac tcggtaggac
cagataagca ctttgctccc tggactgaaa aagaaaggat ttatctgttt gcttcttgct
gacaaatccc tgtgaaaggt aaaagtcgga cacagcaatc gattatttct cgcctgtgtg
aaattactgt gaatattgta aatatatata tatatatata tatatctgta tagaacagcc
tcggaggcgg catggaccca gcgtagatca tgctggattt gtactgccgg aattc

If you are familiar with Unix, log on to your hebrides account and try these exercises from the command line. You will need to create files of the sequences in order to run the programs from the command line. You will need to type the following commands from the Unix prompt:
Problem 1. dottup
Problem 2. stretcher and then water
Problem 3. bl2seq -i filename1 -j filename2 -p blastn
Problem 4. N/A
Problem 5. blastall -p blastx -i dino.txt -d nr -o dino.out