HOMEWORK 2

Mark cloned the human mitogen-activated protein kinase-activated protein kinase 3(MAPKAPK3) gene last year(accession number is NM_004635.2). Recently he found that NCBI had updated this gene. How could he find out the similarities between the new one and his old one?
1. Click here to get to the NCBI home page, change pull-down to menu to Nucleotide, if you type MAPKAPK3 or NM_004635, the website will lead you to the current version of the gene NM_004635.3; if you type NM_004635.2, it will lead you to the history, click on the link at Jul 3 2001 1:46. The page you see is in GenBank format.
2. Transfer from GenBank format to Fasta format(a common format for bioinformatics program): Here are two different ways for converting files:
  1. In the NCBI website, choose the 'FASTA' next to 'Display' and click on 'Display' botton.
  2. Use the READSEQ program as demonstrated in the class. READSEQ can be reached at Baylor College of Medicine. Copy the GenBank format of NM_004635.2 and NM_004635.3 (From the line beginning with 'LOCUS' till line beginning with '//') and paste them to READSEQ to get the FASTA format of nucleotide sequences.
  Is there any difference between the fasta files by the above methods?
3. Compare the sequences of NM_004635.3 with NM_004635.2. Click on the NCBI blast 2 sequences. Copy and paste your NM_004635.2 and NM_004635.3 sequences in fasta format to the sequence boxes, and click on align. Which part of the sequences are identical? Compare the results with and without filter.

How could you find the genomic location of NM_004635.2?

We can use UCSC BLAT tool. BLAT can quickly find genomic sequences of 95% or greater similarity by keeping an index of the entire genome in memory. Click UCSC Genome Bioinformatics website, and choose on Blat from left frame to go to the BLAT Browser. Paste the raw sequense or FASTA-formated sequence obtained in the last question to the big text box, choose the human Genome, July 2003 Assemblly ,DNA in Query type and press submit botton.

There are 3 hits for NM_004635.2. The first one is on chromosome 3, and is the best among the three hits because of the dramatic differences in the SCORE, the length of the alignment(only misssed 10 bases by comparing query START, END and QSIZE), and the percent IDENTITY. To obtain more information on the first hit, click on the details link. This page includes three parts: NM_004635 sequence, the genomic sequence and the alignment of the NM_004635 to the genomic sequence. The MATCHING BASES between the cDNA and genomic sequence are in upper case and darker blue, Gaps are in lower-case and black. Light blue and upper-cases indicate the the BOUNDARIES of the aligned regions on the either side of a gap and are often splices sites.