Solutions to sequence analysis exercises

Solutions to sequence analysis exercises - Lecture 3

WIBR Sequence Analysis Course 2004

EXERCISES

Part I. Browsing for genomic information

2. Human PIK4CA is on the (-) strand of chromosome 22.

4. According to RefSeq, there are 2 transcripts. According to Ensembl, there is 1 transcript.

5. The longest transcript of PIK4CA has 54 exons. This is actually shown most clearly on the Ensembl ExonView page for ENST00000255882.

6. The longest intron appears to contain the gene SERPIND1 on the positive strand.

7. Human mRNAs show evidence of some of the transcripts, including the longest one. Human spliced ESTs show evidence of most exons, especially at the 3' end of the gene, but there's only one EST that shows evidence for the for the first two exons. The longest mRNA (AF012872) links to a 1998 article called "Phosphatidylinositol 4-kinases" (Can you find the link?), so you may believe the data more than if it had come from an automated sequencing project. Interestingly, the L36151 mRNA (corresponding to the shorter RefSeq sequence) was described in 1994 as being a complete sequence. Were they correct?

8. What the links do:
Ensembl: links to the Ensembl Browser for the same genomic region
Map View: links to the NCBI Browser for the same genomic region
PS/PDF: links to high-resolution graphics of the browser

Part II. Extracting annotated genomic sequence

2. The genomic coordinates of NM_058004 are chr22:19,386,545-19,517,555.

3. The length of NM_058004 (including introns) is 131,011 bp (as displayed in the top center of the browser).

5. The genomic coordinates of NM_058004, including 5 kb upstream and 1 kb downstream, are chr22:19,385,545-19,522,555.

9. Colors for annotated genomic sequence of NM_058004:
Black bases: introns or adjacent genomic DNA
Red bases: exons supported by RefSeq data alone
Blue bases: parts of exons supported by Ensembl data alone
Underlined bases: exons supported by Spliced EST data
Magenta (violet-ish) bases: exons supported by both RefSeq and Ensembl data
So the exact gene structure is not obvious; disagreements could be explained in part by more alternative splicing or by mismapping of ESTs.
Knowing more about the protein sequence of PIK4CA could help resolve at least some of these disagreements.

Part III. Gene-finding with comparative mammalian genomics

1. The location of human LOC51149 (TCBP) is chr5:179196882-179201626 on the negative strand.

2. Clicking on the gene leads to a page of gene information and links to databases.

3. NM_016175 has 3 exons. From this view, it looks like a full-length transcript.

4. From this perspective, even though NM_016175 could be a real transcript, there are much longer transcripts of this gene. A transcript like BC069051 (under Known Genes and MGC) or either of the Ensembl transcripts have transcription start sites far upstream of NM_016175.

6. BLAT of the mouse genome should bring you to chr11:49,816,572-49,822,321, which contains two exons. Following the "details" link of the "BLAT Search Results" page (or just looking at the BLAT statistics), however, shows that BLAT maps only a small part of the human gene to the mouse genome. As a result, we'd guess that the mouse region we've mapped to isn't the whole gene.

7. On the mouse browser, zooming out 10x shows that the full-length ortholog appears to be in the RefSeq, MGC, and Ensembl databases. "Downstream" of these mouse transcripts (relative to their direction, which is towards the right) is a gene called Sqstm1 (sequestosome 1). Zooming out in the human browser and looking downstream (in this case, to the left) of TCBP (our original genes) is a gene called Sqstm1 too. If you're still skeptical, zoom out both browsers further and compare the order of named genes. This phenomenon is generally referred to as conserved synteny.

Part IV. Gene and genome analysis through annotation

1. According to Vega, there are 2 transcripts. According to Ensembl, only one.

5. 135 human genes are classified as showing "growth factor activity" (GO:0008083).

WIBR Sequence Analysis Course 2004