Solutions to sequence analysis exercises - Lecture 3

WIBR Sequence Analysis Course 2004

EXERCISES

Part I. Browsing for genomic information

 

6. Human PIK4CA is on the (-) strand of chromosome 22.

7. According to RefSeq, there are 2 transcripts. According to Ensembl, there are 4 transcripts.

9. The longest transcript of PIK4CA has 54 exons. This is actually shown most clearly on the Ensembl ExonView page for ENST00000255882.

10. The longest intron appears to contain the gene SERPIND1 on the positive strand.

11. Human mRNAs show evidence of some of the transcripts, including the longest one. Human spliced ESTs show evidence of most exons, especially at the 3' end of the gene, but there's no EST evidence for the for the first two exons. The longest mRNA (AF012872) links to a 1998 article called "Phosphatidylinositol 4-kinases" (Can you find the link?), so you may believe the data more than if it had come from an automated sequencing project. Interestingly, the L36151 mRNA (corresponding to the shorter RefSeq sequence) was described in 1994 as being a complete sequence. Were they correct?

12. What the links do:
Ensembl: links to the Ensembl Browser for the same genomic region
Map View: links to the NCBI Browser for the same genomic region
PS/PDF: links to high-resolution graphics of the browser

Part II. Extracting annotated genomic sequence

 

2. The genomic coordinates of NM_058004 are chr22:19,386,545-19,517,555.

3. The length of NM_058004 (including introns) is 131,011 bp (as dosplayed in the top center of the browser).

5. The genomic coordinates of NM_058004, including 5 kb upstream and 1 kb downstream, are chr22:19,385,545-19,522,555.

9. Colors for annotated genomic sequence of NM_058004:
Black bases: introns or adjacent genomic DNA
Red bases: exons supported by RefSeq data alone
Blue bases: parts of exons supported by Ensembl data alone
Underlined bases: exons supported by Spliced EST data
Magenta (violet-ish) bases: exons supported by both RefSeq and Ensembl data
So the exact gene structure is not obvious; disagreements could be explained in part by more alternative splicing or by mismapping of ESTs.
Knowing more about the protein sequence of PIK4CA could help resolve at least some of these disagreements.

Part III. Gene-finding with comparative mammalian genomics

 

1. The location of mouse PIK4CA is chr16:16,884,836-16,899,848 on the negative strand.

5. Sequence BC049252 has 16 exons. I looks like it corresponds to the 3' end of the full-length gene but is missing many exons in the 5' end (of some longer transcripts, at least).

7. There's lots of EST evidence that could help extend mouse PIK4CA in the 5' direction.

9. A BLAT alignment of human PIK4CA to the mouse genome aligns to the coordinates chr16:16,884,843-16,993,897.
This looks much more like a full-length transcript (or at least the longest transcript for this gene).
There is EST evidence to support most of the exons, although some of it is contradictory.

10. The longest intron contains SERPIND1 on the other strand. This is not really unexpected, since one often finds conerved synteny between human and mouse. In fact, you can find the same gene-within-a-gene in the Rat Browser too.

Part IV. Gene and genome analysis through annotation

 

4. 7 genes are classified as "transforming growth factor-beta receptor binding" (GO:0005160).
114 genes are classified as showing "growth factor activity" (GO:0008083).


WIBR Sequence Analysis Course 2004