SEQUENCE ANALYSIS EXERCISES - Lecture 3

WIBR Sequence Analysis Course 2004

SOLUTIONS     Unix commands

Part I. Browsing for genomic information

  1. Find the human gene PIK4CA in the UCSC Genome Browser:
  2. Relative to the reference chromosome, what strand is the gene on?
  3. Try clicking on the gene structures to see what information is linked.
  4. How many transcripts are there, according to (a) RefSeq, or (b) Ensembl?
  5. How many exons does the longest transcript have?
  6. Look at the longest intron. What do you see there?
  7. Under "mRNA and EST Tracks", turn on the "Spliced ESTs" tracks to squish, pack, or full. Can you find expression evidence of these transcripts?
  8. At the top of the page, what do the "Ensembl", "Map View", and "PDF/PS" links do?

Part II. Extracting annotated genomic sequence

  1. Enter NM_058004 (the RefSeq ID of longer of the transcripts) into the position box and click on "jump" to get the browser to show the width of the gene.
  2. What are the genomic coordinates of this transcript?
  3. How long is the gene (in genomic context, rather than in cDNA context)?
  4. To extract the genomic sequence of the PIK4CA gene, including 5kb upstream and 1 kb downstream of NM_058004, adjust the coordinates in the position window.
    • Since the gene is on the negative strand, adding 5000 to the second coordinate (y, where the position is chr22:x-y) will expand the window to include 5 kb upstream.
    • Subtracting 1000 from the first coordinate will extend the view to the 3' end.
  5. What are the expanded coordinates?
  6. At the top of the page, click on the "DNA" link and note that you could adjust the coordinates at this time too.

Part III. Gene-finding with comparative mammalian genomics

  1. Find the mouse gene PIK4CA in the UCSC Genome Browser using the latest assembly (Oct 2003).
  2. Once you do the search, select the choice under "Known Genes".
  3. As before, change Known Genes, RefSeq Genes, and MGC Genes to "full" and click on "refresh" or "jump".
  4. Note that the Ensembl track is missing: Ensembl's annotation pipeline for this assembly is still in progress.
  5. How many exons does BC049252 transcript have? Do you think it's the whole gene?
  6. Zoom out 3X and turn on the "Spliced ESTs" tracks to full.
  7. Can you convince yourself that BC049252 is a full-length transcript or not? If not, how could you identify any longer transcripts?
  8. Using the sequence of the longest transcript of HUMAN PIK4CA (NM_058004), search the latest mouse genome with BLAT.
  9. Does this look like the longest transcript of mouse PIK4CA? Why or why not?
  10. Look at the longest intron of the human sequence (shown in the browser as "BLAT sequence"). What do you see there? Should this finding be unexpected?

Part IV (supplementary). Gene and genome analysis through annotation

  1. Find the human gene TGFB3 (Transforming growth factor beta 3) in the Ensembl project:
  2. Follow the link under Genomic Location to view the gene in its genomic location. This presentation of data should look somewhat familiar.
  3. Go back to the GeneView page.
  4. Peruse the GeneView page, noting the information provided under Orthologue Prediction, Similarity Matches, and SNP information.
  5. How many other human genes are classified as binding to TGF-beta receptors? To answer this question,
  6. Get sequences for all human proteins with growth factor activity:
  7. Get orthologs for all human proteins with growth factor activity

WIBR Sequence Analysis Course 2004