SEQUENCE ANALYSIS EXERCISES - Lecture 3
Part I. Browsing for genomic information
- Find the human gene PIK4CA in the UCSC Genome Browser:
- Select the human genome, July 2003 assembly and enter "PIK4CA" as the position.
- Once you do the search, select any of the choices under "Known Genes" or "RefSeq genes".
- Under "Genes and Gene Prediction Tracks" at the bottom half of the page, select "full" for Known Genes, RefSeq Genes, MGC Genes, and Ensembl Genes.
- Click on "refresh" or "jump".
- Zoom out until you can see all the transcripts named PIK4CA.
- Relative to the reference chromosome, what strand is the gene on?
- Try clicking on the gene structures to see what information is linked.
- How many transcripts are there, according to (a) RefSeq, or (b) Ensembl?
- How many exons does the longest transcript have?
- Look at the longest intron. What do you see there?
- Under "mRNA and EST Tracks", turn on the "Spliced ESTs" tracks to squish, pack, or full. Can you find expression evidence of these transcripts?
- At the top of the page, what do the "Ensembl", "Map View", and "PDF/PS" links do?
Part II. Extracting annotated genomic sequence
- Enter NM_058004 (the RefSeq ID of longer of the transcripts) into the position box and click on "jump" to get the browser to show the width of the gene.
- What are the genomic coordinates of this transcript?
- How long is the gene (in genomic context, rather than in cDNA context)?
- To extract the genomic sequence of the PIK4CA gene, including 5kb upstream and 1 kb downstream of NM_058004, adjust the coordinates in the position window.
- Since the gene is on the negative strand, adding 5000 to the second coordinate (y, where the position is chr22:x-y) will expand the window to include 5 kb upstream.
- Subtracting 1000 from the first coordinate will extend the view to the 3' end.
- What are the expanded coordinates?
- At the top of the page, click on the "DNA" link and note that you could adjust the coordinates at this time too.
- Check "Reverse complement" since the gene is on the negative strand, and click on "Extended case/color options."
- To capture some of the gene and EST mapping data with your genomic sequence,
- enter 255 under the Red box for RefSeq genes,
- enter 255 under the Blue box for Ensembl Genes,
- check the Underline box for Spliced ESTs, and
- click on Submit.
- What's the significance of the formatting of the output file?
Part III. Gene-finding with comparative mammalian genomics
- Find the mouse gene PIK4CA in the UCSC Genome Browser using the latest assembly (Oct 2003).
- Once you do the search, select the choice under "Known Genes".
- As before, change Known Genes, RefSeq Genes, and MGC Genes to "full" and click on "refresh" or "jump".
- Note that the Ensembl track is missing: Ensembl's annotation pipeline for this assembly is still in progress.
- How many exons does BC049252 transcript have? Do you think it's the whole gene?
- Zoom out 3X and turn on the "Spliced ESTs" tracks to full.
- Can you convince yourself that BC049252 is a full-length transcript or not? If not, how could you identify any longer transcripts?
- Using the sequence of the longest transcript of HUMAN PIK4CA (NM_058004),
search the latest mouse genome with BLAT.
- When you get back the BLAT Search Results, follow the "browser" link to get the genome browser view.
- Does this look like the longest transcript of mouse PIK4CA? Why or why not?
- Look at the longest intron of the human sequence (shown in the browser as "BLAT sequence"). What do you see there? Should this finding be unexpected?
Part IV (supplementary). Gene and genome analysis through annotation
- Find the human gene TGFB3 (Transforming growth factor beta 3) in the Ensembl project:
- Click on human.
- Enter TGFB3 and Search for "Anything".
- After clicking on Lookup, select a match to the Gene index that refers to the gene.
- Follow the link under Genomic Location to view the gene in its genomic location. This presentation of data should look somewhat familiar.
- Go back to the GeneView page.
- Peruse the GeneView page, noting the information provided under Orthologue Prediction, Similarity Matches, and SNP information.
- How many other human genes are classified as binding to TGF-beta receptors? To answer this question,
- Follow the GO (Gene Ontology) ID link to the GO term that refers to proteins that bind TGF-beta receptors ("transforming growth factor-beta receptor binding").
- Noting that this term appears in multiple locations in the molecular function hierarchy, how many genes are classified into this group?
- While you're viewing the GO tree, look up one level to "growth factor activity"? How many genes are in this category? Get a list of these genes.
- Write down the GO ID for "growth factor activity".
- Get sequences for all human proteins with growth factor activity:
- Go back to the Ensembl home page.
- Go to EnsMart and click on START
- Select Ensembl Genes as the Focus and Homo sapiens as the species, and click "next".
- Under REGION, uncheck "Limit to" to search the entire genome.
- Under GENE ONTOLOGY, check Evidence code for mapping and enter the GO ID for "growth factor activity" next to Molecular Function.
- Click on "next" and then note the tabs for Features and Sequences.
- Select the Sequences tab.
- Select Transcripts/proteins, Peptide, your preferred output format and click on Export.
- Save the fasta-format file (if you want).
- Get orthologs for all human proteins with growth factor activity
- Go back one step in your browser (or start EnsMart over again as in the first six steps above).
- This time select the Features tab.
- Under GENE: Ensembl Attributes, check Ensembl Gene ID and Description.
- Under MULTI SPECIES COMPARISONS: Mouse Homolog Attributes and Rat Homolog Attributes, select Ensembl Gene ID.
- Select your preferred output format and click on Export.
WIBR Sequence Analysis Course 2004