SEQUENCE ANALYSIS EXERCISES II

SEQUENCE ANALYSIS EXERCISES II

Pattern searching and Gene Finding

  1. Fuzznuc is an EMBOSS pattern matcher that searches nucleotide sequence archives for instances of a user input pattern. To search the "promoter" pattern (TATAN(15,100)ATG) on the human genomic sequence, you need to copy and paste them to the fuzznuc interface at German Research Centre for Biotechnology. How many hits did you get?

  2. Mask the above human genomic sequence with RepeatMasker. This will return a masked sequence. Copy and paste the masked sequence and rerun the above pattern search. Is the result different from the last one?

  3. Find coding regions in the above human genomic sequence.

    a. Search for coding regions by GenomeScan with your masked genomic sequence. By running the BLAST or/and Genscan, you can find protein sequences required by GenomeScan. Directions for finding protein sequences are in the fourth paragraph of the GenomeScan website. For BLAST search, you can search against "swissprot" database. From the BLAST result, choose the human hits with e-value of 0.

    b. Search for coding regions by MZEF with your masked genomic sequence. Compare the locations of the predicted exons by MZEF with those from GenomeScan.

    c. To check the performance of the GenomeScan, compare the predicted sequence found from GenomeScan with the sequence found by experiment. You can do the pairwise alignment with NCBI BLAST 2 program.

  4. Run PSI-BLAST search and extract batch sequences from BLAST results

    a. Run PSI-BLAST with the drosophila olfactory receptor 85e (accession number is NP_524283). You can put the accession number into the "search" field, and only search the odorant receptors in "Drosophila melanogaster". How many hits with significant alignments are in your results? Which matrix is used by default?

    b. Run PSI-BLAST until it converges with default matrix, and limiting the search inside Drosophila melanogaster. For each iteration, only includes the sequences belonging to "odorant receptor". Because olfactory receptors share very low similarity (How can you prove this statement?), we need to include the odorant receptors with E-value WORSE than the threshold. After it converges, how many odorant receptors you get? How many iterations did it take for no new hits to be found?

    c. Extract multiple sequences from the BLAST results. Click on the "Select all" button inside the alignment field, and press the "Get selected sequences". In the next page, choose the genes of interest (or all the odorant receptor sequences). Replace "Summary" with "FASTA", and click on "Send to" button.

    d. Repeat the above search with PAM30 and Blosum80. How many hits are there with significant alignments? Is the number different from the one with the default matrix? Why?

  5. On hebrides, run the program patscan. You will need to make a text file with your pattern and another one with the sequences you will be searching. You could run a browser and download in batch sequences of your choice. If you prefer, you can copy the files as described below. The three pattern files are similar to the ones described in the lecture.

    a. cp /home/lewitter/msh2.fa .

    b. cp /home/lewitter/pat1.dat .

    c. cp /home/lewitter/pat2.dat .

    d. cp /home/lewitter/pat3.dat .

    e. scan_for_matches pat1.dat < msh2.fa > out1

    f. scan_for_matches pat2.dat < msh2.fa > out2

    g. scan_for_matches pat3.dat < msh2.fa > out3

    h. more out1

    i. more out2

    j. more out3

    What do your results look like? What happen when you allow mismatches,etc.?