HOMEWORK 8

The purpose of this assignment is to familiarize you with techniques used to identify patterns and profiles, as well as how to use patterns and profiles to search databases.
  1. Build a pattern and search a sequence database.

    Perform a multiple sequence alignment on the file sequences.fasta using clustalx (or your favorite msa application) and save it as sequence.aln . Build a pattern of the first 30 positions within the alignment using a sequence driven method, as shown on slide 9 from lecture 8. Simply list commonly occuring amino acids (the amino acids appear equal or more than 3 times in a column) for each column, then convert this list to a patscan syntax (hints: slide 10 - lecture 8 and http://web.wi.mit.edu/bio/pub/patscan.html). Here is an example pattern.gif. Once you have created the pattern syntax, put it into a file in your directory on fladda.wi.mit.edu, named pattern_file. Then issue the following command:

    scan_for_matches -p pattern_file < /usr/people/latek/smalldb.fasta > pattern.out
    
    Can you categorize the results of your pattern search? What biological properties do they have in common? (You can find out the descriptions of the hits on NCBI entrez.)

  2. Build a profile and use it to search a sequence database.

    Build a profile of the alignment from problem 1. Here is the command to use on fladda:

    hmmbuild sequences.prf sequences.aln
    
    This will build a profile (sequence.prf) for the sequences aligned in sequence.aln. Remember to calibrate your profile with the command:
    hmmcalibrate sequences.prf
    
    Finally, search a small database for sequences that match your profile, and only check the ones which e_values are below 1:
    hmmsearch -E 1 sequences.prf /usr/people/latek/smalldb.fasta
    
    How are the results of your profile search related? How do they compare to your patscan results form problem #1?