Perform a multiple sequence alignment on the file sequences.fasta using clustalx (or your favorite msa application) and save it as sequence.aln . Build a pattern of the first 30 positions within the alignment using a sequence driven method, as shown on slide 9 from lecture 8. Simply list commonly occuring amino acids (the amino acids appear equal or more than 3 times in a column) for each column, then convert this list to a patscan syntax (hints: slide 10 - lecture 8 and http://web.wi.mit.edu/bio/pub/patscan.html). Here is an example pattern.gif. Once you have created the pattern syntax, put it into a file in your directory on fladda.wi.mit.edu, named pattern_file. Then issue the following command:
scan_for_matches -p pattern_file < /usr/people/latek/smalldb.fasta > pattern.outCan you categorize the results of your pattern search? What biological properties do they have in common? (You can find out the descriptions of the hits on NCBI entrez.)
Build a profile of the alignment from problem 1. Here is the command to use on fladda:
hmmbuild sequences.prf sequences.alnThis will build a profile (sequence.prf) for the sequences aligned in sequence.aln. Remember to calibrate your profile with the command:
hmmcalibrate sequences.prfFinally, search a small database for sequences that match your profile, and only check the ones which e_values are below 1:
hmmsearch -E 1 sequences.prf /usr/people/latek/smalldb.fastaHow are the results of your profile search related? How do they compare to your patscan results form problem #1?