Getting To Know Your Protein

Exercise II

Bioinformatics for Biologists 2005

In this exercise, you will be identifying protein domains within an unknown sequence. You will also be using a protein domain pattern and profile to search a database for related sequences. Upon the completion of this exercise, you should be comfortable with browsing and searching domain databases. Follow the steps detailed below and use either the applications located on your computer or that are web-based. Please follow the steps in order. If you have difficulty with any of the steps, please ask for assistance.

(For these exercises, use application default settings)

Step 1 – Identify protein domains

I. Use PFAM to identify domains within the following sequence.:

http://pfam.wustl.edu/

MESEMLQSPLLGLGEEDEADLTDWNLPLAFMKKRHCEKIEGSKSLAQSWRMKDRMKTVSVALVLCLNVGVDP

PDVVKTTPCARLECWIDPLSMGPQKALETIGANLQKQYENWQPRARYKQSLDPTVDEVKKLCTSLRRNAKEE

RVLFHYNGHGVPRPTVNGEVWVFNKNYTQYIPLSIYDLQTWMGSPSIFVYDCSNAGLIVKSFKQFALQREQE

LEVAAINPNHPLAQMPLPPSMKNCIQLAACEATELLPMIPDLPADLFTSCLTTPIKIALRWFCMQKCVSLVP

GVTLDLIEKIPGRLNDRRTPLGELNWIFTAITDTIAWNVLPRDLFQKLFRQDLLVASLFRNFLLAERIMRSY

NCTPVSSPRLPPTYMHAMWQAWDLAVDICLSQLPTIIEEGTAFRHSPFFAEQLTAFQVWLTMGVENRNPPEQ

LPIVLQVLLSQVHRLRALDLLGRFLDLGPWAVSLALSVGIFPYVLKLLQSSARELRPLLVFIWAKILAVDSS

CQADLVKDNGHKYFLSVLADPYMPAEHRTMTAFILAVIVNSYHTGQEACLQGNLIAICLEQLNDPHPLLRQW

VAICLGRIWQNFDSARWCGVRDSAHEKLYSLLSDPIPEVRCAAVFALGTFVGNSAERTDHSTTIDHNVAMML

AQLVSDGSPMVRKELVVALSHLVVQYESNFCTVALQFIEEEKNYALPSPATTEGGSLTPVRDSPCTPRLRSV

SSYGNIRAVATARSLNKSLQNLSLTEESGGAVAFSPGNLSTSSSASSTLGSPENEEHILSFETIDKMRRASS

YSSLNSLIGVSFNSVYTQIWRVLLHLAADPYPEVSDVAMKVLNSIAYKATVNARPQRVLDTSSLTQSAPASP

TNKGVHIHQAGGSPPASSTSSSSLTNDVAKQPVSRDLPSGRPGTTGPAGAQYTPHSHQFPRTRKMFDKGPEQ

TADDADDAAGHKSFISATVQTGFCDWSARYFAQPVMKIPEEHDLESQIRKEREWRFLRNSRVRRQAQQVIQK

GITRLDDQIFLNRNPGVPSVVKFHPFTPCIAVADKDSICFWDWEKGEKLDYFHNGNPRYTRVTAMEYLNGQD

CSLLLTATDDGAIRVWKNFADLEKNPEMVTAWQGLSDMLPTTRGAGMVVDWEQETGLLMSSGDVRIVRIWDT

DREMKVQDIPTGADSCVTSLSCDSHRSLIVAGLGDGSIRVYDRRMALSECRVMTYREHTAWVVKASLQKRPD

GHIVSVSVNGDVRIFDPRMPESVNVLQIVKGLTALDIHPQADLIACGSVNQFTAIYNSSGELINNIKYYDGF

MGQRVGAISCLAFHPHWPHLAVGSNDYYISVYSVEKRVR

II. What are these domains? Can you identify other proteins that contain this domain? What is interesting about the domain architecture for these domains?

Step 2 – Create a pattern (consensus) for the domain in Step1

I. This time, search ProSite for domains in the sequence from Step1.Write down the ProSite identifier number PSxxxxx for this domain.

http://www.expasy.org/tools/scanprosite/

II. Use the search box on the top of the ProSite page to find information regarding the domain identified in the previous question. Locate the consensus representing this domain and copy it to a text file.

Step 3 – Search a database with your sequence pattern

I.Convert the ScanProsite pattern to PatScan syntax. Save as a text file. (For simplicity, you don't have to convert the whole pattern).

II. Try using the command line version of PatScan located on Hebrides to search the nr database with the above pattern file. This will require you to use a Hebrides login.

% scan_for_matches pattern.file < /cluster/db0/Data/nr > out.file

[ Alternatively, copy/paste the Prosite pattern into the PROSITE patterns(s)/profile(s) to "scan for:" box at http://www.expasy.org/tools/scanprosite/ ]

Step 4 – Use a profile to search a database

I. Use PSI-BLAST to create a profile representing sequences related to the following sequence by searching against the swiss-prot or nr database.

http://www.ncbi.nlm.nih.gov/BLAST/

MDTDKLISEAESHFSQGNHAEAVAKLTSAAQSNPNDEQMSTIESLIQKIAGYVMDNRSGGSDASQDRAAGGGSSFMN

TLMADSKGSSQTQLGKLALLATVMTHSSNKGSSNRGFDVGTVMSMLSGSGGGSQSMGASGLAALASQFFKSGNNSQG

QGQGQGQGQGQGQGQGQGSFTALASLASSFMNSNNNNQQGQNQSSGGSSFGALASMASSFMHSNNNQNSNNSQQGYN

QSYQNGNQNSQGYNNQQYQGGNGGYQQQQGQSGGAFSSLASMAQSYLGGGQTQSNQQQYNQQGQNNQQQYQQQGQNY

QHQQQGQQQQQGHSSSFSALASMASSYLGNNSNSNSSYGGQQQANEYGRPQHNGQQQSNEYGRPQYGGNQNSNGQHE

SFNFSGNFSQQNNNGNQNRY

II. What types of proteins do you find? Re-run with sequences having E-values less than 0.0001 for 4 iterations.

• Now what kinds of sequences do you retrieve?

Step 5 – Use a multiple sequence alignment to build a sequence profile

I. Return to PFAM and browse for your favorite domain.

II. Find the corresponding seed alignment in MSF format, then copy and paste it in to a text file. Open the file in ClustalX. Remove all but 10 of the sequences using EDIT->CUT SEQUENCES. Re-align the remaining sequences. Make a figure with your new alignment.

III. BONUS: Using Hebrides, build a profile for this alignment, calibrate it, and search the yeast.aa database with it.

• Build hmmbuild hmmfile.hmm alignment.msf

• Calibrate hmmcalibrate hmmfile.hmm

• Search hmmsearch hmmfile.hmm /cluster/db0/Data/nr > results.out