Relational Databases for Biologists

Relational Databases for Biologists

Session 2: SQL to data mine a database

Exercises

Goals: Learn how to query a MySQL database to extract selected data relationships. Learn how to convert a question (in English) into a SQL query and vice versa.

See /education/bioinfo2005/db4bio/ for the course page.

See http://www.mysql.com/documentation/ or MySQL documentation.

See http://www.mysql.com/doc/en/Tutorial.html for an SQL tutorial.

mysql –u username –D database -p

with the following options:

-u username (replace with your MySQL username)

-D database (“db4bio”; optional – you can select the database later)

-p password (you’ll be prompted for it)

Try some queries in increasing level of complexity, starting with some easy examples to show correct syntax. Note: MySQL keywords (SELECT, FROM, etc.), data fields, and search strings are case-insensitive but other words (table names) ARE case sensitive.

A query can always be saved as a file, e.g., myquery.sql. Then run it from the MySQL prompt by typing :

SOURCE myquery.sql

Convert each of the following SQL queries into a question, and convert each of the following questions into SQL. They are listed in order of increasing difficulty, and some are quite difficult.

1. SELECT * from Sources;

2. What are some example GenBank IDs (accessions) and corresponding sequence descriptions?

3. SELECT count(*) FROM LocusLinks; (to get number of tuples in table)

4. How many different Affy IDs are in the expression data?

5. SELECT description, species FROM LocusDescr

WHERE linkId = 655;

Note that 655 is of type integer (int), so it doesn’t need quotes.

6. What’s the expression level of Affy ID 33659_at in experiment hs-liv-1?

7. SELECT * FROM LocusDescr WHERE description LIKE “phosphatase%”;

Note that the “%” (wildcard) matches any character(s).

8. What are all the gene descriptions, along with their GenBank IDs, containing the phrase “growth factor”?

9. SELECT * from Data WHERE exptId = “mm-hrt-1” and level > 6000

ORDER BY level DESC;

10. What Gene Ontology terms (and corresponding accessions) contain the phrase “transcription factor”? Put your answer in alphabetical order of terms.

11. SELECT Data.affyId, Targets.gbId, Data.level

FROM Data, Targets

WHERE Data.affyId = Targets.affyId

ORDER BY Data.level DESC

LIMIT 10;

12. What Affy probes correspond to target sequences with the phrase “interleukin” in their description?

13. SELECT Data.affyId, Targets.gbId,

UniSeqs.uId,

Data.level

FROM Data, Targets, UniSeqs

WHERE Data.affyId = Targets.affyId

AND Targets.gbId = UniSeqs.gbId

ORDER BY level DESC LIMIT 10;

14. Make a table of ten affyId, UnigeneID, and Unigene descriptions in revserse alphabetical order of Unigene descriptions.

15. SELECT affyId, AVG(level) AS average FROM Data

GROUP BY AffyId ORDER BY level DESC LIMIT 10;

Note that AVG() is an aggregate [like COUNT(), MIN(), MAX(), and SUM()] and “AS” creates a field with the requested name. Since you’re combining rows, you have to use GROUP to show how to do it.

16. What is the average expression level of each of the six experiments?

17. SELECT Data.affyId, Data.level, Data.exptId, DataCopy.affyId, DataCopy.level, DataCopy.exptId

FROM Data, Data DataCopy

WHERE Data.level > 10 * DataCopy.level

AND Data.affyId=DataCopy.affyId

AND Data.affyId NOT LIKE "AFFX%"

LIMIT 10;

(To compare rows of the same table, we copy the table and compare the table to its copy.)

18. List ten non-control human Affy IDs with higher expression in liver than in heart.

19. SELECT Data.affyId,

Data.level AS Heart_level, DataCopy.level AS Brain_

level,

Data.level - DataCopy.level AS Difference

FROM Data, Data DataCopy

WHERE Data.affyId= DataCopy.affyId

AND Data.exptId = “hs-hrt-1”

AND DataCopy.exptId = “hs-cer-1”

ORDER BY Difference DESC

LIMIT 10;

20. SELECT Data.affyId, Targets.gbId,

LocusLinks.

linkId, LocusDescr.description,

Data.level as Heart_level, DataCopy.level as Brain_

level,

Data.level - DataCopy.level AS Difference

FROM Data, Data DataCopy, LocusLinks, LocusDescr,

Targets

WHERE Data.affyId= DataCopy.affyId

AND Data.exptId = “hs-hrt-1”

AND DataCopy.exptId = “hs-cer-1”

AND Data.affyId = Targets.affyId

AND Targets.gbId = LocusLinks.gbId

AND LocusLinks.linkId = LocusDescr.linkId

ORDER BY Difference DESC

LIMIT 10;

Brief descriptions of data in db4bio

Note: Most tables contain only partial data (but should contain enough data to link them together).

Sources – description of source of mRNA for each microarray experiment. This data are for six experiments, only part of a study in which investigators looked at the expression level of genes in a variety of human and mouse organs and tissues.

Data – actual experimental data for a series of microarray experiments. Each affyId represents a set of probes designed to measure expression of a specific gene (known or predicted). The level represents the relative expression level of a gene in a “source” sample of cells. Data are normalized so the sum of all levels on each chip should be the same.

Targets – data linking each piece of microarray data (affyId) to a GenBank sequence ID (gbId), which is the sequence of a gene or part of a gene (an EST, or “expressed sequence tag”). Species are mouse or human, except for some control genes (with Affy IDs starting with AFFX-).

Descriptions – descriptions of DNA sequences in the GenBank repository, each represented by a gbId (a unique accession ID).

UniSeqs – Every EST (part of a gene sequence) in the GenBank repository is compared to every other one, and ESTs are clustered together if enough of their sequences overlap, presumably because they are all part of the sequence of the same gene. Each Unigene cluster of ESTs has a Unique ID, in which the first two characters refer to the species of origin.

UniDescr – description of a gene represented by a Unigene cluster. The description may be vague or completely uninformative if the function of the gene is unknown.

LocusLinks – GenBank annotators attempt to assign most sequences (except for ESTs) to a Locus, representing a gene. As sequencing and annotation progresses, the number of LocusLink IDs should approach the total number of genes in an organism. Just as every EST is generally assigned to a Unigene cluster, each RNA (actually, cDNA) sequence is assigned to a Locus.

LocusDescr – description of a gene represented by a LocusLink ID. The description may be vague or completely uninformative if the function of the gene is unknown.

Unigenes – data linking a Unigene cluster to a LocusLink ID. The number of Unigene clusters is much greater than the predicted number of genes in human and mouse. This may be mostly due to more than one cluster representing different parts of the same gene, with no EST overlapping them both.

RefSeqs – “reference sequences” for LocusLink IDs, which annotators assign to the full length sequence of a gene (a cDNA) and the protein which it encodes. A LocusLink ID with alternative splicing may have more than one cDNA or protein reference sequence.

GO_Descr – Gene Ontology is a big project that has created three detailed hierarchies describing molecular function (ex: enzyme), biological process (ex: reproduction), and localization (ex: nucleus) to systematically describe all proteins in these three ways.

Ontologies – Gene Ontology annotators systematically assigns proteins to the three GO hierarchies (if the function of the protein is known). This list links LocusLink IDs to GO accessions. This annotation is currently quite incomplete.