Relational Databases for Biologists

 

Session 2: SQL to data mine a database

 

Exercises

 

 

Goals: Learn how to query a mySQL database to extract selected data relationships. Learn how to convert a question (in English) into a SQL query and vice versa.

 

See /education/bioinfo2005/db4bio/ for the course page.

See http://www.mysql.com/documentation/                                      for mySQL documentation.

See http://www.mysql.com/doc/en/Tutorial.html                          for an SQL tutorial.

 

Log in to the “db4bio” mySQL database on iona:

 

mysql –u username –D db4bio –p

 

with the following options:

-h host (the computer with the mySQL database) = hebrides.wi.mit.edu

-u username (replace with your mySQL username)

-D database (“db4bio”; optional – you can select the database later)

-p password (you’ll be prompted for it)

 

Try some queries in increasing level of complexity, starting with some easy examples to show correct syntax. Note: mySQL keywords (SELECT, FROM, etc.), data fields, and search strings are case-insensitive but other words (table names) ARE case sensitive.

A query can always be saved as a file, e.g., myquery.sql. Then run it from the mySQL prompt by typing

 

SOURCE myquery.sql

 

Convert each of the following SQL queries into a question, and convert each of the following questions into SQL. They are listed in order of increasing difficulty, and some are quite difficult.

 

1.             SELECT * from Sources;

 

Get every attribute from each tuple of the Sources table.

 

2. What are some example GenBank IDs (accessions) and corresponding sequence descriptions?

 

                  SELECT * FROM Descriptions LIMIT 5;

 

3.             SELECT count(*) FROM LocusLinks; (to get number of tuples in table)

 

                  Count = 70153

 

4. How many different Affy IDs are in the expression data?

 

                  SELECT COUNT(affyId) FROM Data;

 

5.             SELECT description, species FROM LocusDescr

WHERE linkId = 655;

Note that 655 is type integer (int), so it doesn’t need quotes.

 

+------------------------------+---------+

| description                  | species |

+------------------------------+---------+

| bone morphogenetic protein 7 | Hs      |

+-----------------------------------------

1 row in set (0.04 sec)

 

 

6. What’s the expression level of Affy ID 33659_at in experiment hs-liv-1?

 

                  SELECT level FROM Data WHERE affyId=’33659_at’

AND exptId=’hs-liv-1’;

 

7.             SELECT * FROM LocusDescr WHERE description LIKE “phosphatase%”;

Note that the “%” (wildcard) matches any character(s).

 

+--------+-------------+---------+

| linkId | description   species |

+--------+-------------+---------+

6 rows returned

 

8. What are all the gene descriptions, along with their GenBank IDs, containing the phrase “growth factor”?

 

                  SELECT * FROM Descriptions

WHERE description LIKE “%growth factor%”;

 

9.             SELECT * from Data WHERE exptId = “mm-hrt-1” and level > 6000

ORDER BY level DESC;

 

+-----------------------------+----------+-------+

| affyId                      | exptId   | level |

+-----------------------------+----------+-------+

73 rows selected

 

10. What Gene Ontology terms (and corresponding accessions) contain the phrase “transcription factor”? Put your answer in alphabetical order of terms.

 

                  SELECT * FROM GO_Descr

WHERE description LIKE “%transcription factor%”

ORDER BY description;

 

11.          SELECT Data.affyId, Targets.gbId, Data.level

FROM Data, Targets

WHERE Data.affyId = Targets.affyId

ORDER BY Data.level DESC

LIMIT 10;

 

+------------+----------+-------+

| affyId     | gbId     | level |

+------------+----------+-------+

| 36780_at   | M25915   | 47914 |

| 39106_at   | X02162   | 42001 |

| 33377_at   | X03168   | 41726 |

| 36781_at   | X01683   | 40854 |

| 35083_at   | AL031670 | 38720 |

| 32252_at   | D00096   | 35603 |

| 36621_at   | M16961   | 34556 |

| 36984_f_at | X89214   | 34311 |

| 39775_at   | X54486   | 33765 |

| 36995_at   | M88249   | 33553 |

+------------+----------+-------+

10 rows in set (0.52 sec)

 

 

12. What Affy probes correspond to target sequences with the phrase “interleukin” in their description?

 

                  SELECT affyId FROM Targets, Descriptions

WHERE Targets.gbId=Descriptions.gbId

AND Descriptions.description LIKE “%interleukin%”;

 

13.          SELECT Data.affyId, Targets.gbId, UniSeqs.uId, Data.level

FROM Data, Targets, UniSeqs

WHERE Data.affyId = Targets.affyId

AND Targets.gbId = UniSeqs.gbId

ORDER BY level DESC LIMIT 10;

 

+------------+--------+-----------+-------+

| affyId     | gbId   | uId       | level |

+------------+--------+-----------+-------+

| 36780_at   | M25915 | Hs.75106  | 47914 |

| 39106_at   | X02162 | Hs.93194  | 42001 |

| 33377_at   | X03168 | Hs.2257   | 41726 |

| 36781_at   | X01683 | Hs.297681 | 40854 |

| 32252_at   | D00096 | Hs.194366 | 35603 |

| 36621_at   | M16961 | Hs.324746 | 34556 |

| 36984_f_at | X89214 | Hs.328822 | 34311 |

| 93109_f_at | M75718 | Mm.89843  | 33112 |

| 33455_at   | X02747 | Hs.234234 | 31473 |

| 35905_s_at | U34995 | Hs.169476 | 31109 |

+------------+--------+-----------+-------+

10 rows in set (0.52 sec)

 

 

14. Make a table of ten affyId, UnigeneID, and Unigene descriptions in reverse alphabetical order of Unigene descriptions.

 

                  SELECT Targets.affyid, UniDescr.uID, UniDescr.description

FROM UniDescr, UniSeqs, Targets WHERE Targets.gbId=UniSeqs.gbId AND UniSeqs.uId=UniDescr.uID ORDER BY UniDescr.description DESC LIMIT 10;

 

15.          SELECT affyId, AVG(level) AS average FROM Data

GROUP BY AffyId ORDER BY level DESC LIMIT 10;

 

Note that AVG() is an aggregate [like COUNT(), MIN(), MAX(), and SUM()] and “AS” creates a field with the requested name. Since you’re combining rows, you have to use GROUP to show how to do it.

 

+---------------------------+------------+

| affyId                    | average    |

+---------------------------+------------+

| 93050_at                  |  8176.3333 |

| 101869_s_at               | 10145.3333 |

| 101029_f_at               |  7895.3333 |

| 101071_at                 |  7165.6667 |

| AFFX-MURINE_B2_at         | 22705.0000 |

| 93514_at                  |  6720.6667 |

| 100921_at                 |  6424.0000 |

| AFFX-GapdhMur/M32599_M_at | 12907.0000 |

| 100614_at                 |  5847.0000 |

| 101028_i_at               |  5732.3333 |

+---------------------------+------------+

10 rows in set (3.05 sec)

 

16. What is the average expression level of each of the six experiments?

 

                  SELECT exptId, AVG(level) AS average From Data GROUP BY exptId;

 

17.          SELECT Data.affyId, Data.level, Data.exptId, DataCopy.affyId, DataCopy.level, DataCopy.exptId

FROM Data, Data DataCopy

WHERE Data.level > 10 * DataCopy.level

AND Data.affyId=DataCopy.affyId

AND Data.affyId NOT LIKE "AFFX%"

LIMIT 10;

 

(To compare rows of the same table, we copy the table and compare the table to its copy.)

 

+------------+-------+----------+------------+-------+----------+

| affyId     | level | exptId   | affyId     | level | exptId   |

+------------+-------+----------+------------+-------+----------+

| 31333_at   |    16 | hs-cer-1 | 31333_at   |     1 | hs-liv-1 |

| 31337_at   |    20 | hs-cer-1 | 31337_at   |     1 | hs-hrt-1 |

| 31401_r_at |    13 | hs-cer-1 | 31401_r_at |     1 | hs-hrt-1 |

| 31442_at   |   519 | hs-cer-1 | 31442_at   |    20 | hs-liv-1 |

| 31491_s_at |    20 | hs-cer-1 | 31491_s_at |     1 | hs-hrt-1 |

| 31526_f_at |   612 | hs-cer-1 | 31526_f_at |     2 | hs-hrt-1 |

| 31526_f_at |   612 | hs-cer-1 | 31526_f_at |    20 | hs-liv-1 |

| 31536_at   |  1914 | hs-cer-1 | 31536_at   |   191 | hs-hrt-1 |

| 31601_s_at |    20 | hs-cer-1 | 31601_s_at |     1 | hs-hrt-1 |

| 31606_at   |    56 | hs-cer-1 | 31606_at   |     4 | hs-hrt-1 |

+------------+-------+----------+------------+-------+----------+

10 rows in set (0.10 sec)

 

18. List ten non-control human Affy IDs with higher expression in liver than in heart.

 

                  SELECT Data.affId FROM Data.Data DataCopy

WHERE Data.level> Datacopy.level

AND Data.affId=DataCopy.affyId AND Data.exprId=”hs-liv-1”

AND DataCopy.exptId=”hs-hrt-1”

AND Data.affyId NOT LIKE “AFFX%” LIMIT 10;

 

19.          SELECT Data.affyId, Data.level Heart_level, DataCopy.level Brain_level,

Data.level - DataCopy.level Difference

FROM Data, Data DataCopy

WHERE Data.affyId= DataCopy.affyId

AND Data.exptId = “hs-hrt-1”

AND DataCopy.exptId = “hs-cer-1”

ORDER BY Difference DESC

LIMIT 10;

 

+------------+-------------+-------------+------------+

| affyId     | Heart_level | Brain_level | Difference |

+------------+-------------+-------------+------------+

| 32485_at   |       26961 |          20 |      26941 |

| 31737_at   |       26965 |          57 |      26908 |

| 41731_g_at |       26860 |          20 |      26840 |

| 39063_at   |       26574 |          61 |      26513 |

| 36640_at   |       25330 |         236 |      25094 |

| 38660_at   |       23964 |         115 |      23849 |

| 39031_at   |       24492 |         823 |      23669 |

| 31687_f_at |       24020 |        3035 |      20985 |

| 970_r_at   |       23390 |        2890 |      20500 |

| 41730_at   |       19780 |          43 |      19737 |

+------------+-------------+-------------+------------+

10 rows in set (1.66 sec)

 

20.  SELECT Data.affyId, Targets.gbId,

LocusLinks.

linkId, LocusDescr.description,

Data.level Heart_level, DataCopy.level Brain_level,

Data.level - DataCopy.level Difference

FROM Data, Data DataCopy, LocusLinks, LocusDescr,

Targets

WHERE Data.affyId= DataCopy.affyId

AND Data.exptId = ‘hs-hrt-1’

AND DataCopy.exptId = ‘hs-cer-1’

AND Data.affyId = Targets.affyId

AND Targets.gbId = LocusLinks.gbId

AND LocusLinks.linkId = LocusDescr.linkId

ORDER BY Difference DESC

LIMIT 10;

 

| affyId | gbId | linkId | description | Heart_level | Brain_level | Difference |