Relational Databases
for Biologists
Session 2: SQL to data mine a database
Exercises
Goals: Learn how to query a MySQL database to
extract selected data relationships. Learn how to convert a question (in
English) into a SQL query and vice versa.
See /education/bioinfo2005/db4bio/
for the course page.
See http://www.mysql.com/documentation/
or
MySQL documentation.
See http://www.mysql.com/doc/en/Tutorial.html for
an SQL tutorial.
Log in to the
Òdb4bioÓ MySQL database on hebrides:
mysql
–u username –D database -p
with the
following options:
-u
username (replace with your MySQL username)
-D
database (Òdb4bioÓ; optional – you can select the database later)
-p
password (youÕll be prompted for it)
Try some queries
in increasing level of complexity, starting with some easy examples to show
correct syntax. Note: MySQL keywords (SELECT, FROM, etc.), data fields, and
search strings are case-insensitive but other words (table names) ARE case
sensitive.
A query can
always be saved as a file, e.g., myquery.sql. Then run it from the MySQL prompt
by typing :
SOURCE
myquery.sql
Convert each
of the following SQL queries into a question, and convert each of the following
questions into SQL. They are listed in order of increasing difficulty, and some
are quite difficult.
1. SELECT
* from Sources;
2. What are some
example GenBank IDs (accessions) and corresponding sequence descriptions?
3. SELECT
count(*) FROM LocusLinks; (to get number of tuples in table)
4. How many
different Affy IDs are in the expression data?
5. SELECT
description, species FROM LocusDescr
WHERE
linkId = 655;
Note that 655 is
of type integer (int), so it doesnÕt need quotes.
6. WhatÕs the
expression level of Affy ID 33659_at in experiment hs-liv-1?
7. SELECT
* FROM LocusDescr WHERE description LIKE Òphosphatase%Ó;
Note that the
Ò%Ó (wildcard) matches any character(s).
8. What are all
the gene descriptions, along with their GenBank IDs, containing the phrase
Ògrowth factorÓ?
9. SELECT
* from Data WHERE exptId = Òmm-hrt-1Ó and level > 6000
ORDER
BY level DESC;
10. What Gene
Ontology terms (and corresponding accessions) contain the phrase Òtranscription
factorÓ? Put your answer in alphabetical order of terms.
11. SELECT
Data.affyId, Targets.gbId, Data.level
FROM
Data, Targets
WHERE
Data.affyId = Targets.affyId
ORDER
BY Data.level DESC
LIMIT
10;
12. What Affy
probes correspond to target sequences with the phrase ÒinterleukinÓ in their
description?
13. SELECT
Data.affyId, Targets.gbId,
UniSeqs.uId,
Data.level
FROM
Data, Targets, UniSeqs
WHERE
Data.affyId = Targets.affyId
AND Targets.gbId = UniSeqs.gbId
ORDER
BY level DESC LIMIT 10;
14. Make a table
of ten affyId, UnigeneID, and Unigene descriptions in revserse alphabetical
order of Unigene descriptions.
15. SELECT
affyId, AVG(level) AS average FROM Data
GROUP
BY AffyId ORDER BY level DESC LIMIT 10;
Note that AVG()
is an aggregate [like COUNT(), MIN(), MAX(), and SUM()] and ÒASÓ creates a
field with the requested name. Since youÕre combining rows, you have to use
GROUP to show how to do it.
16. What is the
average expression level of each of the six experiments?
17. SELECT
Data.affyId, Data.level, Data.exptId, DataCopy.affyId, DataCopy.level,
DataCopy.exptId
FROM
Data, Data DataCopy
WHERE
Data.level > 10 * DataCopy.level
AND Data.affyId=DataCopy.affyId
AND Data.affyId NOT LIKE
"AFFX%"
LIMIT
10;
(To compare rows
of the same table, we copy the table and compare the table to its copy.)
18. List ten
non-control human Affy IDs with higher expression in liver than in heart.
19. SELECT
Data.affyId,
Data.level AS Heart_level, DataCopy.level
AS Brain_
level,
Data.level - DataCopy.level AS Difference
FROM
Data, Data DataCopy
WHERE
Data.affyId= DataCopy.affyId
AND Data.exptId = Òhs-hrt-1Ó
AND DataCopy.exptId = Òhs-cer-1Ó
ORDER
BY Difference DESC
LIMIT
10;
20. SELECT
Data.affyId, Targets.gbId,
LocusLinks.
linkId, LocusDescr.description,
Data.level as Heart_level, DataCopy.level
as Brain_
level,
Data.level - DataCopy.level AS Difference
FROM
Data, Data DataCopy, LocusLinks, LocusDescr,
Targets
WHERE
Data.affyId= DataCopy.affyId
AND Data.exptId = Òhs-hrt-1Ó
AND DataCopy.exptId = Òhs-cer-1Ó
AND Data.affyId = Targets.affyId
AND Targets.gbId = LocusLinks.gbId
AND LocusLinks.linkId = LocusDescr.linkId
ORDER
BY Difference DESC
LIMIT
10;
Brief descriptions of
data in db4bio
Note: Most
tables contain only partial data (but should contain enough data to link them
together).
Sources –
description of source of mRNA for each microarray experiment. This data are for
six experiments, only part of a study in which investigators looked at the
expression level of genes in a variety of human and mouse organs and tissues.
Data –
actual experimental data for a series of microarray experiments. Each affyId
represents a set of probes designed to measure expression of a specific gene
(known or predicted). The level represents the relative expression level of a
gene in a ÒsourceÓ sample of cells. Data are normalized so the sum of all
levels on each chip should be the same.
Targets –
data linking each piece of microarray data (affyId) to a GenBank sequence ID
(gbId), which is the sequence of a gene or part of a gene (an EST, or
Òexpressed sequence tagÓ). Species are mouse or human, except for some control
genes (with Affy IDs starting with AFFX-).
Descriptions
– descriptions of DNA sequences in the GenBank repository, each
represented by a gbId (a unique accession ID).
UniSeqs –
Every EST (part of a gene sequence) in the GenBank repository is compared to
every other one, and ESTs are clustered together if enough of their sequences
overlap, presumably because they are all part of the sequence of the same gene.
Each Unigene cluster of ESTs has a Unique ID, in which the first two characters
refer to the species of origin.
UniDescr –
description of a gene represented by a Unigene cluster. The description may be
vague or completely uninformative if the function of the gene is unknown.
LocusLinks
– GenBank annotators attempt to assign most sequences (except for ESTs)
to a Locus, representing a gene. As sequencing and annotation progresses, the
number of LocusLink IDs should approach the total number of genes in an
organism. Just as every EST is generally assigned to a Unigene cluster, each
RNA (actually, cDNA) sequence is assigned to a Locus.
LocusDescr
– description of a gene represented by a LocusLink ID. The description
may be vague or completely uninformative if the function of the gene is
unknown.
Unigenes –
data linking a Unigene cluster to a LocusLink ID. The number of Unigene
clusters is much greater than the predicted number of genes in human and mouse.
This may be mostly due to more than one cluster representing different parts of
the same gene, with no EST overlapping them both.
RefSeqs –
Òreference sequencesÓ for LocusLink IDs, which annotators assign to the full
length sequence of a gene (a cDNA) and the protein which it encodes. A
LocusLink ID with alternative splicing may have more than one cDNA or protein
reference sequence.
GO_Descr –
Gene Ontology is a big project that has created three detailed hierarchies
describing molecular function (ex: enzyme), biological process (ex:
reproduction), and localization (ex: nucleus) to systematically describe all
proteins in these three ways.
Ontologies
– Gene Ontology annotators systematically assigns proteins to the three
GO hierarchies (if the function of the protein is known). This list links
LocusLink IDs to GO accessions. This annotation is currently quite incomplete.