Relational Databases for Biologists

Brief descriptions of data in db4bio

Sources
- description of source of mRNA for each microarray experiment. This data are for six experiments, only part of a study in which investigators looked at the expression level of genes in a variety of human and mouse organs and tissues.

Data
- actual experimental data for a series of microarray experiments. Each affyId represents a set of probes designed to measure expression of a specific gene (known or predicted). The level represents the relative expression level of a gene in a "source" sample of cells. Data are normalized so the sum of all levels on each chip should be the same.

Targets
- data linking each piece of microarray data (affyId) to a GenBank sequence ID (gbId), which is the sequence of a gene or part of a gene (an EST, or "expressed sequence tag"). Species are mouse or human, except for some control genes (with Affy IDs starting with AFFX-).

Descriptions
- descriptions of DNA sequences in the GenBank repository, each represented by a gbId (a unique accession ID).

UniSeqs
- Every EST (part of a gene sequence) in the GenBank repository is compared to every other one, and ESTs are clustered together if enough of their sequences overlap, presumably because they are all part of the sequence of the same gene. Each Unigene cluster of ESTs has a Unique ID, in which the first two characters refer to the species of origin.

UniDescr
- description of a gene represented by a Unigene cluster. The description may be vague or completely uninformative if the function of the gene is unknown.

LocusLinks
- GenBank annotators attempt to assign most sequences (except for ESTs) to a Locus, representing a gene. As sequencing and annotation progresses, the number of LocusLink IDs should approach the total number of genes in an organism. Just as every EST is generally assigned to a Unigene cluster, each RNA (actually, cDNA) sequence is assigned to a Locus.

LocusDescr
- description of a gene represented by a LocusLink ID. The description may be vague or completely uninformative if the function of the gene is unknown.

Unigenes
- data linking a Unigene cluster to a LocusLink ID. The number of Unigene clusters is much greater than the predicted number of genes in human and mouse. This may be mostly due to more than one cluster representing different parts of the same gene, with no EST overlapping them both.

RefSeqs
- Transcript "reference sequences" for LocusLink IDs, which annotators assign to the full length sequence of a gene (a cDNA) and the protein which it encodes. A LocusLink ID with alternative splicing may have more than one cDNA or protein reference sequence.

GO_Descr
- Gene Ontology is a big project that has created three detailed hierarchies describing molecular function (ex: enzyme), biological process (ex: reproduction), and localization (ex: nucleus) to systematically describe all proteins in these three ways.

Ontologies
- Gene Ontology annotators systematically assigns proteins to the three GO hierarchies (if the function of the protein is known). This list links LocusLink IDs to GO accessions. This annotation is currently quite incomplete.

Note: Most tables contain only partial data (but should contain enough data to link them together).

Questions or comments?   gbell@wi.mit.edu
Bioinformatics and Research Computing at Whitehead Institute