SEQUENCE ANALYSIS UNIX COMMANDS - Lecture 3
Getting genome sequence
- Your hebrides account can't hold this much data, so only do this if you want the genome on your own computer.
- The UCSC FTP site has the same hierarchy as
UCSC Genome Bioinformatics downloads
- FTP to a repository like UCSC
- ftp genome.ucsc.edu
- log in as 'anonymous' and give your email address when prompted
- cd goldenPath
- ls -F (and select a directory like hg16 [for the human July 2003 assembly])
- cd bigZips (for everything in a big file) or cd chromosomes (for separate files for each chromosome)
- get chromFa.zip (or whatever file you want)
- [continue for every file you want]
- quit
- unzip chromFa.zip (or whatever file you want to unzip)
Mapping and extracting genomic sequence
- First: format the genome using the faToNib command.
- To BLAT search the genome with a multiple sequence fasta file,
- index the genome (get it into memory) with the gfServer command
- search the genome with the gfClient command.
- Extract a region of a chromosome using the nibFrag command.
- See Using BLAT on hebrides
for all the details of these commands.
- There is a command on hebrides called 'blat' but it only works to map one sequence to another
but not to the entire genome.
Downloading the Ensembl annotation
- FTP to Ensembl
- ftp ftp.ensembl.org
- log in as 'anonymous' and give your email address when prompted
- cd pub
- ls -F
- cd current_mart (or the directory you want)
- cd data/mysql (if you want EnsMart)
- ls -F
- get ensembl_mart_19_2.sql.gz (or whatever version you want)
to get a description of all the fields in all the files in this directory
- get hsapiens_ensemblgene_main.txt.table.gz (or whatever file you want)
- [continue for every file you want]
- quit
- gunzip hsapiens_ensemblgene_main.txt.table.gz (or whatever .gz file you want to gunzip)
- Since the file is tab-delimited, it can be parsed with Perl
- The file can also be imported into a MySQL database
- using the CREATE TABLE command for the table(s) in the SQL file shown above, and
- and the LOAD DATA LOCAL INFILE command
(see MySQL documentation for the details).
Querying the Ensembl database
- Connect: mysql -u anonymous -h kaka.sanger.ac.uk (to get a MySQL prompt)
- show databases;
- use ensembl_mart_19_1; (or the database you want)
- show tables;
- describe hsapiens_ensemblgene_main; (or the table you want)
- show tables;
- enter a query:
- ex1: select * from hsapiens_ensemblgene_main limit 10;
- ex2: select * from hsapiens_ensemblgene_main where description like "%growth factor%";
WIBR Sequence Analysis Course 2004