Using BLAT on fladda

Finding everything on fladda | Indexing the genome | Commands | BLAT output |
BLATing GenBank sequences | Description of BLAT | Extracting genomic sequence | Software credits


Finding everything on fladda

Blat and the blat-formatted human genome are available on fladda. Multiple versions of the human genome are in directories in in /cluster/db0/:

  • human UCSC Golden Path (Nov 2002 from NCBI Build 31) at /cluster/db0/human_gp_jun_02
  • human UCSC Golden Path (Jun 2002 from NCBI Build 30) at /cluster/db0/human_gp_jun_02
  • human UCSC Golden Path (Apr 2002 from NCBI Build 29) at /cluster/db0/human_gp_29
  • human NCBI Build 28 (NCBI files) at /cluster/db0/human_ncbi28
  • human UCSC Golden Path (Dec 2001 from NCBI Build 28) at /cluster/db0/human_gp_dec_01
  • human UCSC Golden Path (Aug 2001) at /cluster/db0/human_gp_aug_01
  • mouse UCSC Golden Path (Feb 2002) at /cluster/db0/mouse_gp_feb_02 (from Ensembl)
  • For the time being, all of the executables described below (faToNib, gfServer, gfClient, etc.) are found in /usr/people/gbell/bin

    Indexing the genome

    The genome needs to be indexed in "nib" format before using BLAT.
    All of the genomes above have already been indexed.
    If you need to index your own genome, use the faToNib command
    When indexing a directory of fasta files, it may be easiest to use a
    shell script that can be renamed to something like "makeNibFiles.csh" (and made executable)

    Commands

    Before starting, index the genome by going to the directory of nib files and issuing the gfServer command:

    cd /path/to/nib/dir
    gfServer start fladda.wi.mit.edu portNum *.nib

    where portNum is some 4-5 digit number greater than 1024. The gf (genomic finding) indexing performed with the gfServer command usually takes on the order of 15 minutes. You can monitor progress through chromosomes as tiles are counted and then added. When the process is complete, you'll get the message

    Done adding
    Server ready for queries!

    To run blat, use the gfClient command:

    gfClient [-out=pslx, etc.][-nohead] fladda.wi.mit.edu portNum /full/path/to/nib/dir seqFileToBlat outFile

    where portNum is the same as you used with the gfServer command.

    When you're finished with all of your BLATing, stop the gfServer:

    gfServer stop fladda.wi.mit.edu portNum

    where portNum is the same as you used to start the gfServer. To analyze multiple sequences with BLAT, use a multiple sequence file as input and you'll get one big output file.

    To get help with command syntax and options, run one of these commands alone.

    BLAT output

    Sample BLAT output for three input sequences shows the default output format. This is slightly different from the format of the web version of BLAT. One key point is that command line blat doesn't prioritize hits from best to worst. Web BLAT does this by ordering by "SCORE", which is calculated as SCORE = matches - mismatches. In other words, you need to sort any multiple-hit results to find the best one, which isn't necessarily the first. Another key point is that there may be no obvious "best" hit: several alignments may produce similar scores, and one needs to decide how many of these hits (if any) are biologically meaningful. The output is tab-delimited, so that may help for import into another application for sorting. With the option '-nohead', as one might predict, the 5-line header is not printed. Other output options:

    	pslx - Tab separated format with sequence
    	axt - blastz-associated axt format
    	maf - multiz-associated maf format
    	wublast - similar to wublast format
    	blast - similar to NCBI blast format
    
    The fields for the default output are:

    match number of matching nucleotides in alignment that aren't part of repeats
    mismatch number of nucleotides in alignment that don't match
    rep. match number of nucleotides in alignment that are part of repeats
    N's number of N's in alignment
    Q gap count number of inserts in query sequence
    Q gap bases number of nucleotides inserted in query sequence
    T gap count number of inserts in target sequence (chromosome)
    T gap bases number of nucleotides inserted in target sequence (chromosome)
    strand chromsome strand (+ or -)
    Q name name of query sequence
    Q size length of query sequence
    Q start start of query sequence in alignment
    Q end end of query sequence in alignment
    T name matching chromsome ("target", ex: chr13)
    T size length of target sequence (chromosome)
    T start start of target sequence (chromosome) comprising alignment
    T end end of target sequence (chromosome) comprising alignment
    block count number of blocks (may be exons for cDNA) of matching regions
    blockSizes sizes of blocks (may be exons for cDNA) of matching regions (delimited by commas)
    qStarts list of query nts at starts of blocks (delimited by commas)
    tStarts list of chromsome nts at starts of blocks (delimited by commas)
    qSeqs* list of query sequence blocks (delimited by commas)
    *qSeqs are printed only if gfClient is run with the option -out=pslx

    BLATing GenBank sequences

    BLAT has already been performed on all GenBank sequences (including ESTs) by the Golden Path group at UCSC. Tab delimited files containing all data can be downloaded from the UCSC Genome Annotation Database. Descriptions of the files, including data fields are described in the Genome Browser Database. Data fields for the files all_mrna.txt and all_est.txt are created directly by combining multiple BLAT output files. Please contact George Bell if you like some help processing and analyzing the (very large) annotation files. Note that this annotation applies specifically to only one draft assembly.

    Description of BLAT (from UCSC's blat page)

    BLAT on DNA is designed to quickly find sequences of 95% and greater similarity of length 40 bases or more. It may miss more divergent or shorter sequence alignments. It will find perfect sequence matches of 33 bases, and sometimes find them down to 22 bases. BLAT on proteins finds sequences of 80% and greater similarity of length 20 amino acids or more. In practice DNA BLAT works well on primates, and protein blat on land vertebrates

    BLAT is not BLAST. DNA BLAT works by keeping an index of the entire genome in memory. The index consists of all non-overlapping 11-mers except for those heavily involved in repeats. . . The genome itself is not kept in memory. . . The index is used to find areas of probable homology, which are then loaded into memory for a detailed alignment. Protein BLAT works in a similar manner, except with 4-mers rather than 11-mers.

    As of October 2002, BLAT can map DNA sequences as short as 21-mers.

    Extracting genomic sequence

    Once one has mapped a sequence to the genome, adjacent sequence can be easily extracted using the nibFrag command:

    nibFrag chrFile.nib startNT endNT strand outFile

    nibFrag doesn't require any indexes in memory (like BLAT), and it's much faster than EMBOSS's extractseq. It only works, however, on nib-formatted sequence files. For the time being, nibFrag is found in /usr/people/gbell/bin.

    Software credits

    Blat and associated genomic finding software is courtesy of Jim Kent - see "Source Code" or "Executables."




    This page last updated on