BaRC > Bioinformatics > Script library

Perl scripts for Bioinformatics

Script name Description Sample input Sample output Download Test Perl on your system     download Reverse and complement a fasta sequence using EMBOSS's 'revseq' command     download Extract oligos from a sequence and analyze them     download Run patscan (to search for a pattern) on every sequence in a directory     download
puzzle_helper.html Web-based interface for the puzzle.cgi script     NA Simple GenBank nucleotide report parser using regular expressions input output download Use LWP to automate web file access input output download Draw a PNG figure using the GD module input output download Sequence conversion with BioPerl input output download Split a file of multiple sequences into separate files and modify the format     download Parse GenBank sequence features with BioPerl input output download Manipulate a sequence with BioPerl input output download Parse BLAST output files with BioPerl's SearchIO input output download Sort BLAT output to select only the best hit(s) for each query sequence input output download Merge lines of BLAT output to one line for each query sequence input output download Align a list of pairs of sequences using different algorithms input outputs
1 2 3
download Extract data from a set of Excel files in a directory input output download

Unix commands for Bioinformatics

Script and description
Count the number of fasta sequences in a multiple-sequence fasta file:
grep ">" mySeqs.fa | wc -l
Extract one sequence (with ID 'myAcc') from a multiple-sequence fasta file ('multSeqFile'):
sed -n '/myAcc/, />/p' multSeqFile | sed '$d' > oneSeqFile
Sort fields in a comma-delimited file (6th field by text order then 1st field in reverse by numerical order):
sort -t, -k 6,6 -k 1,1nr fileToSort
Print lines that match a pattern ('myPattern'):
grep myPattern myFile
Print lines that don't match a pattern ('myPattern'):
grep -v myPattern myFile
Print line of a tab-delimited file when the 8th field is 10090:
awk -F "\t" '$8 == 10090 { print $0 }' myFile
Print fields 1, 2, 3 from a tab-delimited file where the 4th field contains a '99':
awk -F "\t" '$4 ~ /99/ {print $1"\t"$2"\t"$3}' myFile
Add text ('lcl|') after the ">" to format a fasta file for BLAST indexing:
sed 's/>/>lcl|/' mySeqs.fa
Find all files ending in .pl and copy them to the 'Perl_archive' directory:
find . -name \*.pl -exec cp {} Perl_archive/ \;
Remove HTML tags:
sed -e :a -e 's/<[^>]*>//g;/</N;//ba' myFile.html
Print lines, from 2 lines before to 3 lines after, when a word ("ABC99") is matched:
grep -B2 -A3 "ABC99" myFile
Convert lowercase letters (a, c, t, g) into 'n' using the 'tr' command:
tr actg n < softmasked_sequence.fa > hardmasked_sequence.fa
Remove all version numbers (ex: '.1') from the end of a list of sequence accessions
sed 's/\.[0-9]\+//g' accsWithVersion > accsOnly