BlastClust

Using BLASTClust to Make Non-redundant Sequence Sets

BLASTClustis a program within the standalone BLAST package used to cluster eitherprotein or nucleotide sequences. The program begins with pairwisematches and places a sequence in a cluster if the sequence matches atleast one sequence already in the cluster. In the case of proteins, theblastp algorithm is used to compute the pairwise matches; in the caseof nucleotide sequences, the Megablast algorithm is used.

Inthe simplest case, BLASTClust takes as input a file containingcatenated FASTA-format sequences, each with a unique identifier at thestart of the definition line. BLASTClust formats the input sequence toproduce a temporary BLAST database, performs the clustering, andremoves the database at completion. Hence, there is no need to runformatdb in advance to use BLASTClust. The output of BLASTClustconsists of a file, one cluster to a line, of sequence identifiersseparated by spaces. The clusters are sorted from the largest clusterto the smallest.

BLASTClust accepts a number ofparameters that can be used to control the stringency of clusteringincluding thresholds for score density, percent identity, and alignmentlength. The BLASTClust program has a number of applications, thesimplest of which is to create a non-redundant set of sequences from asource database. As an example, one might have a library of a fewthousand short nucleotide sequence reads and wish to replace these witha non-redundant set. To produce the non-redundant set, one might use:

blastclust -i infile -o outfile -p F -L .9 -b T -S 95

Thesequences in "infile" will be clustered and the results will be writtento "outfile". The input sequences are identified as nucleotide (-p F);"-p T", or protein, is the default. To register a pairwise match twosequences will need to be 95% identical (-S 95) over an area covering90% of the length (-L .9) of each sequence (-b T) . Using "-b F"instead of "-b T" would enforce the alignment length threshold on onlyone member of a sequence pair. The parameter "S", used here to specifythe percent identity, can also be used to specify, instead, a "scoredensity." The latter is equivalent to the BLAST score divided by thealignment length. If "S" is given as a number between 0 and 3, it isinterpreted as a score density threshold; otherwise it is interpretedas a percent identity threshold.

To create a stringent non-redundant protein sequence set, use the following command line:

blastclust -i infile -o outfile -p T -L 1 -b T -S 100

Inthis case, only sequences which are identical will be clusteredtogether. The “blastclust.txt” file in the standalone BLAST packagedetails the full range of BLASTClust parameters.
크리에이티브 커먼즈 라이센스
Creative Commons License

Posted by gwlee

2008/08/25 02:29 2008/08/25 02:29
Response
0 Trackbacks , 0 Comments
RSS :
http://thegreatgoodplace.com/tt/study/rss/response/40

Trackback URL : http://thegreatgoodplace.com/tt/study/trackback/40

Leave a comment
« Previous : 1 : ... 14 : 15 : 16 : 17 : 18 : 19 : 20 : 21 : 22 : ... 55 : Next »