[EMBOSS] distmat manual


http://ccb.ucmerced.edu/cgi-bin/app/emboss/index/help/distmat


Function

Creates a distance matrix from multiple alignments

Description

distmat calculates the evolutionary distances between every pair ofsequences in a multiple alignment. The sequences need to be alignedbefore running this program. The quality of the alignment is ofparamount importance in obtaining meaningful information from thisanalysis. This application calculates a distance matrix for the set ofsequences in the alignment. The distances are expressed in terms of thenumber of substitutions per 100 bases or amino acids.

As sequence diverge so does the probability of there being multiplesubstitutions at any one site in the alignment increase. The distancewill then be an underestimate of the true evolutionary distance betweenthe sequences. Therefore, there are a number of methods for correctingthe observed substitution rate for the occurence of multiplesubstutions.

For nucleotides, the "-position" flag allows the user to choose basepositions to analyse in each codon, i.e. 123 (all bases), 12 (the firsttwo bases), 1, 2, or 3 individual bases.

Uncorrected distances

This method does not make any corrections for multiple substitutions.Therefore, the score will be an underestimate of the distance betweenthe sequences. This will not be less significant for highly similar setsof sequences.
S = m/(npos + gaps*gap_penalty) (1)

m - score of matches (1 for an exact match, a fraction for partial
matches and 0 for no match)
npos - number of positions included in m
gaps - number of gaps in the sequences
gap_penalty - the score given to a gapped position
D = uncorrected distance = p-distance = 1-S (2)

The score of match includes all exact matches. For nucleotides, if theflag "-ambiguous" is used then partial matches are included in thescore. For example, a match of M (A or C) with A will increment m by 0.5(0.5*1.0). Gaps are not included in the calculation unless a non zerovalue is given with "-gapweight". It should be noted that end gaps andinternal gaps will be weighted by the same amount. So it is recommendedthat this be used with "-sbegin"and "-send" to specify the start and endof the region to calculate the distance from.

Multiple Substitution correction algorithms

Jukes-Cantor

This can be used for nucleotide and protein sequences.
distance = -b ln (1-D/b)

D - uncorrected distance
b - constant. b= 3/4 for nucleotides and 19/20 for proteins.

Partial matches and gap positions can be taken into account in thecalculation of D, by setting the "-ambiguous" and "-gapweight" flags(see "uncorrected distance" method).

Reference:
"Phylogenetic Inference", Swofford, Olsen, Waddell, andHillis, in Molecular Systematics, 2nd ed., Sinauer Ass., Inc., 1996, Ch. 11.

Tajima-Nei

This method is only for nucleotide sequences. It uses the same equationas Jukes-Cantor, but the b-parameter is not constant. Also, only exactmatches are considered in the calculation of the match score and gappositions are ignored.
A = 1, T = 2, C = 3, G = 4

b = 0.5(1.- Sum(i=A,G)(fraction[i]^2 + D^2/h)

h = Sum(i=A,C)Sum(k=T,G) (0.5 * pair_frequency[i,k]^2/(fraction[i]*fraction[k]))

distance = -b ln(1.-D/b)

pair_frequency[i,k] - frequency of the i and k base pair at sites in
the alignement of the pair of sequences.
fraction[i] - average content of the base i in both sequences

Reference:
F. Tajima and M. Nei, Mol. Biol. Evol. 1984, 1, 269.

Kimura Two-Parameter distance

This method is only for nucleotide sequences. This uses the principlethat transition substitutions (purine-purine and pyrimidine-purine) aremore likely than transversion substitutions (purine-pyprimidine). Purinebeing the nucleic acid constituent of A and G, and pyrimidine being thenucleic acid derivative of the bases C, T and U. Gaps are ignored andabiguous symbols other than R (purine) and Y (pyrimidine) are ingnored.
P = transitions/npos
Q = transversions/npos

npos - number of positions scored

distance = -0.5 ln[ (1-2P-Q)*sqrt(1-2Q)]

Reference:
M. kimura, J. Mol. Evol. 1980, 16, 111.

Tamura

This method is only for nucleotide sequences. This method usestransition and transversion rates and takes into account the deviationof GC content from the expected value of 50 %. Gap and ambiguouspositions are ignored.
P = transitions/npos
Q = transversions/npos

npos - number of positions scored

GC1 = GC fraction in sequence 1
GC2 = GC fraction in sequence 2
C = GC1 + GC2 - 2*GC1*GC2

distance = -C ln(1-P/C-Q) - 0.5(1-C) ln(1-2Q)

Reference:
K. Tamura, Mol. Biol. Evol. 1992, 9, 678.

Jin-Nei Gamma distance

This method applies to nucleotides only. This again uses transition andtransversion rates. As with the Kimura two parameter method, gaps andambiguous symbols other than R and Y are not oncluded in the score. Theshape parameter, i.e. "a", is the square of the inverse of thecoefficient of variation of the average substitution,
L = average substituition = transition_rate + 2 * transversion_rate
a = (average L)^2/(variance of L)

P = transitions/npos
Q = transversions/npos

npos - number of positions scored

distance = 0.5 * a ((1-2P-Q)^(-1/a) + 0.5 (1-2Q)^(-1/a) -3/2)

It is suggested [Jin et al.], in general, that the distance becalculated with an a-value of 1. However, the user can specify their ownvalue, using the "-parametera" option, or calculate for each pair ofsequence, using "-calculatea".

Reference:
L. Jin and M. Nei, Mol. Biol. Evol. 1990, 7, 82.

Kimura Protein distance

This method is used for proteins only. Gaps are ignored and only exactmatches and ambiguity codes contribute to the match score.
S = m/npos

m - exact match
npos - number of positions scored

D = 1-S
distance = -ln(1 - D - 0.2D^2)

Reference:
M. Kimura, The Neutral Theory of Molecular Evolution, Camb. Uni. Press,Camb., 1983.

Usage

Here is a sample session with distmat
% distmat pax.align 
Creates a distance matrix from multiple alignments
Multiple substitution correction methods for proteins
0 : Uncorrected
1 : Jukes-Cantor
2 : Kimura Protein
Method to use [0]: 2
Output file [pax.distmat]:

Go to the input files for this example
Go to the output files for this example

Command line arguments

 Standard (Mandatory) qualifiers (* if not always prompted):
[-sequence] seqset File containing a sequence alignment.
* -nucmethod menu Multiple substitution correction methods for
nucleotides.
* -protmethod menu Multiple substitution correction methods for
proteins.
[-outfile] outfile Output file name

Additional (Optional) qualifiers (* if not always prompted):
* -ambiguous boolean Option to use the ambiguous codes in the
calculation of the Jukes-Cantor method or if
the sequences are proteins.
* -gapweight float Option to weight gaps in the uncorrected
(nucleotide) and Jukes-Cantor distance
methods.
* -position integer Choose base positions to analyse in each
codon i.e. 123 (all bases), 12 (the first
two bases), 1, 2, or 3 individual bases.
* -calculatea boolean This will force the calculation of parameter
'a' in the Jin-Nei Gamma distance
calculation, otherwise the default is 1.0
(see -parametera option).
* -parametera float User defined parameter 'a' to be use in the
Jin-Nei Gamma distance calculation. The
suggested value to be used is 1.0 (Jin et
al.) and this is the default.

Advanced (Unprompted) qualifiers: (none)
Associated qualifiers:

"-sequence" associated qualifiers
-sbegin1 integer Start of each sequence to be used
-send1 integer End of each sequence to be used
-sreverse1 boolean Reverse (if DNA)
-sask1 boolean Ask for begin/end/reverse
-snucleotide1 boolean Sequence is nucleotide
-sprotein1 boolean Sequence is protein
-slower1 boolean Make lower case
-supper1 boolean Make upper case
-sformat1 string Input sequence format
-sdbname1 string Database name
-sid1 string Entryname
-ufo1 string UFO features
-fformat1 string Features format
-fopenfile1 string Features file name

"-outfile" associated qualifiers
-odirectory2 string Output directory

General qualifiers:
-auto boolean Turn off prompts
-stdout boolean Write standard output
-filter boolean Read standard input, write standard output
-options boolean Prompt for standard and additional values
-debug boolean Write debug output to program.dbg
-verbose boolean Report some/full command line options
-help boolean Report command line options. More
information on associated and general
qualifiers can be found with -help -verbose
-warning boolean Report warnings
-error boolean Report errors
-fatal boolean Report fatal errors
-die boolean Report deaths


Standard (Mandatory) qualifiers Allowed values Default
[-sequence]
(Parameter 1)
File containing a sequence alignment. Readable set of sequences Required
-nucmethod Multiple substitution correction methods for nucleotides.
0 (Uncorrected)
1 (Jukes-Cantor)
2 (Kimura)
3 (Tamura)
4 (Tajima-Nei)
5 (Jin-Nei Gamma)
0
-protmethod Multiple substitution correction methods for proteins.
0 (Uncorrected)
1 (Jukes-Cantor)
2 (Kimura Protein)
0
[-outfile]
(Parameter 2)
Output file name Output file <sequence>.distmat
Additional (Optional) qualifiers Allowed values Default
-ambiguous Option to use the ambiguous codes in the calculation of the Jukes-Cantor method or if the sequences are proteins. Boolean value Yes/No No
-gapweight Option to weight gaps in the uncorrected (nucleotide) and Jukes-Cantor distance methods. Any numeric value 0.
-position Choose base positions to analyse in each codon i.e. 123 (all bases), 12 (the first two bases), 1, 2, or 3 individual bases. Any integer value 123
-calculatea This will force the calculation of parameter 'a' in the Jin-NeiGamma distance calculation, otherwise the default is 1.0 (see-parametera option). Boolean value Yes/No No
-parametera User defined parameter 'a' to be use in the Jin-Nei Gamma distancecalculation. The suggested value to be used is 1.0 (Jin et al.) andthis is the default. Any numeric value 1.0
Advanced (Unprompted) qualifiers Allowed values Default
(none)

Input file format

It reads in a normal multiple sequence alignment file.

The quality of the alignment is of paramount importance in obtainingmeaningful information from this analysis.


References

See the following for details of the methods used:
  1. "Phylogenetic Inference", Swofford, Olsen, Waddell, and Hillis, inMolecular Systematics, 2nd ed., Sinauer Ass., Inc., 1996, Ch. 11.
  2. F. Tajima and M. Nei, Mol. Biol. Evol. 1984, 1, 269.
  3. M. Kimura, J. Mol. Evol. 1980, 16, 111.
  4. K. Tamura, Mol. Biol. Evol. 1992, 9, 678.
  5. L. Jin and M. Nei, Mol. Biol. Evol. 1990, 7, 82.
  6. M. Kimura, The Neutral Theory of Molecular Evolution,Camb. Uni. Press, Camb., 1983.

!!주의할점!!

input file로 들어가는 align file에서 dot(.)을 dash(-)로 변환해서 해야 문제 없음.

크리에이티브 커먼즈 라이센스
Creative Commons License

Posted by gwlee

2008/08/25 02:26 2008/08/25 02:26
,
Response
0 Trackbacks , 0 Comments
RSS :
http://thegreatgoodplace.com/tt/study/rss/response/37

Trackback URL : http://thegreatgoodplace.com/tt/study/trackback/37

Leave a comment
« Previous : 1 : ... 101 : 102 : 103 : 104 : 105 : 106 : 107 : 108 : 109 : ... 136 : Next »

블로그 이미지

Stay Hungry Stary Foolish!

- gwlee

TC-Cumulus by reznoa requires Flash Player 9 or better.

Site Stats

Total hits:
60666
Today:
28
Yesterday:
48