http://ccb.ucmerced.edu/cgi-bin/app/emboss/index/help/distmat
Function
Creates a distance matrix from multiple alignmentsDescription
distmat calculates the evolutionary distances between every pair ofsequences in a multiple alignment. The sequences need to be alignedbefore running this program. The quality of the alignment is ofparamount importance in obtaining meaningful information from thisanalysis. This application calculates a distance matrix for the set ofsequences in the alignment. The distances are expressed in terms of thenumber of substitutions per 100 bases or amino acids.As sequence diverge so does the probability of there being multiplesubstitutions at any one site in the alignment increase. The distancewill then be an underestimate of the true evolutionary distance betweenthe sequences. Therefore, there are a number of methods for correctingthe observed substitution rate for the occurence of multiplesubstutions.
For nucleotides, the "-position" flag allows the user to choose basepositions to analyse in each codon, i.e. 123 (all bases), 12 (the firsttwo bases), 1, 2, or 3 individual bases.
Uncorrected distances
This method does not make any corrections for multiple substitutions.Therefore, the score will be an underestimate of the distance betweenthe sequences. This will not be less significant for highly similar setsof sequences.S = m/(npos + gaps*gap_penalty) (1)
m - score of matches (1 for an exact match, a fraction for partial
matches and 0 for no match)
npos - number of positions included in m
gaps - number of gaps in the sequences
gap_penalty - the score given to a gapped position
D = uncorrected distance = p-distance = 1-S (2)
The score of match includes all exact matches. For nucleotides, if theflag "-ambiguous" is used then partial matches are included in thescore. For example, a match of M (A or C) with A will increment m by 0.5(0.5*1.0). Gaps are not included in the calculation unless a non zerovalue is given with "-gapweight". It should be noted that end gaps andinternal gaps will be weighted by the same amount. So it is recommendedthat this be used with "-sbegin"and "-send" to specify the start and endof the region to calculate the distance from.
Multiple Substitution correction algorithms
Jukes-Cantor
This can be used for nucleotide and protein sequences.distance = -b ln (1-D/b)
D - uncorrected distance
b - constant. b= 3/4 for nucleotides and 19/20 for proteins.
Partial matches and gap positions can be taken into account in thecalculation of D, by setting the "-ambiguous" and "-gapweight" flags(see "uncorrected distance" method).
Reference:
"Phylogenetic Inference", Swofford, Olsen, Waddell, andHillis, in Molecular Systematics, 2nd ed., Sinauer Ass., Inc., 1996, Ch. 11.
Tajima-Nei
This method is only for nucleotide sequences. It uses the same equationas Jukes-Cantor, but the b-parameter is not constant. Also, only exactmatches are considered in the calculation of the match score and gappositions are ignored.A = 1, T = 2, C = 3, G = 4
b = 0.5(1.- Sum(i=A,G)(fraction[i]^2 + D^2/h)
h = Sum(i=A,C)Sum(k=T,G) (0.5 * pair_frequency[i,k]^2/(fraction[i]*fraction[k]))
distance = -b ln(1.-D/b)
pair_frequency[i,k] - frequency of the i and k base pair at sites in
the alignement of the pair of sequences.
fraction[i] - average content of the base i in both sequences
Reference:
F. Tajima and M. Nei, Mol. Biol. Evol. 1984, 1, 269.
Kimura Two-Parameter distance
This method is only for nucleotide sequences. This uses the principlethat transition substitutions (purine-purine and pyrimidine-purine) aremore likely than transversion substitutions (purine-pyprimidine). Purinebeing the nucleic acid constituent of A and G, and pyrimidine being thenucleic acid derivative of the bases C, T and U. Gaps are ignored andabiguous symbols other than R (purine) and Y (pyrimidine) are ingnored.P = transitions/npos
Q = transversions/npos
npos - number of positions scored
distance = -0.5 ln[ (1-2P-Q)*sqrt(1-2Q)]
Reference:
M. kimura, J. Mol. Evol. 1980, 16, 111.
Tamura
This method is only for nucleotide sequences. This method usestransition and transversion rates and takes into account the deviationof GC content from the expected value of 50 %. Gap and ambiguouspositions are ignored.P = transitions/npos
Q = transversions/npos
npos - number of positions scored
GC1 = GC fraction in sequence 1
GC2 = GC fraction in sequence 2
C = GC1 + GC2 - 2*GC1*GC2
distance = -C ln(1-P/C-Q) - 0.5(1-C) ln(1-2Q)
Reference:
K. Tamura, Mol. Biol. Evol. 1992, 9, 678.
Jin-Nei Gamma distance
This method applies to nucleotides only. This again uses transition andtransversion rates. As with the Kimura two parameter method, gaps andambiguous symbols other than R and Y are not oncluded in the score. Theshape parameter, i.e. "a", is the square of the inverse of thecoefficient of variation of the average substitution,L = average substituition = transition_rate + 2 * transversion_rate
a = (average L)^2/(variance of L)
P = transitions/npos
Q = transversions/npos
npos - number of positions scored
distance = 0.5 * a ((1-2P-Q)^(-1/a) + 0.5 (1-2Q)^(-1/a) -3/2)
It is suggested [Jin et al.], in general, that the distance becalculated with an a-value of 1. However, the user can specify their ownvalue, using the "-parametera" option, or calculate for each pair ofsequence, using "-calculatea".
Reference:
L. Jin and M. Nei, Mol. Biol. Evol. 1990, 7, 82.
Kimura Protein distance
This method is used for proteins only. Gaps are ignored and only exactmatches and ambiguity codes contribute to the match score.S = m/npos
m - exact match
npos - number of positions scored
D = 1-S
distance = -ln(1 - D - 0.2D^2)
Reference:
M. Kimura, The Neutral Theory of Molecular Evolution, Camb. Uni. Press,Camb., 1983.
Usage
Here is a sample session with distmat% distmat pax.align |
Go to the input files for this example
Go to the output files for this example
Command line arguments
Standard (Mandatory) qualifiers (* if not always prompted): |
| Standard (Mandatory) qualifiers | Allowed values | Default | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| [-sequence] (Parameter 1) |
File containing a sequence alignment. | Readable set of sequences | Required | ||||||||||||
| -nucmethod | Multiple substitution correction methods for nucleotides. |
|
0 | ||||||||||||
| -protmethod | Multiple substitution correction methods for proteins. |
|
0 | ||||||||||||
| [-outfile] (Parameter 2) |
Output file name | Output file | <sequence>.distmat | ||||||||||||
| Additional (Optional) qualifiers | Allowed values | Default | |||||||||||||
| -ambiguous | Option to use the ambiguous codes in the calculation of the Jukes-Cantor method or if the sequences are proteins. | Boolean value Yes/No | No | ||||||||||||
| -gapweight | Option to weight gaps in the uncorrected (nucleotide) and Jukes-Cantor distance methods. | Any numeric value | 0. | ||||||||||||
| -position | Choose base positions to analyse in each codon i.e. 123 (all bases), 12 (the first two bases), 1, 2, or 3 individual bases. | Any integer value | 123 | ||||||||||||
| -calculatea | This will force the calculation of parameter 'a' in the Jin-NeiGamma distance calculation, otherwise the default is 1.0 (see-parametera option). | Boolean value Yes/No | No | ||||||||||||
| -parametera | User defined parameter 'a' to be use in the Jin-Nei Gamma distancecalculation. The suggested value to be used is 1.0 (Jin et al.) andthis is the default. | Any numeric value | 1.0 | ||||||||||||
| Advanced (Unprompted) qualifiers | Allowed values | Default | |||||||||||||
| (none) | |||||||||||||||
Input file format
It reads in a normal multiple sequence alignment file.The quality of the alignment is of paramount importance in obtainingmeaningful information from this analysis.
References
See the following for details of the methods used:- "Phylogenetic Inference", Swofford, Olsen, Waddell, and Hillis, inMolecular Systematics, 2nd ed., Sinauer Ass., Inc., 1996, Ch. 11.
- F. Tajima and M. Nei, Mol. Biol. Evol. 1984, 1, 269.
- M. Kimura, J. Mol. Evol. 1980, 16, 111.
- K. Tamura, Mol. Biol. Evol. 1992, 9, 678.
- L. Jin and M. Nei, Mol. Biol. Evol. 1990, 7, 82.
- M. Kimura, The Neutral Theory of Molecular Evolution,Camb. Uni. Press, Camb., 1983.
!!주의할점!!
input file로 들어가는 align file에서 dot(.)을 dash(-)로 변환해서 해야 문제 없음.Posted by gwlee