Practicalities of analyzing genetic population structure

Assessing genetic structure using codominant, allelic markers
within and among populations (WAAP analysis), meaning of A_E
and tips on software-based analyses.

Return to Main Index page

Download WebSoftware.doc list of web software resources

We have looked at the derivations for a number of population genetic parameters (variance-based and distance measures of population structure) and their strengths and weaknesses in the face of various complexities of natural populations (e.g., small and fluctuating population size, variation in the breeding sex ratio). We will now focus on the practicalities of assessing the genetic structure within and among populations -- what measures are essential for any sort of reasonably comprehensive assessment of genetic structure, what programs are available for computing those measures and how do we organize data for analysis?

Here are some of the essential components (adapted from a checklist developed by Jim Hamrick at U. GA):

I. Total variation over the entire set of populations:

P: Polymorphism (% loci polymorphic -- microsatellites most often have P = 1.0)

A: Alleles per locus; A_P (alleles/polymorphic locus for studies with P <1.0)

A_E = effective number of alleles

H: heterozygosity -- observed (H_O) and expected (H_E or D, gene diversity).

(Estimates of variation in mutation rates across loci)

II. Within population variation:

Mean P, A, A_E, H. The values in Part I are calculated with all the samples considered to constitute a single group.
These ones are calculated population by population, then averaged over the set of populations.
Differences among populations in the above. Does one or more populations have unusually high or low values for any of the above?
Deviations from Hardy-Weinberg expectations (per locus and population)
Assessment of linkage disequilibrium
Estimates of N_e, effective population size (4*N_e*m)
Relatedness or allele-sharing matrices

III. Among population variation:

F_ST, G_ST, R_ST -- variance measures. Hierarchical, if appropriate.

Major differences in allele frequencies among populations

Patterns of variation: clinal, ecotypic, latitudinal etc.

Assignment tests (how well do individuals match the population in which they were sampled?)

Genetic distances (Cavalli-Sforza, Nei's 1978 et al.)

Correlation between genetic distance and geographic distance (Mantel tests).
Essentially we are testing for an effect of isolation by distance (IBD effect). Are poplations that are further apart geographically more gentically different?

Estimates of gene flow, effective population size (4*N_e*m)

Tree-building, phylogenetic approaches

Assessment of whether partitions exist in the data (Bayesian approaches, tree-building analyses)

A note on the calculation and uses of A_E (effective number of alleles)

A_E is the effective number of alleles (at the level of the OTUs we are examining). Verbally, this measure is the number of equally frequent alleles it would take to achieve a given level of gene diversity. That is, it allows us to compare populations where the number and distributions of alleles differ drastically. The formula is:

Eqn 1

where D_j is the gene diversity of the j^th of r loci. Note that we calculate the OTU-level A_E by averaging over the A_E calculated locus-by-locus rather than by calculating a mean gene diversity and then calculating A_E from that. The graph below shows why: A_E is a nonlinear function of the gene diversity (H_exp), which brings into play Jensen’s inequality [the expectation of a function ≠ the function of the expectations for nonlinear curves; see Ruel and Ayres (1999)]. Here, because the curve is concave up, the A_E we compute will be greater than if we calculated it from the overall gene diversity.

Fig. 1. Effective number of alleles, A_E, as a function of the gene diversity (D or H_exp). The nonlinear relationship brings into play Jensen’s inequality. Note that most of the "action" happens for D in the range 0.5 to 0.9 (A_E goes from 2 to 10).

The meaning of A_E. Say we have two populations (or species, or whatever our OTU is) with the same number of total alleles, but with very different distributions of allele frequencies. We would like to be able to assess the effective number of alleles as a corollary to the expected heterozygosity. Remember that, for any given number of alleles, the expected heterozygosity (gene diversity) is highest when the all the allele frequencies are equal (look at Fig. 5.1 in the web notes). Simply reverse the logic. When the heterozygosity is high (the peak of the curve in Fig. 5.1) we will have the highest effective number of alleles. For a heterozygosity of 0.85 we will have, effectively, 6.7 alleles {formula is A_E= 1/(1-H_exp)}. If a locus has 8 total alleles (meaning a maximum possible H_exp of 0.875), but the H_exp is only 0.6, the effective number of alleles will be only 2.5. This tells us that we have a set of alleles with very different frequencies. Alleles with frequencies away from the even “average” contribute very little to the effective number of alleles. When will the effective number of alleles be the same as the actual number of alleles? At the maximum gene diversity (peak of the curve). When will it be at a minimum (near 1)? When one allele (the only real contributor to the effective allele number) dominates the allele frequencies and all the others are very rare. Imagine that one OTU has 10 total alleles, another just 4; they could have the same effective number of alleles, if the allele frequencies are very unbalanced in the first case and much more balanced in the second case. Because of the reciprocal nature of the formula, if the OTUs have the same A_E, they will have the same H_exp. That is, if A_E1 = A_E2, then H_exp1 = H_exp2.

Source: Weir (1990) pp. 124-125

Jensen’s inequality is described in Ruel, J.J., and M.P. Ayres. 1999. Jensen’s inequality predicts effects of environmental variation. TREE 14: 361-366. * For non-linear curves, mean of functions is not same as function of means. Means that variance affects expected outcome/return. Implications for mean-variance tradeoffs, optimal foraging, risk prone vs. risk averse strategies. (Ruel and Ayres overlook a fairly extensive literature on the subject, including classics by Gillespie 1977, MacArthur 1967}.

Some useful software and tips on running it

1) MS Tools is an Excel-based toolkit (works only on those clunky Windows machines�) http://acer.gen.tcd.ie/~sdepark/ms-toolkit/
It is most useful in its role of creating input files for other programs (such as FSTAT, GenePop and Arlequin and thereby Genetix, Identix et al.). I also use its allele-sharing matrix as a basis for building trees in PHLIP, with individuals as the OTUs.
The updated version allows you to: work with haploid data as well as diploid data
access help using the new help file
choose which loci and populations in your data set to work with
calculate allele-sharing index of Chakraborty and Jin, 1993.

I begin with a worksheet that has the data in whatever format I find most useful for genotypes, individual IDs, population labels, etc. I then create a "starter" two column-format MS Tools worksheet.

Locus1Name Locus 2Name Locus3�..
PopA01 Allele1 (bp) Allele 2 bp Allele1 (bp) Allele 2 bp ��
PopA02 Allele1 (bp) �
�.
PopB01 �..

First line has Locus names in 2^nd, 4^th, 6^th etc. columns
Second line has Pop. name and individual's number in 1^st column (you may want to paste this concatenated code into your "original" sheet so that you can easily check correspondence between the ID label used for MS Tools and whatever you actually use).
Next column is bp size of first allele, first locus, then 2^nd allele, 2^nd locus, etc.
Once you have this set up, go to "Tools" menu bar and scroll down to "Microsatellites". Note that you first need to select the kind of data in the top dialog box. (For the example above, we would have diploid, two-column format).
You can then check for errors, use the Help file, create other data input formats etc. If you don't get a dialog box for output choices, it means that Toolkit isn't recognizing the data format.

2) PHYLIP. http://evolution.genetics.washington.edu/
At least two essential subprograms.
Neighbor (which creates NJ and UPGMA trees from distance matrix inputs)
Go to details on running Neighbor
GenDist (which creates Cavalli-Sforza, Nei's 1972 and Reynolds' distance matrices from gene frequency inputs).
Go to details on running GenDist
I use GenDist primarily for the Cavalli-Sforza distances. Input and interface are both slightly clunky, but they work.
SeqBoot allows you to bootstrap the gene frequencies. That then allows you to create a bootstrapped input file and eventually a tree with branch support (based on percentage bootstrap support).

Hints:
a) Remember that pop. names must be padded to at least 10 spaces.
b) For entering gene freqs (a long string of numbers), it is definitely best to set it up in a labeled Excel worksheet (using borders around sets of alleles per locus and labels above, etc.) and then save the bare-bones data parts as a text-only file. Edit the file a little more in Word, if necessary.
c) Toggle among choices by typing in the appropriate characters. (Is your input matrix square or lower-diagonal?).
d) When in doubt, try reading the documentation!

3) FSTAT, GenePop, GDA, TFPGA, Genetix, Arlequin, Identix, Structure, Partition et al. Lots of choices for software that will calculate a wide array of pop. gen. measures. See my "Web genetic software" sheet for a reasonably current list of descriptions and URLs.

4) TreeView. http://taxonomy.zoology.gla.ac.uk/rod/treeview.html
Nice little utility for producing output graphics of trees. You may have a few problems with transitions to graphics output (such as jpegs) but perseverance furthers. (I have even resorted to tracing over with a drawing program to get proportions and text the way I want them).

Steps for computing genetic distances using PHYLIP�s GenDist subroutine:

We will start by seeing how to build the input file for computing genetic distance measures from gene frequency data. The software will be the GenDist subroutine of J. Felsenstein�s PHYLIP package. The outputs will be distance matrices between each of the 8, 5 or full 13 populations and each of the others. The three possible distance measure are:

Cavalli-Sforza chord distance. [Cavalli-Sforza, L.L. and A.W.F. Edwards. 1967. Phylogenetic analysis: models and estimation procedures. Am. J. Hum. Gen. 19: 233-257].

Nei�s genetic distance. [Nei, M. 1972. Genetic distance between populations. Am. Nat. 106: 283-292]

Reynolds� distance. [Reynolds, J., B.S. Weir, and C.C. Cockerham. 1983. Estimation of the coancestry coefficient: Basis for a short-term genetic distance. Genetics 105: 767-779].

The assumptions underlying these models differ. For example, Reynolds� distance assumes that drift alone is responsible for the differentiation among populations. Takezaki and Nei found that Cavalli-Sforza chord distance may perform well with microsatellite data, so that will be our first try [Takezaki, N., and M. Nei. 1996. Genetic distances and reconstruction of phylogenetic trees from microsatellite DNA. Genetics 144: 389-399]. We will compute only the Cavalli-Sforza distance.

i) Copy then rename the formatted file as "infile", which you will put in the same folder (directory) as the GenDist application icon (GenDist.exe file). "Infile" should contain the following information (the text below is taken from the file GenDist.Doc, which is in the documentation folder in the folder for PHYLIP):

The input to this program is standard and is as described in the Gene Frequencies and Continuous Characters Programs documentation files. It consists of the number of populations (or species), the number of loci, and after that a line containing the numbers of alleles at each of the loci. Then the gene frequencies follow in standard format. {If I don't provided the file you could adapt an Excel genotype file to this format}.
So, for example, we would take out the "Locus" and "Allele" column headers from the Excel file, so that the file contains only the info. Listed in the paragraph above.

ii) Double-click the GenDist application icon (ask what folder it is in, if necessary). You will see the following screen:

Genetic Distance Matrix program, version 3.5p

Settings for this run:
A Input file contains all alleles at each locus? One omitted at each locus
N Use Nei genetic distance? Yes
C Use Cavalli-Sforza chord measure? No
R Use Reynolds genetic distance? No
L Form of distance matrix? Square
M Analyze multiple data sets? No
O Terminal type (IBM PC, VT52, ANSI)? ANSI {This is irrelevant}
1 Print indications of progress of run? Yes

Are these settings correct? (type Y or the letter for one to change)
The A option is described in the Gene Frequencies and Continuous Characters Programs documentation file. As with CONTML, it denotes whether all alleles are represented in the gene frequency input, or whether one one allele frequency has been left out per locus.
C, N, and R denote whether to calculate the Cavalli-Sforza, Nei, or Reynolds et al. genetic distances respectively. The Nei distance is the default, and it will be computed if neither C nor R is explicitly invoked.
The L option denotes whether the distance matrix is to be written out in Lower triangular, Upper or Square form.
The M option is the usual Multiple Data Sets option, useful for doing bootstrap analyses with the distance matrix programs.

iii) Once you have the settings correct (e.g., toggle the "A" choice because our data will NOT having a missing allele frequency, hit the "C" toggle to get Cavalli-Sforza chord distance), then type "Y" and carriage return.

iv) You will generate an output file named "outfile", which will be a text-only file that can be opened by a word-processing program.

The output file's first line is simply the number of species (or populations). Each species (or population) starts a new line, with its name printed out first, and then the genetic distances (number of distances per line depends on which of the matrix output options you chose). This output in the standard format used as input by the distance matrix programs. That is, the output, in its default form, is ready to be used in the distance matrix programs.
Open the file with Word, add any comments you think appropriate and save it under a different name in a folder of your outputs. Then rename the original as "infile" for use with the next step.

v) ALSO� the outfile (distance matrix) can be directly used as "infile" for one of PHYLIP�s tree-building routine Neighbor.
[See just below for "Steps for computing NJ and UPGMA trees from genetic distance matrices"].

vi) You can use the subprogram SeqBoot with the original gene frequency infile to generate 1000 bootstrapped gene frequency data sets. Then do all the normal steps for Cavalli-Sforza distance matrices and neighbor-joining/UPGMA trees, remembering to toggle M for multiple data sets.

Steps for computing NJ and UPGMA trees from genetic distance matrices

i) Generate or obtain a text-only distance matrix file [see "Steps for computing genetic distances using PHYLIP�s GenDist subroutine" above, or use some other method such as obtaining a published distance matrix]. If you did the above, you can begin with a text-only file that is the output from PHYLIP�s GenDist computation. Default (easiest way) is to have the input file distance matrix labeled "infile". Proceed as follows:

ii) Double-click the Neighbor application icon (ask where it is, if necessary).

Here�s some explanation of the options:

OPTIONS

Here are some of the options available in all three programs (Kitsch, Fitch and Neighbor). They are selected using the menu of options.

-indicates that negative segment lengths are to be allowed in the tree (default is to require that all branch lengths be nonnegative). This option is not available in NEIGHBOR.

L indicates that the distance matrix is input in Lower-triangular form (the lower-left half of the distance matrix only, without the zero diagonal elements).
R indicates that the distance matrix is input in uppeR-triangular form (the upper-right half of the distance matrix only, without the zero diagonal elements). {No for both of these means a square matrix}

M is the usual Multiple data sets option, available in all of these programs. It allows us (when the output tree file is analyzed in CONSENSE) to do a bootstrap (or delete-half-jackknife) analysis with the distance matrix{ programs. Toggle this when going the bootstrap route.

The numerical options are the usual ones and should be clear from the menu.

iii) Once you have set the options to your satisfaction, type "Y" and carriage return and PHYLIP will produce an "outfile" and (if you so specified) a "treefile."

iv) Open your "outfile" with a word processor and save under another name somewhere else. You can then use the topology and branch lengths to construct a more polished-looking tree in some other program (e.g., PAUP, MacClade, plus a drawing/painting program, as described below). Even better, use TreeView to produce a nice tree from your "treefile".

Some introductory references and primers:

Brookfield, J.F.Y. 1996. Population genetics. Curr. Biol. 6: 354-357.

Gillespie, J. H. 1998. Population Genetics: A Concise Guide. The Johns Hopkins University Press, Baltimore, Md.

Hall, B.G. 2001. Phylogenetic Trees Made Easy: A How-to Manual for Molecular Biologists. Sinauer, Sunderland, MA.

Hartl, D.L. 1999. A Primer of Population Genetics (3^rd ed.). Sinauer Associates, Sunderland, MA

A few references on choices in analysis and comparisons among different measures:

Amos, W. 1999. Two problems with the measurement of genetic diversity and genetic distance. Pp. 75-100 In Genetics and the Extinction of Species (L.F. Landweber, and A.P. Dobson, eds.). Princeton Univ. Press, Princeton.

Gaggiotti, O.E., O. Lange, K. Rassmann, and C. Gliddon. 1999. A comparison of two indirect methods for estimating average levels of gene flow using microsatellite data. Mol. Ecol. 8: 1513-1519.

Kalinowski, S.T. 2002. Evolutionary and statistical properties of three genetic distances. Mol. Ecol. 11: 1263

Luikart, G., and P.R. England. 1999. Statistical analysis of microsatellite data. TREE 14: 253-256.

Neigel, J.E. 2002. Is F_ST obsolete? Conservation Genetics 3: 167?173, 2002. (critique of Whitlock and McCauley, 1999).

Paetkau, D., L.P. Waits, P.L. Clarkson, L. Craighead, and C. Strobeck. 1997. An empirical evaluation of genetic distance statistics using microsatellite data from bear (Ursidae) populations. Genetics 147:1943-1957.

Ruzzante, D.E. 1998. A comparison of several measures of genetic distance and population structure with microsatellite data - bias and sampling variance. Can. J. Fish. Aquat. Sci. 55: 1-14.

Takezaki, N., and M. Nei. 1996. Genetic distances and reconstruction of phylogenetic trees from microsatellite DNA. Genetics 144: 389-399.

Tomiuk, J., B. Guldbrandtsen, and V. Loeschcke. 1998. Population differentiation through mutation and drift: a comparison of genetic identity measures. Genetica 102-103: 545-558.

Whitlock, M.C., and D.E. McCauley. 1999. Indirect measures of gene flow and migration: F_ST not equal to 1/(4Nm + 1). Heredity 82: 117-25. (see critique by Neigel, 2002, of the high Nm = 50 used in their simulation).

Return to top of page