INFERENCE OF POPULATION STRUCTURE USING ALLELIC DATA
In this class we are going to use some of the softwares that are used both in classical population genetics and more modern approaches: GENETIX, GENEPOP, FSTAT and STRUCTURE.
- Prior to any analyses, open the file phylogenetix.xls (located in the folder P1>Files)in Excel and take a moment to study this table to understand the type of data we are going to use:
-The first line describes the name of the markers used;
-Each line after the first corresponds to an individual;
-The first column indicates the name of the population of origin. Just as an additional information, the populations are ordered from North to South:
*Salas – Valongo are located North of Douro river;
*Montemuro – Várzea are located between Mondego and Douro;
*Lfiscal – Muradal are located south of the Mondego.
-The second column indicates a code that is specific to each individual;
-Columns C to I represent the diploid genotype for each individual for each locus. E.g. “147147” indicates that the individual in question was homozygous for allele “147” for this particular locus.
-The genotype for the different loci per individual is called a “multilocus genotype”.
Now go to FileSave as and save this file as “Text (Tab delimited) (*.txt). Call it also phylogenetix.txt.
- We are now going to use GENETIXto compute some statistics regarding our data.
2.1Open GENETIX.
2.2Go to Fichier>Importer. Choose the file phylogenetix.txt (which should also be located in the folder P1>Files) and choose the option Texte avec séparateur. Click Open. Now indicate the format of the file: in Séparateur, choose Tabulation; indicate that the alleles are coded by 3 digits (3 chiffres). Make sure all three boxes below these menus are checked (in this case, all of these options regarding the first line and the first two columns are true). Click OK. If all went well you are now ready to start working with GENETIX.
2.3We will start by computing allele frequencies and diversity measures. Go to Variabilité, choose the only option (Fis, H, P et fréquences) and click OK. If all went well, you should now have the file phylogenetix.res opened. We will now look at the results of this file in more detail together.
DISCUSSION:
- What can you say about diversity measures? Are they uniform in space?
- Can you think of a hypothesis to explain this pattern?
2.4Although GENETIXcan also be used for a variety of other analyses (you can explore the menus and see this for yourself!), we will opt to explore different softwares to continue our analyses. Before we do this, we need to convert our file to other formats. Go to Outils>Conversion>FSTAT and save your file as phylofstat.dat (it is important that you type the “.dat” along with the name). Now go to the same menu and choose Outils>Conversion>Genepop. Save your file as phylogenepop.txt (it is also important that you type the “.txt”).
3We will now use GENEPOP. Although there is a version of this program we can download to our computers, it is easier to use GENEPOP on the web. Go to
3.1We will begin by testing for HW equilibrium. Choose option 1 (Hardy-Weinberg exact tests) and then suboption 3 (For each locus in each population, Probability test ). We will leave the default options as they are (as recommended). Choose HTML – plain text delivery file format. Now upload the file we just saved (phylogenepop.txt) using the Browse button and press Submit data.
DISCUSSION:
-Is any locus in Hardy-Weinberg disequilibrium?
-Does this hold using the correction for multiple tests (to be explained!)
3.2Now try to compute F statistics for all populations (Option 6.1; this menu seems complex because some of the other options in this menu require additional information but it is irrelevant for our analysis, so all you need to do is upload the file and submit the data!).
DISCUSSION:
-Does Fst vary between loci? Can you think of explanations for this?
4We are now ready to try FSTAT. This software can be used for the same purposes we have used GENETIXand GENEPOP. We are going to use it for two different analyses. Open FSTAT.
4.1Go to File>Open and choose the file you saved earlier (phylofstat.dat).
4.2We want to see whether or not our loci are in linkage equilibrium. Therefore, to avoid unnecessary analyses we will uncheck all the options and choose only Genotypic Disequilibrium>Tests between all pairs of loci. You can check also the box Nominal level for multiple tests>5/100 (this will be the value to be divided by the number of tests for Bonferroni correction).
4.3Check your results by opening the file phylofstat.out that now appears in your folder.
DISCUSSION:
-Are there loci that appear to be linked?
4.4. Now go to the panel Comp. among groups of samples. We want to statistically test the hypothesis that populations north of the Douro have lower levels of diversity (measured by gene diversity (expected heterozigosity) and allelic richness) than populations from the south. To do this:
4.4.1 Define the Number of groups as 2.
4.4.2 Now we need to Define groups. Since we want to test the hypothesis that one group has a value larger than the other, our test is “one sided”. In this case the program only allows us to test the hypothesis that populations from “Group 1” have larger values than those from “Group 2” and not the other way around.
For this reason, we will need to define populations of the south (5-13) as group 1 and populations of the north (1-4) as group 2. In the pop-up menu “Define group 1”, select populations 5 to 13 and click on . Click OK. Do the same with populations 1-4 in the menu “Define group 2”.
4.4.3 Check the box Type of test for two groups>One sided
4.4.4. Now check the boxes Allelic richness and Gene diversity.
4.4.5 Choose the Number of permutations (let’s try 2000, for example – note that the actual number of different permutations is bounded by the number of groups and populations: no matter how many times populations are swapped between groups, we will always be testing repeated configuration since only 715 different arrangements of 4+9 populations can be done with this data set).
4.4.6 Press Run.
4.4.7 After the calculations are done, check results in file phylofstat_test.out.
DISCUSSION:
-What do you conclude?
5. We are now done with calculating summary statistics. We will use STRUCTURE, which implements a Bayesian MCMC algorithm to search for clusters of individuals that minimize HWD and LD. Note the big differences between the previous methods and this one:
a) we are now going to consider INDIVIDUALS, not POPULATIONS, as our working unit;
b) this is a SEARCH method, not the mere application of formulas to calculate statistics. As in the permutations to obtain p-values in the previous methods, different users will get different results because it is unlikely that two independent simulations will converge on exactly the same result. If everything goes well, however, we expect these independent trials to be VERY SIMILAR (if they are not, then convergence has not been reached).
5.1 There are programs which convert GENEPOP to STRUCTURE file format (e.g. FORMATOMATIC), which we will not use here. We have prepared a ready-to-go input file for STRUCTURE (phylostructure.txt). Open it and take a look at the differences between other files:
- each individual is now represented in two lines, each with an allele of the diploid genotype for each locus (this is actually not mandatory, but it is so in the data set we are going to analyse).
5.2 Open STRUCTURE.
5.3 Go to File>New project.
5.4 Follow the steps in the project wizard menu. 1. Name the project as you prefer (e.g. chiomsats), choose the directory where you want the results to be stored, and choose the data file (phylostructure.txt). Click next. 2. Number of individuals=287, Number of loci=7, Missing data value=0. Click next. 3.All we need is to check Row of marker names. 4. Check Individual ID for each individual and Putative population of origin (this is not taken into account for the analyses). Finish. Check that everything is OK and press Proceed. The data set should now appear on your screen.
5.5 Now we need to define a parameter set for the MCMC. Go to Parameter set>New.
Run length: Choose 10000 as the length of burn-in period (the initial, pre-convergence steps in the MCMC that will be discarded) and 90000 as the number of steps after burn-in.
Ancestry model: Check Use admixture model. The results are often similar among different models, but choosing this means we will estimate the proportion of the genome of each individual originating in a given cluster (which is appropriate if we want to know if there are hybrids).
Allele frequency model: Check Allele frequencies independent. Again, the different models only give slightly different results, but this is more appropriate since it assumes that mutation, not only drift, has also happened in our sample (since we are dealing with microsatellites with fast mutation rate, we should assume mutation !).
Advanced: Check Compute probability of the data and Print credible regions.
Click OK and name the parameter set (e.g. test, since we will do a very preliminary analysis).
5.6We are ready to start the analysis. Go to Project>Start a job. Select the parameter ser, and set K (the number of clusters to test) from 2 to 5. Because many analyses will be conducted in parallel by different users, we can use only 1 iteration (but usually we should do replicate analyses!). Now press Start,
DISCUSSION:
-Is there any obvious population structure in this species?
-Are the results similar to those obtained based on allozymes?
-Can you think of possible explanations to the discrepancies?