Supplementary material:Different models to describe the relationship between allele number and population size

A Core Collectionand Mini Core Collection of Oryza sativa L. in China

Hongliang Zhang1*, Dongling Zhang1*,MeixingWang1, Junli Sun1,Yongwen Qi1, Jinjie Li1, Xinghua Wei2, Longzhi Han3, Zongen Qiu3, Shengxiang Tang2, Zichao Li1#

1Key Laboratory of Crop Genomics & Genetic Improvement of Ministry of Agriculture, and Beijing Key Laboratory of Crop Genetic Improvement, China Agricultural University, Beijing 100094, China; 2 China National Rice Research Institute, Hangzhou, Zhejiang 310006, China; 3 Institute of Crop Science, Chinese Academy of Agricultural Sciences, Beijing 100081, China; *These authors contributed equally to this work

#The communicating author:

Zichao Li, College of Agronomy and Biotechnology, ChinaAgriculturalUniversity, Beijing 100094, China; Tel: 86-10-62731414, Fax: 86-10-62731414, E-mail:

Crossa et al. (1993) gave an estimation of the probability of sampling an allele based on the binomial distribution of the (ith) allele knowing its frequency in the population (pi). According to this, the expected allele number (nA) per locus in a sample of n accessions is

where a is the number of alleles in the population. Here we call it‘allele frequency based estimation’.

As you know, however, we sampled rice varieties (genotypes), each consisting of two alleles rather than one, sowe developed a ‘genotype frequency based estimation’ based on the binomial distribution of the genotype at a locus. For an allele (Ai) with particular frequencies of homozygotes (pii) and heterozygotes (pi.), the probability that the ith allele is absent from a sample is

and the probability that at least one copy of the ith allele is included in the sample is

Considering a locus with a alleles in the population, let Hombe the number of alleleshavingonly homozygous genotypes, Hetbe the number of those only having heterozygous genotypes, and HHbethe number of those having both homozygous and heterozygous genotypes, such that a = Hom+Het+HH. Thus, the expected allele number of the locus in a sample of n accessions is

Using Crossa’s allele frequency based estimation (Equation S1), and the genotype frequency based (Equation S4) and MMF-based estimations (Equation 3), we estimated the allele numbers(Fig. M1a)at 36 SSR loci and at loci with higher LD in various sized populations.We simultaneously determined the actual allelic numbers at the 36 SSR loci in the respective sampled populations, and at loci with higher LD through five random samplings.Differences in the estimated allelic numbersrelative to the actualones in samples of the same size showed that MMF-based estimation was the most stable and accurate measure, and that both allele frequency based and genotype frequency based estimations were lower biased, especially for loci with higher heterozygosity (Fig.M1a, 1b). Allele frequency based estimation was lower than genotype frequency based, especially for loci with higher heterozygosity and higher degrees of LD (Fig. M1a, 1b). For a sample of fewer than 2000 accessions, the consistency of allelic numbers between allele frequency based and genotype frequency based estimations could be higher than 99% if heterozygosity was lower than 10%, but only 50% if the heterozygosity was higher than 90% (Fig. M2).The above comparison among the allele frequency based, genotype frequency based and MMF-based models indicated that MMF-based estimation could preciselypredict the allelic number in certain sized populations, or the population size required to retain a certain number of alleles.

Fig. M1 Consistency of allele frequency based, genotype frequency based and MMF-based estimations with real random sampling for SSR markers with different degrees of heterozygosity and linkage disequilibrium (measured by R2). The y-axis in (b) is the percent of the difference foreach estimation relative to real random sampling: positive representing high-biased estimation relative to real random sampling, negative representing low-biased estimation relative to real random sampling.

Fig. M2 Impact of heterozygosity on consistency of allelic numbers between A-based and G-based estimations in samples of different sizes. The y-axis is the percentage of G-based estimation higher than A-based estimation. The x-axis is the percentage heterozygosity in the primary core collection. For each interval of heterozygosity, each bar represents one of the samples from 100 to 2000 with step of 100 accessions