SUPPORTING INFORMATION

Complete mitochondrial genomes and a novel spatial genetic method reveal cryptic phylogeographical structure and migration patterns among brown bears in north-western Eurasia

Marju Keis, Jaanus Remm,Simon Y. W. Ho, John Davison, Egle Tammeleht, Igor L. Tumanov, Alexander P. Saveljev, Peep Männil, Ilpo Kojola, Alexei V. Abramov,

Tõnu Margus and Urmas Saarma

Journal of Biogeography

Appendix S2: Step-by-step guide to the DResD procedure: a novel spatially explicit, individual-based analysis to identify migration corridors or barriers

The analysis is performed using R 2.14, with packages utils, base and stats needed to run the script (R Development Core Team, 2011). Note that the images have only illustrative purpose, and do not represent real sampling. We are grateful for A. Kaasik for supporting with computational and scripting advice. For additional explanations and discussion please contact with Jaanus Remm ().

1) Import data and prepare the analysis.

• Load a matrix of genetic dissimilarity. In principle, any kind of difference or similarity matrix would be suitable. Only the lower-left triangle should be present, and the first row and first column should contain sample names.

We used the Jukes–Cantor index calculated using MEGA 5 (Tamura et al., 2011). This index was used because it provides similar results to the widely used Kimura and γ-distances, but its calculation is simpler and it generally generates lower variance estimates than the other indices (Schöniger & von Haeseler 1993; Tajima & Takezaki, 1994). We exported the dissimilarity matrix from MEGA 5 in CSV format and manually added sample names into the first row of the table using a spreadsheet application. If using Mega 5, the technical caption in the end of the table should be removed.

M.gen.diss <- read.table(file = 'file path', sep = ',', header = T, row.names = 1, fill = T)

M.gen.diss <- as.dist(M.gen.diss)

• Load the Cartesian coordinates of sample locations and generate a matrix of spatial Euclidean distance (again our input took the form of a CSV file). Polar coordinates should be avoided because simple Euclidean distances generated from those would have weak proportionality with the true spatial separation of points, due to the convergence of meridians near the poles. For our analysis, the coordinates were in Lambert’s conformal conic projection with the origin at 60°N/35°E, and standard parallels 55°N and 65°N; the units of coordinates were meters. In the input file, sample names should be present in the first column and variable names ('x' and 'y') in first row of the table. If the column names in the CSV file are correct, the command names{base} in the next script section, is not necessary.

The example images represent sampling of four genotypes at known geographical positions (A-D); landscape structure in and around the samples is unknown. We use a simplified landscape model, assuming three discrete landscape subtypes with respect to gene flow (high, moderate and low resistance to gene flow). Genetic distance between pairs of samples is expected to reflect both geographic distance and landscape structure.

sample.points <- read.table(file = 'file path', sep = ',', header = T, row.names = 1, fill = T)

names(sample.points) <- c('x', 'y')

M.geo.dist <- dist(sample.points[,1:2])

2) Calculation of an asymptotic regression curve of isolation by distance (IBD) between sample pairs, and derivation of residual values. The IBD residuals represent the difference in genetic dissimilarity between the true value of a particular sample pair and the IBD curve, i.e. the value expected as a result of geographic distance. Using IBD residuals allows the effect of geographic distance on genetic dissimilarity to be accounted for when analyzing pairwise differences between samples.

• Data preparation.

IBD.data <- data.frame(cbind(as.vector(M.geo.dist), as.vector(M.gen.diss)))

names(IBD.data) <- c('geo.dist', 'gen.diss')

• A preliminary estimate of the IBD curve parameters is calculated using a subset of sample pairs with the function NLSstAsymptotic{stats}; note that the whole data set was too large to model using this function.

IBD.model.prelim <- NLSstAsymptotic(sortedXyData(x = IBD.data[1:1000,1], y = IBD.data[1:1000,2]))

• The final parameters were estimated using the function nls{stats}, which can analyze the whole dataset, but which requires preliminary starting estimates for model parameters (the values calculated in the previous step).

IBD.model <- nls(formula = gen.diss ~ b0 + b1 * (1 - exp(-exp(lrc) * geo.dist)), data = IBD.data, start = list(b0 = IBD.model.prelim [1], b1 = IBD.model.prelim [2], lrc = IBD.model.prelim [3]))

• Calculation of IBD residual matrix.

M.IBD.res <- M.gen.diss - (coef(IBD.model)[1] + coef(IBD.model)[2] * (1 - exp(-exp(coef(IBD.model)[3]) * M.geo.dist)))

3) Generate a rectangular grid covering the geographic area of sampling.

• Define the grid step (10 km in our analysis) and margin width (100 km) from the outermost sample locations.

step <- 10000

margin <- 100000

• Define the border-coordinates of the grid area.

x.min <- min(sample.points$x) - margin

x.max <- max(sample.points$x) + margin

y.min <- min(sample.points$y) - margin

y.max <- max(sample.points$y) + margin

• Generate table with grid point coordinates.

grid <- expand.grid(x = seq(from = x.min, to = x.max, by = step), y = seq(from = y.min, to = y.max, by = step))

names(grid) <- c('x', 'y')

4) The location of midpoints of sample pairs is taken as the best possible point location for pair-based parameters. Subsequently, the analysis will be based only on coordinates of midpoints; while the coordinates of sample points are not used.

• Calculation of the coordinates of all pair midpoints using row-wise and column-wise matrices of all sample point coordinates.

n <- dim(sample.points)[1]

M.x1 <- as.dist(matrix(data = sample.points$x, nrow = n, ncol = n, byrow = F))

M.y1 <- as.dist(matrix(data = sample.points$y, nrow = n, ncol = n, byrow = F))

M.x2 <- as.dist(matrix(data = sample.points$x, nrow = n, ncol = n, byrow = T))

M.y2 <- as.dist(matrix(data = sample.points$y, nrow = n, ncol = n, byrow = T))

M.mid.x <- (M.x1 + M.x2) / 2

M.mid.y <- (M.y1 + M.y2) / 2

• The midpoints of spatially close pairs (≤ 50 km) were excluded because of the geographical imprecision of some sample locations and the potential influence of animal mobility within a home range. Midpoints of long distance pairs (≥ 500 km) were also excluded because of the uncertainty associated with a single midpoint between such points. Note that only the pair midpoints that did not match these criteria were excluded, not the actual samples. Other connections involving the same sample were retained in the analysis.

min.dist <- 50000

max.dist <- 500000

midpoints <- data.frame(as.vector(M.mid.x[(M.geo.dist > min.dist) & (M.geo.dist < max.dist)]),

as.vector(M.mid.y[(M.geo.dist > min.dist) & (M.geo.dist < max.dist)]),

as.vector(M.IBD.res[(M.geo.dist > min.dist) & (M.geo.dist < max.dist)]))

names(midpoints) <- c('x', 'y', 'z')

5) The empirical inverse distance weighted average and standard deviation (SD) of all IBD residuals at every grid point is calculated using all pair midpoints. The result of the inverse distance weighting is that the midpoints that are closer to the grid point under calculation have a stronger influence than those located farther from the grid point.

• Definition of function weighted.var to calculate weighted variance (see the origin in https://stat.ethz.ch/pipermail/r-help/2008-July/168762.html).

weighted.var <- function(x, w, na.rm = FALSE){

if (na.rm){

w <- w[i <- !is.na(x)]

x <- x[i]

}

sum.w <- sum(w)

sum.w2 <- sum(w ** 2)

mean.w <- sum(x * w) / sum(w)

(sum.w / (sum.w ** 2 - sum.w2)) * sum(w * (x - mean.w) ** 2, na.rm = na.rm)

}

• A value of 25 km was added to all distances between midpoints and grid points in the weight calculation. As the midpoint does not represent any precise location, it is necessary to avoid distance values close to 0 that would give a weight value approaching infinity for a single midpoint. This also helps to prevent single grid points surrounded by a low number of sample points from gaining excessive importance. The choice of the exact value was determined following preliminary investigation using a range of values, but the chosen value relates neatly to half of the minimum pair-wise distance used in our analysis. We have not found an exact algorithm to estimate the value, but we recommend using a value that is larger than the grid step to avoid dependence on the starting coordinates of grid generation.

shift <- 25000

• Definition of function calculation, which finds the inverse square distance weights and calculates weighted mean and weighted SD of IBD residuals from of all midpoints for a single grid point.

calculation <- function(gridx, gridy, midpoints, shift){

w <- (shift + sqrt((gridx - midpoints[,1]) ** 2 + (gridy - midpoints[,2]) ** 2)) ** -2

return(c(weighted.mean(midpoints[,3], w), weighted.var(midpoints[,3], w) ** 0.5))

}

• Calculate the weighted mean and weighted SD for all grid points.

Different colors on the image represent values of distance-weighted mean of IBD residual. Red and yellow denote higher values; blue tones denote lower values; and green represents values close to 0 (i.e. close to the IBD curve).

grid <- cbind(grid, t(mapply(calculation, grid[,1], grid[,2], MoreArgs = list(midpoints, shift))))

names(grid) <- c('x', 'y', 'z', 'SD')

6) Using randomization to estimate areas where the estimated average IBD residual value is significantly higher or lower than would be expected if IBD residuals were randomly distributed in space (IBD alone, the null-model; Manly 2007).

For further analysis or illustration, it is possible to select only the grid points with estimated value of IBD residual significantly different from randomness, i.e. values within the upper or lower 2.5% tail of the randomized distribution.

• Definition of number of iterations (1000 in our analysis), and addition of columns into the grid table to record the proportion of the randomized distribution larger than estimated value.

n.resamp <- 1000

grid <- cbind(grid, 0, 0, 0)

names(grid) <- c(names(grid)[1:4], 'prob.z>H0', 'H0.cl1', 'H0.cl2')

H0.lower.tail <- H0.upper.tail <- matrix(nrow = dim(grid)[1], ncol = floor(n.resamp * 0.025))

• Definition of function perm.rowscols, which randomly resamples a matrix by rows and columns, using the same sequence for both.

perm.rowscols = function (m) {

d <- dim(m)

s <- sample(1:d[1])

m[s,s]

}

• Randomization procedure that returns the proportion of the randomized distribution larger than the estimated value, and 95% confidence limits of the expected sampling distribution for each grid point. Only the matrix of IBD residuals was randomized; the spatial structure of sampling and the shape of IBD relationship was retained.

Note, that the subsequent for{base} loops may take several hours depending on the sample and grid size, the number of iterations, and the computing power.

for(i in 1:n.resamp) {

M.IBD.res.resamp <- as.dist(perm.rowscols(as.matrix(M.IBD.res)))

midpoints.resamp <- data.frame(midpoints[,1:2], as.vector(M.IBD.res.resamp[(M.geo.dist > min.dist) & (M.geo.dist < max.dist)]))

resamp <- t(mapply(calculation, grid[,1], grid[,2], MoreArgs = list(midpoints.resamp, step)))

grid[,5] <- grid[,5] + (grid[,3] > resamp[,1]) / n.resamp

if(i < n.resamp * 0.025) {

H0.lower.tail[,i] <- resamp[,1]

H0.upper.tail[,i] <- resamp[,1]

}

else {

H0.lower.tail <- t(apply(cbind(resamp, H0.lower.tail), MARGIN = 1 , FUN = sort))[,1:dim(H0.lower.tail)[2]]

H0.upper.tail <- t(apply(cbind(resamp, H0.upper.tail), MARGIN = 1 , FUN =sort))[,2:(dim(H0.lower.tail)[2]+1)]

}

next

}

grid[,6] <- apply(H0.lower.tail, MARGIN = 1, FUN = max)

grid[,7] <- apply(H0.upper.tail, MARGIN = 1, FUN = min)

7) Bootstrapping the empirical distribution of distance-weighted average IBD residuals at every grid point to estimate the post hoc statistical power (i.e. 1 - probability of type II error) of the randomization test (Aberson 2010).

As calculated for each grid point, the proportion of the randomized distribution (null-model) larger than the empirical average IBD residual value indicates the statistical significance (P-value) of the difference between the test value and the null-model. The proportion of the bootstrap-distribution larger than the confidence limit of the expected null-model indicates the statistical power of the test. If the IBD residual has a negative value at a particular grid point, the respective lower tails of the distributions are used.

• Definition of the number of bootstrap iterations (250 in our analysis), and addition of a column into the grid table to record the proportion of the bootstrap-distribution outside the confidence interval of the null model.

n.bts <- 250

grid <- cbind(grid, 1)

names(grid) <- c(names(grid)[1:7], 'power')

• Bootstrap procedure that returns the proportion of the empirical sampling distribution (H1) outside the confidence interval of the expected sampling distribution (H0) for each grid point. For each bootstrap sample, a random selection was taken from all the midpoints included in the analysis (distance range 50–500 km in our analysis). Repeated sampling of same midpoint was permitted; the total number of midpoints was held constant (Manly 2007).

for(i in 1:n.bts) {

s <- sample(1:dim(midpoints)[1], replace = T)

midpoints.bts <- midpoints[s,]

bts <- t(mapply(calculation, grid[,1], grid[,2], MoreArgs = list(midpoints.bts,step)))

grid[,8] <- grid[,8] - (grid[,6] < bts[,1] & grid[,7] > bts[,1]) / n.bts

next

}

8) Save the resulting tables of midpoints and grid in CSV files.

write.csv(midpoints, file ='file path')

write.csv(grid, file = 'file path')

REFERENCES

Aberson, C.L. (2010) Applied power analysis for the behavioral sciences. Taylor and Francis Group, LLC, New York.

Manly, B.F.J. (2007) Randomization, bootstrap and Monte Carlo methods in biology. Chapman & Hall/CRC, Boca Raton, FL.

R Development Core Team (2011) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org/.

Schöniger, M. & von Haeseler, A. (1993) A simple method to improve the reliability of tree reconstruction. Molecular Biology and Evolution, 10, 471–483.

Tajima, F. & Takezaki, N. (1994) Estimation of evolutionary distance and reconstructing molecular phylogenetic trees. Molecular Biology and Evolution, 11, 278–286.

Tamura, K., Peterson, D., Peterson, N., Stecher, G., Nei, M. & Kumar, S. (2011) MEGA5: Molecular Evolutionary Genetics Analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Molecular Biology and Evolution, 28, 2731–2739.