**MaxOcc: a web portal for Maximum Occurrence Analysis**

Ivano Bertini1,2*, Lucio Ferella1, Claudio Luchinat1,2*, Giacomo Parigi1,2, Maxim V. Petoukhov3, Enrico Ravera1,2, Antonio Rosato1,2, Dmitri I. Svergun3.

1Magnetic Resonance Center (CERM), University of Florence, Via L. Sacconi 6, 50019 Sesto Fiorentino (FI), Italy

2Department of Chemistry, University of Florence, Via della Lastruccia 3, 50019 Sesto Fiorentino (FI), Italy

3EMBL, Hamburg Outstation, Notkestrasse 85, D-22603 Hamburg, Germany

**Supporting information**

A pool of plausible protein conformations wasgenerated randomly using the program RanCh(Bernado et al., 2007). The size of the poolshould be defined depending on the extent of mobility of the system (i.e., on the number of flexible backbone dihedral angles). In the case of the two-domain protein calmodulin, 50,000 conformations were sufficient (Bertini et al., 2010). RanCh uses as an input the coordinates of the rigid domains in PDB format and the sequence of the protein; residues that appear in the sequence but are not present in the PDB files are considered flexible. Flexible linkers are substituted with chains of dummy residues centered at the Cα positions,whereby the linker conformation is varied inside allowed areas of a quasi-Ramachandran plot (Kleywegt, 1997). The scattering profiles of the single conformations were contextually calculated with the CRYSOL approach(Svergun et al., 1995). The pcs and rdc corresponding to the provided magnetic susceptibility anisotropy tensors were calculated with CALCPARA (Bertini et al., 2002). The storage of these data requires about 1.3 GB of disk space. The input to MaxOccconstitutes of this entire dataset plus a number of parameters defining the calculation setup, summarized in Table S1(Bertini et al., 2010).

In the MaxOcc runs, ensembles of conformations are selected in order to minimize the discrepancy to the experimental data, according to the following target function:

where the weighting coefficients aimust be optimized by the user in such a way as to have a balanced contribution by each type of restraint,

are the experimental pcs/rdc/pre values,arethe pcs/rdc/pre values calculated for the jthconformation, with weight wj, k is the number of pcs/rdc/pre,n is the number of conformationsin the family, h is the number of points in the SAXS curve, Icalc is the average scattering intensity, calculated as:

andc is a scaling coefficient calculated as

.

To allow for an easier minimization through the analytical conjugate gradient, the normalization of the weights of the nconformationsis imposed as a harmonic restraint:

.

The MaxOcc calculation proceeds as described below and summarized in the flow-chart shown in Figure S1 (the symbols are provided in Table S1).

1)The number m of ensembles that are initially calculated is set to M (400 in the present work; this number can be specified by the user). Each ensemble comprises one selected conformation with a fixed weight and n other conformationsrandomly selected from the entire RanCh pool. This number n is calculated from the maximum number of structures that are required in the final ensemble, N, the number of steps ns, and the number of structures added at each step,na (all these parameters can beadjusted by the user; in the present work n=6).

2)For every ensemble, the discrepancy to the experimental data is minimized by changing the weights of the randomly selected conformations through a conjugate gradient minimization (400 steps in the present work, adjustable by the user). The ensembles are thus sorted according to their target function.

3)The best fitting ensembleis replicated m times and structures with low weight are randomly replaced with structures belonging to the pool generated with RanCh. This is done twenty times or until no improvement is achieved. The best fitting ensemble is the starting point form stepsof heuristic steepest descent, during which the n completing structures composing the ensemble are randomly replaced.

4)The minimization proceeds with a simulated annealing schedule of mi (10 in the present calculations) steps comprising mt(2000) random moves each,during which the structuresof the completing ensemble are again randomly replaced with structures belonging to the RanCh pool. Differently from the previously published version, if the ensemble does not improve over all the random moves taken at the same temperature, the program switches to an “informed” version of the algorithm in which the moves are confined to the 10 nearest neighbors of the conformations having higher weights.

5)When the minimum temperature is reached, the ensemble is replicated in order to have atotal number of ensembles decreased by ms (11) with respect to step 1) to reduce the computational burden and a numberna (4) of newconformations are included to each ensemble. The whole schedule is repeated again (from step 2), until the number of conformations within the ensembles matches the maximum number Nallowed by the user (50 for the presented calculations).

Reference List

Bernado P, Mylonas E, Petoukhov MV, Blackledge M and Svergun DI (2007) Structural characterization of flexible proteins using small-angle X-ray scattering. J Am Chem Soc 129:5656-5664

Bertini I, Giachetti A, Luchinat C, Parigi G, Petoukhov MV, Pierattelli R, Ravera E and Svergun DI (2010) Conformational space of flexible biological macromolecules from average data. J Am Chem Soc 132:13553-13558

Bertini I, Luchinat C and Parigi G (2002) Magnetic susceptibility in paramagnetic NMR. Progr NMR Spectrosc 40:249-273

Kleywegt GJ (1997) Validation of protein models from Ca coordinates alone. J Mol Biol 273:371-376

Svergun DI, Barberato C and Koch MHJ (1995) CRYSOL - a program to evaluate X-ray solution scattering of biological macromolecules from atomic coordinates. J Appl Crystallogr 28:768-773

Table S1. Parameters for the MO calculation with the MaxOcc portal (see also Figure 2 in the main text)

Parameter / Description / Default value / SymbolRanCh

Sequence filename / Aminoacid residue sequence in one-letter code

Number of domains / Number of rigid domains in the protein (the missing residues will be replaced by a flexible linker made of Cs only)

PDB filename / Atom coordinates of each rigid domain in PDB format

Total number of structures to generate / Number of structures in the pool generated by RanCh / 50,000

Reference domain / The coordinates of this domain are fixed to those provided in the pdb file / 1

Type of linker

0: native like

1: random coil / The angles in the linker are constrained to belong to a quasi-Ramachandran plot. Selection of native like forces the angles to adopt values found in secondary structures, selection of random coil allows a wider range of angles. / 0

Order of harmonics / Maximum order of harmonics (min = 1, max =50) for the calculation of the SAXS profile. It defines the resolution of the calculated curve. Default value should be sufficient in most of the cases. For large particles high orders could improve the results, but more CPU time is required. Fractional values are not allowed. / 15

Maximum s value / Maximum scattering vector (max = 1.0 Å-1) / 0.35

Number of points / Number of points in the calculated SAXS profiles / 51

CALCPARA

Rdc filename / Filename of the experimental rdc data

Pcs filename / Filename of the experimental pcs data

Pre filename / Filename of the experimental pre data

Tensor filename / Name of the file containing the coordinates of the paramagnetic metal ion and the anisotropy tensors for pcs and rdccalculation

Field / Proton Larmor frequency (in MHz) corresponding to the magnetic field at which the rdc have been measured

Temperature / Temperature at which the rdc data have been recorded

Coordinates of the paramagnetic center / Coordinates of the paramagnetic nucleus for PRE calculation

MaxOcc minimization

Max number of structures in the completing ensemble / Maximum number of structures that will be used to define the completing ensemble. It may range from 10 (faster) to 200 (more accurate). The value may be fine tuned to balance the gain in terms of TF with the length of the calculation. / 50 / N

Number of structures added at each step / Initially, an ensemble is built out of a limited number of structures. At each step (see the following parameter) the program adds the indicated number of structures until the maximum number is achieved / 4 / na

Number of steps in which structures are added / Number of steps that are performed to increase the number of structures in the completing ensemble from the initial value to the maximum value. / 11 / ns

Initial number of ensembles / When the ensembles are small, a large number of ensembles can be calculated, so as to ensure a large variability. Should not be larger than 480. / 400 / M

Ensembles removed at each step / At every step, when the number of structures in the ensembles is increased, some ensembles are removed to keep the computational time at a bay / 11 / ms

Starting temperature / Highest temperature for the Simulated Annealing (SA) / 0.07 / tstart

Target mutation rate / Maximum fraction of structures that are substituted by SA / 0.10 / mr

Weight minimization / Not effective (for the moment only minimization with conjugate gradients is supported). / 1

Maximum number of conjugate gradients iterations / Number of conjugate gradients iterations that are performed to minimize the TF. / 400

Conjugate gradients tolerance / The conjugate gradient minimization ends when this tolerance is reached. / 0.00001

Simulated Annealing tolerance / Minimum decrease expected in the TF at each temperature. If such decrease is not reached, then an informed version of SA is performed. / 0.01 / tol

Lower bound for minimization / The minimization stops if a TF smaller than this value is found / 0.20

Maximum number of iterations / Maximum number of random moves for each temperature step. The total number of moves throughout the whole calculation is given by this number times the number of steps during which the temperature decreases in the SA, times the number of steps needed to achieve the maximum number of structures in the ensemble. / 2000 / mi

Simulated Annealing temperature steps / Number of steps in which the temperature in the SA is decreased of a factor 0.9 from the starting value. / 10 / mt

Weight MMC criterium for Simulated Annealing / Throughout the calculation, the structures in the completing ensemble are changed with a probability that depends on this number and on their weight after the conjugate gradient minimization. The values of this number (as well as of the three numbers below) are chosen so that the higher minimization efficiency is ensured for TF ranging between 0.2 and 1. / 0.005

Weight MMC criterium for initial replacement / See above. / 0.02

Weight MMC criterium when adding structures / See above. / 0.0005

Weight MMC criterium for heuristic steepest descent / See above. / 0.0000005

Iteration between restart writings / Not effective in the web-portal version. / 500

PCS multiplier / The weight of the different classes of restraints must be adjusted so that their final contribution to the TF is balanced, taking also into account their accuracy and precision. The overall TF should not be smaller than 0.2. / 1.0 / apcs

RDC multiplier / See above. / 1.0 / ardc

PRE multiplier / See above. / 1.0 / apre

SAXS multiplier / See above. / 0.1 / aSAXS

Multiplier of the harmonic restraint for weights normalization / The multiplier for the harmonic constraint on the weights of the conformations in the ensemble (their sum should be 1) is recommended to be chosen so that the maximum allowed discrepancy is of the order of a few points per thousand. It can be increased but it slows down the calculation. / 1000 / aweight

SAXSfile / Filename of the experimental SAXS data

Number of blind points / Number of points in the SAXS profile that are missing because obscured by the beam stopper

Constant for transverse paramagnetic relaxation enhancement / Constant between paramagnetic relaxation rates and distances() / k

Figure S1. Flow-chart of the MaxOcc program

Python script for generating the figure 8 from the “mo.val” and “strutture” files, obtained as output files of the program MaxOcc, and the PDB file of the domain of the protein kept fixed (“structure.pdb”) . The output of the script is a file called “tensors.pdb” that can be read in pymol and color coded typing “spectrum b” in the command line.

#!/usr/bin/env python

fromScientific.Geometry.Quaternion import Quaternion

fromScientific.Geometry import Transformation

fromBio.PDB.PDBParser import PDBParser

from Bio.PDB import PDBIO, Select

import math

importshutil

file1 = open('mo.val','r')

molist = file1.readlines()

file1.close()

file1 = open('strutture','r')

strlist = file1.readlines()

file1.close()

parser = PDBParser()

structure = parser.get_structure('unknownstruct','structure.pdb')

xm=0.

ym=0.

zm=0.

nat=0

maxres=0

for r in structure.get_residues():

ifint(r.id[1]) > maxres:

maxres=int(r.id[1])

for a in structure.get_atoms():

a.bfactor=0.

if a.id != 'H':

nat+=1

xm+=a.coord[0]

ym+=a.coord[1]

zm+=a.coord[2]

xm =xm/float(nat)

ym =ym/float(nat)

zm =zm/float(nat)

objwriter = PDBIO()

objwriter.set_structure(structure)

objwriter.save('tensors.pdb')

file6 = open('tensors.pdb','a')

file7 = open('mo.inf','w')

fori in range(6,len(molist)):

mo = float(molist[i].split()[1])

stru = int(molist[i].split('.')[1].split(' ')[0])

linestr = strlist[stru-1][0:62]

file7.write(linestr)

file7.write(' ')

file7.write(str(mo))

file7.write('\n')

quat=[]

for k in range(0,4):

sv = float(strlist[stru-1].split()[k])

quat.append(sv)

q = Quaternion.normalized(Quaternion(quat))

a0x=float(strlist[stru-1].split()[4])+xm

a0y=float(strlist[stru-1].split()[5])+ym

a0z=float(strlist[stru-1].split()[6])+zm

axx=Quaternion.asRotation(q).tensor.array[0][0]+a0x

axy=Quaternion.asRotation(q).tensor.array[0][1]+a0y

axz=Quaternion.asRotation(q).tensor.array[0][2]+a0z

ayx=Quaternion.asRotation(q).tensor.array[1][0]+a0x

ayy=Quaternion.asRotation(q).tensor.array[1][1]+a0y

ayz=Quaternion.asRotation(q).tensor.array[1][2]+a0z

azx=Quaternion.asRotation(q).tensor.array[2][0]+a0x

azy=Quaternion.asRotation(q).tensor.array[2][1]+a0y

azz=Quaternion.asRotation(q).tensor.array[2][2]+a0z

nres = maxres+int(i)-6

file6.write('MODEL %s\n' %nres)

file6.write('ATOM 1 N NH3 %3i %8.3f%8.3f%8.3f 1.00%6.2f\n' %(nres,a0x,a0y,a0z,mo))

file6.write('ATOM 2 H NH3 %3i %8.3f%8.3f%8.3f 1.00%6.2f\n' %(nres,axx,axy,axz,mo))

file6.write('ATOM 3 H NH3 %3i %8.3f%8.3f%8.3f 1.00%6.2f\n' %(nres,ayx,ayy,ayz,mo))

file6.write('ATOM 4 H NH3 %3i %8.3f%8.3f%8.3f 1.00%6.2f\n' %(nres,azx,azy,azz,mo))

file6.write('ENDMDL\n')

file6.close()

file7.close()