Structuring a PDB Entry

A PDB Entry may contain 0 or more sub-components PART – PRTEND

Either a PDB Entry or a PART may contain 0 or more STRUCT – STREND sub-components

Either a PDB Entry or a PART – PRTEND sub-component may contain 0 or more further sub-components – MODEL – ENDMDL.

Either a PDB Entry or a PART- PRTEND or a STRUCT – STREND may contain 2 or more ATLPRT – ALTEND sub-components.

Note: MODEL format does not change

COLUMNS DATA TYPE FIELD DEFINITION

------

1 - 6 Record name "MODEL "

11 - 14 Integer serial Model serial number.

ENDMDL

1 - 6 Record name "ENDMDL"

The PDB format for PART, STRUCT, ALTPRT are

COLUMNS DATA TYPE FIELD DEFINITION

------

1 - 6 Record name "PART "

11 - 14 Integer serial Part serial number.

21 - 80 String(60) Description/Title (Optimal)

PRTEND

1 - 6 Record name "PRTEND"

COLUMNS DATA TYPE FIELD DEFINITION

------

1 - 6 Record name "STRUCT"

11 - 14 Integer serial Part serial number.

21 - 80 String(60) Description/Title (Optimal)

STREND

1 - 6 Record name "STREND"

COLUMNS DATA TYPE FIELD DEFINITION

------

1 - 6 Record name "ALTPRT"

11 Character ALT-ATOM code for set of atoms to be used

in annotation records such as REMARK 500

21 - 80 String(60) Description/Title (Optimal)

ALTEND

1 - 6 Record name "ALTEND"

PART – ENDPRT is fundamentally a sub-division of a PDB entry where the structure determined in a single experiment contains either more chains than the PDB format can uniquely identify and/or there are more than 99,999 atoms in the structure. By convention PART – ENDPRT should contain a logical sub-division of the structure and should try to not duplicate Entities (as expressed by unique DBREF database pointers) across PARTs. In some cases this may not be avoided, as for example in complex polymer structures which do not have crystallographic symmetry and the division into PARTs may be best reflected in a hetero-unit such as in muscle and tubulin studies. Currently the obvious PDB Entries that are candidates for PART/ENPRT are the Ribosome structures where the experimental coordinates as ATOM records are distributed over more than 1 PDB entry. For example the 70S ribosome from Thermus thermophilus has a structure containing 42 proteins and 3 ribosomal RNAs (rRNA). The 70S ribosome comprises two subunits: a large 50S subunit, and a small 30S subunit. The 50S subunit contains a 23S and a 5S rRNA plus over 30 proteins. The 30S subunit contains a 16S rRNA plus 20 proteins.

A PDB entry for this structure could be divided by PART as

PART 1 Ribosome 50S Subunit (including 23S and a 5S rRNA)

ATOM 1 records

TER

CONECT

PRTEND

PART 2 Ribosome 30S Subunit (including 16S rRNA)

ATOM 1 records

TER

CONECT

PRTEND

MASTER query how many MASTER records are required ?

END

This division allows for some validation to be included in the PART sub-sections where the REMARK 500 series of records, the SITE records (including REMARK 800) can then express any close contacts between the protein and rna and the binding interactions.

The major problem is that REMARK 500 validation annotation between ATOMS in each PART cannot be represented in the PDB entry nor may SITE annotations indicate binding interactions between protein chains distributed between PARTs. During deposition this can be calculated and delivered to the depositor as required and can be readily displayed in the mmCIF and XML formats but will be missing from the PDB format. For example for this structure the PDB entry cannot annotate that the ribosome binds 3 tRNAs, each in a distinctive binding site made from structural elements contributed by both the 50S subunit and the 30S subunit.

A PART/PRTEND section will include all of the PDB records types which reference specific structural elements of the molecule. These records include COMPND, SOURCE, SEQRES, HELIX, SHEET, TURN, SSBOND, MODRES, HET, HETNAM, FORMUL, ATOM, HETATM, TER, REMARKS 300, 350, 400, 450, 470, 500, 900, and 999. CONECT records are unique to each PART and Atom serial number’s start at 1 for each PART. These records will be repeated in each PART/PRTEND section thereby allowing the reuse of atom, residue, and chain nomenclature. PDB records that do not define or reference specific elements of molecular structure will be once at the beginning of the multipart PDB file (e.g. HEADER, TITLE, KEYWDS, EXPDTA, REVDAT, OBSLTE, SPRSDE, AUTHOR, JRNL, CRYST1 ORIGX#, SCALE#, REMARKS 1, 2, 3, 4 100, 105, 200, 280, and 290).

If the structure requires symmetry to build a biological unit (REMARK 300, 350) the BIOLOGICAL UNIT counter may be used in more than one PART and for each reference to BIOLOGICAL UNIT number then the chains given in each relevant REMARK 350 and the symmetry BIOXMT expressed must be accumulated to build the BIOLOGICAL UNIT.

The PART / PRTEND records will be required for the expected larger structures being studied including structures studied by various EM methods that may have fitted coordinates in the near future. An example is The Nuclear Pore Complex. Molecular trafficking between the nucleus and the cytoplasm of interphase cells occurs via the nuclear pore complexes (NPCs), large supramolecular assemblies that are embedded in the double-membraned nuclear envelope (NE). The Nuclear pore complex (NPC) has a molecular mass of ~125 mDa in vertebrates and contains about 50 or more different proteins (Nakielny S, Dreyfuss G. (1999) Transport of proteins and RNAs in and out of the nucleus. Cell99 677-690; Lyman SK, Gerace L. (2001) Nuclear pore complexes: dynamics in unexpected places. J. Cell Biol.154 17-20.; Stephen A Adam (2001)The nuclear pore complex Genome Biology2, 0007.1-0007.6.). The structure of the NPC has been extensively investigated by electron microscopy (EM), and a consensus model of its central framework has emerged (see below).

Accordingly, the vertebrate NPC exhibits an 8-fold symmetric (i.e., perpendicular to the plane of the NE) tripartite architecture with a total mass of ~125 MDa. Its ~55 MDa central framework is a ring-like assembly built of eight multi-domain spokes consisting of two roughly identical halves each so that its asymmetric unit (i.e., one half-spoke) represents one 16th of its mass or roughly the size of a ribosome. This central framework is sandwiched between a ~32 MDa cytoplasmic ring and a ~21 MDa nuclear ring. From the cytoplasmic ring eight short, kinky fibrils emanate, whereas the nuclear ring anchors a basket (or fishtrap), assembled from eight thin, ~50 nm long filaments joined distally by a 30- to 50-nm-diameter ring. The ring-like, ~822-symmetric central framework embraces the central pore of the NPC which acts as a gated channel. The yeast NPC has been well characterized and is thought to contain an upper limit of 30 distinct types of proteins (termed nucleoporins, or nups) The number is low and rather surprising considering that the ribosome, another macromolecular protein complex, contains 75 different types of proteins and weighs only 4 MDa. The body of the vertebrate NPC is approximately 145 nm in diameter and 80 nm length across the nuclear envelope. The yeast NPC is smaller at approximately 96 nm in diameter and 35 nm in length.

/ Using PART then a logical division could be
PART 1 Cytoplasmic fibrils
PART 2 Ctyoplasmic ring
PART 3 Nuclear ring
PART 4 Nuclear basket
PART 5 Distal ring

MODEL / ENDMDL

The MODEL record specifies the model serial number when multiple structures are presented in a single coordinate entry, as is often the case with structures determined by NMR. The chemical connectivity should be the same for each model. ATOM, HETATM, SIGATM, SIGUIJ, ANISOU, and TER records for each model structure are interspersed as needed between MODEL and ENDMDL records. MODELs represent sets of ATOMs that have identical chemical composition where each MODEL is a valid solution to the experimental data used in the structure determination AND where the ensemble of MODELs has a distinct meaning when viewed together and in some manner represents the dynamics of the molecule under the experimental conditions. Currently MODELs can be used for experiments carried out in solution such as NMR techniques and combinations of Xray solution scattering with NMR and/or theoretically modelling work.

A MODEL cannot contain either a STRUCT/STREND or ALTPRT/ALTEND set

A MODEL may be buried inside a PART/PRTEND

Conventionally no statistics are carried in REMARK 3 for PDB entries containing MODELs for each MODEL. Although these MODELs have different minimisation energies and other refinement characteristics the best practise methods do not consider the differences significant to record. In the REMARK 500 annotation/validation records and other such records the MODEL number is used in these records. With MODEL entries the HELIX/SHEET/SITE records refer only to MODEL number 1.

The PDB now insists that the 1st model is the representative MODEL that the author states is the best model.

STRUCT / STREND

STRUCT is similar to MODEL, however, the contents of each STRUCT / STREND may not necessarily be chemically identical. Each STRUCT / STREND contains a set of ATOM records that satisfy the experimental data, e.g. in Xray methods the deposited structure factors. STRUCT is used in the cases where (i) each STRUCT is an independent solution but the authors cannot determine which solution is the correct one. For Xray experiments the majority of ATOMs have an occupancy of 1.00 or have ALTPRTs that add up to an occupancy of 1.00; (2) Multiple refinement methods where each STRUCT is a separate refinement method and/or conditions. STRUCTs may not have meaning as an ensemble. (unlike MODEL where the ensemble of MODELs carries significant scientific meaning).

STRUCT requires a new REMARK 3 to give details per STRUCT as for example the Rfactor is different for each STRUCT set of ATOM records. In the REMARK 500 annotation/validation records and other such records the STRUCT number is used in the same manner as MODEL number in these records. As in MODEL the HELIX/SHEET/SITE records refer only to STRUCT number 1.

A PDB entry containing STRUCT / STREND may not contain MODEL / ENDMDL.

Examples of STRUCT / STREND are (1) PDB ID 2D6B.

Figure 1

This structure of lysozyme C was determined to 1.25Anstroms by Ondracek and Mesters and was refined using HipHop and Shelxl. Each model was fit one at a time into the electron density map. There are 10 models each with occupancy 1.00. It is not yet published but was deposited and released in November 2005 (Figure 1).

ATLPRT – ALTEND (is the ID the altLoc or an integer ?)

ALTPRT is usedif more than 50% of a single chain in a protein structure is in alternate conformations, then the structure will be treated as ALPRT A, ALTPRT B, etc. To satisfy the experimental data such as structure factors, all ALTPRTs contained in a STRUCT must be used. However for viewing the 1st ALTPRT is suitable. The HELIX/STRAND/SITE records all refer to the 1st ALPRT. The ALTPRT identified is a character ALT_ATOM code that is used in other annotation records, BUT note no REMARK 500 records should contain any contacts between ALTPRTs as these are not physical contacts but an artefact of the crystallography, usually as a result of statistical disorder. The ALTPRT identifier is the Character, altLoc (Alternate location indicator) that would have been on the ATOM record for linking with annotation records.

QUERIES

HELIX SHEET TURN do not contain an altLoc term nor do they contain a MODEL number

SSBOND & CISPEP & SITE do not contain the altLoc term – in different ALTPRTs it is possible to imagine that one may have the CYS—CYS while a second does not?

LINK does contain the altLoc term – however is this repeated for all LINKs in all ALTPRTs?

These can only refer to the 1st MODEL or STRUCT or ALTPRT as no current model id

REMARK 650 HELIX

REMARK 700 SHEET

REMARK 750 TURN

REMARK 800 SITE

For all STRUCT, MODEL, ALTPRT what do we do with these as they have no current model id nor an altLoc?

REMARK 101 RESIDUE G A 4 HAS CH3 BONDED TO O6

REMARK 102 BASES A B NN AND X Y ZZ ARE MISPAIRED

REMARK 103 AB I X N AND AB Z X NN

REMARK 295 NON-CRYSTALLOGRAPHIC SYMMETRY

REMARK 375 HOH 301 LIES ON A SPECIAL POSITION

REMARK 470 M RES CSSEQI ATOMS

These don’t have a model ID nor do they have an altLoc

REMARK 500 ATM1 RES C SSEQI ATM2 RES C SSEQI SSYMOP DISTANCE

REMARK 500 CB LEU D 68 - CE LYS E 76 1656 2.10

REMARK 500 ATM1 RES C SSEQI ATM2 RES C SSEQI DISTANCE

REMARK 500 O HOH 761 - O ARG 17 1.89

These do have the model id but no altLoc id

REMARK 500 MODEL OMEGA

REMARK 500 VAL A 123 GLN A 124 0 221.48

REMARK 500 M RES CSSEQI

REMARK 500 0 GLU 1 ALPHA-CARBON

REMARK 500 M RES CSSEQI ATM1 ATM2 ATM3

REMARK 500 0 ASP 3 C-1 - N - CA ANGL. DEV. = 21.7 DEGREES

REMARK 500 M RES CSSEQI PSI PHI

REMARK 500 0 VAL 26 -174.85 -134.80

REMARK 525 M RES CSSEQI

REMARK 525 0 HOH 561 DISTANCE = 5.07 ANGSTROMS

General: If a nucleic acid chain of 10 or more residues is in alternate conformations, and the protein chain is not in alternate conformations, then the structure will be treated as ALTPRT A, ALTPRT B, etc. It should be noted that the 10-residue limit for nucleotides is based on the length of the nucleotide chain used in the experiment (as reported in the sequence) and not on the content of the coordinates. 10 nucleotides was chosen because one turn of the helix is composed of 10 nucleotides. If the entry contains nucleic acid only, and more than 50% of a single chain is in alternate conformations, then the entry will be treated as ALTPRT A, B, C.

It is generally understood that alternate conformations in the coordinate represent alternate positions of residues, side chains etc., possibly resulting from multiple positions occupied by residues in a crystal. In such cases the use of alternate ID alone is appropriate. Cases, where a full chain or a large proportion of a macromolecular chain is in multiple conformations, are likely to represent multiple populations of asymmetric units found in the crystal. Thus representing these conformations as different models in the structure may be useful for the general scientific community. Additionally, within the framework of the PDB, the use of only alternate IDs to represent these cases result in grouping all alternate positions of each atom to be listed consecutively. This makes the resulting file rather challenging to read. The PDB has adopted the best way to represent such structures, within the limitations imposed by the PDB format, was to split the entry into multiple ALTPRTs. In this format, each model would represent one population of the structure containing one conformation. The advantage of this method to the non-expert user is the ability to pick up one or more population for their analysis and research. This will make the structure more intuitive, easy to use and easily parse-able by secondary structural databases.

Figure 2

An example where a molecule is represented as multiple conformations, and the author submitted an average ensemble.is PDB ID 1HTQ. This structure of glutamine synthetase was determined to 2.4 Angstroms by Gill, Pfluegl, and Eisenberg. It was refined using X-Plor and published in Biochemistry in 2002. This structure contains 10 models each with 0.10 occupancy. There are 24 chains in the asymmetric unit and 12 chains in the biological unit. (Figure 2)

For entries where the models are fit as a group into the electron density map: The chain(s) that is/are in alternate conformations is/are assigned the same chain ID for each model. The coordinates for the chain(s) that is/are not in alternate conformations are copied into each model. The occupancies of these chains are changed to match the occupancies of the chains that are in alternate conformations.

The residues in ALTPRT A are NOT assigned alternatealtLoc A in ATOM records, rather it is a global assignment to those ATOMs that have partial occupancy. In those cases where not all the atoms are disordered in the same manner and there is a mixture of ordered and disordered atoms in a ALTPRT and those ordered atoms have individual multiple conformations (such as side chains) then ATOM records do have the altLoc assigned then these codes are assigned with the next available character A to Z after the last ALTPRT has been assigned. Sometimes the waters and chemical groups are already assigned to an ALTPRT by the author and those assignments are kept. However, if these groups are not already assigned, then the waters and chemical groups are duplicated between ALTPRTs. The occupancies of the waters and chemical groups will be changed to match the occupancies of the chains in alternate conformations. (QUERY this may not be valid as if waters are full occupancy is it true that they belong to both conformations – I think no) The model with the maximum number of chains and chemical groups should be listed first as ALTPRT A when possible.

Identification of multiple model structures: A new remark 450 will be added to the header of the PDB file to indicate that a structure is involved in a STRUCT and/or ALTPRT type. “This entry contains multiple models because a significant portion of the structure is presenting in more than one conformation”. A specific cif token will be created to indicate this situation, making it easily identifiable in searches, such as multiple_model, Y/N, as well as a details token. We already have a REMARK 450 !!

REMARK 450, Source

Further details on the biological source of the macromolecular contents of the entry

Example.An entry where the multiple conformations exist for almost all peptide chains but not for the protein chains.

Figure 3 PDB entry 1ZY8. This structure of dihydrolipoyl dehydrogenase was determined to 2.59 Angstroms by Ciszak, Makal, Hong, Vettaikkorumakankauv, Korotchkina and Patel. It was refined using CNS and published in J.Biol.Chem in 2005. In this structure, the protein represented by chains A,B,C,D,E,F,G,H,I,J are not in alternate conformations. The peptides K, L, M, and N are in alternate conformations. Peptide chain O does not have an alternate conformation. One biological unit consists of two protein chains and one peptide chain. However, the biological unit involving chains I, J, and O is different from the other biological units because chain O does not have alternate conformations. The entry was processed as two models. The first model contains all protein chains at half occupancy and the first conformation of peptide chains K, L, M, N. Model 1 also contains peptide chain O. The second model contains all protein chains at half occupancy and the second conformation of peptide chains K, L, M, N, but does not contain O.