Using corona virus structures in downstream computational projects

Computational modelling and three-dimensional bioinformatics require good data to start from. Like in any other scientific endeavour, the rule “garbage in, garbage out” also holds for experimental macromolecular structures. As we received numerous requests to point out the best experimental structures to start from for corona virus, here is some guidance.

Experimentally determined corona virus structures come from one of three sources: X-ray crystallography, electron cryo microscopy (cryo-em) and solution NMR. (For more information, see https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/methods-for-determining-structure )

‘Representative structure’ is a difficult term. The selection of a good model to start from depends on the question you seek to answer; for a study on the dynamics of a large complex, a 3.5 Å Cryo-EM structure may suffice. However, if you want to dock a ligand, that may not cut it, as the side chain conformation is not visible at this resolution.

Here are some general indicators of structural quality that might be of interest to all downstream users:

Everything there?

You should check if the experimental structure contains all domains relevant to your project, and if it has been mutated in any way. Not all of a given structure can necessarily be found in the coordinates. If the person modelling the experimental data could not identify atomic positions, because, for example, the respective side chain/loop/tail was flexible or disordered, these atoms may be missing from the file, or their occupancies may have been set to 0. Beware: In the latter case, these atomic positions are then meaningless.

Crystal structures have one more peculiarity: the structure may not be complete in the coordinate file because the biological assembly had a crystallographic symmetric element in it. Or there may be more than one biological assembly in the file because the so-called asymmetric unit contained more than one copy. The information about the biological assembly is contained in the coordinate file.

See also: https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/biological-assemblies

In some low-resolution cases, albeit not for any corona virus structures, only Cα positions have been deposited.
Hydrogen atoms are usually omitted in macromolecular crystallography, and can be added in an ideal position. Molprobity does these automatically during evaluation.

Other molecules?

Is there a ligand or co-factor bound? Any other non-water molecules in addition to the macromolecule? Have these been correctly identified? Pay close attention to the physical environment, particularly for monoatomic ions. Considering just the coronavirus structure cohort, the following errors have been found in some models: water misidentified as magnesium; chloride misidentified as zinc; zinc misidentified as a disulphide bond; zinc misidentified as poly(ethylene glycol). It might also be worthwhile to check from which pH and conditions the structure has been determined. This information is often (but not always) in the PDB deposition.

Molecular geometry

Depending on the available data, some geometric information has been used to build the model. If the structure deviates from these restraints without a good (chemical) reason, this is always suspicious. For crystallographic structures, these are, for example, bond length and angle deviations. Other criteria are more often used for validation, such as Ramachandran outliers, other torsion angles or Van-der-Waals clashes. Beware that if Ramachandran is perfect at >2.5Å it probably means outliers were artificially refined away, so it does not show whether the backbone is good. The Molprobity score and, for lower resolution structures, the % CaBLAM outliers (preferably <~2%)  give a good first indication. Molprobity outputs are also given in terms of percentile, relating the quality of the structure to others in the same resolution range. Another useful tool is the output of Whatcheck for crystallographic structures, which includes a few more sanity checks. Choosing a geometric sound structure can also avoid hot spots in MD calculations, and limit RMSD shifts per frame. You can find CaBLAM, Whatcheck and Molprobity outputs in our database.

Resolution: How much experimental information was there to begin with?

Resolution is the most common indicator of data quality. (However, it is not the only one. I would strongly encourage you to learn a bit more about the data quality and experiment that you are using as the foundation for your calculations, and form your opinion on several indicators.) Obviously, the better the resolution, the more information was contained in the data that were used to solve this structure. In Cryo-EM, where it is determined by Fourier Shell Correlation of two half data sets, the resolution is locally variable, and the given number is an average over the whole structure. In NMR, usually, no overall resolution is given. In crystallography, the resolution can be directly determined from the experimental data. In addition, the uncertainty in the position of an individual atom is indicated by the atomic B factors. High B factors indicate poor resolution of the position. Small differences in resolution (say, 0.2 Ångström) are negligible when picking a structure. As a general guideline, at resolutions < 1.7 Å, individual atomic positions (except hydrogens) can be determined; deviations from ideal bonds and angles are often chemically meaningful. Disorder is visible and can be modelled as alternative conformations. From 1.7 to 2.6 Å, rotamers and conformations will mostly be correct, but much information on ideal bonds and angles has been used and the model may adhere to these values. Disorder can still be modelled, but the occupancy can no longer be refined. From 2.6 to 3.7 Å, while the fold is almost always correct, but expect many sidechains and the occasional peptide bond to have been modelled wrongly. If there were no higher-resolution homologues available to act as a starting point for modelling, there is a risk that some regions will be “out of register” – that is, with amino acids shifted one or more positions forward or backward along the chain. If the protein is glycosylated, be particularly sceptical about the sugar conformations – it is quite common for them to be modelled “backwards” (that is, flipped ~180° around the asparagine-sugar bond). Essentially, here be dragons: extensive and careful checking should be applied before using a structure of this resolution for docking or dynamics simulations. Many rotamers will be in positions corresponding to the rotamer libraries, i.e. ideal conformations. At > 3.7 Å individual atomic coordinates are meaningless, but the overall fold can perhaps be determined. You can find the resolutions for individual corona virus structures in the table linked below.

See also:
https://proteopedia.org/wiki/index.php/Resolution
https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/resolution

r.m.s.d.: Precision of NMR structures

For NMR structures, the NOE couplings can be seen as a number of constraints that the macromolecular fold has to adhere to and hence, the model is a number of different structures rather than one. From these, a consensus structure with r.m.s.d. values for the different coordinates are then calculated, but it may be worthwhile to look at the entire ensemble; regions which vary a lot were either highly mobile or there was not enough information for them. In general, a well-defined NMR structure should have a backbone r.m.s.d. < 0.5 Å and a non-hydrogen r.m.s.d. < 1.0 Å, measured over the structured part of the protein. Also important for the quality of an NMR structure is the amount of restraints per residue, which should be between 10-18.
See also: https://febs.onlinelibrary.wiley.com/doi/10.1111/j.1742-4658.2011.08004.x

Crystallographic R-value: How well does the model fit the data?

The R-value (R is for residual) gives the discrepancy between diffraction data and model, so the lower the better. As there is a model bias in the modelling of crystallographic structures, the R(free) value serves as a semi-independent criterion, and must always be higher than the R-value. Typical R-values are about 0.24, or 24% at 2-3Å resolution. R-values generally are lower when the resolution is better, but there are some caveats: If the crystals were twinned and a twinning model has been used in structure refinement, it may look better, because there were more parameters in the model, and more parameters generally lead to a better fit to the data. Take care when interpreting them: a model with poor geometric validation statistics is still a poor model even if the R values are very low! R values are also called R factors. You can find R and R(free) in the table linked below.

See also:
https://pdb101.rcsb.org/learn/guide-to-understanding-pdb-data/r-value-and-r-free
https://strucbio.biologie.uni-konstanz.de/ccp4wiki/index.php/R-factors
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4465431/

Experimental data available?

One beauty of the Protein Data Bank is that in recent years, crystallographic and Cryo-EM structural models are deposited alongside the processed data used to generate them. This allows anyone with the tools to compare these models with the density maps and even to improve the models themselves. One such tool is ISOLDE (https://isolde.cimr.cam.ac.uk). This has already been used to check and, where necessary, correct a subset of the current structures (look for folders named “isolde" in the repository). Remember: while automated tools can take you a long way, for greatest peace of mind there is still no substitute for checking the model for yourself. At a minimum, you should at least (1) check each individual site highlighted as an outlier in the MolProbity and/or Whatcheck report and any residues with particular relevance to your project – while some outliers are real (and often functionally relevant), the majority are indicators of error with the potential to cause unreliable results.
Even better if raw data are available. It is often a sign that the depositor has worked with methods developers and is aware of the problems data processing can pose. You can find this info in the table linked below.

Additional tips

  • If there are several coordinate files from the same macromolecule that come into question for your project, you can superimpose them (for example with the programs Coot, Pymol, Chimera or ChimeraX) to get an idea of conformational flexibility.
  • Because modelling methods are still improving every year, it might be prudent to choose a newer model over an older one. PDB-REDO re-refines all PDB entries every week for this very reason, and you can find their results in our data base as well.  Also note that SARS-CoV-2 structures are being actively assessed and sometimes re-versioned at the wwPDB, so check back occasionally.
  • Check out the PDBe Knowledgebase for all observed ligand binding sites and protein-protein interaction residues for a given protein: https://www.ebi.ac.uk/pdbe/covid-19

If you prefer the following table as a file, go to:
https://github.com/thorn-lab/coronavirus_structural_task_force/blob/master/utils/stats.json

cross