Scoring functions are fast approximate mathematical methods used to predict the strength of the
interaction (or binding affinity) between two or more molecules. Four aspects should be
considered when assessing the reliability of a scoring function [27]:
(1) scoring power: the ability to produce scores which linearly correlate with experimental
binding affinity data,
(2) ranking power: the ability to correctly rank a given set of ligands that bind to a common
target protein by their binding affinities when their binding poses are known,
(3) docking power: the ability to identify the native binding pose of a ligand as the one with the
best score, and when screening a large set of generated decoy poses, (4) screening power: the
ability to identify the true binders to a given target protein among a library of random molecules.
Ideally, an accurate scoring function would perform equally well on all these four tasks;
however, each existing scoring function only perform well on one or two of them at the same
time.
Scoring functions can be grouped into four main classes: physics-based, empirical, knowledge-
based andmachinelearning-based scoringfunctions[28]. The first three types are commonly
referred to as ‘classical’ scoring functions and are based on the assumptions that the change in
free energy upon binding of a ligand to its target can be decomposed into a sum of individual
energy contributions, and that all these energy contributions are linearly combined. In reality,
such linear correlation may not always exist [29].
Two major limitations of classical scoring functions are their minimal description of protein
flexibility and the implicit treatment of solvent. Machine learning-based scoring functions
instead use more sophisticated techniques, such as random forests (RF), support vector machines
(SVM), and deep learning (DL), to approximate non-linear problems (Fig. 3). Physics-based or
force-field based scoring functions compute the binding energy by summing up the contribution
of the bonded interactions (bond stretching, angle bending and torsion angles) and non-bonded
interactions (van der Waals and electrostatic interactions) within the protein-ligand complex
which accounts for the contribution of enthalpy to energy. Hydrogen bonds are usually
considered by adding an additional term to the binding energy. Alternatively, they can be
included implicitly in the electrostatic energy term.
Parameters for this type of scoring function are usually derived from both experimental data and
ab initio quantum mechanical calculations. The major challenge for physics-based scoring
functions is the treatment of the solvent in ligand binding. To overcome this limitation, implicit
solvent approaches like Poisson–Boltzmann (PB) or Generalised-Born (GB) continuum solvation
models have been widely used [30]. However, more computationally expensive approaches that
treat water molecules explicitly are also available (such as free energy perturbation (FEP) and
thermodynamic integration (TI) techniques) [31] (Eq. 1).
Protein preparation
Once the 3D structure of the protein target has been obtained (either downloaded from the PDB
or generated using protein prediction methods), there are several protein and ligand preparation
steps that should be followed before starting a docking run (Fig. 5). Here we will discuss the
protein preparation procedure while the procedure for ligand preparation will be covered later in
the chapter (see Section 2.3.2). Due to insufficient resolution, most of the entries in the PDB only
contain coordinates of non-hydrogen atoms. To work with these entries, the most common
protein preparation task is the placement of the missing hydrogen atoms. This is not trivial, as it
should account for the important ambiguities of protein structures, such as rotatable hydrogens,
tautomers and protonation states of particular amino acids, alternative water orientations, and
terminal side chain flips. In addition, during protein preparation it is important to ensure that
missing side chains are added, missing bonds are detected and fixed, bond orders are assigned,
and where alternate locations are present, the atoms with highest frequencies are selected. Other,
more complex, procedures in protein preparation include prediction of protonation states and
identification of which water molecules (if any) should be retained in the protein target structure.
The following subsections will look at the methods used to predict the protonation/tautomer
states and at how to identify structural water molecules that are known to be vital in mediating
hydrogen-bonding interactions, even in some cases key for facilitating tight binding, and hence
should be considered part of the protein target structure.
2.3.2 Ligand preparation
Ligand preparation consists of generation, optimisation, and validation of its 3D structure. 3D
structures of ligands can be obtained experimentally, for example from protein-ligand co-crystal
complexes, or they can be generated using software able to convert 1D and 2D structures (e.g.
SMILES, SMARTS, InChi) into 3D molecular structures (Fig. 9). The 3D structure of the ligand
must have realistic bond lengths and angles as these will not usually change during docking.
Optimisation of the starting ligand geometry is sometimes required for particularly complex
molecules. Several programs exist to generate and optimise the 3D structure of a ligand (e.g.
CSD Conformer Generator [126,127], Omega [128,129], Confab [130], Confect [131], RDKit
[132]). They differ in the algorithm used; some systems use force fields to infer intramolecular
geometries, whereas others including the CSD Conformer Generator, rely directly on crystal
structure data derived from the Cambridge Structural Database (CSD) [133] to produce realistic
ensembles of high probability ligand structures. As in the protein, hydrogens and formal charges
must be added to the 3D structure of the ligand. The protonation state should be set according to
the physiological pH or the pH of the simulation, and tautomeric states of theligand should also
be defined. In some cases, it may be worth generating multiple possible protonation states for a
given ligand, with a view to docking all forms (in particular, if the pKa of a given proton
dissociation is close in numerical value to the physiological pH).
Tautomers are isomers differing only in the positions of hydrogen atoms and electrons, therefore
even a simple molecule can have several different tautomeric forms. Moreover, the protonation
and deprotonation of the ionisable sites in the molecule produces additional forms called
protomers. Tautomers and protomers differ in shape, functional groups, surface, and hydrogen
bonding. Therefore, tautomerism and protonation may result in alternative binding modes that
can affect the efficiency of docking and virtual screening [64,84].