DNA Structure Design
DNA Structure Design
Introduction
Natural and designed nucleic acids serve a number of roles in vitro and in cells.
In nature, DNA is largely an information carrier, while RNA is an information
carrier, a catalyst,1 an agent for recognition of complementary sequences, and an
aptamer for metabolite binding.2−4 New binding and catalytic roles can be evolved
in vitro for RNA and DNA using systematic evolution of ligands by exponential
enrichment (SELEX).5 RNA and DNA are also convenient scaffolds for molecular
arrays6,7 and molecular machines.8 Almost all of these roles for nucleic acids
involve folding into secondary and tertiary structures beyond simple duplexes.
One approach is to guarantee that the sequence folds with minimum (most negative)
free energy change from the random coil to give the target structure.11 This
ensures that the most populated structure at equilibrium will be the target
structure, although it does not guarantee that the probability of the desired
structure will be high in the Boltzmann ensemble.
Another approach is to minimize the normalized ensemble defect (NED), the average
probability that a nucleotide will be in the wrong conformation in the ensemble.12
This improves upon assessing solely the minimum free energy structure in that
optimizing the NED also ensures that the helix and loop components of the desired
structure will also occur with high probability.13 It can conveniently be
calculated from a partition function expression14 for base pairing probabilities:
The normalized ensemble defect quantifies the extent to which the desired
components of the structure dominate the Boltzmann ensemble. It can be difficult to
achieve a low ensemble defect, because many competing structures are often possible
for a sequence. The desired structure and its close analogs must have a
substantially lower folding free energy change than all of the competing structures
in order to fold with high probability.
Figure 1.
Figure 1
To use P and Z in inverse folding designs, the thermodynamics of folding from the
random coil need to be known in order to evaluate the design quality. Prior work
established that P and Z prefer pairing with each other above all single
mismatches, except that G–Z pairs are roughly as stable as A–T pairs (Figure
1).16,17 A subsequent study fit the helical nearest neighbor thermodynamic
parameters for P–Z as part of the eight base “Hachimoji” system.15 In this work, we
determined nearest neighbor parameters for a full P–Z extended alphabet, including
fitting stacks with G–Z wobbles, better defining the P–Z stack parameter values,
and extrapolating loop parameters to include P and Z bases. The RNAstructure
software package was previously extended to accommodate folding alphabets of any
size, and therefore it is able to handle designs using the six base alphabet.18
We demonstrate that in silico designs are improved by using the additional P–Z base
pair. We chose to use the Eterna 100 benchmark structures19 and the design program
from RNAstructure,20 which implements a version of the NUPACK algorithm for NED
minimization.12 94 of 100 benchmark structures could be designed by RNAstructure
using both standard DNA and the extended P–Z base pairs. The sequences with P–Z
outperformed the sequences with standard DNA, as quantitated by showing a
significantly lower NED. Both types of designs had significantly lower NED than the
standard solutions provided by Eterna.
Results
For optimal designs, it is essential to have as complete a set of nearest neighbor
thermodynamic parameters as possible to estimate folding stability from the random
coil. Previous work showed that G–Z pairs stabilize helix formation16 and secondary
structure designs have loop regions that connect helices. Therefore, we extended
the existing data on P and Z thermodynamics to loops and also fit additional
nearest neighbor parameters.
Stacking Nearest Neighbor Parameters for P–Z and G–Z Base Pairs
To fit nearest neighbor stacking parameters for P–Z pairs, we used the existing
data set of optical melting data for duplexes with P–Z pairs15,16 and an additional
13 optical melting experiments from this work.21−24 We refined the data analysis
procedure for optical melting experiments to simultaneously fit absorbance vs
temperature curves at multiple concentrations in order to minimize the error with
respect to all available experimental data (see Supporting Information Figure S1
and Methods section for details). Table S1 provides the optical melting stabilities
determined for the 13 all-helix duplexes studied in this work.
We examined the residuals for the fit when the P–Z end penalty was included and
observed a large residual of 2.22 kcal/mol for duplex (ZGCATGCP)2. This single
experimental value drove much of the magnitude of the P–Z penalty, and the residual
for (ZGCATGCP)2 was the largest across all duplexes used in the fit. Therefore, we
removed this sequence from the fit as an outlier and repeated the regression. In
this fit, absent the outlier, the terminal P–Z parameter is smaller in magnitude
than the standard error of the regression (0.03 ± 0.19). Given this result, we do
not include a terminal P–Z penalty in the fit of the stacking nearest neighbor
parameters. The additional experimental data available in this study informed this
decision, which differs from previous work that did include the terminal P–Z
penalty.15
The P–Z stack nearest neighbor parameters for the free energy change at 37 °C are
provided in Table 1, where they are compared to G–C nearest neighbor parameters. On
average, a P–Z substitution for a G–C pair is stabilizing by −0.2 ± 0.4 kcal/mol.
This varies across the stacks; there are two cases where P–Z substitutions of G–C
pairs were unfavorable. Therefore, the full nearest neighbor treatment of P–Z pairs
is required to fit the set of oligonucleotide thermodynamic data; a simple
increment for P–Z substitution of G–C pairs would not suffice.
Table 1. P–Z Stack Nearest Neighbor Parameters, as Compared to Analogous G–C Stacks
(Replacing P–Z with G–C)a.
parameter ΔG°37 (kcal/mol) parameter ΔG°37 (kcal/mol) ΔΔG°37 (kcal/mol) per P–
Z substitution
AZ –1.44 ± 0.17 AC –1.4 0.0
TP TG
Figure 2.
Figure 2
Our prior work studying the stability of P–Z pairs and mismatches with P or Z
demonstrated that G–Z pairs are more stabilizing that A–T pairs, suggesting that
they should be included as a wobble pair and included in secondary structure
prediction, like G–U wobbles in RNA.27 Here we call these interactions G–Z pairs.
In many applications designed to use P–Z pairs, G–Z pairs would be mismatches
(mispairs) that occur as off-target structures. Other applications might choose to
use these pairs as part of a design strategy, and in these uses, the G–Z pairs
would be desired pairs. Overall, given the relative stability of G–Z pairs relative
to A–T pairs, we choose to call them pairs.
Figure 1 shows two possible G–Z pairs. The first is a two hydrogen-bond wobble. The
G–Z pair could also be a deprotonated three hydrogen-bond pair, as observed at
higher pH; the thermodynamics alone at pH 7 cannot distinguish structures.17,28 To
fit the 15 nearest neighbor parameters for G–Z stacking on A–T, G–C, P–Z, or G–Z
pairs, we performed an additional 11 melts of duplexes containing G–Z pairs and
combined this with 4 prior experiments.16Table S4 provides the results of these
optical melting experiments.
Table 2 provides the G–Z stack nearest neighbor parameters for the free energy
change at 37 °C, which were fit by linear regression (see Methods section) using
fixed values of stacks with P–Z pairs (Table 1). Table 2 also compares the G–Z pair
stacks to the A–T pair stacks. The number of sequences studied by optical melting
is equal to the number of G–Z stack parameters, and therefore the error estimates,
which are the standard errors of the regression, are underestimates. Table S5
provides the residuals from the fit of the parameters.
Table 2. G–Z Stack Nearest Neighbor Parameters, as Compared to Analogous A–T Stacks
(Replacing G–Z with A–T)a.
parameter ΔG°37 (kcal/mol) parameter ΔG°37 (kcal/mol) ΔΔG°37 (kcal/mol) per G–
Z substitution
GZ 1.70 ± 0.48 AT –0.9 1.3
ZG TA
Table S6 shows the stacking enthalpy and entropy changes for stacks with P–Z pairs.
The error estimates are larger for the enthalpy parameters than for the free energy
change parameters, as quantified by the fraction of the parameter value. This is
expected; in fits to experimental data, the enthalpy and entropy changes are highly
correlated and therefore free energy changes are determined with more precision.26
Table S7A provides the duplexes, the experimentally determined enthalpy changes,
the fit enthalpy changes attributed to the P–Z, the fit values for comparison to
the experiment, and the residuals. Table S7B provides the equivalent information
for the fit of the entropy changes. Given the larger average uncertainty in
enthalpy and entropy changes from optical melting experiments and the lack of
redundancy in experiments for G–Z pair stacks, we chose to not fit enthalpy or
entropy change nearest neighbor parameters for G–Z pair stacks.
Loop Stabilities
In addition to the helical stack parameters, the full set of nearest neighbor
parameters required for secondary structure prediction includes parameters for
estimating loop stabilities. To extrapolate a set of loop stabilities, we performed
optical melting experiments on 32 duplexes that include dangling ends (Table S8),
terminal mismatches (Table S9), single mismatches (from Wang et al., 2017; Table
S10),16 and tandem mismatches (Table S10). Twenty-three new optical experiments for
loop systems were performed for this study; the other 9 had been reported
previously.16 These four types of loops were chosen because sensitivity analyses
for RNA secondary structure prediction identified these parameters as important for
the precision of secondary structure prediction.30,31
Dangling ends (single, unpaired nucleotides at the end of a helix) stabilize helix
formation because the dangling nucleobase stacks on the terminal base pair.32,33
For B-form DNA helices, the 5′ dangling ends stabilize more than the 3′ dangling
ends. First, we studied the stability of 5′ dangling nucleotides on a P–Z pair
(Table S11A and Figure 3A). On average, for dangling A, C, G, and T, the stability
of the dangling end is 0.58 ± 0.07 kcal/mol less stable on a P–Z terminal pair than
on a C–G terminal pair.
Figure 3.
Figure 3
We also studied the stability provided by P and Z dangling ends (Table S11B), which
we found to be highly sequence dependent. Figure 3B shows the stability of the P
and Z dangling ends adjacent to Watson–Crick–Franklin pairs. A 5′ dangling Z
adjacent to an A–T pair (−0.8 kcal/mol) is substantially more stable than either a
C or T in the same context (−0.2 or −0.3 kcal/mol, respectively). In other cases,
such as the 5′ dangling P on a C-G pair (−0.37 kcal/mol), the stability is less
than analogous A or G dangling bases (−0.9 or −0.7 kcal/mol, respectively). 3′
dangling P destabilized helix formation in the two contexts studied.
Terminal mismatches are single noncanonical pairs at the end of helices. These also
stabilize helix formation because of stacking and hydrogen bonding. We studied the
stability of P–P and Z–Z terminal mismatches adjacent to G–C and T–A terminal base
pairs (Figure 4 and Table S12). The P–P mismatch next to a G–C pair is less
stabilizing than the analogous purine mismatches G–G or A–A by 0.56 kcal/mol. The
Z–Z mismatch next to a G–C pair, however, is similar in stability (at −0.76
kcal/mol) to the analogous pyrimidine mismatches C–C and T–T (with mean stability
increment of −0.75 kcal/mol). The P–P mismatch next to a T–A pair is also similar
in stability (−0.51 kcal/mol) to the analogous G–G and A–A mismatches (with mean
stability increment of −0.5 kcal/mol). This contrasts to the Z–Z mismatch next to a
T–A pair, which is more stabilizing than the analogous pyrimidine mismatches. The
Z–Z mismatch next to T–A is −0.66 kcal/mol more stable (at −0.91 kcal/mol) than the
mean of C–C and T–T next to T–A (at −0.25 kcal/mol). The general increased overall
stability of dangling Z bases and terminal Z–Z mismatches is consistent with
improved stacking of Z due to the nitro group.
Figure 4.
Figure 4
Figure 5.
Figure 5
Figure 6.
Figure 6
We also observed this for control experiments with G–T and C–A mismatches. The
existing database of optical melting experiments of model systems with natural DNA
base mismatches focuses on mismatches with three canonical base pairs exterior to
the duplex.35−39 Therefore, we developed single mismatch parameters for mismatches
with P or Z by using the two total experiments with P–C and P–T mismatches in the
center of the duplex.
Figure 7.
Figure 7
To find sequences that fold to a specified secondary structure, we used the Design
program in RNAstructure.20 The default parameters were used with one exception: we
allowed isolated base pairs, i.e., base pairs that do not stack on adjacent base
pairs in a helix, because the Eterna 100 structures contain isolated base pairs.
Each structure was attempted five times (using the computer clock to seed the
random number generator), and a maximum of 5 days for each of wall time was allowed
using a single processor core. RNAstructure is capable of multithreading across CPU
cores using OpenMP, but we used single threading to simplify the interpretation of
the benchmarks. We considered success for a given target to be finding at least one
sequence design that completed in the allowed 5 days.
Design was able to design sequences for 94 of the Eterna 100 structures using
canonical DNA bases only. When P–Z pairs were added to the design, 99 structures
were successfully designed. When P and Z are used, we select sequences that include
P–Z pairs in the target structure, but we do not choose P or Z to occur in loops.
We used P and Z only for pairs to take advantage of their intrinsic pairing
stability, although designs could have been chosen to use P and Z elsewhere. P–Z
pairs were selected at a rate of 20%, G–C pairs are selected at a rate of 50%, and
A–T pairs are selected with the remaining 30% (see Methods section).
We then analyzed the results for the lowest NED for each alphabet of nucleotides.
We focused on the lowest NED across calculations because Design uses stochastic
refinement (by which random changes and those that improve the NED are kept and
others are rejected), and the NED of the final sequence therefore varies across
calculations for the same structure.20Figure 8 shows the lowest NED designed
including P–Z pairs versus the lowest NED designed using only canonical nucleotides
for the 94 test structures for which Design was able to solve both problems. The
NED of sequences using P–Z pairs are lower in 89 cases, leaving only 5 cases where
using P–Z is worse than canonical DNA nucleotides alone, and all of these 5 cases
are in the lower left corner, where the NED using both alphabets is excellent (NED
< 0.05).
Figure 8.
Figure 8
A one-sided paired t test indicated that the improvement in NED when incorporating
P–Z pairs is significant (P = 2.7 × 10–13). The average NED for the designs
including P–Z is 0.036, and the average for NED for designs with canonical DNA is
0.073 for these 94 test cases solved with both alphabets. Figure S2 shows the
analogous plot of the mean NED (rather than the minimum) using P–Z versus the mean
NED using only canonical pairs. The same trend for mean is observed as with lowest
NED, and the improvement in mean performance is again significant (P = 1.0 × 10–
13). We provide all the designed sequences in Supporting Information Data set 1,
including the NED, the time used, and the random number seeds (to enable
reproducibility).
We observed the most substantial improvement by incorporating P–Z pairs for the
“Iron Cross” test (number 35), which improved the best NED from 0.283 to 0.011.
Figure 9 shows the target structure with the best DNA sequence and the best
sequence for DNA with P–Z. The structures are color-annotated by the probabilities
that the nucleotides are in the desired structure. Like many of the other Eterna
100 test structures, this structure has prominent symmetry with four branches from
a central multibranch loop, where each branch has a three-way multibranch loop with
each helix of three base pairs. The canonical DNA design uses mostly G–C base
pairs, but these can be promiscuous across the intended helices. The design that
incorporates P–Z pairs has 23 P–Z pairs with the remaining 13 pairs as G–C. The
target structure is stabilized by the incorporation of P–Z pairs, and alternative
structures are preventable because there are additional possible combinations of P–
Z and G–C pairs that form helices of three base pairs. Interestingly, P–Z pairs,
although randomly chosen at a rate of 20% of pairs, became enriched above 20%
during stochastic refinement against G–C and A–U pairs. The predominance of P–Z
pairs after refinement in this case shows that P–Z are preferable in designed
sequences that minimize NED.
Figure 9.
Figure 9
We previously noted that there is a trade-off between time cost and NED quality
when using stochastic refinement.20 In these calculations with Design, we used the
default parameters, which limit the number of sequence refinements. In addition to
observing an improvement in the quality of the designs (lower NED) when using P–Z
pairs in the refinement, we also observe a decreased cost in the average
computational time. Figure S3 shows the average time across the five calculations
that incorporate P–Z pairs as a function of the average time across the five
calculations that use only canonical DNA nucleotides for each Eterna 100 structure.
In the calculation of average time, we use 5 days (the maximum allowed time) for
calculations that failed to return a sequence. The use of P–Z nucleotides
significantly improves the time cost (P = 1.7 × 10–3).
99 of the Eterna 100 benchmark structures have Eterna player solutions provided,
with two RNA sequences being provided for each. We compared the NED of our designs,
including P–Z pairs, to the Eterna solutions. Eterna designs were originally
evaluated computationally as to whether the sequence is predicted to fold with the
lowest free energy to the target structure. This can be achieved whether or not the
NED is relatively low.13,44,45 Therefore, we expected that many of the designs
would not provide low NED, despite performing well by minimum free energy folding.
We compared the lowest NED for our solutions using P–Z to the lowest NED using
Eterna player RNA sequences (Figure S4). Our designs, including P–Z, have
significantly lower NED than the Eterna RNA designs (P = 4.28 × 10–19).
We additionally tested including P–Z base pairs in designing DNA with the same
secondary structure as a DNAzyme:substrate complex, a unimolecular version of the
CT10.3.29.M5 sequence bound to its target.43 The catalytic core of the DNAzyme,
which includes several conserved or essential nucleotides, was specified as
unpaired; the current implementation of the Design algorithm does not allow
constraints on sequence. We designed 10 sequences, including P–Z pairs and 10
sequences using only canonical DNA nucleotides. Table S16 reports the results in
terms of the NED and time. The NEDs for the DNAzyme designs were excellent (i.e.,
low) for either alphabet. Using P–Z pairs the lowest NED solution was 0.031, and
using canonical DNA the best NED solution was 0.041. The average time performance
using P–Z pairs was better and more consistent than when only using canonical
nucleotides. The mean time with P–Z pairs was 1594 ± 134 s and the mean time for
canonical nucleotides was 3232 ± 1770 s. The improvements when using P–Z pairs are
significant for NED (P = 1.90 × 10–3) and time (P = 8.46 × 10–3). Within each set
of designs, there was no correlation between the time required and the quality of
the design.
Discussion
In this work, we tested the hypothesis that synthetic base pairs can improve the
quality of DNA nanostructure designs. We used optical melting to study the folding
stability of DNA containing P and Z Hachimoji nucleotides. These data informed a
new nearest neighbor model for folding, including P–Z pairs, G–Z wobble pairs, and
loops containing P or Z. Using the Design program in RNAstructure,20,46 which was
recently enhanced to allow prediction with nucleotide alphabets beyond the
canonical nucleotides,18 we demonstrated significant improvements for in silico
sequence design quality.
Prior work studying RNA nearest neighbor parameters demonstrated that a subset of
parameters needs to be determined with high precision to precisely estimate base
pairing probabilities. That work focused our efforts here on measuring folding
stability for model systems that could inform those important parameters. This
approach was recently validated for the derivation of folding free energy
parameters for RNA including m6A.18 A set of 98 duplexes with or without N6-
methylation, including helices, bulge loops, internal loops, dangling ends, and
terminal mismatches, was studied and this demonstrated that the parameters
extrapolated for m6A were approximately as accurate for estimating folding
stability as those for canonical bases only.47
There are limitations to the accuracy of our parameters. For example, single
mismatch parameters for canonical DNA do not account for the positional dependence
of the mismatch stability.33 This is because the current experimental database of
DNA mismatches is limited to mostly mismatches distant from helix ends and because
current structure prediction packages are not capable of accounting for this
dependence. For single mismatches containing either P or Z, our parameters are
based on the most destabilizing values that are consistent with being away from
helix ends. This is currently a limitation in parametrization for both DNA and
DNA+PZ folding alphabets. Both DNA parameters and DNA+PZ parameters would benefit
from additional experiments to guide the determination of helix-position-dependent
rules.
Figure 10.
Figure 10
There is a notable effect of the position of the Z base. For the P–Z data sets,
stacks with Z at the 5′ side (e.g., XP/YZ, including ZP/PZ and PP/ZZ) have an
average ΔG°37 = −2.34 kcal/mol whereas stacks with Z at the 3′ side (XZ/YP,
including PP/ZZ and PZ/ZP) have ΔG° = −1.78 kcal/mol. For G–Z pairs, the average
ΔG°37 for 5′ Z is −1.19 kcal/mol, and for 3′ Z it is −0.65 kcal/mol: the absolute
difference is very similar to P–Z but the relative difference is much larger.
Examination of X-ray crystal structures (PDB accessions: B-form 4XO0, 6MIG, and
6MIH and A-form 4XNO) does not suggest an obvious reason for this dramatic
preference.48 One might guess is that the potential stacking of the Z-NO2 group
could be particularly stabilizing, but the Z base appears to stack much better on
its 5′ neighbor than on its 3′ neighbor. The 5′ Z is solvent exposed, and it is
surrounded by ordered water in 4XNO.
Our results highlight a case where the use of P–Z pairs is crucial for high quality
(i.e., low NED) designs. Structures with extensive symmetry (such as the “Iron
Cross” example in Figure 9) benefit from the additional space of sequence options
that are available when P–Z pairs are included. In the absence of P–Z pairs, the
three base pair helices, which need the stabilization of G–C pairs, exhaust the
possible sequences, and then alternative structures form when the helix sequences
pair incorrectly. We hypothesize that this important diversification of sequence
effect plays a smaller, but beneficial, role in other sequence designs that are not
as symmetrical as “Iron Cross”.
Figure 8 demonstrates cases when the NED of the best sequence designed including P–
Z pairs is worse than that of the best sequence designed with DNA only. Given
enough search time or enough search attempts, this would not happen because the DNA
designs including P–Z pairs could find the better DNA-only solution. We observe
these cases because Design stochastically refines the sequences and returns
solutions when it has achieved a specified threshold NED or when it has exhausted a
specified number of trials. For this work, we chose to not refine the default
parameters for the search, although it might be possible to improve the best NED by
changing these settings.
As alphabets expand, the number of parameters needed to describe even the simplest
nearest-neighbor models expands combinatorially. Comprehensive description of
mismatch and internal loop thermodynamics is out of reach for low-throughput
measurements such as those described here. Systematic evaluation of which
parameters are most critical,30,31 high throughput measurements of enthalpy and
entropy as well as free energy changes,49 position-dependent rules, and eventually
integration of 3-D motif stability50,51 will be necessary to tackle more demanding
design problems like structures that change their folds upon ligand binding or as
conditions change.
Methods
Oligonucleotide Synthesis, DNA Melting, and Analysis of Melting Curves
P/Z containing DNA oligonucleotides were synthesized as described,16 and unmodified
oligonucleotides were purchased from IDT (Coralville, Iowa). The purity of all
oligonucleotides was evaluated by denaturing polyacrylamide gel electrophoresis
(PAGE), staining with SYBR Gold, and imaging at 312 nm. We note that Z-containing
oligonucleotides quench SYBR fluorescence in the gel; Z has a broad absorption peak
centered at about 380 nm.
Absorbance melting curves were performed as described for the P–Z pairs and loop
energy oligos.16,57 For G–Z melts, we improved the standard procedure in several
ways. We acquired absorbance vs temperature curves for each of the single strands
individually, both to determine their diluted stock concentrations directly and to
use actual data instead of a linearly extrapolated ssDNA trendline. We carried out
simultaneous dsDNA melts at two concentrations, nominally 1 μM and 5 μM total
strand concentration, CT, which enabled rapid identification of pipetting errors
and set constraints on the concentrations of the two individual strands in the
dsDNA melt cuvettes. We used equations for predicting total absorbance that
explicitly consider unequal concentrations of the single strands and the absorbance
of the remaining excess single strand.
We carried out global fits to both of the absorbance vs temperature curves with a
set of six fit parameters: two ssDNA concentrations for each melt (the other two
being determined by the total absorbance), a common % hypochromicity for the
duplex, and a common slope for the dsDNA absorbance vs T, ΔH°, and ΔS°. The fit was
repeated four times to consider each pair of possibilities in each melt for which
the strand is in excess. The fit with the lowest residual that also had minimal
errors in concentrations relative to nominal values was selected. This procedure
allows explicit correction for pipetting errors, which are typically in the range
of 5%, and the global fitting to dsDNA absorbance decreases the total number of
parameters by two for each melt. We observe that typical estimated errors in ΔH°
and ΔS° are reduced from 8 to 10% to about 5% by following this procedure. It could
readily be extended to melts at additional CT, as is done with the alternative
ln(CT) vs 1/Tm procedure for determining ΔH° and ΔS°.24 An example of the output
from our Matlab routines is provided as Figure S1 and the code is available at
GitHub at [Link] We note that other recent work is
available for improved fitting of optical melting data.58
P–Z and G–Z Stacking Free Energy Change Nearest Neighbor Fits
A custom Python program was written to fit the P–Z nearest neighbor parameters. We
used all available data across prior papers (with the exception of (ZGCATGCP)2 as
explained above) to fit these terms.15,16 We expect that optical melting
experiments are well reproduced (within approximately 3% for ΔG°37) across
laboratories and the residuals in Table S3 support this.59 First, the sum of the
Watson–Crick–Franklin stacks, the intermolecular initiation, and the symmetry
correction (if needed for self-complementary duplexes) was subtracted for each
duplex. For compatibility with RNAstructure, we used the parameters that are
provided with the RNAstructure package.46 Then, the nearest neighbor stacks were
fit by linear regression to the remaining stabilities with the statsmodels ordinary
least-squares class (OLS).60 This fixes the stability of the Watson–Crick–Franklin
stacks with their existing values. Similarly, the stacks with G–Z pairs were fit
with a separate custom Python program, fixing the values of the Watson–Crick–
Franklin and P–Z stacks. Error estimates for parameters are the standard error of
the regression.
P–Z and G–Z Stacking Enthalpy and Entropy Change Nearest Neighbor Fits
To fit the enthalpy change nearest neighbor parameters, a custom Python program was
written. The Watson–Crick–Franklin stack enthalpy changes and intermolecular
initiation enthalpy changes were subtracted from the experimentally determined
enthalpies for each duplex. The Watson–Crick–Franklin enthalpy change parameters
were those reported by SantaLucia and Hicks.33 Then the enthalpy change nearest
neighbor parameters were fit by linear regression using the statsmodels ordinary
least-squares class (OLS).60 A second custom Python program was used to fit the G–Z
enthalpy change stacking parameters.
The enthalpy changes of the Watson–Crick–Franklin and P–Z stacking enthalpy changes
and initiation were subtracted from the total experimentally determined enthalpy
changes. The G–Z enthalpy change terms were then fit to this remaining stability by
linear regression. Error estimates for the enthalpy change parameters are the
standard errors of the regression. Following the same methods used to fit the
enthalpy change nearest neighbor parameters, two additional Python programs were
written to fit the entropy change nearest neighbor parameters for stacks with P–Z
pairs and for stacks with G–Z pairs.
For internal loops, the stability is the total stability of the duplex minus the
helical stacks (estimated with nearest neighbor parameters):
For the DNA+PZ alphabet parameters, we removed GT pairs, which were traditionally
included as part of the DNA secondary structures. GT pairs are less stable than AT
pairs and GZ pairs,35 and they are treated as mismatches in the DNA+PZ alphabet.
For 5′ dangling ends on terminal Z–P pairs, we used the experimental values (Table
S11A) for A, C, G, and T. The dangling end parameters for DNA are not very sequence
dependent. The largest differences are for 3′ dangling ends on terminal CG pairs,
which range from −0.4 to −1.1 kcal/mol in stability. For other terminal pairs, the
dangling end stabilities (3′ or 5′) are in the range of 0.4 kcal/mol.
For terminal mismatches, we used our measured values when they were available. For
P–P terminal mismatches that were not measured, we used the mean from other P–P
mismatches (Table S12; −0.5 kcal/mol). Likewise, for Z–Z terminal mismatches not
measured, we used the mean from other Z–Z mismatches (Table S12; −0.8 kcal/mol).
For terminal mismatches on P–Z, Z–G, or G–Z pairs, including mismatches with P or Z
(but not PP or ZZ), we use the mean value of the terminal mismatches on Z–P pairs
(Table S13; −0.2 ± 0.2 kcal/mol). Terminal mismatch parameters are used in
secondary structure prediction for mismatches in exterior loops, hairpin loops,
internal loops (other than 1 × 1, 1 × 2, 2 × 2, or 1 × n, where n > 2), and
multibranch loops. The terminal stack parameters also provide one of the terms
needed to estimate the stability of mismatch-mediated coaxial stacking.61−63
For single mismatches, also known as 1 × 1 internal loops, parameters were informed
by experiments reported in Table S14. We use mismatches in the center of adjacent
canonical helices to be consistent with prior studies of DNA single mismatches. The
single mismatch parameter table in RNAstructure contains all sequences (including
the mismatch and two closing base pairs). When an experimental value has been
measured, we use that value. The following extrapolations were used to determine
parameters involving P or Z: We map P–Z closure to entries for G–C closure and we
map G–Z closure to entries for A–T. In the loops, we map Z to C and P to G. If the
mapping results in a Watson–Crick–Franklin pair, we remap to an A–C mismatch,
preserving the purine-pyrimidine orientation. We then lookup the entry as mapped to
canonical nucleotides and stabilize mismatches with P or Z by −0.6 kcal/mol, which
is the mean additional stabilization observed for the measured mismatches (Table
S14) compared to analogous AC mismatches.
For 2 × 1 internal loops, RNAstructure also uses a table with all sequence entries.
Similar to single mismatches, we map entries with P or Z back to entries with
canonical nucleotides only. For loops that contain either P or Z, we stabilize the
loop by −0.6 kcal/mol, the average stabilization observed per mismatch with P or Z
in single mismatches (Table S14).
Following the nearest neighbor parameters for DNA, the PZ alphabet tables for first
mismatches in hairpin loops, mismatches at helix ends in internal loops (larger
than 2 × 2 internal loops), and for mismatches at helix ends in coaxial stacks are
set equal to the terminal mismatch values.
Design Calculations
The Design program from RNAstructure was used for the calculations.20,46 We
modified the program to allow isolated base pairs, i.e., base pairs that are not
stacked on adjacent base pairs, and this revision will be available in RNAstructure
version 6.5. The other parameters were set to defaults. The software was compiled
by using GCC version 4.8.5. All calculations were run on one core on machines
equipped with two Intel Xeon CPU E52695v4 (2.10 GHz) processors and 124 GB of RAM.
The CPU time was extracted from a Slurm queuing report.
Design calculations were all run by using the thermodynamic parameters that include
P and Z nucleotides. Design randomly selected pairs and unpaired nucleotides were
designed for initial sequences and then subsequently for the refinement of the
sequence. We extended Design to allow user-specified biases for the selection of
pairs and unpaired nucleotides. These biases are used to initialize the sequence
and also to select revised sequences during the iterative refinement stage. For
calculations with only canonical nucleotides, G–C and A–T pairs were selected with
equal probability. For designs that included P–Z pairs, P–Z was selected with a
probability of 20%, G–C with a probability of 50%, and A–T with a probability of
30%. For loop sequences, A, C, G, and T were selected with probabilities of 60%,
10%, 10%, and 20%, respectively. This reflects a bias for A occurring in loops.64 P
and Z were not selected for incorporation in loops.
Statistical Tests
For analysis of the Eterna 100 set designs, we used SciPy 1.7.3 to perform paired t
tests for NED or time.65 Our null hypothesis is that the quality of designs with PZ
pairs is not better (NED or time) than the designs with canonical nucleotides only.
Similarly, for the DNAzyme calculations, our null hypothesis is that the quality of
designs with PZ pairs is not better (NED or time) than the designs with canonical
nucleotides only. We tested this null hypothesis using Welch’s t tests for the NED
or time. For all tests, we used a one-sided test with the type I error rate, α, set
to 0.05.
Acknowledgments
Sequence design calculations were performed using resources provided by the
University of Rochester Center for Integrated Research Computing. We thank our two
anonymous reviewers for providing helpful and detailed feedback.
Glossary
Abbreviations
e.u.
entropy units (cal mol–1 K–1)
NED
normalized ensemble defect.
Supporting Information Available
The Supporting Information is available free of charge at
[Link]
Author Contributions
T.M.P. and M.Z. carried out folding and design calculations. H.S. and D.H.M.
performed regression analysis. T.M., K.K.S., and X.W. performed absorbance melting
experiments. S.H. synthesized, purified, and characterized all of the
oligonucleotides containing nonstandard nucleotides. R.J.P. and S.A.B. consulted on
all aspects of the work, and reviewed the manuscript. J.D.K. wrote analysis
programs and supervised experimental work. D.H.M. supervised the computational
work. T.M.P., J.D.K., and D.H.M. wrote the manuscript with input from all of the
authors.
Author Contributions
# T.M.P., T.M., and H.S. contributed equally to this paper.
This work was supported by NIH grant R35GM145283 to D.H.M. and an SBIR award from
the U.S. Army (W911NF-12-C-0060) to Celadon Laboratories (later DNA Analytics).
This work was also supported by R01GM141391-01A1 to S.A.B.
The authors declare the following competing financial interest(s): DNA Analytics
applied for a U.S. patent, PCT/US23/64288 (Generating Parameters to Predict
Hybridization Strength of Nucleic Acid Sequences), to protect the applicable
optical melt curve fit methods.
Supplementary Material
sb3c00358_si_001.pdf (1MB, pdf)
sb3c00358_si_002.xlsx (115.6KB, xlsx)