June | 2015 | Possibly Wrong

I often need to display or record output of floating-point values, in a human-readable decimal format. This is easy enough in most programming languages… unless you need the representation to be one-to-one, so that distinct finite values always produce distinct output. More precisely, we would like to be able to (1) convert a floating-point value $x$ to a string, then (2) convert the string back to floating-point, recovering exactly the original value $x$ .

For example, I use MATLAB more than I care to, and despite its primary application being numeric scientific computing, none of its built-in output formats support this. The simplest way that I know of to do this is actually via MATLAB’s interface to Java:

>> java.lang.Double.toString(1.1 + 2.2 - 3.3)

In C++, you can set the precision of an output stream to a desired number of decimal digits. In C++11, the constant max_digits10 is defined just for this purpose:

#include <sstream> // ostringstream
#include <iomanip> // setprecision
#include <limits> // numeric_limits

template<typename T>
std::string to_string(T x)
{
    std::ostringstream os;
    os << std::setprecision(std::numeric_limits<T>::max_digits10) << x;
    return os.str();
}

Unfortunately, I learned the hard way that Visual Studio 2010 gets this wrong, at least for single-precision float, defining max_digits10 in that case to be 8 instead of 9 as it should be.

For compatibility with older compilers– either with bugs like VS 2010, or simply pre-C++11– the good news is that we can compute the required number of digits $d$ , as a function of the number of significant bits $b$ in the underlying floating-point representation (which is in turn defined in the standard header <limits> as numeric_limits<T>::digits):

$d = \left \lceil{b \log_{10}(2)}\right \rceil + 1$

where $b=24$ for float and $b=53$ for double, yielding $d=9$ and $d=17$ , respectively. (Since I can’t seem to write a post without including an obligatory puzzle: why is the +1 needed in the above equation?)

Interestingly, the following alternative formula also works, with the added benefit that it can be evaluated at compile time:


max_digits10 = std::numeric_limits<T>::digits * 3010 / 10000;

I’m not sure why this particular approximation of $\log_{10}(2)$ is used, since 301/1000 works equally well; both are correct for any number of bits less than 196.

(Edit: The last time I wrote about floating-point issues, I mentioned Bruce Dawson’s Random ASCII blog, in particular the excellent series of articles about floating point. I should have done so here as well. In this case, the specific relevant article is here… but instead of the brute force enumeration approach described there explaining the “+1” in the formula above– and how often it is actually needed– my motivation for this post, in the form of the “puzzle,” was to suggest a pencil-and-paper derivation of that formula, which would in turn yield a direct calculation of how often that +1 is needed.)

Possibly Wrong

On science, mathematics, and computing

Monthly Archives: June 2015

Floating-point round trips