Federico Mengozzi


Represent real numbers

Given a number $x \in \mathbb{R}$ and the base $B \in \mathbb{N}$ exists an univocal representation for $x$

$x = \pm 0.d_1 d_2 \dots \cdot B^p = \pm(\sum\limits_{i = 1}^\infty d_i B^{-i})B^p = \pm m B^p$

where $\dfrac{1}{B} \leq m \le 1$ is the mantissa, $p \in \mathbb{N}$ is the exponent and $0 \leq d_i \le B \land d_1 \neq 0$.

Binary number ($B = 2$) represent real number as floating point reserving bits for sign, exponent and mantissa. With $t$ bits for the mantissa the formula above became

$x = \pm 0.d_1 d_2 \dots d_t \cdot B^p = \pm(\sum\limits_{i = 1}^t d_i B^{-i})B^p = \pm m B^p$

It’s possible to define a function for representing a real number as floating point number

$F(b, t, L, U) = \{ x \in \mathbb{R} : \pm(\sum\limits_{i = 1}^t d_i B^{-i})B^p \} \cup \{0\}$

where $L \leq p \leq U$ and usually $L = -U$.

Such representation is far from being perfect, several errors may occur. Those errors concern

  • the exponent: $ p \not\in [L, U] \implies \begin{cases} p \gt U & \textbf{overflow}\\ p \le L & \textbf{underflow} \end{cases}$
  • the mantissa: $mantissa\_digit \gt t \implies \textbf{truncation/rounding}$

For that reason the $t-th$ digits might be truncated or rounded. Using truncation $d_t = d_{i=t}$ whereas rounding is defined as $\widetilde{d_t} = \begin{cases} d_t & if d_{t+1} \leq \dfrac{B}{2}\\ d_t+1 & if d_{t+1} \gt \dfrac{B}{2} \end{cases}$

Finally $\rvert F(b, t, L, U) \rvert = 2(U-L+1)(B-1)B^{t-1} + 1$, from now on in general $fl(x)$ will be used to represent $x$ in binary format.

Representation errors

Given $x = mB^p$ and $\bar{x} = fl(x) = \bar{m}B^p$ the absolute error is defined as $\rvert x - fl(x) \rvert$ and the relative error is defined as $\dfrac{\rvert x - fl(x) \rvert}{\rvert x \rvert}$.

Using rounding it follow that

  • $\rvert x - fl(x) \rvert \le \dfrac{B^{p-t}}{2}$
  • $\dfrac{\rvert x - fl(x) \rvert}{\rvert x \rvert} \le \dfrac{B^{1-t}}{2}$
  • $\rvert m - \bar{m} \rvert \leq \dfrac{B^{-t}}{2}$

From the second expression it’s possible to understand that the relative error is independent form the value represented and for that it provides more useful information.

Calculator precision

Now it’s possible to define the precision $eps$ as $\dfrac{\rvert x - fl(x) \rvert}{\rvert x \rvert} \leq eps$ to be smallest finite number such that $fl(x + eps) \gt x$.

The relative error would then be $err = \dfrac{fl(x) - c}{x}$ and also $\rvert err \rvert \leq eps$.

Errors propagation

On a calculator every operation can be described as $\bar{x}\ op\ \bar{y} = fl(\bar{x}\ op\ \bar{y} ) = (\bar{x}\ op\ \bar{y})(1 + \epsilon)$

  • Multiplication $ \dfrac{fl(fl(x)fl(y)) - xy}{xy}$ = $\dfrac{(x(1 + \epsilon_1) y(1 + \epsilon_2))(1 + \epsilon_3) - xy}{xy}$ = $(1 + \epsilon_1)(1 + \epsilon_2)(1 + \epsilon_3) - 1$ $\widetilde{=}$ $\epsilon_1 + \epsilon_2 + \epsilon_3$.

  • Addition $ \dfrac{fl(fl(x) + fl(y)) - (x + y)}{x + y}$ = $\dfrac{(x(1 + \epsilon_1) + y(1 + \epsilon_2))(1 + \epsilon_3) - (x + y)}{x + y}$ = $\dfrac{x\epsilon_1 + y\epsilon_2 + x\epsilon_3 + y + x\epsilon_1\epsilon_3 + y\epsilon_1\epsilon_3}{x+y}$ $\widetilde{=}$ $\dfrac{x}{x+y}\epsilon_1$ + $\dfrac{y}{x+y}\epsilon_2$ + $\epsilon_3$

  • Subtraction $ \dfrac{fl(fl(x) - fl(y)) - (x - y)}{x - y}$ = $\dfrac{(x(1 + \epsilon_1) - y(1 + \epsilon_2))(1 + \epsilon_3) - (x - y)}{x - y}$ $\widetilde{=}$ $\dfrac{x}{x-y}\epsilon_1$ + $\dfrac{y}{x-y}\epsilon_2$ + $\epsilon_3$

It’s important to note that, in the subtraction, if $x$ and $y$ have near the same value in module $\lim_{x \rightarrow{y}} \dfrac{x\epsilon}{x-y} = \infty$. Eventually the error would be too big for the result to have meaning.

Go to top