# Introduction

## Represent real numbers

Given a number $x \in \mathbb{R}$ and the base $B \in \mathbb{N}$ exists an univocal representation for $x$

$x = \pm 0.d_1 d_2 \dots \cdot B^p = \pm(\sum\limits_{i = 1}^\infty d_i B^{-i})B^p = \pm m B^p$

where $\dfrac{1}{B} \leq m \le 1$ is the mantissa, $p \in \mathbb{N}$ is the exponent and $0 \leq d_i \le B \land d_1 \neq 0$.

Binary number ($B = 2$) represent real number as floating point reserving bits for sign, exponent and mantissa. With $t$ bits for the mantissa the formula above became

$x = \pm 0.d_1 d_2 \dots d_t \cdot B^p = \pm(\sum\limits_{i = 1}^t d_i B^{-i})B^p = \pm m B^p$

It’s possible to define a function for representing a real number as floating point number

$F(b, t, L, U) = \{ x \in \mathbb{R} : \pm(\sum\limits_{i = 1}^t d_i B^{-i})B^p \} \cup \{0\}$

where $L \leq p \leq U$ and usually $L = -U$.

Such representation is far from being perfect, several errors may occur. Those errors concern

• the exponent: $p \not\in [L, U] \implies \begin{cases} p \gt U & \textbf{overflow}\\ p \le L & \textbf{underflow} \end{cases}$
• the mantissa: $mantissa\_digit \gt t \implies \textbf{truncation/rounding}$

For that reason the $t-th$ digits might be truncated or rounded. Using truncation $d_t = d_{i=t}$ whereas rounding is defined as $\widetilde{d_t} = \begin{cases} d_t & if d_{t+1} \leq \dfrac{B}{2}\\ d_t+1 & if d_{t+1} \gt \dfrac{B}{2} \end{cases}$

Finally $\rvert F(b, t, L, U) \rvert = 2(U-L+1)(B-1)B^{t-1} + 1$, from now on in general $fl(x)$ will be used to represent $x$ in binary format.

## Representation errors

Given $x = mB^p$ and $\bar{x} = fl(x) = \bar{m}B^p$ the absolute error is defined as $\rvert x - fl(x) \rvert$ and the relative error is defined as $\dfrac{\rvert x - fl(x) \rvert}{\rvert x \rvert}$.

• $\rvert x - fl(x) \rvert \le \dfrac{B^{p-t}}{2}$
• $\dfrac{\rvert x - fl(x) \rvert}{\rvert x \rvert} \le \dfrac{B^{1-t}}{2}$
• $\rvert m - \bar{m} \rvert \leq \dfrac{B^{-t}}{2}$

From the second expression it’s possible to understand that the relative error is independent form the value represented and for that it provides more useful information.

### Calculator precision

Now it’s possible to define the precision $eps$ as $\dfrac{\rvert x - fl(x) \rvert}{\rvert x \rvert} \leq eps$ to be smallest finite number such that $fl(x + eps) \gt x$.

The relative error would then be $err = \dfrac{fl(x) - c}{x}$ and also $\rvert err \rvert \leq eps$.

## Errors propagation

On a calculator every operation can be described as $\bar{x}\ op\ \bar{y} = fl(\bar{x}\ op\ \bar{y} ) = (\bar{x}\ op\ \bar{y})(1 + \epsilon)$

• Multiplication $\dfrac{fl(fl(x)fl(y)) - xy}{xy}$ = $\dfrac{(x(1 + \epsilon_1) y(1 + \epsilon_2))(1 + \epsilon_3) - xy}{xy}$ = $(1 + \epsilon_1)(1 + \epsilon_2)(1 + \epsilon_3) - 1$ $\widetilde{=}$ $\epsilon_1 + \epsilon_2 + \epsilon_3$.

• Addition $\dfrac{fl(fl(x) + fl(y)) - (x + y)}{x + y}$ = $\dfrac{(x(1 + \epsilon_1) + y(1 + \epsilon_2))(1 + \epsilon_3) - (x + y)}{x + y}$ = $\dfrac{x\epsilon_1 + y\epsilon_2 + x\epsilon_3 + y + x\epsilon_1\epsilon_3 + y\epsilon_1\epsilon_3}{x+y}$ $\widetilde{=}$ $\dfrac{x}{x+y}\epsilon_1$ + $\dfrac{y}{x+y}\epsilon_2$ + $\epsilon_3$

• Subtraction $\dfrac{fl(fl(x) - fl(y)) - (x - y)}{x - y}$ = $\dfrac{(x(1 + \epsilon_1) - y(1 + \epsilon_2))(1 + \epsilon_3) - (x - y)}{x - y}$ $\widetilde{=}$ $\dfrac{x}{x-y}\epsilon_1$ + $\dfrac{y}{x-y}\epsilon_2$ + $\epsilon_3$

It’s important to note that, in the subtraction, if $x$ and $y$ have near the same value in module $\lim_{x \rightarrow{y}} \dfrac{x\epsilon}{x-y} = \infty$. Eventually the error would be too big for the result to have meaning.