Number Representation

In computers, numbers are represented with a finite amount of memory. The smallest unit of memory is a bit which can take two values 0 or 1. Due to the use of bits for memory, it is easier to represent numbers using the binary system in computers.

Most computers use the binary system for representing numbers. Contrary to this, most handheld calculators use the decimal or base-10 system to represent numbers.

Computers represent integers and real numbers in different ways. In this article, we will discuss how real numbers are represented in computers.

There are two ways in which real numbers are represented. One is fixed point representation and the other is floating point representation.

Fixed Point Representation

In fixed point representation, the number of decimal digits is fixed. This is achieved by allocating fixed memory for the decimal part. For example, if we allocate 3 digits for decimal then the numbers we can represent are:

1.002,\ 2.333,\ 200.444 \text{ etc.}

Fixed point representation minimizes the absolute error of representation. It can't represent a large range of numbers. It is used in places where numbers always have fixed numbers of decimals like representing money. Generally, fixed point numbers are not available at the hardware level. But there are plenty of libraries that provide fixed-point number support using software implementation like decimal package for Python.

Importance of significant digits

One of the major flaws of fixed point representation is that it doesn't take into account the importance of significant digits. For example, the decimal part in 200.001 has very small importance. If we omit 0.001 from 200.001, there won't be a huge error in the representation of the number. But considering the number 0.0001, there is a huge importance for the decimal digits and a small change can cause large errors in representation.

Floating Point Representation

In floating point representation, numbers are represented in scientific notation with a fixed number of significant digits and an exponent. It properly accounts for the importance of significant digits and can represent very large or very small numbers with small relative errors. A number represented in floating point notation is called a float. A real number $x$ is represented in floating point notation as:

x = f \times r^e

where, $f$ is the fractional part also called mantissa. $r$ is called radix, it is $2$ for binary system and $10$ for decimal system. $e$ is the exponent. Both mantissa and exponent can take positive as well as negative values. Some numbers in floating point representation are:

1.22\times10^{-2},\ 43.2\times10^{3} \text{ etc.}

The floating point representation can only represent a discrete subset of real numbers. The numbers will be dense in the region -1 to 1 and sparse at the extreme ends. For example, if we take mantissa to be 1 digit long and also exponent to be 1 digit long then the numbers that can be represented are:

-0.9\times10^9,-0.8\times10^9, ..., \\ -0.1\times10^1, -0.9\times10^0, ...., 0, ..., \\ 0.9\times10^0, 0.1\times10^1, ..., \\ 0.8\times10^9,0.9\times10^9

we can see that it is impossible to represent $0.99\times10^9$ as it will cause overflow. Numbers like $0.12\times10^0$ can't be represented either due to the 1-digit mantissa.

Similarly, the absolute error for representation is variable while the relative error is always less than 100%(not considering overflow). For example, when we round off $0.89\times10^9$ as $0.9\times10^9$ , the absolute error is $10^7$ and the relative error is about 11%. But when we round off $0.82\times10^2$ as $0.8\times10^2$ the absolute error is only $2$ while the relative error is about 24%.

Characteristics of Floating Point Representation

1. Non Unique

In FP representation, a single number may be represented in many different ways. For example, the number $1.2\times10^2$ can be represented as $12\times10^1$ or $120\times10^0$ or $0.12\times10^3$ . All of these are valid representations. To avoid confusion, floating point numbers are represented in normalized form like $0.12\times10^3$ . One way of normalization is to represent the mantissa in the form $0.xaa$ , where $x$ is non-zero, and then adjust the exponent for accurate representation.

2. Asymmetric Memory Usage

In FP representation, half of memory is used to represent numbers from $-1$ to $1$ and the other half of memory is used to represent the other part.

To understand this, let's assume that the numbers are always normalized. In normalized form, +ve exponents represent numbers with a magnitude bigger than $1$ . Similarly, -ve exponents represent numbers with a magnitude smaller than $1$ . We also know that the number of -ve exponents is half the total number of exponents. This directly implies that the count of numbers between -1 and 1 will be half the count of the numbers in full range.

3. Minimizes Relative Error

Floating point representation minimizes the relative error. This is because of fixed-sized mantissa and the use of exponent. The exponent helps maintain the magnitude of the number, while the fixed-sized mantissa truncates the number of significant digits.

Floating Point Arithmetics

Floating point numbers need to be stored in normalized exponent form. After any arithmetic operation, the result should also be in normalized form. We take two numbers $x=f_x\times r^{e_x}$ and $y=f_y\times r^{e_y}$ for demonstrating the algorithms.

Addition Algorithm

Our goal is to find $z=x+y$ .

Find exponent: $e_z=\max(e_x,e_y)$
Shift: $f_x$ towards right by $e_z-e_x$ digits and $f_y$ towards right by $e_z-e_y$ digits.
Add Mantissa: Set $f_z=f_x+f_y$ .
Normalize: if $|f_z|\geq1$ , then right-shift $f_z$ by 1 digit and increase $e_z$ by 1.

Addition Example

Given $x=0.923\times 10^2$ and $y=0.824\times 10^1$ find $z=x+y$ , provided number of digits in mantissa=3 and number of digits in exponent=1.

Solution
Given: $f_x=0.923,e_x=2$ and $f_y=0.824,e_y=1$
Find exponent: $e_z=\max(2,1)=2$
Shift: $f_x = f_x >> (e_z-e_x)=0.923>>0=0.923$
$f_y=f_y>>(e_z-e_y)=0.824>>1=0.082$
Add Mantissa: $f_z=f_x+f_y=0.923+0.082=1.012$
Normalize: Since, $|f_z|\geq1$ , $f_z=f_z>>1=0.101$ and $e_z=e_z+1=2+1=3$ .

Therefore, $z = 0.101\times 10^3$

Subtraction Algorithm

Subtraction is similar to addition but the normalization will be different. Instead of the magnitude of mantissa getting bigger than 1, it can be smaller than 0.1 and thus normalization will be the opposite.

Our goal is to find $z=x-y$ .

Find exponent: $e_z=\max(e_x,e_y)$
Shift: $f_x$ towards right by $e_z-e_x$ digits and $f_y$ towards right by $e_z-e_y$ digits.
Subtract mantissa: Set $f_z=f_x-f_y$ .
Normalize: if $|f_z|< 0.1$ , then left-shift $f_z$ by 1 digit and decrease $e_z$ by 1.

Subtraction Example

Given $x=0.881\times 10^2$ and $y=0.845\times 10^2$ find $z=x-y$ , provided number of digits in mantissa=3 and number of digits in exponent=1.

Solution
Given: $f_x=0.881,e_x=2$ and $f_y=0.845,e_y=1$
Find exponent: $e_z=\max(2,1)=2$
Shift: $f_x = f_x >> (e_z-e_x)=0.881>>0=0.881$
$f_y=f_y>>(e_z-e_y)=0.824>>1=0.845$
Subtract mantissa: $f_z=f_x+f_y=0.881-0.845=0.036$
Normalize: Since, $|f_z|< 0.1$ , $f_z=f_z << 1 = 0.36$ and $e_z=e_z-1=2-1=1$ .

Therefore, $z = 0.36 \times 10^1$

Multiplication Algorithm

Our goal is to find $z=x \times y$

Find exponent: $e_z=e_x+e_y$
Multiply mantissa: $f_z=f_x*f_y$
Normalize: if $|f_z|< 0.1$ , then left-shift $f_z$ by 1 and decrease $e_z$ by 1

Multiplication Example

Given $x=0.124\times 10^2$ and $y=0.702\times 10^{-1}$ find $z=x\times y$ , provided number of digits in mantissa=3 and number of digits in exponent=1.

Solution
Given: $f_x=0.124,e_x=2$ and $f_y=0.702,e_y=-1$
Find exponent: $e_z=2+(-1)=1$
Multiply mantissa: $f_z=f_x*f_y=0.124*0.702=0.087048$
Normalize: Since, $|f_z|< 0.1$ , $f_z=f_z << 1 = 0.870$ and $e_z=e_z-1=1-1=0$ .

Therefore, $z = 0.870\times 10^0$

Division Algorithm

Our goal is to find $z=x/y$

Find exponent: $e_z=e_x-e_y$
Divide mantissa: $f_z=f_x/f_y$
Normalize: if $|f_z|\geq 1$ , then right-shift $f_z$ by 1 and increase $e_z$ by 1

Division Example

Given $x=0.628\times 10^2$ and $y=0.271\times 10^{-1}$ find $z=x/y$ , provided number of digits in mantissa=3 and number of digits in exponent=1.

Solution
Given: $f_x=0.628,e_x=2$ and $f_y=0.271,e_y=-1$
Find exponent: $e_z=2-(-1)=3$
Multiply mantissa: $f_z=f_x/f_y=0.628/0.271=2.317$
Normalize: Since, $|f_z|\geq 1$ , $f_z=f_z >> 1 = 0.231$ and $e_z=e_z+1=3+1=4$ .

Therefore, $z = 0.231\times 10^4$

Fixed Point vs Floating Point

The following table summarizes the differences between fixed-point and floating-point representations:

Fixed Point	Floating Point
1. Fixed number of digits is used to represent the decimal part of a real number.	1. Fixed number of significant digits plus an exponent is used to represent a real number.
2. It minimizes the absolute error of representation.	2. It minimizes the relative error of representation.
3. The absolute error of representation is constant while the relative error of representation is variable and can exceed 100%	3. Both absolute and relative errors of representations are variable but relative error is always less than 100%.
4. It can't represent very large or very small numbers.	4. It can represent both very large and very small numbers.
5. Used for financial and accounting purposes.	5. Used for scientific calculations.
6. Generally not available in hardware implementation.	6. Generally available in hardware implementation with standards such as IEEE 754.
7. Example: 1.002, 7.888, 0.001 etc	7. Example: $0.122\times 10^2$ , $-0.38\times 10^{-8}$ etc.

Number Representation

Fixed Point Representation​

Importance of significant digits​

Floating Point Representation​

Characteristics of Floating Point Representation​

1. Non Unique​

2. Asymmetric Memory Usage​

3. Minimizes Relative Error​

Floating Point Arithmetics​

Addition Algorithm​

Subtraction Algorithm​

Multiplication Algorithm​

Division Algorithm​

Fixed Point vs Floating Point​