Skip to main content

Number Representation

In computers, numbers are represented with a finite amount of memory. The smallest unit of memory is a bit which can take two values 0 or 1. Due to the use of bits for memory, it is easier to represent numbers using the binary system in computers.

Most computers use the binary system for representing numbers. Contrary to this, most handheld calculators use the decimal or base-10 system to represent numbers.

Computers represent integers and real numbers in different ways. In this article, we will discuss how real numbers are represented in computers.

There are two ways in which real numbers are represented. One is fixed point representation and the other is floating point representation.

Fixed Point Representation

In fixed point representation, the number of decimal digits is fixed. This is achieved by allocating fixed memory for the decimal part. For example, if we allocate 3 digits for decimal then the numbers we can represent are:

1.002, 2.333, 200.444 etc. 1.002,\ 2.333,\ 200.444 \text{ etc.}

Fixed point representation minimizes the absolute error of representation. It can't represent a large range of numbers. It is used in places where numbers always have fixed numbers of decimals like representing money. Generally, fixed point numbers are not available at the hardware level. But there are plenty of libraries that provide fixed-point number support using software implementation like decimal package for Python.

Importance of significant digits

One of the major flaws of fixed point representation is that it doesn't take into account the importance of significant digits. For example, the decimal part in 200.001 has very small importance. If we omit 0.001 from 200.001, there won't be a huge error in the representation of the number. But considering the number 0.0001, there is a huge importance for the decimal digits and a small change can cause large errors in representation.

Floating Point Representation

In floating point representation, numbers are represented in scientific notation with a fixed number of significant digits and an exponent. It properly accounts for the importance of significant digits and can represent very large or very small numbers with small relative errors. A number represented in floating point notation is called a float. A real number xx is represented in floating point notation as:

x=f×re x = f \times r^e

where, ff is the fractional part also called mantissa. rr is called radix, it is 22 for binary system and 1010 for decimal system. ee is the exponent. Both mantissa and exponent can take positive as well as negative values. Some numbers in floating point representation are:

1.22×102, 43.2×103 etc.1.22\times10^{-2},\ 43.2\times10^{3} \text{ etc.}

The floating point representation can only represent a discrete subset of real numbers. The numbers will be dense in the region -1 to 1 and sparse at the extreme ends. For example, if we take mantissa to be 1 digit long and also exponent to be 1 digit long then the numbers that can be represented are:

0.9×109,0.8×109,...,0.1×101,0.9×100,....,0,...,0.9×100,0.1×101,...,0.8×109,0.9×109-0.9\times10^9,-0.8\times10^9, ..., \\ -0.1\times10^1, -0.9\times10^0, ...., 0, ..., \\ 0.9\times10^0, 0.1\times10^1, ..., \\ 0.8\times10^9,0.9\times10^9

we can see that it is impossible to represent 0.99×1090.99\times10^9 as it will cause overflow. Numbers like 0.12×1000.12\times10^0 can't be represented either due to the 1-digit mantissa.

Similarly, the absolute error for representation is variable while the relative error is always less than 100%(not considering overflow). For example, when we round off 0.89×1090.89\times10^9 as 0.9×1090.9\times10^9, the absolute error is 10710^7 and the relative error is about 11%. But when we round off 0.82×1020.82\times10^2 as 0.8×1020.8\times10^2 the absolute error is only 22 while the relative error is about 24%.

Characteristics of Floating Point Representation

1. Non Unique

In FP representation, a single number may be represented in many different ways. For example, the number 1.2×1021.2\times10^2 can be represented as 12×10112\times10^1 or 120×100120\times10^0 or 0.12×1030.12\times10^3. All of these are valid representations. To avoid confusion, floating point numbers are represented in normalized form like 0.12×1030.12\times10^3. One way of normalization is to represent the mantissa in the form 0.xaa0.xaa, where xx is non-zero, and then adjust the exponent for accurate representation.

2. Asymmetric Memory Usage

In FP representation, half of memory is used to represent numbers from 1-1 to 11 and the other half of memory is used to represent the other part.

To understand this, let's assume that the numbers are always normalized. In normalized form, +ve exponents represent numbers with a magnitude bigger than 11. Similarly, -ve exponents represent numbers with a magnitude smaller than 11. We also know that the number of -ve exponents is half the total number of exponents. This directly implies that the count of numbers between -1 and 1 will be half the count of the numbers in full range.

3. Minimizes Relative Error

Floating point representation minimizes the relative error. This is because of fixed-sized mantissa and the use of exponent. The exponent helps maintain the magnitude of the number, while the fixed-sized mantissa truncates the number of significant digits.

Floating Point Arithmetics

Floating point numbers need to be stored in normalized exponent form. After any arithmetic operation, the result should also be in normalized form. We take two numbers x=fx×rexx=f_x\times r^{e_x} and y=fy×reyy=f_y\times r^{e_y} for demonstrating the algorithms.

Addition Algorithm

Our goal is to find z=x+yz=x+y.

  1. Find exponent: ez=max(ex,ey)e_z=\max(e_x,e_y)
  2. Shift: fxf_x towards right by ezexe_z-e_x digits and fyf_y towards right by ezeye_z-e_y digits.
  3. Add Mantissa: Set fz=fx+fyf_z=f_x+f_y.
  4. Normalize: if fz1|f_z|\geq1 , then right-shift fzf_z by 1 digit and increase eze_z by 1.
Addition Example

Given x=0.923×102x=0.923\times 10^2 and y=0.824×101y=0.824\times 10^1 find z=x+yz=x+y, provided number of digits in mantissa=3 and number of digits in exponent=1.

Solution
Given: fx=0.923,ex=2f_x=0.923,e_x=2 and fy=0.824,ey=1f_y=0.824,e_y=1
Find exponent: ez=max(2,1)=2e_z=\max(2,1)=2
Shift: fx=fx>>(ezex)=0.923>>0=0.923f_x = f_x >> (e_z-e_x)=0.923>>0=0.923
fy=fy>>(ezey)=0.824>>1=0.082f_y=f_y>>(e_z-e_y)=0.824>>1=0.082
Add Mantissa: fz=fx+fy=0.923+0.082=1.012f_z=f_x+f_y=0.923+0.082=1.012
Normalize: Since, fz1|f_z|\geq1, fz=fz>>1=0.101f_z=f_z>>1=0.101 and ez=ez+1=2+1=3e_z=e_z+1=2+1=3.

Therefore, z=0.101×103z = 0.101\times 10^3

Subtraction Algorithm

Subtraction is similar to addition but the normalization will be different. Instead of the magnitude of mantissa getting bigger than 1, it can be smaller than 0.1 and thus normalization will be the opposite.

Our goal is to find z=xyz=x-y.

  1. Find exponent: ez=max(ex,ey)e_z=\max(e_x,e_y)
  2. Shift: fxf_x towards right by ezexe_z-e_x digits and fyf_y towards right by ezeye_z-e_y digits.
  3. Subtract mantissa: Set fz=fxfyf_z=f_x-f_y.
  4. Normalize: if fz<0.1|f_z|< 0.1, then left-shift fzf_z by 1 digit and decrease eze_z by 1.
Subtraction Example

Given x=0.881×102x=0.881\times 10^2 and y=0.845×102y=0.845\times 10^2 find z=xyz=x-y, provided number of digits in mantissa=3 and number of digits in exponent=1.

Solution
Given: fx=0.881,ex=2f_x=0.881,e_x=2 and fy=0.845,ey=1f_y=0.845,e_y=1
Find exponent: ez=max(2,1)=2e_z=\max(2,1)=2
Shift: fx=fx>>(ezex)=0.881>>0=0.881f_x = f_x >> (e_z-e_x)=0.881>>0=0.881
fy=fy>>(ezey)=0.824>>1=0.845f_y=f_y>>(e_z-e_y)=0.824>>1=0.845
Subtract mantissa: fz=fx+fy=0.8810.845=0.036f_z=f_x+f_y=0.881-0.845=0.036
Normalize: Since, fz<0.1|f_z|< 0.1, fz=fz<<1=0.36f_z=f_z << 1 = 0.36 and ez=ez1=21=1e_z=e_z-1=2-1=1.

Therefore, z=0.36×101z = 0.36 \times 10^1

Multiplication Algorithm

Our goal is to find z=x×yz=x \times y

  1. Find exponent: ez=ex+eye_z=e_x+e_y
  2. Multiply mantissa: fz=fxfyf_z=f_x*f_y
  3. Normalize: if fz<0.1|f_z|< 0.1, then left-shift fzf_z by 1 and decrease eze_z by 1
Multiplication Example

Given x=0.124×102x=0.124\times 10^2 and y=0.702×101y=0.702\times 10^{-1} find z=x×yz=x\times y, provided number of digits in mantissa=3 and number of digits in exponent=1.

Solution
Given: fx=0.124,ex=2f_x=0.124,e_x=2 and fy=0.702,ey=1f_y=0.702,e_y=-1
Find exponent: ez=2+(1)=1e_z=2+(-1)=1
Multiply mantissa: fz=fxfy=0.1240.702=0.087048f_z=f_x*f_y=0.124*0.702=0.087048
Normalize: Since, fz<0.1|f_z|< 0.1, fz=fz<<1=0.870f_z=f_z << 1 = 0.870 and ez=ez1=11=0e_z=e_z-1=1-1=0.

Therefore, z=0.870×100z = 0.870\times 10^0

Division Algorithm

Our goal is to find z=x/yz=x/y

  1. Find exponent: ez=exeye_z=e_x-e_y
  2. Divide mantissa: fz=fx/fyf_z=f_x/f_y
  3. Normalize: if fz1|f_z|\geq 1, then right-shift fzf_z by 1 and increase eze_z by 1
Division Example

Given x=0.628×102x=0.628\times 10^2 and y=0.271×101y=0.271\times 10^{-1} find z=x/yz=x/y, provided number of digits in mantissa=3 and number of digits in exponent=1.

Solution
Given: fx=0.628,ex=2f_x=0.628,e_x=2 and fy=0.271,ey=1f_y=0.271,e_y=-1
Find exponent: ez=2(1)=3e_z=2-(-1)=3
Multiply mantissa: fz=fx/fy=0.628/0.271=2.317f_z=f_x/f_y=0.628/0.271=2.317
Normalize: Since, fz1|f_z|\geq 1, fz=fz>>1=0.231f_z=f_z >> 1 = 0.231 and ez=ez+1=3+1=4e_z=e_z+1=3+1=4.

Therefore, z=0.231×104z = 0.231\times 10^4

Fixed Point vs Floating Point

The following table summarizes the differences between fixed-point and floating-point representations:

Fixed PointFloating Point
1. Fixed number of digits is used to represent the decimal part of a real number.1. Fixed number of significant digits plus an exponent is used to represent a real number.
2. It minimizes the absolute error of representation.2. It minimizes the relative error of representation.
3. The absolute error of representation is constant while the relative error of representation is variable and can exceed 100%3. Both absolute and relative errors of representations are variable but relative error is always less than 100%.
4. It can't represent very large or very small numbers.4. It can represent both very large and very small numbers.
5. Used for financial and accounting purposes.5. Used for scientific calculations.
6. Generally not available in hardware implementation.6. Generally available in hardware implementation with standards such as IEEE 754.
7. Example: 1.002, 7.888, 0.001 etc7. Example: 0.122×1020.122\times 10^2, 0.38×108-0.38\times 10^{-8} etc.