# Fixed-point quantizer for video coding

A quantizer employs a scaled integral inverse ratio division for quantization of an input T by a quantization step Q. The quantizer forms an integral approximation q of 2r/Q by either trunc(2r/Q) or round(2r/Q). A multiplier multiplies the absolute value of T by the q. An adjustment factor is added alternatively to the absolute value of T prior to multiplication or to the product after multiplication. This adjustment factor minimizes errors near transition points in the quantization. This invention is applicable to both trunc(T/Q) and round(T/Q).

## Latest Texas Instruments Incorporated Patents:

**Description**

**TECHNICAL FIELD OF THE INVENTION**

The technical field of this invention is data quantizers particularly those used in video data coding.

**BACKGROUND OF THE INVENTION**

Quantization reduces the precision of input values. The lower precision outputs can be represented with fewer bits, thereby achieving data compression. Quantization is an important step in all video coding standards.

**100**. Quantization divides an integer input F by an integer quantization step Q. The quotient is an integer and a fractional part. The resultant A includes rounding the fractional part by a method based on the coding standard. The rounding conventions for several standards are shown in Table 1.

*a*illustrates the relationship between the input F, the quantization step Q and the output A using the truncation toward zero (A=trunc(F/Q)).

*b*illustrates the quantization error tr_err(F/Q). The quantization error tr_err(F/Q)=F/Q−trunc(F/Q).

Division takes a lot of computation compared to addition, subtraction and multiplication. Division is more complicated than these other operations when embodied in hardware circuits. The amount of computation is important in video coding, because a lot of data must be quantized. Thus quantization is often achieved indirectly. A common method replaces division by multiplication between the numerator and the reciprocal of the denominator. Moreover, the reciprocal of denominator is not represented as a floating point number but as a fixed-point number. In fixed-point representation the decimal point is implicitly placed between bits of the binary representation of a number. The decimal point position is selected depending on the required precision and range. With fixed-point representation, all arithmetic operations use integer arithmetic. This improves computation speed and reduces complexity.

Table 2 lists definitions of some functions used in this application for reference.

The program code listing below shows a common fixed-point implementation of the truncation quantizer trunc(F/Q). This implements integer division with truncation towards zero.

This code listing is valid for any integer T and positive integer Q. Table 3 shows a comparison between direct division and this fixed-point algorithm.

The fixed-point implementation is considerably faster than direct division. This fixed-point implementation has been used in many products and video end-equipment. This fixed-point implementation of quantizer will be used as the baseline quantizer in this application.

**SUMMARY OF THE INVENTION**

The baseline quantizer can cause deviation no matter how accurate the implemention. This deviation is the difference between direct division and multiplication by the scaled integer inverse of the quantization step. The fixed-point design of this invention computes an adjustment factor for use near transition points in the quantization output. This adjustment factor can completely eliminate the deviation. The improved quantizer of this invention may require slightly more computation power or slightly more complex hardware than the prior art implementation. The quantizer of this invention improves the picture quality compared to the baseline quantizer when used for H.263 (MPEG-4 type 2) quantization.

**BRIEF DESCRIPTION OF THE DRAWINGS**

These and other aspects of this invention are illustrated in the drawings, in which:

*a *illustrates the relationship between the input F, the quantization step Q and the output A using truncation toward zero;

*b *illustrates the quantization error;

**DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS**

**200** used to compute trunc(T/Q). This function is used in quantizing H.263 Intra AC and Inter DCT coefficients. Absolute value block **201** forms the absolute value of the input T. The result is a positive integer F. Inverting block **202** receives the quantization step Q and forms the inverse quantization step q**1**. This is scaled by a factor r. Thus the inverse quantization step q**1** is q**1**=trunc(2^{r}/Q), when truncation is used or by q**1**=round(2^{r}/Q), when rounding is used. Inverter block **202** calculates the fixed-point representation of 1/Q using r+1 bits to represent q**1** assuming Q≧1. Only r bits are needed if Q>1. This occurs because when Q=1, the output is always F and the quantizer merely passes through the input. Multiplier **203** forms the intermediate product a of F and q**1**. Scaling block **204** right shifts the product by r places forming A**2**. This recovers the scaling of inverting block **202**. Multiplier **205** forms the product of scaled quantity A**2** and the sign of the input T. The product result is the quantized input.

With F=abs(T), then A**1**=trunc(F/Q) is the quantization output using direct division. The purpose of quantizer **200** is to compute trunc(F/Q), but q**1**, the fixed-point representation of 1/Q, can be calculated by rounding or truncation. The following description investigates the error between A**1** and A**2** for all F and Q. Intuitively, any error should reduce for increases in r. This error cannot be completely eliminated because baseline quantizer **200** is sensitive to even tiny errors in intermediate results at certain transition points.

Let e**1** be the error in inverting block **202**. This is:

*e*1=2^{r}*/Q*−trunc(2^{r}*/Q*)

where: 0≦e**1**<1 when using truncation, and

*e*1=2^{r}*/Q*−round(2^{r}*/Q*)

where: −0.5≦e**1**<0.5 when using rounding. Absolute value block **201** introduces no error. Likewise, the integer multiplication of multiplier **203** introduces no error. Scaling block **204** may cause error. Let the error in scaling block **204** be e**2**. This error in scaling block **204** is the truncation error due to bits lost in the shift operation. Thus e**2**=tr_err(a/2^{r}). The baseline quantizer **200** computation can be represented as:

*q*1=2^{r}*/Q−e*1

*a=F*q*1

*A*2*=a/*2^{r}*−e*2=*a/*2^{r}*−tr*_{—}*err*(*a/*2^{r})

By substitution, we have:

The following derivation assumes that (abs(e**1**)*F)/2^{r}<1 or F<2^{r}, and abs(e**1**)<1:

*e*2*=tr*_{—}*err*(*a/*2^{r})

*a=F*(2^{r}*/Q−e*1)

Substituting for a in the equation for e**2**:

If e**1**<0, then

*e*2=*tr*_{—}*err*(*F/Q+−F*e*1/2^{r})

=*tr*_{—}*err*(*F/Q*)+−*F*e*1/2^{r}−1

when *tr*_{—}*err*(*F/Q*)+−*F*e*1/2^{r}≧1, or

=*tr*_{—}*err*(*F/Q*)+−*F*e*1/2^{r}, otherwise.

Thus, if e**1**<1, then:

*A*2=trunc(*F/Q*)*−k*, where

*k=*1 when *tr*_{—}*err*(*F/Q*)≧1*−F·e*1/2^{r}, or

k=0, otherwise.

Similarly, if e**1**>0, then

*A*2=trunc(*F/Q*)*−k*, where

*k=*1, when *tr*_{—}*err*(*F/Q*)<*e*1**F/*2^{r}

k=0, otherwise.

Determining the quantization error depends upon trunc(F/Q). According to the above equations, deviations occur when:

*e*1<0 and *tr*_{—}*err*(*F/Q*)≧1−(−*e*1·*F/*2^{r}), or condition I

*e*1>0 and *tr*_{—}*err*(*F/Q*)<*e*1·*F/*2^{r}. condition II

The resulting deviation is ±1. To prevent condition I, r can be selected such that:

*tr*_{—}*err*(*F/Q*)<1−(−*e*1**F/*2^{r})

It can be shown that tr_err(F/Q)≦1−1/Q. Thus to prevent condition I, r must be selected to make sure that:

2^{r}*>−e*1·*F·Q*, for all *F, Q.*

Thus the number of scaling bits r is selected equal to R, so that:

2^{R}*>F*_MAX·*Q*_MAX/2

where: F_MAX and Q_MAX are the maximums of F and Q, respectively. This eliminates any error due to condition I.

To eliminate condition II, select r such that:

*tr*_{—}*err*(*F/Q*)≧2*e*1**F/*2^{r}, when *e*1>0.

This equation implies that when e**1**>0, tr_err(F/Q)=0 (i.e. F=NQ for some integer N>0) and F≠0, the output A**2** of the baseline quantizer will not be equal to trunc(F/Q) however large is r. For example, suppose F=12, Q=6 and r=32.

*q*1=trunc(2^{r}*/Q*)=715827882, and

*A*2=12·715827882>>32=1

(since 12·715827882/2^{32}=1.999999998).

Thus A**2**=trunc(F/Q)−1 as indicated above.

However, r can be selected so that the probability of deviation is minimum but never zero. If tr_err(F/Q)≠0, then tr_err(F/Q)≧1/Q. Note that Q>1 for e1≠0. If r is selected so that:

2^{r}*≧F*e*1**Q*, for all *F, Q,*

then deviations occur only when F=NQ. This is the best we can achieve, and the probability of deviation is minimum.

In summary, if q**1** is calculated by rounding, then e**1**<0.5 and selecting r equal to R such that:

2^{R}*≧F*_MAX·*Q*_MAX/2

achieves the minimum probability of deviation. Similarly, if q**1** is calculated by truncation, then selecting r equal to R such that:

2^{R}*≧F*_MAX·(*Q*_MAX−1)

achieves the minimum probability of deviation.

If we pick an r satisfying the above equations, then deviations occur only when e**1**>0, tr_err(F/Q)=0 (i.e. F=N·Q for some N) and F≠0. In such cases the baseline output A**2** is trunc(F/Q)−1.

This invention proposes an improved fixed-point implementation of quantizer. A first embodiment of this invention uses truncation to calculate q**1**, so e**1**≧0. This first embodiment selects r satisfying the above conditions. Furthermore, this first embodiment sets:

*F′=F+m*, where

m=0, when e**1**=0, i.e. Q=2^{p }for some integer p≧0,

m=1, otherwise.

When e**1**=0, F′=F, then A**2**=trunc(F′/Q)=trunc(F/Q). When e**1**>0, F′=F+1, then:

It can be shown that:

*A*2=*F/Q+*1/*Q−*(*tr*_{—}*err*(*F/Q*)+1/*Q−*1)−*k,*

when *tr*_{—}*err*(*F/Q*)+1/*Q≧*1, and

*A*2=*F/Q+*1/*Q*−(*tr*_{—}*err*(*F/Q*)+1/*Q*)−*k,*

otherwise.

Note that tr_err(F/Q)+1/Q≧1 is equivalent to F=NQ−1 for some N. Since k=1, the above equation becomes:

When tr_err(F/Q)+1/Q≧1 is not true, then F≠NQ−1 and hence k=0. The above equation becomes:

Adding an adjustment m to F, yields trunc(F/Q) for all F and Q and eliminates all deviations.

**500**. Absolute value block **201** forms the absolute value of the input T resulting in positive integer F. Inverting block **202** receives the quantization step Q and forms the inverse quantization step q**1** scaled by a factor r. Inverting block **202** forms q**1**=trunc(2^{r}/Q) using truncation. Adjustment unit **503** forms the above derived adjustment m responsive to the quantization step Q. As described above: m=0 when e1=0, i.e. Q=2^{p }for some integer p≧0; and m=1 otherwise. Adder **504** adds m to F yielding F′. Multiplier **505** forms the intermediate product a of F′ and q**1**. Scaling block **204** right shifts the product by r places forming A**3**. Multiplier **205** forms the product of scaled quantity A**3** and the sign of the input T. The product result is the quantized input. The output A**3** of quantizer **500** is always the same as trunc(F/Q) with integer division of F by Q.

The addition of adjustment m and checking to determine if Q=2^{p }requires little overhead. For H.263 and MPEG-4 type 2 quantization such as for Intra AC or Inter coefficients, this test is required only once per macroblock so the overhead is negligible. For MPEG-1/2 and MPEG-4 type 1 quantization, Q=QP*W[i,j] for coefficient at location (i,j), where W[i,j] is the 8-by-8 quantization matrix. If W[i,j] is different at each location (i,j), then this test is required for 64 different values per macroblock when QP changes. In this case the improvement in picture quality may not justify the extra overhead. However, it is uncommon for W[i,j] to be different at each location. For example, the MPEG-1 default inter quantization matrix is W[i,j]=16 for all (i,j). So Q needs to be checked only once per macroblock and the overhead is negligible.

**600** which is an alternative design that eliminates all the deviation. Absolute value block **201** forms the absolute value of the input T resulting in positive integer F. Inverting block **202** receives the quantization step Q and forms the inverse quantization step q**1** scaled by a factor r. Inverting block **202** forms q**1**=trunc(2^{r}/Q) using truncation. Multiplier **203** forms the intermediate product a of F and q**1**. Adjustment unit **603** forms the above derived adjustment m**2** responsive to the quantization step Q. As described above: m**2**=0, when Q=2^{p }for some integer p≧0; and m**2**=q**1**, otherwise. Adder **605** sums the intermediate product a and the adjustment m**2**. Scaling block **204** right shifts the product by r places forming A**3**. Multiplier **205** forms the product of scaled quantity A**3** and the sign of the input T. The product result is the quantized input. The design illustrated in **200**.

The previous sections regarding **700**, which is a common fixed-point realization of the quantizer round(T/Q). Absolute value block **701** forms the absolute value of the input T. The result is a positive integer F. Inverting block **702** receives the quantization step Q and forms the inverse quantization step q**1**, which is q**1**=trunc(2^{r}/Q), if truncation is used or q**1**=round(2^{r}/Q), if rounding is used. Multiplier **703** forms the intermediate product b of F and q**1**. Scaling block **704** right shifts the product by r−1 places. Adder **705** adds one to the scaled resultant from scaling block **704**. Shift block **706** right shifts the sum by one bit. This completes recovery the scaling of inverting block **702**. The shift by r−1 bits, add one and shift by one bit forms a rounded quantity B**2** rather than a truncated quantity. Multiplier **707** forms the product of scaled quantity B**2** and the sign of the input T. The product result is the quantized input.

Let:

*B*1=round(*F/Q*)

that is, B**1** is the output from the implementation using division directly. It can be shown that the output from the baseline rounding quantizer B**2** is:

*B*2=round(*F/Q*)+*k*, where:

*k=*1 when *e*1<0, *M≦F/Q*<(*M+*1/2), and

*tr*_{—}*err*(2*·F/Q*)+(*F/*2^{r})*(−*e*1)>=1;

*k=−*1 when *e*1>0, (*M+*1/2)≦*F/Q<M+*1, and

*tr*_{—}*err*(2*·F/Q*)−(*F/*2^{r})**e*1<0; and

k=0 otherwise

for some integer M≧0. Error e**1** is due to fixed-point representation of 1/Q, i.e. e**1**=tr_err(2^{r}/Q) or rd_err(2^{r}/Q). Hence deviations occur when:

*e***1**<0, *M≦F/Q*<(*M+*1/2), and

*tr*_{—}*err*(2*·F/Q*)+(*F/*2^{r}) *(−*e*1)≧1 holds, or condition I

*e***1**>0, (*M+*1/2)≦*F/Q<M+*1, and

*tr*_{—}*err*(2·*F/Q*)−(*F/*2^{r})**e*1<0 holds. conidtion II

A sufficiently large r can prevent condition I. However, condition II can occur no matter how large r is. In particular, when e1>0, (M+1/2)≦F/Q<M+1 and tr_err(2·F/Q)=0, thus F/Q=(M+1/2), condition II will hold and deviation will occur for whatever r. Nevertheless, we can achieve the minimum probability of deviation by setting r=R such that:

2^{R}*≧F*_MAX**Q*_MAX

In this case deviations occur only when e1>0 and F/Q=(M+1/2).

An improved design can completely eliminate the deviation. Using truncation to calculate q**1** and selecting r to satisfy the equation above, set:

*F′=F+m*3, where

*m*3=0, when *e*1=0, thus *Q=*2^{p }for some integer *p≧*0, and

*m*3=1, otherwise.

This eliminates the deviation.

**800** for rounding. Absolute value block **701** forms the absolute value of the input T. The result is a positive integer F. Inverting block **702** receives the quantization step Q and forms the inverse quantization step q**1**, which is q**1**=trunc(2^{r}/Q) or q**1**=round(2^{r}/Q). Adjustment unit **803** forms the above derived adjustment m**3** responsive to the quantization step Q. As described above: m**3**=0, when e**1**=0, thus Q=2^{p }for some integer p≧0, and m**3**=1, otherwise. Adder **804** adds m**3** to F yielding F′. Multiplier **805** forms the intermediate product b of F′ and q**1**. Scaling block **704** right shifts the product by r−1 places. Adder **705** adds one to the scaled resultant from scaling block **704**. Shift block **706** right shifts the sum by one bit. Multiplier **707** forms the product of scaled quantity B**2** and the sign of the input T. The product result is the quantized input.

**900**, which is an alternative design that eliminates all the deviation. Absolute value block **701** forms the absolute value of the input T resulting in positive integer F. Inverting block **702** receives the quantization step Q and forms the inverse quantization step q**1** scaled by a factor r. Inverting block **702** forms q**1**, which is q**1**=trunc(2^{r}/Q) or q**1**=round(2^{r}/Q). Multiplier **805** forms the intermediate product result of F and q**1**. Adjustment unit **903** forms the above derived adjustment m**4** responsive to the quantization step Q. As described above: m**4**=0, when e**1**=0, thus Q=2^{p }for some integer p≧0, and m**4**=q**1**, otherwise. Adder **905** sums the intermediate product result and the adjustment m**4**. Scaling block **704** right shifts the product by r−1 places. Adder **705** adds one to the scaled resultant from scaling quantity B**2** and the sign of the input T. The product result is the quantized input. This alternative embodiment requires the same computation power as the baseline design **700**.

## Claims

1. A quantizer receiving an input T and a quantization step Q and forming a quantization output A in the form trunc(T/Q), the quantizer unit comprising:

- an absolute value unit receiving the input T and forming F the absolute value of T;

- an adjustment unit receiving the the quantization step Q and forming a correction factor m equal to 0 where Q=2p for some integer p≧0 and m equal to 1 otherwise;

- an adder connected to said absolute value unit and said adjustment unit and forming F′ the sum of F and m;

- an inverting unit receiving the quantization step Q and forming q an integer approximation of (2r/Q), where r is a positive integer;

- a first multiplier connected to said absolute value unit and said inverting unit and forming the product of F′ and q;

- a shifter receiving the product of F and q from said first product unit and forming A3 the product of F and q right shifted by r bits; and

- a second multiplier receiving the input T and connected to said shifter forming A the product of the sign of T and A3.

2. The quantizer of claim 1, wherein:

- said inverting unit selects r where 2r≧F_MAX*(Q_MAX−1), where F_MAX is the maximum possible value of F and Q_MAX is the maximum possible value of Q.

3. A quantizer receiving an input T and a quantization step Q and forming a quantization output A in the form trunc(T/Q), the quantizer unit comprising:

- an absolute value unit receiving the input T and forming F the absolute value of T;

- an inverting unit receiving the quantization step Q and forming q an integer quantity trunc(2r/Q), where r is a positive integer;

- a first multiplier connected to said absolute value unit and said inverting unit and forming the product of F and q;

- an adjustment unit receiving and the quantization step Q and forming a correction factor m2 equal to 0 where Q=2p for some integer p≧0 and m2 equal to q otherwise;

- an adder connected to said first multiplier and said adjustment unit and forming F′ the sum of F and m2;

- a shifter receiving the F′ from said first product unit and forming A3 as F′ right shifted by r bits; and

- a second multiplier receiving the input T and connected to said shifter forming A the product of the sign of T and A3.

4. The quantizer of claim 3, wherein:

- said inverting unit selects r where 2r≧F_MAX, where F_MAX is the maximum possible value of F and Q_MAX is the maximum possible value of Q.

5. A quantizer receiving an input T and a quantization step Q and forming a quantization output B in the form round (T/Q), the quantizer unit comprising:

- an absolute value unit receiving the input T and forming F the absolute value of T;

- an adjustment unit receiving the quantization step Q and forming a correction factor m3 equal to 0 where Q=2p for some integer p≧0 and m3 equal to q otherwise;

- a first adder connected to said absolute value unit and said adjustment unit and forming F′ the sum of F and m3;

- an inverting unit receiving the quantization step Q and forming q an integer quantity trunc(2r/Q), where r is a positive integer;

- a first multiplier connected to said absolute value unit and said inverting unit and forming the product of F′ and q;

- a first shifter receiving the product of F and q from said first product unit and forming A3 the product of F and q right shifted by r−1 bits;

- a second adder connected to said first shifter forming the sum of 1 and the product of F and q right shifted by r−1 bits;

- a second shifter connected to said second adder for right shifting the sum output of the second adder by one bit; and

- a second multiplier receiving the input T and connected to said shifter forming B the product of the sign of T and B3.

6. The quantizer of claim 5, wherein:

- said inverting unit selects r where 2r≧F_MAX*Q_MAX, where F_MAX is the maximum possible value of F and Q_MAX is the maximum possible value of Q.

7. A quantizer receiving an input T and a quantization step Q and forming a quantization output B in the form round(T/Q), the quantizer unit comprising:

- an absolute value unit receiving the input T and forming F the absolute value of T;

- an adjustment unit receiving the quantization step Q and forming a correction factor m4 equal to 0 where Q=2p for some integer p≧0 and m4 equal to q otherwise;

- an inverting unit receiving the quantization step Q and forming q an integer quantity trunc(2r/Q), where r is a positive integer;

- a first multiplier connected to said absolute value unit and said inverting unit and forming the product of F and q;

- a first adder connected to said first multiplier and said adjustment unit and forming F′ the sum of F and m4;

- a first shifter receiving the F′ from said first adder unit and forming A3 as F′ right shifted by r−1 bits;

- a second adder connected to said first shifter forming the sum of 1 and forming F′ right shifted by r−1 bits;

- a second shifter connected to said second adder for right shifting the sum output of the second adder by one bit; and

- a second multiplier receiving the input T and connected to said shifter forming B the product of the sign of T and B3.

8. The quantizer of claim 7, wherein:

- said inverting unit selects r where 2rF_MAX*Q_MAX, where F_MAX is the maximum possible value of F and Q_MAX is the maximum possible value of Q.

**Referenced Cited**

**U.S. Patent Documents**

6874007 | March 29, 2005 | Denk et al. |

20040086192 | May 6, 2004 | Togashi et al. |

**Patent History**

**Patent number**: 7155472

**Type:**Grant

**Filed**: Feb 11, 2003

**Date of Patent**: Dec 26, 2006

**Patent Publication Number**: 20040158595

**Assignee**: Texas Instruments Incorporated (Dallas, TX)

**Inventors**: Ngai-Man Cheung (Los Angeles, CA), Ho-Cheon Wey (Ibarki), Yuji Itoh (Ibaraki)

**Primary Examiner**: Tan V. Mai

**Attorney**: Robert D. Marshall, Jr.

**Application Number**: 10/364,636

**Classifications**

**Current U.S. Class**:

**Round Off Or Truncation (708/551)**

**International Classification**: G06F 7/38 (20060101);