C:\Users\demo\Desktop\140130_NTTRD_parts\side.png
C:\Users\Public\Pictures\ろご\R&D_FInal\A_Type\Logos_RD_Atype.jpg
1

GPGPU-Assisted Nonlinear 
DenoisingFilter Generation 
for Video Coding

Seishi Takamura and Atsushi Shimizu

NTT Corporation, Japan

State-of-the-art video coding technologies such as H.265/HEVC 
employ in-loop denoisingfilters.
We have developed a new type of in-loop denoisingfilter with 
Genetic Programming (GP), which is heavily nonlinear and content-
specific.
To boost the evolution, GPGPU is utilized in filter evaluation process.
Proposed method yielded better denoisingfilter in 100x less time.
The bit rate reduction of 1.492-2.569% was achieved against the 
reference software of H.265/HEVC.


Summary


C:\Users\demo\Desktop\140130_NTTRD_parts\header.png
2

Copyright©2014 NTT corp. All Rights Reserved.

Video Coding Block Diagram

Inter-frame

Prediction

Quantization

Entropy

Coding

－

Video 
Input

Compressed

Bitstream

＋

DenoisingFilter

(DF, SAO,ALF,etc)

Transform

Inverse

Transform

Inverse

Quantization

Reconstructed

Videos

Intra-frame

Prediction

Target of 
evolution


C:\Users\demo\Desktop\140130_NTTRD_parts\header.png
3

Copyright©2014 NTT corp. All Rights Reserved.

A Leap from Linear DenoisingFilter

Nonlinear 
filter

Decoded Frame

(large distortion)

Linear 
filter

exp


cos


tan


sinh


log


Restored Frame

(less distortion)

Decoded Frame

(large distortion)

Restored Frame

(much less distortion )


C:\Users\demo\Desktop\140130_NTTRD_parts\header.png
4

Copyright©2014 NTT corp. All Rights Reserved.

DenoisingFilter Support

p21

p24

p19

p25

p28

p15

p13

p16

p29

p27

p10

p07

p05

p08

p11

p26

p23

p14

p06

p02

p01

p03

p09

p17

p22

p20

p18

p12

p04

p00

p

q00

q04

q12

q18

q20

q22

q17

q09

q03

q01

q02

q06

q14

q23

q26

q11

q08

q05

q07

q10

q27

q29

q16

q13

q15

q28

q25

q19

q24

q21


C:\Users\demo\Desktop\140130_NTTRD_parts\header.png
5

Copyright©2014 NTT corp. All Rights Reserved.

Nodes used by our Filter

Terminal nodes


I: pixel value of p

Ixx: (pxx+ qxx) / 2,

Dxx: (pxx–qxx) / 2,.

Ils: least-square restored value, a linear combination of I, I00… I11 with 
offset.

x, y: horizontal and vertical coordinate of the pixel.

value: immediate values such as “0.3”.

Functional nodes


min, max, average, abs, /, *, +, −,

exp, pow, log, sqrt, sin, cos, tan, asin, acos, atan,

sinh, cosh, tanh, conditional branch

In addition, followings are defined

and(a, b):= (a>=0 && b>=0) ? (a+b)/2 : −(|a|+|b|)/2,

or(a, b):= (a>=0 || b>=0) ? (|a|+|b|)/2 : −(|a|+|b|)/2,

xor(a, b):= (ab<=0) ? (|a|+|b|)/2 : −(|a|+|b|)/2.


C:\Users\demo\Desktop\140130_NTTRD_parts\header.png
6

Copyright©2014 NTT corp. All Rights Reserved.

Serializations of a Tree

div

add

max

sin

2.0

I20

I01

log

0.5

Normal expression (or infix notation):


(sin(I20) + max(I01, log(0.5))) / 2

Lisp S-expression (or prefix notation):


(div (add (sin (I20 ))(max (I01 )(log 0.5))) 2)

Reverse Polish notation (or postfix notation):


I20 sin I01 0.5 log max add 2.0 div

We used Reverse Polish notation (as described later).
The fitness function in the evolution is D+lR, where
D is the squared sum of the errors between the filtered image and 
original image
R is the amount of tree information that represents the filter algorithm
λ is the same Lagrange multiplier as the encoder uses during rate-
distortion optimization process


C:\Users\demo\Desktop\140130_NTTRD_parts\header.png
7

Copyright©2014 NTT corp. All Rights Reserved.

GPGPU implementation

div

imm

add

max

log

imm

I01

sin

I20

・・・

・・・

・・・

・・・


(float)0.5

(float)2.0

Initial index

position (a)

・・・

End of individual

(Index=0) (b)

1024bytes

Immediate

Values (c)

4 bytes

1 byte

Beginning of array

End of array

・・・

We convert the tree in 
Reverse Polish Notation 
(RPN) prior to the evaluation.
Linearized instructions are 
stuffed from the middle of 
the array (a) toward the 
beginning.
Immediate values are picked 
out and stuffed from the end 
(c).


Filter evaluation procedure is like following:


for (index = 0; index < array_length; index++) {

switch (funcIDs[index]) {

case add: a=pop(); b=pop(); push(a+b); break;

case sin: a=pop(); push(sin(a)); break;

case imm: push(<the value>); break;

case I: push(I); break;

case I00: push(I00); break;

…

}

}


C:\Users\demo\Desktop\140130_NTTRD_parts\header.png
8

Copyright©2014 NTT corp. All Rights Reserved.

Simulation Conditions

CPU:IntelCorei7-3960XExtremeEdition,C2stepping

Clockrate:3.3GHz

Cores:6(onecoreisusedfortheCPU-experiment)

Hyperthreading:on

Memory:64GB

OS:UbuntuLinux12.04.2LTSx86_64DesktopEdition

GPU:NVIDIAGeForceGTX690

CUDAcapability:3.0

CUDACores:1536

GPUClockrate:1.020GHz

Globalmemory:2048MB

L2CacheSize:512KB

CUDA:Driverversion:5.0.35,x86_64

SDK/Toolkitversion:5.0.35

C++Compiler(asthebackendfornvcc):

IntelC++Compilerversion:12.1.520120612


BQTerrace(1920x1080)

RaceHorces(416x240)

BQMall(832x480)

Video sequences used


C:\Users\demo\Desktop\140130_NTTRD_parts\header.png
9

Copyright©2014 NTT corp. All Rights Reserved.

CPU vs. GPU Comparison

Time[sec]

Speed-up(vs.CPU)

CPU(1core)

0.336489

GPU

0.002674

125.8x


Filter (of 121 nodes) evaluation time over BQMall(832x480)


C:\Users\Taka\Documents\NTT\201207ITE GPU特集\filt.emf
42.642.742.842.943.043.1 
10 100 1000 10000 100000 1e+06Lagrangian Evolution time [sec]
CPU 1CPU 2GPU 1GPU 2
Filter evolution speed for BQMall(832x480)


Better 
fitness

100xtime difference


C:\Users\demo\Desktop\140130_NTTRD_parts\header.png
10

Copyright©2014 NTT corp. All Rights Reserved.

Coding Performance Comparison(vs. original H.265/HEVC)

HM-7.2-3164

ALF*

LS fiter**

Propsal

Sequence

QP

rate (a)
[bits]

Y-PSNR[dB]

BD-ratevs. HM

Y-PSNR[dB]

BD-ratevs. HM

filter info(R) [bits]

total rate(a+R)[bits]

Y-PSNR[dB]

BD-ratevs. HM

BQSquare

22

210,720 

41.53

41.54

0.135%

626 

211,346 

41.71

-1.492%

(ALF off)

27

138,152 

37.16

37.17

315 

138,467 

37.27

32

88,288 

33.30

33.33

329 

88,617 

33.46

37

55,048 

29.65

29.70

418 

55,466 

29.93

BQSquare

22

210,944 

41.53

-0.022%

41.54

0.28%

520 

211,464 

41.69

-1.437%

(ALF on)

27

138,352 

37.16

37.17

445 

138,797 

37.30

(vs.ALFon)

32

88,504 

33.33

33.35

279 

88,783 

33.48

-1.455%

37

55,392 

29.71

29.72

315 

55,707 

29.95

(vs.ALFoff)

RaceHorses

22

174,448 

42.19

42.30

-1.202%

1195 

175,643 

42.47

-2.569%

(ALF off)

27

109,264 

37.97

38.10

698 

109,962 

38.18

32

63,848 

34.08

34.21

750 

64,598 

34.35

37

34,696 

30.57

30.71

536 

35,232 

30.86

RaceHorses

22

174,936 

42.26

-1.755%

42.29

0.428%

321 

175,257 

42.36

-0.843%

(ALF on)

27

109,536 

38.12

38.14

36 

109,572 

38.13

(vs.ALFon)

32

64,128 

34.26

34.26

376 

64,504 

34.39

-2.580%

37

34,992 

30.73

30.74

236 

35,228 

30.85

(vs.ALFoff)


Negative values

mean better

performance

HM: H.265/HEVC reference software (used as an anchor)

*ALF: adaptive loop filter (state-of-the-art loop filter)

**LS filter: least square filter. Filter info(R) = 448 bits


C:\Users\demo\Desktop\140130_NTTRD_parts\header.png
11

Copyright©2014 NTT corp. All Rights Reserved.

Example of Generated Filter

RaceHorses, QP=22, ALF-off, filter information (R) = 1,195 bits

(add (add (add (add (mul(I ) 0.932803332806 )(mul(I01 ) 0.087968140841 ))(add (mul(I02 ) −0.051799394190 )(mul(I00 ) 0.095137931406 )))(add (add (mul(I03 ) 
−0.050682399422 )(mul(I04 ) −0.040202748030 ))(add (mul(I05 ) −0.052293013781 ) 
(mul(ave(I02 )(tan (I12 ))) 0.017782183364 ))))(add (add (add (mul(I07 ) 
0.025515399873 ) (mul(I08 ) 0.025515399873 ))(sub (mul(sin (atan(and (I09 )(I21 )))) 
0.016251996160 )(mul(tanh(tanh(tanh(mul(I02 )(asin(log (sinh(sqr(div (mul(I05 ) 
(sqr(div (atan(mul(mul(asin(asin(sqr(I ))))(sqr(sqr(div (I05 ) (I13 )))))(sqr(div (sin 
(I19 )) (I01 )))))(sqr(I01 )))))(I03 )))))))))) 0.005235218443 )))(mul(I29 ) 
−0.005818639882 )))

Anovel method to generate denoisingfilter that enhances 
the coding performance is proposed.
GPGPU accelerated the evolution by around 100 times than 
the CPU. 
Generated filters outperformed least square filter and 
state-of-the-art filter, i.e., ALF.


Conclusion