C:\Users\demo\Desktop\140130_NTTRD_parts\side.png C:\Users\Public\Pictures\ろご\R&D_FInal\A_Type\Logos_RD_Atype.jpg 1 GPGPU-Assisted Nonlinear DenoisingFilter Generation for Video Coding Seishi Takamura and Atsushi Shimizu NTT Corporation, Japan State-of-the-art video coding technologies such as H.265/HEVC employ in-loop denoisingfilters. We have developed a new type of in-loop denoisingfilter with Genetic Programming (GP), which is heavily nonlinear and content- specific. To boost the evolution, GPGPU is utilized in filter evaluation process. Proposed method yielded better denoisingfilter in 100x less time. The bit rate reduction of 1.492-2.569% was achieved against the reference software of H.265/HEVC. Summary C:\Users\demo\Desktop\140130_NTTRD_parts\header.png 2 Copyright©2014 NTT corp. All Rights Reserved. Video Coding Block Diagram Inter-frame Prediction Quantization Entropy Coding - Video Input Compressed Bitstream + DenoisingFilter (DF, SAO,ALF,etc) Transform Inverse Transform Inverse Quantization Reconstructed Videos Intra-frame Prediction Target of evolution C:\Users\demo\Desktop\140130_NTTRD_parts\header.png 3 Copyright©2014 NTT corp. All Rights Reserved. A Leap from Linear DenoisingFilter Nonlinear filter Decoded Frame (large distortion) Linear filter exp cos tan sinh log Restored Frame (less distortion) Decoded Frame (large distortion) Restored Frame (much less distortion ) C:\Users\demo\Desktop\140130_NTTRD_parts\header.png 4 Copyright©2014 NTT corp. All Rights Reserved. DenoisingFilter Support p21 p24 p19 p25 p28 p15 p13 p16 p29 p27 p10 p07 p05 p08 p11 p26 p23 p14 p06 p02 p01 p03 p09 p17 p22 p20 p18 p12 p04 p00 p q00 q04 q12 q18 q20 q22 q17 q09 q03 q01 q02 q06 q14 q23 q26 q11 q08 q05 q07 q10 q27 q29 q16 q13 q15 q28 q25 q19 q24 q21 C:\Users\demo\Desktop\140130_NTTRD_parts\header.png 5 Copyright©2014 NTT corp. All Rights Reserved. Nodes used by our Filter Terminal nodes I: pixel value of p Ixx: (pxx+ qxx) / 2, Dxx: (pxx–qxx) / 2,. Ils: least-square restored value, a linear combination of I, I00… I11 with offset. x, y: horizontal and vertical coordinate of the pixel. value: immediate values such as “0.3”. Functional nodes min, max, average, abs, /, *, +, −, exp, pow, log, sqrt, sin, cos, tan, asin, acos, atan, sinh, cosh, tanh, conditional branch In addition, followings are defined and(a, b):= (a>=0 && b>=0) ? (a+b)/2 : −(|a|+|b|)/2, or(a, b):= (a>=0 || b>=0) ? (|a|+|b|)/2 : −(|a|+|b|)/2, xor(a, b):= (ab<=0) ? (|a|+|b|)/2 : −(|a|+|b|)/2. C:\Users\demo\Desktop\140130_NTTRD_parts\header.png 6 Copyright©2014 NTT corp. All Rights Reserved. Serializations of a Tree div add max sin 2.0 I20 I01 log 0.5 Normal expression (or infix notation): (sin(I20) + max(I01, log(0.5))) / 2 Lisp S-expression (or prefix notation): (div (add (sin (I20 ))(max (I01 )(log 0.5))) 2) Reverse Polish notation (or postfix notation): I20 sin I01 0.5 log max add 2.0 div We used Reverse Polish notation (as described later). The fitness function in the evolution is D+lR, where D is the squared sum of the errors between the filtered image and original image R is the amount of tree information that represents the filter algorithm λ is the same Lagrange multiplier as the encoder uses during rate- distortion optimization process C:\Users\demo\Desktop\140130_NTTRD_parts\header.png 7 Copyright©2014 NTT corp. All Rights Reserved. GPGPU implementation div imm add max log imm I01 sin I20 ・・・ ・・・ ・・・ ・・・ (float)0.5 (float)2.0 Initial index position (a) ・・・ End of individual (Index=0) (b) 1024bytes Immediate Values (c) 4 bytes 1 byte Beginning of array End of array ・・・ We convert the tree in Reverse Polish Notation (RPN) prior to the evaluation. Linearized instructions are stuffed from the middle of the array (a) toward the beginning. Immediate values are picked out and stuffed from the end (c). Filter evaluation procedure is like following: for (index = 0; index < array_length; index++) { switch (funcIDs[index]) { case add: a=pop(); b=pop(); push(a+b); break; case sin: a=pop(); push(sin(a)); break; case imm: push(); break; case I: push(I); break; case I00: push(I00); break; … } } C:\Users\demo\Desktop\140130_NTTRD_parts\header.png 8 Copyright©2014 NTT corp. All Rights Reserved. Simulation Conditions CPU:IntelCorei7-3960XExtremeEdition,C2stepping Clockrate:3.3GHz Cores:6(onecoreisusedfortheCPU-experiment) Hyperthreading:on Memory:64GB OS:UbuntuLinux12.04.2LTSx86_64DesktopEdition GPU:NVIDIAGeForceGTX690 CUDAcapability:3.0 CUDACores:1536 GPUClockrate:1.020GHz Globalmemory:2048MB L2CacheSize:512KB CUDA:Driverversion:5.0.35,x86_64 SDK/Toolkitversion:5.0.35 C++Compiler(asthebackendfornvcc): IntelC++Compilerversion:12.1.520120612 BQTerrace(1920x1080) RaceHorces(416x240) BQMall(832x480) Video sequences used C:\Users\demo\Desktop\140130_NTTRD_parts\header.png 9 Copyright©2014 NTT corp. All Rights Reserved. CPU vs. GPU Comparison Time[sec] Speed-up(vs.CPU) CPU(1core) 0.336489 GPU 0.002674 125.8x Filter (of 121 nodes) evaluation time over BQMall(832x480) C:\Users\Taka\Documents\NTT\201207ITE GPU特集\filt.emf 42.642.742.842.943.043.1 10 100 1000 10000 100000 1e+06Lagrangian Evolution time [sec] CPU 1CPU 2GPU 1GPU 2 Filter evolution speed for BQMall(832x480) Better fitness 100xtime difference C:\Users\demo\Desktop\140130_NTTRD_parts\header.png 10 Copyright©2014 NTT corp. All Rights Reserved. Coding Performance Comparison(vs. original H.265/HEVC) HM-7.2-3164 ALF* LS fiter** Propsal Sequence QP rate (a) [bits] Y-PSNR[dB] BD-ratevs. HM Y-PSNR[dB] BD-ratevs. HM filter info(R) [bits] total rate(a+R)[bits] Y-PSNR[dB] BD-ratevs. HM BQSquare 22 210,720 41.53 41.54 0.135% 626 211,346 41.71 -1.492% (ALF off) 27 138,152 37.16 37.17 315 138,467 37.27 32 88,288 33.30 33.33 329 88,617 33.46 37 55,048 29.65 29.70 418 55,466 29.93 BQSquare 22 210,944 41.53 -0.022% 41.54 0.28% 520 211,464 41.69 -1.437% (ALF on) 27 138,352 37.16 37.17 445 138,797 37.30 (vs.ALFon) 32 88,504 33.33 33.35 279 88,783 33.48 -1.455% 37 55,392 29.71 29.72 315 55,707 29.95 (vs.ALFoff) RaceHorses 22 174,448 42.19 42.30 -1.202% 1195 175,643 42.47 -2.569% (ALF off) 27 109,264 37.97 38.10 698 109,962 38.18 32 63,848 34.08 34.21 750 64,598 34.35 37 34,696 30.57 30.71 536 35,232 30.86 RaceHorses 22 174,936 42.26 -1.755% 42.29 0.428% 321 175,257 42.36 -0.843% (ALF on) 27 109,536 38.12 38.14 36 109,572 38.13 (vs.ALFon) 32 64,128 34.26 34.26 376 64,504 34.39 -2.580% 37 34,992 30.73 30.74 236 35,228 30.85 (vs.ALFoff) Negative values mean better performance HM: H.265/HEVC reference software (used as an anchor) *ALF: adaptive loop filter (state-of-the-art loop filter) **LS filter: least square filter. Filter info(R) = 448 bits C:\Users\demo\Desktop\140130_NTTRD_parts\header.png 11 Copyright©2014 NTT corp. All Rights Reserved. Example of Generated Filter RaceHorses, QP=22, ALF-off, filter information (R) = 1,195 bits (add (add (add (add (mul(I ) 0.932803332806 )(mul(I01 ) 0.087968140841 ))(add (mul(I02 ) −0.051799394190 )(mul(I00 ) 0.095137931406 )))(add (add (mul(I03 ) −0.050682399422 )(mul(I04 ) −0.040202748030 ))(add (mul(I05 ) −0.052293013781 ) (mul(ave(I02 )(tan (I12 ))) 0.017782183364 ))))(add (add (add (mul(I07 ) 0.025515399873 ) (mul(I08 ) 0.025515399873 ))(sub (mul(sin (atan(and (I09 )(I21 )))) 0.016251996160 )(mul(tanh(tanh(tanh(mul(I02 )(asin(log (sinh(sqr(div (mul(I05 ) (sqr(div (atan(mul(mul(asin(asin(sqr(I ))))(sqr(sqr(div (I05 ) (I13 )))))(sqr(div (sin (I19 )) (I01 )))))(sqr(I01 )))))(I03 )))))))))) 0.005235218443 )))(mul(I29 ) −0.005818639882 ))) Anovel method to generate denoisingfilter that enhances the coding performance is proposed. GPGPU accelerated the evolution by around 100 times than the CPU. Generated filters outperformed least square filter and state-of-the-art filter, i.e., ALF. Conclusion