DaalaMeeting20140422

# Meeting 2014-04-22

Mumble: mf4.xiph.org:64738

# Agenda

- reviews
- gpu status
- exhaustive BSS

# Attending

unlord, gmaxwell, jmspeex, derf

# prefilter work

- g: i restarted the training for leakage focused training and left it crunching. it finally got something that was better than what we had before. but very small differences. i haven't tried it in the codec yet.
- jm: just the prefilter, not the intra right?
- g: correct.
- d: when you say something better do you mean CG or CG + your metric?
- g: both. but it's only very slightly better. 0.001dB or something like that. the values aren't near the prior values, so maybe it will look different. it's been running for 2 weeks now on a 24 core system.
- jm: the leakage is just slightly better?
- g: correct. i'm not sure what i did before; i may have optimized for leakage or the derivative of the window before.

# reviews

assigned 253 to monty for investigation, but not committing.

# exhaustive BSS

- g: the goal was to determine how much our lack of improvement might have been due to bad BSS decisions by doing the decisions in a differnet way. i implemented a dumb brute force method that tries coding lbocks in every size and measuring the rate and distortion and uses that to make the decisions. to make that tractable it can't be using the lapping for hte decisions. because i can't use the lapping i can't use the intra prediction.
- jm: the intra sucks so much (it doesn't take into account TF and the magnitude).
- g: my experiment makes terrible decisions and does worse. it's not making any better decisions that having it turned off.
- jm: do the decisions look like the ones we have now?
- g: yes, but i would expect that.
- jm: when you start measuring RD curves, one thing i noticed is that with the existing code I've tuned it to be more aggressive at high rates, but one thing i noticed while tuning is that in terms of RD curves more or less agressive (one of) is preferred all the time. so you don't want to pay too much attention to the RD curves.
- g: how do you suggest comparing them if not hte RD curves.
- jm: i don't know.
- g: part of the problem is ???. you end up only looking at a single image or two images at a single rate.
- jm: look at the rd curves and then look at the images just to validate. did you do rd curves yet?
- g: no, not yet. i was commenting to jack on how this wasn't looking paritcular promising and thus lowering my estimation that BSS is responsible for difference in coding performance. so while talking about that, one of the other candidates for explaining this was that maybe there is an anomoly in the tokenizer's behavior. we're trying a 16-ary huffman coding for the coefficients.
- jm: if you're going to do an order-0 tokenizer, why not just measure the entropy?
- g: we've done that, and i wanted osmething in the codec that i can use to catch other interactions
- j: the goal was to make entropy coding more like jpeg to eliminate that as a source of error, just like we did with the rest of the codec.
- g: the problem is that hte model ends up being rather large and may be overtrained. i'm not through doing it yet, so not sure if it's a real problem. i wanted some suggestions on how to pool if it is overtrained.
- jm: what do you mean by overtrained?
- g: if i train a different order-0 model for every coeff position, that ends up being a lot of potential values.
- jm: then you end up with exactly the order-0 entropy that i was measuring. if you do it on subset1 you won't be overtarined because i ran it on subset1 and it was giving me the same results. unless you start modeling conditional entropy then you'll overtrain. if you training individually you'll be fine. why huffman and not our entropy coder directly?
- g: i am using our entropy coder directly. I'm breaking it into a 6-way huffman table. huffman works just as well if your branching factor is not 2.
- jm: you're going to train them at what rate?
- g: at whatever rate i test them on.
- jm: for each rate you're going to retrain?
- g: sure. mostly i'm looking for separation result between different block sizes. if we don't see a big difference on different block sizes, then that's not very interesting at answering our question.
- jm: if you find something, which i doubt, i wouldn't commit this, but just figure out what is wrong and fix it.
- g: this is not something we'd use, just want to figure out where things are potentially wrong.
- j: where do we go next if we get a negative result from this?
- jm: move on to something else?
- g: do we take out larger blocks since we can't get them to work better
- jm: they do work better but not in terms of our metrics.
- g: you are claiming you get visual improvements from them.
- jm: not huge but it's there.
- g: i haven't looked in a long time. we had some degradation in moving video with larger blocks, but i should look at that.
- jm: the BSS code was completely broken on inter.
- g: that m ight contribute.
- jm: i think the main improvement of 16x over 8x are in the cases where it's mostly just DC.
- u: other codecs use larger block sizes and they perform well with our metrics too.
- g: it makes a great big difference in H.264 if you turn on large blocks.
- d: large means 8x8. nathan is correct. when you turn on 16x and 32x and 64x in HEVC it makes a big difference in metrics. it diminishes as you get to larger blocksizes.
- jm: does it improve at low or high rates?
- d: i don't remember, but i think it was low rates.
- g: i wonder if our entropy model for DC is eating up the benefit.
- d: also we dont' have a second order transform for DC yet.
- jm: that would go the other way around though
- d: that's further explanation for why we should be seeing something different than we are seeing.
- jm: i was trying to see if there was any gains to be had by doing AM and RDO on DC. i was also thinking that we need some kind of transform on the DC, but i wasn't quite turn how to do that.
- d: there are a number of things we can do. the obvious one is to TF DC up so we have one real DCT coeff per superblock. it's not always clear you want to do that. VP8 in particular has a second level transform for DC that did that (walsh-hadamard) on the DC coeffs for a macroblock. but it did not run when that macroblcok was using 4x4 level prediction modes for intra or inter.
- jm: they could have it off?
- d: you couldn't signal it, but if you signaled that the macroblock was using one of those modes, then it would turn it off. presumably under the theory that there's enough stuff going on in this macroblock that it's not a good idea to merge all this stuff together.
- jm: mostly my htought is that it's not so much about getting the CG itself form it, but doing the coding with some look ahead. right now i suspect our DC doesn't have proper noise shaping. for example, if there's a gradient we might be deciding to switch too early or too late. we should be making more global decision instead of one at a time.
- d: we certainly have lookahead to the end of the current superblock right?
- jm: not in the current code. it becomes more complicated because you have inter in the loop.
- d: fair enough, then we'd have to delay quantization decisions.
- jm: right now we have DC such that we dont' have to do things lots of times, etc.
- d: HEVC had lots of similar proposals and none of them made it in.
- jm: one thing i was thinkign of, but still has pieces missing, is code the DC first (without intra) and once you've done that attempt to reuse that information in the intra predictor.
- d: i have no idea how that would work.
- g: that's interesting. you can learn about the consistency of the image that way from the DC data. one of the challenges in all of this...
- d: the biggest gains for intra right now are on DC
- jm: yeah, but we dont' have a transform.
- d: the biggest gains for intra in VP8 are in DC and it *does* have a transofrm.
- jm: how do they do it? they transform the DC residual?
- d: yep
- jm: how do they handle that it can change? od they do two searches?
- d: the only time they do the transform is when the intra is over the whole macroblock. there is no coded residual.
- jm: oh, that is simple, but not really enough for us.
- d: i'll let you chew on that for a bit.
- jm: i think our intra sucking is not helping.

# gpu work

- u: i have code that does everything but the subset sum right now. i got stuck for a bit measuring how quickly i can move data from cpu to gpu. apparently there are faster ways of doing it than loading textures, but we do it that way because we want to support v1.3 fragment shaders. i dont' think we'll have the speed other people have gotten on jpeg decode.
- d: are you de-zigzaging on the gpu yet? are you uploading tokens or coeffs?
- u: coeffs. I found a paper where someone was uploading jpeg in a packed format instead of RLE tokens. you could pack 3 coeffs in 24 bits.
- d: i haven't spent a long time thinking about what you would do there.my though was that you'd want to do a minimal transformation on the cpu because that's your bottleneck. you can upload data at the speed of the bus, which has a pretty decent throughput. you don't have a lot of cycles to do data transformation during upload.
- u: they are using something crazy with opencl. i thougth maybe it would get rid of the problem i was seeing where you have near lossless jpeg wher eyou have many coeffs that are not zero, so i thought that was interesting.
- d: perhaps. i'll look it over.
- u: i do not have a deblocking filter yet. i have a filter that converts ycbcr to rgb, but i don't have filters after that.
- d: so you have transforms and color conversion, but not deblocking, dc prediction, and dezigzag.
- u: yes. having never done opencl, will webcl be something that exists soon?
- d: probably not.
- u: one data point i got out of this paper is they were getting 5.5k HD frames per second using their jpeg codec. i'm not getting close to that at hd sizes.
- d: how not close?
- u: i think i'm in the hundreds.
- d: are they using a comparable gpu?
- u: probably not. that's why i was trying to measure bus speed.
- d: pci-e bus speed you should be able to look up. i think the first order of business is to get more stuff over to the gpu, then worry about making it fast.
- jm: what's the advantage if you can offload 50% of hte codec to the gpu?
- d: the advantage is that we can do the stuff in the gpu in webgl and it's still fast. and we can take advantage of gpu hardware that exists in mobile.
- jm: if all you can offload is 50% is that worth it?
- d: at 50% probably not, but i don't know. if i can only offload 50% and run both at full speed, then i've doubled my processing power. i've probably done bad things to my performance footprint but maybe it can play in some places you couldnt' before.