DaalaMeeting20141007

# Meeting 2014-10-07

Mumble:  mf4.xiph.org:64738

# Agenda

- reviews
- Reduce PVQ<->entropy coding depencies on intra
- Decide the fate of deringing
- Scalar's gone, let's adapt to PVQ
- intern project brainstorm

# Attending

Nathan, Jean-Marc, Thomas, Tim

# Reviews

- t: We we JM two, 478 and 479. I wanted you to look at 478, Nathan
- n: So, yeah, what do we want to do about intra?
- j: This is the simplest thing that I could amke work
- n: It has some interesting properties, like it's all in the frequency domain but I don't know if we want to do something else that's more complicated
- t: You guys know more about what stuff works at this point than I do
- n: There was stuff I wanted to try, like adding TF. I also wanted to look at how much noref is used
- j: Oh, quite a lot
- n: Yeah, even when you pick H or V you still copy stuff from the other direction
- j: For the larger block sizes, you have several pure H or pure V bands that you can select with noref. But for the smallest 15-coeff band, you can't pick which one to use with noref
- j: for the lower frequency quadrent, the reasoning is that if I copy boht the horizontal energy and the vertical energy from the block on the left ,one of the two components will hurt the other one and it is going to be useless and I code a no ref
- n: what if the diagonal had more energy than the horizontal and vertical, should you not code either one of the bands
- j: maybe, but its only going to cost you a no ref
- j: you should probably try these other experiments because at the rate I did the experiments I am not willing to say that none of them are better.
- j: The reason I submitted this patch was that it filled the most bands, and the reason for that is because of the interactions between PVQ and entropy decoding
- j: right now, what's in master has CfL which means that with CfL or even the simple intra predictors, our decoder is essentially single threaded.  If you run it on a machine that has an infinite number of cpu's you could probably get something that is like 10-15% speedup, assuming your not using slices or stuff.
- j: if we were to use robust bit stream, we would be able to have entropy decoding in a thread and have all the pvq running in parallel
- d: it looks like that was causing us a percent and a half of rate
- j: it may be possible to do better, this is the really early version I have
- d: but does that mean we could use *any* intra prediction strategy we wanted and it would not effect entropy coding
- j: you could use any intra prediction you want
- d: okay good
- j: whatever you attempt to do with the entropy decoder does not depend on prediction, so you would be able to do whatever you want with prediction without messing up the entropy loop
- d: so we've made tradeoffs on the order of this before, the entropy encoder costs you about a percent just to avoid having divisions in it
- u: so the robust stream must cost us something?
- j: most of the cost from the robust bistream is if you code a theta then you cannot use that to change the probabilities of the gain
- d: I keep thinking there should be some way to model this distribution, something like the distribution for theta falls off a cliff, even if you don't know where the cliff is
- j: if you have a whole bunch of blocks that have a gain of 1 or gain of 2
- u: is there a theoretical limit as to how much you lose by removing adaptivity in the entropy coding? 
- j: I would assume hardware would be an issue too, you would have these two pieces of silicon that would need to communicate between each other
- d: yes it would be an issue
- d: you essentially have the issue that the entropy decoder is serial while everything else is not, e.g. the idct's are massively parallel but only if you can feed them data fast enough
- j: although if we were to do something like spatial domain intra prediction, the iDCT needs to run in the coding loop
- d: so if we do something like the unlapping, that sits in a pretty similar place as HEVC
- j: in the current state, we already have CfL so we have our blocks not completely independent, like you can run the DCT and the lapping filter fully parallel (n-squared parallel) but you cannot run the chroma parallel
- d: I mean you can do the luma plane parallel
- j: as far as I can tell we're coding a super block at a time, oh wait hold on, if we were to have a robust bistream we would have to do the entropy coding for everything and yeah we could run luma PVQ and chroma PVQ completely parallel
- u: what do we think about entropy coding in wave-front order?
- d: I need to do more research on this before we make major changes to the bitstream
- d: so if we're going to talk aobut this on an actual CPU, the rough division of a workload you want is 10,000 cycles so you are not going to be able to do this on a superblock by superblock level.
- j: yeah I guess not, the way I see it you do not even need to have an actual synchronization point you just shove data in there
- d: the problem is you still have to bounce a cache line from one CPU to the other to know that the number of superblocks that have been decoded has just gone up
- d: yeah one line of superblocks at a time, now you are talking about something reasonable
- d: I just want to get a picture of what the people have done in htis space, because people have tried to do this
- j: I think the way we're structured if you use the robust bitstream, I think you can make use of 3 to 4 cpu's, you would have 1 doing the entropy decoding.... I guess you would be bound by the entropy decoding.  If the entropy decoding is 1/4th of the total time you can get 4x speedup
- d: this is something that depends on the bit rate, and unfortunately on the streams that you want to do this the most the bitstream decode is the largest component
- j: right now interframes, even if we don't enable the robustness should be fully parallelized
- d: yeah yeah yeah, parallelized the same way you need the entropy decoder to fininsh
- j: the only stuff we have that beats
- d: I'm talking about inter frames
- j: I'm saying that if we had intra prediction that worked, it would work both on key frames and inter frames
- d: there are two different problems here, one is actually having an intra predictor, the second is how to use intra in an inter frame.  My point is we have to do something in the encoder search
- j: what I'm saying is, if the answer to the first one is no-ref, then its all fully parallel
- u: the problem is with intra
- j: so now you are on item 4 of the agenda
- d: fast is what you want for motion search, SAD is good because it is fast and you can consider more candidates
- d: that's sort of what theora did right ,theora had a motion search that is similar to what we do in daala.  It uses SAD to get some candidates and then uses some
- j: but like theora didn't have intra prediction
- d: sure it idd
- j: it just copied the DC
- d: some weighted some of the DCs
- j: which I assume isn't much better than picking whatever was your last motion vector and coding whatever was your last DC
- d: I mean we did that on a block by block basis
- d: oh absolutely, when you do this motion vector stuff with a motion vector that makes no sense you mess up the A/C and then have to correct it all
- j: the current problem you have with no-ref, it won't go away
- n: On a different topic, as you descend down in Haar DC, sometimes you have both UR and BL
- j: No, you have a quadrant every time. There's no BL or UR or whatever, you have 4 inputs to your transform and 4 outputs. Once you're done encoding at the superblock level, it's just a quadtree.
- n: And you descend and do it again, and you carry along this horizontal and vertical gradient that use the U and L DCs to predict with, and in some cases you have UR and BL as well
- j: When I code the 8x8 blocks, the gradients I'm using are based on the differences _within_ the superblock. Every time I code something, I update the gradient for the lower levels.
- j: so de-ringing, I'm at least at a point where we need to make a decision on this.  I shouldn't be spending any more time on this until we decide if we are actually going to use it
- d: okay, so as I understand the major blocker is actually making the painting part performant.
- j: oh its reasonable now
- d: so did you get the square roots out of it?
- j: no they are still there
- d: so I think you and I have different definitions of performant
- j: its now being done only once or twice per pixel instead of 32x
- d: that is not fast enough
- j: whatever the actual speed I am getting is not actually slowing down the codec.  If that is the problem then I can remove them
- d: I need to sit down and actually have a look at it and get more familiar with it
- j: the bottom line is I'm not going to be spending any more time on this if whatever I do is not going to have any impact on if we use it
- d: ok
- j: I think I mentioned that the last problem I had was the strength thing, I think the best that I could come up with is having a per super block strength that changes per factor of two.
- j: the problem is that the block size decision isn't taking chroma into account
- d: is there some way to validate that
- j: http://jmvalin.ca/video/dering/clovis0.png (without deringing)
- j: http://jmvalin.ca/video/dering/clovis1.png (with deringing, still some vertical ringing)
- d: we're now way over time
- j: I am just wondering how good it is to be more parallel
- d: its pretty useful
- j: if we come out and say we are much more parallel than say 265 is that going to help us gain acceptance
- d: it will probably have a small impact