DaalaMeeting20130924

From XiphWiki
Jump to navigation Jump to search
# Meeting 2013-09-24

## Agenda

- grant
- unlord's training results
- getting some other eyes on gmaxwell's 32x32 code ( https://review.xiph.org/65 )
- monty's latest 'I'm so confused', but at least it probably explains unlord's TF2 training results

## Attending

derf, gmaxwell, jack, jmspeex, unlord, xiphmont

## grant

(discussion of research grant)

## unlord's training results

- unlord: looking at the graphs, it was dramatically better than subset1. when it got to the end it went right to zero.
- derf: we didn't save any of the intermediate steps did we?
- unlord: i don't believe so. that would be a good approach though is to start from the intermediate steps and work backwards.
- gmaxwell: my copy of hte training tool saves intermediate steps.
- derf: why isn't that in git?
- gmaxwell: i thought only i thought this was useful.
- jack: is it bad that everythign went to zero, and do we have to start over?
- jm: you probably need most of subset3 for 16x16.
- derf: the whole point of this was we needed lots of data to prevent overfitting. as for going to zero, we knew that would happen at some point, but weren't sure where that point was. now we know where the cliff is.
- jm: how far is it from a reasonable amount of coefficients?
- derf: look at the graph
http://people.xiph.org/~unlord/subset3_16x32_modes.png
- derf: it had dropped most of the coefficients before it fell off the cliff.
- jm: is it dropping them linearly?
- unlord: yes
- derf: that's much worse then.
- unlord: I considered dropping them weighted by the variance. if we run it again we may see something better.
- derf: we talked about switching at the end to something that is computationally tractable by doing something smarter.
- jm: maybe we can save a lot of cpu time by assigning modes upfront. doing a bit of training without decimating, and then freezing that. it will be a lot faster.
- gmaxwell: it does do that. it doesn't retrain after every drop; it batches.
- jm: you drop one coeff at a time?
- unlord: no, 256 at a time i think.
- jm: but you drop one at a time in terms of least squares fit?
- unlord: yes
- jm: i think you can safely drop a whole lot up front and it's only near the end that you need to be really careful.
- g: i don't think that's true but ti's certainly something we can test.
- jm: how many coeffs do you have? 4*256^2?
- unlord: no. 16x16x4. it's 1024 multiplies at the end
- jm: you're trying to rpedict 256 coeffs from 1024 inputs each. so that's 4x256^2=262k.
- derf: that's where we start yes
- jm: i think i fyou pick the first 10k that had the least impact, then it's quite unlikely that you should have not dropped that and used it to the end.
- g: the problem with that is that the coefficients are co-linear so they start with small magnitudes but if you drop them they have large magnitudes.
- jm: i think the first half would have very little impact.
- g: We should be cautious about taking this as fact because it's clearly not true when taken to an extreme
- unlord: one that is to compare with 8x16 (tf from 4x8) and see if this is any better.
- jm: if you do all sorts of approximations like dropping more coeffs at a time. these shortcuts will make training worse but i expect it won't interact with the other stuff you'll try. if it takes 3 weeks you can't run enough experiments.
- x: are you seraching on quantized coeffs.
- u: these are doubles
- x: the final target is 6 bits? 7 bits?
- derf: yeah. the question is how much do we need to not sacrifice too much performance
- g: if there are fewer mults it's easier to have higher precision.
- x: i had some thoughts on how not to get hung up on corners for low precision training. i'll try it on my stuff first. i have some neat ideas about machine search.
- derf: can you write them down
- x; i can't code things unless i write them down carefuly first. the only reaosn i haven't done so yet is because it's not what i'm supposed to be working on. i'm supposed to be working on demo4.
- jm: what is demo4?
- x: chroma from luma. we had some demo stuff and i wasn't going to a hell of a lot of detail. we've got a pretty good conceptual handle on the other demos, and this one we aren't quite as specific. my search thoughts are about how to constrain exhaustive search. it's all straightforward i just wanted to see if it worked first.

## greg's 32x32 code

- g: i have a patch for 32x32 in the review tool and it's working fine except that it hurts performance. there's no obvious bugs, but i can't figure out why it doesn't help.
- derf: what is it now?
- g: maybe 10% larger files for same quality. the images look fine and it runs fine in lossless.
- derf: but 10% is a lot.
- jm: compared to what?
- g: compared to code in git on a few images.
- x: what kind of measurements?
- g: psnr and bi-i?
- jm: how do you adjust the quant step size of 32x32? you're using a mixed block size or just everything 16 or 32?
- g: mixed block size
- jm: could it be an issue of 32x32 not having the right scale factors?
- g: i tried changing around the scaling and wasn't able to make it work better
- derf: could you make it work worse?
- g: yes. this wasn't an instance of running git stash before compiles :)
- g: i'd appreciate if someone could look over teh patch.
- x: how much intermediate stuff does it output?
- g: i cut the patch back so it's implementing only the simplest form, no prediction or anything
- jm: one test you could try is to tak ea synthetic gradient, something relaly smooth, and encode it with 16x16 for 16x16 and 32x32 forced.
- g: that's a good idea, i can try that. i've been comparing with air force which has really smooth backgrounds.
- jm: for example making sure it's not the block switching making the wrong decisions
- derf: the point of using a smooth gradient you can just compute the right answer
- g: i don't expect complete flat to help because it's already predicted???
- jm: if you end up coding all zeros, you'd have 4x fewer zeros so it still should be smaller
- g: i hadn't tried it because i expected 16x to be smaller or tied in any case.
- jm: you could even take an actual image and blur it.
- derf: why were you setting runpvq to zero in your patch?
- g: to remove pvq as a source of problems
- derf: so you're testing both ways?
- g: when i'm speaking here i'm talking about all scalar comparison.
- j: who's going to review the patch?
- derf: i'm looking at it.

## monty's confusion

- x: last week i was supposed to look at the coding gain of dct + tf + 2nd stage tf to see how it compared to dct of native block size. i wanted to see how much more efficient 2nd stage tf was. and whether we could ditch bigger dcts entirely and just tf 4x4 and 8x8. i quantified just what the efficiency difference was. one of the unexpected things if i took a 16x16 block and transformed with 4x4 dcts and tfed up to 16x16 that was reasonable efficiency. however if 4x4 lapped and tfed up to 16x16 the 2nd stage tf decreased coding gain. one thing i noticed, if i did 4x4s that were lapped as 16 and then tfed up, it was nearly identical. i have not chased this to its logical conclusion. i wonder if my 2nd stage filter is doing in freq domain what lapping does in the spatial domain. it would be useful to know that because it's an additional tool in the toolbox, but it's one of those things that i didn't expect it to be a liability. i didn't expect it to be similar to spatial lapping.
- def: i'm confused on what you said was identical
- x: if i take 4x4 blocks that have not been lapped 4x4 and tf them up to 16x16 my seodn stage TF filter produces coding gain X. if i take the 4x4 blocks and lap them as 4x4 and TF them up to 16x16 using 1 stage TF the coding gain and output turns out to be identical.
- d: they can't be because of blocking artifacts
- g: do you have artifacts?
- d: yes, he's not lapping outside, so at least 16x16 blocking artifacts.
- jm: but if it works inside then you can apply the same reasoning with blocks outside lapping
- x: it appears the 2nd stage TF does something similiar to 4x4 internal lapping. if we leave aside external lapping, only looking at 16x16 block internally, and whether lapping is done within the block, it appears that 2nd TF stage is doing something similar to 4x4 lapping.
- d: at this point i would spend some time with malvar's original papers on ???
- g: i recall there was a diagram in those papers that looks a lot like you're improved TF thing. i'll see if i can find the paper i'm thinking of.
- x: i was going to chase that a little farther because it would be an interesting connection between those two things
- jm: one that that could be interesting is couldn't you compute the matrix that's perfectly equivaelnt to lapping in spatial domain?
- x: yes. as well as the matrix that would compute the DCT.
- jm: maybe the most interesting thing to look at is the basis functions for the lapping and the TF 2nd stage. what they look like in the spatial domain and how close they are.
- x: i generated graphs to look at that but didn't look at it from this angle. it's also the case that if i want to look fora 2nd stage TF filter that improves the lap transform i'm going to have to do that directly. we know now the second stage TF is not going to improve the lap transform as we currently have it. it's going to improve the transform when we aren't lapped. i'm not sure which is better, lapping or 2nd stage TF.
- jm: efficient computationally?
- x: yes. coding gain appears identical.
- jack: when do we not lap
- x: we always lap. i've been exploring ???. if we could dispense with DCT and use TF we'd gain some computational margin or prediction could work form the exact transform instead of approximation.
- un: i saw that at high rates it was a slight improvement, but low rates it was a liability.
- x: when you were Tfing to retrain the predictors. were you going 4x4 to 16x16? were they fully lapped?
- u: what i did was go between 4x4 and 8x8 and when i did the training I trained on having 8x8 lapped with 4x4.
- x: the second stage TF would have been a liability in this case. i did not realize that until the end of last week.
- u: we're not going to be lapping 4x4s with 8x8.
- x: unless we decided that variable block size was using variable lapping. that was one of the things i was exploring. it might not be a good idea.
- d: i don't see that being worth it. that starts to matter when we start to get beyond 16x16. that's why greg's doing 32x32 the way he's doing it. but i dont' see it being useful at smaller sizes.
- x: how does the complexity of hte lapping filter scale?
- d: linearly
- x: what about register pressure?
- d: sub-linearly
- g: way sub-linearly.
- x: there was the other observation that if it really is equivalent then does that enable any interesting hack where we can reuse traditional architecture?
- d: the thing that makes everything complicated is the lapping, not the block transform.
- x: does the support only matter to low frequencies. i had a theory that it mattered very little to high freqs because even in textures there not so regular that you were going ot get extra compaction in HF. that was false. the HF benefits from larger supports as well.
- d: jm has me wondering if that's the right measure?
- jm: which?
- d: coding gain.
- jm: it's not.
- x: going by coding gain, it was the correct measure in this case because it's a theoretical higher bound. it's the most optimistic assessment of what we're getting out of it. the additonal compaction from greater lapping is not primarily a benefit to DC and near-DC. assuming no bugs in my code.
- d: at least your result confirms what i suspected to be true.
- x: it's not the transform it's the support. i was looking at can we reuse traditional architecutre if ???. we can't reuse traditional dct infrastrcuture that way.
- x: it's also nice to know why JPEG-XR didn't work.
- d: begs the question of why they did that
- x: maybe they didn't test with anything meaningful. the visual output seems to confirm what you think is going to happen. DC has much larger support. it elminates a lot of the blocking. but at the same time, it's not actually more efficient coding wise. there's still the thing that the flat gradients in the background are a lot smoother. you haven't improved coding gain but have improved visual stuff perhaps. if you've decided thisi is better and build a test to confirm, then you confirm it.
- jm: another thought i had. did you try instead of designing 2nd stage TF matrix actually using PCA?
- x: pCA?
- jm: KLT
- x: i didn't try KLT
- jm: i'd be curious if the KLT looks like the 2nd stage you've designed. i suspect it might.
- x: how would i compare those?
- jM: you look at resulting basis functions.
- x: the KLT basis functions will be different for every input
- jm: you train a KLT
- d: you can't train a KLT on one input you need a collection
- x: if you train on a bunch of inputs, then that bunch is your input
- d: but you're assuming it's representative of some population
- jm: if you compute teh KLT for spatial domain coeeff you find it's close to DCT.
- x: i have not done that but i have no shortage of things to try next.