Daala on Wheels
Daala is the current working name of a next generation video codec— to be renamed once someone insists on something better. So far the best proposed alternative is PatentCake.
For now the purposes of this page is to collect notes about things which have been discussed in informal public IRC discussion about the next generation initiative. Participants in these discussions have included Timothy Terriberry, Jason Garrett-Glaser, Loren Merritt, Ben Schwartz, Greg Maxwell, and others.
See also: https://xiph.org/daala/
Work in progress
We've been having weekly progress meetings on mumble.
- 2012 June 4 minutes (actually a work week)
- 2012 June 22 minutes
- 2012 June 29 minutes recording
- 2012 July 6 minutes
- 2012 July 13 minutes
- 2012 July 20 minutes
- 2012 July 27 minutes
- 2012 August 3 minutes recording
- 2012 August 10 minutes
- 2012 August 17 - no meeting
- 2012 August 24 - minutes recording
- 2012 August 31 - no meeting
- 2012 September 7 - no meeting
- 2012 September 14 - no meeting
- 2012 September 21 - minutes
- 2012 September 28 - minutes recording
- 2012 October 5 - recording
- 20120 October 26 -
- 2012 Novemeber 2 - no meeting
- 2012 December 7 - minutes
- 2013 September 11 - logistics, status updates, coding party agenda
- 2013 September 17 - summit plans, research talks, coding party agenda, monty's TF ideas
- 2013 September 24 - training results, 32x32 code, and 2nd stage TF
- 2013 October 8 - pvq status, training, chroma from luma
- 2013 October 22 - even/odd quantizer, determinant 1, IETF, chroma from luma
- 2013 October 29 - gstreamer conf, ietf, chroma from luma
- 2013 November 12 - chroma from luma, pvq
- 2013 December 3 - chroma from luma, pvq, misc
The discussed overall structure so far has been a variable size lapped-DCT block based codec with lapping done via pre/post filtering with a specially structured (lifting) linear phase transform along the edges along with overlapped block motion compensation and the expected trimmings. The lapping can be optimized for energy compaction and other useful properties, including invert-ability, and yields excellent results with efficient finite precision math.
Other components which have been discussed include:
Techniques applicable to all frame types
- Multisymbol arithmetic coding
- Mode prediction using the previously decoded data, e.g. coding the mode using a probability function derived from trained predictors on the surrounding blocks.
- Explore legendre polynomial basis transforms instead of DCT
- May have better perceptual properties and/or result in 'less compromised' efficient implementations.
- Coefficient domain prediction to allow efficient energy preserving quantization.
- Variable partition size/shape and the use of good predictors appears to remove most of the benefit of directional transforms.
- Perhaps 45deg is still useful?
- Transform-post filtering to allow merging smaller transform blocks (like TF merging in CELT) may allow more flexible partitioning then outright using mixed block sizes.
- Perturbed quantization mode-signalling has been discussed but mostly laughed at. ;)
- Special block modes well suited to solid color/cartoon like content— avoiding ringing.
- Are pixel prediction modes too slow?
- In general— what markov random field techniques can be applied with acceptable performance. Any?
- Designed for parallel encode and decode within each frame
- Important because
- the proposed techniques need a lot more CPU than H.264 and VP8 for both encode and decode
- Moore's law for single-threaded throughput is dead. Future hardware is all multicore/GPU.
- Getting the order of application right for the lapping filters.
- Important because
- Using PVQ and energy conservation: see http://jmvalin.ca/video/video_pvq_v3.pdf
Techniques applicable to inter frames
- Using x264 as a test-bed Jason and Loren demonstrated 15% rate/distortion improvements from using 10-bit intermediaries and references, estimated as being 1/3rd from quality calculation in the 10-bit space, 1/3rd from the higher precision references, and 1/3rd from higher intermediate precision in calculations (e.g. MC filter processing).
- Super-resolution techniques for motion-compensation references have been discussed— in particular it appears that the half-pel location is where intelligent filtering matters the most so staged computation could be effectively used to allow more expensive filtering at that level.
- Timothy has an example code base for a variable partition size blocking-free motion compensation scheme which merges OBMC (overlapped block motion compensation) and CGI (control-grid interpolation) with an interesting prediction/sub-division scheme and whole-frame trellis optimization of motion vectors. (daala-exp)
- YUV 4:4:4, 4:2:2 , 4:2:0 subsamplings, 8-bit, 10bit.
- Alpha channel — need testing material!
- 8-bit RGB compatible mode? (e.g. YCoCg, internally or at least flagging for it)
- Efficient 3D? — need testing material!
- Good support for decode side droppable frames?
- Optionally storing a checksum of the expected decoded frame for decoder/encoder mismatch detection.
- Expose the number of referential descendants of a given frame (or even the whole reference DAG) for most efficient allocation of FEC.
Crazy crap that might be interesting or at least fun to make fun of...
- Possible compromise: the video reference structure contains a backbone that can be decoded at only N bits of depth (e.g. 10), and higher precisions are only supported outside of this reference chain.
- Some high end digital cameras are operating jpeg-derivatives in a special mode that keeps the image in the native linear RGB bayer format in order to avoid lossy/slow demosaicing on the camera. In particular this allows white balancing in post without excessive loss. Probably out of scope for Daala itself.
- Lossless intra-ability: The ability to losslessly rewrite any frame as an intra frame (perhaps with significant bitrate overhead) in order to make frame accurate cuts possible.
- Internal overlays which could be swapped without re-encoding? (e.g. advertising, station ID). Could also be automatically generated by a Sufficiently Advanced™ encoder to improve efficiencies for static sprites over moving backgrounds.
- Could be done externally to the video codec, but if so it's no likely to be useful for anyone ever.
- A secondary reference implementation in OpenCL, maintained throughout development, to make sure that the codec is GPU-friendly and can be done efficiently using OpenCL primitives.
- SWAR-friendly arithmetic. For example, choosing transform coefficients so that no intermediate product overflows 16 bits (tricky for signed values) can sometimes enable (e.g.) 4 parallel operations in one uint64_t. This can allow a pure C reference implementation to run faster, which is valuable for initial adoption and ports to new platforms.
- Parametric decode-side blur.
- Fancy block property prediction. (Not clear how these prediction interact with intra pred)
- Using Kurtosis for detecting text in a frame
- The idea was to detect a Bernouilli distribution but it's not robust and too noisy