Daala on Wheels
Daala is the current working name of a next generation video codec— to be renamed once someone insists on something better. So far the best proposed alternative is PatentCake.
For now the purposes of this page is to collect notes about things which have been discussed in informal public IRC discussion about the next generation initiative. Participants in these discussions have included Timothy Terriberry, Jason Garrett-Glaser, Loren Merritt, Ben Schwartz, Greg Maxwell, and others.
See also: https://xiph.org/daala/
We've been having weekly progress meetings on mumble.
- 2012 June 4 minutes (actually a work week)
- 2012 June 22 minutes
- 2012 June 29 minutes recording
- 2012 July 6 minutes
- 2012 July 13 minutes
- 2012 July 20 minutes
- 2012 July 27 minutes
- 2012 August 3 minutes recording
- 2012 August 10 minutes
- 2012 August 17 - no meeting
- 2012 August 24 - minutes recording
- 2012 August 31 - no meeting
The discussed overall structure so far has been a variable size lapped-DCT block based codec with lapping done via pre/post filtering with a specially structured (lifting) linear phase transform along the edges along with overlapped block motion compensation and the expected trimmings. The lapping can be optimized for energy compaction and other useful properties, including invert-ability, and yields excellent results with efficient finite precision math.
Other components which have been discussed include:
Techniques applicable to all frame types
- Multisymbol arithmetic coding
- Timothy has some trial code showing speed-up proportional to the number of bits coded at once. (ec_test.c)
- Mode prediction using the previously decoded data, e.g. coding the mode using a probability function derived from trained predictors on the surrounding blocks.
- This will be terrible for robustness but may significantly reduce signalling overhead, allowing many more modes, and provide continuous adaptation between signalling free and fully signalled modes.
- Explore legendre polynomial basis transforms instead of DCT
- May have better perceptual properties and/or result in 'less compromised' efficient implementations.
- Coefficient domain prediction to allow efficient energy preserving quantization.
- Variable partition size/shape and the use of good predictors appears to remove most of the benefit of directional transforms.
- Perhaps 45deg is still useful?
- How does this change with partition sizes? Directional transforms are clearly not that useful with 4x4.
- Transform-post filtering to allow merging smaller transform blocks (like TF merging in CELT) may allow more flexible partitioning then outright using mixed block sizes.
- Perturbed quantization mode-signalling has been discussed but mostly laughed at. ;)
- Special block modes well suited to solid color/cartoon like content— avoiding ringing.
- Are pixel prediction modes too slow?
- In general— what markov random field techniques can be applied with acceptable performance. Any?
- Designed for parallel encode and decode within each frame
- Important because
- the proposed techniques need a lot more CPU than H.264 and VP8 for both encode and decode
- Moore's law for single-threaded throughput is dead. Future hardware is all multicore/GPU.
- Getting the order of application right for the lapping filters.
- Mandatory slicing? Maybe some kind of multilevel entropy coding to reduce redundancy between slices while minimizing the single-threaded portion of decode.
- Important because
- Using PVQ and energy conservation: see http://jmvalin.ca/video/video_pvq_v3.pdf
Techniques applicable to inter frames
- Using x264 as a test-bed Jason and Loren demonstrated 15% rate/distortion improvements from using 10-bit intermediaries and references, estimated as being 1/3rd from quality calculation in the 10-bit space, 1/3rd from the higher precision references, and 1/3rd from higher intermediate precision in calculations (e.g. MC filter processing).
- Increased reference precision competes for memory with increased number of references. The improvements demonstrated appear to be a greater win than increasing the reference count once there are four references or so.
- Super-resolution techniques for motion-compensation references have been discussed— in particular it appears that the half-pel location is where intelligent filtering matters the most so staged computation could be effectively used to allow more expensive filtering at that level.
- Edge-directed interpolation techniques might be effectively applied to increase motion compensation accuracy, but most of the techniques known to be very effective are too slow.
- Speculation has been offered that a significant part of MC inaccuracy may be due to blending in a physically incorrect (gamma-corrected) space, though no real conclusions were made. Academic papers on motion compensation accuracy seem to have ignored this issue.
- Timothy has an example code base for a variable partition size blocking-free motion compensation scheme which merges OBMC (overlapped block motion compensation) and CGI (control-grid interpolation) with an interesting prediction/sub-division scheme and whole-frame trellis optimization of motion vectors. (daala-exp)
- YUV 4:4:4, 4:2:2 , 4:2:0 subsamplings, 8-bit, 10bit.
- Alpha channel — need testing material!
- 8-bit RGB compatible mode? (e.g. YCoCg, internally or at least flagging for it)
- Efficient 3D? — need testing material!
- The value of this is disputable. If nothing else it's arguable that stuffing lossless into a lossy format may be the only way to get lossless into many people's hands. Also, see below
- Good support for decode side droppable frames?
- Hopefully the referencing structure will be flexible enough to enable this even if it's not an intentional feature.
- Optionally storing a checksum of the expected decoded frame for decoder/encoder mismatch detection.
- Expose the number of referential descendants of a given frame (or even the whole reference DAG) for most efficient allocation of FEC.
Crazy crap that might be interesting or at least fun to make fun of...
- Use cases don't seem well enough defined yet. Significant complexity. Any prospective hardware developer may hire assassins.
- Possible compromise: the video reference structure contains a backbone that can be decoded at only N bits of depth (e.g. 10), and higher precisions are only supported outside of this reference chain.
- Precision by truncation: decode is performed twice on each frame, identically, at low and high precision. The only difference between them is the bit-depth of the transform, or possibly of the transform and MC filters. Only low-precision outputs can be referenced by subsequent frames. Useful if high-precision content is still worth watching at low precision.
- Precision by gamma: decode is performed once at low precision as normal. Then the output frame is converted to linear-light at high precision, after which another layer of residuals is added. The second layer can be permitted to reference previous high-precision frames... tricky to use both sets of references though. Useful if high precision is used for storing linear data, but people still want to watch it on "low-end" hardware.
- Some high end digital cameras are operating jpeg-derivatives in a special mode that keeps the image in the native linear RGB bayer format in order to avoid lossy/slow demosaicing on the camera. In particular this allows white balancing in post without excessive loss. Probably out of scope for Daala itself.
- Bayer, 4:2:0, 4:2:2, and Interlacing are all special cases of a more general pattern in which the output frames are decimated/subsampled in a regular fashion. All such subsamplings could be supported by a unified framework in which the video is always stored with all planes fully sampled, with a header indicating the recommended subsampling for display. In such cases, the encoder can regard the transform as highly overcomplete, and simply ignore unneeded coefficients (presumably by leaving high frequency residuals coded as zero). This structure would in effect turn the codec into a motion-compensated interpolating/deinterlacing filter. Whether this approach is sensible presumably depends in part on how the transform is structured. It would be especially easy if the transform's highest-frequencies were coded by a wavelet-like layer.
- Lossless intra-ability: The ability to losslessly rewrite any frame as an intra frame (perhaps with significant bitrate overhead) in order to make frame accurate cuts possible.
- Or best handled by making sure that containers have working pre-roll, but presumably common GOP sizes will be greater than the number of references so even if losslessly reencoding the references is expensive it may be cheaper than pre-roll. Do both?
- Can be had for 'free' if lossless is supported, plus the right header flags to restuff the references from lossless copies in a packed hidden frame.
- Use of explicitly (rather than staged) super-resolution and/or deeper references may make this functionality unattractive due to increased overhead.
- Internal overlays which could be swapped without re-encoding? (e.g. advertising, station ID). Could also be automatically generated by a Sufficiently Advanced™ encoder to improve efficiencies for static sprites over moving backgrounds.
- Complicates making the complexity bounded. No Sufficiently Advanced™ encoder likely to ever exist. But perhaps the station id/advertising uses fully justify this.
- Could be done externally to the video codec, but if so it's no likely to be useful for anyone ever.
- A secondary reference implementation in OpenCL, maintained throughout development, to make sure that the codec is GPU-friendly and can be done efficiently using OpenCL primitives.
- SWAR-friendly arithmetic. For example, choosing transform coefficients so that no intermediate product overflows 16 bits (tricky for signed values) can sometimes enable (e.g.) 4 parallel operations in one uint64_t. This can allow a pure C reference implementation to run faster, which is valuable for initial adoption and ports to new platforms.
- Parametric decode-side blur.
- Symmetrical blur in regions that are smooth on scales longer than the block size. Could be signaled or derived from observed DC values.
- Motion blur so that moving objects are blurred along the motion vector. May require coding a shutter speed parameter (0..1 as a fraction of the inter-frame interval).
- Fancy block property prediction. (Not clear how these prediction interact with intra pred)
- Predict block properties (quantizer, energy, etc.) from MV. (0,0) probably means small delta. Larger MV's may correspond to larger deltas ... although at low shutter speeds large MVs may correlate with reduced overall HF energy.
- Predict delta spectral shape from source block spectral shape. HF/LF ratio of the delta may be correlated with the same ratio in its source blocks. Works well with decode-side fDCT.
- Using Kurtosis for detecting text in a frame
- The idea was to detect a Bernouilli distribution but it's not robust and too noisy