Notes on testing theora

This page is still in development, and it's something of a rant at this point in time.

On testing

Testing correctly is enormously difficult. I've yet to see a video codec comparison that didn't have at least one material flaw which could be expected to influence the results. This includes my own testing.

Much of this difficulty comes from the fact that we're trying to measure highly dimensional and highly flexible things in a fair manner. Achieving real fairness is often not possible, because the operating space of two codecs doesn't often completely overlap. How you combine these orthogonal dimensions into one metric of "better" or "worse" is necessarily subjective.

Testing is so difficult that experts in testing, codec developers, and academics have all gotten it wrong. If you're not willing to spend the time to take a serious effort to understand exactly what is happening, then you should probably abandon any hope of performing a rigorous test and instead admit that your test is just a casual one which might only be applicable to your own usage.

All that said, I'm tired of seeing the same mistakes repeated over and over again. So I thought I'd make this list so that when you fail, as you inevitably will, you can at least fail in new and interesting ways. --Gmaxwell 21:17, 10 April 2010 (UTC)

Comparable operating modes

By default libtheora provides either a constant QI encode (target quality), a rate-controlled encode (target bitrate), or a two-pass rate controlled encode (target bitrate).

Libtheora's default one-pass rate controlled mode is a strictly buffer constrained (constant bit-rate over a small buffering window) mode with "zero" latency. The hard CBR requirement significantly hurts quality at a given bitrate but it makes streaming more reliable and lower latency. Encoders for many other formats only provide this kind of behaviour if you explicitly ask for it.

Libtheora's default two-pass does not impose a hard CBR constraint. It's also able to look ahead at the stream to aid its rate control decisions. This is similar to the behaviour of many encoders for many other formats.

Profiles

Some formats offer multiple profiles, and some devices can only decode some profiles and not others. Arguably this means that they aren't a single format, but are instead a collection of related formats under a common name. Theora has only a single profile.

It isn't reasonable to compare Theora's "one setting works everwhere" to the best characteristics of different profiles of a competing encoder.

If you compare Theora against a high complexity profile that restricts what devices that other format can play on, understand that you've added a subjective factor to the comparison. There isn't anything wrong with this, and it can be impossible to avoid, but it should be understood.

Encoder speed levels

Although it is fairly fast, Libtheora's encoder is not extensively optimized for performance. It's basically been made fast enough to encode streaming resolutions on normal hardware. As of currently development efforts have gone into improving other parts of the codec. There is a lot of low-hanging fruit in this area for further development.

In particular speed level 2 is provided as an emergency "make it go fast without regard to bitrate" knob. It disables half of the formats features as a quick measure to make the encoder faster for some real-time applications. The same kinds of speed are possible without sacrificing quality/bitrate, but no one has bother developing code them yet.

For a comparison you should probably test libtheora at the default speed. It's completely fair to note that other encoders currently offer additional speed knobs that make them much faster than libtheora. It's not really fair to run libtheora at speed level 2 then measure the quality at low bitrates. It's poor. It's supposed to be. Unless you want to measure it against MJPEG.

General points on "objective" measures like PSNR and SSIM

Objective measurements like PSNR and SSIM are tools for measuring "quality" which don't require a human to judge the quality, the computer does it for you. As such, they can enormously reduce the cost of testing.

Unfortunately, much of the art in codec development comes from the fact that computers are not particular good judges of quality. So what a tool that measures PSNR or SSIM isn't measuring quality, it's measuring PSNR or SSIM. Under particular circumstances these metrics can be shown to correlate well with human quality judgements. But "particular circumstances" do not mean 'all circumstances'— it isn't too difficult to construct modified images that get a great objective measure but look like crap, and it is VERY easy to modify in image in a way which is almost imperceptible but which ruins the objective measures. I'm not aware of any study demonstrating that any of the available objective measures are useful across different significantly different compression techniques, but if used correctly they are probably usable as a rough yard stick.

The key there being used correctly. There seems to be a problem with using them correctly, since roughly a majority of comparisons seem to screw this up.

If you are comparing PSNR or SSIM and the scores for all your test cases do not converge to a similar very high quality at very high bitrates then you are almost certainly doing something wrong.

... or you've found some kind of gross bug in the encoder. Depending on the nature of the bug, your objective measure may not predict human results at all.

Offsets

The primary cause of getting strange and invalid results from PSNR or SSIM is offset handling.

If you shift a video over a couple of pixels a human observer will hardly notice. But your SSIM or PSNR will become very bad.

If you see this kind of behaviour, _look_ at the results and see if they agree with the numbers. If the PSNR/SSIM says the video looks no better at 10mbit than it did at 1mbit, but to your eyes it clearly does then your test is broken and useless.

Difference formats have the sub-sampled pixels for chroma at different locations. Theora _always_ uses the MPEG-1/MJPEG convention. MPEG2 has another convention which is used by most other codecs. Most PC player software, even software focused on modern mpeg formats always decodes as in the MJPEG style, because its is computationally cheaper and getting the chroma location doesn't produce a visible harm on most material. But it does screw up PSNR or SSIM. If you measure the objective measures in the original YUV space differences in handling the chroma offset should be limited to damaging the chroma PSNR/SSIM scores, which makes this error more likely.

If you are planning on comparing your PSNR or SSIM scores with someone elses, you probably shouldn't bother. There is too much inconsistency between tools, too much likelyhood of using somewhat different input clips. This is why its important that you publish your input material— without it people have no hope of reproducing your results, and without reproducing your results the public should have zero trust that you didn't make a simple measurement error.

Frame drops

If Theora runs out of bits in its hard-bitrate constraint or determines that a frame has almost no motion it will output a zero byte packet. Some decoders simply ignore the zero byte packets, and depend on their normal AV sync mechanisms to keep things well timed. However, if you are dumping the output to a file for analysis with a SSIM/PSNR tool, these 'lost' frames will result in an offset and frames will be compared with the wrong source frames. This will have somewhere between a big and a huge negative impact on the scores.

If your output doesn't have the same number of frames as the input. Stop. You've done something wrong.

This case can also cause the "never gets good at high bitrates" behaviour.

Levels

Objective measures are also very sensitive to changes in luma levels (brightening / darkening). The best way to avoid this case is to measure in the colorspace that the codec worked in, without conversions. Do not convert to RGB before using objective metrics on codecs that operate in YUV, because you will probably screw it up in a non-obvious way.

General points on subjective testing

Codec X has "better colors"

Modern lossy codecs don't do anything which should grossly change the overall brightness, hue, or saturation of an image. If the colours look different in your comparison, your software is probably mishandling colorspace conversions. Figure out what is broken and try again.