Notes on testing theora

From XiphWiki
Jump to navigation Jump to search

This page is still in development, and it's something of a rant at this point in time.


On testing

Testing correctly is enormously difficult. I've yet to see a video codec comparison that didn't have at least one material flaw which could be expected to influence the results. This includes my own testing.

Much of this difficulty comes from the fact that we're trying to measure highly dimensional and highly flexible things in a fair manner. Achieving real fairness is often not possible, because the operating space of two codecs doesn't often completely overlap. How you combine these orthogonal dimensions into one metric of "better" or "worse" is necessarily subjective.

Testing is so difficult that experts in testing, codec developers, and academics have all gotten it wrong. If you're not willing to spend the time to take a serious effort to understand exactly what is happening, then you should probably abandon any hope of performing a rigorous test and instead admit that your test is just a casual one which might only be applicable to your own usage.

All that said, I'm tired of seeing the same mistakes repeated over and over again. So I thought I'd make this list so that when you fail, as you inevitably will, you can at least fail in new and interesting ways. --Gmaxwell 21:17, 10 April 2010 (UTC)

Comparable operating modes

By default libtheora provides either a constant QI encode (target quality), a rate-controlled encode (target bitrate), or a two-pass rate controlled encode (target bitrate). The defaults for each are designed for the situations in which they're expected to be used, and not for consistent codec testing.

Libtheora's one-pass modes are designed for live streaming, which was our original target use case. Thus, they are "zero" latency and use a short keyframe interval. The one-pass rate-controlled mode is strictly buffer constrained (constant bit-rate over a small buffering window). This hard CBR requirement significantly hurts quality at a given bitrate but it makes streaming more reliable and lower latency. Encoders for many other formats provide this kind of behavior only if you explicitly ask for it. You can relax, but not disable, this constraint in libtheora if it is not useful for your application, or use a target quality if your application does not require particular bitrates. The frequent keyframes also require a lot more bits, but improve loss robustness and startup times.

Libtheora's two-pass mode does not impose a hard CBR constraint by default, and uses a higher keyframe interval. It's also able to look ahead at the stream to aid its rate control decisions. This is similar to the default behavior of many encoders for other formats. You can still request a hard CBR constraint if it is useful for your application.

You should not compare libtheora's one-pass, one frame in/one frame out, hard CBR mode against other codecs' one-pass VBR mode with several seconds of lookahead. Even though they are both one pass, you are comparing holodecks to oranges. You should also explicitly set the keyframe interval to the same value for every codec you use. Every codec has a different default, and even relatively similar values may give statistically significant differences in the results, depending on how well they allow keyframes to align with scene changes.

Profiles

Some formats offer multiple profiles, and some devices can only decode some profiles and not others. Arguably this means that they aren't a single format, but are instead a collection of related formats under a common name. Theora has only a single profile.

It isn't reasonable to compare Theora's "one setting works everwhere" to the best characteristics of different profiles of a competing encoder.

If you compare Theora against a high complexity profile that restricts what devices that other format can play on, understand that you've added a subjective factor to the comparison. There isn't anything wrong with this, and it can be impossible to avoid, but it should be understood. You cannot simultaneously compare Theora's quality/bitrate performance against the most complex profile another codec offers and compare its device support to the simplest.

Encoder speed levels

Although it is fairly fast, Libtheora's encoder is not extensively optimized for performance. It's basically been made fast enough to encode streaming resolutions on normal hardware. As of currently development efforts have gone into improving other parts of the codec. There is a lot of low-hanging fruit in this area.

In particular speed level 2 is provided as an emergency "make it go fast without regard to bitrate" knob. It disables half of the formats features as a quick measure to make the encoder faster for some real-time applications. The same kinds of speed are possible without sacrificing quality/bitrate, but no one has bother developing code them yet.

For a comparison you should probably test libtheora at the default speed. It's completely fair to note that other encoders currently offer additional speed knobs that make them much faster than libtheora. It's not really fair to run libtheora at speed level 2 then measure the quality at low bitrates. Unless you want to measure it against MJPEG. The quality/bitrate at speed level 2 is poor. It's supposed to be. It constrains the format to act similar to MJPEG with limited delta frames.

General points on "objective" measures like PSNR and SSIM

Objective measurements like PSNR and SSIM are tools for measuring "quality" which don't require a human to judge the quality, the computer does it for you. As such, they can enormously reduce the cost of testing.

Unfortunately, much of the art in codec development comes from the fact that computers are not particular good judges of quality. So what a tool that measures PSNR or SSIM isn't measuring quality, it's measuring PSNR or SSIM. Under particular circumstances these metrics can be shown to correlate well with human quality judgements. But "particular circumstances" do not mean 'all circumstances'— it isn't too difficult to construct modified images that get a great objective measure but look like crap, and it is VERY easy to modify in image in a way which is almost imperceptible but which ruins the objective measures. I'm not aware of any study demonstrating that any of the available objective measures are useful across different significantly different compression techniques, but if used correctly they are probably usable as a rough yard stick.

The key there being used correctly. There seems to be a problem with using them correctly, since roughly a majority of comparisons seem to screw this up.

If you are comparing PSNR or SSIM and the scores for all your test cases do not converge to a similar very high quality at very high bitrates then you are almost certainly doing something wrong.

... or you've found some kind of gross bug in the encoder. Depending on the nature of the bug, your objective measure may not predict human results at all.

Offsets

The primary cause of getting strange and invalid results from PSNR or SSIM is offset handling.

If you shift a video over a couple of pixels a human observer will hardly notice. But your SSIM or PSNR will become very bad.

If you see this kind of behaviour, look at the results and see if they agree with the numbers. If the PSNR/SSIM says the video looks no better at 10mbit than it did at 1mbit, but to your eyes it clearly does then your test is broken and useless.

Additionally, different formats have the sub-sampled pixels for chroma at different locations. Theora _always_ uses the MPEG-1/MJPEG convention. MPEG2 has another convention which is used by default by most other codecs. Getting the chroma location wrong doesn't produce a visible harm on most material, but it does screw up PSNR or SSIM. Many media formats, frameworks, and libraries do not even have the ability to specify which subsampling should be used, and thus do whatever that particular programmer expected would be "normal". Most PC player software, even software focused on modern MPEG formats, always decodes as in the MJPEG style, because its is computationally cheaper. Other software pedantically tries to correct for chroma tagged in one format when it is set up to handle a different one, even though these tags are often wrong. Unless you know exactly what your tools are doing, you are at risk of one of them screwing this up. If you apply objective measures in the original YUV space, difference in handling the chroma offset should be limited to damaging the chroma PSNR/SSIM scores, which makes this error more obvious.

If you are planning on comparing your PSNR or SSIM scores with someone else's, you probably shouldn't bother. There is too much inconsistency between tools, too much likelihood of using somewhat different input clips. This is why its important that you publish your input material— without it people have no hope of reproducing your results, and without reproducing your results the public should have zero trust that you didn't make a simple measurement error.

Frame drops

If Theora runs out of bits in its hard-bitrate constraint or determines that a frame has almost no motion it will output a zero byte packet. Some decoders simply ignore the zero byte packets, and depend on their normal AV sync mechanisms to keep things well timed. However, if you are dumping the output to a file for analysis with a SSIM/PSNR tool, these 'lost' frames will result in an time-offset and frames will be compared with the wrong source frames. This will have somewhere between a big and a huge negative impact on the scores.

If your output doesn't have the same number of frames as the input. Stop. You've done something wrong.

This case can also cause the "never gets good at high bitrates" behaviour.

Levels

Objective measures are also very sensitive to changes in luma levels (brightening / darkening). The best way to avoid this case is to measure in the colorspace that the codec worked in, without conversions. Do not convert to RGB before using objective metrics on codecs that operate in YUV, because you will probably screw it up in a non-obvious way.

Input files

It's obvious to most people attempting testing that you must begin with the same input for all of the codecs under test. It's less obvious that you may run into problems if you begin with RGB input: Current codecs all operate in some YUV colorspace rather than RGB. If you use RGB files for your inputs differences in the conversion from RGB to YUV between different tools may reduce the objective scores for some formats compared to others.

These differences can even create visible quality differences in corner cases.

General points on subjective testing

Codec X has "better colors"

Modern lossy codecs don't do anything which should grossly change the overall brightness, hue, or saturation of an image. If the colours look different in your comparison, your software is probably mishandling colorspace conversions. Figure out what is broken and try again.