OggOpus

From XiphWiki
Revision as of 12:09, 28 July 2011 by Rillian (talk | contribs) (→‎Draft spec: Document the meaning/motivation for the first few fields)
Jump to navigation Jump to search

Ogg mapping for Opus

The IETF Opus codec is a low-latency audio codec optimized for both voice and general-purpose audio. See [tools.ietf.org/html/draft-ietf-codec-opus the spec] for technical details.

Almost everything about this codec is either fixed or dynamically switchable, so the usual id and setup header parameters in the header packets of an Ogg encapsulation aren't useful. In particular, bitrate, frame size, mono/stereo, and coding modes are all dynamically switchable from packet to packet. A one-byte header on each data packet defines the parameters for that particular packet.

Remaining parameters we need to signal are:

  • magic number for stream identification
  • comment/metadata tags

Additionally there's been a desire to support some kind of channel bonding for surround, and some kind of option signalling for "Opus Custom", in particular the granulerate.

Draft spec

Granulepos is the count of decodeable samples at a fixed rate of 48 kHz.

Two headers: id, comment

Id header:

- 8 byte magic signature 'OpusHead' (64 bits)
- 4 byte Input sample rate (32 bits, max 192 kHz)
- 1 byte channel mapping flags (bool in byte)
- 1 byte channel count (8 bits)
- 2 byte pre-gap (16 bits)
- <optional channel mapping?>

Comment header:

- 8 byte magic signature 'OpusTags' (64 bits)
- rest follows the vorbis-comment header design used in OggVorbis, OggTheora, and Speex.
 ** Vendor string (always present)
 ** tag=value metadata strings (zero or more)

Some discussion is in order.

magic signature

The signature magic values allow codec identification and have the advantage of also being human readable. Starting with 'Op' helps distinguish them from data packets.

input rate

This is *not* the sample rate for playback of the encoded data.

Opus has a handful of coding modes, supporting 8, 12, 16, 24, and 48 kHz signals. Which mode is chosen can be switched dynamically from packet to packet in the stream, but the reference decoder can generate output at any of those sample rates from the compressed data. Fidelity to the original sample rate of the encode input is not preserved by the lossy compression. Therefore, if the playback system supports one of those modes natively, 'the best option for quality is to not resample' but to play back directly at 48 kHz regardless of the value of this field.

However, the Ogg mapping allows the encoder to pass the sample rate of the original input stream as metadata. We felt this could be useful downstream, and as something intended for machine consumption, didn't belong in the tag header. For example, a decoder writing PCM format to disk might choose to resample the output audio back to the original input rate to reduce surprise.