MatroskaOpus - XiphWiki

The following is a draft!

It is at best incomplete and at worst completely broken. In any case, it is not an “official” Xiph spec or codec, so use with care.

This is an encapsulation spec for muxing the Opus codec within the Matroska container. There are a number of outstanding functional issues with muxing Opus within Matroska, so until those are resolved, use of this spec is not recommended.

Header Info

CodecID is A_OPUS
SampleFrequency is 48000 (Hz)
Channels is the number of output PCM channels
SeekPreRoll is set to 80000000
CodecPrivate consists of the OpusHead packet

The OpusHead packet's format is defined by the Ogg Opus mapping. In particular, it includes pre-skip, gain, and the channel mapping table required for correct surround output.

The second OpusTags header packet from Ogg Opus is not used in the Matroska encapsulation. Matroska has its own system for tag metadata, which avoids duplication and the need for sub-framing to index multiple packets within the CodecPrivate element.

Element Additions

CodecDelay [56][AA] is a new unsigned integer element added to the TrackEntry element. The value is the number of nanoseconds that must be discarded, for that stream, from the start of that stream. The value is also the number of nanoseconds that all encoded timestamps for that stream must be shifted to get the presentation timestamp. (This will fix Vorbis encoding as well.)

SeekPreRoll [56][BB] is a new unsigned integer element added to the TrackEntry element. The value is the number of nanoseconds that must be discarded after a seek for that stream, until the decoded data is valid to render.

DiscardPadding [75][A2] is a new signed integer element added to the BlockGroup element. DiscardPadding is the duration in nanoseconds of the silent data added to the Block (padding at the end of the block). The duration of DiscardPadding is not calculated in the duration of the Track and should be discarded during playback. (This will fix Vorbis encoding as well.)

Muxing Recommendations

To prevent extraneous parsing of muxed content for the players that want to start playback at exactly time T, we recommend muxers create files with another Cluster within N-1 at T-SeekPreRoll, where T is the start time of Cluster N. Then, add CuePoints for all the new T-SeekPreRoll Clusters with a CueTrack of the audio stream. The CuePoints for the video stream will not change.

For example, if a file is a muxed MKV with the following characteristics:

5 second interval between video keyframes
Each video keyframe begins a new Cluster
Cues will contain video keyframe CuePoints
For each video keyframe at time T there will be new Cluster at T-SeekPreRoll
Cues will contain audio CuePoints for T-SeekPreRoll Clusters
Audio and video are interleaved in monotonically increasing order

Assuming SeekPreRoll is 80 milliseconds:

the 1st Cluster starts at 0 milliseconds with a video keyframe Block and has a duration of 4920 milliseconds
the 2nd Cluster starts at 4920 milliseconds with an audio Block and has a duration of 80 milliseconds (the 2nd Cluster can contain Blocks from all streams)
the 3rd Cluster starts at 5000 milliseconds with a video keyframe Block and has a duration of 4920 milliseconds
the 4th Cluster starts at 9920 milliseconds with an audio Block and has a duration of 80 milliseconds.

With this recommendation, players that want audio and video to start playback at time T can seek to Cluster T-SeekPreRoll and start decoding the audio stream. This will work the same for both local and HTTP playback.

Open Questions

Should we say muxers MAY or SHOULD NOT produce simple streams without filling in CodecPrivate?

If the CodecPrivate is empty or not present and Channels is 1 or 2, players MAY treat it as a sane set of defaults, I guess. e.g. channel mapping family 0, no pre-skip or gain.
For Channels > 2 the track MUST be rejected, since there's no way to map the encoded substreams to channels.
We would also have to decide on a default value for OutputGain.
Version must be 1.

How can sample-accurate end-time trimming work in Matroska?

We defined a new element added to a BlockGroup, DiscardPadding (previously PostPadding), which is defined as the number of nanoseconds to discard from the Block.
Currently all software encapsulating Vorbis in Matroska is broken in this regard, and muxing a Vorbis file in Matroska causes it to get longer (i.e., produce more audio output than the original Ogg file). It would be unfortunate to repeat this disaster for Opus. This needs a new element specifying the number of samples to trim, perhaps a new BlockGroup child.
- This has been addressed with DiscardPadding for Opus. DiscardPadding was speced to fix Vorbis (as well as other codecs) too.

If new elements are required, can they be defined so as to enable correct seeking in rolling intra (a.k.a intra refresh) video as well?

SeekPreRoll should work for rolling intra video.

Handling Pre-skip data

On Matroska-dev we decided to implement Proposal 1 (ref).

Use Cases

UC1: Playback starts from the beginning of the stream. Source stream time starts at 0.
UC2: Playback starts from the beginning of the stream. Pre-skip data ends in middle of compressed packet.
UC3: Playback starts from the middle of the stream > SeekPreRoll time.
UC4: Playback starts from the middle of the stream < SeekPreRoll time.
UC5: Encode source stream to Opus, mux to Matroksa, then decode Opus stream, must have same number of samples as source stream.

Proposal 1: Timeshift the timestamps by pre-skip data

The Opus audio stream pre-skip data starts from time 0 and adds the pre-skip time to the normal audio time, like how Opus files are muxed into ogg files. We would add a new element to the TrackEntry element, CodecDelay, and the player would adjust the timestamps of the decoded samples by subtracting CodecDelay. All use cases should be covered.

Cons:

The timestamp of the Block does not match the timestamp of the playback position.
Does not generalize known "decode, but not render" data.
Forces the player to handle the pre-skip samples. I.e. not the decoder.
Because CodecPrivate already includes a full OpusHead packet, it contains a redundant pre-skip field. To avoid confusion, decoders should ensure the two fields match (if they do not, this indicates a bug, as in [1]), but since one is specified in nanoseconds and the other in samples at 48 kHz, we need to define what's sufficient to be considered a match.

Proposal 2: Use pre-skip data from CodecPrivate

On every discontinuity the decoder would need to decode and throw away the pre-skip data.

Cons:

UC2 will throw away valid data and the AV sync will be off.
UC3 will redundantly decode the pre-skip data.

Proposal 3: Add TimeToDiscard to Block

Add an element to the Block element, TimeToDiscard in nanoseconds. A value of -1 would not render the whole Block, which would have the same effect as setting the invisible bit. How would this affect the Block timestamp? Maybe the new element should be SamplesToDiscard or DataToDiscard?

Cons:

Proposal 4: Blocks that contain pre-skip data will set invisible flag

Blocks that contain pre-skip data have timestamps from the beginning of the stream. Blocks that only contain normal data have timestamps from the playback position.

Cons:

Forces the player to handle the pre-skip samples. I.e. not the decoder.
UC2 will throw away valid data and the AV sync will be off. Other use cases should be fine.

Proposal 5: Force pre-skip packets to be prepended to the first normal packet in the first Block

The first Block's timestmap will be set to the start time of the source playback position. We would add a new element to the TrackEntry element, CodecDelay. All use cases should be covered.

Cons:

Does not generalize known "decode, but not render" data.
Forces the player to handle the pre-skip samples. I.e. not the decoder.

Proposal 6: Create a new codec, OPUS_MKV

Basically the codec will wrap Opus packets with data telling the decoder what type of Opus packet it contains. Essentially we would be creating a new codec to handle pre-skip data within the decoder.

Cons:

There will be two types of Opus data streams!
Does not generalize known "decode, but not render" data.

Proposal 7: Negative timestamps

The SimpleBlock timestamp is signed 16 bits, so the format can signal about half of the pre-skip if playback timestamps are to start at zero.
One could set an incorrect timestamp on the skipped blocks, and rely on the decoder to drop them based on the OpusHead preskip value. As long as the initial blocks are timestamped <= start of output this shouldn't affect seeking.

Cons:

Moritz suggests this won't work because the resolution of the timestamps is controlled by the muxer, so the SimpleBlock timestamp offset isn't sample accurate anyway (ref).

Proposal 8: It's Crazy, But It Might Just Work

The Ogg format uses granule positions which are converted to presentation timecodes using codec specific information on a per logical stream basis.
The Matroska format uses absolute timecodes with an arbitrary per segement accuracy for all tracks in the segment.
It is the belief of this tikiman that using a timecode offset of any kind in MKV is unholy.
The preskip is communicated to the media software via the Opus header in the codec private data. At the begining of the track, the track timecode is not increased until prekip samples are in track frames.
From then on audio is muxed as normal, however the audio should be muxed >= 3840 samples behind video frames. For example:
- Cluster Timecode: 5.000 seconds
- Video Track Key Frame 5.000 seconds
- Opus Track Frame 4.920 seconds