MatroskaOpus

DRAFT

This is an encapsulation spec for the Opus codec in [Matroska]. There are a number of outstanding functional issues with muxing Opus in Matroska, and until those are resolved, use of this spec is NOT RECOMMENDED.

- CodecID is A_OPUS
- SampleFrequecy is 48000
- Channels is number of output PCM channels
- SeekPreRoll is set to 80000000
- CodecPrivate starts with the 'OpusHead' packet, identical to the Ogg mapping, followed by the pre-skip data.

The 'OpusHead' format is defined by the [Ogg Opus] mapping. In particular it includes pre-skip, gain, and the channel mapping table required for correct surround output.

The second 'OpusTags' header packet from Ogg Opus is not used in the Matroska encapsulation. Matroska has its own system for tag metadata, and this avoids duplicating it and the need for sub-framing to index multiple packets within the CodecPrivate element.

SeekPreRoll is a new unsigned integer element added to the TrackEntry element. The value is the number of nanoseconds that must be discarded, for that stream, after a seek until the decoded data is valid to render.

Block timestamps will match how all other Codecs are handled. I.e. The Block timestamp is the starting time of the first PCM sample position in nanoseconds.

(TODO) Define layout of CodecPrivate.

Muxing Recommendations

In order to prevent extraneous parsing of muxed content for the players that want to start playback at exactly time T, we will recommend muxers create files with another Cluster within N-1 at T-SeekPreRoll, where T is the start time of Cluster N. Then add CuePoints for all the new T-SeekPreRoll Clusters with a CueTrack of the audio stream. The CuePoints for the video stream will not change.

For example, a file is a muxed MKV with the following characteristics: - 5 second interval between video keyframes - Each video keyframe begins a new Cluster - Cues will contain video keyframe CuePoints - For each video keyframe at time T there will be new Cluster at T-SeekPreRoll - Cues will contain audio CuePoints for T-SeekPreRoll Clusters - Audio and video are interleaved in monotonically increasing order

Assume SeekPreRoll is 80 milliseconds, the first Cluster starts at 0 milliseconds with a video keyframe Block and has a duration of 4920 milliseconds. The second Cluster starts at 4920 milliseconds with an audio Block and has a duration of 80 milliseconds. Just to be clear, the second Cluster can contain Blocks from all streams. The third Cluster starts at 5000 milliseconds with a video keyframe Block and has a duration of 4920 milliseconds. The fourth Cluster starts at 9920 milliseconds with an audio Block and has a duration of 80 milliseconds.

With this recommendation players that want audio and video to start playback at time T can seek to Cluster T-SeekPreRoll and start decoding the audio stream. This will work the same for both local and HTTP playback.

Open Questions

Should we say muxers MAY or SHOULD NOT produce simple streams without filling in CodecPrivate?

If the CodecPrivate is empty or not present and Channels is 1 or 2, players MAY treat it as a sane set of defaults, I guess. e.g. channel mapping family 0, no pre-skip or gain. For Channels > 2 the track MUST be rejected, since there's no way to map the encoded substreams to channels.

How does the OpusHead pre-skip field interact with the timestamps?

The SimpleBlock timestamp is signed 16 bits, so the format can signal about half of the pre-skip if playback timestamps are to start at zero. Moritz suggests this won't work because the resolution of the timestamps is controlled by the muxer, so the SimpleBlock timestamp offset isn't sample accurate anyway.[ref]
One could set an incorrect timestamp on the skipped blocks, and rely on the decoder to drop them based on the OpusHead preskip value. As long as the initial blocks are timestamped <= start of output this shouldn't affect seeking.
The SimpleBlock structure also has an 'invisible' bit, which tells the player to decode, but not display, the contained frames. This lets the muxer signal the pre-skip semantics with frame accuracy, but not sample accuracy. If players implement this it will at least help with sync. Libav does not appear to support the invisible bit.

How important is it that timestamps start at zero in a Matroska file?
How can sample-accurate end-time trimming work in Matroska?

Currently all software encapsulating Vorbis in Matroska is broken in this regard, and muxing a Vorbis file in Matroska causes it to get longer (i.e., produce more audio output than the original Ogg file). It would be unfortunate to repeat this disaster for Opus. This needs a new element specifying the number of samples to trim, perhaps a new BlockGroup child.

If new elements are required, can they be defined so as to enable correct seeking in rolling intra (a.k.a intra refresh) video as well?

SeekPreRoll should work for rolling intra video.

Handling Pre-skip data

Use Cases:

UC1: Playback starts from the beginning of the stream. Source stream time starts at 0.
UC2: Playback starts from the beginning of the stream. Pre-skip data ends in middle of compressed packet.
UC3: Playback starts from the middle of the stream > SeekPreRoll time.
UC4: Playback starts from the middle of the stream < SeekPreRoll time.

one: Timeshift the timestamps by pre-skip data
- The Opus audio stream pre-skip data starts from time 0 and adds the pre-skip time to the normal audio time, like how Opus files are muxed into ogg files. We would add a new element to the TrackEntry element, PreSkip, and the player would adjust the timestamps of the decoded samples by subtracting PreSkip. All use cases should be covered.
- Cons:
two: Add pre-skip data to CodecPrivate.

On every discontinuity the decoder would need to decode and throw away the pre-skip data.
Cons:

UC2 will throw away valid data and the AV sync will be off.
UC3 will redundantly decode the pre-skip data.

three: Add TimeToDiscard to Block.

Add an element to the Block element, TimeToDiscard in nanoseconds. A value of -1 would not render the whole Block, which would have the same effect as setting the invisible bit. How would this affect the Block timestamp? Maybe the new element should be SamplesToDiscard or DataToDiscard?
Cons:

four: Blocks that contain pre-skip data will set invisible flag.

Blocks that contain pre-skip data have timestamps from the beginning of the stream. Blocks that only contain normal data have timestamps from the playback position.
Cons:

Forces the player to handle the pre-skip samples. I.e. not the decoder.
UC2 will throw away valid data and the AV sync will be off. Other use cases should be fine.

five: Force pre-skip packets to be prepended to the first normal packet in the first Block.

The first Block's timestmap will be set to the start time of the source playback position. We would add a new element to the TrackEntry element, PreSkip. All use cases should be covered.
Cons:

Does not generalize known "decode, but not render" data.
Forces the player to handle the pre-skip samples. I.e. not the decoder.

six: Create a new codec, OPUS_MKV.

Basically the codec will wrap Opus packets with data telling the decoder what type of Opus packet it contains. Essentially we would be creating a new codec to handle pre-skip data within the decoder.
Cons:

There will be two types of Opus data streams!
Does not generalize known "decode, but not render" data.

Proposal 7 (Like five++)

The preskip is communicated to the media software via the Opus header in the codec private data. At the begining of the track, the track timecode is not increased until prekip samples are in track frames.

From then on audio is muxed as normal, however the audio should be muxed >= 3840 samples behind video frames.

i.e. Cluster Timecode: 5.000 seconds Video Track Key Frame 5.000 seconds Opus Track Frame 4.920 seconds