This is an encapsulation spec for the Opus codec in [Matroska]. There are a number of outstanding functional issues with muxing Opus in Matroska, and until those are resolved, use of this spec is NOT RECOMMENDED.
- CodecID is A_OPUS - SampleFrequecy is 48000 - Channels is number of output PCM channels - SeekPreRoll is set to 80000000 - CodecPrivate starts with the 'OpusHead' packet, identical to the Ogg mapping, followed by the pre-skip data.
The 'OpusHead' format is defined by the [Ogg Opus] mapping. In particular it includes pre-skip, gain, and the channel mapping table required for correct surround output.
The second 'OpusTags' header packet from Ogg Opus is not used in the Matroska encapsulation. Matroska has its own system for tag metadata, and this avoids duplicating it and the need for sub-framing to index multiple packets within the CodecPrivate element.
SeekPreRoll is a new unsigned integer element added to the TrackEntry element. The value is the number of nanoseconds that must be discarded, for that stream, after a seek until the decoded data is valid to render.
Block timestamps will match how all other Codecs are handled. I.e. The Block timestamp is the starting time of the first PCM sample position in nanoseconds.
If the CodecPrivate is empty or not present and Channels is 1 or 2, players MAY treat it as a sane set of defaults, I guess. e.g. channel mapping family 0, no pre-skip or gain. For Channels > 2 the track MUST be rejected, since there's no way to map the encoded substreams to channels.
In order to prevent extraneous parsing of muxed content for the players that want to start playback at exactly time T, we will recommend muxers create files with another Cluster within N-1 at T-SeekPreRoll, where T is the start time of Cluster N. Then add CuePoints for all the new T-SeekPreRoll Clusters with a CueTrack of the audio stream. The CuePoints for the video stream will not change.
For example, a file is a muxed MKV with the following characteristics: - 5 second interval between video keyframes - Each video keyframe begins a new Cluster - Cues will contain video keyframe CuePoints - For each video keyframe at time T there will be new Cluster at T-SeekPreRoll - Cues will contain audio CuePoints for T-SeekPreRoll Clusters - Audio and video are interleaved in monotonically increasing order
Assume SeekPreRoll is 80 milliseconds, the first Cluster starts at 0 milliseconds with a video keyframe Block and has a duration of 4920 milliseconds. The second Cluster starts at 4920 milliseconds with an audio Block and has a duration of 80 milliseconds. Just to be clear, the second Cluster can contain Blocks from all streams. The third Cluster starts at 5000 milliseconds with a video keyframe Block and has a duration of 4920 milliseconds. The fourth Cluster starts at 9920 milliseconds with an audio Block and has a duration of 80 milliseconds.
With this recommendation players that want audio and video to start playback at time T can seek to Cluster T-SeekPreRoll and start decoding the audio stream. This will work the same for both local and HTTP playback.
Should we say muxers MAY or SHOULD NOT produce simple streams without filling in CodecPrivate?
How does the OpusHead pre-skip field interact with the timestamps? The SimpleBlock timestamp is signed 16 bits, so the format can signal about half of the pre-skip if playback timestamps are to start at zero. Moritz suggests this won't work because the resolution of the timestamps is controlled by the muxer, so the SimpleBlock timestamp offset isn't sample accurate anyway.[ref]
One could set an incorrect timestamp on the skipped blocks, and rely on the decoder to drop them based on the OpusHead preskip value. As long as the initial blocks are timestamped <= start of output this shouldn't affect seeking.
How important is it that timestamps start at zero in a Matroska file?
The SimpleBlock structure also has an 'invisible' bit, which tells the player to decode, but not display, the contained frames. This lets the muxer signal the pre-skip semantics with frame accuracy, but not sample accuracy. If players implement this it will at least help with sync. Libav does not appear to support the invisible bit.
How can sample-accurate end-time trimming work in Matroska? Currently all software encapsulating Vorbis in Matroska is broken in this regard, and muxing a Vorbis file in Matroska causes it to get longer (i.e., produce more audio output than the original Ogg file). It would be unfortunate to repeat this disaster for Opus. This needs a new element specifying the number of samples to trim, perhaps a new BlockGroup child.
If new elements are required, can they be defined so as to enable correct seeking in rolling intra (a.k.a intra refresh) video as well?