This is a proposed draft of how to encapsulate the Dirac video codec bitstream within an Ogg container
This document is intended to reflect current thinking about how to do this; something will be finalized after feedback from implementation. It is based on the description of Dirac in the 0.9 specification, updated based on more recent discussion.
An Ogg Dirac logical bitstream encapsulates a single Dirac sequence, beginning with the Stream Identifier, continuing with one or more Access Units, and ending with the Stop Sequence. Any additional following sequences (which may have different decode or display parameters) must be placed in separate Ogg logical bitstreams and concatenated to make a chained Ogg physical bitstream.
The Dirac bitstream is divided into Ogg packets as follows:
- the beginning-of-stream (bos) packet consists solely of the 8-octet Stream Identifier "KW-DIRAC" and no other data
- Each 'Parse Unit' (either RAP "header" or Frame Data Unit) is packed into and individual Ogg packet. These constitute the body of the stream.
- The beginning-of-stream (bos) packet must be a RAP parse unit must occur on a page by itself to assist in determining the stream type and parameters. Muxers may use the initial 5 bytes of the RAP ('BBCD', 0x00) for codec identification.
- the end-of-stream (eos) packet consists solely of the 4-octet stop sequence (which can be viewed as an empty Parse Unit) and no other data
The Ogg granulepos field uses the "keyframe granule shift" mechanism (in common with Theora and CMML) to point to the beginning of the current access unit, and otherwise counts video frames, since Dirac has a fixed frame rate.
The granulepos field for each packet is derived from a count of the number of frames decodable in continuous playback, including the data in that packet. Thus, if the frame data is sent in display order the first Frame Data Unit (with FRAME_NUMBER_OFFSET=0) would have a count of 1, the second a count of 2 and so on. However, Dirac Frame Data Units are in general stored in the bitstream in coded order; that is frames used for prediction may be placed ahead of their display order ahead of other frames that depend on them. In these cases the granulepos does not advance until after the packet directly before the actual display time of the reordered packet. This maintains consistency with the "count of decodable frames" definition and also minimizes buffering requirements in interleaved streams.
For example, suppose we have (in display order) an Intra frame, followed by two bi-predictive frames, followed by a back-predictive frame which is also a reference for the two bi-predictive frames.
Display order: 1I 2B 3B 4P Coded order: 1I 4P 2B 3B granulepos: 1 1 2 4
So the intra frame receives a granulepos of 1 (or n+1 if we're in the middle of a stream) since it advances the count of decodable frames. The next packet is the P frame which will be displayed as the 4th frame. Since we do not yet have enough information to decode up to the 4th frame, granulepos does not advance and the second packet receives the same granulepos as the first. The third packet is the first bi-predictive frame. It will be displayed second, after being decoded with reference to the 1st and 4th frames, and so the granulepos count will be set to 2. The fourth packet is the second bi-predictive frame. It is to be displayed 3rd, but once it is received, the decoder is also ready to display the 4th frame which has already been processed. Therefore the granulepos count advances twice; once for the current decodable frame, and once for the stored frame.
Note that this scheme means that the encoder or muxer must keep a list of out-of-order frames so it knows when to apply their increments to the count of decodable frames. Reordered frames can be detected by comparing the frame number in the Parse Unit header with the current count.
The above describes how a decodable frame count is determined for each Frame Data Unit. Random Access Point packets contain only decode parameters and so also do not advance the count of decodable frames.
Dirac defines 'Access Units' (AUs) to assist with beginning playback after a seek. Each begins with a Random Access Point (RAP) Parse Unit which contains a copy of the decode and display parameters for the stream. This is followed immediately by an Intra frame, and the encoder guarantees that the stream will be decodable starting there. That is, no frame to be displayed after the start of a new AU may depend on frames previous to the start of that AU. Because of the FDU reordering described above, frames whose display time belongs to the previous Access Unit may actually be placed in the bitstream after the beginning of the next; however such frames can simply be discarded when beginning playback.
This mechanism is provided because the dependencies upon previous frames can be complicated, so simply beginning playback at previous Intra frame (as in Theora) does not suffice. Likewise, because AU boundaries where decoding can be restarted can be quite far apart, it is expedient to provide a mechanism to find them during seeking. To do so Ogg Dirac uses the "keyframe granule shift" mechanism to store the count of decoded frames in the Ogg granulepos field is a way that also represents an offset to the start of the Access Unit.
The 64 bit granulepos field for an Ogg Dirac packet is divided into two fields. The higher-order field stores the decodable frame count (as described above) 'as it was at the start of the current access unit'. The lower order field stores the difference between the current count and the value in the higher-order field. Thus, the two fields must be added to get the current count of decodable frames, while in seeking, the value in the lower-order field provides a rough offset (in frames, which can be divided by the framerate to calculate a temporal seek point) which can be used to seek again for the closest previous point where decoding can be restarted.
Dirac allows a maximum of 2^32 - 1 frames in a given access unit, and there is no way to mark a lower limit (outside of the currently undefined profile and level specs) so the lower-order field of the granulepos is always 32 bits, even though in practice most streams will not need that much. With (-1) having a special meaning in Ogg, this effectively leaves 31 bits to hold the total frame count, which at 60 fps is a little over a year before a rollover. Not wonderful, but workable.
Note that the RAP packet at the beginning of the access unit doesn't itself advance decoding, so except for the first one in a sequence (or Ogg chain segment) which gets a granulepos of 0:0, applying the 'count of decodable frames' across an AU boundary prevents us from setting a unique granulepos on either the RAP or an I frame that is followed by predicted frames from the previous AU. Thus the shift will point to a bit before the AU boundary. This appears to be acceptable.
Ogg packets are packed into Ogg pages to make a logical (or degenerate physical) Ogg bitstream. Following best practices, the following rules apply:
- The Ogg Dirac bos packet MUST occur on a page by itself
- Each Access Unit MUST begin at the start of a page (i.e. there must be a page flush before it) to simplify seeking and editing.
An implementation might want to choose an Ogg Page size near the average size of the Dirac packets to best balance overhead and interleave granularity.
When multiplexing an Ogg Dirac logical bitstream the usual rule of ordering Ogg pages by the temporal equivalent of their marked granulepos is followed. At the beginning of a chain segment, all the bos pages must occur together before any non-bos pages. eos pages may occur anywhere in the stream, but after an eos page, further pages with that serial number must not occur.
If Dirac video is the primary media track in the file, the bos page for the Dirac logical bitstream should occur first in the file. If there are multiple Dirac streams multiplexed together, the muxer should place the "default" choice for naive playback as the first bos page.
It is recommended that an Ogg Dirac file include an OggSkeleton stream to describe its components and incorporate metadata. If it does so, the OggSkeleton bos page should occur first, and the other bos pages may occur in any order.
The initial RAP packet is equivalent to the "codec setup header" packets used by the Vorbis and Theora codecs, but all the information is in a single packet. Unlike Vorbis and Theora this packet may repeat throughout the bitstream, flagging points where decoding can be restarted. Since one always appears (with granulepos 0|0) as the first packet in the Ogg stream, the usual "Icecast-style" Ogg streaming server logic will work; however the decoder (a) may not be able to decode all frames until the next in-band RAP packet occurs and (b) will technically be receiving an invalid Dirac stream if the server prepended the RAP at an arbirary point in the stream. Therefore Dirac still requires special-case handling in an Ogg-based streaming server, either editing the stream after sending a cached RAP, or not caching the RAP at all.
Dirac also has "auxiliary" parse units which may contain opaque application specific data. Some applications may require proper handing of these packets by a streaming server.