This is a proposed draft of how to encapsulate the Dirac video codec bitstream within an Ogg container
This document is intended to reflect current thinking about how to do this; something will be finalized after feedback from implementation. It is based on the description of Dirac in the 0.9 specification, updated based on more recent discussion.
An Ogg Dirac logical bitstream encapsulates a single Dirac sequence, beginning with the Stream Identifier, continuing with one or more Access Units, and ending with the Stop Sequence. Any additional following sequences (which may have different decode or display parameters) must be placed in separate Ogg logical bitstreams and concatenated to make a chained Ogg physical bitstream.
The Dirac bitstream is divided into Ogg packets as follows:
- the beginning-of-stream (bos) packet consists solely of the 8-octet Stream Identifier "KW-DIRAC" and no other data
- following the Stream Identifier/bos packet, each 'Parse Unit' (either RAP "header" or Frame Data Unit) is packed into and individual Ogg packet. These constitute the body of the stream.
- the end-of-stream (eos) packet consists solely of the 4-octet stop sequence (which can be viewed as an empty Parse Unit) and no other data
The Ogg granulepos field uses the "keyframe granule shift" mechanism (in common with Theora and CMML) to point to the beginning of the current access unit, and otherwise counts video frames, since Dirac is has a fixed frame rate.
The granulepos field for each packet is derived from a count of the number of frames decodable in continuous playback, including the data in that packet. Thus, if the frame data is sent in display order the first Frame Data Unit (with FRAME_NUMBER_OFFSET=0) would have a count of 1, the second a count of 2 and so on. However, Dirac Frame Data Units are in general stored in the bitstream in coded order; that is frames used for prediction may be placed ahead of their display order ahead of other frames that depend on them. In these cases the granulepos does not advance until after the packet directly before the actual display time of the reordered packet. This maintains consistency with the "count of decodable frames" definition and also minimizes buffering requirements in interleaved streams.
For example, suppose we have (in display order) an Intra frame, followed by two bi-predictive frames, followed by a back-predictive frame which is also a reference for the two bi-predictive frames.
Display order: 1I 2B 3B 4P Coded order: 1I 4P 2B 3B granulepos: 1 1 2 4
So the intra frame receives a granulepos of 1 (or n+1 if we're in the middle of a stream) since it advances the count of decodable frames. The next packet is the P frame which will be displayed as the 4th frame. Since we do not yet have enough information to decode up to the 4th frame, granulepos does not advance and the second packet receives the same granulepos as the first. The third packet is the first bi-predictive frame. It will be displayed second, after being decoded with reference to the 1st and 4th frames, and so the granulepos count will be set to 2. The fourth packet is the second bi-predictive frame. It is to be displayed 3rd, but once it is received, the decoder is also ready to display the 4th frame which has already been processed. Therefore the granulepos count advances twice; once for the current decodable frame, and once for the stored frame.
Note that this scheme means that the encoder or muxer must keep a list of out-of-order frames so it knows when to apply their increments to the count of decodable frames. Reordered frames can be detected by comparing the frame number in the Parse Unit header with the current count.
The above describes how a decodable frame count is determined for each Frame Data Unit. Random Access Point packets contain only decode parameters and so also do not advance the count of decodable frames.
Dirac defines 'Access Units' (AUs) to assist with beginning playback after a seek. Each begins with a Random Access Point (RAP) Parse Unit which contains a copy of the decode and display parameters for the stream. This is followed immediately by an Intra frame, and the encoder guarantees that the stream will be decodable starting there. That is, no frame to be displayed after the start of a new AU may depend on frames previous to the start of that AU. Because of the FDU reordering described above, frames whose display time belongs to the previous Access Unit may actually be placed in the bitstream after the beginning of the next; however such frames can simply be discarded when beginning playback.
This mechanism is provided because the dependencies upon previous frames can be complicated, so simply beginning playback at previous Intra frame (as in Theora) does not suffice. Likewise, because AU boundaries where decoding can be restarted can be quite far apart, it is expedient to provide a mechanism to find them during seeking. To do so Ogg Dirac uses the "keyframe granule shift" mechanism to store the count of decoded frames in the Ogg granulepos field is a way that also represents an offset to the start of the Access Unit.
The 64 bit granulepos field for an Ogg Dirac packet is divided into two fields. The higher-order field stores the decodable frame count (as described above) 'as it was at the start of the current access unit'. The lower order field stores the difference between the current count and the value in the higher-order field. Thus, the two fields must be added to get the current count of decodable frames, while in seeking, the value in the lower-order field provides a rough offset (in frames, which can be divided by the framerate to calculate a temporal seek point) which can be used to seek again for the closest previous point where decoding can be restarted.
Dirac defines a maximum of 2^30 frames in a given access unit, and there is no way to mark a lower limit (outside of the currently undefined profile and level specs) so the lower-order field of the granulepos is always 30 bits, even though in practice most streams will not need that much. With (-1) having a special meaning in Ogg, this effectively leaves 33 bits to hold the total frame count, which at 60 fps is 4 years before a rollover. Not wonderful, but workable.
Needs clarification: what exact granulepos does the RAP packet at the beginning of the access unit get, and where exactly does the offset described above point? If we apply the 'count of decodable frames' across an AU boundary, neither the RAP nor an I frame that is followed by predicted frames from the previous AU actually advance the granelupos and so the shift will point to a bit after the AU boundary.
Ogg packets are packed into Ogg pages to make a logical (or degenerate physical) Ogg bitstream. Following best practices, the following rules apply:
- The Ogg Dirac bos packet MUST occur on a page by itself
- Each Access Unit MUST begin at the start of a page (i.e. there must be a page flush before it) to simplify seeking and editing
An implementation might want to choose an Ogg Page size near the average size of the Dirac packets to best balance overhead and interleave granularity.
Needs clarification: The RAP header is considered part of the Access Unit it heads and thus we might not require a page break after it. However, doing so for the first one places the setup data for other interleaved codecs closer to the beginning of the stream, so it might be a good idea to require it for that instance?
When multiplexing an Ogg Dirac logical bitstream the usual rule of ordering Ogg pages by the temporal equivalent of their marked granulepos is followed. At the beginning of a chain segment, all the bos pages must occur together before any non-bos pages. eos pages may occur anywhere in the stream, but after an eos page, further pages with that serial number must not occur.
If Dirac video is the primary media track in the file, the bos page for the Dirac logical bitstream should occur first in the file. If there are multiple Dirac streams multiplexed together, the muxer should place the "default" choice for naive playback as the first bos page.
It is recommended that an Ogg Dirac file include an OggSkeleton stream to describe its components and incorporate metadata. If it does so, the OggSkeleton bos page should occur first, and the other bos pages may occur in any order.
This mapping is a little unusual in that the bos packet contains only a codec identification magic number and no actual decode parameters. However, the next packet in the stream must be the start of an 'Access Unit', which is always a RAP Parse Unit which does contain such details. The RAP packet is equivalent to the "codec setup header" packets used by the Vorbis and Theora codecs, but all the information is in a single packet. Unlike Vorbis and Theora this packet may repeat throughout the bitstream, flagging points where decoding can be restarted. Since one also appears (with granulepos 0|0) as the second packet in the Ogg stream, the usual "Icecast-style" Ogg streaming server logic will work; however the decoder (a) may not be able to decode all frames until the next in-band RAP packet occurs and (b) will technically be receiving an invalid Dirac stream if the server prepended the RAP at an arbirary point in the stream. Therefore Dirac still requires special-case handling in an Ogg-based streaming server, either editing the stream after sending a cached RAP, or not caching the RAP at all.