From XiphWiki
Revision as of 11:13, 15 November 2008 by Rillian (talk | contribs) (→‎The format of the ident header: Clarify granulerate is 32/32 bits)
Jump to navigation Jump to search


This page describes a generic media mapping (i.e. rules for multiplexing) of "text codecs" into Ogg.

Text codecs are sequences of text chunks that have a timing relationship to an audio or video stream.

Prominent examples of such text codecs are:

  • CC: closed captions (for the deaf)
  • SUB: subtitles
  • TAD: textual audio descriptions (for the blind)
  • KTV: karaoke
  • TIK: ticker text
  • AR: active regions
  • NB: metadata & semantic annotations
  • TRX: transcripts / scripts
  • LRC: lyrics
  • LIN: linguistic markup
  • CUE: cue points, DVD style chapter markers and similar navigational landmarks

[It is possible to also add images as a stream type, but then we are going beyond text codecs...]

There are a multitude of existing open formats for specifying some of these - in particular for specifying closed captions and subtitles. They come in different complexities - some being simply a time stamp and a text, others providing for extensive styling, graphics, and motion of the text blocks over time.

No matter what the differences - when multiplexing such codecs into Ogg, they all have to solve the same problems. This is why this page describes generically how to multiplex text codecs into Ogg.

Codecs with existing mappings are:

  • CMML
  • Kate

Bitstream Format

Ogg codecs consist of a sequence of header packets and data packets.

Header packets contain information necessary to identify and set up the codec. Data packets contain the actual codec data, in this case the time-aligned text.

When these packets are multiplexed into Ogg, they are mapped to Ogg pages. For text codecs, there is a sequence of header pages, a sequence of data pages, and an EOS page, which finishes the stream. The pages have to be ordered non-decreasing with time. No data can come before any of the header pages or after the EOS page.

Header pages

Header packets are a sequence of:

  • one ident header, which identifies the codec
  • one (optional) vorbis-comment header
  • one or more secondary header packets that are codec specific

Any text codec has to map its header information into these header packets.

Header packets must appear in order and all header packets must appear before any data packet. Each header packet is encapsulated in one Ogg page.

A codec that wants to use this specification

The format of the ident header

 0               1               2               3              |
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1| Byte
| packtype      | Identifier char[3]: 'txt'                     | 0-3
| Identifier char[4]: codec identifier                          | 4-7
| Version majorT| Version minorT| Version major | Version minor | 8-11
| Offset to message header fields                               | 12-15
| Number of header packets                                      | 16-19
| Granulerate numerator                                         | 20-23
| Granulerate denominator                                       | 24-27
| Granuleshift  | Padding / future use                          | 28-31
| Text type code                                                | 32-35
| Screen width                                                  | 36-39
| Screen height                                                 | 40-43
| Message header fields: Content-type & Content-language        | 44-
| ...

Fields with more than one byte length are encoded LSB (least significant byte) first.

As per the Ogg specification, granule positions of pages must be non decreasing within the stream. Header pages have granule position 0.

Description of the fields:

  • packtype:

Each text codec page starts with a one byte type, just like this ident header. Similar to Kate, a type with the MSB set (eg, between 0x80 and 0xff) indicates a header packet, while a type with the MSB cleared (eg, between 0x00 and 0x7f) indicates a data packet. We use the packtype field in order to distinguish between the different header and data packet types.

The following packtypes are distinguished:

0x80 ID header (BOS page)
0x81 vorbis-comment header (optional)
0x82-0xff secondary header pages of the codec in an order defined by the codec (optional)
0x00 text data (including optional motions and overrides; eos is a special text data page)
0x01 keepalive (the repetition of a text data packet with only a changed data type)

  • Text codec framework magic:

In all header pages, the packtype is followed by the text codec framework magic from byte offset 1 to byte offset 3 ("txt").

  • Text codec magic:

The succeeding four bytes are used by the text codec to identify itself.

For example, when CMML moves to using this generic text codec mapping approach rather than its own, the first eight bytes of the ident header will identify the track as a CMML codec track through a signature string of "\200txtCMML".

Or as another example take a signature string of "\200txtsrt\0" which will identify srt being mapped.

  • Versions:

Version majorT & minorT are fields that define the version of this text framework mapping. Right now, it is majorT = 1, minorT = 0. The Version major & minor fields are used by the text codec to define the version of its mapping.

  • Offset to message header fields:

A 4 Byte unsigned integer that contains the number of Bytes used in this packet before the message header fields.

  • Number of header packets:

A 4 Byte unsigned integer that contains the number of header packets of that particular logical bitstream consisting of the bos page and the secondary header pages.

  • Granulerate numerator & denominator

A 4 byte unsigned integer each. They represent the temporal resolution of the logical bitstream in Hz given as a rational number. The default granule rate for text codecs is: 1/1000.

  • Granuleshift

A 1 Byte unsigned integer describing whether to partition the granule_position into two for that logical bitstream, and how many of the lower bits to use for the partitioning. The upper bits signify a time-continuous granule position for an independently decodeable and presentable data granule. The lower bits are generally used to specify the relative offset of dependent packets, such as predicted frames of a video. Hence these can be addressed, though not decoded without tracing back to the last fully decodeable data granule. This is the case with Ogg Theora; the general procedure is given in section 3.2.

  • Padding/future use

3 Bytes padding data that may be used for future requirements and are mandated to zero in this revision.

  • Text type code:

A 4 Byte string signifying one of the text codec types as listed above. This provides information to the media player as to what kind of data to expect in the sequel.

  • Screen width / height:

A 4 Byte unsigned integer each. They represent the width and height in pixels of the area onto which the text is meant to be displayed. This allows a render engine to adapt the screen layout of the text that can be provided in relative terms (percentage of the screen width / height). If 0 width and height are provided, the width and height default to the size of the first video stream

  • Message header fields

Message header fields follow the generic Internet Message Format defined in RFC 2822. Each header field consists of a name followed by a colon (":") and the field value. Field names are case-insensitive. The field value MAY be preceded by any amount of LWS, though a single SP is preferred. Header fields can be extended over multiple lines by preceding each extra line with at least one SP or HT. Message header fields are encoded in UTF-8, preferrably using US-ASCII code points.

Two message header fields are defined:

  • Content-Type message header field, e.g. text/x-cmml, text/x-srt according to RFC 2045
    • if no charset parameter is given, the character encoding defaults to UTF-8
    • directionality of this text is implicitly given by the charset
  • Content-Language, e.g. en-AU, de-DE according to RFC 4646 and http://www.iana.org/assignments/language-tags

The format of the Skeleton Fisbone

Text codecs are expected to be used in conjunction with Ogg Skeleton.

The following message header fields should be encoded in the Fisbone header for a text codec:

  • Content-Type message header field, e.g. text/x-cmml, text/x-srt according to rfc2045
    • if no charset parameter is given, it defaults to UTF-8
    • directionality of this text is implicitly given by the charset
  • Content-Language, e.g. en-AU, de-DE according to rfc4646 and http://www.iana.org/assignments/language-tags
  • Text-Type, e.g. CC, SUB, etc (see abbreviations as defined above)

The granule position format

Text codecs are discontinuous codecs. The data packets are placed into the Ogg stream at the time where the data starts. Their duration is given inside the data packet. Thus, when seeking to a specific time offset in a Ogg file that has a text codec, it can be quite difficult to determine what text data should still be on screen.

The solution is to encode the time of the first text codec page that is still active into each text codec page.

Then, as you seek to the current time in the Ogg stream, you find the previous text codec page in the stream and from it the first still active text codec page. From that one, you have to gather all those text codec pages that are still active at the current time.

For this, the granulepos is being segmented into two parts, where one part signifies the insertion time of the first still active text codec page and the second part contains the offset since then.

| granulepos prev | offset          |

This is the same scheme as the one used in Kate and similar to the one used in CMML.

The size of this segmentation is stored in the 1 Byte integer granuleshift number of the Skeleton Fisbone. It describes how many of the lower bits to use for the partitioning. The upper bits then still signify a time-continuous granule position for a directly decodable and presentable data granule. The lower bits allow for specification of the granule position of a previous still active text codec page.

The default granule shift used for text codecs is 32, which halfs the granule position to allow for the backwards pointer.

Data pages

Data packets are generally the text data that is encapsulated into Ogg at a specific time.

For text codecs, large data packets, after being mapped into one or more Ogg pages, should be flushed. Small data packets that start at the same time, should be consolidated into one page. The insertion start time is encoded in the granule_pos of the Ogg page.

Since with text codecs we are talking about discontinuous codecs, there may be a long time between codec pages in a multiplexed stream. Therefore, optionally, the inclusion of keep-alive pages to be sent at regular intervals in the data stream is encouraged. This helps a decoder's seeking code to find a currently active text packet more easily.

Thus, the following data pages can be distinguished:

  • ordinary data pages
  • keep-alive pages

The format of a Data packet

 0               1               2               3              |
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1| Byte
| packtype      | filler / future use - 0                       | 0-3
| Start Time                                                    | 4-7
| Start Time (continued)                                        | 8-11
| End Time                                                      | 12-15
| End Time (continued)                                          | 16-19
| Data size                                                     | 20-23
| Codec Data ...                                                | 24-...

Fields with more than one byte length are encoded LSB (least significant byte) first.

  • start / end time:

The granulepos of a data page contains the start time encoded as a granule position in the higher bytes (see above). The lower bytes are used for seeking.

Therefore, the duration that a data page is on screen is not encoded in the encapsulation format. In CMML, the duration was set by creating pages that would end previously active pages. Here, instead, the duration is encoded directly into the data page.

For Ogg, only the end time would be required to be inserted into the data page, since the granulepos encodes the start time. However, to be encapsulation independent, both start and end time are included. Both are specified in seconds from the start of the video and require 8 Bytes.

  • data size:

A 4 Byte unsigned integer specifying the size of the codec data in bytes.

  • codec data:

After that, the data of the text codec is included. This is preferably be without a repetition of the start and end time. For example, for srt, it makes sense to include just the subtitle text as codec data.

EOS page

The EOS page ends a text codec stream. It can be an empty page, or contain the last data packet(s) of the text codec stream.