From XiphWiki
Jump to navigation Jump to search


This page describes a generic media mapping (i.e. rules for multiplexing) of "text codecs" into Ogg.

Text codecs are sequences of text chunks that have a timing relationship to an audio or video stream. A text codec relates to data that is logically regarded as text, in particular data of content-type text/*, but is not restricted to these. The format is in particular meant to be used for subtitles and closed captions. Non-text representations of subtitles (e.g. bitmaps), open captions and auditive audio descriptions (i.e. soundfiles) are not meant to go into OggText, since they provide less flexibility for display mechanisms, search indexing, and general processing.

A text codec should not contain binary data, in particular images. Where images are needed, they should be referenced. This can be done through a URL and thus having the image external to the media data. It could also be done by embedding the image inside the Ogg stream in a different track to the text codec. This is outside this specification though.

There are a multitude of existing open formats for specifying text codecs - in particular for specifying closed captions and subtitles. They come in different complexities - some being simply a time stamp and a text, others providing for extensive styling, graphics, and motion of the text blocks over time.

No matter what the differences - when multiplexing such codecs into Ogg, they all have to solve the same problems. This is why this page describes generically how to multiplex text codecs into Ogg.

Text Codecs with existing, differing mappings into Ogg are:

  • CMML
  • Kate

These have been used as inspiration for the specification here.

Also, in the stricter sense of "text codecs" as specified above, Kate is not a text codec, since it has the ability to encapsulate binary data inside Kate's data packets. If just the text part of Kate is used, with binary parts replaced by links, one could define a text codec that can be mapped into OggText. However, Kate already has a Ogg mapping and it is a perfectly valid alternative means of specifying text codecs where all the related data is encapsulated inside the same track.

Categories of Text Codecs

In a requirements study undertaken by Mozilla, a substantial number of text codec categories has been identified.

Prominent examples are:

  • CC: closed captions (for the deaf)
  • SUB: subtitles
  • TAD: textual audio descriptions (for the blind; to be used as braille or through TTS)
  • KTV: karaoke
  • TIK: ticker text
  • AR: active regions
  • NB: semantic annotations, including speech bubbles and director comments
  • META: metadata, mostly machine-readable
  • TRX: transcripts / scripts
  • LRC: lyrics
  • LIN: linguistic markup
  • CUE: cue points, DVD style chapter markers and similar navigational landmarks

They are distinguished based on their use cases and based on the typical way in which they may be represented on screen (or off screen for that matter).

In this specification, a logical bitstream in Ogg has to be identified to represent one of these codec categories. Should you require another codec category than the ones specifice here, please discuss on ogg-dev@lists.xiph.org. Characters of codec categories are to be given in ASCII.

Bitstream Format

Ogg codecs consist of a sequence of header packets and data packets.

Header packets contain information necessary to identify and set up the codec. Data packets contain the actual codec data, in this case the time-aligned text.

When these packets are multiplexed into Ogg, they are mapped to Ogg pages. In Ogg, there is a sequence of header pages, a sequence of data pages, and an EOS page, which finishes the stream. The pages have to be ordered non-decreasing with time. No data can come before any of the header pages or after the EOS page.

Text codecs have to take complete care of their layout. E.g. if srt is encapsulated into Ogg with this mapping, it makes sense to specify the relative screen region into which the srt text segments are to be rendered in an srt header page. Otherwise the media player has to make assumptions based on the codec category.

[Possibly it makes sense to define some default layout options for the different codec categories.]

Header pages

Header packets are a sequence of:

  • one ident header, which identifies the codec
  • one (optional) vorbis-comment header
  • one or more secondary header packets that are codec specific

Any text codec has to map its header information into these header packets.

Header packets must appear in order and all header packets must appear before any data packet. Each header packet is encapsulated in one Ogg page.

The format of the ident header

 0               1               2               3              |
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1| Byte
| packtype      | Identifier char[3]: 'txt'                     | 0-3
| Identifier char[4]: codec identifier                          | 4-7
| Version majorT| Version minorT| Version major | Version minor | 8-11
| Offset to message header fields                               | 12-15
| Offset to codec-specific headers                              | 16-19
| Number of header packets                                      | 20-23
| Granulerate numerator                                         | 24-27
| Granulerate denominator                                       | 28-31
| Granuleshift  | Padding / future use                          | 32-35
| Text category code                                            | 36-39
| Message header fields: Content-type & Content-language        | 40-
| Zero or more bytes of codec specific header data              | ...

Fields with more than one byte length are encoded LSB (least significant byte) first.

As per the Ogg specification, granule positions of pages must be non decreasing within the stream. Header pages have granule position 0.

Description of the fields:

  • packtype:

Each text codec page starts with a one byte type, just like this ident header. Similar to Kate, a type with the MSB set (eg, between 0x80 and 0xff) indicates a header packet, while a type with the MSB cleared (eg, between 0x00 and 0x7f) indicates a data packet. We use the packtype field in order to distinguish between the different header and data packet types.

The following packtypes are distinguished:

0x80 ID header (BOS page)
0x81 vorbis-comment header (optional)
0x82-0xff secondary header pages of the codec in an order defined by the codec (optional)
0x00 text data
0x01 keepalive
0x02 repeat
0x02-0x7f special data pages of the codec (optional)
  • Text codec framework magic:

In all header pages, the packtype is followed by the text codec framework magic from byte offset 1 to byte offset 3 ("txt").

  • Text codec magic:

The succeeding four bytes are used by the text codec to identify itself.

For example, when CMML moves to using this generic text codec mapping approach rather than its own, the first eight bytes of the ident header will identify the track as a CMML codec track through a signature string of "\200txtCMML".

Or as another example take a signature string of "\200txtsrt\0" which will identify srt being mapped.

  • Versions:

Version majorT & minorT are fields that define the version of this text framework mapping. Right now, it is majorT = 1, minorT = 0. The Version major & minor fields are used by the text codec to define the version of its mapping.

  • Offset to message header fields:

A 4 Byte unsigned integer that contains the number of Bytes used in this packet before the message header fields. This is to make the decoding somewhat future-proof and allow the insertion of further header bytes into the page without destroying the decodability of the variable length message header fields.

  • Offset to codec-specific headers:

A 4 Byte unsigned integer that contains the number of Bytes used in this packet before codec-specific headers such that a codec-specific encoding can also be somewhat future-proof.

  • Number of header packets:

A 4 Byte unsigned integer that contains the number of header packets of that particular logical bitstream consisting of the bos page and the secondary header pages.

  • Granulerate numerator & denominator

A 4 byte unsigned integer each. They represent the temporal resolution of the logical bitstream in Hz given as a rational number. The default granule rate for text codecs is: 1/1000.

  • Granuleshift

A 1 Byte unsigned integer describing whether to partition the granule_position into two for that logical bitstream, and how many of the lower bits to use for the partitioning. The upper bits signify a time-continuous granule position for an independently decodeable and presentable data granule. The lower bits are generally used to specify the relative offset of dependent packets, such as predicted frames of a video. Hence these can be addressed, though not decoded without tracing back to the last fully decodeable data granule. This is the case with Ogg Theora; the general procedure is given in section 3.2.

  • Padding/future use

3 Bytes padding data that may be used for future requirements and are mandated to zero in this revision.

  • Text category code:

A 4 Byte string signifying one of the text codec types as listed above. This provides information to the media player as to what kind of data to expect in the sequel.

  • Message header fields

Message header fields follow the generic Internet Message Format defined in RFC 2822. Each header field consists of a name followed by a colon (":") and the field value. Field names are case-insensitive. The field value MAY be preceded by any amount of LWS, though a single SP is preferred. Header fields can be extended over multiple lines by preceding each extra line with at least one SP or HT. Message header fields are encoded in UTF-8, preferrably using US-ASCII code points.

Two message header fields are defined:

  • Content-Type message header field, e.g. text/x-cmml, text/x-srt according to RFC 2045
    • if no charset parameter is given, the character encoding defaults to UTF-8
    • the default directionality of this text is implicitly given by the charset; text codecs are free to use additinal LTR/RTL information
  • Content-Language, e.g. en-AU, de-DE according to RFC 4646 and http://www.iana.org/assignments/language-tags

Use of Skeleton

Text codecs must be used in conjunction with Ogg Skeleton.

Skeleton records information about a logical bitstream in a header called "fisbone". This specification defines that the granuleshift of the text codec is read from the Skeleton fisbone. Additionally, the following fisbone message header fields should be used:

  • Content-Type, e.g. text/x-cmml, text/x-srt according to rfc2045
    • if no charset parameter is given, it defaults to UTF-8
    • the default directionality of this text is implicitly given by the charset; text codecs are free to use additinal LTR/RTL information
  • Content-Language, e.g. en-AU, de-DE according to rfc4646/bcp47 and http://www.iana.org/assignments/language-tags
  • Text-Type, e.g. CC, SUB, etc (see abbreviations as defined above)

Note that duplication of this information inside the text codec header and here is intentional, since the text codec can then be used also with other encapsulation formats. The additional exposure of this information in the skeleton header allows this information to be available without any text codec libraries (e.g. in a Web proxy).

Data pages

Data packets are generally the text data that is encapsulated into Ogg at a specific time.

For text codecs, large data packets, after being mapped into one or more Ogg pages, should be flushed. Small data packets that start at the same time, should be consolidated into one page. The insertion start time is encoded in the granule_pos of the Ogg page.

Since with text codecs we are talking about discontinuous codecs, there may be a long time between codec pages in a multiplexed stream. Therefore, optionally, the inclusion of keep-alive and/or repeat pages to be sent at regular intervals in the data stream is encouraged. This helps a decoder's seeking code to find a currently active text packet more easily.

Thus, the following data pages can be distinguished:

  • ordinary data pages
  • keep-alive pages
  • repeat pages

The format of a Data packet

 0               1               2               3              |
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1| Byte
| packtype      | filler / future use - 0                       | 0-3
| Start Time                                                    | 4-7
| Start Time (continued)                                        | 8-11
| End Time                                                      | 12-15
| End Time (continued)                                          | 16-19
| Offset to actual text data                                    | 20-23
| Offset to other codec-specific data                           | 24-27
| Fields for future use ...                                     | 28-...
| Text Codec Data ...                                           | ...
| Other Codec Data ...                                          | ...

Fields with more than one byte length are encoded LSB (least significant byte) first.

  • start / end time:

The Ogg framing header of a data page contains the start time encoded as a granule position in the higher bytes (see above). The lower bytes are used for seeking.

Therefore, the duration that a data page is on screen is not encoded in the encapsulation format. In CMML, the duration was set by creating pages that would end previously active pages. Here, instead, the duration is encoded directly into the data page.

For Ogg, only the end time would be required to be inserted into the data page, since the granule position encodes the start time. However, to be encapsulation independent, both start and end time are included. Both are specified in seconds from the start of the video and require 8 Bytes.

  • Offset to actual text data:

A 4 Byte unsigned integer that contains the number of Bytes used in this packet before the actual text codec data. This enables a backwards compatible extension of the data packet in the future with further fields before the actual text starts.

  • Offset to other codec-specific data:

A 4 Byte unsigned integer that contains the number of Bytes used in this packet before other codec-specific data such that the text part can always be identified. This enables codecs to use data outside the core text if necessary.

  • Fields for future use:

Optional. May be used by a future OggText specification, or a text codec if needed.

  • Text codec data:

After that, the actual data of the text codec is included. This is preferably be without a repetition of the start and end time. For example, for srt, it makes sense to include just the subtitle text as codec data.

  • Other codec data:

And finally, optionally, there can be other codec data that is not regarded as core text.

EOS page

The EOS page ends a text codec stream. It can be an empty page, or contain the last data packet(s) of the text codec stream.


Text codecs are discontinuous codecs. The data packets are placed into the Ogg stream at the time where the data starts. Their duration is given inside the data packet. Thus, when seeking to a specific time offset in a Ogg file that has a text codec, it can be quite difficult to determine what text data should still be on screen.

The solution is to encode the time of the first text codec page that is still active into each text codec page.

Then, as you seek to a target time in the Ogg stream, you find the previous text codec page in the stream and from it the first still active text codec page. From that one, you have to gather all those text codec pages that are still active at the current time.


The granulepos is separated into two parts, where one part signifies the insertion time of an earlier text codec page that is still active, and the second part contains the offset since then.

| prev_granule    | offset          |

This is the same scheme as the one used in Kate and CMML, and similar to that used in Theora to reference keyframes. It is explained in more detail in the page GranulePosAndSeeking.

The size of this segmentation is stored in the 1 Byte integer granuleshift field of the Skeleton Fisbone. It describes how many of the lower bits to use for the partitioning.

The recommended granuleshift for text codecs is 24 bit; it gives 34 years duration, and 4 hours max between pages, which should be sufficient for any long-form video annotation.

Selection of prev_granule

This algorithm clarifies the selection of "first still active packet" for prev_granule. It must be used by an encoder that is creating an OggText stream.

In the case of a text codec which does not allow overlapping data, such as CMML, the prev_granule is simply the insertion time of the previous text codec packet.

In the case where the active times of different data packets may overlap, an arbitrary number of packets may be active at the time of inserting a new data packet. The following specifies the encoder algorithm for choosing prev_granule for a packet's insertion at time t:

  • Identify all packets that are still active at time t
  • For each of these: select only the most recent repeat, or the original in the case of no repeats. This identifies the most recent representative for each currently-active packet
  • Choose the earliest curr_granule of those representatives

The result of that procedure is then used for the prev_granule field of the new packet's granulepos, and the difference between that and the curr_granule is encoded in the offset of the granulepos:

 granulepos = (prev_granule << granuleshift) | (curr_granule - prev_granule)

Repeat packets and Keepalives

Discontinuous codecs can be sparse. This means that "finding the previous text codec page in the stream" may require a lot of seeking backwards - it's essentially unbounded. Similarly, the gathering of "all those text codec pages that are still active at the current time" may take quite some time if, e.g. one page is constantly visible, such as a logo.

We bound the distance that needs to be searched in either direction through the use of keep-alives and repeat packets.

Repeat packets

Repeat packets repeat all the data of an earlier packet which is still active at a certain point in the data stream. This avoids having to continue seeking to gain the information.

With the following rules we can achieve the goal of bounding the seek distance while accessing all data with the existing double-seek algorithm:

  • Repeat packets are part of the stream structure; they must not be arbitrarily removed without re-encoding the granulepos of other packets in the stream.
  • A repeat packet repeats the data of one specific earlier packet; we refer to it as a repeat of that packet
  • In order to signify that multiple packets are still active, we use multiple repeat packets. They need not occur at the same time.
  • Granulepos encoding: the curr_granule=(prev_granule+offset) of a packet is its insertion time. For a repeat packet, this is *not* its start_time. (However the start and end times are copied verbatim in the data header).
  • Any packet may refer to a repeat packet in its prev_granule (including that repeat packets may refer to other repeat packets).
  • The prev_granule of a repeat packet is the curr_granule of the earliest of the packets we've chosen to represent currently active events. This may not necessarily be the event with the earliest start_time.
  • Repeat packets must not have a prev_granule of the packet they are repeating. If that is the only packet that is currently active, they must set prev_granule=curr_granule and offset=0.

The above rules bound the scrollback by the repeat frequency.

We consider an example of how this ensures that all active packets are found in the case where two packets A and B are active at a given time. After seeking to the desired time point, you scan forward from there and find a repeat packet for A: it must have a prev_granule encoded that refers to the packet B (and if packet B has been repeated, then A's prev_granule refers to the most recent repeat of packet B).


A keep-alive is essentially a packet, on a standalone page, that links back to an earlier still active text codec packet, but it is inserted frequently into the stream, e.g. at 30sec intervals, and thus provides a quicker way of seeking back to the required position. An optimisation is to add the positions of all still active text codec pages into the keep-alive packet to allow direct back-linking to all of them to gather the required information.

In order to bound the forward scan we use keepalives that specify that no packets are active. They must have prev_granule = curr_granule and offset = 0, and be in inserted the stream at the same repeat frequency.


Inserting keepalives or repeat packets into an existing stream will improve seekability by reducing the forward scan. If an update is also made to the granulepos of other packets which are byte-wise later in the file than the newly inserted keepalive or repeat packet, the scrollback of those packets will also be reduced.