OggText

From XiphWiki
Jump to navigation Jump to search


This page describes a generic media mapping (i.e. rules for multiplexing) of "text codecs" into Ogg.

Text codecs are sequences of text chunks that have a timing relationship to an audio or video stream.

Prominent examples of such text codecs are:

  • closed captions (for the deaf)
  • subtitles
  • textual audio descriptions (for the blind)
  • karaoke
  • ticker text
  • active regions
  • metadata & semantic annotations
  • transcripts
  • lyrics
  • titles / credits

There are a multitude of existing open formats for specifying some of these - in particular for specifying closed captions and subtitles. They come in different complexities - some being simply a time stamp and a text, others providing for extensive styling, graphics, and motion of the text blocks over time.

No matter what the differences - when multiplexing such codecs into Ogg, they all have to solve the same problems. This is why this page describes generically how to multiplex text codecs into Ogg.

Codecs with existing mappings are:

  • CMML
  • Kate


Bitstream Format

Ogg codecs consist of a sequence of header packets and data packets.

Header packets contain information necessary to identify and set up the codec. Data packets contain the actual codec data, in this case the time-aligned text.

When these packets are multiplexed into Ogg, they are mapped to Ogg pages. For text codecs, there is a sequence of header pages, a sequence of data pages, and an EOS page, which finishes the stream. The pages have to be ordered non-decreasing with time. No data can come after the EOS page.


Header pages

Header packets are a sequence of:

  • one ident header, which identifies the codec
  • one (optional) vorbis-comment header
  • one or more secondary header packets that are codec specific

Any text codec has to map its header information into these header packets.

Header packets must appear in order and all header packets must appear before any data packet. Each header packet is encapsulated in one Ogg page.

Each text codec page starts with a one byte type. Similar to Kate, a type with the MSB set (eg, between 0x80 and 0xff) indicates a header packet, while a type with the MSB cleared (eg, between 0x00 and 0x7f) indicates a data packet. We use the packtype field in order to distinguish between the different header and data packet types.

The following packtypes are distinguished:

headers:
0x80 ID header (BOS page)
0x81 Vorbis comment header (optional)
0x82-0xff secondary header pages of the codec in an order defined by the codec (optional)
data:
0x00 text data (including optional motions and overrides)
0x01 keepalive
0x7f end page (EOS page)

As per the Ogg specification, granule positions of pages must be non decreasing within the stream. Header pages have granule position 0.

In all header pages, the packtype is followed by the text codec magic from byte offset 1 to byte offset 3 ("txt"). The succeeding four bytes are to be used by the text codec to identify itself.

For example, when CMML moves to using this generic text codec mapping approach rather than its own, the first eight bytes of the ident header will identify the track as a CMML codec track through a signature string of "\200txtCMML".

Or as another example take a signature string of "\200txtsrt\0" which will identify srt being mapped.

The text codecs are expected to be used in conjunction with [Ogg_Skeleton Ogg Skeleton] and therefore don't require granulerate and granuleshift to be defined.

Also, message header fields of skeleton, encoded in UTF-8, are used to identify the following protocol-level header fields:

  • Text-Track-Type
  • Character-Set
  • Directionality


This is the format of the ident header:

 0               1               2               3              |
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1| Byte
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| packtype      | Identifier char[3]: 'txt'                     | 0-3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Identifier char[4]: codec identifier                          | 4-7
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Version major                 | Version minor                 | 8-11
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Granulerate numerator                                         | 12-15
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               | 16-19
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Granulerate denominator                                       | 20-23
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|                                                               | 24-27
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Granuleshift  | Num headers   | Text encoding | Directionality| 28-31
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Version major & minor are fields that are used by the text codec to define the version of its mapping.

The granulerate represents the temporal resolution of the logical bitstream in Hz given as a rational number in the same way as the OggSkeleton fisbone secondary header specifies granulerate. It enables a mapping of granule position of the data pages to time by calculating "granulepos / granulerate".

The default granule rate for CMML is: 1/1000.

The granuleshift is a 1 Byte integer number describing whether to partition the granule_position into two for the CMML logical bitstream, and how many of the lower bits to use for the partitioning. The upper bits then still signify a time-continuous granule position for a directly decodable and presentable data granule. The lower bits allow for specification of the granule position of a previous CMML data packet (i.e. "clip" element), which helps to identify how much backwards seeking is necessary to get to the last and still active "clip" element (of the given track). The granuleshift is therefore the log of the maximum possible clip spacing.


language and category are NUL terminating ASCII strings. Language follows RFC 3066, though obviously will not accommodate language tags with lots of subtags.

Data pages

Data packets are generally the text data that is encapsulated into Ogg at a specific time. Each data packet is mapped onto a single Ogg data page with all its content. This is possible because generally text codec packets are rather small. The insertion time is encoded in the granule_pos of the Ogg page.

Since with text codecs we are talking about discontinuous codecs, there may be a long time between codec pages in a multiplexed stream. Therefore, optionally, the inclusion of keep-alive pages to be sent at regular intervals in the data stream is encouraged. This helps a decoder's seeking code to find a currently active text packet more easily.


EOS page

The EOS page ends a text codec stream. It is an empty packet because all the information of the codec is encapsulated in the earlier data pages.