OggKate: Difference between revisions

Revision as of 06:43, 18 February 2008

The following is a draft!

It is at best incomplete and at worst completely broken. In any case, it is not an “official” Xiph spec or codec, so use with care.

Disclaimer

This is not a Xiph codec, but I was asked to post information about Ogg/Kate on this wiki. As such, please do not assume that Xiph has anything to do with this, much less responsibility.

What is Kate?

Kate is a codec for karaoke and text encapsulation for Ogg. Most of the time, this would be multiplexed with audio/video to carry subtitles, song lyrics (with or without karaoke data), etc, but doesn't have to be. A possible use of a lone Kate stream would be an e-book. Moreover, the motion feature gives Kate a powerful means to describe arbitrary curves, so hand drawing of shapes can be achieved. This was originally meant for karaoke use, but can be used for any purpose. Motions can be attached to various semantics, like position, color, etc, so scrolling or fading text can be defined.

Why a new codec?

As I was adding support for Theora, Speex and FLAC to some software of mine, I found myself wanting to have song lyrics accompanying Vorbis audio. Since Vorbis comments are limited to the headers, one can't add them in the stream as they are sung, so another multiplexed stream would be needed to carry them.

The three possible bases usable for such a codec I found were Writ, CMML, and OGM/SRT.

Writ is an unmaintained start at an implementation of a very basic design, though I did find an encoder/decoder in py-pgg2 later on - I'd been quicker to write Kate from scratch anyway.
CMML is more geared towards encapsulating metadata about an accompanying stream, rather than being a data stream itself, and seemed complex for a simple use, though I have now revised my view on this - besides, it seems designed for Annodex (which I haven't had a look at), though it does seems relatively generic for use outwith Annodex
OGM/SRT, which I only found when I added Kate support to MPlayer, is shoehorning various data formats into an Ogg stream, and just dumps the SRT subtitle format as is, AFAICS (though I haven't looked at this one in detail, since I'd already had a working Kate implementation by that time)

I then decided to roll my own, not least because it's a fun thing to do.

I found other formats, such as USF (designed for inclusion in Matroska) and various subtitle formats, but none were designed for embedding inside an Ogg container.

Overview of the Kate bitstream format

I've taken much inspiration from Vorbis and Theora here. Headers and packets (as well as the API design) follow the design of these two codecs.

A rough overview (see Format specification) is:

Headers packets:

ID header [BOS]: magic, version, granule fraction, encoding, language, etc
Comment header: Vorbis comments, as per Vorbis/Theora streams
Style definitions header: a list of predefined styles to be referred to by data packets
Region definitions header: a list of predefined regions to be referred to by data packets
Curves definitions header: a list of predefined curves to be referred to by data packets
Motion definitions header: a list of predefined motions to be referred to by data packets
Palette definitions header: a list of predefined palettes to be referred to by data packets
Bitmap definitions header: a list of predefined bitmaps to be referred to by data packets
Font mapping definitions header: a list of predefined font mappings to be referred to by data packets

Other header packets are ignored, and left for future expansion.

Data packets:

text data: text and optional motions, accompanied by optional overrides for style, region, language, etc
end data [EOS]: marks the end of the stream, it doesn't have any useful payload

Other data packets are ignored, and left for future expansion.

A possible addition to the data packets is a "keepalive" packet, which would be sent at regular intervals when no other packet has been emitted for a while. This would be to help seeking code find a kate page more easily.

Things of note:

Kate is a discontinuous codec, as defined in ogg-multiplex.html in the Ogg documentation, which means it's timed by start granule, not end granule (as Theora and Vorbis).
All data packets are on their own page, for two reasons:
- Ogg keeps track of granules at the page level, not the packet level
- if no text event happens for a while after a particular text event, we don't want to delay it so a larger page can be issued

The granule encoding is not a direct time/granule correspondance, see the granule encoding section.
The EOS packet should have a granule pos higher than the end time of all events.
User code doesn't have to know the number of headers to expect, this is moved inside the library code (as opposed to Vorbis and Theora).

Format specification

At the moment, only the ID header has settled down enough to be specified here. Still, this is still subject to change, though likely to stay as is.

In any case, all fields up to and including version minor are frozen (ID header packet type (0x80), kate magic, and major and minor versions numbers.

Furthermore, the num headers, text encoding, directionality, granule shift, granule rate numerator, and granule rate denominator fields are almost cold enough to be frozen (in both their size and offset in the header packet), pending a decision regarding a more complex granule encoding.

This works out to a 57 byte ID header.

 0               1               2               3              |
 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1| Byte
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| packtype      | Identifier char[8]: 'kate\0\0\0\0'            | 0-3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| kate magic continued                                          | 4-7
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| kate magic    | version major | version minor | reserved - 0  | 8-11
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| num headers   | text encoding | directionality| granule shift | 12-15
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate numerator                                        | 16-19
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate denominator                                      | 20-23
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (NUL terminated)                                     | 24-27
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued)                                          | 28-31
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued)                                          | 32-35
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued)                                          | 36-39
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (NUL terminated)                                     | 40-43
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued)                                          | 44-47
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued)                                          | 48-51
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued)                                          | 52-55
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|1|                                                               56-59
+-+

language and category are NUL terminating ASCII strings. Language follows RFC 3066, though obviously will not accommodate language tags with lots of subtags.

Category is currently being defined, and I haven't found yet a nice way to present it in a generic way, but is meant for automatic classifying of various multiplexed Kate streams (eg, to recognize that some streams are subtitles (in a set of languages), and some others are commentary (in a possibly different set of languages, etc).

Support

I have patches for the following with Kate support:

oggmerge (it also adds Theora and Speex support)
file(1)
MPlayer (for multiplexed per-language subtitles - all region/style info is ignored)
xine (everything kate supports, as xine is my testbed)
liboggz
ogginfo

Most of those are not released yet, since the Kate bitstream format is still a work in progress.

Granule encoding

At the moment, the granules are split in two: the high bits represent a time (scaled by a fractional speed defined in the ID header), and the low bits are an increasing counter used when several events happen at the same time. At the moment, 5 bits are taken for that counter. This is totally arbitrary and subject to change. The granule shift of a stream is included in the ID header. See also the problems to solve section, about seeking, for a possible three-way split, where the high bits would be further split.

I'm now considering changing this to a system closer to what Theora and CMML do, for the sake of simplicity from the point of view of a demuxer/seeker), in which a granulepos would be the time of the earliest still active event in the high bits, and the offset from that granulepos to the current one in the middle bits. There would probably will be a few low bits reserved for the same-time counter.

Additionally, it should be possible to recover some bits by decreasing the precision of the backlink and accepting that the second bisection might seek to a previous packet, but if not too many bits are taken, the majority of the cases should still seek to the same page.

This needs some more thought.

I've now posted a description of a method that is a superset of the Theora/CMML method, which allows a larger time span with the same precision, as well as allowing the mapping of more than one adjacent granule to the same time.

Motion

The Kate bitstream format includes motion definition, originally for karaoke purposes, but which can be used for more general purpose, such as line based drawing, or animation of the text (position, color, etc)

Motions are defined by the means of a series of curves (static points, segments, splines (catmull-rom, bezier, and b-splines)). A 2D point can be obtained from a motion for any timestamp during the lifetime of a text. This can be used for moving a marker in 2D above the text for karaoke, or to use the x coordinate to color text when the motion position passes each letter or word, etc. Motions have an attached semantics so the client code knows how to use a particular motion. Predefined semantics include text color, text position, etc).

Since a motion can be composed of an arbitrary number of curves, each of which may have an arbitrary number of control points, complex motions can be achieved. If the motion is the main object of an event, it is even possible to have an empty text, and use the motion as a virtual pencil to draw arbitrary shapes. Even on-the-fly handwriting subtitles could be done this way, though this would require a lot of control points, and would not be able to be used with text-to-speech.

As a proof of concept, I also have a "draw chat" program where two people can draw, and the shapes are turned to b-splines and sent as a kate motion to be displayed on the other person's window.

It is also possible for motions to be discontinuous - simply insert a curve of 'none' type. While the timestamp lies within such a curve, no 2D point will be generated. This can be used to temporarily hide a marker, for instance.

It is worth mentionning that pauses in the motion can be trivially included by inserting at the right time and for the right duration a simple linear interpolation curve with only two equal points, equal to the position the motion is supposed to pause at.

Kate defines a set of predefined mappings so that each decoder user interprets a motion in the same way. A mapping is coded on 8 bits in the bitstream, and the first 128 are reserved for Kate, leaving 128 for application specific mappings, to avoid constraining creative uses of that feature. Predefined mappings include frame (eg, 0-1 points are mapped to the size of the current video frame), or region, to scale 0-1 to the current region. This allows curves to be defined without knowing in advance the pixel size of the area it should cover.

For uses which require more than two coordinates (eg, text color, where 4 (RGBA) values are needed, Kate predefines the semantics text_color_rg and text_color_ba, so a 4D point can be obtained using two different motions.

There are higher level constructs, such as morphing between two styles, or predefined karaoke effects. More are planned to be added in the future.

Trackers

Since attaching motions to text position, etc, makes it hard for the client to keep track of everything, doing interpolation, etc, the library supplies a tracker object, which handles the interpolation of the relevant properties. Once initialized with a text and a set of motions, the client code can give the tracker a new timestamp, and get back the current text position, text color, etc.

Using a tracker is not necessary, if one wants to use the motions directly, or just ignore them, but it makes life easier, especially when considering the the order in which motions are applied does matter (to be defined formally, but the current source code is informative at this point).

The Kate file format

Though this is not a feature of the bitstream format, I have created a text file format to describe a series of events to be turned into a Kate bitstream. At its minimum, the following is a valid input to the encoder:

kate {

event { 00:00:05 --> 00:00:10 "This is a text" }

}

This will create a simple stream with "This is a text" emitted at an offset of 5 seconds into the track, lasting 5 seconds to an offset of 10 seconds.

Motions, regions, styles can be declared in a definitions block to be reused by events, or can be defined inline. Defining those in the definitions block places them in a header so they can be reused later, saving space. However, they can also be defined in each event, so they will be sent with the event. This allows them to be generated on the fly (eg, if the bitstream is being streamed from a realtime input).

For convenience, the Kate file format also allows C style macros, though without parameters.

Please note that the Kate file format is fully separate from the Kate bitstream format. The difference between the two is similar to the difference between a C source file and the resulting object file, when compiled.

Note that the format is not based on XML for a very parochial reason: I tend to dislike very much editing XML by hand, as it's really hard to read. XML is really meant for machines to parse generically text data in a shared syntax but with possibly unknown semantics, and I need those text representations to be editable easily.

This also implies that there could be an XML representation of a Kate stream, which would be useful if one were to make an editor that worked on a higher level than the current all-text representation, and it is something that might very well happen in the future, in parallel with the current format.

Karaoke

Karaoke effects rely on motions, and there will be predefined higher level ways of specifying timings and effects, two of which are already done. As an example, this is a valid Karaoke script:

kate {

simple_timed_glyph_style_morph {

from style "start_style" to style "end_style"

"Let " at 1.0

"us " at 1.2

"sing " at 1.4

"to" at 2.0

"ge" at 2.5

"ther" at 3.0

}

The syllables will change from a style to another as time passes. The definition of the start_style and end_style styles is omitted for brevity.

Problems to solve

There are a few things to solve before the Kate bitstream format can be considered good enough to be frozen:

Seeking and memory

When seeking to a particular time in a movie with subtitles, we may end up at a place when a subtitle has been started, but is not removed yet. Pure streaming doesn't have this problem as it remembers the subtitle being issued (as opposed to, say, Vorbis, for which all data valid now is decoded from the last packet). With Kate, a text string valid now may have been issued long ago.

I see three possible ways to solve this:

each data packet includes the granule of the earliest still active packet (if none, this will be the granule of this very packet)
- this means seeks are two phased: first seek, find the next Kate packet, and seek again if the granule of the earlier still active packet is less than the original seeked granule. This implies support code on players to do the double seek.

use "reference frames", a bit like Theora does, where the granule position is split in several fields: the higher bits represent a position for the reference frame, and the lowest bits a delta time to the current position. When seeking to a granule position, the lower bits are cleared off, yielding the granule position of the previous reference frame, so the seek ends up at the reference frame. The reference frame is a sync point where any active strings are issued again. This is a variant of the method described in the Writ wiki page, but the granule splitting avoids any "downtime".
- this requires reissuing packets, and it doesn't feel right (and wastes space).
- it also requires "dummy" decoding of Kate data from the reference frame to the actual seek point to fully refresh the state "memory".

A variant of the two-granules-in-one system used by libcmml, where the "back link" points to the earliest still active string, rather than the previous one (this allows a two phase seek, rather than a multiphase seek, hopping back from event to event, with no real way to know if there is or not a previous event which is still active - I suppose CMML has no need to know this, if their "clips" do not overlap - mine can do).
- Such a system considerably shortens the usable granule space, though it can do a one phase seek, if I understand the system correctly, which I am not certain.
  - Well, it seems it can't do a one phase seek anyway.

Text encoding

A header field declares the text encoding used in the stream. At the moment, only UTF-8 is supported, for simplicity.

Note that strings included in the header (language, category, etc) are not affected by that language encoding (rather obviously for language itself). These are ASCII.

The actual text in events may include simple HTML markup (eg, <br>, <em>, etc). It is also possible to ask libkate to remove this markup if the client prefers to receive plain text without the markup.

Language encoding

A header field defines the language (if any) used in the stream (this can be overridden in a data packet, but this is not relevant to this point). At the moment, my test code uses ISO 639-1 two letter codes, but I originally thought to use RFC 3066 tags. However, matching a language to a user selection may be simpler for user code if the language encoding is kept simple. At the moment, I tend to favor allowing both two letter tags (eg, "en") and secondary tags (like "en_EN"), as RFC 3066 tags can be quite complex, but I welcome comments on this.

Alternatively, I might use only RFC 1766 tags, which are essentially the subset I considered above, but this RFC has been deprecated by RFC 3066, and I'm not sure of the wisdom of basing a new format on a deprecated RFC.

If a stream contains more than one language, there usually is a predominant language, which can be set as the default language for the stream. Each event can then have a language override. If there is no predominant language, and it is not possible to split the stream into multiple substreams, each with its own language, then it is possible to use the "mul" language tag, as a last resort.

Bitstream format for floating point values

Floating point values are be turned to a 16.16 fixed point format, then stored in a bitpacked format, storing the number of zero bits at the head and tail of the floating point values once per stream, and the remainder bits for all values in the stream. This seems to yield good results (typically a 50% reduction over 32 bits raw writes, and 70% over the snprintf based storage), and has the big advantage of being portable (eg, independant of any IEEE format). However, this means reduced precision due to the quantization to 16.16. I may add support for variable precision (eg, 8.24 fixed point formats) to alleviate this. This would however mean less space savings, though these are likely to be insignificant when Kate streams are interleaved with a video.

Though this is not a Kate issue per se, the motion feature is very difficult to use without a curve editor. While tools may be coded to create a Kate bitstream for various existing subtitle formats, it is not certain it will be easy to find a good authoring tool for a series of curves. That said, it's not exactly difficult to do if you know a widget set.

Text to speech

One of the goals of the Kate bitstream format is that text data can be easily parsed by the user of the decoder, so any additional information, such as style, placement, karaoke data, etc, should be able to be stripped to leave only the bare text. This is in view of allowing text-to-speech software to use Kate bitstreams as a bandwith-cheap way of conveying speech data, and could also allow things like e-books which can be either read or listened to from the same bitstream (I have seen no reference to this being used anywhere, but I see no reason why the granule progression should be temporal, and not user controlled, such as by using a "next" button which would bump a granule postion by a preset amount, simulating turning a page (this would be close to necessary for text-to-speech, as the wall time duration of the spoken speech is not known in advance to the Kate encoder, and can't be mapped to a time based granule progression)). All text strings triggered consecutively between the two granule positions would then be read in order.

Possible additions

Embedded binary data

Images and font mappings can be included within a Kate stream.

Images

Though this could be misused to interfere with ability to render as text-to-speech, Kate can use images as well as text. The same caveat as for fonts applies with regard to data duplication.

Complex images might however be best left to a multiplexed OggSpots stream, unless the images mesh with the text (eg, graphical exclamation points, custom fonts, (see next paragraph), etc).

There is support for simple paletted bitmap images, with a variable length palette of up to 256 colors (in fact, sized in powers of 2 up to 256) and matching pixel data in as many bits per pixel as can address the palette. Palettes and images are stored separately, so can be used with one another with no fixed assignment.

Palettes and bitmaps are put in two separate header for later use by reference, but will also be able to be placed in data packets, as with motions, etc.

This can be used to have custom fonts, so that raw text is still available if the stream creator wants a custom look.

I expect that the need more more than 256 colors in a bitmap, or non palette bitmap data, would be best handled by another codec, eg OggMNG or OggSpots. The goal of images in a Kate stream is to mesh the images with the text, not to have large images by themselves.

On the other hand, interesting Karaoke effects could be achieved by having MNG images instead of simple paletted bitmaps in a Kate streams. Comments would be most welcome on whether this is going too far, however.

A possible solution to the duplication issue is to have another stream in the container stream, which would hold the shared data (eg, fonts), which the user program could load, and which could then be used by any Kate (and other) stream. Typically, this type of stream would be a degenerate stream with only header packets (so it is fully processed before any other stream presents data packets that might make use of that shared data), and all payload such as fonts being contained within the headers. Thinking about it, it has parallels with the way Vorbis stores its codebooks within a header packet, or even the way Kate stores the list of styles within a header packet.

Fonts

Fonts are merely a set of ranges mapping unicode code points to bitmaps. As this implies, fonts are bitmap fonts, not vector fonts, so scaling, if supported by the rendering client, may not look as good as with a vector font.

Reference encoder/decoder

A encoder and a decoder are included in the tools directory. The encoder pulls its input from a custom text based file format (see The Kate file format), which is by no means meant to be part of the Kate bitstream specification itself, or from an SubRip (.srt) format file (the most common subtitle format I found, and a very basic one).

The Kate bitstreams encoded and decoded by those tools, however, are (supposed to be) correct for this specification, provided their input is correct.

Things I need to get feedback on

Wisdom of having several smaller headers rather a large one (packet loss...)
Granule "back link" encoding - I don't like the 32+32 split, it loses too much granule space
Empty packets - possible ? A good idea ? Could they become "invisible" if no header (oggless) ?
Wisdom of relying on bitwise.c from libogg - will it be ripped out (I can still take it internal)
language/category: variable length ?
size field in data (and header) packets for easier "parsing" by decoders that don't grok kate ?
is it a good idea to avoid floating point usage altogether ?