Difference between revisions of "OggKate"

From XiphWiki
Jump to navigation Jump to search
(Preliminary presentation and description of the Kate bitstream format)
m (Wikified content)
Line 1: Line 1:
As a disclaimer, this is not a Xiph codec, but I was asked to post information
+
== Disclaimer ==
 +
This is not a Xiph codec, but I was asked to post information
 
about Ogg/Kate on this wiki. As such, please do not assume that Xiph has anything
 
about Ogg/Kate on this wiki. As such, please do not assume that Xiph has anything
 
to do with this, much less responsibility.
 
to do with this, much less responsibility.
  
 
+
==  What is Kate? ==
 
 
----
 
 
 
 
 
0 - Table of contents
 
      0 - Table of contents
 
      1 - What is Kate ?
 
      2 - Why a new codec ?
 
      3 - Overview of the Kate bitstream format
 
      4 - Support
 
      5 - Granule encoding
 
      6 - Motion
 
      7 - Problems to solve
 
      8 - Text to speech
 
      9 - Possible additions
 
    10 - Reference encoder/decoder
 
 
 
 
 
 
 
1 - What is Kate ?
 
  
 
Kate is a codec for karaoke and text encapsulation for Ogg. Most of the time, this
 
Kate is a codec for karaoke and text encapsulation for Ogg. Most of the time, this
Line 33: Line 14:
 
can be used for any purpose.
 
can be used for any purpose.
  
 
+
== Why a new codec? ==
 
 
2 - Why a new codec ?
 
  
 
As I was adding support for Theora, Speex and FLAC to some software of mine, I found myself
 
As I was adding support for Theora, Speex and FLAC to some software of mine, I found myself
Line 44: Line 23:
 
The three possible bases usable for such a codec I found were Writ, CMML, and OGM/SRT.
 
The three possible bases usable for such a codec I found were Writ, CMML, and OGM/SRT.
  
- Writ is an unmaintained start at an implementation of a very basic design, though I did find
+
*Writ is an unmaintained start at an implementation of a very basic design, though I did find an encoder/decoder in py-pgg2 later on - I'd been quicker to write Kate from scratch anyway.
    an encoder/decoder in py-pgg2 later on - I'd been quicker to write Kate from scratch anyway.
+
*CMML is more geared towards encapsulating metadata about an accompanying stream, rather than being a data stream itself, and seems complex for a simple use - I don't really want *full* HTML/XML with links, etc - besides, it seems designed for Annodex (which I haven't had a look at), though it does seems relatively generic for use outwith Annodex
- CMML is more geared towards encapsulating metadata about an accompanying stream, rather than being
+
*OGM/SRT, which I only found when I added Kate support to MPlayer, is shoehorning various data formats into an Ogg stream, and just dumps the SRT subtitle format as is, AFAICS (though I haven't looked at this one in detail, since I'd already had a working Kate implementation by that time)
    a data stream itself, and seems complex for a simple use - I don't really want *full* HTML/XML
 
    with links, etc - besides, it seems designed for Annodex (which I haven't had a look at), though
 
    it does seems relatively generic for use outwith Annodex
 
- OGM/SRT, which I only found when I added Kate support to MPlayer, is shoehorning various data
 
    formats into an Ogg stream, and just dumps the SRT subtitle format as is, AFAICS (though I
 
    haven't looked at this one in detail, since I'd already had a working Kate implementation by
 
    that time)
 
  
 
I then decided to roll my own, not least because it's a fun thing to do.
 
I then decided to roll my own, not least because it's a fun thing to do.
Line 60: Line 32:
 
but none were designed for embedding inside an Ogg container.
 
but none were designed for embedding inside an Ogg container.
  
 
+
== Overview of the Kate bitstream format ==
 
 
3 - Overview of the Kate bitstream format
 
  
 
I've taken much inspiration from Vorbis and Theora here.
 
I've taken much inspiration from Vorbis and Theora here.
Line 71: Line 41:
  
 
Headers packets:
 
Headers packets:
- ID header [BOS]: magic, version, granule fraction, language, etc
+
*ID header [BOS]: magic, version, granule fraction, language, etc
- Comment header: Vorbis comments
+
*Comment header: Vorbis comments
- Style definitions header: a list of predefined styles to be referred to by data packets
+
*Style definitions header: a list of predefined styles to be referred to by data packets
- Region definitions header: a list of predefined regions to be referred to by data packets
+
*Region definitions header: a list of predefined regions to be referred to by data packets
  
 
Other header packets are ignored, and left for future expansion. In particular, there will
 
Other header packets are ignored, and left for future expansion. In particular, there will
Line 81: Line 51:
  
 
Data packets:
 
Data packets:
- text data: text and optional motion, accompanied by optional overrides for style, region,
+
*text data: text and optional motion, accompanied by optional overrides for style, region, language, etc
  language, etc
+
*end data [EOS]: marks the end of the stream, it doesn't have any payload
- end data [EOS]: marks the end of the stream, it doesn't have any payload
 
  
 
Other data packets are ignored, and left for future expansion.
 
Other data packets are ignored, and left for future expansion.
  
 
Things of note:
 
Things of note:
- Kate is a discontinuous codec, as defined in ogg-multiplex.html in the Ogg documentation,
+
*Kate is a discontinuous codec, as defined in ogg-multiplex.html in the Ogg documentation, which means it's timed by start granule, not end granule (as Theora and Vorbis). Also, all data packets are on their own page, for two reasons:
  which means it's timed by start granule, not end granule (as Theora and Vorbis). Also,
+
**Ogg keeps track of granules at the page level, not the packet level
  all data packets are on their own page, for two reasons:
+
**if no text event happens for a while after a particular text event, we don't want to delay it so a fuller page can be issued
    - Ogg keeps track of granules at the page level, not the packet level
 
    - if no text event happens for a while after a particular text event, we don't want to
 
      delay it so a fuller page can be issued
 
  See also the problems to solve section, about seeking.
 
- The granule encoding is not a direct time/granule correspondance, see the granule encoding
 
  section.
 
- The EOS packet should have a granule pos higher than the end time of all events.
 
- User code doesn't have to know the number of headers to expect, this is moved inside the
 
  library code.
 
  
 +
See also [[#Seeking and memory|Problems to solve: Seeking and memory]].
  
 +
*The granule encoding is not a direct time/granule correspondance, see the granule encoding section.
 +
*The EOS packet should have a granule pos higher than the end time of all events.
 +
*User code doesn't have to know the number of headers to expect, this is moved inside the library code.
  
4 - Support
+
== Support ==
  
 
I have patches for the following with Kate support:
 
I have patches for the following with Kate support:
 
+
*oggmerge (it also adds Theora support)
  - oggmerge (it also adds Theora support)
+
*file(1)
  - file(1)
+
*MPlayer (for multiplexed per-language subtitles - all region/style info is ignored)
  - MPlayer (for multiplexed per-language subtitles - all region/style info is ignored)
 
  
 
None of those are released yet, since the Kate bitstream format is still a work in progress.
 
None of those are released yet, since the Kate bitstream format is still a work in progress.
  
 
+
== Granule encoding ==
 
 
5 - Granule encoding
 
  
 
At the moment, the granules are split in two: the high bits represent a time (scaled by a
 
At the moment, the granules are split in two: the high bits represent a time (scaled by a
Line 125: Line 86:
 
the high bits would be further split.
 
the high bits would be further split.
  
 
+
== Motion ==
 
 
6 - Motion
 
  
 
The Kate bitstream format includes motion definition, primarily for karaoke purposes, but
 
The Kate bitstream format includes motion definition, primarily for karaoke purposes, but
Line 163: Line 122:
 
and the Kate specification does not attempt to codify the use of extra motions.
 
and the Kate specification does not attempt to codify the use of extra motions.
  
 
+
== Problems to solve ==
 
 
7 - Problems to solve
 
  
 
There are a few things to solve before the Kate bitstream format can be considered good
 
There are a few things to solve before the Kate bitstream format can be considered good
 
enough to be frozen:
 
enough to be frozen:
  
- Seeking and memory
+
=== Seeking and memory ===
  
    When seeking to a particular time in a movie with subtitles, we may end up at a place
+
When seeking to a particular time in a movie with subtitles, we may end up at a place when a subtitle has been started, but is not removed yet. Pure streaming doesn't have this problem as it remembers the subtitle being issued (as opposed to, say, Vorbis, for which all data valid now is decoded from the last packet). With Kate, a text string valid now may have been issued long ago.
    when a subtitle has been started, but is not removed yet. Pure streaming doesn't have
 
    this problem as it remembers the subtitle being issued (as opposed to, say, Vorbis,
 
    for which all data valid now is decoded from the last packet). With Kate, a text string
 
    valid now may have been issued long ago.
 
  
    I see three possible ways to solve this:
+
I see three possible ways to solve this:
 +
*each data packet includes the granule of the earliest still active packet (if none, this will be the granule of this very packet)
 +
**this means seeks are two phased: first seek, find the next Kate packet, and seek again if the granule of the earlier still active packet is less than the original seeked granule. This implies support code on players to do the double seek.
  
    - each data packet includes the granule of the earliest still active packet (if none,
+
*use "reference frames", a bit like Theora does, where the granule position is split in several fields: the higher bits represent a position for the reference frame, and the lowest bits a delta time to the current position. When seeking to a granule position, the lower bits are cleared off, yielding the granule position of the previous reference frame, so the seek ends up at the reference frame. The reference frame is a sync point where any active strings are issued again. This is a variant of the method described in the Writ wiki page, but the granule splitting avoids any "downtime".
      this will be the granule of this very packet)
+
**this requires reissuing packets, and it doesn't feel right (and wastes space).
      -> this means seeks are two phased: first seek, find the next Kate packet, and seek
+
**it also requires "dummy" decoding of Kate data from the reference frame to the actual seek point to fully refresh the state "memory".
          again if the granule of the earlier still active packet is less than the original
 
          seeked granule. This implies support code on players to do the double seek.
 
  
    - use "reference frames", a bit like Theora does, where the granule position is split
+
*A variant of the two-granules-in-one system used by libcmml, where the "back link" points to the earliest still active string, rather than the previous one (this allows a two phase seek, rather than a multiphase seek, hopping back from event to event, with no real way to know if there is or not a previous event which is still active - I suppose CMML has no need to know this, if their "clips" do not overlap - mine can do).
      in several fields: the higher bits represent a position for the reference frame,
+
**Such a system considerably shortens the usable granule space, though it can do a one phase seek, if I understand the system correctly, which I am not certain.
      and the lowest bits a delta time to the current position. When seeking to a granule
 
      position, the lower bits are cleared off, yielding the granule position of the previous
 
      reference frame, so the seek ends up at the reference frame. The reference frame is
 
      a sync point where any active strings are issued again. This is a variant of the method
 
      described in the Writ wiki page, but the granule splitting avoids any "downtime".
 
      -> this requires reissuing packets, and it doesn't feel right (and wastes space).
 
      -> it also requires "dummy" decoding of Kate data from the reference frame to the actual
 
          seek point to fully refresh the state "memory".
 
  
    - A variant of the two-granules-in-one system used by libcmml, where the "back link" points
+
=== Text encoding ===
      to the earliest still active string, rather than the previous one (this allows a two
 
      phase seek, rather than a multiphase seek, hopping back from event to event, with no
 
      real way to know if there is or not a previous event which is still active - I suppose
 
      CMML has no need to know this, if their "clips" do not overlap - mine can do).
 
      -> Such a system considerably shortens the usable granule space, though it can do a one
 
          phase seek, if I understand the system correctly, which I am not certain.
 
  
- Text encoding
+
A header field declares the text encoding used in the stream (this can be overridden in a
 +
data packet, but this is not relevant to this point). At the moment, only UTF-8 is supported,
 +
for simplicity, and I have not yet decided whether or not the Kate specification will allow
 +
for other encodings, such as UTF-16 of UTF-32. The reason for this is that, if these were tobe supported, either:
 +
*users of the decoder would have to be ready to face text in any one of these encodings
 +
*the decoder would have to convert encodings to one selected by the user of the decoder
  
    A header field declares the text encoding used in the stream (this can be overridden in a
+
The first option may be asking a lot of users, while the second one brings complexity to the
    data packet, but this is not relevant to this point). At the moment, only UTF-8 is supported,
+
decoder, and kind of defeats the purpose of supporting the encoding in the first place.
    for simplicity, and I have not yet decided whether or not the Kate specification will allow
 
    for other encodings, such as UTF-16 of UTF-32. The reason for this is that, if these were to
 
    be supported, either:
 
      - users of the decoder would have to be ready to face text in any one of these encodings
 
      - the decoder would have to convert encodings to one selected by the user of the decoder
 
    The first option may be asking a lot of users, while the second one brings complexity to the
 
    decoder, and kind of defeats the purpose of supporting the encoding in the first place.
 
  
    Note that strings included in the header (language, category, etc) are not affected by that
+
Note that strings included in the header (language, category, etc) are not affected by that
    language encoding (rather obviously for language itself). These are ASCII.
+
language encoding (rather obviously for language itself). These are ASCII.
  
    An argument in favor of UTF-8 only text is that it is the format of Vorbis comments, which
+
An argument in favor of UTF-8 only text is that it is the format of Vorbis comments, which
    are part of the Kate bitstream format.
+
are part of the Kate bitstream format.
  
- Language encoding
+
=== Language encoding ===
  
    A header field defines the language (if any) used in the stream (this can be overridden in a
+
A header field defines the language (if any) used in the stream (this can be overridden in a
    data packet, but this is not relevant to this point). At the moment, my test code uses
+
data packet, but this is not relevant to this point). At the moment, my test code uses
    ISO 639-1 two letter codes, but I originally thought to use RFC 3066 tags. However, matching
+
ISO 639-1 two letter codes, but I originally thought to use RFC 3066 tags. However, matching
    a language to a user selection may be simpler for user code if the language encoding is kept
+
a language to a user selection may be simpler for user code if the language encoding is kept
    simple. At the moment, I tend to favor allowing both two letter tags (eg, "en") and secondary
+
simple. At the moment, I tend to favor allowing both two letter tags (eg, "en") and secondary
    tags (like "en_EN"), as RFC 3066 tags can be quite complex, but I welcome comments on this.
+
tags (like "en_EN"), as RFC 3066 tags can be quite complex, but I welcome comments on this.
  
    Alternatively, I might use only RFC 1766 tags, which are essentially the subset I considered
+
Alternatively, I might use only RFC 1766 tags, which are essentially the subset I considered
    above, but this RFC has been deprecated by RFC 3066, and I'm not sure of the wisdom of basing
+
above, but this RFC has been deprecated by RFC 3066, and I'm not sure of the wisdom of basing
    a new format on a deprecated RFC.
+
a new format on a deprecated RFC.
  
    Also, it might be possible for the language field to be a list of such encodings, for streams
+
Also, it might be possible for the language field to be a list of such encodings, for streams
    that contain several languages (though the usual way to present several languages is to have
+
that contain several languages (though the usual way to present several languages is to have
    several bitstreams multiplexed with one another (as opposed to Writ, which has all languages
+
several bitstreams multiplexed with one another (as opposed to Writ, which has all languages
    included in a single bitstream)).
+
included in a single bitstream)).
  
    A disadvantage of having multiple languages is that text-to-speech typically needs to know
+
A disadvantage of having multiple languages is that text-to-speech typically needs to know
    the current language to function properly, and that having, say, two current languages, would
+
the current language to function properly, and that having, say, two current languages, would
    make it more difficult to deal with such a stream.
+
make it more difficult to deal with such a stream.
  
- Bitstream format for floating point values
+
=== Bitstream format for floating point values ===
  
  At the moment, floating point values (for splines) are stored as their textual representation,
+
At the moment, floating point values (for splines) are stored as their textual representation, and converted back and forth using snprintf and sscanf. We could quantize them and store as
  and converted back and forth using snprintf and sscanf. We could quantize them and store as
+
integers, since precision isn't that important here.
  integers, since precision isn't that important here.
 
  
- Though this is not a Kate issue per se, the motion feature is very difficult to use without a
+
*Though this is not a Kate issue per se, the motion feature is very difficult to use without a curve editor. While tools may be coded to create a Kate bitstream for various existing subtitle formats, it is not certain it will be easy to find a good authoring tool for a series of curves. That said, it's not exactly difficult to do if you know a widget set.
  curve editor. While tools may be coded to create a Kate bitstream for various existing subtitle
 
  formats, it is not certain it will be easy to find a good authoring tool for a series of curves.
 
  That said, it's not exactly difficult to do if you know a widget set.
 
  
- Since motions may be repeated, I may add predefined motions in an extra header packet, to be
+
*Since motions may be repeated, I may add predefined motions in an extra header packet, to be referenced as styles and regions are. This would depend on whether motions are likely to be exactly repeated often, and I don't know if this will likely be the case. Complex motion definitions can take a lot of space, especially with the current floating point value encoding. After some thought, I will almost certainly place predefined curves in a header, and allow motions to refer to them. Fully defined curves will also be able to be placed in data packets, as it's likely some curves will be used only once, and it would constrain future uses to allow them only in headers (eg, if one were to stream handwriting using Kate).
  referenced as styles and regions are. This would depend on whether motions are likely to be
 
  exactly repeated often, and I don't know if this will likely be the case. Complex motion
 
  definitions can take a lot of space, especially with the current floating point value encoding.
 
  After some thought, I will almost certainly place predefined curves in a header, and allow
 
  motions to refer to them. Fully defined curves will also be able to be placed in data packets,
 
  as it's likely some curves will be used only once, and it would constrain future uses to allow
 
  them only in headers (eg, if one were to stream handwriting using Kate).
 
  
 
+
== Text to speech ==
 
 
8 - Text to speech
 
  
 
One of the goals of the Kate bitstream format is that text data can be easily parsed
 
One of the goals of the Kate bitstream format is that text data can be easily parsed
Line 283: Line 207:
 
be read in order.
 
be read in order.
  
 +
== Possible additions ==
 +
 +
=== HTML (or similar) text content ===
  
 +
At the moment, free utf-8 text is included in the data packets. Kate doesn't care about
 +
the actual contents of that text. Allowing a subset of HTML allows an easy way to define
 +
extra style elements within the body of the text, at the glyph level. Despite originally
 +
not wanting to add in-band markup, I am more and more thinking about making this change.
 +
In this case, Kate would have a way to give a scrubbed text to the client. Since these
 +
markup tags can't be nested, that scrubbing is easy to do so that users do not have to
 +
understand those tags (or scrub them themselves).
 +
Subset to be defined, and fallback for plain text to be added.
 +
This is an argument to keep all in utf-8, isn't it ? I don't know how one would go about
 +
having UTF-16 HTML code.
  
9 - Possible additions
+
=== Embedded binary data ===
  
- HTML (or similar) text content
+
Various types of binary data could be embedded within a Kate stream:
    At the moment, free utf-8 text is included in the data packets. Kate doesn't care about
 
    the actual contents of that text. Allowing a subset of HTML allows an easy way to define
 
    extra style elements within the body of the text, at the glyph level. Despite originally
 
    not wanting to add in-band markup, I am more and more thinking about making this change.
 
    In this case, Kate would have a way to give a scrubbed text to the client. Since these
 
    markup tags can't be nested, that scrubbing is easy to do so that users do not have to
 
    understand those tags (or scrub them themselves).
 
    Subset to be defined, and fallback for plain text to be added.
 
    This is an argument to keep all in utf-8, isn't it ? I don't know how one would go about
 
    having UTF-16 HTML code.
 
  
- Embedded binary data
+
==== Fonts ====
    Various types of binary data could be embedded within a Kate stream:
 
  
  - Fonts
+
Font selection is the first thing that came to mind, due to the discrepancy of font
    Font selection is the first thing that came to mind, due to the discrepancy of font
+
naming in platforms (eg, the *-*-* X system, and the...  hmm, not sure, filename ?
    naming in platforms (eg, the *-*-* X system, and the...  hmm, not sure, filename ?
+
in Windows). A potential problem, however, is that there might be several multiplexed
    in Windows). A potential problem, however, is that there might be several multiplexed
+
Kate streams in an Ogg bitstream, so a custom font might be included several times
    Kate streams in an Ogg bitstream, so a custom font might be included several times
+
in the container stream. On the other hand, it would allow for per-language fonts.
    in the container stream. On the other hand, it would allow for per-language fonts.
 
  
  - Images
+
==== Images ====
    Though this could interfere with ability to render as text-to-speech, images could be
+
 
    mixed with text. The same caveat as for fonts applies with regard to data duplication.
+
Though this could interfere with ability to render as text-to-speech, images could be
    This might however be best left to a multiplexed OggSpots stream, unless the images
+
mixed with text. The same caveat as for fonts applies with regard to data duplication.
    mesh with the text (eg, graphical exclamation points, etc).
+
This might however be best left to a multiplexed OggSpots stream, unless the images
 +
mesh with the text (eg, graphical exclamation points, etc).
  
 
A possible solution to the duplication issue is to have another stream in the container
 
A possible solution to the duplication issue is to have another stream in the container
Line 324: Line 250:
 
list of styles within a header packet.
 
list of styles within a header packet.
  
 
+
== Reference encoder/decoder ==
 
 
10 - Reference encoder/decoder
 
  
 
A encoder and a decoder are included in the tools directory. Note that they are very rough
 
A encoder and a decoder are included in the tools directory. Note that they are very rough

Revision as of 13:35, 15 January 2008

Disclaimer

This is not a Xiph codec, but I was asked to post information about Ogg/Kate on this wiki. As such, please do not assume that Xiph has anything to do with this, much less responsibility.

What is Kate?

Kate is a codec for karaoke and text encapsulation for Ogg. Most of the time, this would be multiplexed with audio/video to carry subtitles, song lyrics (with or without karaoke data), etc, but doesn't have to be. A possible use of a lone Kate stream would be an e-book. Moreover, the motion feature gives Kate a powrful means to describe arbitrary curves, so hand drawing of shapes can be achieved. This was originally meant for karaoke use, but can be used for any purpose.

Why a new codec?

As I was adding support for Theora, Speex and FLAC to some software of mine, I found myself wanting to have song lyrics accompanying Vorbis audio. Since Vorbis comments are limited to the headers, one can't add them in the stream as they are sung, so another multiplexed stream would be needed to carry them.

The three possible bases usable for such a codec I found were Writ, CMML, and OGM/SRT.

  • Writ is an unmaintained start at an implementation of a very basic design, though I did find an encoder/decoder in py-pgg2 later on - I'd been quicker to write Kate from scratch anyway.
  • CMML is more geared towards encapsulating metadata about an accompanying stream, rather than being a data stream itself, and seems complex for a simple use - I don't really want *full* HTML/XML with links, etc - besides, it seems designed for Annodex (which I haven't had a look at), though it does seems relatively generic for use outwith Annodex
  • OGM/SRT, which I only found when I added Kate support to MPlayer, is shoehorning various data formats into an Ogg stream, and just dumps the SRT subtitle format as is, AFAICS (though I haven't looked at this one in detail, since I'd already had a working Kate implementation by that time)

I then decided to roll my own, not least because it's a fun thing to do.

I found other formats, such as USF (designed for inclusion in Matroska) and various subtitle formats, but none were designed for embedding inside an Ogg container.

Overview of the Kate bitstream format

I've taken much inspiration from Vorbis and Theora here. Headers and packets (as well as the API design) follow the design of these two codecs.

A rough overview (detailed description is available below (no, it's not, it will be available later when the format has settled down a bit more)) is:

Headers packets:

  • ID header [BOS]: magic, version, granule fraction, language, etc
  • Comment header: Vorbis comments
  • Style definitions header: a list of predefined styles to be referred to by data packets
  • Region definitions header: a list of predefined regions to be referred to by data packets

Other header packets are ignored, and left for future expansion. In particular, there will likely be a motions definition header, where motions which are to be used repeatedly will be stored for reference in text packets.

Data packets:

  • text data: text and optional motion, accompanied by optional overrides for style, region, language, etc
  • end data [EOS]: marks the end of the stream, it doesn't have any payload

Other data packets are ignored, and left for future expansion.

Things of note:

  • Kate is a discontinuous codec, as defined in ogg-multiplex.html in the Ogg documentation, which means it's timed by start granule, not end granule (as Theora and Vorbis). Also, all data packets are on their own page, for two reasons:
    • Ogg keeps track of granules at the page level, not the packet level
    • if no text event happens for a while after a particular text event, we don't want to delay it so a fuller page can be issued

See also Problems to solve: Seeking and memory.

  • The granule encoding is not a direct time/granule correspondance, see the granule encoding section.
  • The EOS packet should have a granule pos higher than the end time of all events.
  • User code doesn't have to know the number of headers to expect, this is moved inside the library code.

Support

I have patches for the following with Kate support:

  • oggmerge (it also adds Theora support)
  • file(1)
  • MPlayer (for multiplexed per-language subtitles - all region/style info is ignored)

None of those are released yet, since the Kate bitstream format is still a work in progress.

Granule encoding

At the moment, the granules are split in two: the high bits represent a time (scaled by a fractional speed defined in the ID header), and the low bits are an increasing counter used when several events happen at the same time. At the moment, 5 bits are taken for that counter. This is totally arbitrary and subject to change. See also the problems to solve section, about seeking, for a possible three-way split, where the high bits would be further split.

Motion

The Kate bitstream format includes motion definition, primarily for karaoke purposes, but which can be used for more general purpose, such as line based drawing.

Motions are defined by the means of a series of curves (for now, segments and splines). A 2D point can be obtained for any timestamp during the lifetime of a text. This can be used for moving a marker in 2D above the text for karaoke, or to use the x coordinate to color text when the motion position passes each letter or word, etc.

Since a motion can be composed of an arbitrary number of curves, each of which may have an arbitrary number of control points, complex motions can be achieved. If the motion is the main object of an event, it is even possible to have an empty text, and use the motion as a virtual pencil to draw arbitrary shapes. Even on-the-fly handwriting subtitles could be done this way, though this would require a lot of control points.

It is worth mentionning that pauses in the motion can be trivially included by inserting at the right time and for the right duration a simple linear interpolation curve with only two equal points, equal to the position the motion is supposed to pause at.

I could also let an event have an indefinite number of attached motions. If so, a motion might be made 1D only, and a karaoke moving pointer system would attach two of them. Thus, if one needed N coordinates, one would attach N motions. They wouldn't have to have the same curves at all. This needs more thinking about.

Kate defines a set of predefined mappings so that each decoder user interprets a motion in the same way. A mapping is coded on 8 bits in the bitstream, and the first 128 are reserved for Kate, leaving 128 for application specific mappings, to avoid constraining creative uses of that feature.

If an application wishes to have a motion in more dimensions that 2 (eg, to have four extra dimension which would be interpreted as, say, the RGBA components of a marker color which position is controlled by the two first coordinates of the motion), it is possible to add two empty texts, each with their 2D motion. This, however, is entirely an application issue and the Kate specification does not attempt to codify the use of extra motions.

Problems to solve

There are a few things to solve before the Kate bitstream format can be considered good enough to be frozen:

Seeking and memory

When seeking to a particular time in a movie with subtitles, we may end up at a place when a subtitle has been started, but is not removed yet. Pure streaming doesn't have this problem as it remembers the subtitle being issued (as opposed to, say, Vorbis, for which all data valid now is decoded from the last packet). With Kate, a text string valid now may have been issued long ago.

I see three possible ways to solve this:

  • each data packet includes the granule of the earliest still active packet (if none, this will be the granule of this very packet)
    • this means seeks are two phased: first seek, find the next Kate packet, and seek again if the granule of the earlier still active packet is less than the original seeked granule. This implies support code on players to do the double seek.
  • use "reference frames", a bit like Theora does, where the granule position is split in several fields: the higher bits represent a position for the reference frame, and the lowest bits a delta time to the current position. When seeking to a granule position, the lower bits are cleared off, yielding the granule position of the previous reference frame, so the seek ends up at the reference frame. The reference frame is a sync point where any active strings are issued again. This is a variant of the method described in the Writ wiki page, but the granule splitting avoids any "downtime".
    • this requires reissuing packets, and it doesn't feel right (and wastes space).
    • it also requires "dummy" decoding of Kate data from the reference frame to the actual seek point to fully refresh the state "memory".
  • A variant of the two-granules-in-one system used by libcmml, where the "back link" points to the earliest still active string, rather than the previous one (this allows a two phase seek, rather than a multiphase seek, hopping back from event to event, with no real way to know if there is or not a previous event which is still active - I suppose CMML has no need to know this, if their "clips" do not overlap - mine can do).
    • Such a system considerably shortens the usable granule space, though it can do a one phase seek, if I understand the system correctly, which I am not certain.

Text encoding

A header field declares the text encoding used in the stream (this can be overridden in a data packet, but this is not relevant to this point). At the moment, only UTF-8 is supported, for simplicity, and I have not yet decided whether or not the Kate specification will allow for other encodings, such as UTF-16 of UTF-32. The reason for this is that, if these were tobe supported, either:

  • users of the decoder would have to be ready to face text in any one of these encodings
  • the decoder would have to convert encodings to one selected by the user of the decoder

The first option may be asking a lot of users, while the second one brings complexity to the decoder, and kind of defeats the purpose of supporting the encoding in the first place.

Note that strings included in the header (language, category, etc) are not affected by that language encoding (rather obviously for language itself). These are ASCII.

An argument in favor of UTF-8 only text is that it is the format of Vorbis comments, which are part of the Kate bitstream format.

Language encoding

A header field defines the language (if any) used in the stream (this can be overridden in a data packet, but this is not relevant to this point). At the moment, my test code uses ISO 639-1 two letter codes, but I originally thought to use RFC 3066 tags. However, matching a language to a user selection may be simpler for user code if the language encoding is kept simple. At the moment, I tend to favor allowing both two letter tags (eg, "en") and secondary tags (like "en_EN"), as RFC 3066 tags can be quite complex, but I welcome comments on this.

Alternatively, I might use only RFC 1766 tags, which are essentially the subset I considered above, but this RFC has been deprecated by RFC 3066, and I'm not sure of the wisdom of basing a new format on a deprecated RFC.

Also, it might be possible for the language field to be a list of such encodings, for streams that contain several languages (though the usual way to present several languages is to have several bitstreams multiplexed with one another (as opposed to Writ, which has all languages included in a single bitstream)).

A disadvantage of having multiple languages is that text-to-speech typically needs to know the current language to function properly, and that having, say, two current languages, would make it more difficult to deal with such a stream.

Bitstream format for floating point values

At the moment, floating point values (for splines) are stored as their textual representation, and converted back and forth using snprintf and sscanf. We could quantize them and store as integers, since precision isn't that important here.

  • Though this is not a Kate issue per se, the motion feature is very difficult to use without a curve editor. While tools may be coded to create a Kate bitstream for various existing subtitle formats, it is not certain it will be easy to find a good authoring tool for a series of curves. That said, it's not exactly difficult to do if you know a widget set.
  • Since motions may be repeated, I may add predefined motions in an extra header packet, to be referenced as styles and regions are. This would depend on whether motions are likely to be exactly repeated often, and I don't know if this will likely be the case. Complex motion definitions can take a lot of space, especially with the current floating point value encoding. After some thought, I will almost certainly place predefined curves in a header, and allow motions to refer to them. Fully defined curves will also be able to be placed in data packets, as it's likely some curves will be used only once, and it would constrain future uses to allow them only in headers (eg, if one were to stream handwriting using Kate).

Text to speech

One of the goals of the Kate bitstream format is that text data can be easily parsed by the user of the decoder, so any additional information, such as style, placement, karaoke data, etc, should be able to be stripped to leave only the bare text. This is in view of allowing text-to-speech software to use Kate bitstreams as a bandwith-cheap way of conveying speech data, and could also allow things like e-books which can be either read or listened to from the same bitstream (I have seen no reference to this being used anywhere, but I see no reason why the granule progression should be temporal, and not user controlled, such as by using a "next" button which would bump a granule postion by a preset amount, simulating turning a page (this would be close to necessary for text-to-speech, as the wall time duration of the spoken speech is not known in advance to the Kate encoder, and can't be mapped to a time based granule progression)). All text strings triggered consecutively between the two granule positions would then be read in order.

Possible additions

HTML (or similar) text content

At the moment, free utf-8 text is included in the data packets. Kate doesn't care about the actual contents of that text. Allowing a subset of HTML allows an easy way to define extra style elements within the body of the text, at the glyph level. Despite originally not wanting to add in-band markup, I am more and more thinking about making this change. In this case, Kate would have a way to give a scrubbed text to the client. Since these markup tags can't be nested, that scrubbing is easy to do so that users do not have to understand those tags (or scrub them themselves). Subset to be defined, and fallback for plain text to be added. This is an argument to keep all in utf-8, isn't it ? I don't know how one would go about having UTF-16 HTML code.

Embedded binary data

Various types of binary data could be embedded within a Kate stream:

Fonts

Font selection is the first thing that came to mind, due to the discrepancy of font naming in platforms (eg, the *-*-* X system, and the... hmm, not sure, filename ? in Windows). A potential problem, however, is that there might be several multiplexed Kate streams in an Ogg bitstream, so a custom font might be included several times in the container stream. On the other hand, it would allow for per-language fonts.

Images

Though this could interfere with ability to render as text-to-speech, images could be mixed with text. The same caveat as for fonts applies with regard to data duplication. This might however be best left to a multiplexed OggSpots stream, unless the images mesh with the text (eg, graphical exclamation points, etc).

A possible solution to the duplication issue is to have another stream in the container stream, which would hold the shared data (eg, fonts), which the user program could load, and which could then be used by any Kate (and other) stream. Typically, this type of stream would be a degenerate stream with only header packets (so it is fully processed before any other stream presents data packets that might make use of that shared data), and all payload such as fonts being contained within the headers. Thinking about it, it has parallels with the way Vorbis stores its codebooks within a header packet, or even the way Kate stores the list of styles within a header packet.

Reference encoder/decoder

A encoder and a decoder are included in the tools directory. Note that they are very rough and do not perform much error checking at all. The encoder pulls its input from a custom text based file format, which is by no means meant to be part of the Kate specification. It is just used as a quick way to define data to create a Kate bitstream. Tools might be created to create a Kate bitstream from various data formats, such as existing subtitle formats (SSA, etc). The Kate bitstreams encoded and decoded by those tools, however, are (supposed to be) correct for this specification, provided their input is correct.