XiphWiki - User contributions [en]

Icecast Server

2012-01-15T10:56:17Z

Ogg.k.ogg.k: Reverted edits by Tarjetis (talk) to last revision by Basilgohar

'''Icecast''' is an open source multi-platform streaming server. It supports [[Ogg]] [[Vorbis]], Ogg [[Theora]], and [[MP3]].

== External links ==

* [http://www.icecast.org/ Icecast homepage]
* [http://dir.xiph.org/index.php Stream directory]
* [http://www.nabble.com/Icecast-f2880.html Icecast archive / forum] - an Icecast mailing list archive that combines both user and dev lists. It is hosted by [http://www.nabble.com/ Nabble]. You can search or browse Icecast discussions here.

== Development ==

*trunk http://svn.xiph.org/icecast/trunk/icecast
*kh-branch http://svn.xiph.org/icecast/branches/kh/icecast
**diff to trunk
***fast pre-buffering aka burst-on-connect. <br>State a burst size in bytes to indicate how much should be sent at listener connect.
***mp3 accepts artist and title separately on the url.
***program invocation at stream start and end, per mount based.
***on-demand relays, activated on first listener, disconnected when listenersfalls to 0. <br>Available for master relays as well.
***multiple Ogg codec streaming. Current codecs handled are Theora, Vorbis, Speex, Writ.
***Clients are started at theora key frame if theora is being streamed.
***Added URL and command based listener authentication
***server xml reload, and reopen logging available via admin url
***slave startup re-organised so that relays are more independant
***on xml reload, active sources are updated as well
***When max-listeners reached, a HTTP 302 code can be sent to redirect clients to alternative slave hosts.
***authenticated relays, those that match the relay user/pass, bypass the max-listener check

== Wish List ==

As good ideas are never a waste, and for tracking purposes, please list here all the features you're missing in icecast trunk.

Note: please check that the feature you request is not already in trunk before posting !

* WebM streaming
* OggOpus streaming
* PUT method support
* Ponies

[[Category:Xiph-related Software]]

Talk:PortablePlayers

2011-12-16T15:05:49Z

Ogg.k.ogg.k: Reverted edits by Zilking75 (talk) to last revision by Martin.leese

== Discontinued players ==
I think that discontinued players should be moved into a different section. Not necessarily removed, as many are still available on clearance/refurbished/used.

The question is, how to make the split? I see three possibilities:
# Discontinued players on a separate page
# same page, but separate top-level section. e.g.:
#* Current devices
#** Flash-memory devices
#** HD devices
#** CD/DVD devices
#* Discontinued devices
#** Flash-memory devices
#** HD devices
#** CD/DVD devices
# make a subsection within each section. e.g.:
#* Flash-memory devices
#**Current devices
#**Discontinued devices
#* HD devices
#**Current devices
#**Discontinued devices
#* CD/DVD devices
#**Current devices
#**Discontinued devices

I'll hold off any updates right now, but I'll check back in a few days/weeks/whenever and see if there's any opinions here. If there's no disagreement by then, and noone has beat me to it, I'll take the initiative. [[User:Bsammon|Bsammon]] 03:43, 30 May 2010 (UTC)

:Having sections for discontinued devices (on each subpage) is fine - but I suggest to avoid explicit ''Current devices'' sections. With a ''discontinued devices'' section on a page it is immediately clear that everything listed before is current.--[[User:Gsauthof|Gsauthof]] 10:09, 30 September 2011 (PDT)

== List of top five players ==
It would be a good idea to have a few (five?) players at the top with images that are considered to be the best *recent* devices. I don't think any of the MP3 using masses will use this page to choose their next music player unless it lists recent devices, and presents a choice of five or six at the top, with images, and links to sites that they can buy them from. Also, could someone put up a notice to remind people it's not OGG, or Ogg! It's Ogg Vorbis, or if you must, Vorbis. - thehumanerror 25th December 2006

I totally agree with the above. This page was next to useless for me when I was shopping for a Vorbis player since I was overwhelmed with choices. Add to that the fact that many products have been discontinued or cannot be bought new and there's a recipe for disaster. - erpo41 October 17th, 2007

I also agree with the above; the primary reason I am not using Ogg Vorbis (I keep a parallel collection of mp3 and flac files) is I cannot easily find a portable player. I don't know that reorganizing this wiki page will help. I did comb through this page; basically all of the listed hard disk players are from one off manufacturers or not being manufactured any more. There are plenty of nice flash storage based devices and cell phones (from Samsung and others), but that is not what I am looking for. Also, I'm not interested in hacking my iPod. (I do embedded linux development enough at work; I'll pay someone else to get my media player working). Until this is addressed, Ogg Vorbis is going to remain out of use; which is a shame because for every other reason it is the best (in my opinion). --Kevin Holzer, January 10, 2009

:Yes. This is a good idea. Create a section at the top. Polish it well. And perhaps add a free-licensed photo. Anyone up for it?--[[User:Saoshyant|Ivo]] 06:41, 17 October 2007 (PDT)

I would rather see just a simple feature matrix (sorted so that unavailable devices are listed at the bottom, or just not listed at all). See talk below. Maybe preferred choises could be raised to the top thought! I agree that current list is quite unusable.

:As of now, the page contains a feature matrix, which only lists current and available devices. Thus, when you check this page out for buying advice it should be more useful now, i.e. it has no overwhelming effect anymore. I consider the raised issues as '''done'''. --[[User:Gsauthof|Gsauthof]] 10:04, 30 September 2011 (PDT)

== Recording in Vorbis ==

I would like to know which Players can '''record''' in Vorbis?! -- [[User:217.186.150.213|217.186.150.213]] 17:03, 26 Dec 2004 (PST)

:Ditto. Absolutely vital information. Do any of the players listed also record in Vorbis? If anyone has experience with A player, please state specifically whether it does or does not record in Vorbis.[[User:Nickhill|Nickhill]] 15:04, 4 June 2006 (PDT)

::Never heard of one that does, and there isn't a fixed point reference encoder, which makes it unlikely.

== Pretec Allegro may need firmware update ==

I recently purchased a Pretec Allegro, but was unable to play Oggs for three months, until the firmware update was made available on 14 or 15 March 2005. Now it works well! (So far, listening to -q3 Oggs). I'd hope that units purchased after this date already has the firmware update, but you never know. Installing the update is as simple as placing the .rom on the USB-storage-device media (eg flash disk), starting up the unit, and pressing the play button. -- Hugo van der Merwe
: How much battery runtime do you get playing Oggs compared with playing mp3? [[User:Phr|Phr]] 02:05, 27 Aug 2005 (PDT)

== Any player with Removable Memory Cards ==

The NexBlack (see [[PortablePlayers]] ) has removable compact flash and batteries.

Every single Vorbis-capable portable player out there seems to come with built-in flash memory. Which is stupid, because I don't want to fire up my computer and plug in the player every time I get tired of the tracks on my player. Plus flash memory has a limited lifetime (write cycles) and so does your player with built-in memory. The same applies for built-in rechargable batteries.

Now when would you ever need to buy your second device without any moving parts if you could just change flash memory and batteries? Ok, that's the industrie's point of view but not mine. I want to go on vacation with music and batteries for one week of non-stop music - without a power source or computer nearby.

So, any hint to where I might find a portable audio player that can play back ogg vorbis files and uses SD flash cards (and preferably AAA-batteries) would be greatly appreciated.
* Me too! If the [http://enox.co.kr/2004/eng/product/product_830_01.asp Enox EMX-830] took SD cards it'd be perfect. --[[User:Rgm|rgm]] 14:41, 7 Nov 2005 (PST)

* SanDisk Sansa e250/e260/e270/e280 has a microSD-card slot. With ROCKbox it plays Ogg/Vorbis and more.[[User:Nostromo|Nostromo]] 15:26, 29 October 2007 (PDT)

----

The Pretec Allegro is not the slickest player out there, it's LCD backlight seems to give off a high-pitched whine, which not everyone can hear (it kind-of screams in my ears though, so I put the backlight timer on 1 second so it doesn't scream too long). It is, however, the only one I now know of that can play Oggs, and uses removable media. If you want a nicely portable device, you have to use Pretec's "iDisk tiny" usb flash disk, the only thing that will fit inside. You can also, however, connect some USB SD-card reader with it's cable, then listen to Oggs off of SD. A little unwieldy, but, it works, and is the only thing *I* know of. (I stopped following developments in December though, when I bought it...)

== Samsung / Yepp ==

Moved to [[Talk:PortablePlayersSamsungYepp]]

== UniBrain iZak ==

Apologies if this is the wrong place for this; I'm new to wikis.

The UniBrain iZak was added, then removed recently, with the comment that it doesn't claim to play Ogg Vorbis.

The FAQ is available here: [http://www.unibrain.com/support/FAQ_iZak.htm iZak FAQ] and Question/Answer 24 says:

'22. Can iZak™ support OGG audio files?

Yes, iZak™ fully supports OGG playback using the latest firmware.'

:I was the one that removed it. In their specs linked from the main page, I saw that they listed only MP3 and WMA support for music formats. Obviously they need to update their promotional material! I went ahead and added the iZak back in, making a point to mention that the most current version of the firmware now supports Ogg Vorbis and linking to their FAQ as evidence. [[User:Saxifrage|Saxifrage]] 02:36, 5 May 2005 (PDT)

:Splendid. I didn't want to just stick it back after it had been taken out.--[[User:Ipl|Ipl]] 05:14, 5 May 2005 (PDT)

== Entempo Spirit ==

This inexpensive player from Entempo had listed Vorbis as a "Supported Audio Format", but the device will not index the Vorbis files into it's menus -- let alone play the files. Tested with both the stock and most recent firmware, May 29, 2005. Vendor had been contacted and removed Vorbis support claims from their website, but has not provided any resolution to customers which purchased the product expecting this support. The company's webpage has disappeared as of Feb 2006.

== Lexar LDP-800 dropped ==
It seems that Lexar have abondoned the LDP-800. The following was posted by a user on [http://www.dapreview.net/comment.php?comment.news.1055 dapreview.net]
" Unfortunately, lexar will not offer the LDP-800, but will focus instead
on its existing LDP Players that already offer appealing features and
benefits to meet a variety of consumer needs."
Shame.--[[User:Ipl|Ipl]] 06:15, 22 Jul 2005 (PDT)

There's more info on that dapreview thread that indicates some confusion within Lexar. Currently, it looks like the release is going to happen in early September.

Update 2005-11-11: after inquiries to Lexar's "new products" personnel, I received a telephone message that the LDP-800 will definitely "is not going to see the light of day." Ask me if you want details. I agree that it's a shame since this looked to be an outstanding product. --[[User:dfavro|dfavro]]

== Hong Kong Dream-tech Electronic DT-202, works? please confirm ==
http://hkdream-tech.com
An ebay seller says that it can reproduce Vorbis. This is unconfirmed. In the manufacturer web it says: MP3, WMA, WAV, DMV and etc.

Some webpage also says that it works on Windows, Mac and Linux. Also unconfirmed.
Further investigation required.

== Trekstor i.Beat Cube ==
This player seems to be very similar to the Samsung Yepp YP-T6, possibly with the [[#Yepp_MT-6X|same problems]] regarding Vorbis playback. Trekstor has moved [http://www.trekstor.de/en/produkte/mp3-player/ibeat-cube.html info about this player] from "MP3-Player" to the "Archive" section which propably means that it is not produced anymore.

== The Muzio jm300 / jm-300 does NOT play Vorbis ==

NB this is the jm-300 (not 100 or 200)

I bought this a month ago. I've been unable to play Vorbis files on
it. It simply shows these as 'etc' files and skips over them.

Pitty really, this was the main reason I chose this player.

I've seen lots of discussion about the muzio playing oggs, is there
anybody there who owns a jm300 and is actually playing oggs ? I can't
help think I've juts missed something basic.

== Layout of the PortablePlayers list and Feature matrix ==
It's gone! I've moved this discussion to [[Talk:PortablePlayersv2]]. [[User:Imalone|Imalone]] 10:55, 18 November 2006 (PST)

:Is there something very wrong with those proposals? I mean, is there any reason why (even a simple) feature matrix just could not be applied right now? It would probably solve 'list of top 5 players' problem above too. Just list something basic from the main features, name, size, weight, price, battery (internal, aa, aaa, ..), capasity, flash card type (sd, microsd, ..) , availability (current or discontinued), supported formats, charging (usb or propietary or none). Link to the longer comments. No complicated sorting or anything too fancy. No icons. Name can be a abbreviation to save space, use it as a link to current comments.

::As of now I consider this as '''done'''. --[[User:Gsauthof|Gsauthof]] 10:14, 30 September 2011 (PDT)

== NEXBlack out ==

I got my NEXBlack player today from Frontier Labs. It is a nice gadget with sleek design. They have corrected the occasional snap-sounds that came between tracks and it is overall more usable now. Vorbis-files also play fine, but the current firmware doesn't have Vorbis-tag reader, which is somewhat major drawback. The music selection works through mp3-tags and you can select by album, artist, genre and playlist, but since Vorbis tags won't work you have to select "unordered" to play them. Vorbis-files are all listed in one big list. I hope they either implement a Vorbis-tag reader or revert to old Nex IIe system where you could select by folder in the flash disc. But for the cheap price ($89), it is a good player... waiting for a new firmware..

== Sumvision M18/S1 ==

I've just got the 2GB Sumvision and it plays the OGG files I've tested so far. Should I add it to the list? [[User:Steevc|Steevc]] 04:05, 19 April 2007 (PDT)
:Yes, [http://en.wikipedia.org/wiki/Wikipedia:Be_bold just do it :)]. --[[User:Gsauthof|Gsauthof]] 10:18, 30 September 2011 (PDT)

== Humble A2 Review ==

Just a blog link [http://www.personal.psu.edu/gsc127/blogs/2007/10/happiness-with-cowon-a2.html to my review of the the Cowon A2]. Thanks, [[User:GChriss|GChriss]] 13:23, 6 November 2007 (PST)

== iRiver e100 ==

[http://reviews.cnet.com/4566-6490_7-0.html?filter=1000036_5260177_ CNet] and [http://www.amazon.com/iRiver-E100-Multimedia-Player-White/dp/B00171UYYS/ref=pd_bbs_sr_5?ie=UTF8&s=electronics&qid=1208253617&sr=8-5 Amazon] are saying the iRiver e100 supports Vorbis. I haven't tested it myself. [[User:Mattflaschen|Mattflaschen]] 03:09, 15 April 2008 (PDT)

== Bought an vorbis-enabled player recently? Tell us where! ==

I have started a page that should allow people easier purchasing of vorbis-enabled players: [[PortablePlayers_per_Place]]

Everyone, who bought an vorbis-enabled player recently should update the page with place and model.

== Move Flash/HD-sections to dedicated pages ==

Hi,

IMHO the PortablePlayers page is too long. I want to split it into several pages for each main section. Like [[PortablePlayers/Flash]], [[PortablePlayers/Harddisk]] etc.. Sure, one have to fix some links then, but I am convinced this step would increase the usability a lot.
What do you think about that? --[[User:Gsauthof|Gsauthof]] 01:20, 31 March 2009 (PDT)

:Since there were no objections I restructured the page as planned. --[[User:Gsauthof|Gsauthof]] 10:07, 27 June 2010 (UTC)

OggOpus

2011-11-21T17:36:18Z

Ogg.k.ogg.k:

== Ogg mapping for Opus ==

The IETF Opus codec is a low-latency audio codec optimized for both voice and general-purpose audio. See [http://tools.ietf.org/html/draft-ietf-codec-opus the spec] for technical details.

Almost everything about this codec is either fixed or dynamically switchable, so the usual id and setup header parameters in the header packets of an Ogg encapsulation aren't useful. In particular, bitrate, frame size, mono/stereo, and coding modes are all dynamically switchable from packet to packet. A one-byte header on each data packet defines the parameters for that particular packet.

Remaining parameters we need to signal are:

* magic number for stream identification
* comment/metadata tags

Additionally there's been a desire to support some kind of channel bonding for surround, and some kind of option signalling for "Opus Custom", in particular the granulerate.

=== Draft spec ===

Granulepos is the count of decodeable samples at a fixed rate of 48 kHz.

Two headers: id, comment

==== Id header ====

- Magic signature: "OpusHead" (64 bits)
- Version number (8 bits): zero for this spec
- Channel count 'c' (8 bits unsigned): MUST be > 0
- Pre-skip (16 bits unsigned)
- Input sample rate (32 bits, little endian): informational only
- Output gain (16 bits, little endian, signed Q7.8 in dB) to apply when decoding
- Channel mapping family (8 bits)
-- 0 = one stream, RTP order, 1 = channels in vorbis spec order, 2..254 reserved (treat as 255), 255 = no defined channel meaning
If channel mapping family > 0
- Stream count 'N' (8 bits unsigned): MUST be > 0
- Two-channel stream count 'M' (8 bits unsigned): MUST satisfy M <= N, M+N <= 255
- Channel mapping (8*c bits)
-- one stream index (8 bits unsigned) per channel (255 means silent throughout the file)

Some discussion is in order.

* '''Magic signature'''
The magic signature "OpusHead" allows codec identification and is human readable. Starting with 'Op' helps distinguish it from data packets, as this is an invalid TOC sequence.

* '''Version'''
The version number must always be zero for this version of the encapsulation spec. We do not plan to revise the spec, but this also acts as a null terminator for the signature bytes and helps align the rest of the fields.

* '''Channel count''' 'c'
The number of channels byte specifies the number of output channels (1...255) for this Ogg Opus stream.

* '''Pre-skip'''
This is the number of samples (at 48 kHz) to discard from the decoder output before starting playback.

The purpose of pre-skip is to allow a time-segment of an existing Opus stream to be saved as an independent Ogg file, with single-sample time granularity, without re-encoding. Opus is an asymptotically convergent predictive codec, so the decoded contents of each frame depend on the recent history of decoder inputs. Pre-skip can be used to provide sufficient history to the decoder so that it has already converged before the stream's output begins.

Because more than one page can be needed for re-convergence the Vorbis scheme for signaling pre-skip is not used for Opus.

The granule corresponding to the end time of an Ogg Opus page can be determined by subtracting the pre-skip from the page's granpos value. For example, if the page's granpos is 59970, and the preskip is 11971, then last sample decoded from the page is sample 47999, i.e. the last sample from the first second of absolute time.

When constructing cropped Ogg Opus streams, we recommend a pre-skip of at least '''FIXME''' samples to ensure complete convergence.

* '''Input sample rate'''
This is ''not'' the sample rate to use playback of the encoded data.

Opus has a handful of coding modes, with internal sample rates of 8, 12, 16, 24, and 48 kHz. Each packet in the stream may have a different internal sample rate. Regardless of the internal sample rate, the reference decoder supports decoding any stream to any of these sample rates. The original sample rate of the encoder input is not preserved by the lossy compression.

An Ogg Opus player SHOULD select the playback sample rate according to the following procedure:
** If the hardware supports 48 kHz playback, decode at 48 kHz
** else if the hardware's highest available sample rate is a supported rate, decode at this sample rate
** else if the hardware's highest available sample rate is less than 48 kHz, decode at the next higher supported rate and resample
** else decode at 48 kHz and resample.

However, the Ogg mapping allows the encoder to pass the sample rate of the original input stream as metadata. This may be useful when the user requires the output sample rate to match the input sample rate. For example, a non-player decoder writing PCM format to disk might choose to resample the output audio back to the original input rate to reduce surprise to the user, who might reasonably expect to get back a file with the same sample rate as the one they fed to the encoder.

A value of zero indicates 'unspecified'. Implementations which do something with this field should take care to behave sanely if given crazy values (e.g. don't
actually upsample the output to 10MHz) and encoders should write the actual input rate or zero.

* '''Output gain'''
This is a gain to be applied by the decoder. Virtually all players and media frameworks should apply it by default. If a player chooses to apply any volume adjustment or gain modification, such as the R128_TRACK_GAIN or a user-facing volume knob, the adjustment MUST be applied ''in addition'' to this output gain in order to achieve playback at the desired volume.

An encoder SHOULD set the output gain to zero, and instead apply any gain prior to encoding, when this is possible and does not conflict with the user's wishes. The output gain should only be nonzero when the gain is adjusted after encoding, or when the user wishes to adjust the gain for playback while preserving the ability to recover the original signal amplitude.

Note that although the output gain has enormous range (+/- 128 dB, enough to amplify inaudible sounds to the threshold of physical pain), most applications can only reasonably use a small portion of this range around zero. The large range serves in part to ensure that gain can always be losslessly transferred between OpusHead and R128_TRACK_GAIN (see below) without saturating.

* '''Channel mapping family'''
This byte indicates the order and semantic meaning of the various channels encoded in each Opus packet.

Each possible value of this byte indicates a ''mapping family'', which defines a set of allowed numbers of channels, and the ordered set of channel names for each allowed number of channels. Currently there are three defined mapping families, although more may be added:

** Family 0 (RTP mapping)
*** Allowed numbers of channels: 1 or 2
*** 1 channel: monophonic (mono)
*** 2 channels: stereo (left, right)
*** '''Special mapping''': this channel mapping value also indicates that the contents consists of a single Opus stream that is stereo if and only if c==2, with stream index 0 mapped to channel 0, and (if stereo) stream index 1 mapped to channel 1. When the channel mapping byte has this value, no further fields are present in OpusHead.
** Family 1 ([http://www.xiph.org/vorbis/doc/Vorbis_I_spec.html#x1-800004.3.9 Vorbis mapping])
*** Allowed numbers of channels: 1 ... 8
*** Channel meanings depend on the number of channels, see the Vorbis mapping for details.
** Family 255 (no defined channel meaning)
*** Allowed numbers of channels: 1...255
*** Channels are unidentified. General-purpose players SHOULD NOT attempt to play these streams, and offline decoders MAY deinterleave the output into separate PCM files, one per channel. Decoders SHOULD NOT produce output for channels mapped to stream index 255 (pure silence) unless they have no other way to indicate the index of non-silent channels.

The remaining channel mapping families (2...254) are reserved. A decoder encountering a reserved mapping byte should act as though the mapping byte is 255.

An Ogg Opus player MUST play any Ogg Opus stream with a channel mapping family of 0 or 1, even if the number of channels does not match the physically connected audio hardware. Players SHOULD perform channel mixing to increase or reduce the number of channels as needed.

* '''Stream count''' 'N'
This field indicates the total number of streams so the decoder can correctly parse the packed Opus packets inside the Ogg packet.

For channel mapping family 0, this value defaults to 1, and is not coded.

A multi-channel Opus file is composed of one or more individual Opus streams, each of which produce one or two channels of decoded data. Each Ogg packet contains one Opus packet from each stream. The first N-1 Opus packets are packed using the self-delimiting framing from Appendix B of the Opus specification. The remaining Opus packet is packed using the regular, undelimited framing from Section 3 of the Opus specification. All the Opus packets in a single Ogg packet are constrained to produce the same number of decoded samples.

* '''Two-channel stream count''' 'M'
Describes the number of streams whose decoders should be configured to produce two channels. This must be no larger than the number of total streams.

For channel mapping family 0, this value defaults to c-1 (i.e., 0 for mono and 1 for stereo), and is not coded.

Each packet in an Opus stream has an internal channel count of 1 or 2, which can change from packet to packet. This is selected by the encoder depending on the bitrate and the contents being encoded. The original channel count of the encoder input is not preserved by the lossy compression.

Regardless of the internal channel count, any Opus stream may be decoded as mono (single channel) or stereo (two channels) by appropriate initialization of the decoder. The "two-channel stream count" field indicates that the first M Opus decoders should be initialized in stereo mode, and the remaining N-M decoders should be initialized in mono mode. The total number of decoded channels (M+N) must be no larger than 255, as there is no way to index more channels than that in the channel mapping.

* '''Channel mapping'''
Contains one index per output channel indicating which decoded channel should be used. If the index is less than 2*M, the output MUST be taken from decoding stream (index/2) as stereo and selecting the left channel if index is even, and the right channel if index is odd. If the index is 2*M or larger, the output MUST be taken from decoding stream (index-M) as mono. As a special case, an index of 255 means that the corresponding output channel MUST contain pure silence.

For channel mapping family 0, the first index defaults to 0, and if c==2, the second index defaults to 1. Neither index is coded.

The number of output channels (c) is not constrained to match the number of decoded channels (M+N). A single index MAY appear multiple times, i.e., the same decoded channel may be mapped to multiple output channels. Some decoded channels might not be assigned to any output channel, as well.

==== Comment header ====

- 8 byte 'OpusTags' magic signature (64 bits)
- rest follows the vorbis-comment header design used in OggVorbis (without the "framing-bit"), OggTheora, and Speex.
** Vendor string (always present)
** tag=value metadata strings (zero or more)

One new comment field is introduced for Ogg Opus:
R128_TRACK_GAIN=-573
representing the volume shift needed to normalize the track's volume. The gain is a Q7.8 fixed point number in dB, as in the OpusHead "output gain" field. This field acts similarly to the [[VorbisComment#Replay_Gain|REPLAYGAIN_TRACK_GAIN field in Vorbis]], although the normal volume reference is the [http://tech.ebu.ch/loudness EBU-R128] standard.

An Ogg Opus file MUST NOT have more than one such field, and if present its value MUST be an integer from -32768 to +32767 inclusive, represented in ASCII with no whitespace. If present it MUST correctly represent the R128 normalization gain (relative to the OpusHead output gain). If a player chooses to make use of the TRACK_GAIN, it MUST be applied ''in addition'' to the OpusHead output gain. If an encoder populates the TRACK_GAIN field, and the output gain is not otherwise constrained or specified, the encoder SHOULD write the R128 gain into the OpusHead output gain and write "R128_TRACK_GAIN=0". If a tool modifies the OpusHead "output gain" field, it MUST also update or remove the R128_TRACK_GAIN comment field.

There is no comment field corresponding to Replaygain's ALBUM_GAIN; that information should instead be stored in the OpusHead "output gain" field.

To avoid confusion with multiple normalization schemes, an OpusTags packet SHOULD NOT contain any of the REPLAYGAIN_TRACK_GAIN, REPLAYGAIN_TRACK_PEAK, REPLAYGAIN_ALBUM_GAIN, or REPLAYGAIN_ALBUM_PEAK fields.

== Other implementation notes ==
As [http://www.xiph.org/vorbis/doc/Vorbis_I_spec.html#x1-130000A.2 in Ogg Vorbis], a granule position on the final page in a stream that indicates less audio data than the final packet would normally return is used to end the stream on other than even frame boundaries. The difference between the actual available data returned and the declared amount indicates how many trailing samples to discard from the decoding process.

When seeking within an Ogg Opus stream, the decoder should start decoding (and discarding the output) at least '''FIXME''' samples prior to the seek point in order to ensure that the output audio is correct at the seek point.

== Test vectors ==

* [[OggOpus/testvectors|Planned test vectors for OggOpus]]
* Opus test vectors

Ghost

2011-09-23T10:58:22Z

Ogg.k.ogg.k: I must with the greatest regret consign this valuable information to the bin of history.

This page is meant to track ideas about low-delay, high-quality audio coding. The work has just started, so don't expect anything in the near future (or at all for that matter).

== Signal types ==

There are many signal types that can be found:
* Sinusoids
** A few pure (or nearly pure) tones
* Harmonic
** Periodic waveforms (e.g. voice)
** Many (sometimes closely spaced) harmonics
* Shapred noise
** Signals that are (or are indistinguishable from) filtered (coloured) white noise
* Transients
** Whatever doesn’t fit above I guess

== Signal analysis ==

=== Sinusoidal ===

Good when most of the energy is contained in a few sinusoids. May be problematic for very harmonic signals, e.g. a male voice may have close to a hundred harmonics in the full audio band.

=== Pitch ===

Good for harmonic signals. Hard to estimate and code when extra sinusoids and noise are present. At 48 kHz, no need for fractional pitch or anything like that, but sub-band pitch analysis or multi-tap gain is a good idea. Also, there needs to be a way to remove the effect of sinusoids and noise. Even then removing the "noise" also means removing all excitation to the pitch predictor, so that's a problem.

=== MDCT ===

Very general. Can code anything, but not very good at anything. High delay (2x frame size). Could put several "MDCT frames" in each codec frame to make latency smaller.

=== Wavelets ===

Just a fancy name for sub-bands with non-uniform width. Probably similar to having an MDCT with few sub-bands, except that that the sub-bands could follow (roughly) the critical bands.

=== LPC + stochastic cb ===

Like CELP with no pitch. Could be used to code the noisy part of the signal with low bit-rate. Would need to figure out how to preserve the energy of the noise when going with 1/2 bit per sample and less.

== Codec Structure Ideas ==

=== Sinusoidal + wavelet ===

* Preemphasis
* Extract as many sinusoids as possible
* Wavelet transform
* Code wavelet coefs using VQ

=== Sinusoidal, pitch and noise ===

* Preemphasis
* Joint pitch + sinusoidal estimation
* LPC analysis
* CELP-like coding of the residual (mainly noise)

== Estimation Ideas ==

=== Sinusoid Estimation ===

Very hard to do properly, especially with reasonable complexity and low delay. Some ideas:

==== Least-square type matching ====

Step one: estimate sinusoid frequencies.

Tried so far:
* MUSIC fails on non-trivial signals and very complex, although there's an AES paper that recommends first whitening the noise part of the signal before applying the algo. Haven't tried that so far.
* ESPRIT fails on non-trivial signals and very complex (see above for possible solution)
* LPC would probably work, but requires an insane order -> impractical, plus it tends to be numerically unstable anyway.
* FFT poor resolution, but that's all we have left so far. There's an AES paper that describes a sort of time-domain phase unwrapping that could help.

Step two: what to match

Step three: solving

Looks like it's possible to solve an NxM least square problem in O(N*M) time using an iterative algorithm as long as the system matrix is near-orthogonal. If we want to solve '''Ax'''='''b''' and '''A'''^h*'''A''' ~= I, then we start with '''x'''(0)='''A'''^h*'''b''' and then:

:'''x'''(N+1) = '''x'''(N) + '''A'''^h*('''b'''-'''A'''*'''x'''(N))

==== Phase lock loop (PLL) ====

== Quantization Ideas ==
After the sinusoids have been extracted they have to be quantized. The possible ways are
* Sort the sinusoids according to energy and transmit only a finite number or only ones with a specific energy or above. The indices of the sinusoids before rearranging will have to be sent.
** I think it's worth checking which is most efficient. Sorting the sinusoids will help quantizing the amplitude, but make it harder to encode frequency. [[User:Jmspeex|Jmspeex]] 05:45, 28 June 2006 (PDT)
* Use the psycho acoustic properties and remove all the sinusoids, which will be masked by other tones.
** Of course, we don't want to encode perceptually irrelevant sinusoids. Actually, we want the resolution (in amplitude, phase and probably frequency) to scale with the amplitude-to-mask ratio or something like that. [[User:Jmspeex|Jmspeex]] 05:45, 28 June 2006 (PDT)
* After removing perceptually irrelevant and low-energy tones the energy in each critical bands has to be adjusted to match with the initial energy.
** Possibly -- I don't know much on that topic. Monty probably has valuable experience. [[User:Jmspeex|Jmspeex]] 05:45, 28 June 2006 (PDT)
* Time-differential coding of sinusoids across frames can be used
** Definitely. This is very important if we plan on using short frames. It would be important to minimize inter-frame redundancy, but still make it possible to recover from packet loss. For that, we could either use a leaky predictor (like the pitch in CELP) or use key-frames (like a video codec). [[User:Jmspeex|Jmspeex]] 05:45, 28 June 2006 (PDT)

==== Quantization of frequencies====
* Quantize frequencies of a few selected sinusoids and recreate other values using interpolation.
** How would you do that? (maybe I'm not following here) [[User:Jmspeex|Jmspeex]] 05:45, 28 June 2006 (PDT)
==== Quantization of Amplitudes ====
* Model the energy curve of the sinusoids – for instance using an exponential curve
** Exponential decay might be a good way to do inter-frame prediction. [[User:Jmspeex|Jmspeex]] 05:45, 28 June 2006 (PDT)
* Quantize amplitudes of a few selected sinusoids and recreate other values using interpolation.
** Possibly, but probably not at first (hard problem). [[User:Jmspeex|Jmspeex]] 05:45, 28 June 2006 (PDT)
==== Quantization of phase and modulation parameters ====
* Can be scalar quantized with the number of bits allocated being proportional to the energy of the sinusoid
** Yes. Also, this is something that can be predicted very well across frames. It's not even necessary to make that one robust to losses, because as long as the phase is continuous, no one will notice [[User:Jmspeex|Jmspeex]] 05:45, 28 June 2006 (PDT)
==== Quantization of indices ====
==== Quantization of energy gains in critical bands ====

=== Excitation similarity weighting ===
The idea behind the ESW technique is to select sinusoids such that each new sinusoid added will provide a maximum incremental gain in matching between the auditory excitation pattern associated with the original signal and the auditory excitation pattern associated with the modeled signal. In order to accomplish this goal, an iterative process is proposed in which each sinusoid extracted during conventional analysis is assigned an excitation similarity weight. During each iteration, the sinusoid having the largest weight is added to the modeled representation. New sinusoids are accumulated until some constrain is exhausted, for example, a bit budget. The algorithm tends to converge as the number of modeled sinusoids increases

-- Not sure I understand here. Any reference? [[User:Jmspeex|Jmspeex]] 05:45, 28 June 2006 (PDT)

=== Trajectory tracking ===
Once the meaningful sinusoidal peaks and their parameters have been estimated, the peaks are tracked together into inter-frame trajectories. At each frame, a peak continuation algorithm tries to connect the sinusoidal peak into the already existing trajectories at the previous frame, resulting into a smooth curve of frequencies and amplitudes. The continuation was tested with two algorithms: the traditional one which uses only the parameters of the sinusoids to obtain smooth trajectories and one original method which synthesizes the possible continuations inside certain deviation limits and compares them to the original signal. There is also other systems which use more advanced methods, for example the Hidden Markov Models to track the trajectories.
Sinusoidal trajectories contain all the information needed for the reconstruction of the harmonic parts of input signals: amplitudes, frequencies and phases of each trajectory at each frame. To avoid discontinuities at frame boundaries, the amplitudes, frequencies and phases are interpolated from frame to frame.
*Amplitudes are linearly interpolated
* Phase interpolated with cubic polynomials

-- Any reference? [[User:Jmspeex|Jmspeex]] 05:45, 28 June 2006 (PDT)

OggKate

2011-09-07T21:38:05Z

Ogg.k.ogg.k: Add something about how to add metadata

== Disclaimer ==
This is not a Xiph codec, though it may be embedded in Ogg alonside other Xiph
codecs, such as Vorbis and Theora. As such, please do not assume that Xiph has
anything to do with this, much less responsibility.

== What is Kate? ==

Kate is an overlay codec, originally designed for karaoke and text, that can be
multiplixed in Ogg. Text and images can be carried by a Kate stream, and animated.
Most of the time, this would be multiplexed with audio/video to carry subtitles,
song lyrics (with or without karaoke data), etc, but doesn't have to be.

Series of curves (splines, segments, etc) may be attached to various properties
(text position, font size, etc) to create animated overlays. This allows scrolling
or fading text to be defined. This can even be used to draw arbitrary shapes, so
hand drawing can also be represented by a Kate stream.

Example uses of Kate streams are movie subtitles for Theora videos, either text based,
as may be created by [http://www.v2v.cc/~j/ffmpeg2theora ffmpeg2theora], or image
based, such as created by [http://thoggen.net Thoggen] (patching needed), and lyrics,
as created by oggenc, from vorbis-tools.

== Why a new codec? ==

As I was adding support for Theora, Speex and FLAC to some software of mine, I found myself
wanting to have song lyrics accompanying Vorbis audio. Since Vorbis comments are limited to
the headers, one can't add them in the stream as they are sung, so another multiplexed stream
would be needed to carry them.

The three possible bases usable for such a codec I found were Writ, CMML, and OGM/SRT.

*[[OggWrit|Writ]] is an unmaintained start at an implementation of a very basic design, though I did find an encoder/decoder in py-ogg2 later on - I'd been quicker to write Kate from scratch anyway.
*[[CMML]] is more geared towards encapsulating metadata about an accompanying stream, rather than being a data stream itself, and seemed complex for a simple use, though I have now revised my view on this - besides, it seems designed for Annodex (which I haven't had a look at), though it does seems relatively generic for use outwith Annodex - though it is being "repurposed" as timed text now, bringing it closer to what I'm doing
*OGM/SRT, which I only found when I added Kate support to MPlayer, is shoehorning various data formats into an Ogg stream, and just dumps the SRT subtitle format as is, AFAICS (though I haven't looked at this one in detail, since I'd already had a working Kate implementation by that time)

I then decided to roll my own, not least because it's a fun thing to do.

I found other formats, such as USF (designed for inclusion in Matroska) and various subtitle formats,
but none were designed for embedding inside an Ogg container.

== Overview of the Kate bitstream format ==

I've taken much inspiration from Vorbis and Theora here.
Headers and packets (as well as the API design) follow the design of these two codecs.

A rough overview (see [[#Format specification|Format specification]] for more details) is:

Headers packets:
*ID header [BOS]: magic, version, granule fraction, encoding, language, etc
*Comment header: Vorbis comments, as per Vorbis/Theora streams
*Style definitions header: a list of predefined styles to be referred to by data packets
*Region definitions header: a list of predefined regions to be referred to by data packets
*Curves definitions header: a list of predefined curves to be referred to by data packets
*Motion definitions header: a list of predefined motions to be referred to by data packets
*Palette definitions header: a list of predefined palettes to be referred to by data packets
*Bitmap definitions header: a list of predefined bitmaps to be referred to by data packets
*Font mapping definitions header: a list of predefined font mappings to be referred to by data packets

Other header packets are ignored, and left for future expansion.

Data packets:
*text data: text/image and optional motions, accompanied by optional overrides for style, region, language, etc
*keepalive: can be emitted at any time to help a demuxer know where we're at, but those packets are optional
*repeats: a verbatim repeat of a text packet's payload, in order to bound any backward seeking needed when starting to play a stream partway through. These are also optional.
*end data [EOS]: marks the end of the stream, it doesn't have any useful payload

Other data packets are ignored, and left for future expansion.

The intent of the "keepalive" packet is to be sent at regular
intervals when no other packet has been emitted for a while. This would be to help seeking code
find a kate page more easily.

Things of note:
*Kate is a discontinuous codec, as defined in [http://www.xiph.org/ogg/doc/ogg-multiplex.html ogg-multiplex.html] in the Ogg documentation, which means it's timed by start granule, not end granule (as Theora and Vorbis).
* All data packets are on their own page, for two reasons:
**Ogg keeps track of granules at the page level, not the packet level
**if no text event happens for a while after a particular text event, we don't want to delay it so a larger page can be issued

See also [[#Seeking and memory|Problems to solve: Seeking and memory]].

*The granule encoding is not a direct time/granule correspondance, see the granule encoding section.
*The EOS packet should have a granule pos higher or equal to the end time of all events.
*User code doesn't have to know the number of headers to expect, this is moved inside the library code (as opposed to Vorbis and Theora).
*The format contains hooks so that additional information may be added in future revisions while keeping backward compatibility (though old decoders will correctly parse, but ignore the new information).

== Format specification ==

The Kate bitstream format consists of a number of sequential packets.
Packets can be either header packets or data packets. All header packets
must appear before any data packet.

Header packets must appear in order. Decoding of a data packet is not
possible until all header packets have been decoded.

Each Kate packet starts with a one byte type. A type with the MSB set
(eg, between 0x80 and 0xff) indicates a header packet, while a type with
the MSB cleared (eg, between 0x00 and 0x7f) indicates a data packet.
All header packets then have the Kate magic, from byte offset 1 to byte
offset 7 ("kate\0\0\0"). Note that this applies only to header packets:
data packets do not contain the Kate signature.

Since the ID header must appear first, a Kate stream can be recognized
by comparing the first eight bytes of the first packet with the signature
string "\200kate\0\0\0".

When embedded in Ogg,the first packet in a Kate stream (always packet type 0x80,
the id header packet) must be placed on a separate page. The corresponding Ogg
packet must be marked as beginning of stream (BOS).All subsequent header packets
must be on one or more pages. Subsequently, each data packet must be on a separate
page.

The last data packet must be the end of stream packet (packet type 0x7f).

When embedded in Ogg, the corresponding Ogg packet must be marked as end of stream (EOS).

As per the Ogg specification, granule positions must be non decreasing
within the stream. Header packets have granule position 0.

Currently existing packet types are:
:headers:
::0x80 ID header (BOS)
::0x81 Vorbis comment header
::0x82 regions list header
::0x83 styles list header
::0x84 curves list header
::0x85 motions list header
::0x86 palettes list header
::0x87 bitmaps list header
::0x88 font ranges and mappings header
:data:
::0x00 text data (including optional motions and overrides)
::0x01 keepalive
::0x02 repeat
::0x7f end packet (EOS)

This format described here is for bitstream version 0.x.
As or 19 december 2008, the latest bitstream version is 0.4.

For more detailed information, refer to the format documentation
in libkate (see URL below in the [[#Downloading|Downlading]] section).

Following is the definition of the ID header (packet type 0x80).
This works out to a 64 byte ID header. This is the header that should be
used to detect a Kate stream within an Ogg stream.

0 1 2 3 |
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1| Byte
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| packtype | Identifier char[7]: 'kate\0\0\0' | 0-3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| kate magic continued | 4-7
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | version major | version minor | num headers | 8-11
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| text encoding | directionality| reserved - 0 | granule shift | 12-15
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| cw sh | canvas width | ch sh | canvas height | 16-19
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | 20-23
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate numerator | 24-27
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate denominator | 28-31
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (NUL terminated) | 32-35
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 36-39
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 40-43
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 44-47
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (NUL terminated) | 48-51
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 52-55
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 56-59
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 60-63
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The fields cw sh, canvas width, cw sh, and canvas height were introduced
in bistream 0.3. Earlier bitstreams will have 0 in these fields.

language and category are NUL terminating ASCII strings.
Language follows RFC 3066, though obviously will not accommodate language tags
with lots of subtags.

Category is currently loosely defined, and I haven't found yet a nice way to
present it in a generic way, but is meant for automatic classifying of
various multiplexed Kate streams (eg, to recognize that some streams are
subtitles (in a set of languages), and some others are commentary (in a
possibly different set of languages, etc).

== API overview ==

libkate offers an API very similar to that of libvorbis and libtheora, as well as
an extra higher level decoding API.

Here's an overview of the three main modules:

=== Decoding ===

Decoding is done in a way similar to libvorbis. First, initialize a kate_info and a
kate_comment structure. Then, read headers by calling kate_decode_headerin. Once
all headers have been read, a kate_state is initialized for decoding using kate_decode_init,
and kate_decode_packetin is called repeatedly with data packets. Events (eg, text) can be
retrieved via kate_decode_eventout.

=== Encoding ===

Encoding is also done in a way similar to libvorbis. First initialize a kate_info
and a kate_comment structure, and fill them out as needed. kate_encode_headers will
create ogg packets from those. Then, kate_encode_text is called repeatedly for all
the text events to add. When done, calling kate_encode_finish will create an end of
stream packet.

=== High level decoding API ===

There are only 3 calls here:

kate_high_decode_init
kate_high_decode_packetin
kate_high_decode_clear

Here, all Ogg packets are sent to kate_high_decode_packetin, which does the right
thing (header/data classification, decoding, and event retrieval). Note that you
do not get access to the comments directly using this, but you do get access to the
kate_info via events.

The libkate distribution includes commented examples for each of those.

Additionally, libkate includes a layer (liboggkate) to make it easier to use when
embedded in Ogg. While the normal API uses kate_packet structures, liboggkate uses
ogg_packet structures.

The High level decoding API does not have an Ogg specific layer, but functions exist
to wrap a kate_packet around a memory buffer (such as the one ogg_packet uses, for instance).

== Support ==

Among the software with Kate support:
*VLC
*ffmpeg2theora
*liboggz
*liboggplay
*Cortado (wikimedia version)
*vorbis-tools

I have patches for the following with Kate support:
*MPlayer
*xine
*GStreamer
*Thoggen
*Audacious
*and more...

These may be found in the libkate source distribution (see [[#Downloading|Downloading]]
for links).

In addition, libtiger is a rendering library for Kate streams using Pango and Cairo,
though it is not quite yet API stable (though no major changes are expected).

== Granule encoding ==

=== Ogg ===

Ogg leaves the encoding of granules up to a particular codec, only
mandating that granules be non decreasing with time.

The Kate bitstream format uses a linear mapping between time and
granule, described here.

A Kate granule position is composed of two different parts:
- a base granule, in the high bits
- a granule offset, in the low bits

+----------------+----------------+
| base | offset |
+----------------+----------------+

The number of bits these parts occupy is variable, and each stream
may choose how many bits to dedicate to each. The kate_info structure
for a stream holds that information in the granule_shift field,
so each part may be reconstructed from a granulepos.

The timestamp T of a given Kate packet is split into a base B and
offset O, and these are stored in the granulepos of that packet.
The split is done such that the B is the time of the earliest event
still active at the time, and the O is the time elapsed between B
and T. Thus, T = B + O. This mimics the way Theora stores its own
timestamps in granulepos, where the base acts as a keyframe, and
an offset acts as the position of an intra frame from the previous
keyframe. Since Kate allows time overlapping events, however, the
choice of the base to use is slightly more complex, as it may not
be the starting time of the previous event, if the stream contains
time overlapping events.

The kate_info structure for a stream holds a rational fraction
representing the time span of granule units for both the base and
the offset parts.

The granule rate is defined by the two fields:

kate_info::gps_numerator
kate_info::gps_denominator

The number of bits reserved for the offset is defined by the field:

kate_info::granule_shift

=== Generic timing ===

Kate data packets (data packet type 0) includes timing information (start time,
end time, and time of the earliest event still active). All these are stored as
64 bit at the rate defined by the granule rate, so they do not suffer from the
granule_shift space limitation.

This also allows for Kate streams to be stored in other containers.

== Motion ==

The Kate bitstream format includes motion definition, originally for karaoke purposes, but
which can be used for more general purpose, such as line based drawing, or animation of
the text (position, color, etc)

Motions are defined by the means of a series of curves (static points, segments, splines (catmull-rom, bezier, and b-splines)).
A 2D point can be obtained from a motion for any timestamp during the lifetime of a text.
This can be used for moving a marker in 2D above the text for karaoke, or to use the x
coordinate to color text when the motion position passes each letter or word, etc.
Motions have an attached semantics so the client code knows how to use a particular motion.
Predefined semantics include text color, text position, etc).

Since a motion can be composed of an arbitrary number of curves, each of which may have
an arbitrary number of control points, complex motions can be achieved. If the motion is
the main object of an event, it is even possible to have an empty text, and use the motion
as a virtual pencil to draw arbitrary shapes. Even on-the-fly handwriting subtitles could
be done this way, though this would require a lot of control points, and would not be able
to be used with text-to-speech.

As a proof of concept, I also have a "draw chat" program where two people can draw, and
the shapes are turned to b-splines and sent as a kate motion to be displayed on the other
person's window.

It is also possible for motions to be discontinuous - simply insert a curve of 'none' type.
While the timestamp lies within such a curve, no 2D point will be generated. This can be
used to temporarily hide a marker, for instance.

It is worth mentionning that pauses in the motion can be trivially included by inserting
at the right time and for the right duration a simple linear interpolation curve with only
two equal points, equal to the position the motion is supposed to pause at.

Kate defines a set of predefined mappings so that each decoder user interprets a motion in
the same way. A mapping is coded on 8 bits in the bitstream, and the first 128 are reserved
for Kate, leaving 128 for application specific mappings, to avoid constraining creative uses
of that feature. Predefined mappings include frame (eg, 0-1 points are mapped to the size of
the current video frame), or region, to scale 0-1 to the current region. This allows curves
to be defined without knowing in advance the pixel size of the area it should cover.

For uses which require more than two coordinates (eg, text color, where 4 (RGBA) values are
needed, Kate predefines the semantics text_color_rg and text_color_ba, so a 4D point can be
obtained using two different motions.

There are higher level constructs, such as morphing between two styles, or predefined
karaoke effects. More are planned to be added in the future.

See also [[#Trackers|Trackers]].

== Trackers ==

Since attaching motions to text position, etc, makes it hard for the client to keep track of
everything, doing interpolation, etc, the library supplies a tracker object, which handles the
interpolation of the relevant properties.
Once initialized with a text and a set of motions, the client code can give the tracker a new
timestamp, and get back the current text position, text color, etc.

Using a tracker is not necessary, if one wants to use the motions directly, or just ignore them,
but it makes life easier, especially when considering the the order in which motions are applied
does matter (to be defined formally, but the current source code is informative at this point).

== The Kate file format ==

Though this is not a feature of the bitstream format, I have created a text file format to
describe a series of events to be turned into a Kate bitstream.
At its minimum, the following is a valid input to the encoder:

: kate {
:: event { 00:00:05 --> 00:00:10 "This is a text" }
: }

This will create a simple stream with "This is a text" emitted at an offset of 5 seconds into
the track, lasting 5 seconds to an end time at 10 seconds.

Motions, regions, styles can be declared in a definitions block to be reused by events, or can
be defined inline. Defining those in the definitions block places them in a header so they can
be reused later, saving space. However, they can also be defined in each event, so they will be
sent with the event. This allows them to be generated on the fly (eg, if the bitstream is being
streamed from a realtime input).

For convenience, the Kate file format also allows C style macros, though without parameters.

Please note that the Kate file format is fully separate from the Kate bitstream format. The
difference between the two is similar to the difference between a C source file and the resulting
object file, when compiled.

Note that the format is not based on XML for a very parochial reason: I tend to dislike very
much editing XML by hand, as it's really hard to read. XML is really meant for machines to parse
generically text data in a shared syntax but with possibly unknown semantics, and I need those
text representations to be editable easily.

This also implies that there could be an XML representation of a Kate stream, which would be
useful if one were to make an editor that worked on a higher level than the current all-text
representation, and it is something that might very well happen in the future, in parallel with
the current format.

== Karaoke ==

Karaoke effects rely on motions, and there will be predefined higher level ways of specifying
timings and effects, two of which are already done. As an example, this is a valid Karaoke script:

:kate {
:: simple_timed_glyph_style_morph {
::: from style "start_style" to style "end_style"
::: "Let " at 1.0
::: "us " at 1.2
::: "sing " at 1.4
::: "to" at 2.0
::: "ge" at 2.5
::: "ther" at 3.0
:: }
:}

The syllables will change from a style to another as time passes. The definition of the start_style
and end_style styles is omitted for brevity.

== Problems to solve ==

There are a few things to solve before the Kate bitstream format can be considered good
enough to be frozen:

Note: the following is mostly solved, and the bitstream is now stable, and has been
backward and forward compatible since the first released version. This will be updated
when I get some time.

=== Seeking and memory ===

When seeking to a particular time in a movie with subtitles, we may end up at a place when a subtitle has been started, but is not removed yet. Pure streaming doesn't have this problem as it remembers the subtitle being issued (as opposed to, say, Vorbis, for which all data valid now is decoded from the last packet). With Kate, a text string valid now may have been issued long ago.

I see three possible ways to solve this:
*each data packet includes the granule of the earliest still active packet (if none, this will be the granule of this very packet)
**this means seeks are two phased: first seek, find the next Kate packet, and seek again if the granule of the earlier still active packet is less than the original seeked granule. This implies support code on players to do the double seek.

*use "reference frames", a bit like Theora does, where the granule position is split in several fields: the higher bits represent a position for the reference frame, and the lowest bits a delta time to the current position. When seeking to a granule position, the lower bits are cleared off, yielding the granule position of the previous reference frame, so the seek ends up at the reference frame. The reference frame is a sync point where any active strings are issued again. This is a variant of the method described in the Writ wiki page, but the granule splitting avoids any "downtime".
**this requires reissuing packets, and it doesn't feel right (and wastes space).
**it also requires "dummy" decoding of Kate data from the reference frame to the actual seek point to fully refresh the state "memory".

*A variant of the two-granules-in-one system used by libcmml, where the "back link" points to the earliest still active string, rather than the previous one (this allows a two phase seek, rather than a multiphase seek, hopping back from event to event, with no real way to know if there is or not a previous event which is still active - I suppose CMML has no need to know this, if their "clips" do not overlap - mine can do).
**Such a system considerably shortens the usable granule space, though it can do a one phase seek, if I understand the system correctly, which I am not certain.
*** Well, it seems it can't do a one phase seek anyway.

*Additionally, it could be possible to emit simple "keepalive" packets at regular intervals to help a seek algorithm to sync up to the stream without needing too much data reading - this helps for discontinuous streams where there could be no pages for a while if no data is needed at that time.

=== Text encoding ===

A header field declares the text encoding used in the stream. At the moment, only UTF-8 is
supported, for simplicity. There are no plans to support other encodings, such as UTF-16,
at the moment.

Note that strings included in the header (language, category) are not affected by that
language encoding (rather obviously for language itself). These are ASCII.

The actual text in events may include simple HTML-like markup (at the moment, allowed markup
is the same as the one Pango uses, but more markup types may be defined in the future).
It is also possible to ask libkate to remove this markup if the client prefers to receive
plain text without the markup.

=== Language encoding ===

A header field defines the language (if any) used in the stream (this can be overridden in a
data packet, but this is not relevant to this point). At the moment, my test code uses
ISO 639-1 two letter codes, but I originally thought to use RFC 3066 tags. However, matching
a language to a user selection may be simpler for user code if the language encoding is kept
simple. At the moment, I tend to favor allowing both two letter tags (eg, "en") and secondary
tags (like "en_EN"), as RFC 3066 tags can be quite complex, but I welcome comments on this.

If a stream contains more than one language, there usually is a predominant language, which
can be set as the default language for the stream. Each event can then have a language
override. If there is no predominant language, and it is not possible to split the stream
into multiple substreams, each with its own language, then it is possible to use the "mul"
language tag, as a last resort.

=== Bitstream format for floating point values ===

Floating point values are be turned to a 16.16 fixed point format, then stored in a bitpacked
format, storing the number of zero bits at the head and tail of the floating point values once
per stream, and the remainder bits for all values in the stream. This seems to yield good results
(typically a 50% reduction over 32 bits raw writes, and 70% over the snprintf based storage), and
has the big advantage of being portable (eg, independant of any IEEE format).
However, this means reduced precision due to the quantization to 16.16. I may add support for
variable precision (eg, 8.24 fixed point formats) to alleviate this. This would however mean less
space savings, though these are likely to be insignificant when Kate streams are interleaved with
a video.

*Though this is not a Kate issue per se, the motion feature is very difficult to use without a curve editor. While tools may be coded to create a Kate bitstream for various existing subtitle formats, it is not certain it will be easy to find a good authoring tool for a series of curves. That said, it's not exactly difficult to do if you know a widget set.

=== Higher dimensional curves/motions ===

It is quite annoying to have to create two motions to control a color change, due to curves
being restricted to two dimensions. I may add support for arbitrary dimensions. It would also
help for 1D motions, like changing the time flow, where one coordinate is simply ignored at
the moment.
Alternatively, changes could be made to the Kate file format to hide the two dimensionality and
allow simpler specification of non-2 dimensional motions, but still map them to 2D in the kate
bitstream format.

=== Category definition ===

The category field in the BOS packet is a 16 byte text field (15 really, as it is zero terminated
in the bitstream itself). Its goal is to provide the reader with a short description of what kind
of information the stream contains, eg subtitles, lyrics, etc. This would be displayed to the user,
possibly to allow to choose to turn some streams on and off.

Since this category is meant primarily for a machine to parse, they will be kept to ASCII. When
a player recognizes a category, it is free to replace its name with one in the user's language if
it prefers. Even in English, the "lyrics" category could be displayed by a player as "Lyrics".

Since this is a free text field rather than an enumeration, it would be good to have a list of
common predefined category names that Kate streams can use.

This is a list of proposed predefined categories, feedback/additions welcome:

* subtitles - the usual movie subtitles, as text
* spu-subtitles - movie subtitles in DVD style paletted images
* lyrics - song lyrics

Please remember the 15 character limit if proposing other categories.

Note that the list of categories is subject to change, and will likely
be replaced by new, more "identifier like" ones. The three ones above,
however, would be kept for backward compatibility as they're already used.

== Text to speech ==

One of the goals of the Kate bitstream format is that text data can be easily parsed
by the user of the decoder, so any additional information, such as style, placement,
karaoke data, etc, should be able to be stripped to leave only the bare text. This is
in view of allowing text-to-speech software to use Kate bitstreams as a bandwith-cheap
way of conveying speech data, and could also allow things like e-books which can be
either read or listened to from the same bitstream (I have seen no reference to this
being used anywhere, but I see no reason why the granule progression should be temporal,
and not user controlled, such as by using a "next" button which would bump a granule
postion by a preset amount, simulating turning a page (this would be close to necessary
for text-to-speech, as the wall time duration of the spoken speech is not known in
advance to the Kate encoder, and can't be mapped to a time based granule progression)).
All text strings triggered consecutively between the two granule positions would then
be read in order.

== Possible additions ==

=== Embedded binary data ===

Images and font mappings can be included within a Kate stream.

==== Images ====

Though this could be misused to interfere with ability to render as text-to-speech, Kate
can use images as well as text. The same caveat as for fonts applies with regard to data
duplication.

Complex images might however be best left to a multiplexed OggSpots or OggMNG stream, unless the
images mesh with the text (eg, graphical exclamation points, custom fonts, (see next
paragraph), etc).

There is support for simple paletted bitmap images, with a variable length palette of up
to 256 colors (in fact, sized in powers of 2 up to 256) and matching pixel data in as
many bits per pixel as can address the palette. Palettes and images are stored separately,
so can be used with one another with no fixed assignment.

Palettes and bitmaps are put in two separate header for later use by reference, but can
also be placed in data packets, as with motions, etc, if they are not going to be reused.

PNG bitmaps can also be embedded in a Kate stream. These do not have associated palettes
(but the PNGs themselves may or may not be paletted). There is no support for decoding PNG
images in libkate itself, so a program will have to use libpng (or similar code) to decode
the PNG image. For instance, the libtiger rendering library uses Cairo to decode and render
PNG images in Kate streams.

This can be used to have custom fonts, so that raw text is still available if the stream
creator wants a custom look.

I expect that the need for more than 256 colors in a bitmap, or non palette bitmap data,
would be best handled by another codec, eg OggMNG or OggSpots. The goal of images in a
Kate stream is to mesh the images with the text, not to have large images by themselves.

On the other hand, interesting Karaoke effects could be achieved by having MNG images
instead of simple paletted bitmaps in a Kate streams. Comments would be most welcome on
whether this is going too far, however.

I am also investigating SVG images. These allow for very small footprint images for simple
vector drawings, and could be very useful for things like background gradients below text.

A possible solution to the duplication issue is to have another stream in the container
stream, which would hold the shared data (eg, fonts), which the user program could load,
and which could then be used by any Kate (and other) stream. Typically, this type of stream
would be a degenerate stream with only header packets (so it is fully processed before any
other stream presents data packets that might make use of that shared data), and all payload
such as fonts being contained within the headers. Thinking about it, it has parallels with
the way Vorbis stores its codebooks within a header packet, or even the way Kate stores the
list of styles within a header packet.

==== Fonts ====

Custom fonts are merely a set of ranges mapping unicode code points to bitmaps. As this implies,
fonts are bitmap fonts, not vector fonts, so scaling, if supported by the rendering client,
may not look as good as with a vector font.

A style may also refer to a font name to use (eg, "Tahoma"). These fonts may or may not be
available on the playing system, however, since the font data is not included in the stream,
just referenced by name. For this reason, it is best to keep to widely known fonts.

== Reference encoder/decoder ==

A encoder (kateenc) and a decoder (katedec) are included in the tools directory.
The encoder supports input from several different formats:
* a custom text based file format (see [[#The Kate file format|The Kate file format]]), which is by no means meant to be part of the Kate bitstream specification itself
* SubRip (.srt), the most common subtitle format I found
* LRC lyrics format.

As an example for the widely used SRT subtitles format, the following command line
create a Kate subtitles stream from an SRT file:

kateenc -l en -c subtitles -t srt -o subtites.ogg subtitles.srt

The reverse is possible, to recover an SRT file from a Kate stream, with katedec.

Note that the subtitles.ogg file should then be multiplexed into the A/V stream,
using either ogg-tools or oggz-tools.

The Kate bitstreams encoded and decoded by those tools are (supposed to be) correct for this
specification, provided their input is correct.

== Next steps ==

=== Continuations ===

Continuations are a way to add to existing events, and are mostly meant for motions. When streaming
in real time, what motions may be applied to events may not be known in advance (for instance, for a
draw chat program where two programs exchange Kate streams, the drawing motions are only known as
they are drawn. Continuations will allow an event to be extended in time, and motions to be appended
to it. This is only useful for streaming, as when stored in a file, everything is already known in
advance.

=== A rendering library ===

This will allow easier integration in other packages (movie players, etc).
I have started working on an implementation using Cairo and Pango, though I'm still at the early stages.
I might add support for embedding vector fonts in a Kate stream if I was going that way. Still need to think about this.
Another point of note is that when this library is available, it would make it easier to add
capabilities such as rotation, scaling, etc, to the bitstream, since this would not cause too
much work for playing programs using the rendering library. It is expected that these additions
would stay backward compatible (eg, an old player would ignore this information but still correctly
decode the information they can work with from a newly encoded stream).

=== An XML representation ===

While I purposefully did not write Kate description files in XML due to me finding editing XML such
a chore, it would be nice to be able to losslessly convert between the more user friendly representation
and an XML document, so one can do what one does with XML documents, like transformations.

And after all, some people might prefer editing the XML version.

=== Packaging ===

It would be really nice to have packages for libkate/libtiger for many distros.

If you're a packager for a distro which doesn't have yet packages for libkate
or libtiger, please consider helping :)

In particular, packages for Debian would be grand.

== Matroska mapping ==

The codec ID is "S_KATE".

As for Theora and Vorbis, Kate headers are stored in the private data as xiph-laced packets:

Byte 0: number of packets present, minus 1 (there must be at least one packet) - let this number be NP
Bytes 1..n: lengths of the first NP packets, coded in xiph style lacing
Bytes n+1..end: the data packets themselves concatenated one after the other

Note that the length of the last packet isn't encoded, it is deduced from the sizes of the other
packets and the total size of the private data.

This mapping is similar to the Vorbis and Theora mappings, with the caveat that one should not
expect a set number of headers.

== Downloading ==

libkate encodes and decodes Kate streams, and is API and ABI stable.

The libkate source distribution is available at [http://libkate.googlecode.com/ http://libkate.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/kate.git;a=summary http://git.xiph.org/?p=users/oggk/kate.git;a=summary].

libtiger renders Kate streams using Pango and Cairo, and is alpha, with API changes still possible.

The libtiger source distribution is available at [http://libtiger.googlecode.com/ http://libtiger.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/tiger.git;a=summary http://git.xiph.org/?p=users/oggk/tiger.git;a=summary].

== HOWTOs ==

These paragraphs describe a few ways to use Kate streams:

=== Text movie subtitles ===

Kate streams can carry Unicode text (that is, text that can represent
pretty much any existing language/script). If several Kate streams are
multiplexed along with a video, subtitles in various languages can be
made for that movie.

An easy way to create such subtitles is to use ffmpeg2theora, which
can create Kate streams from SubRip (.srt) format files, a simple but
common text subtitles format. ffmpeg2theora 0.21 or later is needed.

At its simplest:

ffmpeg2theora -o video-with-subtitles.ogg --subtitles subtitles.srt
video-without-subtitles.avi

Several languages may be created and tagged with their language code
for easy selection in a media player:

ffmpeg2theora -o video-with-subtitles.ogg video-without-subtitles.avi
--subtitles japanese-subtitles.srt --subtitles-language ja
--subtitles welsh-subtitles.srt --subtitles-language cy
--subtitles english-subtitles.srt --subtitles-language en_GB

Alternatively, kateenc (which comes with the libkate distribution) can
create Kate streams from SubRip files as well. These can then be merged
with a video with oggz-tools:

kateenc -t srt -c SUB -l it -o subtitles.ogg italian-subtitles.srt
oggz merge -o movie-with-subtitles.ogg movie-without-subtitles.ogg subtitles.ogg

This second method can also be used to add subtitles to a video which
is already encoded to Theora, as it will not transcode the video again.

=== DVD subtitles ===

DVD subtitles are not text, but images. Thoggen, a DVD ripper program,
can convert these subtitles to Kate streams (at the time of writing,
Thoggen and GStreamer have not applied the necessary patches for this
to be possible out of the box, so patching them will be required).

When configuring how to rip DVD tracks, any subtitles will be detected
by Thoggen, and selecting them in the GUI will cause them to be saved as
Kate tracks along with the movie.

=== Song lyrics ===

Kate streams carrying song lyrics can be embedded in an Ogg file. The
oggenc Vorbis encoding tool from the Xiph.Org Vorbis tools allows lyrics
to be loaded from a LRC or SRT text file and converted to a Kate stream
multiplexed with the resulting Vorbis audio. At the time of writing,
the patch to oggenc was not applied yet, so it will have to be patched
manually with the patch found in the diffs directory.

oggenc -o song-with-lyrics.ogg --lyrics lyrics.lrc --lyrics-language en_US song.wav

So called 'enhanced LRC' files (containing extra karaoke timing information)
are supported, and a simple karaoke color change scheme will be saved
out for these files. For more complex karaoke effects (such as more
complex style changes, or sprite animation), kateenc should be used with
a Kate description file to create a separate Kate stream, which can then
be merged with a Vorbis only song with oggz-tools:

oggenc -o song.ogg song.wav
kateenc -t kate -c LRC -l en_US -o lyrics.ogg lyrics-with-karaoke.kate
oggz merge -o song-with-karaoke.ogg lyrics-with-karaoke.ogg song.ogg

This latter method may also be used if you already have an encoded Vorbis song
with no lyrics, and just want to add the lyrics without reencoding.

=== Metadata ===

Metadata can be attached to events, or to styles, bitmaps, regions, etc.
Metadata are free form tag/value pairs, and can be used to enrich their
attached data with extra information. However, how this information is
interpreted is up to the application layer.

It is worth noting that an event may not have attached text, so it is
possible to create an empty timed event with attached metadata.

For instance, let's say we have a documentary, with footage from various
places, as well as short interviews, and we want two things:
- tag footage with metadata about the location and date that footage was shot
- subtitle the interviews and tag those subtitles with information about the speaker

You can then create an empty Kate event for each footage part, synchronized
with the footage, and attach a new metadata item called GEO_LOCATION, filled
with latitude and longitude of the place the footage was shot at.
Similarly, for each subtitle event, a metadata item called SPEAKER can be
attached.

An empty event to tag a long 4:20 footage shot in Tokyo on 2011/08/12, and
inserted at 18:30 in the documentary could look like:

event {
00:18:30,000 --> 00:22:50,000
meta "GEO_LOCATION" = "35.42; 139.42"
meta "DATE" = "2011-08-12"
}

Here's a example for a line spoken by Dr Joe Bloggs at 18:30 into the documentary:

event {
00:18:30,000 --> 00:18:32,000
"Notice how the subtitles for my words have metadata attached to them"
meta "SPEAKER" = "Dr Joe Bloggs"
meta "URL" = "http://www.example.com/biography?name=Joe+Bloggs"
}

Notice how another metadata item, URL, is also present. The application
will have to be aware of those metadata in order to do something with it
though. Since those are free form, it is up to you to think of what
metadata you want, and make use of it.

Note that metadata may be attached to other objects, such as regions.
This way, you can for example create a region tagged with a name, and
track a person's movements with that region. Or you can tag a bitmap
with a copyright and a URL to a larger version of the image.

=== Changing a Kate stream embedded in an Ogg stream ===

If you need to change a Kate stream already embedded in an Ogg stream (eg, you have a movie with subtitles, and you want to fix a spelling mistake, or want to bring one of the subtitles forward in time, etc), you can do this easily with KateDJ, a tool that will extract Kate streams, decode them to a temporary location, and rebuild the original stream after you've made whatever changes you want.

KateDJ (included with the libkate distribution) is a GUI program using wxPython, a Python module for the wxWidgets GUI library, and the oggz tools (both needing installing separately if they are not already).

The procedure consists of:

* Run KateDJ
* Click 'Load Ogg stream' and select the file to load
* Click 'Demux file' to decode Kate streams in a temporary location
* Edit the Kate streams (a message box tells you where they are placed)
* When done, click 'Remux file from parts'
* If any errors are reported, continue editing until the remux step succeeds

== Frequently Asked Questions ==

=== Does libkate work on other plaforms than Linux ? ===

Yes, libkate is not Linux specific in any way. It optionally relies on libogg
and libpng, two libraries widely ported to various platforms.
It has been reported to work on Windows and MacOS X as well as UNIX platforms.

However, libtiger, a rendering library for Kate streams, relies on Pango and Cairo,
which are not easy to build on Windows, though they can be.
The Tiger renderer is however completely separate from libkate, and is not needed
for full encoding and decoding of Kate streams.

=== Where can I find some example files ? ===

The libkate distribution can generate various examples, but already built files
can be found there:
[http://people.xiph.org/~oggk/elephants_dream/elephantsdream-with-subtitles.ogg]
[http://stallman.org/fry/Stephen_Fry-Happy_Birthday_GNU-nq_600px_425kbit.ogv]

These files use raw text only.

[[Category:Drafts]]
[[Category:Ogg Mappings]]

Bounties

2011-09-05T20:27:22Z

Ogg.k.ogg.k: spamicide

These are proposed bounty projects, similar to http://gnome.org/bounties/
or the [http://ghostscript.com/article/58.html Ghostscript bug bounty] program.
We don't have the same level of funding but could start a pot with $10-$100 and
let people contribute to specific bounties through paypal.

=== Xiph Quicktime Plugin ===
[http://www.xiph.org/quicktime/ QuickTime Components] is now a project hosted on xiph.org.

You have to write a Quicktime Plugin for the Ogg container and the Xiph Codec Family.
[http://qtcomponents.sf.net qtcomponents] provides support for Ogg Vorbis and MNG. This could be used as start.
Xiph Quicktime Plugin has to support encoding/decoding for:
* Ogg Media container
**[http://qtcomponents.sf.net qtcomponents] ''has an operational pluggable API for import, it needs some work to be long term supportable. It does not have a pluggable API for exporting at this time.''
* Support for Chained Ogg Streams
**[http://qtcomponents.sf.net qtcomponents] ''imports chained files as multiple tracks in QuickTime. It does not create chained files during export.''
* Support for Icecast Streams (sending is optional)
**[http://qtcomponents.sf.net qtcomponents] ''implements nothing towards this item. First up is a reverse-engineering effort, as the specifications for a streaming media handler have not been published.''
* Support for Xiph Codec Family: Vorbis, Theora, FLAC, Speex, Writ
**[http://qtcomponents.sf.net qtcomponents] ''has code for Vorbis and Speex (not working at the moment) and there is code at [http://damien.drix.free.fr/qtflac/ Damien Drix's site] for FLAC (decode only).''
It must also be possible to use the Xiph codecs in .mov files in combination with other quicktime codecs.
*[http://qtcomponents.sf.net qtcomponents] ''supports embedding media encoded with Xiph codes into .mov files.''
The plugin should work with at least QuickTime 6.x and 7.x on Mac OS X and Windows. (Mac OS 9 would be nice but probably isn't as important.)

All work must be released under the GPL.

Proposed bounty: 100€

=== Aggressive low-bitrate libvorbis encoding improvements for Vorbis I ===
libvorbis has a lot of room for improvement in all quality/bitrate departments, particularly at the lower quality levels / bitrates. There are many directions from which to approach this problem.

To claim this bounty, the following criteria would have to be met:
* A 25%-or-better reduction in bitrate for quality levels -1, 0, 1 on a reasonable testsuite while maintaining qualitative equivilence (or improvement) in community testing.
* No overall qualitative/bitrate regressions in quality levels 2 upwards
* Output ogg files compatible with Vorbis I spec
* Changes under suitable license for re-integration with Xiph.Org libvorbis

Proposed bounty: 200€

=== iPod playback support ===
The [http://ipodlinux.sourceforge.net/ Linux on iPod] project has vorbis decode working (with alternate firmware) at a good fraction of realtime. It should be a small matter of optimization to get it working
for useful playback.

Proposed bounty: 100€

=== Ogg Vorbis Bitrate Peeling ===
:Note: a bounty for this project has been posted on [https://launchpad.net/ launchpad.net]: [https://launchpad.net/bounties/ogg-vorbis-bitrate-peeling Add bitrate peeling to the standard libvorbis encoding library].
<p>Ogg Vorbis bitrate peeling has been a topic brought up time and again to combat MP3 enthusiasts. But this feature does not actually exist, only the mere possibility abounds. This bounty is set to change that.</p>
The peeler must meet the following criteria:
* Any Vorbis stream can be converted (not transcoded) to a lower quality setting
* Resulting streams would be identical or nearly identical to a stream generated by encoding the original source to the selected quality
* This process is reasonably fast (that is, signifigantly faster than re-encoding from source)
The following must also be accomplished to claim this bounty:
* The encoding libraries must be updated to create <em>peelable</em> Vorbis streams natively
* Old Vorbis streams must be <em>peelable</em> already, or convertable with a utility in order to be made <em>peelable</em>
* If older streams are not natively <em>peelable</em>, old <em>unpeelable</em> Vorbis streams must be identifiable and discernable from <em>peelable</em> streams in such a way as to facilitate transcoding streams from the old format
* All work submitted must be licenced under a BSD style licence (excepting circumstances where other licences may conflict)

Proposed bounty: 100€

FLAC

2010-09-29T13:24:15Z

Ogg.k.ogg.k: spamicide

'''FLAC''' stands for '''Free Lossless Audio Codec'''. FLAC is an [[wikipedia:audio compression|audio compression]] [[wikipedia:codec|codec]] that is [[wikipedia:lossless data compression|lossless]]. Unlike [[wikipedia:lossy data compression|lossy]] codecs such as [[Vorbis]] and [[wikipedia:MP3|MP3]], it does not remove any information from the audio stream.

On 2003 January 29th, the [[Xiph.Org Foundation]] announced the incorporation of FLAC under their flag, to go along with Vorbis, [[Theora]], and [[Speex]].

== The Project ==

The FLAC project consists of:
* the stream format
* libFLAC, a library of reference encoders and decoders, and a metadata interface
* libFLAC++, an object wrapper around libFLAC
* flac, a command-line wrapper around libFLAC to encode and decode .flac files
* metaflac, a command-line metadata editor for .flac files
* input plugins for various music players ([[wikipedia:Winamp|Winamp]], [[wikipedia:XMMS|XMMS]], [[wikipedia:Foobar2000|foobar2000]], and more in the works)

"Free" means that the specification of the stream format is in the [[wikipedia:public domain|public domain]] (the FLAC project reserves the right to set the FLAC specification and certify compliance), and that neither the FLAC format nor any of the implemented encoding/decoding methods are covered by any patent. It also means that the sources for libFLAC and libFLAC++ are available under The New BSD license and the sources for flac and metaflac applications, and the plugins are available under the [[wikipedia:GPL|GPL]].

== Comparisons ==

FLAC is distinguished from general lossless algorithms such as ZIP and gzip in that it is specifically designed for the efficient packing of audio data; while ZIP may compress a CD-quality audio file 20–40%, FLAC achieves compression rates of 30–70%.

While lossy codecs can achieve ratios of 80–90+%, they do this at the expense of discarding data from the original stream. Though FLAC uses a similar technique in its encoding process, it also adds "residual" data to allow the decoder to restore the original waveform flawlessly.

FLAC has become the preferred lossless format for trading live music online. It has a smaller file size than Shorten, and unlike MP3, it's lossless, which ensures the highest fidelity to the source material, which is important to live music traders. It has recently become a favorite trading format of non-live lossless audio traders as well.

There are other lostless audio codecs, however: WAVPACK (marginally better compression, slower), TAK, Monkey's audio and some more.

FLAC compiles on many platforms: most Unices (including Linux, *BSD, Solaris, and Mac OS X), DOS, Windows, BeOS, and OS/2. There are build systems for autoconf/automake, MSVC, Watcom C, and Project Builder.

== More information ==

* [[FLACDecoders]]: List of decoders
* [[FLACEncoders]]: List of encoders

== Non-PC playback support ==

FLAC is supported by a wide range of devices. The [[PortablePlayers#Portable Vorbis Native Support Table|portable players Vorbis support matrix]] also contains information about FLAC support. Other examples of FLAC supporting devices are:

* [[PortablePlayers/Flash#Cowon.2FiAudio_D2.2C_F2.2C_T2.2C_U3.2C_U2.2C_G3.2C_5.2C_G2.2C_U5.2C_7|iAudio]]: http://www.iaudio.com
* Kenwood Music Keg
* Naim HDX: http://www.naim-audio.com/products/hdx.html
* PhatNoise Home Media Player
* PhatNoise Phatbox
* [[PortablePlayers/Harddisk#Rio Karma|Rio Karma]]: http://www.digitalnetworksna.com/rioaudio/
* [[StaticPlayers#Slim_Devices_Squeezebox.2C_Squeezebox2.2C_Squeezebox3.2C_Transporter|SlimDevices Squeezebox]]: http://www.slimdevices.com

FLAC is supported by the following chips and/or chipsets:

* VLSI Solution OY's [http://www.vlsi.fi/en/products/vs1053.html VS1053b] decodes FLAC

== External links ==

*[http://flac.sourceforge.net/ Project homepage]
*[http://mikewren.com/flac/ Unofficial FLAC installer for Windows]
*[http://www.danrules.com/macflac/ MacFLAC] [[GUI]] frontend to encode/decode FLAC on [[Mac OS X]]
*[[Wikipedia: FLAC]]
*[http://www.losslessaudioblog.com/ The Lossless Audio Blog] Lossless Audio News & Information Site.

[[Category:Xiph core projects]]

Talk:Videos/A Digital Media Primer For Geeks

2010-09-25T10:24:29Z

Ogg.k.ogg.k: /* Video vegetables (they're good for you!) */

Welcome to the discussion.

To discuss the video, make an account and hit edit. Please feel free to point out errata, suggested additional resources, or just ask questions!

==Introduction==

==Analog vs Digital==

==Raw (digital audio) meat==
Don't forget when talking about higher sampling rates that frequency and temporal response are inherently linked. One often overlooked aspect of this is the value of higher sampling rates in presenting subtle differences in multi-channel timing (e.g. the stereo field). Even fairly uncritical listeners presented sample audio blind can notice this. --Chaboud

:They aren't merely "technically linked". They're mathematically indistinguishable. If a system doesn't has a response beyond some frequency it also lacks time resolution beyond some point.
:To the best of my knowledge a perceptually justified need for higher rates is not supported by the available science on the subject. Not only is there no real physiological mechanism proposed for this kind of sensitivity, well controlled blind listening tests don't support it— well controlled being key, loudspeakers can suffer from considerable non-linear effects including intermodulation, and having a lot of otherwise inaudible ultrasonics can produce audible distortion at lower frequencies. Another common error is running the DAC at different frequencies— with the obvious interactions with the reconstruction and analog filters. A correct test for determining the audibility differences of higher sample rates needs to use a single DAC stage at the highest frequency, re-sampling digitally to create the bandpass... etc. I'm not aware of any such test supporting a need for information beyond 24kHz.
:I normally suggest to people looking for increased to look into acoustic holography techniques like higher-order ambisonics and wavefield synthesis.
:The beyond 48kHz sampling subject subject has been [http://www.google.com/custom?domains=hydrogenaudio.org&q=96khz&sa=Google+Search&sitesearch=hydrogenaudio.org&client=pub-4544327213918729&forid=1&channel=7051718642&ie=ISO-8859-1&oe=ISO-8859-1&flav=0000&sig=6_g3ghDcS6bRpfcd&cof=GALT%3A%23008000%3BGL%3A1%3BDIV%3A%23336699%3BVLC%3A663399%3BAH%3Acenter%3BBGC%3AFFFFFF%3BLBGC%3AFFFFFF%3BALC%3A0000FF%3BLC%3A0000FF%3BT%3A000000%3BGFNT%3A0000FF%3BGIMP%3A0000FF%3BLH%3A50%3BLW%3A262%3BL%3Ahttp%3A%2F%2Fwww.hydrogenaudio.org%2Fforums%2Flogo50.png%3BS%3Ahttp%3A%2F%2Fwww.hydrogenaudio.org%3BFORID%3A1&hl=en discussed a number of times on hydrogen audio], I recommend reading the thread there. They are quite informative. Most audio groups out there online and off are not very scientifically oriented (e.g. evidence based)— HA is special because it is one of the few that are.--[[User:Gmaxwell|Gmaxwell]] 06:00, 24 September 2010 (UTC)

==Video vegetables (they're good for you!)==

An interesting point is that the discussion of the linear segment in the normal display responses (e.g. sRGB) is incorrect, or at best incomplete, though I've coming up short on good citations for this, so Wikipedia remains uncorrected at this time.--[[User:Gmaxwell|Gmaxwell]] 05:15, 22 September 2010 (UTC)

Hi there, great tutorial, but in fact the most common DVD standard is 720 pixels by 480 pixels, with a pixel ratio of 0.9, yielding a device aspect ratio of 1.35. I understand that you're trying to simplify the lecture to 4:3 aspect (1.333) for newbies, I think this is ultimately misleading, since the vast majority of DVDs are not sampled at 704x480. --Dryo

: Sort of-- the most common encoding is 720x480, but with the crop area set to 704x480; that's what the standard calls for (I was being sneaky when I said 'display resolution of 704x480'). Many software players ignore the crop rectangle and also display the horizontal overscan area. Many software encoders also just blindly encode 720x480 without setting the crop area. It is a source of *much* confusion. --[[User:Xiphmont|Monty]]

::"The standard" here being— Rec. 601? Is there anything else? We should probably at least link [[Wikipedia:overscan]]. --[[User:Gmaxwell|Gmaxwell]] 13:13, 24 September 2010 (UTC)

::OK, thanks for the clarification Monty... I did not even know that the horizontal crop area existed.

"''[...] most displays use [RGB] colors [...]''". Doesn't that sentence contradict this one : "''[...] video usually is represented as a [...] luma channel along with additional [...] chroma channels, the color''". I don't understand what "''position the chroma pixels''" means exactly. Are we talking of real points on a display ? Thanks, great video ! --[[User:Ledahulevogyre|Ledahulevogyre]] 13:59, 24 September 2010 (UTC)

:Display devices use RGB. Most video is actually encoded as YUV, luma plus two color "difference" channels. This reduces the bandwidth of raw video by cleverly exploiting limitations in human perception. Additionally, color samples need not be as frequent as luminance samples. So "chroma pixels" are the color data samples, not the pixels on a real display. --Dryo

::Thanks Dryo ! that's what I thought. Then I don't quite understand what this chroma samples positioning/siting is about. Is it actually defining the algorithm you should use to compute RGB pixels from YUV samples ? Is is defining the influence zone of chroma samples over luminance ones ? What I don't get is how you can talk about spatial positioning for something that is, well... not spatial (samples). Thank you again ! --[[User:Ledahulevogyre|Ledahulevogyre]] 09:52, 25 September 2010 (UTC)

:::Imagine a small 2x2 image, with the top two pixels blue, and the bottom two pixels red. Luminance will be sampled at each pixel, but (for 4:2:0), only one sample of Cr will be taken for this 2x2 set, so you'll have to decide where. If you place the sample on the middle horizontally, but aligned with every even or odd line, you'll get a sample from either blue, or red. If you place the sample horizontally and vertically, you'll get a sample from pink. Similarly for each other possible placement algorithm. [[User:Ogg.k.ogg.k|Ogg.k.ogg.k]] 10:24, 25 September 2010 (UTC)

==Containers==

==General discussion==

The video hasn't yet been formally released but we have all the sites up early in order to get everything debugged... Feedback on site functionality prior to the official release would be very helpful. --[[User:Gmaxwell|Gmaxwell]] 15:15, 22 September 2010 (UTC)
:Released now, but still tell us about bugs :-) --[[User:Xiphmont|Monty]]

=== Atom/RSS feed ===

Could not find an Atom/RSS feed for the video episodes. A videocast url with video-link enclosures would be ideal for getting future episodes. But even a announce-only feed would be convenient to track new episode releases. --[[User:Gsauthof|Gsauthof]] 17:41, 24 September 2010 (UTC)
:One does not exist yet— as a stopgap you can follow the [http://xiphmont.livejournal.com/tag/xiph Xiph tag on Monty's blog] and you'll be sure to hear about new videos. This has to be the most requested feature— I'll make sure we do it before the next video.--[[User:Gmaxwell|Gmaxwell]] 20:50, 24 September 2010 (UTC)

== 44100 Hz Trivia ==

The reason CDs use a 44,100 Hz (actually 44,056 Hz in the United States) is because, before dedicated digital recorders became mainstream, the only way a recording engineer or producer could record digital audio was with a piece of gear called a "PCM processor" or a "PCM Adaptor" (like a Sony PCM-F1 of PCM-501). These would take an audio input and, after running through the A/D if necessary, it would modulate it onto a baseband monochrome NTSC or PAL video signal that could then be recorded onto a 3/4" U-Matic video tape. The processors would accept two inputs, at 16 bits, giving a total bit rate of 1411200 bps. This number has the serendipitous property of being evenly divisible by both 30 and 25, 47040 and 56448, and these numbers allow both NTSC and PAL to encode the same number of bits, 98, per scan line (with the NTSC 480 line raster and PAL 576 line raster). It was just convenient selection of integers. CDs would be recorded at 44.1k in Europe as they were mastered onto 25 fps tapes, while CDs in the US were recorded at a "nominal" 30fps were actually at 44.056, but the difference in tone is basically inaudible. [[User:Iluvcapra|Iluvcapra]] 18:44, 24 September 2010 (UTC)

:Note that the PCM audio signal, once modulated to NTSC or PAL, can be recorded on any video recorder, not just U-matic. The most common tape format for PCM audio was Sony Betamax. Sony sold Betamax decks bundled with external PCM A/D converter units for the pro audio market. The PCM-F1 was designed to be used with Betacam VCRs. -- Dryo

Playback Troubleshooting

2010-09-24T12:20:09Z

Ogg.k.ogg.k: die-on-pause also happens in 3.5.x

We'd like to hear detailed descriptions of problems viewing videos in standalone players and browsers. HTML5 and WebM especially are very new, and it's highly unlikely the experience in any browser is going to be free of hiccups quite yet. The more feedback we get about what doesn't work, the more we can do to make sure problems <b>get fixed</b>.

If you don't see your browser or player below, feel free to add it to the appropriate list. And to avoid any battles over natural pecking order, keep them in alphabetical order ;-)

A list of Ogg Theora players (without troubleshooting or discussion) with links to vendor pages can be found on the [[TheoraSoftwarePlayers|Theora Software Players page]].

==In-browser Playback==

===Hiccups not specific to any browser===
====Brief flash of beginning of video when changing resolutions====

There are two basic ways of changing the video currently playing back in the current HTML5 spec, and both have some practical problems we'd like to see fixed before the spec is finalized.

The first way to change streams is to create a new video element via javascript, wait for it to load, then replace the current video with the new one. Unfortunately, HTML5 gives no way to prevent the original video, even when stopped, from using all available bandwidth to keep buffering as fast as it can. This starves the replacement video of network access, causing a lengthy delay when loading. It looks very nice and seamless when it finally works, but can easily result is switching video streams taking 15-30 seconds or more.

The second option is to switch the pre-existing video element to a new stream. This is much faster as the original stream stops sinking bandwidth immediately, but upon loading it always starts from the beginning and in current browsers also displays the first frame, even if playback isn't started. After the load completes, then it's possible to seek forward to where the original stream started. It doesn't look as good, but it's much faster in practice.

Xiph's video playback scripting uses the second, faster option, so there's a brief flash back to the beginning of the video upon resolution switch.

====No 'extra' controls [resolution switching, chapter navigation] on some browsers====

The 'extra controls' that appear as a bar along the top of the video playback window are implemented using HTML5 <video> tag features, and as such can't work as written in browsers using the Cortado fallback applet. Cortado does support subtitles via the 'CC' button in the lower right of the playback area, and our Ogg streams include subtitle tracks.

===Firefox===

[https://bugzilla.mozilla.org/buglist.cgi?query_format=advanced&component=Video%2FAudio&product=Core Search for known Firefox Audio/Video bugs]

====Firefox before Version 1.x,2.x,3.0.x====

Firefox before version 3.5 (or 3.1 beta) did not include native support for Ogg or WebM. These broswers can play Ogg video via the Cortado applet if a Java runtime enviroment is installed. With Java installed, playback is seamless but does not have a full set of HTML5 features; resolution switching and chapter navigation are disabled. Cortado has native support for Ogg Kate subtitles.

====Firefox 3.5.x====

Firefox 3.5 was the first version of Firefox to ship with native Ogg playback. It features a full HTML5 feature set, though it is known to be relatively slow about seeking and navigation.

Seeking may work poorly if your connectivity to the media passes through a proxy which strips HTTP range requests.

On common GNU/Linux systems with pulseaudio, such as Ubuntu and Fedora, playback will halt and refuse to continue after pausing (and potentially seeking) and will not continue unless the page is completely reloaded due to Mozilla [https://bugzilla.mozilla.org/show_bug.cgi?id=526411 Bug#526411].

====Firefox 3.6.x====

Firefox 3.6 behaves similarly to FF3.5, but adds poster support and more robust Ogg stream navigation along with some bug fixes.

Seeking may work poorly if your connectivity to the media passes through a proxy which strips HTTP range requests.

On common GNU/Linux systems with pulseaudio, such as Ubuntu and Fedora, playback will halt and refuse to continue after pausing (and potentially seeking) and will not continue unless the page is completely reloaded due to Mozilla [https://bugzilla.mozilla.org/show_bug.cgi?id=526411 Bug#526411].

====Firefox 4.0 (currently in beta)====

Firefox 4.0 features a new Ogg playback engine that allows considerably faster stream navigation, as well as WebM support.

===Google Chrome===
[https://code.google.com/p/chromium/issues/list?can=1&q=label%3Avideo Search for video bugs in Chrome's bugtracker]

Google Chrome added Ogg playback support in version [?], but it known to have serious bugs when seeking in Ogg streams; it also tends to lose the beginning of videos. Recent releases of Chrome support WebM, which works considerably better, though the playback framerate is often choppy/jerky (at least on Linux).

The "Save Video As" menu item re-downloads the video, even if it's fully cached.

===Internet Explorer===

====Internet explorer 5, 6, 7, 8====
Internet Explorer through version 8 has no support whatsoever for Ogg, WebM or the video tag. Normal installs do include Java support, however, so these browsers are able to play Ogg video through the Cortado applet. With Cortado, playback is seamless but does not have a full set of HTML5 features; resolution switching and chapter navigation are disabled. Cortado has native support for Ogg Kate subtitles.

====Internet Explorer 9====
Internet Explorer 9 (currently in alpha/beta) apparently at least somewhat supports the HTML 5 video tag, however it does not support Ogg or WebM playback out of the box. Microsoft has stated it will support Ogg and WebM 'if the codecs are installed on the system'. Presumably having the [http://www.xiph.org/dshow/ Open Codecs pack] installed fufills this requirement and enables Ogg and WebM support (confimation would be appreciated! If you have the Open Codecs Directshow filters installed, you should get full in-browser playback).

Internet Explorer 9 without Ogg/WebM support installed can presumably still play back Ogg video via the Cortado applet as in versions 8 and earlier (again, confimation would be appreciated!)

===Opera===

Opera long supported Ogg playback in developer builds, and finally shipped native Ogg support in release 10.5. As of 10.60, WebM is also natively supported.

===Netscape Navigator===
[[Image:Cortado_ns4.png|250px|right]]
Laugh if you must, but Navigator back to version 4 can play Ogg video via the Cortado applet.

<br style="clear:both;"/>

===Safari===

Safari does not ship native support for Ogg or WebM video, however all versions can play Ogg video via the Cortado applet. With Cortado, playback is seamless but does not have a full set of HTML5 features; resolution switching and chapter navigation are disabled. Cortado has native support for Ogg Kate subtitles.

As of Safari 3.1, Safari supports full HTML5 Ogg video playback via the [http://www.xiph.org/quicktime/ XiphQT Quicktime Components].

==Standalone Players and Tools==

===Core Player===

===FFMPEG / ffplay===

As of release 0.6, ffmpeg supports WebM playback, and Ogg playback is solid with the exception of surround support (eg 5.1 and other surround encodings produced my modern Vorbis encoders will not play).

Prior to ffmpeg 0.6, WebM was not supported and Ogg video playback was broken due to a number of longstanding bugs caused by treating Theora as if it was just VP3 (eg, the 'sheet lightning acid trip' bug that caused the image to disintegrate into a shower of colored blocks). Many applications and video sharing sites (such as YouTube) are still using old versions of ffmpeg internally, and as such, they cannot handle Ogg video unless it is encoded in 'vp3 compatability mode'.

===Media Player Classic===

===Mplayer===

Recent mplayer versions have good natives Ogg playback support and can handle WebM playback through libavcodec (ffmpeg libraries).

Mplayer has had a number of minor Ogg playback bugs in the past that mostly caused seeking or smoothness hiccups. Recent versions should have fixed all of the playback/seeking bugs of note.

===Helix Player (Real)===

===Quicktime===

Quicktime supports Ogg and WebM playback and encoding through the [http://www.xiph.org/quicktime/ XiphQT Quicktime Components]. These components also add Ogg support to Quicktime-aware applications such as Final Cut and Final Cut Pro.

===Totem===

Totem supports Ogg and WebM playback via native support in gstreamer.

===VLC===

VLC has had good native Ogg support since the GoldenEye release. Webm support is available in 1.1+.

===Windows Media Player===

WMP supports Ogg and WebM playback through the [http://www.xiph.org/dshow/ Open Codecs] DirectShow filter pack.

===Xine===

Ogg Skeleton 3

2010-09-21T22:12:29Z

Ogg.k.ogg.k: Undo revision 12334 by Drivelol (Talk)

'''Ogg Skeleton''' provides structuring information for multitrack [[Ogg]] files. It is compatible with Ogg [[Theora]] and provides extra clues for synchronization and content negotiation such as language selection.

Ogg is a generic container format for time-continuous data streams, enabling interleaving of several tracks of frame-wise encoded content in a time-multiplexed manner. As an example, an Ogg physical bitstream could encapsulate several tracks of video encoded in Theora and multiple tracks of audio encoded in Speex or Vorbis or FLAC at the same time. A player that decodes such a bitstream could then, for example, play one video channel as the main video playback, alpha-blend another one on top of it (e.g. a caption track), play a main Vorbis audio together with several FLAC audio tracks simultaneously (e.g. as sound effects), and provide a choice of Speex channels (e.g. providing commentary in different languages). Such a file is generally possible to create with Ogg, it is however not possible to generically parse such a file, seek on it, understand what codecs are contained in such a file, and dynamically handle and play back such content.

Ogg does not know anything about the content it carries and leaves it to the media mapping of each codec to declare and describe itself. There is no meta information available at the Ogg level about the content tracks encapsulated within an Ogg physical bitstream. This is particularly a problem if you don't have all the decoder libraries available and just want to parse an Ogg file to find out what type of data it encapsulates (such as the "file" command under *nix to determine what file it is through magic numbers), or want to seek to a temporal offset without having to decode the data (such as on a Web server that just serves out Ogg files and parts thereof).

Ogg Skeleton is being designed to overcome these problems. Ogg Skeleton is a logical bitstream within an Ogg stream that contains information about the other encapsulated logical bitstreams. For each logical bitstream it provides information such as its media type, and explains the way the granulepos field in Ogg pages is mapped to time.

Ogg Skeleton is also designed to allow the creation of substreams from Ogg physical bitstreams that retain the original timing information. For example, when cutting out the segment between the 7th and the 59th second of an Ogg file, it would be nice to continue to start this cut out file with a playback time of 7 seconds and not of 0. This is of particular interest if you're streaming this file from a Web server after a query for a temporal subpart such as in http://example.com/video.ogv?t=7-59 .

== Specification ==

This is a motivation and design sketch.
'''For the current specification see http://svn.annodex.net/standards/draft-pfeiffer-oggskeleton-current.txt'''

=== How to describe the logical bitstreams within an Ogg container? ===

The following information about a logical bitstream is of interest to contain as meta information in the Skeleton:
* the serial number: it identifies a content track
* the mime type: it identifies the content type
* other generic name-value fields that can provide meta information such as the language of a track or the video height and width
* the number of header packets: this informs a parser about the number of actual header packets in an Ogg logical bitstream
* the granule rate: the granule rate represents the data rate in Hz at which content is sampled for the particular logical bitstream. Note that when using this to interpret timestamps, the granulepos of a data page must first be parsed to extract a granule value using the method described in [[GranulePosAndSeeking]]. This value can then be mapped to time by calculating "granules / granulerate".
* the preroll: the number of past content packets to take into account when decoding the current Ogg page, which is necessary for seeking (vorbis has generally 2, speex 3)
* the granuleshift: the number of lower bits from the granulepos field that are used to provide position information for sub-seekable units (like the keyframe shift in theora)
* a basetime: it provides a mapping for granule position 0 (for all logical bitstreams) to a playback time; an example use: most content in professional analog video creation actually starts at a time of 1 hour and thus adding this additional field allows them retain this mapping on digitizing their content
* a UTC time: it provides a mapping for granule position 0 (for all logical bitstreams) to a real-world clock time allowing to remember e.g. the recording or broadcast time of some content

=== How to allow the creation of substreams from an Ogg physical bitstream? ===

When cutting out a subpart of an Ogg physical bitstream, the aim is to keep all the content pages intact (including the framing and granule positions) and just change some information in the Skeleton that allows reconstruction of the accurate time mapping. When remultiplexing such a bitstream, it is necessary to take into account all the different contained logical bitstreams. A given cut-in time maps to several different byte positions in the Ogg physical bitstream because each logical bitstream has its relevant information for that time at a different location. In addition, the resolution of each logical bitstream may not be high enough to accommodate for the given cut-in time and thus there may be some surplus information necessary to be remuxed into the new bitstream.

The following information is necessary to be added to the Skeleton to allow a correct presentation of a subpart of an Ogg bitstream:
* the presentation time: this is the actual cut-in time and all logical bitstreams are meant to start presenting from this time onwards, not from the time their data starts, which may be some time before that (because this time may have mapped right into the middle of a packet, or because the logical bitstream has a preroll or a keyframe shift)
* the basegranule: this represents the granule number with which this logical bitstream starts in the remuxed stream and provides for each logical bitstream the accurate start time of its data stream; this information is necessary to allow correct decoding and timing of the first data packets contained in a logcial bitstream of a remuxed Ogg stream

=== Ogg Skeleton version 3.0 Format Specification ===

Adding the above information into an Ogg bitstream without breaking existing Ogg functionality and code requires the use of a logical bitstream for Ogg Skeleton. This logical bitstream may be ignored on decoding such that existing players can still continue to play back Ogg files that have a Skeleton bitstream. Skeleton enriches the Ogg bitstream to provide meta information about structure and content of the Ogg bitstream.

The Skeleton logical bitstream starts with an ident header that contains information about all of the logical bitstreams and is mapped into the Skeleton bos page.
The first 8 bytes provide the magic identifier "fishead\0".
After the fishead follows a set of secondary header packets, each of which contains information about one logical bitstream. These secondary header packets are identified by an 8 byte code of "fisbone\0". The Skeleton logical bitstream has no actual content packets. Its eos page is included into the stream before any data pages of the other logical bitstreams appear and contains a packet of length 0.

The fishead ident header looks as follows ([http://annodex.org/w/images/3/39/FishHeads.JPG inspiration]):

0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1| Byte
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Identifier 'fishead\0' | 0-3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | 4-7
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Version major | Version minor | 8-11
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Presentationtime numerator | 12-15
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | 16-19
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Presentationtime denominator | 20-23
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | 24-27
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Basetime numerator | 28-31
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | 32-35
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Basetime denominator | 36-39
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | 40-43
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| UTC | 44-47
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | 48-51
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | 52-55
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | 56-59
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | 60-63
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The version fields provide version information for the Skeleton track, currently being 3.0 (the number having evolved within the Annodex project).
Presentation time and basetime are specified as a rational number, the denominator providing the temporal resolution at which the time is given (e.g. to specify time in milliseconds, provide a denominator of 1000).

The fisbone secondary header packet looks as follows:

0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1| Byte
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Identifier 'fisbone\0' | 0-3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | 4-7
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Offset to message header fields | 8-11
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Serial number | 12-15
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Number of header packets | 16-19
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Granulerate numerator | 20-23
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | 24-27
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Granulerate denominator | 28-31
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | 32-35
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Basegranule | 36-39
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| | 40-43
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Preroll | 44-47
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Granuleshift | Padding/future use | 48-51
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| Message header fields ... | 52-
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The mime type is provided as a message header field specified in the same way that HTTP header fields are given (e.g. "Content-Type: audio/vorbis"). Further meta information (such as language and screen size) are also included as message header fields. The offset to the message header fields at the beginning of a fisbone packet is included for forward compatibility - to allow further fields to be included into the packet without disrupting the message header field parsing.
The granule rate is again given as a rational number in the same way that presentation time and basetime were provided above.

A further restriction on how to encapsulate Skeleton into Ogg is proposed to allow for easier parsing:
* there can only be one Skeleton logical bitstream in a Ogg bitstream.
* the Skeleton bos page is the very first bos page in the Ogg stream such that it can be identified straight away and decoders don't get confused about it being e.g. Ogg Vorbis without this meta information
* the bos pages of all the other logical bistreams come next (a requirement of Ogg)
* the secondary header pages of all logical bitstreams come next, including Skeleton's secondary header packets
* the Skeleton eos page end the control section of the Ogg stream before any content pages of any of the other logical bitstreams appear

== Development ==

Ogg Skeleton is being supported by the following projects:
* the Ogg Directshow filters: see [http://www.illiminable.com/ogg/ illiminable]
* liboggz: [http://svn.annodex.net/liboggz/ liboggz svn] or [http://annodex.net/software/liboggz/ liboggz]
* the Annodex technology: [http://www.annodex.net/ annodex.net]
* [http://www.kfish.org/software/hogg/ HOgg] (Haskell)
* ffmpeg2theora (with --skeleton)
* speexenc (with --skeleton) & speexdec
* many more ...

== External links ==

* Ogg Skeleton is described in more detail in the [http://svn.annodex.net/standards/draft-pfeiffer-oggskeleton-current.txt Skeleton I-D in svn]
* Ogg Skeleton was originally specified in Annodex v3: [http://svn.annodex.net/standards/ I-D in svn] or [http://annodex.net/specifications.html I-D]

[[Category:Ogg]]

Vorbis

2010-09-07T15:01:56Z

Ogg.k.ogg.k:

'''Vorbis''' is a patent-clear, fully open general purpose audio encoding format standard that rivals or surpasses the 'upcoming' generation of proprietary coders ([[Wikipedia:Advanced Audio Coding|AAC]] and [[Wikipedia:TwinVQ|TwinVQ]], also known as VQF). There is no raw Vorbis stream defined, instead the Vorbis codec is typically used in the [[Ogg]] container format for audio files, and was called Ogg Vorbis for long time since the Ogg container was quasi exclusive for Vorbis. Later [[FLAC]] audio codec and [[Theora]] and [[Dirac]] video codecs began to be used inside Ogg too, and in 2010 the [[WebM]] format was defined using the Vorbis codec inside the WebM container, rather than Ogg.

libvorbis, a BSD-licensed source implementation of Vorbis as a library is available; See the [http://xiph.org/vorbis/ Ogg Vorbis page] for documentation, downloads and distribution terms.

Further, many players support Ogg Vorbis; see [http://www.vorbis.com/ vorbis.com] for a list of all the players we know about.

== More information ==

* [[Vorbis Hardware]]: List of hardware-players supporting Ogg Vorbis
* [[Vorbis Software Players]]: List of media players that can play Ogg Vorbis
* [[Vorbis Software Encoders]]: List of libvorbis frontends
* [[Vorbis Decoders]]: List of decoders (e.g. Xiph, Tremor, JOrbis, etc)
* [[Vorbis Encoders]]: List of encoders (e.g. Xiph, aoTuV, GT, vorbis-java)
* [[Vorbis-tools]]: Reference tools maintained by Xiph.org
* [[Games that use Vorbis]]: List of games using Ogg Vorbis
* [[VorbisStreams]]: Stations streaming with the [[Vorbis]] codec
* [[OggVorbis|Mapping in Ogg]]: useful information for developers

== External links ==

* [http://www.vorbis.com/ Vorbis.com]
* [[Wikipedia: Vorbis]]
* [http://www.rjamorim.com/test/multiformat128/results.html 128kbps public listening test]
* [http://www.hydrogenaudio.org/forums/index.php?showtopic=35438 80kbps personal listening test]
* [http://www.hydrogenaudio.org/forums/index.php?showtopic=36465 180kbps personal listening test with classical music]
* [http://www.maresweb.de/listening-tests/mf-128-1/results.htm 128kbps public listening test]

[[Category:Vorbis]]

Timed Divs HTML

2010-07-09T18:45:16Z

Ogg.k.ogg.k: Undo revision 12270 by Vsimon213 (Talk)

{{draft}}

= Introduction =

This page specifies a subclass of HTML documents that is a time-aligned text format for audio-visual content. We call the format "timed divs within HTML" or TDHT. It is intended to be used only in a World Wide Web context i.e. everywhere that Web browser functionality is available. Use cases for the format are subtitles, captions, annotations and other time aligned text as listed at http://wiki.xiph.org/index.php/OggText#Categories_of_Text_Codecs .

TDHT may be similar to W3C TimedText DFXP in many respects, but in comparison to DFXP it does not re-invent HTML, CSS and effects, but rather uses existing HTML, CSS and javascript for these. The purpose of DFXP is to create a web-independent exchange format for timed text, which is why it cannot directly be specified as a subpart of HTML.

TDHT in contrast is HTML with a minimum number of changes. TDHT is parsable by any HTML parser. It works with CSS and javascript. No new functionality has to be defined for TDHT.

= File Extension =

Files in this format are to be of text/html mime type since they are valid html files, apart from some extra attributes.

Files in this format should have a file extension of .tdht to separate them from plain html files.

= The TDHT format changes from HTML =

TDHT files are time-aligned text. This means there is a time association with blocks of text and there is time-based seeking functionality on those blocks of text.

Here is an example TDHT file for subtitles:

<pre>
<html>
<head>
<title>Desperate Housewives - Season 5, Episode 6</title>
</head>
<body>
<div start="00:00:00.070" end="00:00:02.270">
<p>Previously on...</p>
</div>
<div start="00:00:02.280" end="00:00:04.270">
<p>We had an agreement to keep things casual.</p>
</div>
<div start="00:00:04.280" end="00:00:06.660">
<p>Susan made her feelings clear.</p>
</div>
<div start="00:00:06.800" end="00:00:10.100">
<p>So if I was with another woman, that wouldn't bother you? No, it wouldn't.</p>
</div>
</body>
</html>
</pre>

The differences of TDHT from HTML are described using [http://www.w3.org/TR/html401/ HTML4.01], but the changes apply the same to [http://www.whatwg.org/specs/web-apps/current-work/ HTML5], which doesn't have a normative schema.

The following changes to HTML are made for TDHT:

== 1. The body element ==

In HTML4.01, the [http://www.w3.org/TR/html401/struct/global.html#h-7.5 body element] is defined as follows:

<pre>
<!ELEMENT BODY O O (%block;|SCRIPT)+ +(INS|DEL) -- document body -->
<!ATTLIST BODY
%attrs; -- %coreattrs, %i18n, %events --
onload %Script; #IMPLIED -- the document has been loaded --
onunload %Script; #IMPLIED -- the document has been removed --
>
</pre>

In TDHT1.0 we restrict body to just contain a sequence of div tags:

<pre>
<!ELEMENT BODY O O (DIV)+ -- document body -->
<!ATTLIST BODY
%attrs; -- %coreattrs, %i18n, %events --
onload %Script; #IMPLIED -- the document has been loaded --
onunload %Script; #IMPLIED -- the document has been removed --
>
</pre>

Any tags inside the body tag that are non-conformant to this specification (such as regular html tags that are allowed inside body) must be ignored for TDHT.

The div tags, however, can contain anything that HTML div tags can contain, thus enabling a very flexible, but time-aligned text model.

== 2. The div element ==

In HTML, the [http://www.w3.org/TR/html401/struct/global.html#h-7.5.4 div element] is defined as follows:

<pre>
<!ELEMENT DIV - - (%flow;)* -- generic language/style container -->
<!ATTLIST DIV
%attrs; -- %coreattrs, %i18n, %events --
>
</pre>

In TDHT1.0 we extend it with start and end time attributes:

<pre>
<!ELEMENT DIV - - (%flow;)* -- generic language/style container -->
<!ATTLIST DIV
%attrs; -- %coreattrs, %i18n, %events --
start %Time; #IMPLIED -- start time
end %Time; #IMPLIED -- end time
>
</pre>

The Time entity represents a valid time string accroding to HTML5: http://www.whatwg.org/specs/web-apps/current-work/#valid-time-string . The end time string must be larger than the start time string, otherwise the div element does not exist for any duration and can never turn active.

<div> elements in a TDHT file should be ordered by start time to simplify parsing. Inside Ogg or when rendered, they will be ordered by start time.

= Rendering in a Web Browser =

A TDHT file is meant to be associated with a audio or video file and rendered in a Web browser in sync with the audio or video file.

The TDHT file's div elements are not rendered into an existing HTML page, but rather a TDHT file creates its own [http://www.whatwg.org/specs/web-apps/current-work/#the-iframe-element iframe-like] new nested browsing context. It is linked to the parent HTML page through an itext element that is inserted as a child of the video element. Creation of a nested browsing context is important because a TDHT file can come from a different URI than the Web page and thus for security reasons and for general base URI computations a nested browsing context is the better approach with the DOM nodes of the hosting page and the DOM nodes of the TDHT document in different owner documents. That way, the hosting document has the security origin of its own URL and the TDHT document has the security origin of its URL.

The rendering and CSS view port are either by default the rectangle occupied by the given <video> or <audio> tag, or an area provided for by the hosting HTML page through the itext element's properties. The zoom factor of the iframe must be set to such a value that the width of the view port established by the itext frame is equally wide in CSS px as the video frame is wide in codec pixels. (Example: If the video encodes a frame that is 240 pixels wide but is displayed at 480 CSS px wide, the zoom factor of the itext frame should be 200% so that the box that on the outsize measures 480 px seems like a box of 240 px from within the itext frame.)

The itext frame is by default transparent.

A TDHT file can get to a browser either as a external resource, or as part of audio or video resource (in particular inside Ogg, see below). Parsing in these two cases is slightly different for the browser.

For the external TDHT file case:
The TDHT file is parsed using the HTML5 parsing algorithm in its normal mode into a non-rendered DOM. To render a div, the children of the div would be cloned into the body of the rendering shell document (replacing possible previous children of body).

For the Ogg-internal TDHT case:
To multiplex an external TDHT file into Ogg, each div with its innerHTML would be placed into a data packet and the head data in to an Ogg header. For decoding, the rendering shell document is set up and the head tag is included from the Ogg headers. To render a packet, the div and its innerHTML would be added to the innerHTML of the body element of the rendering shell document as they come. This will use the HTML fragment parser.

As the browser plays the video, it must render the TDHT <div> tags in sync. As the start time of a <div> tag is reached, the <div> tag is made activate, and it is made inactive as the <div> tag's end time is reached. If no start time is given, the start is assumed to be 0, and if no end time is given, end is assumed to be infinity.

An "active" <div> tag may be a <div> tag that is being displayed ("display: block") in contrast to an "inactive" <div> tag, which may not be displayed ("display: none"). For some text formats however the difference between "active" and "inactive" may be a background colour or the display location on screen or some other mechanism. The default should be between "block" and "none", but changeable through CSS.

As the browser has parsed the TDHT file or its consitutent <div> tags, it is expected to keep the structure in memory. When seeking happens on the video, it can then decide upon which <div> tags are supposed to be active at the seek time and display these. [There is a discussion to be had here about the effect this has on the DOM. Different selectors may apply to a caption depending on whether the video was played back all the way there or seeking skipped over data to get there. It was suggested that inactive captions should be removed from the DOM, so there's always a well-defined small unambiguous DOM to match selectors against. However, this may for example not be desirable on some text display formats.]

= Encapsulation into Ogg =

The [http://wiki.xiph.org/index.php/OggText OggText] specification is used to encapsulate a TDHT file into Ogg.

The codec-specific header data for the OggText ident header is the <head>..</head> part of the TDHT file. The complete <head> tag including all its subtags is encoded into the ident header unchanged.

The <div> elements with all their inner HTML are the data packets of the TDHT text codec and are thus encapsulated into the data packets as text codec data. A complete <div> including all its subtags is encoded into one data packet each.

= Direct linking on a HTML5 page =

Often, subtitles and other time-aligned text files are not actually provided inside a video stream (e.g. inside Ogg), but are referenced as a separate partner resource to a video.

To allow association of such files with a <video> or <audio> element, we propose the following approach:

<pre>
<video i="video" src="http://example.com/video.ogv" controls>
<itext id="caption1" category="CC" lang="en/us" src="caption.srt" style=""></itext>
<itext id="caption2" category="CC" lang="de/de" src="caption.tdht" style=""></itext>
<itext id="subtitle1" category="SUB" lang="de/de" src="german.dfxp" style=""></itext>
<itext id="subtitle2" category="SUB" lang="jp" src="japanese.smil" style="></itext>
<itext id="subtitle3" category="SUB" lang="fr" src="translation_webservice/fr/caption.srt" style=""></itext>
</video>
</pre>

Notice the second set of closed captions being a TDHT file.

The id tag is simply a unique identifier for the tag.
The category is from [http://wiki.xiph.org/index.php/OggText#Categories_of_Text_Codecs Ogg text categories].
The lang contains a natural language according to [http://en.wikipedia.org/wiki/Language_code language codes].
The src element contains the actual file URI that we are after.
The style element allows to attach styling to marked-up import files.

The <itext> element would act like an <iframe> element and create the nested browsing context described earlier. It has been renamed from earlier mentions of this approach from <text> to <itext> to avoid name clashes with e.g. SVG.

The user agent would then provide an interface such as:

interface MediaItextElement : HTMLElement {
attribute DOMString src;
attribute DOMString category;
attribute DOMString lang;
attribute DOMString id;
attribute DOMString style;
};

In javascript there will need to be additional functions such as:

getItext (): returns an array of time-aligned text elements
addItext({src,category,lang,style,name}): adds a time-aligned text element to a <video> or <audio> element
enable(itextElement): activates display of an itext file
disable(itextElement) : deactivates display of an itext file
delay(itextElement, seconds) : delays the itext file in relation to the video file by a positive or negative number of seconds

OggKate

2010-06-02T10:33:50Z

Ogg.k.ogg.k: Contrary to popular belief, it has nothing to do with it.

== Disclaimer ==
This is not a Xiph codec, though it may be embedded in Ogg alonside other Xiph
codecs, such as Vorbis and Theora. As such, please do not assume that Xiph has
anything to do with this, much less responsibility.

== What is Kate? ==

Kate is an overlay codec, originally designed for karaoke and text, that can be
multiplixed in Ogg. Text and images can be carried by a Kate stream, and animated.
Most of the time, this would be multiplexed with audio/video to carry subtitles,
song lyrics (with or without karaoke data), etc, but doesn't have to be.

Series of curves (splines, segments, etc) may be attached to various properties
(text position, font size, etc) to create animated overlays. This allows scrolling
or fading text to be defined. This can even be used to draw arbitrary shapes, so
hand drawing can also be represented by a Kate stream.

Example uses of Kate streams are movie subtitles for Theora videos, either text based,
as may be created by [http://www.v2v.cc/~j/ffmpeg2theora ffmpeg2theora], or image
based, such as created by [http://thoggen.net Thoggen] (patching needed), and lyrics,
as created by oggenc, from vorbis-tools.

== Why a new codec? ==

As I was adding support for Theora, Speex and FLAC to some software of mine, I found myself
wanting to have song lyrics accompanying Vorbis audio. Since Vorbis comments are limited to
the headers, one can't add them in the stream as they are sung, so another multiplexed stream
would be needed to carry them.

The three possible bases usable for such a codec I found were Writ, CMML, and OGM/SRT.

*[[OggWrit|Writ]] is an unmaintained start at an implementation of a very basic design, though I did find an encoder/decoder in py-ogg2 later on - I'd been quicker to write Kate from scratch anyway.
*[[CMML]] is more geared towards encapsulating metadata about an accompanying stream, rather than being a data stream itself, and seemed complex for a simple use, though I have now revised my view on this - besides, it seems designed for Annodex (which I haven't had a look at), though it does seems relatively generic for use outwith Annodex - though it is being "repurposed" as timed text now, bringing it closer to what I'm doing
*OGM/SRT, which I only found when I added Kate support to MPlayer, is shoehorning various data formats into an Ogg stream, and just dumps the SRT subtitle format as is, AFAICS (though I haven't looked at this one in detail, since I'd already had a working Kate implementation by that time)

I then decided to roll my own, not least because it's a fun thing to do.

I found other formats, such as USF (designed for inclusion in Matroska) and various subtitle formats,
but none were designed for embedding inside an Ogg container.

== Overview of the Kate bitstream format ==

I've taken much inspiration from Vorbis and Theora here.
Headers and packets (as well as the API design) follow the design of these two codecs.

A rough overview (see [[#Format specification|Format specification]] for more details) is:

Headers packets:
*ID header [BOS]: magic, version, granule fraction, encoding, language, etc
*Comment header: Vorbis comments, as per Vorbis/Theora streams
*Style definitions header: a list of predefined styles to be referred to by data packets
*Region definitions header: a list of predefined regions to be referred to by data packets
*Curves definitions header: a list of predefined curves to be referred to by data packets
*Motion definitions header: a list of predefined motions to be referred to by data packets
*Palette definitions header: a list of predefined palettes to be referred to by data packets
*Bitmap definitions header: a list of predefined bitmaps to be referred to by data packets
*Font mapping definitions header: a list of predefined font mappings to be referred to by data packets

Other header packets are ignored, and left for future expansion.

Data packets:
*text data: text/image and optional motions, accompanied by optional overrides for style, region, language, etc
*keepalive: can be emitted at any time to help a demuxer know where we're at, but those packets are optional
*repeats: a verbatim repeat of a text packet's payload, in order to bound any backward seeking needed when starting to play a stream partway through. These are also optional.
*end data [EOS]: marks the end of the stream, it doesn't have any useful payload

Other data packets are ignored, and left for future expansion.

The intent of the "keepalive" packet is to be sent at regular
intervals when no other packet has been emitted for a while. This would be to help seeking code
find a kate page more easily.

Things of note:
*Kate is a discontinuous codec, as defined in [http://www.xiph.org/ogg/doc/ogg-multiplex.html ogg-multiplex.html] in the Ogg documentation, which means it's timed by start granule, not end granule (as Theora and Vorbis).
* All data packets are on their own page, for two reasons:
**Ogg keeps track of granules at the page level, not the packet level
**if no text event happens for a while after a particular text event, we don't want to delay it so a larger page can be issued

See also [[#Seeking and memory|Problems to solve: Seeking and memory]].

*The granule encoding is not a direct time/granule correspondance, see the granule encoding section.
*The EOS packet should have a granule pos higher or equal to the end time of all events.
*User code doesn't have to know the number of headers to expect, this is moved inside the library code (as opposed to Vorbis and Theora).
*The format contains hooks so that additional information may be added in future revisions while keeping backward compatibility (though old decoders will correctly parse, but ignore the new information).

== Format specification ==

The Kate bitstream format consists of a number of sequential packets.
Packets can be either header packets or data packets. All header packets
must appear before any data packet.

Header packets must appear in order. Decoding of a data packet is not
possible until all header packets have been decoded.

Each Kate packet starts with a one byte type. A type with the MSB set
(eg, between 0x80 and 0xff) indicates a header packet, while a type with
the MSB cleared (eg, between 0x00 and 0x7f) indicates a data packet.
All header packets then have the Kate magic, from byte offset 1 to byte
offset 7 ("kate\0\0\0"). Note that this applies only to header packets:
data packets do not contain the Kate signature.

Since the ID header must appear first, a Kate stream can be recognized
by comparing the first eight bytes of the first packet with the signature
string "\200kate\0\0\0".

When embedded in Ogg,the first packet in a Kate stream (always packet type 0x80,
the id header packet) must be placed on a separate page. The corresponding Ogg
packet must be marked as beginning of stream (BOS).All subsequent header packets
must be on one or more pages. Subsequently, each data packet must be on a separate
page.

The last data packet must be the end of stream packet (packet type 0x7f).

When embedded in Ogg, the corresponding Ogg packet must be marked as end of stream (EOS).

As per the Ogg specification, granule positions must be non decreasing
within the stream. Header packets have granule position 0.

Currently existing packet types are:
:headers:
::0x80 ID header (BOS)
::0x81 Vorbis comment header
::0x82 regions list header
::0x83 styles list header
::0x84 curves list header
::0x85 motions list header
::0x86 palettes list header
::0x87 bitmaps list header
::0x88 font ranges and mappings header
:data:
::0x00 text data (including optional motions and overrides)
::0x01 keepalive
::0x02 repeat
::0x7f end packet (EOS)

This format described here is for bitstream version 0.x.
As or 19 december 2008, the latest bitstream version is 0.4.

For more detailed information, refer to the format documentation
in libkate (see URL below in the [[#Downloading|Downlading]] section).

Following is the definition of the ID header (packet type 0x80).
This works out to a 64 byte ID header. This is the header that should be
used to detect a Kate stream within an Ogg stream.

0 1 2 3 |
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1| Byte
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| packtype | Identifier char[7]: 'kate\0\0\0' | 0-3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| kate magic continued | 4-7
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | version major | version minor | num headers | 8-11
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| text encoding | directionality| reserved - 0 | granule shift | 12-15
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| cw sh | canvas width | ch sh | canvas height | 16-19
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | 20-23
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate numerator | 24-27
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate denominator | 28-31
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (NUL terminated) | 32-35
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 36-39
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 40-43
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 44-47
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (NUL terminated) | 48-51
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 52-55
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 56-59
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 60-63
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The fields cw sh, canvas width, cw sh, and canvas height were introduced
in bistream 0.3. Earlier bitstreams will have 0 in these fields.

language and category are NUL terminating ASCII strings.
Language follows RFC 3066, though obviously will not accommodate language tags
with lots of subtags.

Category is currently loosely defined, and I haven't found yet a nice way to
present it in a generic way, but is meant for automatic classifying of
various multiplexed Kate streams (eg, to recognize that some streams are
subtitles (in a set of languages), and some others are commentary (in a
possibly different set of languages, etc).

== API overview ==

libkate offers an API very similar to that of libvorbis and libtheora, as well as
an extra higher level decoding API.

Here's an overview of the three main modules:

=== Decoding ===

Decoding is done in a way similar to libvorbis. First, initialize a kate_info and a
kate_comment structure. Then, read headers by calling kate_decode_headerin. Once
all headers have been read, a kate_state is initialized for decoding using kate_decode_init,
and kate_decode_packetin is called repeatedly with data packets. Events (eg, text) can be
retrieved via kate_decode_eventout.

=== Encoding ===

Encoding is also done in a way similar to libvorbis. First initialize a kate_info
and a kate_comment structure, and fill them out as needed. kate_encode_headers will
create ogg packets from those. Then, kate_encode_text is called repeatedly for all
the text events to add. When done, calling kate_encode_finish will create an end of
stream packet.

=== High level decoding API ===

There are only 3 calls here:

kate_high_decode_init
kate_high_decode_packetin
kate_high_decode_clear

Here, all Ogg packets are sent to kate_high_decode_packetin, which does the right
thing (header/data classification, decoding, and event retrieval). Note that you
do not get access to the comments directly using this, but you do get access to the
kate_info via events.

The libkate distribution includes commented examples for each of those.

Additionally, libkate includes a layer (liboggkate) to make it easier to use when
embedded in Ogg. While the normal API uses kate_packet structures, liboggkate uses
ogg_packet structures.

The High level decoding API does not have an Ogg specific layer, but functions exist
to wrap a kate_packet around a memory buffer (such as the one ogg_packet uses, for instance).

== Support ==

Among the software with Kate support:
*VLC
*ffmpeg2theora
*liboggz
*liboggplay
*Cortado (wikimedia version)
*vorbis-tools

I have patches for the following with Kate support:
*MPlayer
*xine
*GStreamer
*Thoggen
*Audacious
*and more...

These may be found in the libkate source distribution (see [[#Downloading|Downloading]]
for links).

In addition, libtiger is a rendering library for Kate streams using Pango and Cairo,
though it is not quite yet API stable (though no major changes are expected).

== Granule encoding ==

=== Ogg ===

Ogg leaves the encoding of granules up to a particular codec, only
mandating that granules be non decreasing with time.

The Kate bitstream format uses a linear mapping between time and
granule, described here.

A Kate granule position is composed of two different parts:
- a base granule, in the high bits
- a granule offset, in the low bits

+----------------+----------------+
| base | offset |
+----------------+----------------+

The number of bits these parts occupy is variable, and each stream
may choose how many bits to dedicate to each. The kate_info structure
for a stream holds that information in the granule_shift field,
so each part may be reconstructed from a granulepos.

The timestamp T of a given Kate packet is split into a base B and
offset O, and these are stored in the granulepos of that packet.
The split is done such that the B is the time of the earliest event
still active at the time, and the O is the time elapsed between B
and T. Thus, T = B + O. This mimics the way Theora stores its own
timestamps in granulepos, where the base acts as a keyframe, and
an offset acts as the position of an intra frame from the previous
keyframe. Since Kate allows time overlapping events, however, the
choice of the base to use is slightly more complex, as it may not
be the starting time of the previous event, if the stream contains
time overlapping events.

The kate_info structure for a stream holds a rational fraction
representing the time span of granule units for both the base and
the offset parts.

The granule rate is defined by the two fields:

kate_info::gps_numerator
kate_info::gps_denominator

The number of bits reserved for the offset is defined by the field:

kate_info::granule_shift

=== Generic timing ===

Kate data packets (data packet type 0) includes timing information (start time,
end time, and time of the earliest event still active). All these are stored as
64 bit at the rate defined by the granule rate, so they do not suffer from the
granule_shift space limitation.

This also allows for Kate streams to be stored in other containers.

== Motion ==

The Kate bitstream format includes motion definition, originally for karaoke purposes, but
which can be used for more general purpose, such as line based drawing, or animation of
the text (position, color, etc)

Motions are defined by the means of a series of curves (static points, segments, splines (catmull-rom, bezier, and b-splines)).
A 2D point can be obtained from a motion for any timestamp during the lifetime of a text.
This can be used for moving a marker in 2D above the text for karaoke, or to use the x
coordinate to color text when the motion position passes each letter or word, etc.
Motions have an attached semantics so the client code knows how to use a particular motion.
Predefined semantics include text color, text position, etc).

Since a motion can be composed of an arbitrary number of curves, each of which may have
an arbitrary number of control points, complex motions can be achieved. If the motion is
the main object of an event, it is even possible to have an empty text, and use the motion
as a virtual pencil to draw arbitrary shapes. Even on-the-fly handwriting subtitles could
be done this way, though this would require a lot of control points, and would not be able
to be used with text-to-speech.

As a proof of concept, I also have a "draw chat" program where two people can draw, and
the shapes are turned to b-splines and sent as a kate motion to be displayed on the other
person's window.

It is also possible for motions to be discontinuous - simply insert a curve of 'none' type.
While the timestamp lies within such a curve, no 2D point will be generated. This can be
used to temporarily hide a marker, for instance.

It is worth mentionning that pauses in the motion can be trivially included by inserting
at the right time and for the right duration a simple linear interpolation curve with only
two equal points, equal to the position the motion is supposed to pause at.

Kate defines a set of predefined mappings so that each decoder user interprets a motion in
the same way. A mapping is coded on 8 bits in the bitstream, and the first 128 are reserved
for Kate, leaving 128 for application specific mappings, to avoid constraining creative uses
of that feature. Predefined mappings include frame (eg, 0-1 points are mapped to the size of
the current video frame), or region, to scale 0-1 to the current region. This allows curves
to be defined without knowing in advance the pixel size of the area it should cover.

For uses which require more than two coordinates (eg, text color, where 4 (RGBA) values are
needed, Kate predefines the semantics text_color_rg and text_color_ba, so a 4D point can be
obtained using two different motions.

There are higher level constructs, such as morphing between two styles, or predefined
karaoke effects. More are planned to be added in the future.

See also [[#Trackers|Trackers]].

== Trackers ==

Since attaching motions to text position, etc, makes it hard for the client to keep track of
everything, doing interpolation, etc, the library supplies a tracker object, which handles the
interpolation of the relevant properties.
Once initialized with a text and a set of motions, the client code can give the tracker a new
timestamp, and get back the current text position, text color, etc.

Using a tracker is not necessary, if one wants to use the motions directly, or just ignore them,
but it makes life easier, especially when considering the the order in which motions are applied
does matter (to be defined formally, but the current source code is informative at this point).

== The Kate file format ==

Though this is not a feature of the bitstream format, I have created a text file format to
describe a series of events to be turned into a Kate bitstream.
At its minimum, the following is a valid input to the encoder:

: kate {
:: event { 00:00:05 --> 00:00:10 "This is a text" }
: }

This will create a simple stream with "This is a text" emitted at an offset of 5 seconds into
the track, lasting 5 seconds to an end time at 10 seconds.

Motions, regions, styles can be declared in a definitions block to be reused by events, or can
be defined inline. Defining those in the definitions block places them in a header so they can
be reused later, saving space. However, they can also be defined in each event, so they will be
sent with the event. This allows them to be generated on the fly (eg, if the bitstream is being
streamed from a realtime input).

For convenience, the Kate file format also allows C style macros, though without parameters.

Please note that the Kate file format is fully separate from the Kate bitstream format. The
difference between the two is similar to the difference between a C source file and the resulting
object file, when compiled.

Note that the format is not based on XML for a very parochial reason: I tend to dislike very
much editing XML by hand, as it's really hard to read. XML is really meant for machines to parse
generically text data in a shared syntax but with possibly unknown semantics, and I need those
text representations to be editable easily.

This also implies that there could be an XML representation of a Kate stream, which would be
useful if one were to make an editor that worked on a higher level than the current all-text
representation, and it is something that might very well happen in the future, in parallel with
the current format.

== Karaoke ==

Karaoke effects rely on motions, and there will be predefined higher level ways of specifying
timings and effects, two of which are already done. As an example, this is a valid Karaoke script:

:kate {
:: simple_timed_glyph_style_morph {
::: from style "start_style" to style "end_style"
::: "Let " at 1.0
::: "us " at 1.2
::: "sing " at 1.4
::: "to" at 2.0
::: "ge" at 2.5
::: "ther" at 3.0
:: }
:}

The syllables will change from a style to another as time passes. The definition of the start_style
and end_style styles is omitted for brevity.

== Problems to solve ==

There are a few things to solve before the Kate bitstream format can be considered good
enough to be frozen:

Note: the following is mostly solved, and the bitstream is now stable, and has been
backward and forward compatible since the first released version. This will be updated
when I get some time.

=== Seeking and memory ===

When seeking to a particular time in a movie with subtitles, we may end up at a place when a subtitle has been started, but is not removed yet. Pure streaming doesn't have this problem as it remembers the subtitle being issued (as opposed to, say, Vorbis, for which all data valid now is decoded from the last packet). With Kate, a text string valid now may have been issued long ago.

I see three possible ways to solve this:
*each data packet includes the granule of the earliest still active packet (if none, this will be the granule of this very packet)
**this means seeks are two phased: first seek, find the next Kate packet, and seek again if the granule of the earlier still active packet is less than the original seeked granule. This implies support code on players to do the double seek.

*use "reference frames", a bit like Theora does, where the granule position is split in several fields: the higher bits represent a position for the reference frame, and the lowest bits a delta time to the current position. When seeking to a granule position, the lower bits are cleared off, yielding the granule position of the previous reference frame, so the seek ends up at the reference frame. The reference frame is a sync point where any active strings are issued again. This is a variant of the method described in the Writ wiki page, but the granule splitting avoids any "downtime".
**this requires reissuing packets, and it doesn't feel right (and wastes space).
**it also requires "dummy" decoding of Kate data from the reference frame to the actual seek point to fully refresh the state "memory".

*A variant of the two-granules-in-one system used by libcmml, where the "back link" points to the earliest still active string, rather than the previous one (this allows a two phase seek, rather than a multiphase seek, hopping back from event to event, with no real way to know if there is or not a previous event which is still active - I suppose CMML has no need to know this, if their "clips" do not overlap - mine can do).
**Such a system considerably shortens the usable granule space, though it can do a one phase seek, if I understand the system correctly, which I am not certain.
*** Well, it seems it can't do a one phase seek anyway.

*Additionally, it could be possible to emit simple "keepalive" packets at regular intervals to help a seek algorithm to sync up to the stream without needing too much data reading - this helps for discontinuous streams where there could be no pages for a while if no data is needed at that time.

=== Text encoding ===

A header field declares the text encoding used in the stream. At the moment, only UTF-8 is
supported, for simplicity. There are no plans to support other encodings, such as UTF-16,
at the moment.

Note that strings included in the header (language, category) are not affected by that
language encoding (rather obviously for language itself). These are ASCII.

The actual text in events may include simple HTML-like markup (at the moment, allowed markup
is the same as the one Pango uses, but more markup types may be defined in the future).
It is also possible to ask libkate to remove this markup if the client prefers to receive
plain text without the markup.

=== Language encoding ===

A header field defines the language (if any) used in the stream (this can be overridden in a
data packet, but this is not relevant to this point). At the moment, my test code uses
ISO 639-1 two letter codes, but I originally thought to use RFC 3066 tags. However, matching
a language to a user selection may be simpler for user code if the language encoding is kept
simple. At the moment, I tend to favor allowing both two letter tags (eg, "en") and secondary
tags (like "en_EN"), as RFC 3066 tags can be quite complex, but I welcome comments on this.

If a stream contains more than one language, there usually is a predominant language, which
can be set as the default language for the stream. Each event can then have a language
override. If there is no predominant language, and it is not possible to split the stream
into multiple substreams, each with its own language, then it is possible to use the "mul"
language tag, as a last resort.

=== Bitstream format for floating point values ===

Floating point values are be turned to a 16.16 fixed point format, then stored in a bitpacked
format, storing the number of zero bits at the head and tail of the floating point values once
per stream, and the remainder bits for all values in the stream. This seems to yield good results
(typically a 50% reduction over 32 bits raw writes, and 70% over the snprintf based storage), and
has the big advantage of being portable (eg, independant of any IEEE format).
However, this means reduced precision due to the quantization to 16.16. I may add support for
variable precision (eg, 8.24 fixed point formats) to alleviate this. This would however mean less
space savings, though these are likely to be insignificant when Kate streams are interleaved with
a video.

*Though this is not a Kate issue per se, the motion feature is very difficult to use without a curve editor. While tools may be coded to create a Kate bitstream for various existing subtitle formats, it is not certain it will be easy to find a good authoring tool for a series of curves. That said, it's not exactly difficult to do if you know a widget set.

=== Higher dimensional curves/motions ===

It is quite annoying to have to create two motions to control a color change, due to curves
being restricted to two dimensions. I may add support for arbitrary dimensions. It would also
help for 1D motions, like changing the time flow, where one coordinate is simply ignored at
the moment.
Alternatively, changes could be made to the Kate file format to hide the two dimensionality and
allow simpler specification of non-2 dimensional motions, but still map them to 2D in the kate
bitstream format.

=== Category definition ===

The category field in the BOS packet is a 16 byte text field (15 really, as it is zero terminated
in the bitstream itself). Its goal is to provide the reader with a short description of what kind
of information the stream contains, eg subtitles, lyrics, etc. This would be displayed to the user,
possibly to allow to choose to turn some streams on and off.

Since this category is meant primarily for a machine to parse, they will be kept to ASCII. When
a player recognizes a category, it is free to replace its name with one in the user's language if
it prefers. Even in English, the "lyrics" category could be displayed by a player as "Lyrics".

Since this is a free text field rather than an enumeration, it would be good to have a list of
common predefined category names that Kate streams can use.

This is a list of proposed predefined categories, feedback/additions welcome:

* subtitles - the usual movie subtitles, as text
* spu-subtitles - movie subtitles in DVD style paletted images
* lyrics - song lyrics

Please remember the 15 character limit if proposing other categories.

Note that the list of categories is subject to change, and will likely
be replaced by new, more "identifier like" ones. The three ones above,
however, would be kept for backward compatibility as they're already used.

== Text to speech ==

One of the goals of the Kate bitstream format is that text data can be easily parsed
by the user of the decoder, so any additional information, such as style, placement,
karaoke data, etc, should be able to be stripped to leave only the bare text. This is
in view of allowing text-to-speech software to use Kate bitstreams as a bandwith-cheap
way of conveying speech data, and could also allow things like e-books which can be
either read or listened to from the same bitstream (I have seen no reference to this
being used anywhere, but I see no reason why the granule progression should be temporal,
and not user controlled, such as by using a "next" button which would bump a granule
postion by a preset amount, simulating turning a page (this would be close to necessary
for text-to-speech, as the wall time duration of the spoken speech is not known in
advance to the Kate encoder, and can't be mapped to a time based granule progression)).
All text strings triggered consecutively between the two granule positions would then
be read in order.

== Possible additions ==

=== Embedded binary data ===

Images and font mappings can be included within a Kate stream.

==== Images ====

Though this could be misused to interfere with ability to render as text-to-speech, Kate
can use images as well as text. The same caveat as for fonts applies with regard to data
duplication.

Complex images might however be best left to a multiplexed OggSpots or OggMNG stream, unless the
images mesh with the text (eg, graphical exclamation points, custom fonts, (see next
paragraph), etc).

There is support for simple paletted bitmap images, with a variable length palette of up
to 256 colors (in fact, sized in powers of 2 up to 256) and matching pixel data in as
many bits per pixel as can address the palette. Palettes and images are stored separately,
so can be used with one another with no fixed assignment.

Palettes and bitmaps are put in two separate header for later use by reference, but can
also be placed in data packets, as with motions, etc, if they are not going to be reused.

PNG bitmaps can also be embedded in a Kate stream. These do not have associated palettes
(but the PNGs themselves may or may not be paletted). There is no support for decoding PNG
images in libkate itself, so a program will have to use libpng (or similar code) to decode
the PNG image. For instance, the libtiger rendering library uses Cairo to decode and render
PNG images in Kate streams.

This can be used to have custom fonts, so that raw text is still available if the stream
creator wants a custom look.

I expect that the need for more than 256 colors in a bitmap, or non palette bitmap data,
would be best handled by another codec, eg OggMNG or OggSpots. The goal of images in a
Kate stream is to mesh the images with the text, not to have large images by themselves.

On the other hand, interesting Karaoke effects could be achieved by having MNG images
instead of simple paletted bitmaps in a Kate streams. Comments would be most welcome on
whether this is going too far, however.

I am also investigating SVG images. These allow for very small footprint images for simple
vector drawings, and could be very useful for things like background gradients below text.

A possible solution to the duplication issue is to have another stream in the container
stream, which would hold the shared data (eg, fonts), which the user program could load,
and which could then be used by any Kate (and other) stream. Typically, this type of stream
would be a degenerate stream with only header packets (so it is fully processed before any
other stream presents data packets that might make use of that shared data), and all payload
such as fonts being contained within the headers. Thinking about it, it has parallels with
the way Vorbis stores its codebooks within a header packet, or even the way Kate stores the
list of styles within a header packet.

==== Fonts ====

Custom fonts are merely a set of ranges mapping unicode code points to bitmaps. As this implies,
fonts are bitmap fonts, not vector fonts, so scaling, if supported by the rendering client,
may not look as good as with a vector font.

A style may also refer to a font name to use (eg, "Tahoma"). These fonts may or may not be
available on the playing system, however, since the font data is not included in the stream,
just referenced by name. For this reason, it is best to keep to widely known fonts.

== Reference encoder/decoder ==

A encoder (kateenc) and a decoder (katedec) are included in the tools directory.
The encoder supports input from several different formats:
* a custom text based file format (see [[#The Kate file format|The Kate file format]]), which is by no means meant to be part of the Kate bitstream specification itself
* SubRip (.srt), the most common subtitle format I found
* LRC lyrics format.

As an example for the widely used SRT subtitles format, the following command line
create a Kate subtitles stream from an SRT file:

kateenc -l en -c subtitles -t srt -o subtites.ogg subtitles.srt

The reverse is possible, to recover an SRT file from a Kate stream, with katedec.

Note that the subtitles.ogg file should then be multiplexed into the A/V stream,
using either ogg-tools or oggz-tools.

The Kate bitstreams encoded and decoded by those tools are (supposed to be) correct for this
specification, provided their input is correct.

== Next steps ==

=== Continuations ===

Continuations are a way to add to existing events, and are mostly meant for motions. When streaming
in real time, what motions may be applied to events may not be known in advance (for instance, for a
draw chat program where two programs exchange Kate streams, the drawing motions are only known as
they are drawn. Continuations will allow an event to be extended in time, and motions to be appended
to it. This is only useful for streaming, as when stored in a file, everything is already known in
advance.

=== A rendering library ===

This will allow easier integration in other packages (movie players, etc).
I have started working on an implementation using Cairo and Pango, though I'm still at the early stages.
I might add support for embedding vector fonts in a Kate stream if I was going that way. Still need to think about this.
Another point of note is that when this library is available, it would make it easier to add
capabilities such as rotation, scaling, etc, to the bitstream, since this would not cause too
much work for playing programs using the rendering library. It is expected that these additions
would stay backward compatible (eg, an old player would ignore this information but still correctly
decode the information they can work with from a newly encoded stream).

=== An XML representation ===

While I purposefully did not write Kate description files in XML due to me finding editing XML such
a chore, it would be nice to be able to losslessly convert between the more user friendly representation
and an XML document, so one can do what one does with XML documents, like transformations.

And after all, some people might prefer editing the XML version.

=== Packaging ===

It would be really nice to have packages for libkate/libtiger for many distros.

If you're a packager for a distro which doesn't have yet packages for libkate
or libtiger, please consider helping :)

In particular, packages for Debian would be grand.

== Matroska mapping ==

The codec ID is "S_KATE".

As for Theora and Vorbis, Kate headers are stored in the private data as xiph-laced packets:

Byte 0: number of packets present, minus 1 (there must be at least one packet) - let this number be NP
Bytes 1..n: lengths of the first NP packets, coded in xiph style lacing
Bytes n+1..end: the data packets themselves concatenated one after the other

Note that the length of the last packet isn't encoded, it is deduced from the sizes of the other
packets and the total size of the private data.

This mapping is similar to the Vorbis and Theora mappings, with the caveat that one should not
expect a set number of headers.

== Downloading ==

libkate encodes and decodes Kate streams, and is API and ABI stable.

The libkate source distribution is available at [http://libkate.googlecode.com/ http://libkate.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/kate.git;a=summary http://git.xiph.org/?p=users/oggk/kate.git;a=summary].

libtiger renders Kate streams using Pango and Cairo, and is alpha, with API changes still possible.

The libtiger source distribution is available at [http://libtiger.googlecode.com/ http://libtiger.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/tiger.git;a=summary http://git.xiph.org/?p=users/oggk/tiger.git;a=summary].

== HOWTOs ==

These paragraphs describe a few ways to use Kate streams:

=== Text movie subtitles ===

Kate streams can carry Unicode text (that is, text that can represent
pretty much any existing language/script). If several Kate streams are
multiplexed along with a video, subtitles in various languages can be
made for that movie.

An easy way to create such subtitles is to use ffmpeg2theora, which
can create Kate streams from SubRip (.srt) format files, a simple but
common text subtitles format. ffmpeg2theora 0.21 or later is needed.

At its simplest:

ffmpeg2theora -o video-with-subtitles.ogg --subtitles subtitles.srt
video-without-subtitles.avi

Several languages may be created and tagged with their language code
for easy selection in a media player:

ffmpeg2theora -o video-with-subtitles.ogg video-without-subtitles.avi
--subtitles japanese-subtitles.srt --subtitles-language ja
--subtitles welsh-subtitles.srt --subtitles-language cy
--subtitles english-subtitles.srt --subtitles-language en_GB

Alternatively, kateenc (which comes with the libkate distribution) can
create Kate streams from SubRip files as well. These can then be merged
with a video with oggz-tools:

kateenc -t srt -c SUB -l it -o subtitles.ogg italian-subtitles.srt
oggz merge -o movie-with-subtitles.ogg movie-without-subtitles.ogg subtitles.ogg

This second method can also be used to add subtitles to a video which
is already encoded to Theora, as it will not transcode the video again.

=== DVD subtitles ===

DVD subtitles are not text, but images. Thoggen, a DVD ripper program,
can convert these subtitles to Kate streams (at the time of writing,
Thoggen and GStreamer have not applied the necessary patches for this
to be possible out of the box, so patching them will be required).

When configuring how to rip DVD tracks, any subtitles will be detected
by Thoggen, and selecting them in the GUI will cause them to be saved as
Kate tracks along with the movie.

=== Song lyrics ===

Kate streams carrying song lyrics can be embedded in an Ogg file. The
oggenc Vorbis encoding tool from the Xiph.Org Vorbis tools allows lyrics
to be loaded from a LRC or SRT text file and converted to a Kate stream
multiplexed with the resulting Vorbis audio. At the time of writing,
the patch to oggenc was not applied yet, so it will have to be patched
manually with the patch found in the diffs directory.

oggenc -o song-with-lyrics.ogg --lyrics lyrics.lrc --lyrics-language en_US song.wav

So called 'enhanced LRC' files (containing extra karaoke timing information)
are supported, and a simple karaoke color change scheme will be saved
out for these files. For more complex karaoke effects (such as more
complex style changes, or sprite animation), kateenc should be used with
a Kate description file to create a separate Kate stream, which can then
be merged with a Vorbis only song with oggz-tools:

oggenc -o song.ogg song.wav
kateenc -t kate -c LRC -l en_US -o lyrics.ogg lyrics-with-karaoke.kate
oggz merge -o song-with-karaoke.ogg lyrics-with-karaoke.ogg song.ogg

This latter method may also be used if you already have an encoded Vorbis song
with no lyrics, and just want to add the lyrics without reencoding.

=== Changing a Kate stream embedded in an Ogg stream ===

If you need to change a Kate stream already embedded in an Ogg stream (eg, you have a movie with subtitles, and you want to fix a spelling mistake, or want to bring one of the subtitles forward in time, etc), you can do this easily with KateDJ, a tool that will extract Kate streams, decode them to a temporary location, and rebuild the original stream after you've made whatever changes you want.

KateDJ (included with the libkate distribution) is a GUI program using wxPython, a Python module for the wxWidgets GUI library, and the oggz tools (both needing installing separately if they are not already).

The procedure consists of:

* Run KateDJ
* Click 'Load Ogg stream' and select the file to load
* Click 'Demux file' to decode Kate streams in a temporary location
* Edit the Kate streams (a message box tells you where they are placed)
* When done, click 'Remux file from parts'
* If any errors are reported, continue editing until the remux step succeeds

== Frequently Asked Questions ==

=== Does libkate work on other plaforms than Linux ? ===

Yes, libkate is not Linux specific in any way. It optionally relies on libogg
and libpng, two libraries widely ported to various platforms.
It has been reported to work on Windows and MacOS X as well as UNIX platforms.

However, libtiger, a rendering library for Kate streams, relies on Pango and Cairo,
which are not easy to build on Windows, though they can be.
The Tiger renderer is however completely separate from libkate, and is not needed
for full encoding and decoding of Kate streams.

=== Where can I find some example files ? ===

The libkate distribution can generate various examples, but already built files
can be found there:
[http://people.xiph.org/~oggk/elephants_dream/elephantsdream-with-subtitles.ogg]
[http://stallman.org/fry/Stephen_Fry-Happy_Birthday_GNU-nq_600px_425kbit.ogv]

These files use raw text only.

[[Category:Drafts]]
[[Category:Ogg Mappings]]

PortablePlayers

2010-04-15T22:29:42Z

Ogg.k.ogg.k: please crap in your own home

== Introduction ==
Here you'll find all mobile players known to natively support [[Vorbis]].

When updating this information, please consider these guidelines: Use the term Vorbis not <strike>OGG</strike>
. Add information about other Xiph-codecs such as Speex, FLAC, and Theora. Do not add information about non-Xiph-codecs such as MP3, WMA, or WAV.

== Portable Vorbis Native Support Table ==
{| style="font-size: 85%; text-align: center;" class="wikitable sortable"
! Brand
! Model
! Additional Xiph codecs
! FM
! Voice Rec
! Interface
! USB Mass storage
! MTP
! Built-in Capacity (GB)
! Additional Capacity via
! Storage Type
! Estim. battery life
! other
! Estimated price
! In Production?
|-

! SanDisk
! [[PortablePlayers#SanDisk_Sansa_Clip_and_Sansa_Fuze|Sansa Clip]]
| FLAC
| yes
| yes
| USB 2.0
| yes
| yes
| 8
|
| Flash
| 8 h
| 28g weight
| 50 Eur
| ?
|-
! Cowon
! [[PortablePlayers#Cowon.2FiAudio_D2.2C_F2.2C_T2.2C_U3.2C_U2.2C_G3.2C_5.2C_G2.2C_U5.2C_7|iAudio U5]]
| FLAC
| yes
| yes
| USB
| yes
| no
| 8
|
| Flash
| 8 h
|
| 85 USD(?)
| ?
|-
! Trekstor
! [[PortablePlayers#TrekStor.27s_blaxx.2C_iBeat_cody.2C_iBeat_organix_2.0.2C_iBeat_sonix.2C|iBeat Organix 2.0]]
|
|
| yes
| USB 2.0
| yes
| no
| 8
|
| Flash
| 50 h
|
| 50 Eur
| ?
|-
! HTC
! Hero
|
|
| yes
| USB 2.0
| yes
|
| 8
| MicroSD
| Flash
|
|
|
| Yes, as of 2010.04.02
|
|-
! Rio
! Karma
| FLAC
| no
| no
| USB
| no
|
| 20
|
| HD
|
|
|
| No
|}

@wiki-admins: It looks like this mediawiki instance does not support [http://meta.wikimedia.org/wiki/Help:Sorting#Secondary_sortkey sortable tables] and nice table cell templates (like in [http://en.wikipedia.org/wiki/List_of_BitTorrent_clients#Operating_system_support this example]). The support for this table related features would really improve this table layout.

== Flash Memory Storage ==

:in each description, please say if the device works "out of the box" or you have to install any software to use it properly (if the extra-software is optional, then it doesn't matter).

<i>From the information below (see the "Chinese MP4" and "PowerUp!" items), it is possible that all Chinese made [http://en.wikipedia.org/wiki/S1_MP3_Player S1 MP3] and [http://en.wikipedia.org/wiki/Chinese_MP4/MTV_player MP4] players can play the Ogg Vorbis file format, even though their manuals or advertisements do not mention this. Since many tens of millions of these units have been sold worldwide, there is a potentially huge, undocumented, base of portable media players which can play the Ogg Vorbis format. If you have one of these Chinese made players, just give it a try and see. [http://www.ebuyer.com/UK/product/126069 Here] is one cheap unbranded Chinese 1GB mp3 player that supports vorbis.</i><br><br>

<i>It is appropriate to say that in brazilian consumer market, there are unbranded MP3 players such as [http://produto.mercadolivre.com.br/MLB-71404870-mp3-player-2-gb-pen-drive-gravador-de-voz-radio-fm-_JM this one] that can flawlessly play Ogg Vorbis files. There are many of them branded as "Sony". I have tested one "Sony" and it does play Ogg Vorbis. If you have one of these players and know that they can play Ogg Vorbis, please inform which chipset these devices are equipped with. Many of these players can also be identified as having the following writings "MP3/WMA/FM/REC". All these are basically 1GB/2GB USB pen drives. </i>

=== [http://www.netonnet.se/item.asp?iid=61510 Avant] MP-8256, MP-85me12, MP-81000 ===
:No official website, product no longer available for purchase, but three models existed: MP-8256 (256MB memory), MP-8512 (512MB) and MP-81000 (1GB). Some features are a small colour display, 5-band Equalizer, FM-stereo radio, Line in, Microphone, and Charging via USB2.0.

=== [http://www.bang-olufsen.com/page.asp?id=374 Bang & Olufsen] BeoSound 6 ===
:It has 4GB of storage, USB 1.1 and 2.0 support and a small TFT LCD color display. Although advertised as Windows and Mac OS 9.2 and higher only, the device is a Mass Storage device and is perfectly usable in Linux as well. Supports Vorbis quality levels up to Q10. B&O have co-operated with Samsung to develop the device.

=== [http://www.centon.com/ Centon CraZe] ===
:8G model (at least) from Buy.com seems to have either s1mp3 or sigmatel chipset, worked fresh out of box. It is USB rechargeable device with monochrome LCD and multicolored backlight plus FM stereo. Doc only mentions mp3 and wma, not vorbis.

=== [http://linuxdevices.com/news/NS7996764346.html Cool-Karaoke] ===
:The DRM-free Cool-Karaoke supports MP3, OGG, WAV, and FLAC audio formats and MPG, AVI, and FLV video formats. Runs an ARM920t processor clocked to 400MHz, with 4GB and up NAND Flash. Battery charges through USB cable. Built in equalizer allows tuning down the voice freqencies for sing-alongs.

=== [http://en.wikipedia.org/wiki/Chinese_MP4/MTV_player Chinese MP4 players sold on eBay] ===
: I've tried two different MP4 nano lookalikes from different manufacturers and different eBay sellers, and both will play Ogg Vorbis fine, even though none of the documentation or product advertisements say this. Before you buy one, you should check out the eBay FAQ on MP4 players first.

=== Coby MP-C7052 ===
:While it does support Vorbis, buyer beware. Poor ratings at [http://reviews.cnet.com/mp3-players/coby-mp-c7052-512mb/4505-6490_7-32466874.html cnet.com]: "utterly fails at its intended purpose"

=== [http://www.cowonamerica.com Cowon/iAudio] D2, F2, T2, U3, U2, G3, 5, G2, U5, 7 ===
:NOTE: The U3 and 7 both are buggy with Vorbis, in that they exhibit artifacts in the lower frequency range. As of firmware 1.29 on the U3, and 1.17 on the 7, both are broken. Cowon fixed this on the D2 about firmware 2.41 onward, and on the 7 with release 1.18 (29-MAY-2009). By way of a code examination, it appears the U5 does not suffer from this bug (On Cowon players that have the issue, there is a hex string which matches a low precision table. On the ones that do not have the issue, it has the correct normal precision value. This is referring to the Tremor decoder used). Most people describe this as a mild high pitched squeak. See this forum [http://www.cowonamerica.com/forums/showthread.php?t=13253 post] for more details. Some also say the iriver Clix 2 has this issue as well. Cowon on the D2 firmware page does not specifically mention that they fixed this issue.
:The iAudio U2 is a small flash-based player (256MB/512MB/1GB) and supports Vorbis. Early U2 releases required a firmware upgrade for Vorbis support; as of September 2005 this support was included in the retail version. The iAudio G3 and iAudio 5 offer up to 2GB, and support Ogg Vorbis out-of-the-box. The G2 has storage from 256 MB up to 1 GB and supports the same formats. iAudio U3 is Cowon's last candy bar form factor flash-based player with a 5 way navigation control. It also supports FLAC and MPEG-4 video. All these players will talk to Linux or Mac (but the included software is Windows only. You'll need Windows for firmware updates.).
:The G3, and most likely the other models as well, supports Ogg Vorbis from q0. Quality settings q-1 and q-2 (from the aoTuV ogg encoder) are not supported. It supports the meta tags ''album'' (limited length) and ''title''.
:iAudio F2 flash memory, 512MB/1GB/2GB versions supporting Vorbis and FLAC. USB 2.0, supports Linux and Mac (Windows needed for firmware updates).
:iAudio T2 flash memory 1GB/2GB, supports Vorbis. USB 2.0, supports Linux and Mac (Windows needed for firmware updates).
:iAudio 7 is Cowon's current small form factor flash based player with touch controls for most functions and comes in 4, 8 and 16GB versions and supports Vorbis and FLAC. USB 2.0 file transfer, Linux and Mac compatible (including firmware updates). Reading Ogg tags not supported (requires browsing music in 'files' mode rather than in 'tags' mode).
:iAudio D2 comes in 4, 8 or 16GB capacities and can use SD and SDHC flash memory cards, supports music and movies supporting FLAC and Vorbis. USB 2.0 file transfer, Linux and Mac compatible (including firmware updates).
:The ''iAudio U5'' is a player with 8 GB flash and USB (speed is at USB 1.0 level). The player is out of the box configurable as USB-mass-storage-device or MTP-device. It is available since early 2008. It supports Ogg Vorbis and FLAC since at least firmware 2.10. A firmware update is possible in mass-storage-mode, i.e. without additional proprietary software. The firmware is available at the US and Global site. The 2.10 firmware has multi language support, i.e. you can select for example english as language after flashing. Note however, that the FLAC/Vorbis firmware loses support for tag based browsing (as of version 3.16).

=== Craig ===
:Model No. CMP622E. 2GB. Even if the package of this product does not mention .ogg support it does! I bought this at a CVS pharmacy.

=== D-Wave 9830 ===
:Polish player with 2GB of internal memory. Supports Vorbis and has a FM radio, TFT display, ebook reader.

=== [http://www.audiodaihatsu.com.ar/productos.asp?cat=17 Daihatsu] D-Z40, D-Z20, D-Z10 ===
:Daihatsu sells in Argentina 1, 2 and 4 GB music players that support vorbis Q0 to Q10 out of the box. I tested the D-Z40 one. Maybe they are available under a different brand in othe places.

=== ENOX EMX-830, EMX-900, EMX-530 ===
:'The lightest and the smallest one among AAA type MP3 players.' Supports MP3, WMA, ASF, WAV, and Ogg Vorbis, has FM tuner, line-in and mic with direct MP3 encoding. Comes with 128/256/512/1024 MB flash memory and USB 2.0 interface. The EMX-900 has up to 1 GB storage and supports the same file formats.

=== [http://www.pyramid.com/Electronics/MP3_And_MP4_Players.aspx Pyramid MP3 & MP4 Players and Player Acccessories] MP3/MP4 Players ===
:There is a wide range of [http://www.pyramid.com/Electronics/MP3_And_MP4_Players.aspx MP3 and MP4 Players] to choose from on Pyramid.com and other similar sites.

=== EZAV T2, EMP-600, EMP-500, EMP-400 ===
:All players support Ogg Vorbis, MP3, ASF, and WMA codecs, FM radio recording (FM, voice, and line-in). The EMP-400 has 256MB and 512MB storage. The other players have storage options up to 1GB. The EMP-600 and T2 have full color displays and add support for a proprietary video format.

=== [http://www.fascin8.co.uk/f8/index.php/tevion/mp4/6940/11-mp4/42-6940 Fascin8] 6940 (Tevion) ===
:Sold in the UK at the ALDI supermarket stores, under their brand name "Tevion" the 6940 model is a 2GB multimedia player that can receive DAB radio and has a colour screen for viewing Jpegs and movies. It connects via a USB2 interface, and appears as a mass storage device. It claims to play Vorbis files, and does so without problems. The USB connector at the player end is non-standard, but extra cables can be obtained from the manufacturer.

=== [http://www.gp2x.com/ Gamepark Holdings] GP2X ===
:Linux-based handheld audio/video/game player. Uses SD cards for storage, removable batteries (AA) providing 6-8 hours of music listening.

=== [http://www.grundig.de/ Grundig ] MPaxx 920 ===
:Very small and simple device with 2GB at a low price (about 25 EUR). Although not mentioned anywhere on the homepage or inside the documentation of this device, it is capable of playing also Ogg Vorbis files out of the box. It connects via USB 2.0 cable (with which the internal accumulator is charged) and acts like a mass storage device, which is formatted via FAT32 filesystem.

=== i-BEAD 170, 400, 600 ===
:The i-BEAD 170 & 400 models are small, light flash-based players with built in Lithium-Polymer batteries. They also have OLED displays, and FM & line-in recording. Both are available in 256MB/512MB/1GB and both support Ogg Vorbis after a firmware upgrade. The i-BEAD 600 has up to 2 GB storage and is very small and supports Ogg Vorbis out of the box. PLEASE NOTE: Ogg Vorbis files encoded using pre-1.0 versions of the encoder will not work with these players.

=== [http://www.imedian.co.kr/ iMedian] M-Cody M-20, MX-100, 250, 400, 300, 500, 700 ===
:According to the homepage, they support Ogg Vorbis (besides MP3, WMA (some devices w/ DRM), ASF, WAV). Some come with a FM Receiver, USB 2.0 and work even as IR remote. One has a OLED, the others have colour LCDs. Battery and memory is internal. I infer from a review that the MX-100 is the same as a Rio SU70, but I haven't found any information about that rio gadget, though. The M-20 is the newest model, a thin portable in response to the iPod Shuffle. It looks exactly like Maxfield's Max-Sin Touch.

=== [http://www.insignia-products.com/c-22-mp3-player.aspx Insignia] Pilot and Sport ===
:Both are sold by Best Buy and advertised to support Ogg Vorbis. The Pilot supports Ogg Vorbis and GNU/Linux out of the box. Haven't tried the Sport. 2GB, 4GB and 8GB models available. The Sport does not support any tags. Ogg files can not be used in playlists. Ogg files can not be shuffled. Thus, there is no way to order the files. Windows shows an error that the format is not supported when dragging over ogg files to the player. All also support bluetooth.

=== [http://www.iops.co.kr/enghome/index.html Iops] X7, Z5, Z3, F5, F4, MFP-312, MFP-325, MFP-350 ===
:Newer players offer video and photo support (X7, Z5, F5). Iops offers the MFP-300 series player with 128/256/512MB/1GB internal flash memory. They offer voice and FM radio recording whilst maintaining a lightweight portable size.

=== [http://www.iriver.com/ iRiver's] E100, iFP-3xx, iFP-5xx, iFP-7xx, iFP-8xx, iFP-9xx, iFP-10xx, iFP-11xx, Lplayer, T7, T10, T20, T30, T50, T60, U10, Clix, Clix2, X20 ===
:iRiver has a huge line of flash-based players with various memory sizes (128MB to 2GB). Some of these players may need an updated firmware in order to play Ogg Vorbis files, see the [http://www.iriveramerica.com/support/ support download page] for that. Note — on older players, only certain bitrates are supported, various problems are reported including reboots, silence and random noise when a VBR Vorbis passes outside the limit (either under 96Kbps or over 225 Kbps). Newer players don't have this limitation. However, please be alerted that many of the newer players, such as the Clix, use the Microsoft MTP transfer protocol exclusively so they only work with Windows, whereas other players may be shipped with MTP, but have alternate non-MTP firmware available for download. Tag support not present on U10/Clix (others also?), so Vorbis files will appear under 'unknown artist'/'unknown album'. Please note that the H10 model does not (yet?) support ogg, and can operate in both MTP and UMS (mass storage) modes. [http://easyh10.sf.net./ More information]. Confirmed that the T50 and T60 players support Ogg Vorbis, use UMS and have complete tag support out of the box.
** The iRiver Clix 4GB ('''not''' the iRiver Clix gen 2) available at [[http://www.bhphotovideo.com/]] supports Ogg Vorbis audio and metadata (artist/album/song names). The following notes apply:
*** The latest firmware, 2.6.0.0, was installed during the test. It is not known whether or not this is required for Ogg Vorbis support.
*** Windows XP SP2 with Windows Media Player 11 (or later) is absolutely required. '''Windows Media Player 10 will not work.'''
*** MTP is the only method to access the device. '''UMS will not work.'''
*** Once Windows Media Player 11 has been installed, other programs such as Windows Explorer or Winamp can be used to load Vorbis songs normally.
*** Do not confuse the iRiver Clix with the iRiver Clix gen 2. These notes apply only to the iRiver Clix.

=== [http://www.jensofsweden.com/ Jens Of Sweden's] MP-120, MP-130, MP-400, MP-450, MP-500 ===
:The MP-130 is a portable player with flash memory in 128/256/512MB sizes. This appears to be a rebranded Iops player. The MP-400 is a tiny machine with lots of features (line in, mic, fm radio, usb 2.0). With the updated 4.1 firmware it supports Ogg Vorbis files encoded with libvorbis version 1.0rc2 or later. When trying to play files encoded with earlier versions it freezes on playback, requiring an USB connect or reset button pressed (through a tiny hole) to wake up again. The MP-120, a 1Gb flash player, supports Ogg Vorbis with a firmware upgrade since March 2005. MP-120 still doesn't play old Ogg Vorbis files, but they don't make it freeze up. The MP-450 is basically a MP-400 with color o

=== [http://www.jnc-digital.com/Eng/ JNC's] SSF-2002, SSF-2005 ===
:These are flash-based players with 256 MB respectively 512 MB storage capacity. They have the usual FM radio which can be recorded in addition to voice. They also have a 1,9" color display.

=== [http://www.kingston.com/ Kingston] [http://www.kingston.com/flash/kpex.asp K-PEX 100] ===
:Two versions available but are now discontinued (as at March 2007): with 1 GB or 2 GB internal memory. Both models have an extra miniSD memory card storage slot. Ogg playback is sticky at high quality settings. (firmware v2.09) The internal equalizer is disabled when playing ogg. (firmware v2.09) This device is a rebranded Cenix GMP-M6.

=== [http://www.lexar.com/mp3/index.html Lexar's] LDP-800 ===
:Available from 03/2005 the LDP-800 is offering MP3, WMA and Ogg Vorbis Support with 256/512MB storage. It has a digital out, FM receiver and transmitter, can record from FM, mic and line-in and has a SD-card slot. Includes Sennheiser earbuds. Update: A telephoned sales representative informed on 2005-04-15 that this player would be available sometime in June. Update again: A sales representative telephoned on 2005-06-20 again stated that the player would be available sometime in June. However, a sales representitave at [http://www.ecost.com/ eCOST], an online store carrying the LDP-800, stated that their availability date is now 2005-07-15. Lexar now seem to have dropped this product. See discussion.

=== [http://www.lowrance.com/ Lowrance's] iFINDER Expedition C, Hunt C, PhD, iWay 350C, possibly others. ===
:GPS units, certain models, support playing MP3 and Ogg Vorbis files stored on the SD/MMC card, which is primarily there to hold map files and route/track data. The item descriptions only mention mp3, you have to dig into the manual or actually use the device to discover Vorbis support. What a nice surprise! Many units seem to include voice-recorder functionality too, for tagging waypoints with audio notes, but it's not clear what codec they record in.

=== [http://www.lge.com.au/ LG's] UPANW5HSSI, UPANW1GSSI, UPANL1GSSI, UPANR1GSSI, UPANB1GSSI, FM30 ===
: Flash players with 512MB and 1GB capacity. The have no display other than a single multicolour LED. New FM30 model has a large colour display. The FM30 (and likely the older models, as well) does not support Vorbis metadata tags.

=== [http://www.maxfield.de/ Maxfield's] Max-Ivy, Max-Diamond, Max-Movie, Max-Diablo, Max-Sin Touch ===
: The Max-Diamond supports MP3, Ogg Vorbis and WMA (DRM). It has 512MB flash memory and can record from FM radio. The Max-Movie has 1GB storage and supports DivX, MP3 WMA (DRM) and Ogg Vorbis. It also has FM radio and a display with 260.000 colors. The Max-Diablo supports the same audio formats, but can also display pictures and videos on its small OLED (4096 colors). It has 1GB storage. Max-Sin Touch has 512 MB or 1 GB internal memory. Not to be confused with Maxfield Max-Sin, which doesn't have ogg support. Max-Sin Touch looks exactly like M-Cody M-20.

:: While the Max-Sin Touch does play Ogg Vorbis, it only does so with occasional glitches, at least with a device bought in November 2006. Perhaps a future firmware upgrade might help, but I'm skeptical. At this time, I cannot recommend the player. ― [[User:Eloquence|Eloquence]] 22:48, 22 November 2006 (PST)
::: It looks like there won't be any firmware upgrades in future. Maxfield GmbH became insolvent in january.

=== [http://www.mbird.co.kr/ M-bird's] XT-22S, XR-22 ===
: Available in 256MB/512MB/1GB sizes. USB 2.0. Supports Ogg Vorbis (although it doesn't seem to view tag info, will probably be fixed in future firmwares (?)), but also MP3 and WMA. It has small 200 mW built-in speaker. Inverted display with the ability to choose the foreground colour in 125 steps. Other features include FM-radio, voice recorder (built-in mic), line-in, alarm, and more. While XR-22 support memory upto 2GB and functions are similar to XT-22S.

=== [http://en.meizu.com/ Meizu] M6 miniPlayer ===
:Available in 1/2/4GB capacities. USB 2.0. Supports Ogg Vorbis and FLAC as well as MP3, MP2, WMA. DRM10 support should be supported with future firmware updates. 2.4", 260k color display, text, photo (BMP, JPG, GIF), and video (AVI), FM radio/recording, built-in mic for voice recording. English, Simplified Chinese, Traditional Chinese, Japanese, Korean and partial Hebrew language support. You can buy an an external battery pack which is rumored to enable USB On-The-Go support sometime in the future.

=== Mediacom JukeBox Movie 150-C 2GB ===
I created an "Ogg data, Vorbis audio, mono, 44100 Hz, ~96000 bps, created by: Xiph.Org libVorbis I" using Avidemux. It plays awesome!

=== [http://www.mobiblu.com/ MobiBLU] Cube2, DAH-2100, US2, BOXON ===
: All the above players support Ogg Vorbis (Q1-Q10). The B153 and DAH-1500i models do not mention ogg Vorbis in their specifications

=== MP3 MP-8256, MP-8512, MP-81000 ===
:Looks like another whitebox label. No official website found yet, but three models are offered in shops: MP-8256 with 256MB memory, MP-8512 (512MB) and MP-81000 (1GB). Plays not only Ogg Vorbis, but [[MP3]], [[WMA]] and even BMP and Textfiles via small colour display. USB 2.0 interface. Sufficient quality in playback and recording (Radio/Line-In).

=== [http://www.mpmaneurope.com/product.aspx?product_id=77 MPMan] MP-FUB34 MP-CS157 ===
:The mpman FUB34 and FUB35 are available (March 2007) in the UK in electrical stores such as Comet and come in 128MB, 256MB, 512MB and 1GB memory sizes. They appear to be a Chinese S1 MP3 player. Although no mention is made of Ogg Vorbis support in the documentation or on the website (only MP3 & WMA), the format is supported. MP-CS157 is a multi-media player, supporting Ogg/Vorbis as well, even if there is no mention on the box.

=== MPMan MP-160 ===
: As today (23/11/09), the MP-160 does NOT play Ogg Vorbis files, although several shops and websites maintain the contrary.

=== [http://mpeye.net/ MPeye] TS-400 ===
:a flash player which comes in 128MB/256MB/512MB/1GB sizes, has a FM-receiver, colour display and a voice recorder.

=== Mustek MC-1503F ===
:Portable player with 1,5" colour display and 2GB of memory. The manual suggests that there are versions from 256MB to 4GB available. It only mentions MP3, WMA and WAV as supported formats but OGG Vorbis playback apparently works fine.

=== [http://www.muzio.co.kr/ Muzio's] JM200, JM250, JM300 ===
:Another Korean manufacturer jumps in and offers small flash-based players with 128MB up to 1GB storage capacities. They support the usual formats MP3/WMA/Ogg Vorbis, can record voice, receive FM radio.

=== [http://www.nextar.com/ Nextar] 933A-1B ===
:This is an inexpensive flash-based player with 1G memory. (Recently purchased on sale for $18 US at K-mart) It comes in various other memory sizes, and I suspect these other models will also play Ogg Vorbis files. There is no mention on thier web site, or in the documentation that these will play Ogg Vorbis. The "drive formatting" on this device is strange, to be able to mount this device under Linux, I had to delete all partitions (showed as 4 non-standard partitions under Linux fdisk) in linux, then put the device in a windows XP machine and recreate a single partition and format as FAT. (Simply recreating a single partition and formatting as FAT under linux didn't allow the device to see the files copied to it.)

:This seems to work on other Nextar models including MA933A

=== [http://www.neurostechnology.com/ Neuros'] Neuros II ===
:The Neuros II can be used as a stand-alone flash-player. You can later buy an HDD "backpack" from 20 to 80 gigs in size and switch the backpacks as you please. This player now has a [http://open.neurosaudio.com/ free software (open-source) firmware].

=== [http://pentagram.com.tw/ Pentagram] Vanquish R SKIT ===
2 or 4 GB of storage memory, USB 2.0, weighs 23 grams, plays OGG, MP3, WMA, WAV and ASF, 1.1" OLED screen.

=== [http://www.tigerdirect.com/applications/SearchTools/item-details.asp?EdpNo=1532817&Sku=TC3G-5012 PowerUp!] 1GB USB Player ===
: Power Up! brand 1GB player, available from [http://www.tigerdirect.com TigerDirect]. The unit is either the standard S1 or Centon 1GB USB player or a clone thereof. There is no mention of Ogg Vorbis support in any of the literature, but my unit plays ogg files. Bonus!

=== [http://www.pre-view.com.tw Preview Technology] ===
:Makes a number of OGG-Vorbis compatable players. Although only a handful of their players claim support for Vorbis, it appears that OGG Vorbis works on some of the models where it is not advertised. Their players are being re-branded sold as inexpensive "MP4" players. Many players by Ergotech, Vakoss, and Zicplay are based on designs by Preview.

=== [http://eng.qoolqee.com/ Qoolqee's] K7 ===
:This is an interesting mix of a flash-based MP3 player and an organizer: the player has 512/1024 MB storage and contact and calendar functions and can sync with Outlook. It supports MP3, WMA and Ogg Vorbis, has FM radio and connectors for two headphones.

=== [http://www.allpmp.org/2008/09/20/ramos-t8-review/ RAmos] T8 ===
:Lightweight 4.3-inch touchscreen screen. Screen resolution of 480×272. Uses USB connection. Based on the Rockchip RK2706 chipset.

=== Renkforce S30 ===
:The Renkforce S30, sold by Conrad Electronic in Germany, and available as a 2GB and a 4GB model, is a USB stick style "S1MP3" player and plays OGG Vorbis fine. It only displays the file name and the average bitrate, but no additional Metadata. The Manual only mentions MP3 and WMA.

=== [http://rovermedia.ru/ RoverMedia] ARIA X7 ===
:A portable Vorbis/MP3/WMA player with 512MB - 4GB internal flash memory, FM-receiver, recording function, picture viewer, video player.

=== [http://www.samsung.com/Products/ Samsung] / [http://www.yepp.co.kr/ Yepp] (product label), YP-C1, YP-F1, YP-MT6, YP-P2, YP-S2, YP-S3, YP-T6, YP-T7, YP-T9, YP-T10, YP-U1, YP-U2, YP-U3, YP-U4, YP-U5, YP-Z5, YP-53, YP-R1 ===
:Many Yepp players support Ogg, please see [[PortablePlayers/SamsungYepp]] for more details about each model. Note: many of these models being sold into DRM-sensitive markets (e.g. the United States) are configured as MTP (Media Transfer Protocol) devices rather than as USB mass-storage drives (UMS) and may require the use of specialized software on any system with which you use them. Samsung provides Windows drivers with these devices, which may or may not be necessary on Windows systems (recent versions of Windows Media Player reportedly support these devices without a specific driver). Using MTP-based players on non-Windows PCs will require installation of additional software. Linux support for at least some of these devices is available through [http://libmtp.sourceforge.net/ libmtp] and the "generic MTP device" plugin in [http://amarok.kde.org/ Amarok]. Read the specifications on the box carefully; if it says it depends on Windows Media Player, then it's probably an MTP device which may need Windows drivers or other MTP support software.
*The Samsung S3 (YP-S3) is (as of August 2008) a low-cost, internal flash memory player, with <b>official</b> out of the box support. Includes video screen. List price $80.
*The [http://www.samsung.com/my/consumer/detail/detail.do?group=mp3audiovideo&type=mp3player&subtype=mp3player&model_cd=YP-P2AB/XME Samsung P2 (YP-P2)] is an ogg vorbis supporting touch based digital audio player (2GB, 4GB, 8GB... and a 16GB likely to arrive in the U.S. early 2009, already available in Korea, Fall of 2008). The P2 also has FM radio and stereo bluetooth. In the U.S. it is likely that the device ships with MTP, but it is possible to switch it to UMS mode. Read through [http://www.anythingbutipod.com/forum/showthread.php?t=25784&highlight=samsung+games+pack this post/guide] (from anythingbutipod.com) for instructions. Vorbis playback is only available in UMS mode. As of November 2008, the 8GB player is available for between $150 and $180.
*The YP-U4 supports Ogg Vorbis out of the box. The included Samsung Media Studio also writes the correct track metadata for album, artist, title, etc. The player itself only reads these vorbis comments, however, after upgrading to firmware v1.28; in earlier versions the metadata information reads as 'unknown'. The device can transfer in MTP or USB mass storage modes, as selected on the device itself.
*The YP-R1 supports Ogg Vorbis out of the box and can be switched freely between MSC and MTP modes. However, Samsung's line as of January 2010 is that metadata is unsupported due to the lack of a global standard. (The validity of this statement being flawed in multiple ways.) See this [http://forums.cnet.com/5208-4_102-0.html?threadID=380161 forum thread] with an official response.

=== [http://www.sansa.com/players SanDisk] Sansa Clip and Sansa Fuze ===
:As of May 2008, these two Sansa models '''officially''' support Ogg Vorbis and FLAC playback. The ''Clip''-series is smaller, weighs less than one ounce (28 g - the 8 GB version; as of Feb 2009), and less expense. It features USB 2.0 cable, FM tuner with presets, microphone, and belt clip. Available in 1, 2, 4 and 8 GB built-in memory. It works per default as usb mass storage device. The audio file navigation is based on an internal tag-library (artist, album etc.). This library is kept in sync by the player, when the Sansa Clip is used as USB mass storage device. Audiobooks and Podcasts are organized in special categories by the player navigation system.
:The ''Fuze''-series is larger and weighs two ounces. It also features a USB 2.0 cable, FM tuner with presets, microphone, and video display (for Mpeg-4 video). Available in 2, 4, and 8GB built-in memory and microSD/SDHC expansion.
Official support is provided for this operating system through their message forum.

See also:
* [http://forums.sandisk.com/sansa/board/message?board.id=clip&thread.id=6720&view=by_date_ascending&page=1 Official firmware upgrade FAQ]
* [http://forums.sandisk.com/sansa/board/message?board.id=clip&message.id=10832&query.id=3787#M10832 Summary: Don't repartition your device if you don't know what you are doing]

=== [http://www.signeo.co.jp Signeo] / [http://www.signeo.co.jp/products/sn-a800/ SN-A800], [http://www.signeo.co.jp/products/sn-m700/ SN-M700], [http://www.signeo.co.jp/products/sn-m600/ SN-M600]. ===
:(2006-01-08) Seen in many electronics stores in Japan. The SN-A800 looks incredible — smaller than the iPod Nano, I think. I've not been able to try any for sound quality. Signeo also makes a hard drive player that supports vorbis. Their 2005-12 sales brochure claims Linux compatability for the SN-M600 and SN-M700.

=== Sumvision 1GB SV04-M18 ===
:My test ogg file was created using the timidity midi player, and the format was checked using mplayer, which used the ffvorbis codec to play back the same file. While this is a Chinese made MP3 player, another Sumvision player I have does not appear to play ogg vorbis files. The SV04-M18 works as a USB mass storage device.

=== [http://www.supportplus.cn/ SupportPlus'] SP-Advance ===
:Found this player in the local supermarket. The player is very small, has a 1 inch colour LCD and 1 GB of storage. Supports audio and video incl. Ogg Vorbis. The SP-Advance is not listed on their web site, but among the ones that are on the web site the 1-inch HDD Super Slim Jukebox claims Ogg Vorbis support.

=== [http://www.swissbit.com/ Swissbit's] Swissmemory s.beat ===
:The s.beat is sort of an original piece of hardware, as, you may have guessed it, it is a swiss army knife with an MP3 player. It supports Ogg Vorbis too and comes in sizes of 1 up to 4 GB.

=== T-Budd ===
:Korean company who makes wonderdull piece of hardware : TLN-100 which comes in 512 Mb or 1 Go. Supports MPEG 1/2/2.5/3 layer 3, WMA, ASF et OGG, PLF (proprietary video format) and works with two AAA batteries. Nice OLED display. FM radio. Very quick memory transfers. Not a usbkey type player, but a small USB adaptator is furnished, and allows the device to be plugged directly on a USB standard plug. USB2 Mass storage implemented : works perfectly under Linux.

=== [http://www.teac.com/ TEAC's] MP-60 ===
:Very small and simple device with 2GB at a low price (about 20 EUR). Although not mentioned anywhere on the homepage or inside the documentation of this device, it is capable of playing also Ogg Vorbis files out of the box. It connects via USB 2.0 cable (with which the internal accumulator is charged) and acts like a mass storage device, which is formatted via FAT32 filesystem.

=== [http://www.teac.com/ TEAC's] MP-400 ===
:The MP-400 is a flash-player with either 512MB or 1024MB storage. (As of 01-2009, could not find product sold online.)

=== Tekmax T-1000 [http://www.ioneit.com/ "ioneit"] ===
:256/512/1024 MB USB-connected mass storage device (flash based, uses FAT16, OS independent), 64K 4.41cm² color display, MP3/WMA/ASF/OGG support, equalizer and "3D sound", FM tuner, bookmark system, clock, stopwatch, alarm timer, record from microphone/FM as MP3, dual output, firmware upgradeable. Size: 3.5x8x1.7cm @ 40 grams. 16 hours of battery life.

=== [http://www.t-logic.it/ T-Logic] TL-258 ===
:Either 2048, 4096, or 8192MB storage. Vorbis, FLAC, and MPEG-4 playback. Very small player with touch sensitive pad and FM radio.

=== [http://www.trekstor.de/ TrekStor's] blaxx, iBeat cody, iBeat organix 2.0, iBeat sonix, ===
:The blaxx (also video-player) comes with TFT-disply and 2GB or 4 GB. The iBeat cody (also video-player) comes with 2/4 GB storage has a 262K color TFT-display. The iBeat organix 2.0 comes with a 2 color OLED, approx. 55h battery and 4GB or 8GB. The iBeat sonix has a large display that can be used to watch movies. It comes in sizes from 1GB to 4GB and batteries last for a period of approx. 45 hours. All player support Linux from kernel 2.4.x (identified as USB mass storage device).

:The iBeat organix 2.0 supports Ogg Vorbis out of the box. It also reads tags from media files and stores their information in an internal database so one can then search through all songs by artist, album title, song title, year etc., regardless of the actual directory structure. This works with Ogg Vorbis files, even with UTF-8 encoded "special characters" in the tags - at least roman characters with diacritics (like in ''Komm süßes Kreuz''), also ones not belonging to latin-1 (like in ''Dvořák'').

=== [http://www.turbolinux.com/ Turbolinux's] [http://www.turbolinux.com/products/wizpy/ Wizpy] ===

=== Wigo's CVM-101, CVM-103, CVM-300, CVS-100 ===
:Korean players with slick design, comes in 128/256/512/1024 MB depending on models. Support MP3/WMA/Ogg, FM receiver, voice recorder. Note: Ogg bitrates supported may be limited, check the manufacturer's specification for each device for details.

=== Xcent XT100 ===
:This player is sold in the U.K. and comes with 256/512MB. Supports Linux and BSD. (As of 01-2009 could not find product online.)

=== [http://www.yuraku.com.sg/ Yuraku] [http://www.yuraku.com.sg/proddetails.asp?prodid=90&catid=38 Yur.Beat Fusion Stream] ===
:This is a 1GB-Flash-[http://en.wikipedia.org/wiki/Portable_media_player PMP], that also have a MicroSD card slot. The playback-function supports AAC, ADPCM, AIFF, MP3, Ogg Vorbis, WAV and WMA, the streaming-function MP3, WMA. FM- and Internet Radio (via "vTuner Internet Radio Index Service") are also available. PC Connection is possible via mini USB type B, USB 2.0 high speed or Wi-Fi (IEEE 802.11b/g standards).

== Harddisk Storage ==

=== [http://www.airlinktek.com/ AL Tech's] MG-25, MG-35, MG350HD ===
:The Mediagate MG-25 is a portable HDD that supports also media playback. It uses a 2,5" disk and USB2.0 to connect, and supports MPEG-1/-2/-4, DivX, Xvid, MP3, Ogg Vorbis, JPG. It can upsample to HDTV, has composite, component and s-video outs, stereo and a digital out. Remote control is included. The MG-35 uses a 3,5" HDD instead, supports WMA and ethernet. The MG350HD uses a 3,5" HDD as well and supports HDTV. There is a wiki page with an faq [http://mediagate.pbwiki.com/ here].

=== [http://www.apple.com/ipod Apple's] iPod* ===
:<nowiki>*</nowiki>''The native iPod firmware doesn't support Ogg Vorbis.'' You can, however, install [http://www.rockbox.org/ RockBox] or [http://www.ipodlinux.org/ iPodLinux] on all iPod models (except for the Shuffle and Nano 2nd gen). RockBox supports tags, and a number of other formats. The larger iPod models have up to 80 GB HDDs.

=== [http://www.boghe.com Boghe] Vip20 ===
:The Vip20 seems to be similar to the iBeat 500 from TrekStor and Xclef HD-800. It has the same features: MP3, WMA, WAV, Ogg Vorbis decoding plus 20 GB storage.

=== [http://www.cmt21.com/index_eng.php Creative Mind (CMTECH)'s] U250 ===
:Seems to be a Korean supplier to Samsung who also sells own branded players. Works as pendrive, encodes MP3 from line-in (same jack as the headphone), FM radio and microphone. Has built in loudspeaker. Plays back Ogg Vorbis, MP3 and WMA. Does not display ISO-885902 accented characters from my VorbisComments. :-(

=== [http://www.commodore.net/ Commodore's] eVic ===
:The eVic has 20GB storage and plays WMA (incl. DRM), MP3 and Ogg Vorbis. It can record voice and music, and has USB host functionality. In Hardware version M03-002, firmware 2.203 '''serious problems''' with ogg playback while using the ''Equalizer'' are present (disturbing crackling noises). (An email inquiry to Commodore International Corporation replied "eVic's new firmware is still developing. The new version will safe the issue with ogg playback while using the Equalizer.") USB host functionality seems not to be implemented yet at all.

=== [http://www.cowonamerica.com COWON America's] [http://www.cowonamerica.com/products/tvix/ Dvico TViX][http://www.tvix.co.kr/eng/ 2] ===
:This is a rather unique device; a multimedia jukebox, music tank, photo album and last but not least a portable storage. It is bigger than usual portable devices, but has also a lot more options. It can connect to the PC (USB 2.0), TV (S-Video, Composite), stereos and 5.1 surround systems (Coaxial/Optical) and comes with a remote control. Supported video formats are DVD (MPEG-2), VCD (MPEG-1), DivX, Xvid. Supported Audio formats are MP3, WMA and Ogg Vorbis (and [http://www.tvix.co.kr/eng/ mkv] with firmware upgrade). It can display JPEG pictures on the TV. It is available without a harddrive, or equipped with harddrive sizes up to 200 GB.

=== [http://www.cowonamerica.com Cowon iAudio] M3, M5, X5, A2, 6, 7 ===
:The iAudio M3 is a portable harddisk player with either 20 or 40 GB of storage. It has a built-in FM radio and mic. It supports MP3, WMA, Ogg Vorbis and WAV and even FLAC with the newest firmware upgrade. See this [http://gear.ign.com/articles/522/522090p1.html IGN article] for more info. The M5 has 20 GB storage and supports the same formats. The X5 is similarly designed (storage sizes of 20GB, 30GB, 60GB) and can play MPEG-4 videos. It has a 1.8 inch LCD with 260,000 colors and USB OTG (On-The-Go) feature. The A2 is released in November 2005 and is a widescreen mobile video player. It has a 480 x 272 pixel screen and supports the above metioned set of audio, video and image formats. The tiny iAudio 6 features a 4 Gb 0.85" harddisk and supports both OGG and FLAC. The M3, M5, X5, and A2 (probably the 6 as well) all act as USB mass storage devices, which means they are supported by Linux and Mac. The software is windows-only, though.
:'''Comment tag support''' — The iaudio X5 supports the ''artist'' (limited length), ''album'', and ''title'' comment tags.

=== [http://www.digmind.com/ Digital Mind Corporation's] DMC 8280 ===
:The 8280 has 20 GB or 30 GB storage, plays Ogg Vorbis, MP3 and WMA. Standard feature set; this player does not excel in any area but price. USB mass storage compliant — you can put songs on it from non-Windows computers, but full indexing of the songs for reference by artist etc. requires Windows.

=== [http://www.emtec-international.com/ Emtec's] Movie Cube ===
:The Movie Cube comes with a 2,5" HDD with 40 or 80 GB size. It supports the playback of various audio and video formats including Ogg Vorbis. The package includes some AV cables and a remote control.

=== [http://www.freecom.com/ Freecom's] MediaPlayer-3, Network MediaPlayer-35 Drive-In ===
:The MediaPlayer-3 is again sort of an external HDD that can play media without a PC. It supports DivX, MP3, MPEG-4, AVI, WMA, ASF and Ogg Vorbis. The product with the complicated name Network MediaPlayer-35 Drive-In is an enhanced version of the MediaPlayer-3 — it has an additional network interface and supports an internal 3,5" drive. The ethernet port can be used to read media from the network, but cannot be used as network attached storage.

=== [http://www.godot.com.tw/ GoDot] M8170, M8270, M8370, M8470, M8570 ===
:GoDot's HD players have capacity ranging from 2.2gb to 20gb. Each model is very different. They support Ogg Vorbis, MP3 and WMA (some models support DRM).

=== [http://www.hama.de/portal?lid=2 Hama's] VSV-20/VSV-40 ===
:The VSV-20/VSV-40 has the usual mobile MP3 HDD player size and can read/write from its 16in1 memory card reader and 20 GB or 40 GB internal HDD. But it can do more than audio (MP3, WMA, Ogg Vorbis, AAC). It supports image (JPEG) and video (MPEG-1/-4) playback on the 2" display and on a connected TV. It even includes a remote control. Beware: Hama has suspended OggVorbis support. However, there is a Firmware update promised to reestablish OggVorbis. If you plan to buy a device check the [http://www.hama.de/service/download/firmware/index.hsp Firmware download page] or better [http://www.hama.de/portal/pageId*2276/action*3499 ask them] about the current status of OggVorbis support.

=== [http://eng.iaudio.com/ iAudio] ===
:See Cowon iAudio above.

=== [http://www.idream-multimedia.com/liste.php?cid=9 iDREAM] Jukebox 2.2 GB, 3.3 GB and 4 GB ===
:Those HDD players support OGG and Encode MP3 from Line-In.

=== [http://www.ivmm.com/innoax/products/innopod.htm InnoAX's] InnoPod ===
:This is a iPod mini clone, that supports MP3, WMA, WAV and Ogg Vorbis. It supports recording from line-in and mic, has a 4 GB harddrive and USB2.0.

=== [http://www.iomega.com/ Iomega's] ScreenPlay Pro ===
:Iomega is finally also jumping on the bandwaggon and offers external HDDs with multimedia-playback. The larger version ScreenPlay Pro supports the usual audio and video codecs including Ogg Vorbis. It seems to be a repackaged Mvisto with HDD included [http://www.iomega-europe.com/eu/en/products/screenplay/screenplay_family_en.aspx ScreenPlay Pro].

=== [http://www.iriver.com/ iRiver's] iHP-1xx, H1xx, H2xx, H3xx, iGP-100 ===
:iRiver has also a number of harddisk based items that play back Ogg Vorbis. Older models like the iHP-100 and the iHP-115 come in 10 and 15 GB sizes and need a firmware update (see the [http://www.iriveramerica.com/support support downloads] for that). The iHP-120, a 20GB portable player, and the iHP-140, a 40GB version, support Vorbis playback out of the box. Read reviews here: [http://gear.ign.com/articles/435/435472p1.html IGN on iHP-100], [http://gear.ign.com/articles/457/457818p1.html IGN on iHP-120]. The iGP-100, a 1.5Gb portable player, supports Vorbis, according to the FAQ, though no firmware upgrade appears to be required. The new line of harddisk players H120, H140 come in 10 to 40 GB sizes. There is also a product line with USB host function and colour display that supports 32-500kbs: H320, H340]. The newer H10 player does not support Ogg Vorbis.
:Many iRiver devices can be loaded with the RockBox replacement firmware which plays Ogg Vorbis as well as adding FLAC playback.

=== [http://www.jnc-digital.com/Eng/ JNC's] SSF-M3, SSF-M5 ===
:The SSF-M3 comes with 20/40GB storage size, whereas the SSF-M5 has only 1.5 GB. Both support voice recording and FM radio. The SSF-M3 is more stylish and very slim and comes with a docking station.

=== [http://www.lge.com/ LG's] Mediagate ===
:This player is similar to the Modix or TViX. It is a portable USB HDD equipped with a 2,5" drive (size varies). It plays audio (MP3, Ogg Vorbis, WMA), video (MPEG-1/-2, Xvid, DivX) and images (JPEG). It has composite, s-video and component video output and supports progressive scan, audio output is done through a coaxial and stereo plug. The device is bundled with a remote control.

=== [http://www.mobiblu.com/ mobiBLU] DHH-200 ===

=== Modix HD-3510 ===
:The HD-3510 is similar to the TViX, as it is sort of a portable multi-talent. It can store and playback audio, video and images, and can be used for other files as well. It can decode MPEG-1/-2/-4 including DivX/Xvid, AC3, DTS, MP3, WMA, Ogg Vorbis and JPEG. It uses USB2.0 for data input and has various ouput connectors: anlog stereo and 5.1 out, coaxial digital out, composite, s-video and component video out with progressive scan and HDTV upscaling. The HD-3510 is bundled with a carrying bag and a remote control, but without a 3,5" HDD.

=== [http://mpeye.net/ MPeye's] HT-100, HT-150 ===
:The HT-100 uses a 1,5 GB HDD, decodes MP3, WMA, Ogg Vorbis and supports the usual features. The HT-150 seems to have the same features (maybe a mistake on the website).

=== [http://www.mpio.com/ mpio] HD300, HD200, One ===
:mpio HD300 is a harddisk player with 20GB and supports WAV/MP3/WMA/Ogg Vorbis. It has FM radio, an alarm clock and supports USB 2.0. The HD200 has 5GB storage capacity, a FM radio which can be recorded and supports the same formats as the HD300. Despite its name the One consist of three components: a player, a HDD and a CD-ROM drive, which can be combined with each other. It supports [[MP3]], [[WMA]], Ogg Vorbis, JPG, BMP and MPEG-4 movies. It has a 1" OLED display and will be available from 05/2005.

=== [http://www.imp3.net/read.php?textid=1529 Muzio's] JM-600 ===
:This player comes with either 2.2 or 4 GB harddrive and supports MP3, WMA, Ogg Vorbis and ASF. It can record voice and has a FM receiver. What sets this player apart is the LCD — it can show BMPs, JPGs and text. The device can also act as a USB host to support digital cameras.

=== [http://www.macpower.com.tw/ Macpower] Mvisto MV-U2UGS ===
:The Mvisto is a portable hardware enclosure for 2,5" harddrives. It has video and audio outs and decodes MPEG1/2/Divx/Xvid/JPEG/MP3/WMA/AAC/Ogg Vorbis. It comes with a remote control.

=== [http://www.neurostechnology.com/ Neuros'] Neuros II ===
:This mobile player comes either with various harddrive sizes up to 80 GB or as 256 MB flash player. The new firmware to support Ogg Vorbis has been developed by the Xiph.org Foundation. The Neuros Synchronization Manager for Windows is available from the same link and now fully supports the addition of Vorbis files to the Neuros. *nix users can use Xiph.org's [http://www.xiph.org/positron/ Positron], Sean Starkey's Java [http://neurosdbm.sf.net/ Neuros Database Manipulator], or [http://www.sorune.com/ Sorune], all of which provide full Neuros database support and other features. Neuros II discontinued. Neuros III is planned but indefinite but they have a [http://open.neurosaudio.com/archives/Product%20Roadmap3-15-2005.htm roadmap].

=== [http://www.nextway.co.kr/ Nextway's] D Cube NHD-150D ===
:1.5 GB harddisk, USB 2.0, and can broadcast music through a FM transmitter.

=== [http://www.pontis.de/ Pontis'] MX2020 ===
:There is now a firmware update for the MX2020 that adds Ogg Vorbis support, which is a portable player for movies, music and photos.

=== [http://www.modix-hd.com/ Rapsody's] RSH-100 ===
:It is similar to the Modix HD-3510, but supports USB host functionality additionally. This web site is dead. The Savit Micro Rapsody RSH-100 can be seen on their site.

=== [http://www.digitalnetworksna.com/rioaudio/ Rio] [http://www.digitalnetworksna.com/shop/item.asp?model=261 Karma] ===
:Harddisk of 20 GB. Uses Vorbis and FLAC. Uses USB 2.0 cable or docking station, which offers Ethernet and RCA line-out support. See [http://gear.ign.com/articles/458/458401p1.html ING review] or [http://www.riovolution.com Riovolution review] for more information. Note that firmware versions prior to 1.25 cause stability problems for some people, visit the [http://www.digitalnetworksna.com/support/rio/product.asp?prodID=113 support page] to get the newest version. The Karma was discontinued in March 2005, Rio (DNNA) effectively dissolved 27-July-2005 assets sold to [http://www.sigmatel.com/ SigmaTel].

=== Safa HMP-110R ===
:A portable player with 1.5GB memory, FM-receiver, recording function, upgradeable firmware, etc.

=== [http://www.samsung.com Samsung] YH-J70 ===
:A portable Multimedia Jukebox as seen on their [http://www.samsung.com/common/microsite/exhibition/cebit2005/base.asp?pcode=IT01 Cebit 2005 Microsite]. Comes with 20/30GB disk, colour display, video player and USB host function. Samsung's support for Ogg Vorbis is reported to be buggy. [http://www.samsunghq.com/forum/showthread.php?t=369] The Samsung YH925 is falsely advertised to support Ogg Vorbis. [http://www.paul.sladen.org/toys/samsung-yh-925/]

=== [http://www.sitecom.com/ Sitecom's] MP-330, MP-010 ===
:The MP-330 player uses a 4,4 GB harddrive, USB 2.0 and supports MP3, WMA and (Ogg Vorbis is claimed in the manual but it doesn't play ogg). The MP-010 is a portable media player. As such it supports music, movies and pictures. This includes MP3, WMA, Ogg Vorbis, MPEG-1/-2/-4. It has a capacity of 40GB, comes with a remote control and has various ports for the TV.

=== [http://www.teac.de/ TEAC] MP-1000, MP-2000 ===
:TEAC MP-1000 is an ultra-compact harddrive player with 1.5GB capacity and only 70g mass. The follow-up model MP-2000 has 5 GB storage and supports the same formats (MP3, WMA, Ogg Vorbis).

=== [http://www.trekstor.de/ TrekStor's] iBeat 500, iBeat 300, vibez ===
:The iBeat 500 is a portable harddisk player with 20 GB of storage. It supports MP3, WMA and Ogg Vorbis and uses USB 2.0 to connect to PCs. It has a FM radio and an in-built mic. It seems to be available only in Germany (looks like a rebadged Xclef HD-800). The iBeat 300 uses a 1,5 GB HDD and has a color display. The vibez is available in 8GB, 12GB and 15GB versions. All can play MP3, WMA, WAV, OGG and FLAC files.

=== [http://www.unibrain.com/iZak Unibrain's] iZak ===
:This is a portable USB hard disk with 40/80/100 GB of storage. It plays a wide range of video formats, including dixv/xvid/bvix/dvd iso. A good review can be found [http://www.mpeg-playcenter.com/modules/Reviews/reviews/Review_iZak.pdf here].
:The most current firmware release supports Ogg Vorbis playback.

=== [http://www.agci.co.uk/customer/categories/audio/mp3players Vusys] i-DJ 370 and i-DJ 670 ===
:4GB and 20GB harddrive players listed as playing OGG on the site. 370 weighs 150g and plays for 10 hours, 670 weighs 165g and plays for 12 hours.

=== [http://www.xclef.com/ Xclef's] HD-800, HD-500 ===
:This is a harddisk player with 20/40/60 GB storage size, and can decode MP3, WMA, Ogg Vorbis and WAV. It has a FM radio and a mic for recording voice. Though not mentioned on the web site, the HD-500 does decode Ogg Vorbis. — Site is dead, and as of 2007.05.23 no results come up in Google Product Search.

== CD/DVD Audio Players ==

=== [http://www.ifreemax.com/ Freemax's] FW-960 ===
:This CD-R portable supports Ogg Vorbis playback out of the box. It has 48 hours of WMA playback if an external battery pack (2 AA batteries) is used. The FreeMax FW-960 is also known as the mpman MP-CD550.

=== [http://www.exonion.com/ Havin's] (link dead) Exonion HVC-400E, [http://www.princeton.co.jp/ Princeton's] Pocket Beat airCD ===
:The Havin HVC-400E, also known as the Princeton airCD is probably on sale in Japan since late November, 2003.

=== [http://www.iriver.com iRiver] iMP-250, iMP-350, iMP-400, iMP-550, iMP-700(T) ===
:Ogg Vorbis is supported only through latest beta firmwares, still some bitrate restriction which may vary depending on the model (min=96kbps, max=160kbps). The iMP-550 supports maximum bitrate up to 256kps (still 96kbps as minimum). Also note the latest iMP-450 does not support OGG for the moment, a future upgrade may correct this... The iMP-700T with firmware 1.40 supports bitrates between 96 and 210 kbps, and .ogg files are generally not as loud as .mp3 files.

=== [http://www.roadstar.com/ Roadstar] PCD-5960WOMPT ===

=== [http://www.samsungusa.com/ Samsung's] MCD-CM600 ===
:The MCD-CM600 is now available in Korea. It is a CD portable that can play Vorbis, MP3, and WMA.

== Mobile Phones ==

=== [http://www.openmoko.com/ Openmoko] ===
:Openmoko produces phones with hardware and software as open as possible. They run GNU/Linux and software players such as mplayer and ogg123 can be used for vorbis playback. Because it runs GPL'ed software, ogg theora is also supported (but needs to be encoded with low frame rate as described at [http://wiki.openmoko.org/wiki/Video_Player Openmoko wiki]).

=== [http://www.samsung.com Samsung] introduced phones on the 2006 3GSM that play .ogg files: SGH-i320 and [http://www.engadgetmobile.com/2006/02/13/samsung-shows-off-sph-s4300-musicphone/ SPH-S4300] ===
:Also, Samsung i900 Omnia is known to play Vorbis, in Windows Media Player only. [http://es.samsungmobile.com/mobile/Samsungi200/spec Samsung SGH-i200], also plays Vorbis.

=== SymbianOS based mobile phones from '''Nokia''', '''Sony Ericsson''', '''Siemens''', '''Motorola''', '''Samsung''' etc.===
:Plays Vorbis files with the third-party, open source [http://symbianoggplay.sourceforge.net/ Symbian OggPlay Software]. For supported mobile phones please visit the project website. The software works very well — even the still-in-development version which is strongly recommended. There is also a [http://developer.symbian.com/main/documentation/example_app_code/cpp/ogg_vorbis.jsp plugin] to Symbian itself. See also [[VorbisSoftwarePlayers#Symbian]].

=== iPhone ===
:Third-party efforts are porting the [http://coreplayer.com/content/view/28/69/ CorePlayer] and the [http://www.zodttd.com/ VLC player]

=== Android-based phones ===
Presumably all Android devices including phones support Vorbis out-of-the-box. Here are some examples with references:
* Nexus One aka the "Goggle Phone" (User Guide page page 329)
* T-Mobile G1 [http://support.t-mobile.com/knowbase/root/public/tm30234.pdf (User Guide page 105)]
* HTC Dream [http://member.america.htc.com/download/Web_materials/Manual/Rogers_Dream/090512_Dream_HTC_US_Rogers_HEP_English_UM.pdf (User Guide page 153)]
* HTC Magic [http://member.america.htc.com/download/Web_materials/Manual/HTC_Magic_Rogers/100128_Magic_MR_Rogers_English_UM.pdf (User Guide page 198)]
* Motorola Droid [http://www.motorola.com/staticfiles/Support/US-EN/Mobile%20Phones/DROID-by-Motorola/US-EN/Documents/Static-Files/DROID_UG_Verizon_00202474c.pdf (User Guide page 34)]
* Motorola Milestone [http://www.motorola.com/staticfiles/Support/CA-EN/Mobile_Phones/Milestone/CA-EN/_Documents/Static_Files/Milestone_Telus_CA_EN_UG_68000202482A.pdf (User Guide page 35)]

===Windows Mobile based phones===
:see [[VorbisSoftwarePlayers#PocketPC]]

== Automobiles ==

See [[StaticPlayers]] page.

== Others ==

=== [http://www.ipodlinux.org/ iPodLinux] ===
:You can install special Linux distribution on almost all of Apple iPods. In combination with Podzilla jukebox software it plays OGG (and many more audio file formats).

=== PDAs / Cell Phones / Game Consoles ===
:Other devices that run software to play Ogg Vorbis can be used as portable players as well. Please go to [[VorbisSoftwarePlayers]] page for more information.

=== [http://www.rockbox.org/ Rockbox] alternative firmware for iPods and other DAPs ===
:The Rockbox project works hard to provide an alternative firmware for some portable players. Rockbox has a rich feature set that is hard to find elsewhere, including gapless playback, Ogg Vorbis, FLAC and even [http://www.musepack.net/ Musepack] support. Currently many models by [http://www.iriver.com/ iRiver], [http://www.archos.com/ Archos], [http://www.apple.com/de/ipod/ Ipod], Cowon(iAudio X5, X5V, X5L, M5 and M5l), SanDisk(Sansa c200, e200 and e200R series) and Toshiba(Gigabeat X and F series) are supported.

====[http://www.rockbox.org/twiki/bin/view/Main/RockboxPlayer Rockbox Player] - Free/Open hardware audio player (DAP) and recorder====
:There is ongoing efforts to design and build a Free/Open hardware audio player (DAP) and recorder, for use with RockBox firmware. Developer interested in participating are encouraged to visit the [http://www.rockbox.org/twiki/bin/view/Main/RockboxPlayer project page].

=== NAViBLUE NBC3500 GPS Navigation Device ===
:According to [http://www.tigerdirect.com/applications/SearchTools/item-details.asp?EdpNo=3123083&CatId=2374]

=== TomTom Navigation software (mentioned on e.g. [http://www.pocketgpsworld.com/tomtom-navigator-pda-5.php]) and hardware systems ===
------------

Resources and papers on Audio, Music and Speech

2010-04-09T12:30:24Z

Ogg.k.ogg.k: Undo revision 10972 by Shibo (Talk)

[http://cnx.org/content/col10338/latest/ Frequency and Music]<br/>
An overview of frequency, harmonic (Fourier) series, and their
relationship to music.

[http://cnx.org/content/col10250/latest/ Audio Localization]<br/>
This course has been created as an introduction to audio localization,
and how beamforming can be applied in a real-time environment.

[http://cnx.org/content/col10303/latest/ Fundamentals of Digital Signal Processing Lab]<br/>
The purpose of this lab is to familiarize students with the DSP
development workstation in the signal processing lab by examining
sampling, analysis, and reconstruction of continuous-time signals.
Specifically, we will first look at sampling/reconstruction of
continuous-time signals. We will then examine time- and
frequency-domain displays. Finally, we will examine the importance of
sampling frequency and its effects on aliasing.

[http://cnx.org/content/col10203/latest/ Intro to Digital Signal Processing]<br/>
The course provides an introduction to the concepts of digital signal
processing (DSP). Some of the main topics covered include DSP systems,
image restoration, z-transform, FIR filters, adaptive filters,
wavelets, and filterbanks.

[http://cnx.org/content/col10252/latest/ Methods for Voice Conversion]<br/>
This course explores methods in signal processing to perform voice
conversion: producing the words from one speaker in the voice of
another. This is the Elec 301 project of Justin Chen, Matthew
Hutchinson, Gina Upperman, and Brian VanOsdol.

[http://cnx.org/content/col10313/latest/ Musical Instrument Recognition]<br/>
To detect the pitch and instrument of a monophonic signal. To
decompose polyphonic signals into their component pitches and
instruments by analyzing the waveform and spectra of each instrument.
Elec 301 Project Fall 2005.

[[Category:Developers stuff]]

Vorbis Hardware

2010-04-09T12:26:42Z

Ogg.k.ogg.k: Undo revision 10973 by Shibo (Talk)

This is a list of hardware of all categories, from chipsets to ready-to-use products, that support Ogg [[Vorbis]].

Hardware support status for Ogg Vorbis is relatively good, you can choose between a huge number of mobile flash players, many HDD based players and a respectable number of Hi-Fi components. More than 50 different companies offer a total of more than a hundred products for virtually every application, there is even a knife that can play Ogg Vorbis now ;-). If you can't find a suitable player come back next week -- new products are added on a weekly basis, as many companies are working to support Vorbis on their hardware.

If you know of any hardware or projects that are not yet mentioned here, please add them to the list. More (outdated) hardware info can be found at [http://www.xiph.org/ogg/vorbis/hardware.html vorbis hardware page].

== Consumer products ==

These players support Ogg Vorbis either out of the box or after a firmware upgrade.

* [[PortablePlayers]]: mobile players
:[[PortablePlayers#Flash_Memory_Storage|Flash Memory Storage]]
:[[PortablePlayers#Harddisk_Storage|Harddisk Storage]]
:[[PortablePlayers#CD.2FDVD_Audio_Players|CD/DVD Audio Players]]
:[[PortablePlayers#Mobile_Phones|Mobile Phones]]
:[[PortablePlayers#Others|Others]]
* [[StaticPlayers]]: installed players
:[[StaticPlayers#Hi-Fi_components|Hi-Fi components]]
:[[StaticPlayers#Car_Audio|Car Audio]]
:[[StaticPlayers#Media_Storage|Media Storage]]

For hardware that is able to run third-party software (such as PDAs and video game consoles), please visit [[VorbisSoftwarePlayers]].

== Non-consumer products ==

This is Vorbis in Silicon, meaning chips from which actual consumer products can be built.

;[http://www.vlsi.fi/ VLSI Solution Oy]: VLSI provides two Ogg Vorbis capable chips.

:[http://www.vlsi.fi/en/products/vs1000.shtml VS1000] is an Ogg Vorbis decoder and controller chip based on a 16-bit DSP.

:[http://www.vlsi.fi/en/products/vs1053.shtml VS1053] is a low-power "MP3 decoder" chip based on the same DSP. What makes the IC unique is that it can both decode and [http://www.vlsi.fi/en/support/software/vs10xxapplications.html encode] Ogg Vorbis files. There are several different quality settings to choose from varying from narrowband speech to high-quality stereo music.

;[http://oggonachip.sourceforge.net/ Ogg On A Chip]: A hardware/software implementation with a good report showing how to make FPGAs and the like to decode Vorbis streams.

;[http://www.finearch.com/english FineArch]: FineArch, Inc. developed a hardware core and control software for decoding Vorbis. This technology can be integrated into portable players or cell phones, and since it runs at only 12MHz, it uses very little battery power. It supports files up to 64Kb/s, but could be scaled to 16MHz and 128Kb/s, at the expense of battery life. For more information, see FineArch’s [http://www.finearch.com/english/news/pr_20030715/pr_20030715.htm press release].

;[http://www.mcslogic.com/ MCS Logic]: MCS Logic creates single chip decoders that can play Ogg Vorbis. They supply the Vorbis decoding chips for Havin and Freemax.

;[http://www.telechips.com Telechips]: Telechips has developed the TCC72x, a single chip decoder that can play Vorbis. The TCC72x series is based on on an ARM940T core, and it is used widely in Korea for players such as Iops or MobiBlu.

;[http://www.tamulsite.co.kr Tamul Multimedia]: Tamul Multimedia manufactures decoding chips for Samsung. They claim they have Ogg Vorbis decoding firmware, according to [http://www.dt.co.kr/print.html?gisaid=2003031002011367704002 <em>The Digital Times</em>] (Korean).

;[http://www.sigmatel.com/ SigmaTel]: SigmaTel makes several chips which support Ogg Vorbis decoding. After this quote years ago, we knew it was only a matter of time:
<blockquote>"<i>I talked to Deborah Clark, product marketing engineer for audio chipmaker Sigmatel out of Austin, Tex. She is the company's expert in audio decoders. She says there is a growing base of support for Ogg Vorbis. "We can't keep paying these high licensing fees for this. Manufacturers would flock to something that's free." </i></blockquote>
:from a 2000 [http://www.forbes.com/2000/09/18/dvorak_index.html column in Forbes]

:Some STMP3500-based devices supports Ogg Vorbis, but there are no notes about this on SigmaTel-website.

:SigmaTel introduces the STMP3600 with support for Ogg Vorbis, MP3, AAC, WMA and more.[http://www.finanznachrichten.de/nachrichten-2005-10/artikel-5493211.asp]

== See also ==
* [[Theora Hardware]]

[[Category:Vorbis]]

Theora

2010-04-09T12:25:59Z

Ogg.k.ogg.k: Undo revision 10962 by Shibo (Talk)

'''Theora''' is a video codec, based on the [[VP3]] codec donated by [[On2 Technologies]]. We've refined and extended it, giving it the same future scope for encoder improvement [[Vorbis]] has. See http://theora.org/ for more information.

== Features ==

Features available in the Theora format (and a comparison to VP3 and MPEG-4 ASP):

* 8x8 Type-II Discrete Cosine Transform
* block-based motion compensation
* free-form variable bit rates (VBR)
* adaptive in-loop deblocking applied to the edges of the coded blocks (not existing in MPEG-4 ASP)
* block sizes down to 8x8 (MPEG-4 ASP supports 8x8 only with 4MV)
* 384 8x8 custom quantization matrices: intra/inter, luma/chroma and even each quant (more than VP3 and MPEG-4 ASP/AVC)
* flexible entropy encoding (Theora supports 80 VLC tables selectable per-frame, MPEG-4 ASP has just one)
* 4:2:0, 4:2:2, and 4:4:4 chroma subsampling formats (VP3 and MPEG-4 ASP only support 4:2:0)
* 8 bits per pixel per color channel
* multiple reference frames (not possible in MPEG-4 ASP)
* pixel aspect ratio (eg for anamorphic signalling/playback)
* non-multiple of 16 picture sizes (as possible in ASP, but not in VP3)
* non-linear scaling of quants values (as done in MPEG-4 AVC)
* adaptive quantization down to the block level (as possible in MPEG-4 ASP/AVC, but not in VP3)
* intra frames (I-Frames in MPEG), inter frames (P-Frames), but no B-Frames (as supported in MPEG-4 ASP/AVC)
* HalfPixel Motion Search Precision (MPEG-4 ASP/AVC supports HalfPixel or QuarterPixel)
* technologies used already in Vorbis (decoder setup configuration, bitstream headers...) not available in VP3

== Status ==
* '''1.1.1''' is the latest stable release (2009-10-01).
* The bitstream format was frozen in 1.0 Alpha 3 on 2004-08-04: every file created with this encoder (and, of course, later encoders) will be playable by any compliant Theora decoder.
* The decoder in 1.0 Alpha 8 implemented all features of the [http://theora.org/doc/Theora.pdf Theora Format Specification]: every file created by any compliant Theora encoder will be playable by the decoder in 1.0 Alpha 8 (and, of course, later decoders).

== Development ==

* [[OggTheora|Mapping in Ogg]]
* [[TheoraTodo|ToDo list for development]]
* [[Cortado/release|Release checklist for the Cortado java applet]]

== More information ==
{{Template:Theora}}

It's possible to convert VP3 video to Theora. See [[vp3toTheora]].

== External links ==

* [http://www.theora.org/ Theora homepage]
* [http://www.annodex.net/software/theora/ Theora documentation daily builds]
* [[Wikipedia: Theora]]
* [http://www.vp3.com VP3 homepage]: The homepage of the codec Theora is based on
* [http://www.on2.com On2 Technologies]: The authors of VP3
* [http://forum.doom9.org/showthread.php?s=&threadid=77314 Ogg Theora Information on Doom9 Forum]
* [http://www.parrishtech.com/content/view/16/1/ HOWTO: Rip DVD to Theora using Linux]
* [http://www.doom9.org/index.html?/codecs-quali-105-1.htm Codec shoot-out 2005] Comparison of many video codecs, including Theora

[[Category:Theora]]

Cortado/todo

2010-03-22T20:06:55Z

Ogg.k.ogg.k: skeleton support

*Chaining support. (At least 'play through' on an icecast stream)
*Test new Theora implementation (pollux)
*New Vorbis implementation? Must benchmark and test with old JVMs.
*Dirac support? DS says jirac "mostly" works.
*Improved duration scanning, seeking, and support for the index when its finalized.
*Make the ant build scripts produce proguarded binaries. (See http://proguard.sourceforge.net/manual/ant.html)
*Rethink Cortado's buffer management (default values and logic)
*Fix short files/low fps not playing correctly: [http://myrandomnode.dyndns.org:8080/~gmaxwell/cortest/cortest-vfr2.html] (should be 10 sec long)
*Do everything HTML5[http://www.w3.org/TR/html5/video.html] does.
**Loop
*Port Gstreamer's QoS system to improve behavior when CPU-constrained
*Catch the case where the packet is not zero bytes but there are no coded blocks and also treat that as a duplicate
*Find out how to eliminate extra buffer copying when overlaying a Kate stream onto the video (requires knowledge of how Java manages images, ImageProducer objects).
*Skeleton support

===Crazy things===
*Vorbis low-pass mode to reduce aliasing when the pluging is producing µlaw output for a 1.1 JVM.
*Perceptual noise shaping for the ulaw output (if we're feeling sufficiently insane)
*HRTF based surround -> stereo downmix.

===Done things===
*''Maybe done'': Fix handling of zero byte packets / drop frames / VFR (and add the libogg readB fix)
*DONE (needs more testing): Make the Ogg/Vorbis only build work again (requires removing the static references to Theora/Kate in Durationscanner)
*DONE: Make it work in Netscape 4 (plugins.ini access is killing it)
*DONE: Make 4:2:2 and 4:4:4 work.
*DONE: fix surround sound support in jorbis.

Cortado/todo

2010-03-21T14:54:17Z

Ogg.k.ogg.k: mention extra copying when overlaying, which needs to be avoided

*''Maybe done'': Fix handling of zero byte packets / drop frames / VFR (and add the libogg readB fix)
*DONE (needs more testing): Make the Ogg/Vorbis only build work again (requires removing the static references to Theora/Kate in Durationscanner)
*DONE: Make it work in Netscape 4 (plugins.ini access is killing it)
*DONE: Make 4:2:2 and 4:4:4 work.
*Chaining support. (At least 'play through' on an icecast stream)
*Test new Theora implementation (pollux)
*New Vorbis implementation? Must benchmark and test with old JVMs.
*Dirac support? DS says jirac "mostly" works.
*Improved duration scanning, seeking, and support for the index when its finalized.
*Make the ant build scripts produce proguarded binaries. (See http://proguard.sourceforge.net/manual/ant.html)
*Rethink Cortado's buffer management (default values and logic)
*Fix short files/low fps not playing correctly: [http://myrandomnode.dyndns.org:8080/~gmaxwell/cortest/cortest-vfr2.html] (should be 10 sec long)
*Do everything HTML5[http://www.w3.org/TR/html5/video.html] does.
**Loop
*Port Gstreamer's QoS system to improve behavior when CPU-constrained
*DONE: fix surround sound support in jorbis.
*Catch the case where the packet is not zero bytes but there are no coded blocks and also treat that as a duplicate
*Find out how to eliminate extra buffer copying when overlaying a Kate stream onto the video (requires knowledge of how Java manages images, ImageProducer objects).

===Crazy things===
*Vorbis low-pass mode to reduce aliasing when the pluging is producing µlaw output for a 1.1 JVM.
*Perceptual noise shaping for the ulaw output (if we're feeling sufficiently insane)
*HRTF based surround -> stereo downmix.

Summer of Code 2010

2010-03-15T11:19:56Z

Ogg.k.ogg.k: Skeleton

This is our ideas page for [http://code.google.com/soc/ Google Summer of Code] projects with [http://xiph.org Xiph.org] and [http://annodex.org/ Annodex]. The two projects participate jointly this year under Xiph's name.

'''Students''' please use the template at [[Summer of Code Applications]] when applying for a GSoC position.

'''Mentors''' please visit [[Summer of Code Mentoring]] and help us prepare our application as a mentoring organization.

== Current Ideas ==

=== OggIndex ===
OggIndex has recently been introduced and adds a keyframe index to the Ogg Skeleton track. Support needs to be added to many existing open source applications, such as MPlayer, VLC, etc, so that they can take advantage of the keyframe index when seeking. For more info see [[OggIndex-Migration]], [[Ogg_Index]], and [http://blog.pearce.org.nz/2010/01/indexing-keyframes-in-ogg-videos-for.html Indexing keyframes in Ogg videos for fast seeking]. Mentor: Chris Pearce

=== Skeleton support===
Get skeleton patches upstream so players:
* stop choking on it.
* start using the information it contains.

Cortado/todo

2010-03-14T22:22:33Z

Ogg.k.ogg.k: -ov- works again

*''Maybe done'': Fix handling of zero byte packets / drop frames / VFR (and add the libogg readB fix)
*DONE (needs more testing): Make the Ogg/Vorbis only build work again (requires removing the static references to Theora/Kate in Durationscanner)
*DONE: Make it work in Netscape 4 (plugins.ini access is killing it)
*DONE: Make 4:2:2 and 4:4:4 work.
*Chaining support. (At least 'play through' on an icecast stream)
*Test new Theora implementation (pollux)
*New Vorbis implementation? Must benchmark and test with old JVMs.
*Dirac support? DS says jirac "mostly" works.
*Improved duration scanning, seeking, and support for the index when its finalized.
*Make the ant build scripts produce proguarded binaries. (See http://proguard.sourceforge.net/manual/ant.html)
*Rethink Cortado's buffer management (default values and logic)
*Fix short files/low fps not playing correctly: [http://myrandomnode.dyndns.org:8080/~gmaxwell/cortest/cortest-vfr2.html] (should be 10 sec long)
*Do everything HTML5[http://www.w3.org/TR/html5/video.html] does.
**Loop
*Port Gstreamer's QoS system to improve behavior when CPU-constrained
*DONE: fix surround sound support in jorbis.
*Catch the case where the packet is not zero bytes but there are no coded blocks and also treat that as a duplicate

===Crazy things===
*Vorbis low-pass mode to reduce aliasing when the pluging is producing µlaw output for a 1.1 JVM.
*Perceptual noise shaping for the ulaw output (if we're feeling sufficiently insane)
*HRTF based surround -> stereo downmix.

Talk:Vorbis

2009-12-14T13:14:11Z

Ogg.k.ogg.k: Undo revision 10754 by Oyunlar35 (Talk)

== Question ==

What about non-standard encoders and tunings?

Shouldn't you make a page about how to encode vorbis files?

[JohnRipley] How about a list of third party implementations of the Vorbis codec itself? For example: JOrbis, and mine :)

== Windows Media Player Encoding ==

[cparker] I'd like to know how to enable Windows Media Player to encode vorbis files directly from the "Rip" tab. I'm using Windows Media Player 9-10. I checked vorbis.com[http://vorbis.com], and it appears to be quite outdated. (It makes a reference to irc.xiph.org.)

== HW requirements ? ==

[xerces8] What CPU power (in terms of popular PC CPUs) is required for realtime decoding of Vorbis ?
Does tremor require more/less time/space as the "classic" version ?
(I plan to purchase a used laptop to use as a Vorbis playing station, so I need to know, thanks)

== RE: HW requirements ? ==

> What CPU power required for realtime decoding of Vorbis

Why don't you test ? 100 MHz Pentium 1 (just a guess)

> Does tremor require more/less time/space as the "classic" version ?

Space: same/irrelevant/unreproductable (???), time: probably slightly slower

[[User:DOS386|DOS386]] 15:21, 19 October 2007 (PDT)

== Merge proposal ==

Any reason for this:

* [[VorbisSoftwareEncoders]]: List of libvorbis frontends
* [[VorbisEncoders]]: List of encoders (e.g. Xiph, aoTuV, GT, vorbis-java)

Merge them ? [[User:DOS386|DOS386]] 15:21, 19 October 2007 (PDT)

:I guess that's a solution.--[[User:Saoshyant|Ivo]] 11:16, 20 October 2007 (PDT)

Talk:FLAC

2009-12-14T13:10:47Z

Ogg.k.ogg.k: Undo revision 10755 by Oyunlar35 (Talk)

== Devices that SUpport FLAC ==

The FLAC Page SHOULD also note supported Devices as this is a big issue for many of us.

please add this and I'd love to add devices and such.

:Anyone willing to do this would be great.--[[User:Saoshyant|Ivo]] 07:00, 31 May 2007 (PDT)

== The L.A.B ==

I would like to see if anyone has a problem with me adding The Lossless Audio Blog under external links? The Lossless Audio Blog is a news and information site dedicated to lossless audio formats. Because its a information site I think it fits well here.[[User:Windmiller|windmiller]] 12:36, 25 October 2006 (PDT)

== Plagiarism from Wikipedia ==
It looks to me that this article was lifted word by word from an older version at Wikipedia. Our licenses are not compatible with them thanks to the bloody FSF not cooperating with Creative Commons, or whatever, I don't care. If this article was plagiarized, I'll have no alternative but to delete it.--[[User:Saoshyant|Ivo]] 07:00, 31 May 2007 (PDT)

Talk:Main Page

2009-11-15T22:25:47Z

Ogg.k.ogg.k: spamicide

According to [[Special:Popularpages]], the various pages in the Demonstration section are the most visited parts of the wiki, so I moved that section to the top of the main page. --[[User:Andrel|Andrel]] 09:19, 26 April 2006 (PDT)

== Work in Progress ==

It's not clear on first view (to me at least) that
[[Main Page#Work in Progress]] is a link to
[[Work In Progress]] (as none of the other section headings
are). Possibly it should be a normal heading with the link
in a short text below (à la [[Main Page#Other software]]).

[[User:Imalone|Imalone]] 05:22, 1 February 2006 (PST)

You are free to fix that. It's a wiki after all -- [[User:Jmspeex|Jmspeex]] 19:22, 1 February 2006 (PST)

Done (just didn't want to trample all over the front page) -- [[User:Imalone|Imalone]] 04:33, 2 February 2006 (PST)

== Lock This Page ==

On all/most other wikis the Main Page is locked so only admins can edit it. Due to the amount of vandalism, I think the [[Main Page]] should be locked and all changes discussed here. --[[User:SonicChao|SonicChao]] 05:11, 27 August 2006 (PDT)

:Done !

== Paranoia / cdparanoia ==

Why is there no listing under software of paranoia or cdparanoia? Also, there is no listing on the main xiph.org page. Is that software acknowlegded? --[[User:WhiteDragon|WhiteDragon]] 19:52, 9 September 2006 (PDT)

== Suggestion :-O ==
First I want to congratulate you on the wonderful work being done. Thank you very much :-)
Please allow anonymous edits (like wikipedia does) b'caus i'm too lazy to login :-)

:I'll give you the benefit of the doubt and assume the ton of hidden links I just culled has nothing to do with you... I can't speak for the people running Xiph but requiring a login reduces some of the flood of spam that shows up here, and Wikipedia has many more resources available to deal with it than this wiki does. [[User:Imalone|Imalone]] 06:06, 24 November 2006 (PST)

== Proposal for a developer section ==

As more developers start to "get it" about how ultra cool Ogg / Vorbis / Theora / etc is, wouldn't it be great to have a wiki section devoted to helping these budding programmers along? eg: i've written some nice code i'd be happy to share. Could contain a programming FAQ, how-to's, and real code. Thoughts? [[User:Davec|Davec]] 13:44, 6 December 2006 (PST)

== Why CamelCase? ==

MediaWiki supports free links, why are most page titles in the CamelCase format? - [[User:Sikon|Sikon]] 05:34, 27 February 2007 (PST)
:CamelCase? I don't see what you mean. If you think something's wrong, you may go ahead and change it. That's what wikis are for.--[[User:Saoshyant|Saoshyant]] 05:37, 27 February 2007 (PST)
:[[WhatHappened|Historical reasons]]. The original wiki used software that only supported [http://c2.com/cgi/wiki?CamelCase CamelCase]. For new pages it is fine to use free links. I suggest not renaming pages, as many of them have good search ranking. [[User:Andrel|Andrel]] 07:03, 27 February 2007 (PST)
I see now. And thanks for the WhatHappened link, Andrel. I managed to recover two pages so far from web archive. I wonder if I'll savage it further.--[[User:Saoshyant|Saoshyant]] 08:28, 27 February 2007 (PST)

== ICECast2 vs vBulletin ==

Hi there,
Fisrt let me thank you all the great work and the self performance over the ICECast streaming server

I also Wonder if anyone had included ICECast directly into a vBulletin board having it worked from there on ..meaning enbable to use same username,permission and prefference from the database itself running on MySQL 5 having as if setting permission on for a usergroupe from there to enable them streaming out on your ICECast server and others can apply and yet just participate into the main forum itself

I do have both running-up over my dedicated server now meanwhile if any would want to help me out creating like this hack or template am willing to give them all access for working over it
can YOU imagine how friendly and powerful that ICECast would mean then

hey give me a shout if willing to try it

admin@gysmo.net

== News ==

Hi,

I would really strongly suggest adding this to your news items... Someone has made a graphical interface for ffmpeg2theora at http://www.softpedia.com/get/Multimedia/Video/Other-VIDEO-Tools/GFrontEnd-for-ffmpeg2theora.shtml

I don't know the technical difficulties of building such a tool, but regardless, the ability for a common Windows user like myself to be able to just easily convert a proprietary file format to Ogg-Theora is really cool... I tested it out on the sample WMV file that came with this laptop, and the converted file worked great in Cortado... Of course, news about programs that allow one to record directly to Ogg-Theora would be even better, but this is still very important, imo... [[User:Brettz9|Brettz9]] 20:56, 4 July 2007 (PDT)

== .NET :-( ==

> I would really strongly suggest adding this to your news items

Check it out: > Requirements: .NET Framework 2.0

Also, with 1.5 MiB size it's bigger than FFMPEG2THEORA itself (1.3 MiB after recompressing with UPX 3.0 --ultra-brute) - not that VERY good IMHO.

: Sorry, do you mean MB as in Megabytes? That's a drop in the bucket of most hard drives nowadays, no? And .NET was already on my system for something I had downloaded earlier (not sure what, but maybe others may have it already too). My interest in seeing it announced is not how well it is implemented--if there are better alternatives let them be known--but that such a tool exists and it works (at least if you get the requirements).

== MiB ||| .NET ==

> Sorry, do you mean MB as in Megabytes?

NO. [[http://en.wikipedia.org/wiki/Binary_prefix]]

> And .NET was already on my system for something I had downloaded earlier (not sure what, but maybe others may have it already too)

I don't have .NET and don't like it :-( Finally, the important thing is the FFMPEG2THEOA core and it works perfectly for me without .NET ;-)

== Link to Games in Demonstrations Section? ==

Shouldn't there be a link to [[Games_that_use_Theora]] in the section with other demonstrations? Maybe there's not enough games listed on the page to warrant it?
--[[User:Sim9|Sim9]] 11:47, 27 October 2007 (PDT)

:Yes. There's only two games listed right now. I'm pretty sure there's more out there. Sim9, can you help us listing more games?--[[User:Saoshyant|Ivo]] 11:56, 28 October 2007 (PDT)

::I'm sure you're right, there ''has'' to be more than two. Theora (and Vorbis) is now by default included in the Torque game engine, so there must be a lot of games using it by now. I just polled the Torque community to see if they know of any to help us fill up the list with some successful integrations! --[[User:Sim9|Sim9]] 19:02, 29 October 2007 (PDT)

== Add xvid it's opensource ==

Please add xvid to usable video format in ogv and others.
It's an open source, mature codec and doesn't have encoder problems theora currently has.
Embedding it makes possible to show off ogv files with a codec that shows the tru power of ogv.

Please make it possible to put this in, so .ogv can be used immediately with xvid and vorbis/speex/flac to create a mature and temporary solution until theora 1.0 hits the digital streets.
Then users can just batch convert it with their applications whenever they feel/want to do it.
(When they think theora is ready.)
Could someone please look into this and tell me if this is possible and/or will be integrated?
--[[User:Vmol|Vmol]] 1 February 2008

:Vmol, you should read about the issues a bit before filling up the whole wiki with so many questions. Most of your questions, concerns and statements have already been thought about. In this case here, Xvid cannot be considered because it is a patented format. That means it's not a free format like Theora and Xiph cannot use it. Theora is currently now undergoing the last stages of beta to version 1.0 and quality is already at pair with Xvid. Also, users can't simply transcode from one video format to another; you lose quality everytime you do it, because most video formats are lossy.--[[User:Saoshyant|Ivo]] 12:37, 2 February 2008 (PST)

== Random access Ogg Vorbis decoder written in Java. ==

And I am very glad that now you have an encoder written in Java. Can vorbis-java-1.0.0 also do the decoding?
Is there an example of how to use vorbis-java decoder?
If yes, can it seek, i.e. decode an Ogg Vorbis bitstream from a random position?
--[[User:Sergey|Sergey]] 12:53, 9 February 2008 (PST)

== For lossless video compression, make it possible to have Lagarith codec as video ==

Lagarith is a lossless video codec. Please support it, with supporting I mean that it can be used in the ogg and annodex containers as a native video format.
--[[User:Vmol|Vmol]] 4 May 2008

:There is also HuffYUV besides Lagarith. Also the new Dirac codec supports lostless compression, reportedly better than anything else. Agree, a lostless codec should be added. Just carefully select one of them ;-) [[User:DOS386|DOS386]] 01:31, 4 May 2008 (PDT)

:: Lossless compression would still be a good thing to have for Theora.
:: Even if there are other formats available that can do lossless.
:: Not convinced about using another codec, a lossless mode for Theora is useful.
:: Link to Theora todo page where lossless mode is requested:
:: [http://wiki.xiph.org/Talk:TheoraTodo]
:: --[[User:Vmol|Vmol]] 25 Jul 2009

== Add link to "Reporting Abuse" page on the front page? ==

There's been a bunch of spam recently, but I couldn't find any way to report abuse. I've added a skeleton page, could we add a link to the front page?

Talk:OggKate

2009-08-01T17:15:07Z

Ogg.k.ogg.k: /* Possible additions restrictions trap */

== Kate is going to have support for all languages in the world, right ? ==
(This can be useful to make a video where a user can choose the right subtitle language.)

:OggKate supports Unicode (UTF-8), so yes. [[User:Martin.leese|Martin Leese]] 15:46, 29 January 2008 (PST)

::With the right fonts, I have a test stream that displays Japanese, Arabic, Chinese, as well as Latin characters. The only thing left open there is how to deal with languages like Arabic which are written right to left. The language in a stream is set in the header as a language/region tag, such as en_US, or just en. [[User:ogg.k.ogg.k]] Wed Jan 30 18:20:49 UTC 2008

::Right to left now supported (in my local version of xine). Language directionality can be overridden for each data packet from the default given in the headers. [[User:ogg.k.ogg.k]] Thurs Jan 31 13:29 UTC 2008

Be careful to make sure you're using the latest ISO standard (the one with the highest number) about languages.
Because there are already a few so you could miss and end up using a wrong one.
For the rest Kate looks very good ;)

:: Well, I am not certain about this - assuming you are referring to the latest RFC about language identification (the latest one is RFC 4646 I believe), then it is kinda complex, and I plan on supporting only part of it (yes, I know this is probably the standard's bane to have partial implementations). A full language tag can be quite long, and that RFC suggests a max "sane" size of 42 bytes. I have actually looked at what I'll do with that this weekend and am currently going with a 15 character string, which should handle easily things like primary tag and one (or two small) secondary tags, like "en_GB". Language plus country should cover most needs. However, it is possible to specify a language override in each data packet, if precision is required. [[User:ogg.k.ogg.k]] Wed Feb 6 12:08:03 UTC 2008

:I believe it would be useful if the person who asked the question went away and learnt something about Unicode. From the Unicode website, "Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language." There are around 7000 written languages in the world. (There is no agreement on an exact figure.) Unicode does the lot. [[User:Martin.leese|Martin Leese]] 10:06, 5 February 2008 (PST)

:: As a note, Kate uses UTF-8 only at the moment, and supports 31 bit UCS space (if a define is set, off by default), and current code points to 0x10ffff (eg, the currently defined 16 planes). I haven't quite ruled out UTF-16 and UTF-32, but if I add them in, libkate will have an auto conversion option for client code. Note that Kate doesn't concern itself with rules of ligatures, etc, defined by Unicode, that is up to the rendering client. [[User:ogg.k.ogg.k]] Wed Feb 6 12:16:05 UTC 2008

:::I have learned stuff about unicode. That was stupid of me if you think that I was asking about Unicode supporting it.
I was asking about Kate supporting all unicode features.
(I didn't knew about Unicode having language, country,... mapping.)
If you want to tell me that Unicode has a region, language, currency,... mapping on top of a character mapping.
Then say it clearly.

Reading this page: [http://cldr.unicode.org/]
All the localization stuff are under the name CLDR.
CLDR is about the Unicode Common Locale Data Repository
It does a lot more than just language and region mapping.
In fact, the other things are also very useful to have.

(e.g. Engineers and the whole scientific community would be very pleased with the number localization.)
(Because of the decimal and thousands separator issue: [http://en.wikipedia.org/wiki/Decimal_separator] )

There is a region definitions header present in Kate.
For the CLDR information, there needs to be a new header.
Will there be a CLDR Definitions header or extended Unicode Definitions header somewhere in the future?
Please?

: While I have not looked in depth at the CLDR, I don't think it's something that matters here.
: It seems to be more useful to programs' localization. The CLDR would then be more useful in a
: possible Kate editor, for instance. Once applied, text would go in the Kate stream and the CDLR
: would not be useful anymore. Feel free to correct me if I'm missing your point though. [[User:Ogg.k.ogg.k|Ogg.k.ogg.k]]

== Embedding of bitmap fonts ==

Embedding bitmap fonts in the stream seems a very odd idea to me in in this day and age where display resolutions increase constantly and the number of output devices varies so much (desktop display, mobile phone, internet tablet etc.). What's the point of it? I think this idea should just be dropped. (TimMüller)

:For simple text subtitles, this is very true, but the idea was allow control over the presentation
:of the screen for other uses. Since images are supported, adding bitmap fonts is trivial, since it
:is just a mapping from code point to bitmap index. The goal is not to encourage using custom fonts
:but to allow it if needed.
:Another point was that people wanting control over the font might use bitmaps directly to fake text
:in a particular font, and this would result in visible text that couldn't be interpreted (eg, by
:text to speech software).
:That said, I do agree with your argument. [[User:Ogg.k.ogg.k|Ogg.k.ogg.k]]

==Possible additions restrictions trap==

Please don't restrict your codec from having more than 256 colors in a bitmap.
Applications can do it anyway and didn't these codecs where to create freedom in the first place?
There is nothing wrong with allowing these things and MNG in overlays.
OggSpots is meant for having a timed image track, not for image overlays.
OggKate isn't duplicating because it just isn't in the same scope.
High quality images in OggSpots will look very weird with 256 low-quality images overlays.
It probably won't look good.

: Kate is not limited to 256 color images. These can be encoded natively, but any other PNG
: image may be embedded too, including non paletted images.
: As for MNG, there is a MNG mapping for Ogg too.

Please do allow the embedding of shared data (fonts).
That's a fantastic idea you've got there, don't let it slide.
It would be great to be able to make a font, add it to the file and use it for subtitle's.
It would solve the platform dependency issue with fonts, which is currently a big deal.
(There would even be more freedom added to your codec, this way.)

Please add support for svgfonts.
They have the same advantages for fonts as svg has for images.
It's vector based which means very good looking fonts.

What is your opinion about svgfonts?

: I have no opinion about svgfonts, as I do not know about them.
: I have proof of concept support for SVG images in Kate, but there are unresolved issues.
: It'd be nice to have SVG in, and I suppose (but only suppose) SVG fonts would automagically work then. [[User:Ogg.k.ogg.k|Ogg.k.ogg.k]]

Talk:OggKate

2009-08-01T17:10:12Z

Ogg.k.ogg.k: sign my comment

== Kate is going to have support for all languages in the world, right ? ==
(This can be useful to make a video where a user can choose the right subtitle language.)

:OggKate supports Unicode (UTF-8), so yes. [[User:Martin.leese|Martin Leese]] 15:46, 29 January 2008 (PST)

::With the right fonts, I have a test stream that displays Japanese, Arabic, Chinese, as well as Latin characters. The only thing left open there is how to deal with languages like Arabic which are written right to left. The language in a stream is set in the header as a language/region tag, such as en_US, or just en. [[User:ogg.k.ogg.k]] Wed Jan 30 18:20:49 UTC 2008

::Right to left now supported (in my local version of xine). Language directionality can be overridden for each data packet from the default given in the headers. [[User:ogg.k.ogg.k]] Thurs Jan 31 13:29 UTC 2008

Be careful to make sure you're using the latest ISO standard (the one with the highest number) about languages.
Because there are already a few so you could miss and end up using a wrong one.
For the rest Kate looks very good ;)

:: Well, I am not certain about this - assuming you are referring to the latest RFC about language identification (the latest one is RFC 4646 I believe), then it is kinda complex, and I plan on supporting only part of it (yes, I know this is probably the standard's bane to have partial implementations). A full language tag can be quite long, and that RFC suggests a max "sane" size of 42 bytes. I have actually looked at what I'll do with that this weekend and am currently going with a 15 character string, which should handle easily things like primary tag and one (or two small) secondary tags, like "en_GB". Language plus country should cover most needs. However, it is possible to specify a language override in each data packet, if precision is required. [[User:ogg.k.ogg.k]] Wed Feb 6 12:08:03 UTC 2008

:I believe it would be useful if the person who asked the question went away and learnt something about Unicode. From the Unicode website, "Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language." There are around 7000 written languages in the world. (There is no agreement on an exact figure.) Unicode does the lot. [[User:Martin.leese|Martin Leese]] 10:06, 5 February 2008 (PST)

:: As a note, Kate uses UTF-8 only at the moment, and supports 31 bit UCS space (if a define is set, off by default), and current code points to 0x10ffff (eg, the currently defined 16 planes). I haven't quite ruled out UTF-16 and UTF-32, but if I add them in, libkate will have an auto conversion option for client code. Note that Kate doesn't concern itself with rules of ligatures, etc, defined by Unicode, that is up to the rendering client. [[User:ogg.k.ogg.k]] Wed Feb 6 12:16:05 UTC 2008

:::I have learned stuff about unicode. That was stupid of me if you think that I was asking about Unicode supporting it.
I was asking about Kate supporting all unicode features.
(I didn't knew about Unicode having language, country,... mapping.)
If you want to tell me that Unicode has a region, language, currency,... mapping on top of a character mapping.
Then say it clearly.

Reading this page: [http://cldr.unicode.org/]
All the localization stuff are under the name CLDR.
CLDR is about the Unicode Common Locale Data Repository
It does a lot more than just language and region mapping.
In fact, the other things are also very useful to have.

(e.g. Engineers and the whole scientific community would be very pleased with the number localization.)
(Because of the decimal and thousands separator issue: [http://en.wikipedia.org/wiki/Decimal_separator] )

There is a region definitions header present in Kate.
For the CLDR information, there needs to be a new header.
Will there be a CLDR Definitions header or extended Unicode Definitions header somewhere in the future?
Please?

: While I have not looked in depth at the CLDR, I don't think it's something that matters here.
: It seems to be more useful to programs' localization. The CLDR would then be more useful in a
: possible Kate editor, for instance. Once applied, text would go in the Kate stream and the CDLR
: would not be useful anymore. Feel free to correct me if I'm missing your point though. [[User:Ogg.k.ogg.k|Ogg.k.ogg.k]]

== Embedding of bitmap fonts ==

Embedding bitmap fonts in the stream seems a very odd idea to me in in this day and age where display resolutions increase constantly and the number of output devices varies so much (desktop display, mobile phone, internet tablet etc.). What's the point of it? I think this idea should just be dropped. (TimMüller)

:For simple text subtitles, this is very true, but the idea was allow control over the presentation
:of the screen for other uses. Since images are supported, adding bitmap fonts is trivial, since it
:is just a mapping from code point to bitmap index. The goal is not to encourage using custom fonts
:but to allow it if needed.
:Another point was that people wanting control over the font might use bitmaps directly to fake text
:in a particular font, and this would result in visible text that couldn't be interpreted (eg, by
:text to speech software).
:That said, I do agree with your argument. [[User:Ogg.k.ogg.k|Ogg.k.ogg.k]]

==Possible additions restrictions trap==

Please don't restrict your codec from having more than 256 colors in a bitmap.
Applications can do it anyway and didn't these codecs where to create freedom in the first place?
There is nothing wrong with allowing these things and MNG in overlays.
OggSpots is meant for having a timed image track, not for image overlays.
OggKate isn't duplicating because it just isn't in the same scope.
High quality images in OggSpots will look very weird with 256 low-quality images overlays.
It probably won't look good.

Please do allow the embedding of shared data (fonts).
That's a fantastic idea you've got there, don't let it slide.
It would be great to be able to make a font, add it to the file and use it for subtitle's.
It would solve the platform dependency issue with fonts, which is currently a big deal.
(There would even be more freedom added to your codec, this way.)

Please add support for svgfonts.
They have the same advantages for fonts as svg has for images.
It's vector based which means very good looking fonts.

What is your opinion about svgfonts?

Talk:OggKate

2009-08-01T17:08:21Z

Ogg.k.ogg.k: /* Kate is going to have support for all languages in the world, right ? */

== Kate is going to have support for all languages in the world, right ? ==
(This can be useful to make a video where a user can choose the right subtitle language.)

:OggKate supports Unicode (UTF-8), so yes. [[User:Martin.leese|Martin Leese]] 15:46, 29 January 2008 (PST)

::With the right fonts, I have a test stream that displays Japanese, Arabic, Chinese, as well as Latin characters. The only thing left open there is how to deal with languages like Arabic which are written right to left. The language in a stream is set in the header as a language/region tag, such as en_US, or just en. [[User:ogg.k.ogg.k]] Wed Jan 30 18:20:49 UTC 2008

::Right to left now supported (in my local version of xine). Language directionality can be overridden for each data packet from the default given in the headers. [[User:ogg.k.ogg.k]] Thurs Jan 31 13:29 UTC 2008

Be careful to make sure you're using the latest ISO standard (the one with the highest number) about languages.
Because there are already a few so you could miss and end up using a wrong one.
For the rest Kate looks very good ;)

:: Well, I am not certain about this - assuming you are referring to the latest RFC about language identification (the latest one is RFC 4646 I believe), then it is kinda complex, and I plan on supporting only part of it (yes, I know this is probably the standard's bane to have partial implementations). A full language tag can be quite long, and that RFC suggests a max "sane" size of 42 bytes. I have actually looked at what I'll do with that this weekend and am currently going with a 15 character string, which should handle easily things like primary tag and one (or two small) secondary tags, like "en_GB". Language plus country should cover most needs. However, it is possible to specify a language override in each data packet, if precision is required. [[User:ogg.k.ogg.k]] Wed Feb 6 12:08:03 UTC 2008

:I believe it would be useful if the person who asked the question went away and learnt something about Unicode. From the Unicode website, "Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language." There are around 7000 written languages in the world. (There is no agreement on an exact figure.) Unicode does the lot. [[User:Martin.leese|Martin Leese]] 10:06, 5 February 2008 (PST)

:: As a note, Kate uses UTF-8 only at the moment, and supports 31 bit UCS space (if a define is set, off by default), and current code points to 0x10ffff (eg, the currently defined 16 planes). I haven't quite ruled out UTF-16 and UTF-32, but if I add them in, libkate will have an auto conversion option for client code. Note that Kate doesn't concern itself with rules of ligatures, etc, defined by Unicode, that is up to the rendering client. [[User:ogg.k.ogg.k]] Wed Feb 6 12:16:05 UTC 2008

:::I have learned stuff about unicode. That was stupid of me if you think that I was asking about Unicode supporting it.
I was asking about Kate supporting all unicode features.
(I didn't knew about Unicode having language, country,... mapping.)
If you want to tell me that Unicode has a region, language, currency,... mapping on top of a character mapping.
Then say it clearly.

Reading this page: [http://cldr.unicode.org/]
All the localization stuff are under the name CLDR.
CLDR is about the Unicode Common Locale Data Repository
It does a lot more than just language and region mapping.
In fact, the other things are also very useful to have.

(e.g. Engineers and the whole scientific community would be very pleased with the number localization.)
(Because of the decimal and thousands separator issue: [http://en.wikipedia.org/wiki/Decimal_separator] )

There is a region definitions header present in Kate.
For the CLDR information, there needs to be a new header.
Will there be a CLDR Definitions header or extended Unicode Definitions header somewhere in the future?
Please?

: While I have not looked in depth at the CLDR, I don't think it's something that matters here.
: It seems to be more useful to programs' localization. The CLDR would then be more useful in a
: possible Kate editor, for instance. Once applied, text would go in the Kate stream and the CDLR
: would not be useful anymore. Feel free to correct me if I'm missing your point though.

== Embedding of bitmap fonts ==

Embedding bitmap fonts in the stream seems a very odd idea to me in in this day and age where display resolutions increase constantly and the number of output devices varies so much (desktop display, mobile phone, internet tablet etc.). What's the point of it? I think this idea should just be dropped. (TimMüller)

:For simple text subtitles, this is very true, but the idea was allow control over the presentation
:of the screen for other uses. Since images are supported, adding bitmap fonts is trivial, since it
:is just a mapping from code point to bitmap index. The goal is not to encourage using custom fonts
:but to allow it if needed.
:Another point was that people wanting control over the font might use bitmaps directly to fake text
:in a particular font, and this would result in visible text that couldn't be interpreted (eg, by
:text to speech software).
:That said, I do agree with your argument. [[User:Ogg.k.ogg.k|Ogg.k.ogg.k]]

==Possible additions restrictions trap==

Please don't restrict your codec from having more than 256 colors in a bitmap.
Applications can do it anyway and didn't these codecs where to create freedom in the first place?
There is nothing wrong with allowing these things and MNG in overlays.
OggSpots is meant for having a timed image track, not for image overlays.
OggKate isn't duplicating because it just isn't in the same scope.
High quality images in OggSpots will look very weird with 256 low-quality images overlays.
It probably won't look good.

Please do allow the embedding of shared data (fonts).
That's a fantastic idea you've got there, don't let it slide.
It would be great to be able to make a font, add it to the file and use it for subtitle's.
It would solve the platform dependency issue with fonts, which is currently a big deal.
(There would even be more freedom added to your codec, this way.)

Please add support for svgfonts.
They have the same advantages for fonts as svg has for images.
It's vector based which means very good looking fonts.

What is your opinion about svgfonts?

How to do a release

2009-06-30T10:55:08Z

Ogg.k.ogg.k: mention make distcheck too

You made a new release, the world is waiting for it.
Here is what to do:

== Update versions and CHANGES files ==

Verify all project release versions embedded throughout the source and build system have been updated to appropriate values for the release. For projects that use the autotools, this means checking configure.in/configure.ac for AC_INIT, *LIB_CURRENT, *LIB_REVISION and *LIB_AGE. Depending on the project, there might be a version.h file, vendor or lib version strings embedded in the source somewhere (eg, lib/info.c for libvorbis or lib/internal.h for libtheora) and various other build project files for non-UNIX platforms (eg, macosx/Info.plist).

Changes, additions, improvements, and major bugfixes should be summarized in the CHANGES file. A good way to avoid missing anything is to look through the SVN log since last release and cherrypick the bits that would be of interest to outside developers or project managers.

== Tag in SVN ==

All official project releases must be tagged in SVN. This is done using the SVN copy tag; essentially a versioned copy of a specific module/brach is copied to the tags directory in SNV. For example, the libvorbis 1.2.2 release was tagged using:

svn copy http://svn.xiph.org/trunk/vorbis@16168 http://svn.xiph.org/tags/vorbis/libvorbis-1.2.2

== Prepare a tarball ==
./autogen.sh
make dist

If a distcheck target is available, then it should be used instead, as it can spot
common mistakes:
./autogen.sh
make distcheck

Ideally, offer binaries for the different systems. This is not required, and many packages (such as libvorbis, etc) ship only as source releases. If in doubt, do what previous releases did. If there are no previous releases, libs usually ship as source only, applications tend to offer binaries.

== Create a release directory under [http://svn.xiph.org/releases/ http://svn.xiph.org/releases/] ==

If you are uploading the first release of a project to Xiph.org, then first create a release directory in the svn repository. You can do this using remote svn commands (rather than checking out the entire Xiph.org release archive):

<tt>svn mkdir https://svn.xiph.org/releases/PROJECTNAME</tt>

Then check that directory out locally:

<tt>svn co https://svn.xiph.org/releases/PROJECTNAME PROJECTNAME-releases</tt>

== Add new release files ==

Add tarballs etc. to your local checkout of the release directory:

<tt>cd PROJECTNAME-releases</tt>
<tt>svn add PROJECTNAME-x.x.x.tar.gz</tt>

Then, generate MD5 and SHA1 checksums for these files. Extending the checksum files is easy on a Unix machine:

<tt>md5sum PROJECTNAME-x.x.x.tar.gz >> MD5SUMS</tt>
<tt>sha1sum PROJECTNAME-x.x.x.tar.gz >> SHA1SUMS</tt>

Check that the only modifications to the checksums are for the new files:

<tt>svn diff</tt>

If everything is ok (and the checksums for other files have not changed), commit:

<tt>svn commit</tt>

== Website update ==
=== Add downloadable files ===
After about 30(?) minutes repository changes will be visible on
http://downloads.xiph.org/releases/YOUR-COMPONENT/

==== immediate mirror update ====
The mirrorpush is performed by an every-half-hour cron task. If, for some reason, it's important to update the mirrors immediately, the following may be run as root on Motherfish to force-push:

cd /home/mirrorpush; ./update_downloads.sh

The script must be run from the /home/mirrorpush directory.

=== Update HTML ===

==== Update [http://www.xiph.org/downloads/ http://www.xiph.org/downloads/] ====

Then you should update the [http://www.xiph.org/downloads/ download section] on the Xiph website.
The downloads page is in the normal svn repository for www.xiph.org:

<tt>svn co https://svn.xiph.org/websites/xiph.org/downloads/</tt>

Update it with your release tarball name and checksum, and commit.

==== News page ====
New releases of official projects should include an announcement. The same announcement that is sent to the email announcement lists is used as the basis for a 'press release' on the Xiph [http://www.xiph.org/press/index.shtml.en news/press page]. New news entries must be added seperately to the [http://www.xiph.org/press/index.shtml.en press page] and the [http://www.xiph.org/index.shtml.en Xiph front page]. Theora-related releases should also be added to the [http://www.theora.org/news/index.shtml.en Theora News page].

The various xiph.org web sites must be edited through SVN just like the release download files. The websites can be found under [https://svn.xiph.org/websites/ svn.xiph.org/websites/].

==== immediate HTML update ====

Website changes are updated by a cron script like the download mirrors. To force an immediate website update, perform the following as root on Motherfish:

cd /var/www; ./update_websites.sh

== Announcement ==
Announce your release where apropriate. This can include
* The various Xiph.Org website news pages; see above.
* The Xiph [http://lists.xiph.org/mailman/listinfo/announce Announce] mailing list
* your blog
* the project's FreshMeat page
* Linux Weekly News <lwn@lwn.net>
* comp.os.linux.announce <cola@stump.algebra.com>
* ''<other suitable places>''

It might also be a good idea to notify people maintaining ports of your project.

== See Also ==
* [[CodingGuidelines]]
* [[MIT approach to design and implementation]]

[[Category:Developers stuff]]

Summer of Code 2009

2009-03-12T07:47:31Z

Ogg.k.ogg.k: /* Javascript Library for Subtitles, Captions and other time-aligned text */

This is our ideas page for [http://code.google.com/soc/ Google Summer of Code 2009] projects with [http://xiph.org Xiph.org] and [http://annodex.org/ Annodex]. The two projects participate jointly this year under Xiph's name.

'''Students''' please use the template at [[Summer of Code Applications]] when applying for a GSoC position.

'''Mentors''' please visit [[Summer of Code Mentoring]] and help us prepare our application as a mentoring organization.

== General Ideas ==

* Kate to HTML & CSS overlay library in javascript.
* Proof of concept liboggplay-based media patch for Google's Chrome browser.
* mod_duration apache module to generate X-Content-Duration headers for Ogg files.
* Get skeleton patches upstream so players stop choking on it.
* Portable listening application for codec MOS/MUSHRA comparisons (Win32, MacOS, Linux; FF3.1 web application?).
* Conference bridge using CELT.
* Reference SIP client for CELT.
* Firefox extension to record locally and stream to icecast.
* Firefox extension to support RTP for conferencing.
* OpenMAX IL components for Ogg codecs

== Detailed Project Descriptions ==

These ideas were suggested by various members of the developer community as projects that would be beneficial and which we feel we can mentor. Students should feel free to select one of these, develop a variation, or propose their own ideas. Here, ideally.

=== Proof of Concept liboggplay (html5 video) support in Chromium Browser ===

This project would focus on integrating support for liboggplay into chrome. This project would only need to be a proof of concept with the end result being some frames decoded in the browser. We have some direct contacts with people on the Chromium project in Google, but would expect the student mostly to work through the Xiph on Chromium online communities.

[http://code.google.com/chromium/ Chromium Home Page]

=== Metavid related projects ===

see [http://metavid.org/wiki/Summer_of_Code_2009 full page on metavid.org]
* Improve transcript import / export system:
** Wiki to SRT
** SRT to Wiki
** CMML to Wiki
** Extend oggz_chop or other tool for exporting transcript encapsulated in the ogg file.

=== Javascript Library for Subtitles, Captions and other time-aligned text ===

The main focus of the project is around enabling video accessibility for Ogg in Firefox.

Captions, subtitles and other categories of time-aligned text are starting to become relevant to HTML5. In Ogg, we currently encapsulate such data in OggKate and can use SRT or Kate as input formats. Display of OggKate is currently supported in VLC and there are patches for various other media players. We now want to enable Web browsers to also deal with these time-aligned text tracks in those Web Browsers that support the HTML5 video tag.

There is a proof of concept patch for Firefox 3.5 and liboggplay through which Firefox is capable of decoding Ogg Kate tracks and either overlay them onto the video, or handing the raw text to the browser
(eg, for text to speech). However, there is no display of OggKate in Firefox 3.1 (now called 3.5) using HTML5. This can be fixed through the creation of a javascript library that can deal with Kate output and convert it to HTML and CSS. Example libraries exists for SRT, but will need to be extended to Kate in this project.

The project includes the creation of example files for different types of time-aligned text. These are then encapsulated into Ogg through Kate encoding. Firefox 3.5 with the applied OggKate patch can decode these files and hand the textual data to the Web browser. It will be necessary to extend liboggplay to pass non textual Kate data (eg, styling, etc) to the browser, as currently the only two ways of dealing with a Kate track is to render it, or pass raw text, ignoring extra styling information. This could be part of the project, or done before the GSoC projects begins. It may be necessary to extend the OggKate patch to converting Ogg Kate's representation into something that the Web browser can understand. The browser extracts the text and styling information and a javascript library implemented by the student will take care of the display. This will include an implementation of default display mechanisms for the different types of time-aligned text that we decide to deal with.

The project requires a student with experience in javascript development, HTML and CSS, but also with some understanding of C for liboggplay and libkate, and of C++ for Firefox. The student will learn how to deal with Ogg and Ogg tracks, including Ogg Kate. He/she will also get some insight into Firefox development. He/she will work with the developer of Ogg Kate and the video accessibility expert of Xiph, as well as having access to the whole Xiph community including the core developer of Ogg support in Firefox.

The project is adaptable to the qualifications of the student - it may consist in simply implementing a toolchain for handling srt inside Ogg, or it may go much further and include richer forms or time-aligned text such as audio annotations, Karaoke, ticker text, clickable text etc.

=== OpenMAX IL components for Ogg codecs ===

OpenMAX is a set of low-level C APIs for media codecs. It is used by many mobile devices, in platforms like [http://www.maemo.org/ Maemo] and [http://source.android.com/ Android]. As we'd like to encourage the use of free codecs on mobile and embedded devices, we want to develop a set of components using our codec libraries.

For details, including the motivation for this project and links to related projects, see
[http://blog.kfish.org/2009/02/is-openmax-important-for-free-software.html Is OpenMAX important for Free Software?]

==See Also==
*[[Summer of Code 2008]]
*[[Summer of Code 2007]]
*[[Summer of Code 2006]]

OggKate

2009-02-13T20:28:43Z

Ogg.k.ogg.k: Add a section about KateDJ (how to edit an existing Kate stream muxed in Ogg)

== Disclaimer ==
This is not a Xiph codec, though it may be embedded in Ogg alonside other Xiph
codecs, such as Vorbis and Theora. As such, please do not assume that Xiph has
anything to do with this, much less responsibility.

== What is Kate? ==

Kate is an overlay codec, originally designed for karaoke and text, that can be
multiplixed in Ogg. Text and images can be carried by a Kate stream, and animated.
Most of the time, this would be multiplexed with audio/video to carry subtitles,
song lyrics (with or without karaoke data), etc, but doesn't have to be.

Series of curves (splines, segments, etc) may be attached to various properties
(text position, font size, etc) to create animated overlays. This allows scrolling
or fading text to be defined. This can even be used to draw arbitrary shapes, so
hand drawing can also be represented by a Kate stream.

Example uses of Kate streams are movie subtitles for Theora videos, either text based,
as may be created by [http://www.v2v.cc/~j/ffmpeg2theora ffmpeg2theora], or image
based, such as created by [http://thoggen.net Thoggen] (patching needed), and lyrics,
as created by oggenc, from vorbis-tools.

== Why a new codec? ==

As I was adding support for Theora, Speex and FLAC to some software of mine, I found myself
wanting to have song lyrics accompanying Vorbis audio. Since Vorbis comments are limited to
the headers, one can't add them in the stream as they are sung, so another multiplexed stream
would be needed to carry them.

The three possible bases usable for such a codec I found were Writ, CMML, and OGM/SRT.

*[[OggWrit|Writ]] is an unmaintained start at an implementation of a very basic design, though I did find an encoder/decoder in py-ogg2 later on - I'd been quicker to write Kate from scratch anyway.
*[[CMML]] is more geared towards encapsulating metadata about an accompanying stream, rather than being a data stream itself, and seemed complex for a simple use, though I have now revised my view on this - besides, it seems designed for Annodex (which I haven't had a look at), though it does seems relatively generic for use outwith Annodex - though it is being "repurposed" as timed text now, bringing it closer to what I'm doing
*OGM/SRT, which I only found when I added Kate support to MPlayer, is shoehorning various data formats into an Ogg stream, and just dumps the SRT subtitle format as is, AFAICS (though I haven't looked at this one in detail, since I'd already had a working Kate implementation by that time)

I then decided to roll my own, not least because it's a fun thing to do.

I found other formats, such as USF (designed for inclusion in Matroska) and various subtitle formats,
but none were designed for embedding inside an Ogg container.

== Overview of the Kate bitstream format ==

I've taken much inspiration from Vorbis and Theora here.
Headers and packets (as well as the API design) follow the design of these two codecs.

A rough overview (see [[#Format specification|Format specification]] for more details) is:

Headers packets:
*ID header [BOS]: magic, version, granule fraction, encoding, language, etc
*Comment header: Vorbis comments, as per Vorbis/Theora streams
*Style definitions header: a list of predefined styles to be referred to by data packets
*Region definitions header: a list of predefined regions to be referred to by data packets
*Curves definitions header: a list of predefined curves to be referred to by data packets
*Motion definitions header: a list of predefined motions to be referred to by data packets
*Palette definitions header: a list of predefined palettes to be referred to by data packets
*Bitmap definitions header: a list of predefined bitmaps to be referred to by data packets
*Font mapping definitions header: a list of predefined font mappings to be referred to by data packets

Other header packets are ignored, and left for future expansion.

Data packets:
*text data: text/image and optional motions, accompanied by optional overrides for style, region, language, etc
*keepalive: can be emitted at any time to help a demuxer know where we're at, but those packets are optional
*repeats: a verbatim repeat of a text packet's payload, in order to bound any backward seeking needed when starting to play a stream partway through. These are also optional.
*end data [EOS]: marks the end of the stream, it doesn't have any useful payload

Other data packets are ignored, and left for future expansion.

The intent of the "keepalive" packet is to be sent at regular
intervals when no other packet has been emitted for a while. This would be to help seeking code
find a kate page more easily.

Things of note:
*Kate is a discontinuous codec, as defined in [http://www.xiph.org/ogg/doc/ogg-multiplex.html ogg-multiplex.html] in the Ogg documentation, which means it's timed by start granule, not end granule (as Theora and Vorbis).
* All data packets are on their own page, for two reasons:
**Ogg keeps track of granules at the page level, not the packet level
**if no text event happens for a while after a particular text event, we don't want to delay it so a larger page can be issued

See also [[#Seeking and memory|Problems to solve: Seeking and memory]].

*The granule encoding is not a direct time/granule correspondance, see the granule encoding section.
*The EOS packet should have a granule pos higher or equal to the end time of all events.
*User code doesn't have to know the number of headers to expect, this is moved inside the library code (as opposed to Vorbis and Theora).
*The format contains hooks so that additional information may be added in future revisions while keeping backward compatibility (though old decoders will correctly parse, but ignore the new information).

== Format specification ==

The Kate bitstream format consists of a number of sequential packets.
Packets can be either header packets or data packets. All header packets
must appear before any data packet.

Header packets must appear in order. Decoding of a data packet is not
possible until all header packets have been decoded.

Each Kate packet starts with a one byte type. A type with the MSB set
(eg, between 0x80 and 0xff) indicates a header packet, while a type with
the MSB cleared (eg, between 0x00 and 0x7f) indicates a data packet.
All header packets then have the Kate magic, from byte offset 1 to byte
offset 7 ("kate\0\0\0"). Note that this applies only to header packets:
data packets do not contain the Kate signature.

Since the ID header must appear first, a Kate stream can be recognized
by comparing the first eight bytes of the first packet with the signature
string "\200kate\0\0\0".

When embedded in Ogg,the first packet in a Kate stream (always packet type 0x80,
the id header packet) must be placed on a separate page. The corresponding Ogg
packet must be marked as beginning of stream (BOS).All subsequent header packets
must be on one or more pages. Subsequently, each data packet must be on a separate
page.

The last data packet must be the end of stream packet (packet type 0x7f).

When embedded in Ogg, the corresponding Ogg packet must be marked as end of stream (EOS).

As per the Ogg specification, granule positions must be non decreasing
within the stream. Header packets have granule position 0.

Currently existing packet types are:
:headers:
::0x80 ID header (BOS)
::0x81 Vorbis comment header
::0x82 regions list header
::0x83 styles list header
::0x84 curves list header
::0x85 motions list header
::0x86 palettes list header
::0x87 bitmaps list header
::0x88 font ranges and mappings header
:data:
::0x00 text data (including optional motions and overrides)
::0x01 keepalive
::0x02 repeat
::0x7f end packet (EOS)

This format described here is for bitstream version 0.x.
As or 19 december 2008, the latest bitstream version is 0.4.

For more detailed information, refer to the format documentation
in libkate (see URL below in the [[#Downloading|Downlading]] section).

Following is the definition of the ID header (packet type 0x80).
This works out to a 64 byte ID header. This is the header that should be
used to detect a Kate stream within an Ogg stream.

0 1 2 3 |
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1| Byte
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| packtype | Identifier char[7]: 'kate\0\0\0' | 0-3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| kate magic continued | 4-7
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | version major | version minor | num headers | 8-11
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| text encoding | directionality| reserved - 0 | granule shift | 12-15
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| cw sh | canvas width | ch sh | canvas height | 16-19
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | 20-23
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate numerator | 24-27
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate denominator | 28-31
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (NUL terminated) | 32-35
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 36-39
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 40-43
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 44-47
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (NUL terminated) | 48-51
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 52-55
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 56-59
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 60-63
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The fields cw sh, canvas width, cw sh, and canvas height were introduced
in bistream 0.3. Earlier bitstreams will have 0 in these fields.

language and category are NUL terminating ASCII strings.
Language follows RFC 3066, though obviously will not accommodate language tags
with lots of subtags.

Category is currently loosely defined, and I haven't found yet a nice way to
present it in a generic way, but is meant for automatic classifying of
various multiplexed Kate streams (eg, to recognize that some streams are
subtitles (in a set of languages), and some others are commentary (in a
possibly different set of languages, etc).

== API overview ==

libkate offers an API very similar to that of libvorbis and libtheora, as well as
an extra higher level decoding API.

Here's an overview of the three main modules:

=== Decoding ===

Decoding is done in a way similar to libvorbis. First, initialize a kate_info and a
kate_comment structure. Then, read headers by calling kate_decode_headerin. Once
all headers have been read, a kate_state is initialized for decoding using kate_decode_init,
and kate_decode_packetin is called repeatedly with data packets. Events (eg, text) can be
retrieved via kate_decode_eventout.

=== Encoding ===

Encoding is also done in a way similar to libvorbis. First initialize a kate_info
and a kate_comment structure, and fill them out as needed. kate_encode_headers will
create ogg packets from those. Then, kate_encode_text is called repeatedly for all
the text events to add. When done, calling kate_encode_finish will create an end of
stream packet.

=== High level decoding API ===

There are only 3 calls here:

kate_high_decode_init
kate_high_decode_packetin
kate_high_decode_clear

Here, all Ogg packets are sent to kate_high_decode_packetin, which does the right
thing (header/data classification, decoding, and event retrieval). Note that you
do not get access to the comments directly using this, but you do get access to the
kate_info via events.

The libkate distribution includes commented examples for each of those.

Additionally, libkate includes a layer (liboggkate) to make it easier to use when
embedded in Ogg. While the normal API uses kate_packet structures, liboggkate uses
ogg_packet structures.

The High level decoding API does not have an Ogg specific layer, but functions exist
to wrap a kate_packet around a memory buffer (such as the one ogg_packet uses, for instance).

== Support ==

Among the software with Kate support:
*VLC
*ffmpeg2theora
*liboggz
*liboggplay
*Cortado (wikimedia version)
*vorbis-tools

I have patches for the following with Kate support:
*MPlayer
*xine
*GStreamer
*Thoggen
*Audacious
*and more...

These may be found in the libkate source distribution (see [[#Downloading|Downloading]]
for links).

In addition, libtiger is a rendering library for Kate streams using Pango and Cairo,
though it is not quite yet API stable (though no major changes are expected).

== Granule encoding ==

=== Ogg ===

Ogg leaves the encoding of granules up to a particular codec, only
mandating that granules be non decreasing with time.

The Kate bitstream format uses a linear mapping between time and
granule, described here.

A Kate granule position is composed of two different parts:
- a base granule, in the high bits
- a granule offset, in the low bits

+----------------+----------------+
| base | offset |
+----------------+----------------+

The number of bits these parts occupy is variable, and each stream
may choose how many bits to dedicate to each. The kate_info structure
for a stream holds that information in the granule_shift field,
so each part may be reconstructed from a granulepos.

The timestamp T of a given Kate packet is split into a base B and
offset O, and these are stored in the granulepos of that packet.
The split is done such that the B is the time of the earliest event
still active at the time, and the O is the time elapsed between B
and T. Thus, T = B + O. This mimics the way Theora stores its own
timestamps in granulepos, where the base acts as a keyframe, and
an offset acts as the position of an intra frame from the previous
keyframe. Since Kate allows time overlapping events, however, the
choice of the base to use is slightly more complex, as it may not
be the starting time of the previous event, if the stream contains
time overlapping events.

The kate_info structure for a stream holds a rational fraction
representing the time span of granule units for both the base and
the offset parts.

The granule rate is defined by the two fields:

kate_info::gps_numerator
kate_info::gps_denominator

The number of bits reserved for the offset is defined by the field:

kate_info::granule_shift

=== Generic timing ===

Kate data packets (data packet type 0) includes timing information (start time,
end time, and time of the earliest event still active). All these are stored as
64 bit at the rate defined by the granule rate, so they do not suffer from the
granule_shift space limitation.

This also allows for Kate streams to be stored in other containers.

== Motion ==

The Kate bitstream format includes motion definition, originally for karaoke purposes, but
which can be used for more general purpose, such as line based drawing, or animation of
the text (position, color, etc)

Motions are defined by the means of a series of curves (static points, segments, splines (catmull-rom, bezier, and b-splines)).
A 2D point can be obtained from a motion for any timestamp during the lifetime of a text.
This can be used for moving a marker in 2D above the text for karaoke, or to use the x
coordinate to color text when the motion position passes each letter or word, etc.
Motions have an attached semantics so the client code knows how to use a particular motion.
Predefined semantics include text color, text position, etc).

Since a motion can be composed of an arbitrary number of curves, each of which may have
an arbitrary number of control points, complex motions can be achieved. If the motion is
the main object of an event, it is even possible to have an empty text, and use the motion
as a virtual pencil to draw arbitrary shapes. Even on-the-fly handwriting subtitles could
be done this way, though this would require a lot of control points, and would not be able
to be used with text-to-speech.

As a proof of concept, I also have a "draw chat" program where two people can draw, and
the shapes are turned to b-splines and sent as a kate motion to be displayed on the other
person's window.

It is also possible for motions to be discontinuous - simply insert a curve of 'none' type.
While the timestamp lies within such a curve, no 2D point will be generated. This can be
used to temporarily hide a marker, for instance.

It is worth mentionning that pauses in the motion can be trivially included by inserting
at the right time and for the right duration a simple linear interpolation curve with only
two equal points, equal to the position the motion is supposed to pause at.

Kate defines a set of predefined mappings so that each decoder user interprets a motion in
the same way. A mapping is coded on 8 bits in the bitstream, and the first 128 are reserved
for Kate, leaving 128 for application specific mappings, to avoid constraining creative uses
of that feature. Predefined mappings include frame (eg, 0-1 points are mapped to the size of
the current video frame), or region, to scale 0-1 to the current region. This allows curves
to be defined without knowing in advance the pixel size of the area it should cover.

For uses which require more than two coordinates (eg, text color, where 4 (RGBA) values are
needed, Kate predefines the semantics text_color_rg and text_color_ba, so a 4D point can be
obtained using two different motions.

There are higher level constructs, such as morphing between two styles, or predefined
karaoke effects. More are planned to be added in the future.

See also [[#Trackers|Trackers]].

== Trackers ==

Since attaching motions to text position, etc, makes it hard for the client to keep track of
everything, doing interpolation, etc, the library supplies a tracker object, which handles the
interpolation of the relevant properties.
Once initialized with a text and a set of motions, the client code can give the tracker a new
timestamp, and get back the current text position, text color, etc.

Using a tracker is not necessary, if one wants to use the motions directly, or just ignore them,
but it makes life easier, especially when considering the the order in which motions are applied
does matter (to be defined formally, but the current source code is informative at this point).

== The Kate file format ==

Though this is not a feature of the bitstream format, I have created a text file format to
describe a series of events to be turned into a Kate bitstream.
At its minimum, the following is a valid input to the encoder:

: kate {
:: event { 00:00:05 --> 00:00:10 "This is a text" }
: }

This will create a simple stream with "This is a text" emitted at an offset of 5 seconds into
the track, lasting 5 seconds to an end time at 10 seconds.

Motions, regions, styles can be declared in a definitions block to be reused by events, or can
be defined inline. Defining those in the definitions block places them in a header so they can
be reused later, saving space. However, they can also be defined in each event, so they will be
sent with the event. This allows them to be generated on the fly (eg, if the bitstream is being
streamed from a realtime input).

For convenience, the Kate file format also allows C style macros, though without parameters.

Please note that the Kate file format is fully separate from the Kate bitstream format. The
difference between the two is similar to the difference between a C source file and the resulting
object file, when compiled.

Note that the format is not based on XML for a very parochial reason: I tend to dislike very
much editing XML by hand, as it's really hard to read. XML is really meant for machines to parse
generically text data in a shared syntax but with possibly unknown semantics, and I need those
text representations to be editable easily.

This also implies that there could be an XML representation of a Kate stream, which would be
useful if one were to make an editor that worked on a higher level than the current all-text
representation, and it is something that might very well happen in the future, in parallel with
the current format.

== Karaoke ==

Karaoke effects rely on motions, and there will be predefined higher level ways of specifying
timings and effects, two of which are already done. As an example, this is a valid Karaoke script:

:kate {
:: simple_timed_glyph_style_morph {
::: from style "start_style" to style "end_style"
::: "Let " at 1.0
::: "us " at 1.2
::: "sing " at 1.4
::: "to" at 2.0
::: "ge" at 2.5
::: "ther" at 3.0
:: }
:}

The syllables will change from a style to another as time passes. The definition of the start_style
and end_style styles is omitted for brevity.

== Problems to solve ==

There are a few things to solve before the Kate bitstream format can be considered good
enough to be frozen:

Note: the following is mostly solved, and the bitstream is now stable, and has been
backward and forward compatible since the first released version. This will be updated
when I get some time.

=== Seeking and memory ===

When seeking to a particular time in a movie with subtitles, we may end up at a place when a subtitle has been started, but is not removed yet. Pure streaming doesn't have this problem as it remembers the subtitle being issued (as opposed to, say, Vorbis, for which all data valid now is decoded from the last packet). With Kate, a text string valid now may have been issued long ago.

I see three possible ways to solve this:
*each data packet includes the granule of the earliest still active packet (if none, this will be the granule of this very packet)
**this means seeks are two phased: first seek, find the next Kate packet, and seek again if the granule of the earlier still active packet is less than the original seeked granule. This implies support code on players to do the double seek.

*use "reference frames", a bit like Theora does, where the granule position is split in several fields: the higher bits represent a position for the reference frame, and the lowest bits a delta time to the current position. When seeking to a granule position, the lower bits are cleared off, yielding the granule position of the previous reference frame, so the seek ends up at the reference frame. The reference frame is a sync point where any active strings are issued again. This is a variant of the method described in the Writ wiki page, but the granule splitting avoids any "downtime".
**this requires reissuing packets, and it doesn't feel right (and wastes space).
**it also requires "dummy" decoding of Kate data from the reference frame to the actual seek point to fully refresh the state "memory".

*A variant of the two-granules-in-one system used by libcmml, where the "back link" points to the earliest still active string, rather than the previous one (this allows a two phase seek, rather than a multiphase seek, hopping back from event to event, with no real way to know if there is or not a previous event which is still active - I suppose CMML has no need to know this, if their "clips" do not overlap - mine can do).
**Such a system considerably shortens the usable granule space, though it can do a one phase seek, if I understand the system correctly, which I am not certain.
*** Well, it seems it can't do a one phase seek anyway.

*Additionally, it could be possible to emit simple "keepalive" packets at regular intervals to help a seek algorithm to sync up to the stream without needing too much data reading - this helps for discontinuous streams where there could be no pages for a while if no data is needed at that time.

=== Text encoding ===

A header field declares the text encoding used in the stream. At the moment, only UTF-8 is
supported, for simplicity. There are no plans to support other encodings, such as UTF-16,
at the moment.

Note that strings included in the header (language, category) are not affected by that
language encoding (rather obviously for language itself). These are ASCII.

The actual text in events may include simple HTML-like markup (at the moment, allowed markup
is the same as the one Pango uses, but more markup types may be defined in the future).
It is also possible to ask libkate to remove this markup if the client prefers to receive
plain text without the markup.

=== Language encoding ===

A header field defines the language (if any) used in the stream (this can be overridden in a
data packet, but this is not relevant to this point). At the moment, my test code uses
ISO 639-1 two letter codes, but I originally thought to use RFC 3066 tags. However, matching
a language to a user selection may be simpler for user code if the language encoding is kept
simple. At the moment, I tend to favor allowing both two letter tags (eg, "en") and secondary
tags (like "en_EN"), as RFC 3066 tags can be quite complex, but I welcome comments on this.

If a stream contains more than one language, there usually is a predominant language, which
can be set as the default language for the stream. Each event can then have a language
override. If there is no predominant language, and it is not possible to split the stream
into multiple substreams, each with its own language, then it is possible to use the "mul"
language tag, as a last resort.

=== Bitstream format for floating point values ===

Floating point values are be turned to a 16.16 fixed point format, then stored in a bitpacked
format, storing the number of zero bits at the head and tail of the floating point values once
per stream, and the remainder bits for all values in the stream. This seems to yield good results
(typically a 50% reduction over 32 bits raw writes, and 70% over the snprintf based storage), and
has the big advantage of being portable (eg, independant of any IEEE format).
However, this means reduced precision due to the quantization to 16.16. I may add support for
variable precision (eg, 8.24 fixed point formats) to alleviate this. This would however mean less
space savings, though these are likely to be insignificant when Kate streams are interleaved with
a video.

*Though this is not a Kate issue per se, the motion feature is very difficult to use without a curve editor. While tools may be coded to create a Kate bitstream for various existing subtitle formats, it is not certain it will be easy to find a good authoring tool for a series of curves. That said, it's not exactly difficult to do if you know a widget set.

=== Higher dimensional curves/motions ===

It is quite annoying to have to create two motions to control a color change, due to curves
being restricted to two dimensions. I may add support for arbitrary dimensions. It would also
help for 1D motions, like changing the time flow, where one coordinate is simply ignored at
the moment.
Alternatively, changes could be made to the Kate file format to hide the two dimensionality and
allow simpler specification of non-2 dimensional motions, but still map them to 2D in the kate
bitstream format.

=== Category definition ===

The category field in the BOS packet is a 16 byte text field (15 really, as it is zero terminated
in the bitstream itself). Its goal is to provide the reader with a short description of what kind
of information the stream contains, eg subtitles, lyrics, etc. This would be displayed to the user,
possibly to allow to choose to turn some streams on and off.

Since this category is meant primarily for a machine to parse, they will be kept to ASCII. When
a player recognizes a category, it is free to replace its name with one in the user's language if
it prefers. Even in English, the "lyrics" category could be displayed by a player as "Lyrics".

Since this is a free text field rather than an enumeration, it would be good to have a list of
common predefined category names that Kate streams can use.

This is a list of proposed predefined categories, feedback/additions welcome:

* subtitles - the usual movie subtitles, as text
* spu-subtitles - movie subtitles in DVD style paletted images
* lyrics - song lyrics

Please remember the 15 character limit if proposing other categories.

Note that the list of categories is subject to change, and will likely
be replaced by new, more "identifier like" ones. The three ones above,
however, would be kept for backward compatibility as they're already used.

== Text to speech ==

One of the goals of the Kate bitstream format is that text data can be easily parsed
by the user of the decoder, so any additional information, such as style, placement,
karaoke data, etc, should be able to be stripped to leave only the bare text. This is
in view of allowing text-to-speech software to use Kate bitstreams as a bandwith-cheap
way of conveying speech data, and could also allow things like e-books which can be
either read or listened to from the same bitstream (I have seen no reference to this
being used anywhere, but I see no reason why the granule progression should be temporal,
and not user controlled, such as by using a "next" button which would bump a granule
postion by a preset amount, simulating turning a page (this would be close to necessary
for text-to-speech, as the wall time duration of the spoken speech is not known in
advance to the Kate encoder, and can't be mapped to a time based granule progression)).
All text strings triggered consecutively between the two granule positions would then
be read in order.

== Possible additions ==

=== Embedded binary data ===

Images and font mappings can be included within a Kate stream.

==== Images ====

Though this could be misused to interfere with ability to render as text-to-speech, Kate
can use images as well as text. The same caveat as for fonts applies with regard to data
duplication.

Complex images might however be best left to a multiplexed OggSpots or OggMNG stream, unless the
images mesh with the text (eg, graphical exclamation points, custom fonts, (see next
paragraph), etc).

There is support for simple paletted bitmap images, with a variable length palette of up
to 256 colors (in fact, sized in powers of 2 up to 256) and matching pixel data in as
many bits per pixel as can address the palette. Palettes and images are stored separately,
so can be used with one another with no fixed assignment.

Palettes and bitmaps are put in two separate header for later use by reference, but can
also be placed in data packets, as with motions, etc, if they are not going to be reused.

PNG bitmaps can also be embedded in a Kate stream. These do not have associated palettes
(but the PNGs themselves may or may not be paletted). There is no support for decoding PNG
images in libkate itself, so a program will have to use libpng (or similar code) to decode
the PNG image. For instance, the libtiger rendering library uses Cairo to decode and render
PNG images in Kate streams.

This can be used to have custom fonts, so that raw text is still available if the stream
creator wants a custom look.

I expect that the need for more than 256 colors in a bitmap, or non palette bitmap data,
would be best handled by another codec, eg OggMNG or OggSpots. The goal of images in a
Kate stream is to mesh the images with the text, not to have large images by themselves.

On the other hand, interesting Karaoke effects could be achieved by having MNG images
instead of simple paletted bitmaps in a Kate streams. Comments would be most welcome on
whether this is going too far, however.

I am also investigating SVG images. These allow for very small footprint images for simple
vector drawings, and could be very useful for things like background gradients below text.

A possible solution to the duplication issue is to have another stream in the container
stream, which would hold the shared data (eg, fonts), which the user program could load,
and which could then be used by any Kate (and other) stream. Typically, this type of stream
would be a degenerate stream with only header packets (so it is fully processed before any
other stream presents data packets that might make use of that shared data), and all payload
such as fonts being contained within the headers. Thinking about it, it has parallels with
the way Vorbis stores its codebooks within a header packet, or even the way Kate stores the
list of styles within a header packet.

==== Fonts ====

Custom fonts are merely a set of ranges mapping unicode code points to bitmaps. As this implies,
fonts are bitmap fonts, not vector fonts, so scaling, if supported by the rendering client,
may not look as good as with a vector font.

A style may also refer to a font name to use (eg, "Tahoma"). These fonts may or may not be
available on the playing system, however, since the font data is not included in the stream,
just referenced by name. For this reason, it is best to keep to widely known fonts.

== Reference encoder/decoder ==

A encoder (kateenc) and a decoder (katedec) are included in the tools directory.
The encoder supports input from several different formats:
* a custom text based file format (see [[#The Kate file format|The Kate file format]]), which is by no means meant to be part of the Kate bitstream specification itself
* SubRip (.srt), the most common subtitle format I found
* LRC lyrics format.

As an example for the widely used SRT subtitles format, the following command line
create a Kate subtitles stream from an SRT file:

kateenc -l en -c subtitles -t srt -o subtites.ogg subtitles.srt

The reverse is possible, to recover an SRT file from a Kate stream, with katedec.

Note that the subtitles.ogg file should then be multiplexed into the A/V stream,
using either ogg-tools or oggz-tools.

The Kate bitstreams encoded and decoded by those tools are (supposed to be) correct for this
specification, provided their input is correct.

== Next steps ==

=== Continuations ===

Continuations are a way to add to existing events, and are mostly meant for motions. When streaming
in real time, what motions may be applied to events may not be known in advance (for instance, for a
draw chat program where two programs exchange Kate streams, the drawing motions are only known as
they are drawn. Continuations will allow an event to be extended in time, and motions to be appended
to it. This is only useful for streaming, as when stored in a file, everything is already known in
advance.

=== A rendering library ===

This will allow easier integration in other packages (movie players, etc).
I have started working on an implementation using Cairo and Pango, though I'm still at the early stages.
I might add support for embedding vector fonts in a Kate stream if I was going that way. Still need to think about this.
Another point of note is that when this library is available, it would make it easier to add
capabilities such as rotation, scaling, etc, to the bitstream, since this would not cause too
much work for playing programs using the rendering library. It is expected that these additions
would stay backward compatible (eg, an old player would ignore this information but still correctly
decode the information they can work with from a newly encoded stream).

=== An XML representation ===

While I purposefully did not write Kate description files in XML due to me finding editing XML such
a chore, it would be nice to be able to losslessly convert between the more user friendly representation
and an XML document, so one can do what one does with XML documents, like transformations.

And after all, some people might prefer editing the XML version.

=== Packaging ===

It would be really nice to have packages for libkate/libtiger for many distros.

If you're a packager for a distro which doesn't have yet packages for libkate
or libtiger, please consider helping :)

In particular, packages for Debian would be grand.

== Matroska mapping ==

The codec ID is "S_KATE".

As for Theora and Vorbis, Kate headers are stored in the private data as xiph-laced packets:

Byte 0: number of packets present, minus 1 (there must be at least one packet) - let this number be NP
Bytes 1..n: lengths of the first NP packets, coded in xiph style lacing
Bytes n+1..end: the data packets themselves concatenated one after the other

Note that the length of the last packet isn't encoded, it is deduced from the sizes of the other
packets and the total size of the private data.

This mapping is similar to the Vorbis and Theora mappings, with the caveat that one should not
expect a set number of headers.

== Downloading ==

libkate encodes and decodes Kate streams, and is API and ABI stable.

The libkate source distribution is available at [http://libkate.googlecode.com/ http://libkate.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/kate.git;a=summary http://git.xiph.org/?p=users/oggk/kate.git;a=summary].

libtiger renders Kate streams using Pango and Cairo, and is alpha, with API changes still possible.

The libtiger source distribution is available at [http://libtiger.googlecode.com/ http://libtiger.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/tiger.git;a=summary http://git.xiph.org/?p=users/oggk/tiger.git;a=summary].

== HOWTOs ==

These paragraphs describe a few ways to use Kate streams:

=== Text movie subtitles ===

Kate streams can carry Unicode text (that is, text that can represent
pretty much any existing language/script). If several Kate streams are
multiplexed along with a video, subtitles in various languages can be
made for that movie.

An easy way to create such subtitles is to use ffmpeg2theora, which
can create Kate streams from SubRip (.srt) format files, a simple but
common text subtitles format. ffmpeg2theora 0.21 or later is needed.

At its simplest:

ffmpeg2theora -o video-with-subtitles.ogg --subtitles subtitles.srt
video-without-subtitles.avi

Several languages may be created and tagged with their language code
for easy selection in a media player:

ffmpeg2theora -o video-with-subtitles.ogg video-without-subtitles.avi
--subtitles japanese-subtitles.srt --subtitles-language ja
--subtitles welsh-subtitles.srt --subtitles-language cy
--subtitles english-subtitles.srt --subtitles-language en_GB

Alternatively, kateenc (which comes with the libkate distribution) can
create Kate streams from SubRip files as well. These can then be merged
with a video with oggz-tools:

kateenc -t srt -c SUB -l it -o subtitles.ogg italian-subtitles.srt
oggz merge -o movie-with-subtitles.ogg movie-without-subtitles.ogg subtitles.ogg

This second method can also be used to add subtitles to a video which
is already encoded to Theora, as it will not transcode the video again.

=== DVD subtitles ===

DVD subtitles are not text, but images. Thoggen, a DVD ripper program,
can convert these subtitles to Kate streams (at the time of writing,
Thoggen and GStreamer have not applied the necessary patches for this
to be possible out of the box, so patching them will be required).

When configuring how to rip DVD tracks, any subtitles will be detected
by Thoggen, and selecting them in the GUI will cause them to be saved as
Kate tracks along with the movie.

=== Song lyrics ===

Kate streams carrying song lyrics can be embedded in an Ogg file. The
oggenc Vorbis encoding tool from the Xiph.Org Vorbis tools allows lyrics
to be loaded from a LRC or SRT text file and converted to a Kate stream
multiplexed with the resulting Vorbis audio. At the time of writing,
the patch to oggenc was not applied yet, so it will have to be patched
manually with the patch found in the diffs directory.

oggenc -o song-with-lyrics.ogg --lyrics lyrics.lrc --lyrics-language en_US song.wav

So called 'enhanced LRC' files (containing extra karaoke timing information)
are supported, and a simple karaoke color change scheme will be saved
out for these files. For more complex karaoke effects (such as more
complex style changes, or sprite animation), kateenc should be used with
a Kate description file to create a separate Kate stream, which can then
be merged with a Vorbis only song with oggz-tools:

oggenc -o song.ogg song.wav
kateenc -t kate -c LRC -l en_US -o lyrics.ogg lyrics-with-karaoke.kate
oggz merge -o song-with-karaoke.ogg lyrics-with-karaoke.ogg song.ogg

This latter method may also be used if you already have an encoded Vorbis song
with no lyrics, and just want to add the lyrics without reencoding.

=== Changing a Kate stream embedded in an Ogg stream ===

If you need to change a Kate stream already embedded in an Ogg stream (eg, you have a movie with subtitles, and you want to fix a spelling mistake, or want to bring one of the subtitles forward in time, etc), you can do this easily with KateDJ, a tool that will extract Kate streams, decode them to a temporary location, and rebuild the original stream after you've made whatever changes you want.

KateDJ (included with the libkate distribution) is a GUI program using wxPython, a Python module for the wxWidgets GUI library, and the oggz tools (both needing installing separately if they are not already).

The procedure consists of:

* Run KateDJ
* Click 'Load Ogg stream' and select the file to load
* Click 'Demux file' to decode Kate streams in a temporary location
* Edit the Kate streams (a message box tells you where they are placed)
* When done, click 'Remux file from parts'
* If any errors are reported, continue editing until the remux step succeeds

== Frequently Asked Questions ==

=== Does libkate work on other plaforms than Linux ? ===

Yes, libkate is not Linux specific in any way. It optionally relies on libogg
and libpng, two libraries widely ported to various platforms.
It has been reported to work on Windows and MacOS X as well as UNIX platforms.

However, libtiger, a rendering library for Kate streams, relies on Pango and Cairo,
which are not easy to build on Windows, though they can be.
The Tiger renderer is however completely separate from libkate, and is not needed
for full encoding and decoding of Kate streams.

=== Where can I find some example files ? ===

The libkate distribution can generate various examples, but already built files
can be found there:
[http://people.xiph.org/~oggk/elephants_dream/elephantsdream-with-subtitles.ogg]
[http://stallman.org/fry/Stephen_Fry-Happy_Birthday_GNU-nq_600px_425kbit.ogv]

These files use raw text only.

[[Category:Drafts]]
[[Category:Ogg Mappings]]

OggKate

2009-02-08T21:30:14Z

Ogg.k.ogg.k: HOWTOs (text subtitles, DVD subtitles, song lyrics)

== Disclaimer ==
This is not a Xiph codec, though it may be embedded in Ogg alonside other Xiph
codecs, such as Vorbis and Theora. As such, please do not assume that Xiph has
anything to do with this, much less responsibility.

== What is Kate? ==

Kate is an overlay codec, originally designed for karaoke and text, that can be
multiplixed in Ogg. Text and images can be carried by a Kate stream, and animated.
Most of the time, this would be multiplexed with audio/video to carry subtitles,
song lyrics (with or without karaoke data), etc, but doesn't have to be.

Series of curves (splines, segments, etc) may be attached to various properties
(text position, font size, etc) to create animated overlays. This allows scrolling
or fading text to be defined. This can even be used to draw arbitrary shapes, so
hand drawing can also be represented by a Kate stream.

Example uses of Kate streams are movie subtitles for Theora videos, either text based,
as may be created by [http://www.v2v.cc/~j/ffmpeg2theora ffmpeg2theora], or image
based, such as created by [http://thoggen.net Thoggen] (patching needed), and lyrics,
as created by oggenc, from vorbis-tools.

== Why a new codec? ==

As I was adding support for Theora, Speex and FLAC to some software of mine, I found myself
wanting to have song lyrics accompanying Vorbis audio. Since Vorbis comments are limited to
the headers, one can't add them in the stream as they are sung, so another multiplexed stream
would be needed to carry them.

The three possible bases usable for such a codec I found were Writ, CMML, and OGM/SRT.

*[[OggWrit|Writ]] is an unmaintained start at an implementation of a very basic design, though I did find an encoder/decoder in py-ogg2 later on - I'd been quicker to write Kate from scratch anyway.
*[[CMML]] is more geared towards encapsulating metadata about an accompanying stream, rather than being a data stream itself, and seemed complex for a simple use, though I have now revised my view on this - besides, it seems designed for Annodex (which I haven't had a look at), though it does seems relatively generic for use outwith Annodex - though it is being "repurposed" as timed text now, bringing it closer to what I'm doing
*OGM/SRT, which I only found when I added Kate support to MPlayer, is shoehorning various data formats into an Ogg stream, and just dumps the SRT subtitle format as is, AFAICS (though I haven't looked at this one in detail, since I'd already had a working Kate implementation by that time)

I then decided to roll my own, not least because it's a fun thing to do.

I found other formats, such as USF (designed for inclusion in Matroska) and various subtitle formats,
but none were designed for embedding inside an Ogg container.

== Overview of the Kate bitstream format ==

I've taken much inspiration from Vorbis and Theora here.
Headers and packets (as well as the API design) follow the design of these two codecs.

A rough overview (see [[#Format specification|Format specification]] for more details) is:

Headers packets:
*ID header [BOS]: magic, version, granule fraction, encoding, language, etc
*Comment header: Vorbis comments, as per Vorbis/Theora streams
*Style definitions header: a list of predefined styles to be referred to by data packets
*Region definitions header: a list of predefined regions to be referred to by data packets
*Curves definitions header: a list of predefined curves to be referred to by data packets
*Motion definitions header: a list of predefined motions to be referred to by data packets
*Palette definitions header: a list of predefined palettes to be referred to by data packets
*Bitmap definitions header: a list of predefined bitmaps to be referred to by data packets
*Font mapping definitions header: a list of predefined font mappings to be referred to by data packets

Other header packets are ignored, and left for future expansion.

Data packets:
*text data: text/image and optional motions, accompanied by optional overrides for style, region, language, etc
*keepalive: can be emitted at any time to help a demuxer know where we're at, but those packets are optional
*repeats: a verbatim repeat of a text packet's payload, in order to bound any backward seeking needed when starting to play a stream partway through. These are also optional.
*end data [EOS]: marks the end of the stream, it doesn't have any useful payload

Other data packets are ignored, and left for future expansion.

The intent of the "keepalive" packet is to be sent at regular
intervals when no other packet has been emitted for a while. This would be to help seeking code
find a kate page more easily.

Things of note:
*Kate is a discontinuous codec, as defined in [http://www.xiph.org/ogg/doc/ogg-multiplex.html ogg-multiplex.html] in the Ogg documentation, which means it's timed by start granule, not end granule (as Theora and Vorbis).
* All data packets are on their own page, for two reasons:
**Ogg keeps track of granules at the page level, not the packet level
**if no text event happens for a while after a particular text event, we don't want to delay it so a larger page can be issued

See also [[#Seeking and memory|Problems to solve: Seeking and memory]].

*The granule encoding is not a direct time/granule correspondance, see the granule encoding section.
*The EOS packet should have a granule pos higher or equal to the end time of all events.
*User code doesn't have to know the number of headers to expect, this is moved inside the library code (as opposed to Vorbis and Theora).
*The format contains hooks so that additional information may be added in future revisions while keeping backward compatibility (though old decoders will correctly parse, but ignore the new information).

== Format specification ==

The Kate bitstream format consists of a number of sequential packets.
Packets can be either header packets or data packets. All header packets
must appear before any data packet.

Header packets must appear in order. Decoding of a data packet is not
possible until all header packets have been decoded.

Each Kate packet starts with a one byte type. A type with the MSB set
(eg, between 0x80 and 0xff) indicates a header packet, while a type with
the MSB cleared (eg, between 0x00 and 0x7f) indicates a data packet.
All header packets then have the Kate magic, from byte offset 1 to byte
offset 7 ("kate\0\0\0"). Note that this applies only to header packets:
data packets do not contain the Kate signature.

Since the ID header must appear first, a Kate stream can be recognized
by comparing the first eight bytes of the first packet with the signature
string "\200kate\0\0\0".

When embedded in Ogg,the first packet in a Kate stream (always packet type 0x80,
the id header packet) must be placed on a separate page. The corresponding Ogg
packet must be marked as beginning of stream (BOS).All subsequent header packets
must be on one or more pages. Subsequently, each data packet must be on a separate
page.

The last data packet must be the end of stream packet (packet type 0x7f).

When embedded in Ogg, the corresponding Ogg packet must be marked as end of stream (EOS).

As per the Ogg specification, granule positions must be non decreasing
within the stream. Header packets have granule position 0.

Currently existing packet types are:
:headers:
::0x80 ID header (BOS)
::0x81 Vorbis comment header
::0x82 regions list header
::0x83 styles list header
::0x84 curves list header
::0x85 motions list header
::0x86 palettes list header
::0x87 bitmaps list header
::0x88 font ranges and mappings header
:data:
::0x00 text data (including optional motions and overrides)
::0x01 keepalive
::0x02 repeat
::0x7f end packet (EOS)

This format described here is for bitstream version 0.x.
As or 19 december 2008, the latest bitstream version is 0.4.

For more detailed information, refer to the format documentation
in libkate (see URL below in the [[#Downloading|Downlading]] section).

Following is the definition of the ID header (packet type 0x80).
This works out to a 64 byte ID header. This is the header that should be
used to detect a Kate stream within an Ogg stream.

0 1 2 3 |
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1| Byte
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| packtype | Identifier char[7]: 'kate\0\0\0' | 0-3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| kate magic continued | 4-7
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | version major | version minor | num headers | 8-11
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| text encoding | directionality| reserved - 0 | granule shift | 12-15
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| cw sh | canvas width | ch sh | canvas height | 16-19
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | 20-23
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate numerator | 24-27
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate denominator | 28-31
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (NUL terminated) | 32-35
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 36-39
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 40-43
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 44-47
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (NUL terminated) | 48-51
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 52-55
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 56-59
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 60-63
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The fields cw sh, canvas width, cw sh, and canvas height were introduced
in bistream 0.3. Earlier bitstreams will have 0 in these fields.

language and category are NUL terminating ASCII strings.
Language follows RFC 3066, though obviously will not accommodate language tags
with lots of subtags.

Category is currently loosely defined, and I haven't found yet a nice way to
present it in a generic way, but is meant for automatic classifying of
various multiplexed Kate streams (eg, to recognize that some streams are
subtitles (in a set of languages), and some others are commentary (in a
possibly different set of languages, etc).

== API overview ==

libkate offers an API very similar to that of libvorbis and libtheora, as well as
an extra higher level decoding API.

Here's an overview of the three main modules:

=== Decoding ===

Decoding is done in a way similar to libvorbis. First, initialize a kate_info and a
kate_comment structure. Then, read headers by calling kate_decode_headerin. Once
all headers have been read, a kate_state is initialized for decoding using kate_decode_init,
and kate_decode_packetin is called repeatedly with data packets. Events (eg, text) can be
retrieved via kate_decode_eventout.

=== Encoding ===

Encoding is also done in a way similar to libvorbis. First initialize a kate_info
and a kate_comment structure, and fill them out as needed. kate_encode_headers will
create ogg packets from those. Then, kate_encode_text is called repeatedly for all
the text events to add. When done, calling kate_encode_finish will create an end of
stream packet.

=== High level decoding API ===

There are only 3 calls here:

kate_high_decode_init
kate_high_decode_packetin
kate_high_decode_clear

Here, all Ogg packets are sent to kate_high_decode_packetin, which does the right
thing (header/data classification, decoding, and event retrieval). Note that you
do not get access to the comments directly using this, but you do get access to the
kate_info via events.

The libkate distribution includes commented examples for each of those.

Additionally, libkate includes a layer (liboggkate) to make it easier to use when
embedded in Ogg. While the normal API uses kate_packet structures, liboggkate uses
ogg_packet structures.

The High level decoding API does not have an Ogg specific layer, but functions exist
to wrap a kate_packet around a memory buffer (such as the one ogg_packet uses, for instance).

== Support ==

Among the software with Kate support:
*VLC
*ffmpeg2theora
*liboggz
*liboggplay
*Cortado (wikimedia version)
*vorbis-tools

I have patches for the following with Kate support:
*MPlayer
*xine
*GStreamer
*Thoggen
*Audacious
*and more...

These may be found in the libkate source distribution (see [[#Downloading|Downloading]]
for links).

In addition, libtiger is a rendering library for Kate streams using Pango and Cairo,
though it is not quite yet API stable (though no major changes are expected).

== Granule encoding ==

=== Ogg ===

Ogg leaves the encoding of granules up to a particular codec, only
mandating that granules be non decreasing with time.

The Kate bitstream format uses a linear mapping between time and
granule, described here.

A Kate granule position is composed of two different parts:
- a base granule, in the high bits
- a granule offset, in the low bits

+----------------+----------------+
| base | offset |
+----------------+----------------+

The number of bits these parts occupy is variable, and each stream
may choose how many bits to dedicate to each. The kate_info structure
for a stream holds that information in the granule_shift field,
so each part may be reconstructed from a granulepos.

The timestamp T of a given Kate packet is split into a base B and
offset O, and these are stored in the granulepos of that packet.
The split is done such that the B is the time of the earliest event
still active at the time, and the O is the time elapsed between B
and T. Thus, T = B + O. This mimics the way Theora stores its own
timestamps in granulepos, where the base acts as a keyframe, and
an offset acts as the position of an intra frame from the previous
keyframe. Since Kate allows time overlapping events, however, the
choice of the base to use is slightly more complex, as it may not
be the starting time of the previous event, if the stream contains
time overlapping events.

The kate_info structure for a stream holds a rational fraction
representing the time span of granule units for both the base and
the offset parts.

The granule rate is defined by the two fields:

kate_info::gps_numerator
kate_info::gps_denominator

The number of bits reserved for the offset is defined by the field:

kate_info::granule_shift

=== Generic timing ===

Kate data packets (data packet type 0) includes timing information (start time,
end time, and time of the earliest event still active). All these are stored as
64 bit at the rate defined by the granule rate, so they do not suffer from the
granule_shift space limitation.

This also allows for Kate streams to be stored in other containers.

== Motion ==

The Kate bitstream format includes motion definition, originally for karaoke purposes, but
which can be used for more general purpose, such as line based drawing, or animation of
the text (position, color, etc)

Motions are defined by the means of a series of curves (static points, segments, splines (catmull-rom, bezier, and b-splines)).
A 2D point can be obtained from a motion for any timestamp during the lifetime of a text.
This can be used for moving a marker in 2D above the text for karaoke, or to use the x
coordinate to color text when the motion position passes each letter or word, etc.
Motions have an attached semantics so the client code knows how to use a particular motion.
Predefined semantics include text color, text position, etc).

Since a motion can be composed of an arbitrary number of curves, each of which may have
an arbitrary number of control points, complex motions can be achieved. If the motion is
the main object of an event, it is even possible to have an empty text, and use the motion
as a virtual pencil to draw arbitrary shapes. Even on-the-fly handwriting subtitles could
be done this way, though this would require a lot of control points, and would not be able
to be used with text-to-speech.

As a proof of concept, I also have a "draw chat" program where two people can draw, and
the shapes are turned to b-splines and sent as a kate motion to be displayed on the other
person's window.

It is also possible for motions to be discontinuous - simply insert a curve of 'none' type.
While the timestamp lies within such a curve, no 2D point will be generated. This can be
used to temporarily hide a marker, for instance.

It is worth mentionning that pauses in the motion can be trivially included by inserting
at the right time and for the right duration a simple linear interpolation curve with only
two equal points, equal to the position the motion is supposed to pause at.

Kate defines a set of predefined mappings so that each decoder user interprets a motion in
the same way. A mapping is coded on 8 bits in the bitstream, and the first 128 are reserved
for Kate, leaving 128 for application specific mappings, to avoid constraining creative uses
of that feature. Predefined mappings include frame (eg, 0-1 points are mapped to the size of
the current video frame), or region, to scale 0-1 to the current region. This allows curves
to be defined without knowing in advance the pixel size of the area it should cover.

For uses which require more than two coordinates (eg, text color, where 4 (RGBA) values are
needed, Kate predefines the semantics text_color_rg and text_color_ba, so a 4D point can be
obtained using two different motions.

There are higher level constructs, such as morphing between two styles, or predefined
karaoke effects. More are planned to be added in the future.

See also [[#Trackers|Trackers]].

== Trackers ==

Since attaching motions to text position, etc, makes it hard for the client to keep track of
everything, doing interpolation, etc, the library supplies a tracker object, which handles the
interpolation of the relevant properties.
Once initialized with a text and a set of motions, the client code can give the tracker a new
timestamp, and get back the current text position, text color, etc.

Using a tracker is not necessary, if one wants to use the motions directly, or just ignore them,
but it makes life easier, especially when considering the the order in which motions are applied
does matter (to be defined formally, but the current source code is informative at this point).

== The Kate file format ==

Though this is not a feature of the bitstream format, I have created a text file format to
describe a series of events to be turned into a Kate bitstream.
At its minimum, the following is a valid input to the encoder:

: kate {
:: event { 00:00:05 --> 00:00:10 "This is a text" }
: }

This will create a simple stream with "This is a text" emitted at an offset of 5 seconds into
the track, lasting 5 seconds to an end time at 10 seconds.

Motions, regions, styles can be declared in a definitions block to be reused by events, or can
be defined inline. Defining those in the definitions block places them in a header so they can
be reused later, saving space. However, they can also be defined in each event, so they will be
sent with the event. This allows them to be generated on the fly (eg, if the bitstream is being
streamed from a realtime input).

For convenience, the Kate file format also allows C style macros, though without parameters.

Please note that the Kate file format is fully separate from the Kate bitstream format. The
difference between the two is similar to the difference between a C source file and the resulting
object file, when compiled.

Note that the format is not based on XML for a very parochial reason: I tend to dislike very
much editing XML by hand, as it's really hard to read. XML is really meant for machines to parse
generically text data in a shared syntax but with possibly unknown semantics, and I need those
text representations to be editable easily.

This also implies that there could be an XML representation of a Kate stream, which would be
useful if one were to make an editor that worked on a higher level than the current all-text
representation, and it is something that might very well happen in the future, in parallel with
the current format.

== Karaoke ==

Karaoke effects rely on motions, and there will be predefined higher level ways of specifying
timings and effects, two of which are already done. As an example, this is a valid Karaoke script:

:kate {
:: simple_timed_glyph_style_morph {
::: from style "start_style" to style "end_style"
::: "Let " at 1.0
::: "us " at 1.2
::: "sing " at 1.4
::: "to" at 2.0
::: "ge" at 2.5
::: "ther" at 3.0
:: }
:}

The syllables will change from a style to another as time passes. The definition of the start_style
and end_style styles is omitted for brevity.

== Problems to solve ==

There are a few things to solve before the Kate bitstream format can be considered good
enough to be frozen:

Note: the following is mostly solved, and the bitstream is now stable, and has been
backward and forward compatible since the first released version. This will be updated
when I get some time.

=== Seeking and memory ===

When seeking to a particular time in a movie with subtitles, we may end up at a place when a subtitle has been started, but is not removed yet. Pure streaming doesn't have this problem as it remembers the subtitle being issued (as opposed to, say, Vorbis, for which all data valid now is decoded from the last packet). With Kate, a text string valid now may have been issued long ago.

I see three possible ways to solve this:
*each data packet includes the granule of the earliest still active packet (if none, this will be the granule of this very packet)
**this means seeks are two phased: first seek, find the next Kate packet, and seek again if the granule of the earlier still active packet is less than the original seeked granule. This implies support code on players to do the double seek.

*use "reference frames", a bit like Theora does, where the granule position is split in several fields: the higher bits represent a position for the reference frame, and the lowest bits a delta time to the current position. When seeking to a granule position, the lower bits are cleared off, yielding the granule position of the previous reference frame, so the seek ends up at the reference frame. The reference frame is a sync point where any active strings are issued again. This is a variant of the method described in the Writ wiki page, but the granule splitting avoids any "downtime".
**this requires reissuing packets, and it doesn't feel right (and wastes space).
**it also requires "dummy" decoding of Kate data from the reference frame to the actual seek point to fully refresh the state "memory".

*A variant of the two-granules-in-one system used by libcmml, where the "back link" points to the earliest still active string, rather than the previous one (this allows a two phase seek, rather than a multiphase seek, hopping back from event to event, with no real way to know if there is or not a previous event which is still active - I suppose CMML has no need to know this, if their "clips" do not overlap - mine can do).
**Such a system considerably shortens the usable granule space, though it can do a one phase seek, if I understand the system correctly, which I am not certain.
*** Well, it seems it can't do a one phase seek anyway.

*Additionally, it could be possible to emit simple "keepalive" packets at regular intervals to help a seek algorithm to sync up to the stream without needing too much data reading - this helps for discontinuous streams where there could be no pages for a while if no data is needed at that time.

=== Text encoding ===

A header field declares the text encoding used in the stream. At the moment, only UTF-8 is
supported, for simplicity. There are no plans to support other encodings, such as UTF-16,
at the moment.

Note that strings included in the header (language, category) are not affected by that
language encoding (rather obviously for language itself). These are ASCII.

The actual text in events may include simple HTML-like markup (at the moment, allowed markup
is the same as the one Pango uses, but more markup types may be defined in the future).
It is also possible to ask libkate to remove this markup if the client prefers to receive
plain text without the markup.

=== Language encoding ===

A header field defines the language (if any) used in the stream (this can be overridden in a
data packet, but this is not relevant to this point). At the moment, my test code uses
ISO 639-1 two letter codes, but I originally thought to use RFC 3066 tags. However, matching
a language to a user selection may be simpler for user code if the language encoding is kept
simple. At the moment, I tend to favor allowing both two letter tags (eg, "en") and secondary
tags (like "en_EN"), as RFC 3066 tags can be quite complex, but I welcome comments on this.

If a stream contains more than one language, there usually is a predominant language, which
can be set as the default language for the stream. Each event can then have a language
override. If there is no predominant language, and it is not possible to split the stream
into multiple substreams, each with its own language, then it is possible to use the "mul"
language tag, as a last resort.

=== Bitstream format for floating point values ===

Floating point values are be turned to a 16.16 fixed point format, then stored in a bitpacked
format, storing the number of zero bits at the head and tail of the floating point values once
per stream, and the remainder bits for all values in the stream. This seems to yield good results
(typically a 50% reduction over 32 bits raw writes, and 70% over the snprintf based storage), and
has the big advantage of being portable (eg, independant of any IEEE format).
However, this means reduced precision due to the quantization to 16.16. I may add support for
variable precision (eg, 8.24 fixed point formats) to alleviate this. This would however mean less
space savings, though these are likely to be insignificant when Kate streams are interleaved with
a video.

*Though this is not a Kate issue per se, the motion feature is very difficult to use without a curve editor. While tools may be coded to create a Kate bitstream for various existing subtitle formats, it is not certain it will be easy to find a good authoring tool for a series of curves. That said, it's not exactly difficult to do if you know a widget set.

=== Higher dimensional curves/motions ===

It is quite annoying to have to create two motions to control a color change, due to curves
being restricted to two dimensions. I may add support for arbitrary dimensions. It would also
help for 1D motions, like changing the time flow, where one coordinate is simply ignored at
the moment.
Alternatively, changes could be made to the Kate file format to hide the two dimensionality and
allow simpler specification of non-2 dimensional motions, but still map them to 2D in the kate
bitstream format.

=== Category definition ===

The category field in the BOS packet is a 16 byte text field (15 really, as it is zero terminated
in the bitstream itself). Its goal is to provide the reader with a short description of what kind
of information the stream contains, eg subtitles, lyrics, etc. This would be displayed to the user,
possibly to allow to choose to turn some streams on and off.

Since this category is meant primarily for a machine to parse, they will be kept to ASCII. When
a player recognizes a category, it is free to replace its name with one in the user's language if
it prefers. Even in English, the "lyrics" category could be displayed by a player as "Lyrics".

Since this is a free text field rather than an enumeration, it would be good to have a list of
common predefined category names that Kate streams can use.

This is a list of proposed predefined categories, feedback/additions welcome:

* subtitles - the usual movie subtitles, as text
* spu-subtitles - movie subtitles in DVD style paletted images
* lyrics - song lyrics

Please remember the 15 character limit if proposing other categories.

Note that the list of categories is subject to change, and will likely
be replaced by new, more "identifier like" ones. The three ones above,
however, would be kept for backward compatibility as they're already used.

== Text to speech ==

One of the goals of the Kate bitstream format is that text data can be easily parsed
by the user of the decoder, so any additional information, such as style, placement,
karaoke data, etc, should be able to be stripped to leave only the bare text. This is
in view of allowing text-to-speech software to use Kate bitstreams as a bandwith-cheap
way of conveying speech data, and could also allow things like e-books which can be
either read or listened to from the same bitstream (I have seen no reference to this
being used anywhere, but I see no reason why the granule progression should be temporal,
and not user controlled, such as by using a "next" button which would bump a granule
postion by a preset amount, simulating turning a page (this would be close to necessary
for text-to-speech, as the wall time duration of the spoken speech is not known in
advance to the Kate encoder, and can't be mapped to a time based granule progression)).
All text strings triggered consecutively between the two granule positions would then
be read in order.

== Possible additions ==

=== Embedded binary data ===

Images and font mappings can be included within a Kate stream.

==== Images ====

Though this could be misused to interfere with ability to render as text-to-speech, Kate
can use images as well as text. The same caveat as for fonts applies with regard to data
duplication.

Complex images might however be best left to a multiplexed OggSpots or OggMNG stream, unless the
images mesh with the text (eg, graphical exclamation points, custom fonts, (see next
paragraph), etc).

There is support for simple paletted bitmap images, with a variable length palette of up
to 256 colors (in fact, sized in powers of 2 up to 256) and matching pixel data in as
many bits per pixel as can address the palette. Palettes and images are stored separately,
so can be used with one another with no fixed assignment.

Palettes and bitmaps are put in two separate header for later use by reference, but can
also be placed in data packets, as with motions, etc, if they are not going to be reused.

PNG bitmaps can also be embedded in a Kate stream. These do not have associated palettes
(but the PNGs themselves may or may not be paletted). There is no support for decoding PNG
images in libkate itself, so a program will have to use libpng (or similar code) to decode
the PNG image. For instance, the libtiger rendering library uses Cairo to decode and render
PNG images in Kate streams.

This can be used to have custom fonts, so that raw text is still available if the stream
creator wants a custom look.

I expect that the need for more than 256 colors in a bitmap, or non palette bitmap data,
would be best handled by another codec, eg OggMNG or OggSpots. The goal of images in a
Kate stream is to mesh the images with the text, not to have large images by themselves.

On the other hand, interesting Karaoke effects could be achieved by having MNG images
instead of simple paletted bitmaps in a Kate streams. Comments would be most welcome on
whether this is going too far, however.

I am also investigating SVG images. These allow for very small footprint images for simple
vector drawings, and could be very useful for things like background gradients below text.

A possible solution to the duplication issue is to have another stream in the container
stream, which would hold the shared data (eg, fonts), which the user program could load,
and which could then be used by any Kate (and other) stream. Typically, this type of stream
would be a degenerate stream with only header packets (so it is fully processed before any
other stream presents data packets that might make use of that shared data), and all payload
such as fonts being contained within the headers. Thinking about it, it has parallels with
the way Vorbis stores its codebooks within a header packet, or even the way Kate stores the
list of styles within a header packet.

==== Fonts ====

Custom fonts are merely a set of ranges mapping unicode code points to bitmaps. As this implies,
fonts are bitmap fonts, not vector fonts, so scaling, if supported by the rendering client,
may not look as good as with a vector font.

A style may also refer to a font name to use (eg, "Tahoma"). These fonts may or may not be
available on the playing system, however, since the font data is not included in the stream,
just referenced by name. For this reason, it is best to keep to widely known fonts.

== Reference encoder/decoder ==

A encoder (kateenc) and a decoder (katedec) are included in the tools directory.
The encoder supports input from several different formats:
* a custom text based file format (see [[#The Kate file format|The Kate file format]]), which is by no means meant to be part of the Kate bitstream specification itself
* SubRip (.srt), the most common subtitle format I found
* LRC lyrics format.

As an example for the widely used SRT subtitles format, the following command line
create a Kate subtitles stream from an SRT file:

kateenc -l en -c subtitles -t srt -o subtites.ogg subtitles.srt

The reverse is possible, to recover an SRT file from a Kate stream, with katedec.

Note that the subtitles.ogg file should then be multiplexed into the A/V stream,
using either ogg-tools or oggz-tools.

The Kate bitstreams encoded and decoded by those tools are (supposed to be) correct for this
specification, provided their input is correct.

== Next steps ==

=== Continuations ===

Continuations are a way to add to existing events, and are mostly meant for motions. When streaming
in real time, what motions may be applied to events may not be known in advance (for instance, for a
draw chat program where two programs exchange Kate streams, the drawing motions are only known as
they are drawn. Continuations will allow an event to be extended in time, and motions to be appended
to it. This is only useful for streaming, as when stored in a file, everything is already known in
advance.

=== A rendering library ===

This will allow easier integration in other packages (movie players, etc).
I have started working on an implementation using Cairo and Pango, though I'm still at the early stages.
I might add support for embedding vector fonts in a Kate stream if I was going that way. Still need to think about this.
Another point of note is that when this library is available, it would make it easier to add
capabilities such as rotation, scaling, etc, to the bitstream, since this would not cause too
much work for playing programs using the rendering library. It is expected that these additions
would stay backward compatible (eg, an old player would ignore this information but still correctly
decode the information they can work with from a newly encoded stream).

=== An XML representation ===

While I purposefully did not write Kate description files in XML due to me finding editing XML such
a chore, it would be nice to be able to losslessly convert between the more user friendly representation
and an XML document, so one can do what one does with XML documents, like transformations.

And after all, some people might prefer editing the XML version.

=== Packaging ===

It would be really nice to have packages for libkate/libtiger for many distros.

If you're a packager for a distro which doesn't have yet packages for libkate
or libtiger, please consider helping :)

In particular, packages for Debian would be grand.

== Matroska mapping ==

The codec ID is "S_KATE".

As for Theora and Vorbis, Kate headers are stored in the private data as xiph-laced packets:

Byte 0: number of packets present, minus 1 (there must be at least one packet) - let this number be NP
Bytes 1..n: lengths of the first NP packets, coded in xiph style lacing
Bytes n+1..end: the data packets themselves concatenated one after the other

Note that the length of the last packet isn't encoded, it is deduced from the sizes of the other
packets and the total size of the private data.

This mapping is similar to the Vorbis and Theora mappings, with the caveat that one should not
expect a set number of headers.

== Downloading ==

libkate encodes and decodes Kate streams, and is API and ABI stable.

The libkate source distribution is available at [http://libkate.googlecode.com/ http://libkate.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/kate.git;a=summary http://git.xiph.org/?p=users/oggk/kate.git;a=summary].

libtiger renders Kate streams using Pango and Cairo, and is alpha, with API changes still possible.

The libtiger source distribution is available at [http://libtiger.googlecode.com/ http://libtiger.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/tiger.git;a=summary http://git.xiph.org/?p=users/oggk/tiger.git;a=summary].

== HOWTOs ==

These paragraphs describe a few ways to use Kate streams:

=== Text movie subtitles ===

Kate streams can carry Unicode text (that is, text that can represent
pretty much any existing language/script). If several Kate streams are
multiplexed along with a video, subtitles in various languages can be
made for that movie.

An easy way to create such subtitles is to use ffmpeg2theora, which
can create Kate streams from SubRip (.srt) format files, a simple but
common text subtitles format. ffmpeg2theora 0.21 or later is needed.

At its simplest:

ffmpeg2theora -o video-with-subtitles.ogg --subtitles subtitles.srt
video-without-subtitles.avi

Several languages may be created and tagged with their language code
for easy selection in a media player:

ffmpeg2theora -o video-with-subtitles.ogg video-without-subtitles.avi
--subtitles japanese-subtitles.srt --subtitles-language ja
--subtitles welsh-subtitles.srt --subtitles-language cy
--subtitles english-subtitles.srt --subtitles-language en_GB

Alternatively, kateenc (which comes with the libkate distribution) can
create Kate streams from SubRip files as well. These can then be merged
with a video with oggz-tools:

kateenc -t srt -c SUB -l it -o subtitles.ogg italian-subtitles.srt
oggz merge -o movie-with-subtitles.ogg movie-without-subtitles.ogg subtitles.ogg

This second method can also be used to add subtitles to a video which
is already encoded to Theora, as it will not transcode the video again.

=== DVD subtitles ===

DVD subtitles are not text, but images. Thoggen, a DVD ripper program,
can convert these subtitles to Kate streams (at the time of writing,
Thoggen and GStreamer have not applied the necessary patches for this
to be possible out of the box, so patching them will be required).

When configuring how to rip DVD tracks, any subtitles will be detected
by Thoggen, and selecting them in the GUI will cause them to be saved as
Kate tracks along with the movie.

=== Song lyrics ===

Kate streams carrying song lyrics can be embedded in an Ogg file. The
oggenc Vorbis encoding tool from the Xiph.Org Vorbis tools allows lyrics
to be loaded from a LRC or SRT text file and converted to a Kate stream
multiplexed with the resulting Vorbis audio. At the time of writing,
the patch to oggenc was not applied yet, so it will have to be patched
manually with the patch found in the diffs directory.

oggenc -o song-with-lyrics.ogg --lyrics lyrics.lrc --lyrics-language en_US song.wav

So called 'enhanced LRC' files (containing extra karaoke timing information)
are supported, and a simple karaoke color change scheme will be saved
out for these files. For more complex karaoke effects (such as more
complex style changes, or sprite animation), kateenc should be used with
a Kate description file to create a separate Kate stream, which can then
be merged with a Vorbis only song with oggz-tools:

oggenc -o song.ogg song.wav
kateenc -t kate -c LRC -l en_US -o lyrics.ogg lyrics-with-karaoke.kate
oggz merge -o song-with-karaoke.ogg lyrics-with-karaoke.ogg song.ogg

This latter method may also be used if you already have an encoded Vorbis song
with no lyrics, and just want to add the lyrics without reencoding.

== Frequently Asked Questions ==

=== Does libkate work on other plaforms than Linux ? ===

Yes, libkate is not Linux specific in any way. It optionally relies on libogg
and libpng, two libraries widely ported to various platforms.
It has been reported to work on Windows and MacOS X as well as UNIX platforms.

However, libtiger, a rendering library for Kate streams, relies on Pango and Cairo,
which are not easy to build on Windows, though they can be.
The Tiger renderer is however completely separate from libkate, and is not needed
for full encoding and decoding of Kate streams.

=== Where can I find some example files ? ===

The libkate distribution can generate various examples, but already built files
can be found there:
[http://people.xiph.org/~oggk/elephants_dream/elephantsdream-with-subtitles.ogg]
[http://stallman.org/fry/Stephen_Fry-Happy_Birthday_GNU-nq_600px_425kbit.ogv]

These files use raw text only.

[[Category:Drafts]]
[[Category:Ogg Mappings]]

OggKate

2009-02-07T13:07:09Z

Ogg.k.ogg.k: FAQ number 2: example files

== Disclaimer ==
This is not a Xiph codec, though it may be embedded in Ogg alonside other Xiph
codecs, such as Vorbis and Theora. As such, please do not assume that Xiph has
anything to do with this, much less responsibility.

== What is Kate? ==

Kate is an overlay codec, originally designed for karaoke and text, that can be
multiplixed in Ogg. Text and images can be carried by a Kate stream, and animated.
Most of the time, this would be multiplexed with audio/video to carry subtitles,
song lyrics (with or without karaoke data), etc, but doesn't have to be.

Series of curves (splines, segments, etc) may be attached to various properties
(text position, font size, etc) to create animated overlays. This allows scrolling
or fading text to be defined. This can even be used to draw arbitrary shapes, so
hand drawing can also be represented by a Kate stream.

Example uses of Kate streams are movie subtitles for Theora videos, either text based,
as may be created by [http://www.v2v.cc/~j/ffmpeg2theora ffmpeg2theora], or image
based, such as created by [http://thoggen.net Thoggen] (patching needed), and lyrics,
as created by oggenc, from vorbis-tools.

== Why a new codec? ==

As I was adding support for Theora, Speex and FLAC to some software of mine, I found myself
wanting to have song lyrics accompanying Vorbis audio. Since Vorbis comments are limited to
the headers, one can't add them in the stream as they are sung, so another multiplexed stream
would be needed to carry them.

The three possible bases usable for such a codec I found were Writ, CMML, and OGM/SRT.

*[[OggWrit|Writ]] is an unmaintained start at an implementation of a very basic design, though I did find an encoder/decoder in py-ogg2 later on - I'd been quicker to write Kate from scratch anyway.
*[[CMML]] is more geared towards encapsulating metadata about an accompanying stream, rather than being a data stream itself, and seemed complex for a simple use, though I have now revised my view on this - besides, it seems designed for Annodex (which I haven't had a look at), though it does seems relatively generic for use outwith Annodex - though it is being "repurposed" as timed text now, bringing it closer to what I'm doing
*OGM/SRT, which I only found when I added Kate support to MPlayer, is shoehorning various data formats into an Ogg stream, and just dumps the SRT subtitle format as is, AFAICS (though I haven't looked at this one in detail, since I'd already had a working Kate implementation by that time)

I then decided to roll my own, not least because it's a fun thing to do.

I found other formats, such as USF (designed for inclusion in Matroska) and various subtitle formats,
but none were designed for embedding inside an Ogg container.

== Overview of the Kate bitstream format ==

I've taken much inspiration from Vorbis and Theora here.
Headers and packets (as well as the API design) follow the design of these two codecs.

A rough overview (see [[#Format specification|Format specification]] for more details) is:

Headers packets:
*ID header [BOS]: magic, version, granule fraction, encoding, language, etc
*Comment header: Vorbis comments, as per Vorbis/Theora streams
*Style definitions header: a list of predefined styles to be referred to by data packets
*Region definitions header: a list of predefined regions to be referred to by data packets
*Curves definitions header: a list of predefined curves to be referred to by data packets
*Motion definitions header: a list of predefined motions to be referred to by data packets
*Palette definitions header: a list of predefined palettes to be referred to by data packets
*Bitmap definitions header: a list of predefined bitmaps to be referred to by data packets
*Font mapping definitions header: a list of predefined font mappings to be referred to by data packets

Other header packets are ignored, and left for future expansion.

Data packets:
*text data: text/image and optional motions, accompanied by optional overrides for style, region, language, etc
*keepalive: can be emitted at any time to help a demuxer know where we're at, but those packets are optional
*repeats: a verbatim repeat of a text packet's payload, in order to bound any backward seeking needed when starting to play a stream partway through. These are also optional.
*end data [EOS]: marks the end of the stream, it doesn't have any useful payload

Other data packets are ignored, and left for future expansion.

The intent of the "keepalive" packet is to be sent at regular
intervals when no other packet has been emitted for a while. This would be to help seeking code
find a kate page more easily.

Things of note:
*Kate is a discontinuous codec, as defined in [http://www.xiph.org/ogg/doc/ogg-multiplex.html ogg-multiplex.html] in the Ogg documentation, which means it's timed by start granule, not end granule (as Theora and Vorbis).
* All data packets are on their own page, for two reasons:
**Ogg keeps track of granules at the page level, not the packet level
**if no text event happens for a while after a particular text event, we don't want to delay it so a larger page can be issued

See also [[#Seeking and memory|Problems to solve: Seeking and memory]].

*The granule encoding is not a direct time/granule correspondance, see the granule encoding section.
*The EOS packet should have a granule pos higher or equal to the end time of all events.
*User code doesn't have to know the number of headers to expect, this is moved inside the library code (as opposed to Vorbis and Theora).
*The format contains hooks so that additional information may be added in future revisions while keeping backward compatibility (though old decoders will correctly parse, but ignore the new information).

== Format specification ==

The Kate bitstream format consists of a number of sequential packets.
Packets can be either header packets or data packets. All header packets
must appear before any data packet.

Header packets must appear in order. Decoding of a data packet is not
possible until all header packets have been decoded.

Each Kate packet starts with a one byte type. A type with the MSB set
(eg, between 0x80 and 0xff) indicates a header packet, while a type with
the MSB cleared (eg, between 0x00 and 0x7f) indicates a data packet.
All header packets then have the Kate magic, from byte offset 1 to byte
offset 7 ("kate\0\0\0"). Note that this applies only to header packets:
data packets do not contain the Kate signature.

Since the ID header must appear first, a Kate stream can be recognized
by comparing the first eight bytes of the first packet with the signature
string "\200kate\0\0\0".

When embedded in Ogg,the first packet in a Kate stream (always packet type 0x80,
the id header packet) must be placed on a separate page. The corresponding Ogg
packet must be marked as beginning of stream (BOS).All subsequent header packets
must be on one or more pages. Subsequently, each data packet must be on a separate
page.

The last data packet must be the end of stream packet (packet type 0x7f).

When embedded in Ogg, the corresponding Ogg packet must be marked as end of stream (EOS).

As per the Ogg specification, granule positions must be non decreasing
within the stream. Header packets have granule position 0.

Currently existing packet types are:
:headers:
::0x80 ID header (BOS)
::0x81 Vorbis comment header
::0x82 regions list header
::0x83 styles list header
::0x84 curves list header
::0x85 motions list header
::0x86 palettes list header
::0x87 bitmaps list header
::0x88 font ranges and mappings header
:data:
::0x00 text data (including optional motions and overrides)
::0x01 keepalive
::0x02 repeat
::0x7f end packet (EOS)

This format described here is for bitstream version 0.x.
As or 19 december 2008, the latest bitstream version is 0.4.

For more detailed information, refer to the format documentation
in libkate (see URL below in the [[#Downloading|Downlading]] section).

Following is the definition of the ID header (packet type 0x80).
This works out to a 64 byte ID header. This is the header that should be
used to detect a Kate stream within an Ogg stream.

0 1 2 3 |
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1| Byte
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| packtype | Identifier char[7]: 'kate\0\0\0' | 0-3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| kate magic continued | 4-7
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | version major | version minor | num headers | 8-11
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| text encoding | directionality| reserved - 0 | granule shift | 12-15
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| cw sh | canvas width | ch sh | canvas height | 16-19
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | 20-23
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate numerator | 24-27
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate denominator | 28-31
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (NUL terminated) | 32-35
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 36-39
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 40-43
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 44-47
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (NUL terminated) | 48-51
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 52-55
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 56-59
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 60-63
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The fields cw sh, canvas width, cw sh, and canvas height were introduced
in bistream 0.3. Earlier bitstreams will have 0 in these fields.

language and category are NUL terminating ASCII strings.
Language follows RFC 3066, though obviously will not accommodate language tags
with lots of subtags.

Category is currently loosely defined, and I haven't found yet a nice way to
present it in a generic way, but is meant for automatic classifying of
various multiplexed Kate streams (eg, to recognize that some streams are
subtitles (in a set of languages), and some others are commentary (in a
possibly different set of languages, etc).

== API overview ==

libkate offers an API very similar to that of libvorbis and libtheora, as well as
an extra higher level decoding API.

Here's an overview of the three main modules:

=== Decoding ===

Decoding is done in a way similar to libvorbis. First, initialize a kate_info and a
kate_comment structure. Then, read headers by calling kate_decode_headerin. Once
all headers have been read, a kate_state is initialized for decoding using kate_decode_init,
and kate_decode_packetin is called repeatedly with data packets. Events (eg, text) can be
retrieved via kate_decode_eventout.

=== Encoding ===

Encoding is also done in a way similar to libvorbis. First initialize a kate_info
and a kate_comment structure, and fill them out as needed. kate_encode_headers will
create ogg packets from those. Then, kate_encode_text is called repeatedly for all
the text events to add. When done, calling kate_encode_finish will create an end of
stream packet.

=== High level decoding API ===

There are only 3 calls here:

kate_high_decode_init
kate_high_decode_packetin
kate_high_decode_clear

Here, all Ogg packets are sent to kate_high_decode_packetin, which does the right
thing (header/data classification, decoding, and event retrieval). Note that you
do not get access to the comments directly using this, but you do get access to the
kate_info via events.

The libkate distribution includes commented examples for each of those.

Additionally, libkate includes a layer (liboggkate) to make it easier to use when
embedded in Ogg. While the normal API uses kate_packet structures, liboggkate uses
ogg_packet structures.

The High level decoding API does not have an Ogg specific layer, but functions exist
to wrap a kate_packet around a memory buffer (such as the one ogg_packet uses, for instance).

== Support ==

Among the software with Kate support:
*VLC
*ffmpeg2theora
*liboggz
*liboggplay
*Cortado (wikimedia version)
*vorbis-tools

I have patches for the following with Kate support:
*MPlayer
*xine
*GStreamer
*Thoggen
*Audacious
*and more...

These may be found in the libkate source distribution (see [[#Downloading|Downloading]]
for links).

In addition, libtiger is a rendering library for Kate streams using Pango and Cairo,
though it is not quite yet API stable (though no major changes are expected).

== Granule encoding ==

=== Ogg ===

Ogg leaves the encoding of granules up to a particular codec, only
mandating that granules be non decreasing with time.

The Kate bitstream format uses a linear mapping between time and
granule, described here.

A Kate granule position is composed of two different parts:
- a base granule, in the high bits
- a granule offset, in the low bits

+----------------+----------------+
| base | offset |
+----------------+----------------+

The number of bits these parts occupy is variable, and each stream
may choose how many bits to dedicate to each. The kate_info structure
for a stream holds that information in the granule_shift field,
so each part may be reconstructed from a granulepos.

The timestamp T of a given Kate packet is split into a base B and
offset O, and these are stored in the granulepos of that packet.
The split is done such that the B is the time of the earliest event
still active at the time, and the O is the time elapsed between B
and T. Thus, T = B + O. This mimics the way Theora stores its own
timestamps in granulepos, where the base acts as a keyframe, and
an offset acts as the position of an intra frame from the previous
keyframe. Since Kate allows time overlapping events, however, the
choice of the base to use is slightly more complex, as it may not
be the starting time of the previous event, if the stream contains
time overlapping events.

The kate_info structure for a stream holds a rational fraction
representing the time span of granule units for both the base and
the offset parts.

The granule rate is defined by the two fields:

kate_info::gps_numerator
kate_info::gps_denominator

The number of bits reserved for the offset is defined by the field:

kate_info::granule_shift

=== Generic timing ===

Kate data packets (data packet type 0) includes timing information (start time,
end time, and time of the earliest event still active). All these are stored as
64 bit at the rate defined by the granule rate, so they do not suffer from the
granule_shift space limitation.

This also allows for Kate streams to be stored in other containers.

== Motion ==

The Kate bitstream format includes motion definition, originally for karaoke purposes, but
which can be used for more general purpose, such as line based drawing, or animation of
the text (position, color, etc)

Motions are defined by the means of a series of curves (static points, segments, splines (catmull-rom, bezier, and b-splines)).
A 2D point can be obtained from a motion for any timestamp during the lifetime of a text.
This can be used for moving a marker in 2D above the text for karaoke, or to use the x
coordinate to color text when the motion position passes each letter or word, etc.
Motions have an attached semantics so the client code knows how to use a particular motion.
Predefined semantics include text color, text position, etc).

Since a motion can be composed of an arbitrary number of curves, each of which may have
an arbitrary number of control points, complex motions can be achieved. If the motion is
the main object of an event, it is even possible to have an empty text, and use the motion
as a virtual pencil to draw arbitrary shapes. Even on-the-fly handwriting subtitles could
be done this way, though this would require a lot of control points, and would not be able
to be used with text-to-speech.

As a proof of concept, I also have a "draw chat" program where two people can draw, and
the shapes are turned to b-splines and sent as a kate motion to be displayed on the other
person's window.

It is also possible for motions to be discontinuous - simply insert a curve of 'none' type.
While the timestamp lies within such a curve, no 2D point will be generated. This can be
used to temporarily hide a marker, for instance.

It is worth mentionning that pauses in the motion can be trivially included by inserting
at the right time and for the right duration a simple linear interpolation curve with only
two equal points, equal to the position the motion is supposed to pause at.

Kate defines a set of predefined mappings so that each decoder user interprets a motion in
the same way. A mapping is coded on 8 bits in the bitstream, and the first 128 are reserved
for Kate, leaving 128 for application specific mappings, to avoid constraining creative uses
of that feature. Predefined mappings include frame (eg, 0-1 points are mapped to the size of
the current video frame), or region, to scale 0-1 to the current region. This allows curves
to be defined without knowing in advance the pixel size of the area it should cover.

For uses which require more than two coordinates (eg, text color, where 4 (RGBA) values are
needed, Kate predefines the semantics text_color_rg and text_color_ba, so a 4D point can be
obtained using two different motions.

There are higher level constructs, such as morphing between two styles, or predefined
karaoke effects. More are planned to be added in the future.

See also [[#Trackers|Trackers]].

== Trackers ==

Since attaching motions to text position, etc, makes it hard for the client to keep track of
everything, doing interpolation, etc, the library supplies a tracker object, which handles the
interpolation of the relevant properties.
Once initialized with a text and a set of motions, the client code can give the tracker a new
timestamp, and get back the current text position, text color, etc.

Using a tracker is not necessary, if one wants to use the motions directly, or just ignore them,
but it makes life easier, especially when considering the the order in which motions are applied
does matter (to be defined formally, but the current source code is informative at this point).

== The Kate file format ==

Though this is not a feature of the bitstream format, I have created a text file format to
describe a series of events to be turned into a Kate bitstream.
At its minimum, the following is a valid input to the encoder:

: kate {
:: event { 00:00:05 --> 00:00:10 "This is a text" }
: }

This will create a simple stream with "This is a text" emitted at an offset of 5 seconds into
the track, lasting 5 seconds to an end time at 10 seconds.

Motions, regions, styles can be declared in a definitions block to be reused by events, or can
be defined inline. Defining those in the definitions block places them in a header so they can
be reused later, saving space. However, they can also be defined in each event, so they will be
sent with the event. This allows them to be generated on the fly (eg, if the bitstream is being
streamed from a realtime input).

For convenience, the Kate file format also allows C style macros, though without parameters.

Please note that the Kate file format is fully separate from the Kate bitstream format. The
difference between the two is similar to the difference between a C source file and the resulting
object file, when compiled.

Note that the format is not based on XML for a very parochial reason: I tend to dislike very
much editing XML by hand, as it's really hard to read. XML is really meant for machines to parse
generically text data in a shared syntax but with possibly unknown semantics, and I need those
text representations to be editable easily.

This also implies that there could be an XML representation of a Kate stream, which would be
useful if one were to make an editor that worked on a higher level than the current all-text
representation, and it is something that might very well happen in the future, in parallel with
the current format.

== Karaoke ==

Karaoke effects rely on motions, and there will be predefined higher level ways of specifying
timings and effects, two of which are already done. As an example, this is a valid Karaoke script:

:kate {
:: simple_timed_glyph_style_morph {
::: from style "start_style" to style "end_style"
::: "Let " at 1.0
::: "us " at 1.2
::: "sing " at 1.4
::: "to" at 2.0
::: "ge" at 2.5
::: "ther" at 3.0
:: }
:}

The syllables will change from a style to another as time passes. The definition of the start_style
and end_style styles is omitted for brevity.

== Problems to solve ==

There are a few things to solve before the Kate bitstream format can be considered good
enough to be frozen:

Note: the following is mostly solved, and the bitstream is now stable, and has been
backward and forward compatible since the first released version. This will be updated
when I get some time.

=== Seeking and memory ===

When seeking to a particular time in a movie with subtitles, we may end up at a place when a subtitle has been started, but is not removed yet. Pure streaming doesn't have this problem as it remembers the subtitle being issued (as opposed to, say, Vorbis, for which all data valid now is decoded from the last packet). With Kate, a text string valid now may have been issued long ago.

I see three possible ways to solve this:
*each data packet includes the granule of the earliest still active packet (if none, this will be the granule of this very packet)
**this means seeks are two phased: first seek, find the next Kate packet, and seek again if the granule of the earlier still active packet is less than the original seeked granule. This implies support code on players to do the double seek.

*use "reference frames", a bit like Theora does, where the granule position is split in several fields: the higher bits represent a position for the reference frame, and the lowest bits a delta time to the current position. When seeking to a granule position, the lower bits are cleared off, yielding the granule position of the previous reference frame, so the seek ends up at the reference frame. The reference frame is a sync point where any active strings are issued again. This is a variant of the method described in the Writ wiki page, but the granule splitting avoids any "downtime".
**this requires reissuing packets, and it doesn't feel right (and wastes space).
**it also requires "dummy" decoding of Kate data from the reference frame to the actual seek point to fully refresh the state "memory".

*A variant of the two-granules-in-one system used by libcmml, where the "back link" points to the earliest still active string, rather than the previous one (this allows a two phase seek, rather than a multiphase seek, hopping back from event to event, with no real way to know if there is or not a previous event which is still active - I suppose CMML has no need to know this, if their "clips" do not overlap - mine can do).
**Such a system considerably shortens the usable granule space, though it can do a one phase seek, if I understand the system correctly, which I am not certain.
*** Well, it seems it can't do a one phase seek anyway.

*Additionally, it could be possible to emit simple "keepalive" packets at regular intervals to help a seek algorithm to sync up to the stream without needing too much data reading - this helps for discontinuous streams where there could be no pages for a while if no data is needed at that time.

=== Text encoding ===

A header field declares the text encoding used in the stream. At the moment, only UTF-8 is
supported, for simplicity. There are no plans to support other encodings, such as UTF-16,
at the moment.

Note that strings included in the header (language, category) are not affected by that
language encoding (rather obviously for language itself). These are ASCII.

The actual text in events may include simple HTML-like markup (at the moment, allowed markup
is the same as the one Pango uses, but more markup types may be defined in the future).
It is also possible to ask libkate to remove this markup if the client prefers to receive
plain text without the markup.

=== Language encoding ===

A header field defines the language (if any) used in the stream (this can be overridden in a
data packet, but this is not relevant to this point). At the moment, my test code uses
ISO 639-1 two letter codes, but I originally thought to use RFC 3066 tags. However, matching
a language to a user selection may be simpler for user code if the language encoding is kept
simple. At the moment, I tend to favor allowing both two letter tags (eg, "en") and secondary
tags (like "en_EN"), as RFC 3066 tags can be quite complex, but I welcome comments on this.

If a stream contains more than one language, there usually is a predominant language, which
can be set as the default language for the stream. Each event can then have a language
override. If there is no predominant language, and it is not possible to split the stream
into multiple substreams, each with its own language, then it is possible to use the "mul"
language tag, as a last resort.

=== Bitstream format for floating point values ===

Floating point values are be turned to a 16.16 fixed point format, then stored in a bitpacked
format, storing the number of zero bits at the head and tail of the floating point values once
per stream, and the remainder bits for all values in the stream. This seems to yield good results
(typically a 50% reduction over 32 bits raw writes, and 70% over the snprintf based storage), and
has the big advantage of being portable (eg, independant of any IEEE format).
However, this means reduced precision due to the quantization to 16.16. I may add support for
variable precision (eg, 8.24 fixed point formats) to alleviate this. This would however mean less
space savings, though these are likely to be insignificant when Kate streams are interleaved with
a video.

*Though this is not a Kate issue per se, the motion feature is very difficult to use without a curve editor. While tools may be coded to create a Kate bitstream for various existing subtitle formats, it is not certain it will be easy to find a good authoring tool for a series of curves. That said, it's not exactly difficult to do if you know a widget set.

=== Higher dimensional curves/motions ===

It is quite annoying to have to create two motions to control a color change, due to curves
being restricted to two dimensions. I may add support for arbitrary dimensions. It would also
help for 1D motions, like changing the time flow, where one coordinate is simply ignored at
the moment.
Alternatively, changes could be made to the Kate file format to hide the two dimensionality and
allow simpler specification of non-2 dimensional motions, but still map them to 2D in the kate
bitstream format.

=== Category definition ===

The category field in the BOS packet is a 16 byte text field (15 really, as it is zero terminated
in the bitstream itself). Its goal is to provide the reader with a short description of what kind
of information the stream contains, eg subtitles, lyrics, etc. This would be displayed to the user,
possibly to allow to choose to turn some streams on and off.

Since this category is meant primarily for a machine to parse, they will be kept to ASCII. When
a player recognizes a category, it is free to replace its name with one in the user's language if
it prefers. Even in English, the "lyrics" category could be displayed by a player as "Lyrics".

Since this is a free text field rather than an enumeration, it would be good to have a list of
common predefined category names that Kate streams can use.

This is a list of proposed predefined categories, feedback/additions welcome:

* subtitles - the usual movie subtitles, as text
* spu-subtitles - movie subtitles in DVD style paletted images
* lyrics - song lyrics

Please remember the 15 character limit if proposing other categories.

Note that the list of categories is subject to change, and will likely
be replaced by new, more "identifier like" ones. The three ones above,
however, would be kept for backward compatibility as they're already used.

== Text to speech ==

One of the goals of the Kate bitstream format is that text data can be easily parsed
by the user of the decoder, so any additional information, such as style, placement,
karaoke data, etc, should be able to be stripped to leave only the bare text. This is
in view of allowing text-to-speech software to use Kate bitstreams as a bandwith-cheap
way of conveying speech data, and could also allow things like e-books which can be
either read or listened to from the same bitstream (I have seen no reference to this
being used anywhere, but I see no reason why the granule progression should be temporal,
and not user controlled, such as by using a "next" button which would bump a granule
postion by a preset amount, simulating turning a page (this would be close to necessary
for text-to-speech, as the wall time duration of the spoken speech is not known in
advance to the Kate encoder, and can't be mapped to a time based granule progression)).
All text strings triggered consecutively between the two granule positions would then
be read in order.

== Possible additions ==

=== Embedded binary data ===

Images and font mappings can be included within a Kate stream.

==== Images ====

Though this could be misused to interfere with ability to render as text-to-speech, Kate
can use images as well as text. The same caveat as for fonts applies with regard to data
duplication.

Complex images might however be best left to a multiplexed OggSpots or OggMNG stream, unless the
images mesh with the text (eg, graphical exclamation points, custom fonts, (see next
paragraph), etc).

There is support for simple paletted bitmap images, with a variable length palette of up
to 256 colors (in fact, sized in powers of 2 up to 256) and matching pixel data in as
many bits per pixel as can address the palette. Palettes and images are stored separately,
so can be used with one another with no fixed assignment.

Palettes and bitmaps are put in two separate header for later use by reference, but can
also be placed in data packets, as with motions, etc, if they are not going to be reused.

PNG bitmaps can also be embedded in a Kate stream. These do not have associated palettes
(but the PNGs themselves may or may not be paletted). There is no support for decoding PNG
images in libkate itself, so a program will have to use libpng (or similar code) to decode
the PNG image. For instance, the libtiger rendering library uses Cairo to decode and render
PNG images in Kate streams.

This can be used to have custom fonts, so that raw text is still available if the stream
creator wants a custom look.

I expect that the need for more than 256 colors in a bitmap, or non palette bitmap data,
would be best handled by another codec, eg OggMNG or OggSpots. The goal of images in a
Kate stream is to mesh the images with the text, not to have large images by themselves.

On the other hand, interesting Karaoke effects could be achieved by having MNG images
instead of simple paletted bitmaps in a Kate streams. Comments would be most welcome on
whether this is going too far, however.

I am also investigating SVG images. These allow for very small footprint images for simple
vector drawings, and could be very useful for things like background gradients below text.

A possible solution to the duplication issue is to have another stream in the container
stream, which would hold the shared data (eg, fonts), which the user program could load,
and which could then be used by any Kate (and other) stream. Typically, this type of stream
would be a degenerate stream with only header packets (so it is fully processed before any
other stream presents data packets that might make use of that shared data), and all payload
such as fonts being contained within the headers. Thinking about it, it has parallels with
the way Vorbis stores its codebooks within a header packet, or even the way Kate stores the
list of styles within a header packet.

==== Fonts ====

Custom fonts are merely a set of ranges mapping unicode code points to bitmaps. As this implies,
fonts are bitmap fonts, not vector fonts, so scaling, if supported by the rendering client,
may not look as good as with a vector font.

A style may also refer to a font name to use (eg, "Tahoma"). These fonts may or may not be
available on the playing system, however, since the font data is not included in the stream,
just referenced by name. For this reason, it is best to keep to widely known fonts.

== Reference encoder/decoder ==

A encoder (kateenc) and a decoder (katedec) are included in the tools directory.
The encoder supports input from several different formats:
* a custom text based file format (see [[#The Kate file format|The Kate file format]]), which is by no means meant to be part of the Kate bitstream specification itself
* SubRip (.srt), the most common subtitle format I found
* LRC lyrics format.

As an example for the widely used SRT subtitles format, the following command line
create a Kate subtitles stream from an SRT file:

kateenc -l en -c subtitles -t srt -o subtites.ogg subtitles.srt

The reverse is possible, to recover an SRT file from a Kate stream, with katedec.

Note that the subtitles.ogg file should then be multiplexed into the A/V stream,
using either ogg-tools or oggz-tools.

The Kate bitstreams encoded and decoded by those tools are (supposed to be) correct for this
specification, provided their input is correct.

== Next steps ==

=== Continuations ===

Continuations are a way to add to existing events, and are mostly meant for motions. When streaming
in real time, what motions may be applied to events may not be known in advance (for instance, for a
draw chat program where two programs exchange Kate streams, the drawing motions are only known as
they are drawn. Continuations will allow an event to be extended in time, and motions to be appended
to it. This is only useful for streaming, as when stored in a file, everything is already known in
advance.

=== A rendering library ===

This will allow easier integration in other packages (movie players, etc).
I have started working on an implementation using Cairo and Pango, though I'm still at the early stages.
I might add support for embedding vector fonts in a Kate stream if I was going that way. Still need to think about this.
Another point of note is that when this library is available, it would make it easier to add
capabilities such as rotation, scaling, etc, to the bitstream, since this would not cause too
much work for playing programs using the rendering library. It is expected that these additions
would stay backward compatible (eg, an old player would ignore this information but still correctly
decode the information they can work with from a newly encoded stream).

=== An XML representation ===

While I purposefully did not write Kate description files in XML due to me finding editing XML such
a chore, it would be nice to be able to losslessly convert between the more user friendly representation
and an XML document, so one can do what one does with XML documents, like transformations.

And after all, some people might prefer editing the XML version.

=== Packaging ===

It would be really nice to have packages for libkate/libtiger for many distros.

If you're a packager for a distro which doesn't have yet packages for libkate
or libtiger, please consider helping :)

In particular, packages for Debian would be grand.

== Matroska mapping ==

The codec ID is "S_KATE".

As for Theora and Vorbis, Kate headers are stored in the private data as xiph-laced packets:

Byte 0: number of packets present, minus 1 (there must be at least one packet) - let this number be NP
Bytes 1..n: lengths of the first NP packets, coded in xiph style lacing
Bytes n+1..end: the data packets themselves concatenated one after the other

Note that the length of the last packet isn't encoded, it is deduced from the sizes of the other
packets and the total size of the private data.

This mapping is similar to the Vorbis and Theora mappings, with the caveat that one should not
expect a set number of headers.

== Downloading ==

libkate encodes and decodes Kate streams, and is API and ABI stable.

The libkate source distribution is available at [http://libkate.googlecode.com/ http://libkate.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/kate.git;a=summary http://git.xiph.org/?p=users/oggk/kate.git;a=summary].

libtiger renders Kate streams using Pango and Cairo, and is alpha, with API changes still possible.

The libtiger source distribution is available at [http://libtiger.googlecode.com/ http://libtiger.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/tiger.git;a=summary http://git.xiph.org/?p=users/oggk/tiger.git;a=summary].

== Things I need to get feedback on ==

* is it a good idea to avoid floating point usage altogether ?

== Frequently Asked Questions ==

=== Does libkate work on other plaforms than Linux ? ===

Yes, libkate is not Linux specific in any way. It optionally relies on libogg
and libpng, two libraries widely ported to various platforms.
It has been reported to work on Windows and MacOS X as well as UNIX platforms.

However, libtiger, a rendering library for Kate streams, relies on Pango and Cairo,
which are not easy to build on Windows, though they can be.
The Tiger renderer is however completely separate from libkate, and is not needed
for full encoding and decoding of Kate streams.

=== Where can I find some example files ? ===

The libkate distribution can generate various examples, but already built files
can be found there:
[http://people.xiph.org/~oggk/elephants_dream/elephantsdream-with-subtitles.ogg]
[http://stallman.org/fry/Stephen_Fry-Happy_Birthday_GNU-nq_600px_425kbit.ogv]

These files use raw text only.

[[Category:Drafts]]
[[Category:Ogg Mappings]]

OggKate

2009-01-30T22:34:33Z

Ogg.k.ogg.k: FAQ, first question - libkate works on Windows and MacOS X

== Disclaimer ==
This is not a Xiph codec, though it may be embedded in Ogg alonside other Xiph
codecs, such as Vorbis and Theora. As such, please do not assume that Xiph has
anything to do with this, much less responsibility.

== What is Kate? ==

Kate is an overlay codec, originally designed for karaoke and text, that can be
multiplixed in Ogg. Text and images can be carried by a Kate stream, and animated.
Most of the time, this would be multiplexed with audio/video to carry subtitles,
song lyrics (with or without karaoke data), etc, but doesn't have to be.

Series of curves (splines, segments, etc) may be attached to various properties
(text position, font size, etc) to create animated overlays. This allows scrolling
or fading text to be defined. This can even be used to draw arbitrary shapes, so
hand drawing can also be represented by a Kate stream.

Example uses of Kate streams are movie subtitles for Theora videos, either text based,
as may be created by [http://www.v2v.cc/~j/ffmpeg2theora ffmpeg2theora], or image
based, such as created by [http://thoggen.net Thoggen] (patching needed), and lyrics,
as created by oggenc, from vorbis-tools.

== Why a new codec? ==

As I was adding support for Theora, Speex and FLAC to some software of mine, I found myself
wanting to have song lyrics accompanying Vorbis audio. Since Vorbis comments are limited to
the headers, one can't add them in the stream as they are sung, so another multiplexed stream
would be needed to carry them.

The three possible bases usable for such a codec I found were Writ, CMML, and OGM/SRT.

*[[OggWrit|Writ]] is an unmaintained start at an implementation of a very basic design, though I did find an encoder/decoder in py-ogg2 later on - I'd been quicker to write Kate from scratch anyway.
*[[CMML]] is more geared towards encapsulating metadata about an accompanying stream, rather than being a data stream itself, and seemed complex for a simple use, though I have now revised my view on this - besides, it seems designed for Annodex (which I haven't had a look at), though it does seems relatively generic for use outwith Annodex - though it is being "repurposed" as timed text now, bringing it closer to what I'm doing
*OGM/SRT, which I only found when I added Kate support to MPlayer, is shoehorning various data formats into an Ogg stream, and just dumps the SRT subtitle format as is, AFAICS (though I haven't looked at this one in detail, since I'd already had a working Kate implementation by that time)

I then decided to roll my own, not least because it's a fun thing to do.

I found other formats, such as USF (designed for inclusion in Matroska) and various subtitle formats,
but none were designed for embedding inside an Ogg container.

== Overview of the Kate bitstream format ==

I've taken much inspiration from Vorbis and Theora here.
Headers and packets (as well as the API design) follow the design of these two codecs.

A rough overview (see [[#Format specification|Format specification]] for more details) is:

Headers packets:
*ID header [BOS]: magic, version, granule fraction, encoding, language, etc
*Comment header: Vorbis comments, as per Vorbis/Theora streams
*Style definitions header: a list of predefined styles to be referred to by data packets
*Region definitions header: a list of predefined regions to be referred to by data packets
*Curves definitions header: a list of predefined curves to be referred to by data packets
*Motion definitions header: a list of predefined motions to be referred to by data packets
*Palette definitions header: a list of predefined palettes to be referred to by data packets
*Bitmap definitions header: a list of predefined bitmaps to be referred to by data packets
*Font mapping definitions header: a list of predefined font mappings to be referred to by data packets

Other header packets are ignored, and left for future expansion.

Data packets:
*text data: text/image and optional motions, accompanied by optional overrides for style, region, language, etc
*keepalive: can be emitted at any time to help a demuxer know where we're at, but those packets are optional
*repeats: a verbatim repeat of a text packet's payload, in order to bound any backward seeking needed when starting to play a stream partway through. These are also optional.
*end data [EOS]: marks the end of the stream, it doesn't have any useful payload

Other data packets are ignored, and left for future expansion.

The intent of the "keepalive" packet is to be sent at regular
intervals when no other packet has been emitted for a while. This would be to help seeking code
find a kate page more easily.

Things of note:
*Kate is a discontinuous codec, as defined in [http://www.xiph.org/ogg/doc/ogg-multiplex.html ogg-multiplex.html] in the Ogg documentation, which means it's timed by start granule, not end granule (as Theora and Vorbis).
* All data packets are on their own page, for two reasons:
**Ogg keeps track of granules at the page level, not the packet level
**if no text event happens for a while after a particular text event, we don't want to delay it so a larger page can be issued

See also [[#Seeking and memory|Problems to solve: Seeking and memory]].

*The granule encoding is not a direct time/granule correspondance, see the granule encoding section.
*The EOS packet should have a granule pos higher or equal to the end time of all events.
*User code doesn't have to know the number of headers to expect, this is moved inside the library code (as opposed to Vorbis and Theora).
*The format contains hooks so that additional information may be added in future revisions while keeping backward compatibility (though old decoders will correctly parse, but ignore the new information).

== Format specification ==

The Kate bitstream format consists of a number of sequential packets.
Packets can be either header packets or data packets. All header packets
must appear before any data packet.

Header packets must appear in order. Decoding of a data packet is not
possible until all header packets have been decoded.

Each Kate packet starts with a one byte type. A type with the MSB set
(eg, between 0x80 and 0xff) indicates a header packet, while a type with
the MSB cleared (eg, between 0x00 and 0x7f) indicates a data packet.
All header packets then have the Kate magic, from byte offset 1 to byte
offset 7 ("kate\0\0\0"). Note that this applies only to header packets:
data packets do not contain the Kate signature.

Since the ID header must appear first, a Kate stream can be recognized
by comparing the first eight bytes of the first packet with the signature
string "\200kate\0\0\0".

When embedded in Ogg,the first packet in a Kate stream (always packet type 0x80,
the id header packet) must be placed on a separate page. The corresponding Ogg
packet must be marked as beginning of stream (BOS).All subsequent header packets
must be on one or more pages. Subsequently, each data packet must be on a separate
page.

The last data packet must be the end of stream packet (packet type 0x7f).

When embedded in Ogg, the corresponding Ogg packet must be marked as end of stream (EOS).

As per the Ogg specification, granule positions must be non decreasing
within the stream. Header packets have granule position 0.

Currently existing packet types are:
:headers:
::0x80 ID header (BOS)
::0x81 Vorbis comment header
::0x82 regions list header
::0x83 styles list header
::0x84 curves list header
::0x85 motions list header
::0x86 palettes list header
::0x87 bitmaps list header
::0x88 font ranges and mappings header
:data:
::0x00 text data (including optional motions and overrides)
::0x01 keepalive
::0x02 repeat
::0x7f end packet (EOS)

This format described here is for bitstream version 0.x.
As or 19 december 2008, the latest bitstream version is 0.4.

For more detailed information, refer to the format documentation
in libkate (see URL below in the [[#Downloading|Downlading]] section).

Following is the definition of the ID header (packet type 0x80).
This works out to a 64 byte ID header. This is the header that should be
used to detect a Kate stream within an Ogg stream.

0 1 2 3 |
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1| Byte
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| packtype | Identifier char[7]: 'kate\0\0\0' | 0-3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| kate magic continued | 4-7
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | version major | version minor | num headers | 8-11
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| text encoding | directionality| reserved - 0 | granule shift | 12-15
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| cw sh | canvas width | ch sh | canvas height | 16-19
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | 20-23
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate numerator | 24-27
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate denominator | 28-31
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (NUL terminated) | 32-35
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 36-39
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 40-43
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 44-47
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (NUL terminated) | 48-51
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 52-55
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 56-59
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 60-63
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The fields cw sh, canvas width, cw sh, and canvas height were introduced
in bistream 0.3. Earlier bitstreams will have 0 in these fields.

language and category are NUL terminating ASCII strings.
Language follows RFC 3066, though obviously will not accommodate language tags
with lots of subtags.

Category is currently loosely defined, and I haven't found yet a nice way to
present it in a generic way, but is meant for automatic classifying of
various multiplexed Kate streams (eg, to recognize that some streams are
subtitles (in a set of languages), and some others are commentary (in a
possibly different set of languages, etc).

== API overview ==

libkate offers an API very similar to that of libvorbis and libtheora, as well as
an extra higher level decoding API.

Here's an overview of the three main modules:

=== Decoding ===

Decoding is done in a way similar to libvorbis. First, initialize a kate_info and a
kate_comment structure. Then, read headers by calling kate_decode_headerin. Once
all headers have been read, a kate_state is initialized for decoding using kate_decode_init,
and kate_decode_packetin is called repeatedly with data packets. Events (eg, text) can be
retrieved via kate_decode_eventout.

=== Encoding ===

Encoding is also done in a way similar to libvorbis. First initialize a kate_info
and a kate_comment structure, and fill them out as needed. kate_encode_headers will
create ogg packets from those. Then, kate_encode_text is called repeatedly for all
the text events to add. When done, calling kate_encode_finish will create an end of
stream packet.

=== High level decoding API ===

There are only 3 calls here:

kate_high_decode_init
kate_high_decode_packetin
kate_high_decode_clear

Here, all Ogg packets are sent to kate_high_decode_packetin, which does the right
thing (header/data classification, decoding, and event retrieval). Note that you
do not get access to the comments directly using this, but you do get access to the
kate_info via events.

The libkate distribution includes commented examples for each of those.

Additionally, libkate includes a layer (liboggkate) to make it easier to use when
embedded in Ogg. While the normal API uses kate_packet structures, liboggkate uses
ogg_packet structures.

The High level decoding API does not have an Ogg specific layer, but functions exist
to wrap a kate_packet around a memory buffer (such as the one ogg_packet uses, for instance).

== Support ==

Among the software with Kate support:
*VLC
*ffmpeg2theora
*liboggz
*liboggplay
*Cortado (wikimedia version)
*vorbis-tools

I have patches for the following with Kate support:
*MPlayer
*xine
*GStreamer
*Thoggen
*Audacious
*and more...

These may be found in the libkate source distribution (see [[#Downloading|Downloading]]
for links).

In addition, libtiger is a rendering library for Kate streams using Pango and Cairo,
though it is not quite yet API stable (though no major changes are expected).

== Granule encoding ==

=== Ogg ===

Ogg leaves the encoding of granules up to a particular codec, only
mandating that granules be non decreasing with time.

The Kate bitstream format uses a linear mapping between time and
granule, described here.

A Kate granule position is composed of two different parts:
- a base granule, in the high bits
- a granule offset, in the low bits

+----------------+----------------+
| base | offset |
+----------------+----------------+

The number of bits these parts occupy is variable, and each stream
may choose how many bits to dedicate to each. The kate_info structure
for a stream holds that information in the granule_shift field,
so each part may be reconstructed from a granulepos.

The timestamp T of a given Kate packet is split into a base B and
offset O, and these are stored in the granulepos of that packet.
The split is done such that the B is the time of the earliest event
still active at the time, and the O is the time elapsed between B
and T. Thus, T = B + O. This mimics the way Theora stores its own
timestamps in granulepos, where the base acts as a keyframe, and
an offset acts as the position of an intra frame from the previous
keyframe. Since Kate allows time overlapping events, however, the
choice of the base to use is slightly more complex, as it may not
be the starting time of the previous event, if the stream contains
time overlapping events.

The kate_info structure for a stream holds a rational fraction
representing the time span of granule units for both the base and
the offset parts.

The granule rate is defined by the two fields:

kate_info::gps_numerator
kate_info::gps_denominator

The number of bits reserved for the offset is defined by the field:

kate_info::granule_shift

=== Generic timing ===

Kate data packets (data packet type 0) includes timing information (start time,
end time, and time of the earliest event still active). All these are stored as
64 bit at the rate defined by the granule rate, so they do not suffer from the
granule_shift space limitation.

This also allows for Kate streams to be stored in other containers.

== Motion ==

The Kate bitstream format includes motion definition, originally for karaoke purposes, but
which can be used for more general purpose, such as line based drawing, or animation of
the text (position, color, etc)

Motions are defined by the means of a series of curves (static points, segments, splines (catmull-rom, bezier, and b-splines)).
A 2D point can be obtained from a motion for any timestamp during the lifetime of a text.
This can be used for moving a marker in 2D above the text for karaoke, or to use the x
coordinate to color text when the motion position passes each letter or word, etc.
Motions have an attached semantics so the client code knows how to use a particular motion.
Predefined semantics include text color, text position, etc).

Since a motion can be composed of an arbitrary number of curves, each of which may have
an arbitrary number of control points, complex motions can be achieved. If the motion is
the main object of an event, it is even possible to have an empty text, and use the motion
as a virtual pencil to draw arbitrary shapes. Even on-the-fly handwriting subtitles could
be done this way, though this would require a lot of control points, and would not be able
to be used with text-to-speech.

As a proof of concept, I also have a "draw chat" program where two people can draw, and
the shapes are turned to b-splines and sent as a kate motion to be displayed on the other
person's window.

It is also possible for motions to be discontinuous - simply insert a curve of 'none' type.
While the timestamp lies within such a curve, no 2D point will be generated. This can be
used to temporarily hide a marker, for instance.

It is worth mentionning that pauses in the motion can be trivially included by inserting
at the right time and for the right duration a simple linear interpolation curve with only
two equal points, equal to the position the motion is supposed to pause at.

Kate defines a set of predefined mappings so that each decoder user interprets a motion in
the same way. A mapping is coded on 8 bits in the bitstream, and the first 128 are reserved
for Kate, leaving 128 for application specific mappings, to avoid constraining creative uses
of that feature. Predefined mappings include frame (eg, 0-1 points are mapped to the size of
the current video frame), or region, to scale 0-1 to the current region. This allows curves
to be defined without knowing in advance the pixel size of the area it should cover.

For uses which require more than two coordinates (eg, text color, where 4 (RGBA) values are
needed, Kate predefines the semantics text_color_rg and text_color_ba, so a 4D point can be
obtained using two different motions.

There are higher level constructs, such as morphing between two styles, or predefined
karaoke effects. More are planned to be added in the future.

See also [[#Trackers|Trackers]].

== Trackers ==

Since attaching motions to text position, etc, makes it hard for the client to keep track of
everything, doing interpolation, etc, the library supplies a tracker object, which handles the
interpolation of the relevant properties.
Once initialized with a text and a set of motions, the client code can give the tracker a new
timestamp, and get back the current text position, text color, etc.

Using a tracker is not necessary, if one wants to use the motions directly, or just ignore them,
but it makes life easier, especially when considering the the order in which motions are applied
does matter (to be defined formally, but the current source code is informative at this point).

== The Kate file format ==

Though this is not a feature of the bitstream format, I have created a text file format to
describe a series of events to be turned into a Kate bitstream.
At its minimum, the following is a valid input to the encoder:

: kate {
:: event { 00:00:05 --> 00:00:10 "This is a text" }
: }

This will create a simple stream with "This is a text" emitted at an offset of 5 seconds into
the track, lasting 5 seconds to an end time at 10 seconds.

Motions, regions, styles can be declared in a definitions block to be reused by events, or can
be defined inline. Defining those in the definitions block places them in a header so they can
be reused later, saving space. However, they can also be defined in each event, so they will be
sent with the event. This allows them to be generated on the fly (eg, if the bitstream is being
streamed from a realtime input).

For convenience, the Kate file format also allows C style macros, though without parameters.

Please note that the Kate file format is fully separate from the Kate bitstream format. The
difference between the two is similar to the difference between a C source file and the resulting
object file, when compiled.

Note that the format is not based on XML for a very parochial reason: I tend to dislike very
much editing XML by hand, as it's really hard to read. XML is really meant for machines to parse
generically text data in a shared syntax but with possibly unknown semantics, and I need those
text representations to be editable easily.

This also implies that there could be an XML representation of a Kate stream, which would be
useful if one were to make an editor that worked on a higher level than the current all-text
representation, and it is something that might very well happen in the future, in parallel with
the current format.

== Karaoke ==

Karaoke effects rely on motions, and there will be predefined higher level ways of specifying
timings and effects, two of which are already done. As an example, this is a valid Karaoke script:

:kate {
:: simple_timed_glyph_style_morph {
::: from style "start_style" to style "end_style"
::: "Let " at 1.0
::: "us " at 1.2
::: "sing " at 1.4
::: "to" at 2.0
::: "ge" at 2.5
::: "ther" at 3.0
:: }
:}

The syllables will change from a style to another as time passes. The definition of the start_style
and end_style styles is omitted for brevity.

== Problems to solve ==

There are a few things to solve before the Kate bitstream format can be considered good
enough to be frozen:

Note: the following is mostly solved, and the bitstream is now stable, and has been
backward and forward compatible since the first released version. This will be updated
when I get some time.

=== Seeking and memory ===

When seeking to a particular time in a movie with subtitles, we may end up at a place when a subtitle has been started, but is not removed yet. Pure streaming doesn't have this problem as it remembers the subtitle being issued (as opposed to, say, Vorbis, for which all data valid now is decoded from the last packet). With Kate, a text string valid now may have been issued long ago.

I see three possible ways to solve this:
*each data packet includes the granule of the earliest still active packet (if none, this will be the granule of this very packet)
**this means seeks are two phased: first seek, find the next Kate packet, and seek again if the granule of the earlier still active packet is less than the original seeked granule. This implies support code on players to do the double seek.

*use "reference frames", a bit like Theora does, where the granule position is split in several fields: the higher bits represent a position for the reference frame, and the lowest bits a delta time to the current position. When seeking to a granule position, the lower bits are cleared off, yielding the granule position of the previous reference frame, so the seek ends up at the reference frame. The reference frame is a sync point where any active strings are issued again. This is a variant of the method described in the Writ wiki page, but the granule splitting avoids any "downtime".
**this requires reissuing packets, and it doesn't feel right (and wastes space).
**it also requires "dummy" decoding of Kate data from the reference frame to the actual seek point to fully refresh the state "memory".

*A variant of the two-granules-in-one system used by libcmml, where the "back link" points to the earliest still active string, rather than the previous one (this allows a two phase seek, rather than a multiphase seek, hopping back from event to event, with no real way to know if there is or not a previous event which is still active - I suppose CMML has no need to know this, if their "clips" do not overlap - mine can do).
**Such a system considerably shortens the usable granule space, though it can do a one phase seek, if I understand the system correctly, which I am not certain.
*** Well, it seems it can't do a one phase seek anyway.

*Additionally, it could be possible to emit simple "keepalive" packets at regular intervals to help a seek algorithm to sync up to the stream without needing too much data reading - this helps for discontinuous streams where there could be no pages for a while if no data is needed at that time.

=== Text encoding ===

A header field declares the text encoding used in the stream. At the moment, only UTF-8 is
supported, for simplicity. There are no plans to support other encodings, such as UTF-16,
at the moment.

Note that strings included in the header (language, category) are not affected by that
language encoding (rather obviously for language itself). These are ASCII.

The actual text in events may include simple HTML-like markup (at the moment, allowed markup
is the same as the one Pango uses, but more markup types may be defined in the future).
It is also possible to ask libkate to remove this markup if the client prefers to receive
plain text without the markup.

=== Language encoding ===

A header field defines the language (if any) used in the stream (this can be overridden in a
data packet, but this is not relevant to this point). At the moment, my test code uses
ISO 639-1 two letter codes, but I originally thought to use RFC 3066 tags. However, matching
a language to a user selection may be simpler for user code if the language encoding is kept
simple. At the moment, I tend to favor allowing both two letter tags (eg, "en") and secondary
tags (like "en_EN"), as RFC 3066 tags can be quite complex, but I welcome comments on this.

If a stream contains more than one language, there usually is a predominant language, which
can be set as the default language for the stream. Each event can then have a language
override. If there is no predominant language, and it is not possible to split the stream
into multiple substreams, each with its own language, then it is possible to use the "mul"
language tag, as a last resort.

=== Bitstream format for floating point values ===

Floating point values are be turned to a 16.16 fixed point format, then stored in a bitpacked
format, storing the number of zero bits at the head and tail of the floating point values once
per stream, and the remainder bits for all values in the stream. This seems to yield good results
(typically a 50% reduction over 32 bits raw writes, and 70% over the snprintf based storage), and
has the big advantage of being portable (eg, independant of any IEEE format).
However, this means reduced precision due to the quantization to 16.16. I may add support for
variable precision (eg, 8.24 fixed point formats) to alleviate this. This would however mean less
space savings, though these are likely to be insignificant when Kate streams are interleaved with
a video.

*Though this is not a Kate issue per se, the motion feature is very difficult to use without a curve editor. While tools may be coded to create a Kate bitstream for various existing subtitle formats, it is not certain it will be easy to find a good authoring tool for a series of curves. That said, it's not exactly difficult to do if you know a widget set.

=== Higher dimensional curves/motions ===

It is quite annoying to have to create two motions to control a color change, due to curves
being restricted to two dimensions. I may add support for arbitrary dimensions. It would also
help for 1D motions, like changing the time flow, where one coordinate is simply ignored at
the moment.
Alternatively, changes could be made to the Kate file format to hide the two dimensionality and
allow simpler specification of non-2 dimensional motions, but still map them to 2D in the kate
bitstream format.

=== Category definition ===

The category field in the BOS packet is a 16 byte text field (15 really, as it is zero terminated
in the bitstream itself). Its goal is to provide the reader with a short description of what kind
of information the stream contains, eg subtitles, lyrics, etc. This would be displayed to the user,
possibly to allow to choose to turn some streams on and off.

Since this category is meant primarily for a machine to parse, they will be kept to ASCII. When
a player recognizes a category, it is free to replace its name with one in the user's language if
it prefers. Even in English, the "lyrics" category could be displayed by a player as "Lyrics".

Since this is a free text field rather than an enumeration, it would be good to have a list of
common predefined category names that Kate streams can use.

This is a list of proposed predefined categories, feedback/additions welcome:

* subtitles - the usual movie subtitles, as text
* spu-subtitles - movie subtitles in DVD style paletted images
* lyrics - song lyrics

Please remember the 15 character limit if proposing other categories.

Note that the list of categories is subject to change, and will likely
be replaced by new, more "identifier like" ones. The three ones above,
however, would be kept for backward compatibility as they're already used.

== Text to speech ==

One of the goals of the Kate bitstream format is that text data can be easily parsed
by the user of the decoder, so any additional information, such as style, placement,
karaoke data, etc, should be able to be stripped to leave only the bare text. This is
in view of allowing text-to-speech software to use Kate bitstreams as a bandwith-cheap
way of conveying speech data, and could also allow things like e-books which can be
either read or listened to from the same bitstream (I have seen no reference to this
being used anywhere, but I see no reason why the granule progression should be temporal,
and not user controlled, such as by using a "next" button which would bump a granule
postion by a preset amount, simulating turning a page (this would be close to necessary
for text-to-speech, as the wall time duration of the spoken speech is not known in
advance to the Kate encoder, and can't be mapped to a time based granule progression)).
All text strings triggered consecutively between the two granule positions would then
be read in order.

== Possible additions ==

=== Embedded binary data ===

Images and font mappings can be included within a Kate stream.

==== Images ====

Though this could be misused to interfere with ability to render as text-to-speech, Kate
can use images as well as text. The same caveat as for fonts applies with regard to data
duplication.

Complex images might however be best left to a multiplexed OggSpots or OggMNG stream, unless the
images mesh with the text (eg, graphical exclamation points, custom fonts, (see next
paragraph), etc).

There is support for simple paletted bitmap images, with a variable length palette of up
to 256 colors (in fact, sized in powers of 2 up to 256) and matching pixel data in as
many bits per pixel as can address the palette. Palettes and images are stored separately,
so can be used with one another with no fixed assignment.

Palettes and bitmaps are put in two separate header for later use by reference, but can
also be placed in data packets, as with motions, etc, if they are not going to be reused.

PNG bitmaps can also be embedded in a Kate stream. These do not have associated palettes
(but the PNGs themselves may or may not be paletted). There is no support for decoding PNG
images in libkate itself, so a program will have to use libpng (or similar code) to decode
the PNG image. For instance, the libtiger rendering library uses Cairo to decode and render
PNG images in Kate streams.

This can be used to have custom fonts, so that raw text is still available if the stream
creator wants a custom look.

I expect that the need for more than 256 colors in a bitmap, or non palette bitmap data,
would be best handled by another codec, eg OggMNG or OggSpots. The goal of images in a
Kate stream is to mesh the images with the text, not to have large images by themselves.

On the other hand, interesting Karaoke effects could be achieved by having MNG images
instead of simple paletted bitmaps in a Kate streams. Comments would be most welcome on
whether this is going too far, however.

I am also investigating SVG images. These allow for very small footprint images for simple
vector drawings, and could be very useful for things like background gradients below text.

A possible solution to the duplication issue is to have another stream in the container
stream, which would hold the shared data (eg, fonts), which the user program could load,
and which could then be used by any Kate (and other) stream. Typically, this type of stream
would be a degenerate stream with only header packets (so it is fully processed before any
other stream presents data packets that might make use of that shared data), and all payload
such as fonts being contained within the headers. Thinking about it, it has parallels with
the way Vorbis stores its codebooks within a header packet, or even the way Kate stores the
list of styles within a header packet.

==== Fonts ====

Custom fonts are merely a set of ranges mapping unicode code points to bitmaps. As this implies,
fonts are bitmap fonts, not vector fonts, so scaling, if supported by the rendering client,
may not look as good as with a vector font.

A style may also refer to a font name to use (eg, "Tahoma"). These fonts may or may not be
available on the playing system, however, since the font data is not included in the stream,
just referenced by name. For this reason, it is best to keep to widely known fonts.

== Reference encoder/decoder ==

A encoder (kateenc) and a decoder (katedec) are included in the tools directory.
The encoder supports input from several different formats:
* a custom text based file format (see [[#The Kate file format|The Kate file format]]), which is by no means meant to be part of the Kate bitstream specification itself
* SubRip (.srt), the most common subtitle format I found
* LRC lyrics format.

As an example for the widely used SRT subtitles format, the following command line
create a Kate subtitles stream from an SRT file:

kateenc -l en -c subtitles -t srt -o subtites.ogg subtitles.srt

The reverse is possible, to recover an SRT file from a Kate stream, with katedec.

Note that the subtitles.ogg file should then be multiplexed into the A/V stream,
using either ogg-tools or oggz-tools.

The Kate bitstreams encoded and decoded by those tools are (supposed to be) correct for this
specification, provided their input is correct.

== Next steps ==

=== Continuations ===

Continuations are a way to add to existing events, and are mostly meant for motions. When streaming
in real time, what motions may be applied to events may not be known in advance (for instance, for a
draw chat program where two programs exchange Kate streams, the drawing motions are only known as
they are drawn. Continuations will allow an event to be extended in time, and motions to be appended
to it. This is only useful for streaming, as when stored in a file, everything is already known in
advance.

=== A rendering library ===

This will allow easier integration in other packages (movie players, etc).
I have started working on an implementation using Cairo and Pango, though I'm still at the early stages.
I might add support for embedding vector fonts in a Kate stream if I was going that way. Still need to think about this.
Another point of note is that when this library is available, it would make it easier to add
capabilities such as rotation, scaling, etc, to the bitstream, since this would not cause too
much work for playing programs using the rendering library. It is expected that these additions
would stay backward compatible (eg, an old player would ignore this information but still correctly
decode the information they can work with from a newly encoded stream).

=== An XML representation ===

While I purposefully did not write Kate description files in XML due to me finding editing XML such
a chore, it would be nice to be able to losslessly convert between the more user friendly representation
and an XML document, so one can do what one does with XML documents, like transformations.

And after all, some people might prefer editing the XML version.

=== Packaging ===

It would be really nice to have packages for libkate/libtiger for many distros.

If you're a packager for a distro which doesn't have yet packages for libkate
or libtiger, please consider helping :)

In particular, packages for Debian would be grand.

== Matroska mapping ==

The codec ID is "S_KATE".

As for Theora and Vorbis, Kate headers are stored in the private data as xiph-laced packets:

Byte 0: number of packets present, minus 1 (there must be at least one packet) - let this number be NP
Bytes 1..n: lengths of the first NP packets, coded in xiph style lacing
Bytes n+1..end: the data packets themselves concatenated one after the other

Note that the length of the last packet isn't encoded, it is deduced from the sizes of the other
packets and the total size of the private data.

This mapping is similar to the Vorbis and Theora mappings, with the caveat that one should not
expect a set number of headers.

== Downloading ==

libkate encodes and decodes Kate streams, and is API and ABI stable.

The libkate source distribution is available at [http://libkate.googlecode.com/ http://libkate.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/kate.git;a=summary http://git.xiph.org/?p=users/oggk/kate.git;a=summary].

libtiger renders Kate streams using Pango and Cairo, and is alpha, with API changes still possible.

The libtiger source distribution is available at [http://libtiger.googlecode.com/ http://libtiger.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/tiger.git;a=summary http://git.xiph.org/?p=users/oggk/tiger.git;a=summary].

== Things I need to get feedback on ==

* is it a good idea to avoid floating point usage altogether ?

== Frequently Asked Questions ==

=== Does libkate work on other plaforms than Linux ? ===

Yes, libkate is not Linux specific in any way. It optionally relies on libogg
and libpng, two libraries widely ported to various platforms.
It has been reported to work on Windows and MacOS X as well as UNIX platforms.

However, libtiger, a rendering library for Kate streams, relies on Pango and Cairo,
which are not easy to build on Windows, though they can be.
The Tiger renderer is however completely separate from libkate, and is not needed
for full encoding and decoding of Kate streams.

[[Category:Drafts]]
[[Category:Ogg Mappings]]

OggKate

2009-01-11T18:43:50Z

Ogg.k.ogg.k: overview - repeat packets

== Disclaimer ==
This is not a Xiph codec, though it may be embedded in Ogg alonside other Xiph
codecs, such as Vorbis and Theora. As such, please do not assume that Xiph has
anything to do with this, much less responsibility.

== What is Kate? ==

Kate is an overlay codec, originally designed for karaoke and text, that can be
multiplixed in Ogg. Text and images can be carried by a Kate stream, and animated.
Most of the time, this would be multiplexed with audio/video to carry subtitles,
song lyrics (with or without karaoke data), etc, but doesn't have to be.

Series of curves (splines, segments, etc) may be attached to various properties
(text position, font size, etc) to create animated overlays. This allows scrolling
or fading text to be defined. This can even be used to draw arbitrary shapes, so
hand drawing can also be represented by a Kate stream.

Example uses of Kate streams are movie subtitles for Theora videos, either text based,
as may be created by [http://www.v2v.cc/~j/ffmpeg2theora ffmpeg2theora], or image
based, such as created by [http://thoggen.net Thoggen] (patching needed), and lyrics,
as created by oggenc, from vorbis-tools.

== Why a new codec? ==

As I was adding support for Theora, Speex and FLAC to some software of mine, I found myself
wanting to have song lyrics accompanying Vorbis audio. Since Vorbis comments are limited to
the headers, one can't add them in the stream as they are sung, so another multiplexed stream
would be needed to carry them.

The three possible bases usable for such a codec I found were Writ, CMML, and OGM/SRT.

*[[OggWrit|Writ]] is an unmaintained start at an implementation of a very basic design, though I did find an encoder/decoder in py-ogg2 later on - I'd been quicker to write Kate from scratch anyway.
*[[CMML]] is more geared towards encapsulating metadata about an accompanying stream, rather than being a data stream itself, and seemed complex for a simple use, though I have now revised my view on this - besides, it seems designed for Annodex (which I haven't had a look at), though it does seems relatively generic for use outwith Annodex - though it is being "repurposed" as timed text now, bringing it closer to what I'm doing
*OGM/SRT, which I only found when I added Kate support to MPlayer, is shoehorning various data formats into an Ogg stream, and just dumps the SRT subtitle format as is, AFAICS (though I haven't looked at this one in detail, since I'd already had a working Kate implementation by that time)

I then decided to roll my own, not least because it's a fun thing to do.

I found other formats, such as USF (designed for inclusion in Matroska) and various subtitle formats,
but none were designed for embedding inside an Ogg container.

== Overview of the Kate bitstream format ==

I've taken much inspiration from Vorbis and Theora here.
Headers and packets (as well as the API design) follow the design of these two codecs.

A rough overview (see [[#Format specification|Format specification]] for more details) is:

Headers packets:
*ID header [BOS]: magic, version, granule fraction, encoding, language, etc
*Comment header: Vorbis comments, as per Vorbis/Theora streams
*Style definitions header: a list of predefined styles to be referred to by data packets
*Region definitions header: a list of predefined regions to be referred to by data packets
*Curves definitions header: a list of predefined curves to be referred to by data packets
*Motion definitions header: a list of predefined motions to be referred to by data packets
*Palette definitions header: a list of predefined palettes to be referred to by data packets
*Bitmap definitions header: a list of predefined bitmaps to be referred to by data packets
*Font mapping definitions header: a list of predefined font mappings to be referred to by data packets

Other header packets are ignored, and left for future expansion.

Data packets:
*text data: text/image and optional motions, accompanied by optional overrides for style, region, language, etc
*keepalive: can be emitted at any time to help a demuxer know where we're at, but those packets are optional
*repeats: a verbatim repeat of a text packet's payload, in order to bound any backward seeking needed when starting to play a stream partway through. These are also optional.
*end data [EOS]: marks the end of the stream, it doesn't have any useful payload

Other data packets are ignored, and left for future expansion.

The intent of the "keepalive" packet is to be sent at regular
intervals when no other packet has been emitted for a while. This would be to help seeking code
find a kate page more easily.

Things of note:
*Kate is a discontinuous codec, as defined in [http://www.xiph.org/ogg/doc/ogg-multiplex.html ogg-multiplex.html] in the Ogg documentation, which means it's timed by start granule, not end granule (as Theora and Vorbis).
* All data packets are on their own page, for two reasons:
**Ogg keeps track of granules at the page level, not the packet level
**if no text event happens for a while after a particular text event, we don't want to delay it so a larger page can be issued

See also [[#Seeking and memory|Problems to solve: Seeking and memory]].

*The granule encoding is not a direct time/granule correspondance, see the granule encoding section.
*The EOS packet should have a granule pos higher or equal to the end time of all events.
*User code doesn't have to know the number of headers to expect, this is moved inside the library code (as opposed to Vorbis and Theora).
*The format contains hooks so that additional information may be added in future revisions while keeping backward compatibility (though old decoders will correctly parse, but ignore the new information).

== Format specification ==

The Kate bitstream format consists of a number of sequential packets.
Packets can be either header packets or data packets. All header packets
must appear before any data packet.

Header packets must appear in order. Decoding of a data packet is not
possible until all header packets have been decoded.

Each Kate packet starts with a one byte type. A type with the MSB set
(eg, between 0x80 and 0xff) indicates a header packet, while a type with
the MSB cleared (eg, between 0x00 and 0x7f) indicates a data packet.
All header packets then have the Kate magic, from byte offset 1 to byte
offset 7 ("kate\0\0\0"). Note that this applies only to header packets:
data packets do not contain the Kate signature.

Since the ID header must appear first, a Kate stream can be recognized
by comparing the first eight bytes of the first packet with the signature
string "\200kate\0\0\0".

When embedded in Ogg,the first packet in a Kate stream (always packet type 0x80,
the id header packet) must be placed on a separate page. The corresponding Ogg
packet must be marked as beginning of stream (BOS).All subsequent header packets
must be on one or more pages. Subsequently, each data packet must be on a separate
page.

The last data packet must be the end of stream packet (packet type 0x7f).

When embedded in Ogg, the corresponding Ogg packet must be marked as end of stream (EOS).

As per the Ogg specification, granule positions must be non decreasing
within the stream. Header packets have granule position 0.

Currently existing packet types are:
:headers:
::0x80 ID header (BOS)
::0x81 Vorbis comment header
::0x82 regions list header
::0x83 styles list header
::0x84 curves list header
::0x85 motions list header
::0x86 palettes list header
::0x87 bitmaps list header
::0x88 font ranges and mappings header
:data:
::0x00 text data (including optional motions and overrides)
::0x01 keepalive
::0x02 repeat
::0x7f end packet (EOS)

This format described here is for bitstream version 0.x.
As or 19 december 2008, the latest bitstream version is 0.4.

For more detailed information, refer to the format documentation
in libkate (see URL below in the [[#Downloading|Downlading]] section).

Following is the definition of the ID header (packet type 0x80).
This works out to a 64 byte ID header. This is the header that should be
used to detect a Kate stream within an Ogg stream.

0 1 2 3 |
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1| Byte
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| packtype | Identifier char[7]: 'kate\0\0\0' | 0-3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| kate magic continued | 4-7
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | version major | version minor | num headers | 8-11
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| text encoding | directionality| reserved - 0 | granule shift | 12-15
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| cw sh | canvas width | ch sh | canvas height | 16-19
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | 20-23
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate numerator | 24-27
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate denominator | 28-31
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (NUL terminated) | 32-35
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 36-39
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 40-43
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 44-47
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (NUL terminated) | 48-51
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 52-55
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 56-59
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 60-63
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The fields cw sh, canvas width, cw sh, and canvas height were introduced
in bistream 0.3. Earlier bitstreams will have 0 in these fields.

language and category are NUL terminating ASCII strings.
Language follows RFC 3066, though obviously will not accommodate language tags
with lots of subtags.

Category is currently loosely defined, and I haven't found yet a nice way to
present it in a generic way, but is meant for automatic classifying of
various multiplexed Kate streams (eg, to recognize that some streams are
subtitles (in a set of languages), and some others are commentary (in a
possibly different set of languages, etc).

== API overview ==

libkate offers an API very similar to that of libvorbis and libtheora, as well as
an extra higher level decoding API.

Here's an overview of the three main modules:

=== Decoding ===

Decoding is done in a way similar to libvorbis. First, initialize a kate_info and a
kate_comment structure. Then, read headers by calling kate_decode_headerin. Once
all headers have been read, a kate_state is initialized for decoding using kate_decode_init,
and kate_decode_packetin is called repeatedly with data packets. Events (eg, text) can be
retrieved via kate_decode_eventout.

=== Encoding ===

Encoding is also done in a way similar to libvorbis. First initialize a kate_info
and a kate_comment structure, and fill them out as needed. kate_encode_headers will
create ogg packets from those. Then, kate_encode_text is called repeatedly for all
the text events to add. When done, calling kate_encode_finish will create an end of
stream packet.

=== High level decoding API ===

There are only 3 calls here:

kate_high_decode_init
kate_high_decode_packetin
kate_high_decode_clear

Here, all Ogg packets are sent to kate_high_decode_packetin, which does the right
thing (header/data classification, decoding, and event retrieval). Note that you
do not get access to the comments directly using this, but you do get access to the
kate_info via events.

The libkate distribution includes commented examples for each of those.

Additionally, libkate includes a layer (liboggkate) to make it easier to use when
embedded in Ogg. While the normal API uses kate_packet structures, liboggkate uses
ogg_packet structures.

The High level decoding API does not have an Ogg specific layer, but functions exist
to wrap a kate_packet around a memory buffer (such as the one ogg_packet uses, for instance).

== Support ==

Among the software with Kate support:
*VLC
*ffmpeg2theora
*liboggz
*liboggplay
*Cortado (wikimedia version)
*vorbis-tools

I have patches for the following with Kate support:
*MPlayer
*xine
*GStreamer
*Thoggen
*Audacious
*and more...

These may be found in the libkate source distribution (see [[#Downloading|Downloading]]
for links).

In addition, libtiger is a rendering library for Kate streams using Pango and Cairo,
though it is not quite yet API stable (though no major changes are expected).

== Granule encoding ==

=== Ogg ===

Ogg leaves the encoding of granules up to a particular codec, only
mandating that granules be non decreasing with time.

The Kate bitstream format uses a linear mapping between time and
granule, described here.

A Kate granule position is composed of two different parts:
- a base granule, in the high bits
- a granule offset, in the low bits

+----------------+----------------+
| base | offset |
+----------------+----------------+

The number of bits these parts occupy is variable, and each stream
may choose how many bits to dedicate to each. The kate_info structure
for a stream holds that information in the granule_shift field,
so each part may be reconstructed from a granulepos.

The timestamp T of a given Kate packet is split into a base B and
offset O, and these are stored in the granulepos of that packet.
The split is done such that the B is the time of the earliest event
still active at the time, and the O is the time elapsed between B
and T. Thus, T = B + O. This mimics the way Theora stores its own
timestamps in granulepos, where the base acts as a keyframe, and
an offset acts as the position of an intra frame from the previous
keyframe. Since Kate allows time overlapping events, however, the
choice of the base to use is slightly more complex, as it may not
be the starting time of the previous event, if the stream contains
time overlapping events.

The kate_info structure for a stream holds a rational fraction
representing the time span of granule units for both the base and
the offset parts.

The granule rate is defined by the two fields:

kate_info::gps_numerator
kate_info::gps_denominator

The number of bits reserved for the offset is defined by the field:

kate_info::granule_shift

=== Generic timing ===

Kate data packets (data packet type 0) includes timing information (start time,
end time, and time of the earliest event still active). All these are stored as
64 bit at the rate defined by the granule rate, so they do not suffer from the
granule_shift space limitation.

This also allows for Kate streams to be stored in other containers.

== Motion ==

The Kate bitstream format includes motion definition, originally for karaoke purposes, but
which can be used for more general purpose, such as line based drawing, or animation of
the text (position, color, etc)

Motions are defined by the means of a series of curves (static points, segments, splines (catmull-rom, bezier, and b-splines)).
A 2D point can be obtained from a motion for any timestamp during the lifetime of a text.
This can be used for moving a marker in 2D above the text for karaoke, or to use the x
coordinate to color text when the motion position passes each letter or word, etc.
Motions have an attached semantics so the client code knows how to use a particular motion.
Predefined semantics include text color, text position, etc).

Since a motion can be composed of an arbitrary number of curves, each of which may have
an arbitrary number of control points, complex motions can be achieved. If the motion is
the main object of an event, it is even possible to have an empty text, and use the motion
as a virtual pencil to draw arbitrary shapes. Even on-the-fly handwriting subtitles could
be done this way, though this would require a lot of control points, and would not be able
to be used with text-to-speech.

As a proof of concept, I also have a "draw chat" program where two people can draw, and
the shapes are turned to b-splines and sent as a kate motion to be displayed on the other
person's window.

It is also possible for motions to be discontinuous - simply insert a curve of 'none' type.
While the timestamp lies within such a curve, no 2D point will be generated. This can be
used to temporarily hide a marker, for instance.

It is worth mentionning that pauses in the motion can be trivially included by inserting
at the right time and for the right duration a simple linear interpolation curve with only
two equal points, equal to the position the motion is supposed to pause at.

Kate defines a set of predefined mappings so that each decoder user interprets a motion in
the same way. A mapping is coded on 8 bits in the bitstream, and the first 128 are reserved
for Kate, leaving 128 for application specific mappings, to avoid constraining creative uses
of that feature. Predefined mappings include frame (eg, 0-1 points are mapped to the size of
the current video frame), or region, to scale 0-1 to the current region. This allows curves
to be defined without knowing in advance the pixel size of the area it should cover.

For uses which require more than two coordinates (eg, text color, where 4 (RGBA) values are
needed, Kate predefines the semantics text_color_rg and text_color_ba, so a 4D point can be
obtained using two different motions.

There are higher level constructs, such as morphing between two styles, or predefined
karaoke effects. More are planned to be added in the future.

See also [[#Trackers|Trackers]].

== Trackers ==

Since attaching motions to text position, etc, makes it hard for the client to keep track of
everything, doing interpolation, etc, the library supplies a tracker object, which handles the
interpolation of the relevant properties.
Once initialized with a text and a set of motions, the client code can give the tracker a new
timestamp, and get back the current text position, text color, etc.

Using a tracker is not necessary, if one wants to use the motions directly, or just ignore them,
but it makes life easier, especially when considering the the order in which motions are applied
does matter (to be defined formally, but the current source code is informative at this point).

== The Kate file format ==

Though this is not a feature of the bitstream format, I have created a text file format to
describe a series of events to be turned into a Kate bitstream.
At its minimum, the following is a valid input to the encoder:

: kate {
:: event { 00:00:05 --> 00:00:10 "This is a text" }
: }

This will create a simple stream with "This is a text" emitted at an offset of 5 seconds into
the track, lasting 5 seconds to an end time at 10 seconds.

Motions, regions, styles can be declared in a definitions block to be reused by events, or can
be defined inline. Defining those in the definitions block places them in a header so they can
be reused later, saving space. However, they can also be defined in each event, so they will be
sent with the event. This allows them to be generated on the fly (eg, if the bitstream is being
streamed from a realtime input).

For convenience, the Kate file format also allows C style macros, though without parameters.

Please note that the Kate file format is fully separate from the Kate bitstream format. The
difference between the two is similar to the difference between a C source file and the resulting
object file, when compiled.

Note that the format is not based on XML for a very parochial reason: I tend to dislike very
much editing XML by hand, as it's really hard to read. XML is really meant for machines to parse
generically text data in a shared syntax but with possibly unknown semantics, and I need those
text representations to be editable easily.

This also implies that there could be an XML representation of a Kate stream, which would be
useful if one were to make an editor that worked on a higher level than the current all-text
representation, and it is something that might very well happen in the future, in parallel with
the current format.

== Karaoke ==

Karaoke effects rely on motions, and there will be predefined higher level ways of specifying
timings and effects, two of which are already done. As an example, this is a valid Karaoke script:

:kate {
:: simple_timed_glyph_style_morph {
::: from style "start_style" to style "end_style"
::: "Let " at 1.0
::: "us " at 1.2
::: "sing " at 1.4
::: "to" at 2.0
::: "ge" at 2.5
::: "ther" at 3.0
:: }
:}

The syllables will change from a style to another as time passes. The definition of the start_style
and end_style styles is omitted for brevity.

== Problems to solve ==

There are a few things to solve before the Kate bitstream format can be considered good
enough to be frozen:

Note: the following is mostly solved, and the bitstream is now stable, and has been
backward and forward compatible since the first released version. This will be updated
when I get some time.

=== Seeking and memory ===

When seeking to a particular time in a movie with subtitles, we may end up at a place when a subtitle has been started, but is not removed yet. Pure streaming doesn't have this problem as it remembers the subtitle being issued (as opposed to, say, Vorbis, for which all data valid now is decoded from the last packet). With Kate, a text string valid now may have been issued long ago.

I see three possible ways to solve this:
*each data packet includes the granule of the earliest still active packet (if none, this will be the granule of this very packet)
**this means seeks are two phased: first seek, find the next Kate packet, and seek again if the granule of the earlier still active packet is less than the original seeked granule. This implies support code on players to do the double seek.

*use "reference frames", a bit like Theora does, where the granule position is split in several fields: the higher bits represent a position for the reference frame, and the lowest bits a delta time to the current position. When seeking to a granule position, the lower bits are cleared off, yielding the granule position of the previous reference frame, so the seek ends up at the reference frame. The reference frame is a sync point where any active strings are issued again. This is a variant of the method described in the Writ wiki page, but the granule splitting avoids any "downtime".
**this requires reissuing packets, and it doesn't feel right (and wastes space).
**it also requires "dummy" decoding of Kate data from the reference frame to the actual seek point to fully refresh the state "memory".

*A variant of the two-granules-in-one system used by libcmml, where the "back link" points to the earliest still active string, rather than the previous one (this allows a two phase seek, rather than a multiphase seek, hopping back from event to event, with no real way to know if there is or not a previous event which is still active - I suppose CMML has no need to know this, if their "clips" do not overlap - mine can do).
**Such a system considerably shortens the usable granule space, though it can do a one phase seek, if I understand the system correctly, which I am not certain.
*** Well, it seems it can't do a one phase seek anyway.

*Additionally, it could be possible to emit simple "keepalive" packets at regular intervals to help a seek algorithm to sync up to the stream without needing too much data reading - this helps for discontinuous streams where there could be no pages for a while if no data is needed at that time.

=== Text encoding ===

A header field declares the text encoding used in the stream. At the moment, only UTF-8 is
supported, for simplicity. There are no plans to support other encodings, such as UTF-16,
at the moment.

Note that strings included in the header (language, category) are not affected by that
language encoding (rather obviously for language itself). These are ASCII.

The actual text in events may include simple HTML-like markup (at the moment, allowed markup
is the same as the one Pango uses, but more markup types may be defined in the future).
It is also possible to ask libkate to remove this markup if the client prefers to receive
plain text without the markup.

=== Language encoding ===

A header field defines the language (if any) used in the stream (this can be overridden in a
data packet, but this is not relevant to this point). At the moment, my test code uses
ISO 639-1 two letter codes, but I originally thought to use RFC 3066 tags. However, matching
a language to a user selection may be simpler for user code if the language encoding is kept
simple. At the moment, I tend to favor allowing both two letter tags (eg, "en") and secondary
tags (like "en_EN"), as RFC 3066 tags can be quite complex, but I welcome comments on this.

If a stream contains more than one language, there usually is a predominant language, which
can be set as the default language for the stream. Each event can then have a language
override. If there is no predominant language, and it is not possible to split the stream
into multiple substreams, each with its own language, then it is possible to use the "mul"
language tag, as a last resort.

=== Bitstream format for floating point values ===

Floating point values are be turned to a 16.16 fixed point format, then stored in a bitpacked
format, storing the number of zero bits at the head and tail of the floating point values once
per stream, and the remainder bits for all values in the stream. This seems to yield good results
(typically a 50% reduction over 32 bits raw writes, and 70% over the snprintf based storage), and
has the big advantage of being portable (eg, independant of any IEEE format).
However, this means reduced precision due to the quantization to 16.16. I may add support for
variable precision (eg, 8.24 fixed point formats) to alleviate this. This would however mean less
space savings, though these are likely to be insignificant when Kate streams are interleaved with
a video.

*Though this is not a Kate issue per se, the motion feature is very difficult to use without a curve editor. While tools may be coded to create a Kate bitstream for various existing subtitle formats, it is not certain it will be easy to find a good authoring tool for a series of curves. That said, it's not exactly difficult to do if you know a widget set.

=== Higher dimensional curves/motions ===

It is quite annoying to have to create two motions to control a color change, due to curves
being restricted to two dimensions. I may add support for arbitrary dimensions. It would also
help for 1D motions, like changing the time flow, where one coordinate is simply ignored at
the moment.
Alternatively, changes could be made to the Kate file format to hide the two dimensionality and
allow simpler specification of non-2 dimensional motions, but still map them to 2D in the kate
bitstream format.

=== Category definition ===

The category field in the BOS packet is a 16 byte text field (15 really, as it is zero terminated
in the bitstream itself). Its goal is to provide the reader with a short description of what kind
of information the stream contains, eg subtitles, lyrics, etc. This would be displayed to the user,
possibly to allow to choose to turn some streams on and off.

Since this category is meant primarily for a machine to parse, they will be kept to ASCII. When
a player recognizes a category, it is free to replace its name with one in the user's language if
it prefers. Even in English, the "lyrics" category could be displayed by a player as "Lyrics".

Since this is a free text field rather than an enumeration, it would be good to have a list of
common predefined category names that Kate streams can use.

This is a list of proposed predefined categories, feedback/additions welcome:

* subtitles - the usual movie subtitles, as text
* spu-subtitles - movie subtitles in DVD style paletted images
* lyrics - song lyrics

Please remember the 15 character limit if proposing other categories.

Note that the list of categories is subject to change, and will likely
be replaced by new, more "identifier like" ones. The three ones above,
however, would be kept for backward compatibility as they're already used.

== Text to speech ==

One of the goals of the Kate bitstream format is that text data can be easily parsed
by the user of the decoder, so any additional information, such as style, placement,
karaoke data, etc, should be able to be stripped to leave only the bare text. This is
in view of allowing text-to-speech software to use Kate bitstreams as a bandwith-cheap
way of conveying speech data, and could also allow things like e-books which can be
either read or listened to from the same bitstream (I have seen no reference to this
being used anywhere, but I see no reason why the granule progression should be temporal,
and not user controlled, such as by using a "next" button which would bump a granule
postion by a preset amount, simulating turning a page (this would be close to necessary
for text-to-speech, as the wall time duration of the spoken speech is not known in
advance to the Kate encoder, and can't be mapped to a time based granule progression)).
All text strings triggered consecutively between the two granule positions would then
be read in order.

== Possible additions ==

=== Embedded binary data ===

Images and font mappings can be included within a Kate stream.

==== Images ====

Though this could be misused to interfere with ability to render as text-to-speech, Kate
can use images as well as text. The same caveat as for fonts applies with regard to data
duplication.

Complex images might however be best left to a multiplexed OggSpots or OggMNG stream, unless the
images mesh with the text (eg, graphical exclamation points, custom fonts, (see next
paragraph), etc).

There is support for simple paletted bitmap images, with a variable length palette of up
to 256 colors (in fact, sized in powers of 2 up to 256) and matching pixel data in as
many bits per pixel as can address the palette. Palettes and images are stored separately,
so can be used with one another with no fixed assignment.

Palettes and bitmaps are put in two separate header for later use by reference, but can
also be placed in data packets, as with motions, etc, if they are not going to be reused.

PNG bitmaps can also be embedded in a Kate stream. These do not have associated palettes
(but the PNGs themselves may or may not be paletted). There is no support for decoding PNG
images in libkate itself, so a program will have to use libpng (or similar code) to decode
the PNG image. For instance, the libtiger rendering library uses Cairo to decode and render
PNG images in Kate streams.

This can be used to have custom fonts, so that raw text is still available if the stream
creator wants a custom look.

I expect that the need for more than 256 colors in a bitmap, or non palette bitmap data,
would be best handled by another codec, eg OggMNG or OggSpots. The goal of images in a
Kate stream is to mesh the images with the text, not to have large images by themselves.

On the other hand, interesting Karaoke effects could be achieved by having MNG images
instead of simple paletted bitmaps in a Kate streams. Comments would be most welcome on
whether this is going too far, however.

I am also investigating SVG images. These allow for very small footprint images for simple
vector drawings, and could be very useful for things like background gradients below text.

A possible solution to the duplication issue is to have another stream in the container
stream, which would hold the shared data (eg, fonts), which the user program could load,
and which could then be used by any Kate (and other) stream. Typically, this type of stream
would be a degenerate stream with only header packets (so it is fully processed before any
other stream presents data packets that might make use of that shared data), and all payload
such as fonts being contained within the headers. Thinking about it, it has parallels with
the way Vorbis stores its codebooks within a header packet, or even the way Kate stores the
list of styles within a header packet.

==== Fonts ====

Custom fonts are merely a set of ranges mapping unicode code points to bitmaps. As this implies,
fonts are bitmap fonts, not vector fonts, so scaling, if supported by the rendering client,
may not look as good as with a vector font.

A style may also refer to a font name to use (eg, "Tahoma"). These fonts may or may not be
available on the playing system, however, since the font data is not included in the stream,
just referenced by name. For this reason, it is best to keep to widely known fonts.

== Reference encoder/decoder ==

A encoder (kateenc) and a decoder (katedec) are included in the tools directory.
The encoder supports input from several different formats:
* a custom text based file format (see [[#The Kate file format|The Kate file format]]), which is by no means meant to be part of the Kate bitstream specification itself
* SubRip (.srt), the most common subtitle format I found
* LRC lyrics format.

As an example for the widely used SRT subtitles format, the following command line
create a Kate subtitles stream from an SRT file:

kateenc -l en -c subtitles -t srt -o subtites.ogg subtitles.srt

The reverse is possible, to recover an SRT file from a Kate stream, with katedec.

Note that the subtitles.ogg file should then be multiplexed into the A/V stream,
using either ogg-tools or oggz-tools.

The Kate bitstreams encoded and decoded by those tools are (supposed to be) correct for this
specification, provided their input is correct.

== Next steps ==

=== Continuations ===

Continuations are a way to add to existing events, and are mostly meant for motions. When streaming
in real time, what motions may be applied to events may not be known in advance (for instance, for a
draw chat program where two programs exchange Kate streams, the drawing motions are only known as
they are drawn. Continuations will allow an event to be extended in time, and motions to be appended
to it. This is only useful for streaming, as when stored in a file, everything is already known in
advance.

=== A rendering library ===

This will allow easier integration in other packages (movie players, etc).
I have started working on an implementation using Cairo and Pango, though I'm still at the early stages.
I might add support for embedding vector fonts in a Kate stream if I was going that way. Still need to think about this.
Another point of note is that when this library is available, it would make it easier to add
capabilities such as rotation, scaling, etc, to the bitstream, since this would not cause too
much work for playing programs using the rendering library. It is expected that these additions
would stay backward compatible (eg, an old player would ignore this information but still correctly
decode the information they can work with from a newly encoded stream).

=== An XML representation ===

While I purposefully did not write Kate description files in XML due to me finding editing XML such
a chore, it would be nice to be able to losslessly convert between the more user friendly representation
and an XML document, so one can do what one does with XML documents, like transformations.

And after all, some people might prefer editing the XML version.

=== Packaging ===

It would be really nice to have packages for libkate/libtiger for many distros.

If you're a packager for a distro which doesn't have yet packages for libkate
or libtiger, please consider helping :)

In particular, packages for Debian would be grand.

== Matroska mapping ==

The codec ID is "S_KATE".

As for Theora and Vorbis, Kate headers are stored in the private data as xiph-laced packets:

Byte 0: number of packets present, minus 1 (there must be at least one packet) - let this number be NP
Bytes 1..n: lengths of the first NP packets, coded in xiph style lacing
Bytes n+1..end: the data packets themselves concatenated one after the other

Note that the length of the last packet isn't encoded, it is deduced from the sizes of the other
packets and the total size of the private data.

This mapping is similar to the Vorbis and Theora mappings, with the caveat that one should not
expect a set number of headers.

== Downloading ==

libkate encodes and decodes Kate streams, and is API and ABI stable.

The libkate source distribution is available at [http://libkate.googlecode.com/ http://libkate.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/kate.git;a=summary http://git.xiph.org/?p=users/oggk/kate.git;a=summary].

libtiger renders Kate streams using Pango and Cairo, and is alpha, with API changes still possible.

The libtiger source distribution is available at [http://libtiger.googlecode.com/ http://libtiger.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/tiger.git;a=summary http://git.xiph.org/?p=users/oggk/tiger.git;a=summary].

== Things I need to get feedback on ==

* is it a good idea to avoid floating point usage altogether ?

[[Category:Drafts]]
[[Category:Ogg Mappings]]

OggKate

2009-01-04T11:14:59Z

Ogg.k.ogg.k: /* Support */

== Disclaimer ==
This is not a Xiph codec, though it may be embedded in Ogg alonside other Xiph
codecs, such as Vorbis and Theora. As such, please do not assume that Xiph has
anything to do with this, much less responsibility.

== What is Kate? ==

Kate is an overlay codec, originally designed for karaoke and text, that can be
multiplixed in Ogg. Text and images can be carried by a Kate stream, and animated.
Most of the time, this would be multiplexed with audio/video to carry subtitles,
song lyrics (with or without karaoke data), etc, but doesn't have to be.

Series of curves (splines, segments, etc) may be attached to various properties
(text position, font size, etc) to create animated overlays. This allows scrolling
or fading text to be defined. This can even be used to draw arbitrary shapes, so
hand drawing can also be represented by a Kate stream.

Example uses of Kate streams are movie subtitles for Theora videos, either text based,
as may be created by [http://www.v2v.cc/~j/ffmpeg2theora ffmpeg2theora], or image
based, such as created by [http://thoggen.net Thoggen] (patching needed), and lyrics,
as created by oggenc, from vorbis-tools.

== Why a new codec? ==

As I was adding support for Theora, Speex and FLAC to some software of mine, I found myself
wanting to have song lyrics accompanying Vorbis audio. Since Vorbis comments are limited to
the headers, one can't add them in the stream as they are sung, so another multiplexed stream
would be needed to carry them.

The three possible bases usable for such a codec I found were Writ, CMML, and OGM/SRT.

*[[OggWrit|Writ]] is an unmaintained start at an implementation of a very basic design, though I did find an encoder/decoder in py-ogg2 later on - I'd been quicker to write Kate from scratch anyway.
*[[CMML]] is more geared towards encapsulating metadata about an accompanying stream, rather than being a data stream itself, and seemed complex for a simple use, though I have now revised my view on this - besides, it seems designed for Annodex (which I haven't had a look at), though it does seems relatively generic for use outwith Annodex - though it is being "repurposed" as timed text now, bringing it closer to what I'm doing
*OGM/SRT, which I only found when I added Kate support to MPlayer, is shoehorning various data formats into an Ogg stream, and just dumps the SRT subtitle format as is, AFAICS (though I haven't looked at this one in detail, since I'd already had a working Kate implementation by that time)

I then decided to roll my own, not least because it's a fun thing to do.

I found other formats, such as USF (designed for inclusion in Matroska) and various subtitle formats,
but none were designed for embedding inside an Ogg container.

== Overview of the Kate bitstream format ==

I've taken much inspiration from Vorbis and Theora here.
Headers and packets (as well as the API design) follow the design of these two codecs.

A rough overview (see [[#Format specification|Format specification]] for more details) is:

Headers packets:
*ID header [BOS]: magic, version, granule fraction, encoding, language, etc
*Comment header: Vorbis comments, as per Vorbis/Theora streams
*Style definitions header: a list of predefined styles to be referred to by data packets
*Region definitions header: a list of predefined regions to be referred to by data packets
*Curves definitions header: a list of predefined curves to be referred to by data packets
*Motion definitions header: a list of predefined motions to be referred to by data packets
*Palette definitions header: a list of predefined palettes to be referred to by data packets
*Bitmap definitions header: a list of predefined bitmaps to be referred to by data packets
*Font mapping definitions header: a list of predefined font mappings to be referred to by data packets

Other header packets are ignored, and left for future expansion.

Data packets:
*text data: text/image and optional motions, accompanied by optional overrides for style, region, language, etc
*keepalive: can be emitted at any time to help a demuxer know where we're at, but those packets are optional
*end data [EOS]: marks the end of the stream, it doesn't have any useful payload

Other data packets are ignored, and left for future expansion.

The intent of the "keepalive" packet is to be sent at regular
intervals when no other packet has been emitted for a while. This would be to help seeking code
find a kate page more easily.

Things of note:
*Kate is a discontinuous codec, as defined in [http://www.xiph.org/ogg/doc/ogg-multiplex.html ogg-multiplex.html] in the Ogg documentation, which means it's timed by start granule, not end granule (as Theora and Vorbis).
* All data packets are on their own page, for two reasons:
**Ogg keeps track of granules at the page level, not the packet level
**if no text event happens for a while after a particular text event, we don't want to delay it so a larger page can be issued

See also [[#Seeking and memory|Problems to solve: Seeking and memory]].

*The granule encoding is not a direct time/granule correspondance, see the granule encoding section.
*The EOS packet should have a granule pos higher or equal to the end time of all events.
*User code doesn't have to know the number of headers to expect, this is moved inside the library code (as opposed to Vorbis and Theora).
*The format contains hooks so that additional information may be added in future revisions while keeping backward compatibility (though old decoders will correctly parse, but ignore the new information).

== Format specification ==

The Kate bitstream format consists of a number of sequential packets.
Packets can be either header packets or data packets. All header packets
must appear before any data packet.

Header packets must appear in order. Decoding of a data packet is not
possible until all header packets have been decoded.

Each Kate packet starts with a one byte type. A type with the MSB set
(eg, between 0x80 and 0xff) indicates a header packet, while a type with
the MSB cleared (eg, between 0x00 and 0x7f) indicates a data packet.
All header packets then have the Kate magic, from byte offset 1 to byte
offset 7 ("kate\0\0\0"). Note that this applies only to header packets:
data packets do not contain the Kate signature.

Since the ID header must appear first, a Kate stream can be recognized
by comparing the first eight bytes of the first packet with the signature
string "\200kate\0\0\0".

When embedded in Ogg,the first packet in a Kate stream (always packet type 0x80,
the id header packet) must be placed on a separate page. The corresponding Ogg
packet must be marked as beginning of stream (BOS).All subsequent header packets
must be on one or more pages. Subsequently, each data packet must be on a separate
page.

The last data packet must be the end of stream packet (packet type 0x7f).

When embedded in Ogg, the corresponding Ogg packet must be marked as end of stream (EOS).

As per the Ogg specification, granule positions must be non decreasing
within the stream. Header packets have granule position 0.

Currently existing packet types are:
:headers:
::0x80 ID header (BOS)
::0x81 Vorbis comment header
::0x82 regions list header
::0x83 styles list header
::0x84 curves list header
::0x85 motions list header
::0x86 palettes list header
::0x87 bitmaps list header
::0x88 font ranges and mappings header
:data:
::0x00 text data (including optional motions and overrides)
::0x01 keepalive
::0x02 repeat
::0x7f end packet (EOS)

This format described here is for bitstream version 0.x.
As or 19 december 2008, the latest bitstream version is 0.4.

For more detailed information, refer to the format documentation
in libkate (see URL below in the [[#Downloading|Downlading]] section).

Following is the definition of the ID header (packet type 0x80).
This works out to a 64 byte ID header. This is the header that should be
used to detect a Kate stream within an Ogg stream.

0 1 2 3 |
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1| Byte
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| packtype | Identifier char[7]: 'kate\0\0\0' | 0-3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| kate magic continued | 4-7
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | version major | version minor | num headers | 8-11
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| text encoding | directionality| reserved - 0 | granule shift | 12-15
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| cw sh | canvas width | ch sh | canvas height | 16-19
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | 20-23
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate numerator | 24-27
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate denominator | 28-31
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (NUL terminated) | 32-35
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 36-39
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 40-43
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 44-47
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (NUL terminated) | 48-51
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 52-55
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 56-59
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 60-63
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The fields cw sh, canvas width, cw sh, and canvas height were introduced
in bistream 0.3. Earlier bitstreams will have 0 in these fields.

language and category are NUL terminating ASCII strings.
Language follows RFC 3066, though obviously will not accommodate language tags
with lots of subtags.

Category is currently loosely defined, and I haven't found yet a nice way to
present it in a generic way, but is meant for automatic classifying of
various multiplexed Kate streams (eg, to recognize that some streams are
subtitles (in a set of languages), and some others are commentary (in a
possibly different set of languages, etc).

== API overview ==

libkate offers an API very similar to that of libvorbis and libtheora, as well as
an extra higher level decoding API.

Here's an overview of the three main modules:

=== Decoding ===

Decoding is done in a way similar to libvorbis. First, initialize a kate_info and a
kate_comment structure. Then, read headers by calling kate_decode_headerin. Once
all headers have been read, a kate_state is initialized for decoding using kate_decode_init,
and kate_decode_packetin is called repeatedly with data packets. Events (eg, text) can be
retrieved via kate_decode_eventout.

=== Encoding ===

Encoding is also done in a way similar to libvorbis. First initialize a kate_info
and a kate_comment structure, and fill them out as needed. kate_encode_headers will
create ogg packets from those. Then, kate_encode_text is called repeatedly for all
the text events to add. When done, calling kate_encode_finish will create an end of
stream packet.

=== High level decoding API ===

There are only 3 calls here:

kate_high_decode_init
kate_high_decode_packetin
kate_high_decode_clear

Here, all Ogg packets are sent to kate_high_decode_packetin, which does the right
thing (header/data classification, decoding, and event retrieval). Note that you
do not get access to the comments directly using this, but you do get access to the
kate_info via events.

The libkate distribution includes commented examples for each of those.

Additionally, libkate includes a layer (liboggkate) to make it easier to use when
embedded in Ogg. While the normal API uses kate_packet structures, liboggkate uses
ogg_packet structures.

The High level decoding API does not have an Ogg specific layer, but functions exist
to wrap a kate_packet around a memory buffer (such as the one ogg_packet uses, for instance).

== Support ==

Among the software with Kate support:
*VLC
*ffmpeg2theora
*liboggz
*liboggplay
*Cortado (wikimedia version)
*vorbis-tools

I have patches for the following with Kate support:
*MPlayer
*xine
*GStreamer
*Thoggen
*Audacious
*and more...

These may be found in the libkate source distribution (see [[#Downloading|Downloading]]
for links).

In addition, libtiger is a rendering library for Kate streams using Pango and Cairo,
though it is not quite yet API stable (though no major changes are expected).

== Granule encoding ==

=== Ogg ===

Ogg leaves the encoding of granules up to a particular codec, only
mandating that granules be non decreasing with time.

The Kate bitstream format uses a linear mapping between time and
granule, described here.

A Kate granule position is composed of two different parts:
- a base granule, in the high bits
- a granule offset, in the low bits

+----------------+----------------+
| base | offset |
+----------------+----------------+

The number of bits these parts occupy is variable, and each stream
may choose how many bits to dedicate to each. The kate_info structure
for a stream holds that information in the granule_shift field,
so each part may be reconstructed from a granulepos.

The timestamp T of a given Kate packet is split into a base B and
offset O, and these are stored in the granulepos of that packet.
The split is done such that the B is the time of the earliest event
still active at the time, and the O is the time elapsed between B
and T. Thus, T = B + O. This mimics the way Theora stores its own
timestamps in granulepos, where the base acts as a keyframe, and
an offset acts as the position of an intra frame from the previous
keyframe. Since Kate allows time overlapping events, however, the
choice of the base to use is slightly more complex, as it may not
be the starting time of the previous event, if the stream contains
time overlapping events.

The kate_info structure for a stream holds a rational fraction
representing the time span of granule units for both the base and
the offset parts.

The granule rate is defined by the two fields:

kate_info::gps_numerator
kate_info::gps_denominator

The number of bits reserved for the offset is defined by the field:

kate_info::granule_shift

=== Generic timing ===

Kate data packets (data packet type 0) includes timing information (start time,
end time, and time of the earliest event still active). All these are stored as
64 bit at the rate defined by the granule rate, so they do not suffer from the
granule_shift space limitation.

This also allows for Kate streams to be stored in other containers.

== Motion ==

The Kate bitstream format includes motion definition, originally for karaoke purposes, but
which can be used for more general purpose, such as line based drawing, or animation of
the text (position, color, etc)

Motions are defined by the means of a series of curves (static points, segments, splines (catmull-rom, bezier, and b-splines)).
A 2D point can be obtained from a motion for any timestamp during the lifetime of a text.
This can be used for moving a marker in 2D above the text for karaoke, or to use the x
coordinate to color text when the motion position passes each letter or word, etc.
Motions have an attached semantics so the client code knows how to use a particular motion.
Predefined semantics include text color, text position, etc).

Since a motion can be composed of an arbitrary number of curves, each of which may have
an arbitrary number of control points, complex motions can be achieved. If the motion is
the main object of an event, it is even possible to have an empty text, and use the motion
as a virtual pencil to draw arbitrary shapes. Even on-the-fly handwriting subtitles could
be done this way, though this would require a lot of control points, and would not be able
to be used with text-to-speech.

As a proof of concept, I also have a "draw chat" program where two people can draw, and
the shapes are turned to b-splines and sent as a kate motion to be displayed on the other
person's window.

It is also possible for motions to be discontinuous - simply insert a curve of 'none' type.
While the timestamp lies within such a curve, no 2D point will be generated. This can be
used to temporarily hide a marker, for instance.

It is worth mentionning that pauses in the motion can be trivially included by inserting
at the right time and for the right duration a simple linear interpolation curve with only
two equal points, equal to the position the motion is supposed to pause at.

Kate defines a set of predefined mappings so that each decoder user interprets a motion in
the same way. A mapping is coded on 8 bits in the bitstream, and the first 128 are reserved
for Kate, leaving 128 for application specific mappings, to avoid constraining creative uses
of that feature. Predefined mappings include frame (eg, 0-1 points are mapped to the size of
the current video frame), or region, to scale 0-1 to the current region. This allows curves
to be defined without knowing in advance the pixel size of the area it should cover.

For uses which require more than two coordinates (eg, text color, where 4 (RGBA) values are
needed, Kate predefines the semantics text_color_rg and text_color_ba, so a 4D point can be
obtained using two different motions.

There are higher level constructs, such as morphing between two styles, or predefined
karaoke effects. More are planned to be added in the future.

See also [[#Trackers|Trackers]].

== Trackers ==

Since attaching motions to text position, etc, makes it hard for the client to keep track of
everything, doing interpolation, etc, the library supplies a tracker object, which handles the
interpolation of the relevant properties.
Once initialized with a text and a set of motions, the client code can give the tracker a new
timestamp, and get back the current text position, text color, etc.

Using a tracker is not necessary, if one wants to use the motions directly, or just ignore them,
but it makes life easier, especially when considering the the order in which motions are applied
does matter (to be defined formally, but the current source code is informative at this point).

== The Kate file format ==

Though this is not a feature of the bitstream format, I have created a text file format to
describe a series of events to be turned into a Kate bitstream.
At its minimum, the following is a valid input to the encoder:

: kate {
:: event { 00:00:05 --> 00:00:10 "This is a text" }
: }

This will create a simple stream with "This is a text" emitted at an offset of 5 seconds into
the track, lasting 5 seconds to an end time at 10 seconds.

Motions, regions, styles can be declared in a definitions block to be reused by events, or can
be defined inline. Defining those in the definitions block places them in a header so they can
be reused later, saving space. However, they can also be defined in each event, so they will be
sent with the event. This allows them to be generated on the fly (eg, if the bitstream is being
streamed from a realtime input).

For convenience, the Kate file format also allows C style macros, though without parameters.

Please note that the Kate file format is fully separate from the Kate bitstream format. The
difference between the two is similar to the difference between a C source file and the resulting
object file, when compiled.

Note that the format is not based on XML for a very parochial reason: I tend to dislike very
much editing XML by hand, as it's really hard to read. XML is really meant for machines to parse
generically text data in a shared syntax but with possibly unknown semantics, and I need those
text representations to be editable easily.

This also implies that there could be an XML representation of a Kate stream, which would be
useful if one were to make an editor that worked on a higher level than the current all-text
representation, and it is something that might very well happen in the future, in parallel with
the current format.

== Karaoke ==

Karaoke effects rely on motions, and there will be predefined higher level ways of specifying
timings and effects, two of which are already done. As an example, this is a valid Karaoke script:

:kate {
:: simple_timed_glyph_style_morph {
::: from style "start_style" to style "end_style"
::: "Let " at 1.0
::: "us " at 1.2
::: "sing " at 1.4
::: "to" at 2.0
::: "ge" at 2.5
::: "ther" at 3.0
:: }
:}

The syllables will change from a style to another as time passes. The definition of the start_style
and end_style styles is omitted for brevity.

== Problems to solve ==

There are a few things to solve before the Kate bitstream format can be considered good
enough to be frozen:

Note: the following is mostly solved, and the bitstream is now stable, and has been
backward and forward compatible since the first released version. This will be updated
when I get some time.

=== Seeking and memory ===

When seeking to a particular time in a movie with subtitles, we may end up at a place when a subtitle has been started, but is not removed yet. Pure streaming doesn't have this problem as it remembers the subtitle being issued (as opposed to, say, Vorbis, for which all data valid now is decoded from the last packet). With Kate, a text string valid now may have been issued long ago.

I see three possible ways to solve this:
*each data packet includes the granule of the earliest still active packet (if none, this will be the granule of this very packet)
**this means seeks are two phased: first seek, find the next Kate packet, and seek again if the granule of the earlier still active packet is less than the original seeked granule. This implies support code on players to do the double seek.

*use "reference frames", a bit like Theora does, where the granule position is split in several fields: the higher bits represent a position for the reference frame, and the lowest bits a delta time to the current position. When seeking to a granule position, the lower bits are cleared off, yielding the granule position of the previous reference frame, so the seek ends up at the reference frame. The reference frame is a sync point where any active strings are issued again. This is a variant of the method described in the Writ wiki page, but the granule splitting avoids any "downtime".
**this requires reissuing packets, and it doesn't feel right (and wastes space).
**it also requires "dummy" decoding of Kate data from the reference frame to the actual seek point to fully refresh the state "memory".

*A variant of the two-granules-in-one system used by libcmml, where the "back link" points to the earliest still active string, rather than the previous one (this allows a two phase seek, rather than a multiphase seek, hopping back from event to event, with no real way to know if there is or not a previous event which is still active - I suppose CMML has no need to know this, if their "clips" do not overlap - mine can do).
**Such a system considerably shortens the usable granule space, though it can do a one phase seek, if I understand the system correctly, which I am not certain.
*** Well, it seems it can't do a one phase seek anyway.

*Additionally, it could be possible to emit simple "keepalive" packets at regular intervals to help a seek algorithm to sync up to the stream without needing too much data reading - this helps for discontinuous streams where there could be no pages for a while if no data is needed at that time.

=== Text encoding ===

A header field declares the text encoding used in the stream. At the moment, only UTF-8 is
supported, for simplicity. There are no plans to support other encodings, such as UTF-16,
at the moment.

Note that strings included in the header (language, category) are not affected by that
language encoding (rather obviously for language itself). These are ASCII.

The actual text in events may include simple HTML-like markup (at the moment, allowed markup
is the same as the one Pango uses, but more markup types may be defined in the future).
It is also possible to ask libkate to remove this markup if the client prefers to receive
plain text without the markup.

=== Language encoding ===

A header field defines the language (if any) used in the stream (this can be overridden in a
data packet, but this is not relevant to this point). At the moment, my test code uses
ISO 639-1 two letter codes, but I originally thought to use RFC 3066 tags. However, matching
a language to a user selection may be simpler for user code if the language encoding is kept
simple. At the moment, I tend to favor allowing both two letter tags (eg, "en") and secondary
tags (like "en_EN"), as RFC 3066 tags can be quite complex, but I welcome comments on this.

If a stream contains more than one language, there usually is a predominant language, which
can be set as the default language for the stream. Each event can then have a language
override. If there is no predominant language, and it is not possible to split the stream
into multiple substreams, each with its own language, then it is possible to use the "mul"
language tag, as a last resort.

=== Bitstream format for floating point values ===

Floating point values are be turned to a 16.16 fixed point format, then stored in a bitpacked
format, storing the number of zero bits at the head and tail of the floating point values once
per stream, and the remainder bits for all values in the stream. This seems to yield good results
(typically a 50% reduction over 32 bits raw writes, and 70% over the snprintf based storage), and
has the big advantage of being portable (eg, independant of any IEEE format).
However, this means reduced precision due to the quantization to 16.16. I may add support for
variable precision (eg, 8.24 fixed point formats) to alleviate this. This would however mean less
space savings, though these are likely to be insignificant when Kate streams are interleaved with
a video.

*Though this is not a Kate issue per se, the motion feature is very difficult to use without a curve editor. While tools may be coded to create a Kate bitstream for various existing subtitle formats, it is not certain it will be easy to find a good authoring tool for a series of curves. That said, it's not exactly difficult to do if you know a widget set.

=== Higher dimensional curves/motions ===

It is quite annoying to have to create two motions to control a color change, due to curves
being restricted to two dimensions. I may add support for arbitrary dimensions. It would also
help for 1D motions, like changing the time flow, where one coordinate is simply ignored at
the moment.
Alternatively, changes could be made to the Kate file format to hide the two dimensionality and
allow simpler specification of non-2 dimensional motions, but still map them to 2D in the kate
bitstream format.

=== Category definition ===

The category field in the BOS packet is a 16 byte text field (15 really, as it is zero terminated
in the bitstream itself). Its goal is to provide the reader with a short description of what kind
of information the stream contains, eg subtitles, lyrics, etc. This would be displayed to the user,
possibly to allow to choose to turn some streams on and off.

Since this category is meant primarily for a machine to parse, they will be kept to ASCII. When
a player recognizes a category, it is free to replace its name with one in the user's language if
it prefers. Even in English, the "lyrics" category could be displayed by a player as "Lyrics".

Since this is a free text field rather than an enumeration, it would be good to have a list of
common predefined category names that Kate streams can use.

This is a list of proposed predefined categories, feedback/additions welcome:

* subtitles - the usual movie subtitles, as text
* spu-subtitles - movie subtitles in DVD style paletted images
* lyrics - song lyrics

Please remember the 15 character limit if proposing other categories.

Note that the list of categories is subject to change, and will likely
be replaced by new, more "identifier like" ones. The three ones above,
however, would be kept for backward compatibility as they're already used.

== Text to speech ==

One of the goals of the Kate bitstream format is that text data can be easily parsed
by the user of the decoder, so any additional information, such as style, placement,
karaoke data, etc, should be able to be stripped to leave only the bare text. This is
in view of allowing text-to-speech software to use Kate bitstreams as a bandwith-cheap
way of conveying speech data, and could also allow things like e-books which can be
either read or listened to from the same bitstream (I have seen no reference to this
being used anywhere, but I see no reason why the granule progression should be temporal,
and not user controlled, such as by using a "next" button which would bump a granule
postion by a preset amount, simulating turning a page (this would be close to necessary
for text-to-speech, as the wall time duration of the spoken speech is not known in
advance to the Kate encoder, and can't be mapped to a time based granule progression)).
All text strings triggered consecutively between the two granule positions would then
be read in order.

== Possible additions ==

=== Embedded binary data ===

Images and font mappings can be included within a Kate stream.

==== Images ====

Though this could be misused to interfere with ability to render as text-to-speech, Kate
can use images as well as text. The same caveat as for fonts applies with regard to data
duplication.

Complex images might however be best left to a multiplexed OggSpots or OggMNG stream, unless the
images mesh with the text (eg, graphical exclamation points, custom fonts, (see next
paragraph), etc).

There is support for simple paletted bitmap images, with a variable length palette of up
to 256 colors (in fact, sized in powers of 2 up to 256) and matching pixel data in as
many bits per pixel as can address the palette. Palettes and images are stored separately,
so can be used with one another with no fixed assignment.

Palettes and bitmaps are put in two separate header for later use by reference, but can
also be placed in data packets, as with motions, etc, if they are not going to be reused.

PNG bitmaps can also be embedded in a Kate stream. These do not have associated palettes
(but the PNGs themselves may or may not be paletted). There is no support for decoding PNG
images in libkate itself, so a program will have to use libpng (or similar code) to decode
the PNG image. For instance, the libtiger rendering library uses Cairo to decode and render
PNG images in Kate streams.

This can be used to have custom fonts, so that raw text is still available if the stream
creator wants a custom look.

I expect that the need for more than 256 colors in a bitmap, or non palette bitmap data,
would be best handled by another codec, eg OggMNG or OggSpots. The goal of images in a
Kate stream is to mesh the images with the text, not to have large images by themselves.

On the other hand, interesting Karaoke effects could be achieved by having MNG images
instead of simple paletted bitmaps in a Kate streams. Comments would be most welcome on
whether this is going too far, however.

I am also investigating SVG images. These allow for very small footprint images for simple
vector drawings, and could be very useful for things like background gradients below text.

A possible solution to the duplication issue is to have another stream in the container
stream, which would hold the shared data (eg, fonts), which the user program could load,
and which could then be used by any Kate (and other) stream. Typically, this type of stream
would be a degenerate stream with only header packets (so it is fully processed before any
other stream presents data packets that might make use of that shared data), and all payload
such as fonts being contained within the headers. Thinking about it, it has parallels with
the way Vorbis stores its codebooks within a header packet, or even the way Kate stores the
list of styles within a header packet.

==== Fonts ====

Custom fonts are merely a set of ranges mapping unicode code points to bitmaps. As this implies,
fonts are bitmap fonts, not vector fonts, so scaling, if supported by the rendering client,
may not look as good as with a vector font.

A style may also refer to a font name to use (eg, "Tahoma"). These fonts may or may not be
available on the playing system, however, since the font data is not included in the stream,
just referenced by name. For this reason, it is best to keep to widely known fonts.

== Reference encoder/decoder ==

A encoder (kateenc) and a decoder (katedec) are included in the tools directory.
The encoder supports input from several different formats:
* a custom text based file format (see [[#The Kate file format|The Kate file format]]), which is by no means meant to be part of the Kate bitstream specification itself
* SubRip (.srt), the most common subtitle format I found
* LRC lyrics format.

As an example for the widely used SRT subtitles format, the following command line
create a Kate subtitles stream from an SRT file:

kateenc -l en -c subtitles -t srt -o subtites.ogg subtitles.srt

The reverse is possible, to recover an SRT file from a Kate stream, with katedec.

Note that the subtitles.ogg file should then be multiplexed into the A/V stream,
using either ogg-tools or oggz-tools.

The Kate bitstreams encoded and decoded by those tools are (supposed to be) correct for this
specification, provided their input is correct.

== Next steps ==

=== Continuations ===

Continuations are a way to add to existing events, and are mostly meant for motions. When streaming
in real time, what motions may be applied to events may not be known in advance (for instance, for a
draw chat program where two programs exchange Kate streams, the drawing motions are only known as
they are drawn. Continuations will allow an event to be extended in time, and motions to be appended
to it. This is only useful for streaming, as when stored in a file, everything is already known in
advance.

=== A rendering library ===

This will allow easier integration in other packages (movie players, etc).
I have started working on an implementation using Cairo and Pango, though I'm still at the early stages.
I might add support for embedding vector fonts in a Kate stream if I was going that way. Still need to think about this.
Another point of note is that when this library is available, it would make it easier to add
capabilities such as rotation, scaling, etc, to the bitstream, since this would not cause too
much work for playing programs using the rendering library. It is expected that these additions
would stay backward compatible (eg, an old player would ignore this information but still correctly
decode the information they can work with from a newly encoded stream).

=== An XML representation ===

While I purposefully did not write Kate description files in XML due to me finding editing XML such
a chore, it would be nice to be able to losslessly convert between the more user friendly representation
and an XML document, so one can do what one does with XML documents, like transformations.

And after all, some people might prefer editing the XML version.

=== Packaging ===

It would be really nice to have packages for libkate/libtiger for many distros.

If you're a packager for a distro which doesn't have yet packages for libkate
or libtiger, please consider helping :)

In particular, packages for Debian would be grand.

== Matroska mapping ==

The codec ID is "S_KATE".

As for Theora and Vorbis, Kate headers are stored in the private data as xiph-laced packets:

Byte 0: number of packets present, minus 1 (there must be at least one packet) - let this number be NP
Bytes 1..n: lengths of the first NP packets, coded in xiph style lacing
Bytes n+1..end: the data packets themselves concatenated one after the other

Note that the length of the last packet isn't encoded, it is deduced from the sizes of the other
packets and the total size of the private data.

This mapping is similar to the Vorbis and Theora mappings, with the caveat that one should not
expect a set number of headers.

== Downloading ==

libkate encodes and decodes Kate streams, and is API and ABI stable.

The libkate source distribution is available at [http://libkate.googlecode.com/ http://libkate.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/kate.git;a=summary http://git.xiph.org/?p=users/oggk/kate.git;a=summary].

libtiger renders Kate streams using Pango and Cairo, and is alpha, with API changes still possible.

The libtiger source distribution is available at [http://libtiger.googlecode.com/ http://libtiger.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/tiger.git;a=summary http://git.xiph.org/?p=users/oggk/tiger.git;a=summary].

== Things I need to get feedback on ==

* is it a good idea to avoid floating point usage altogether ?

[[Category:Drafts]]
[[Category:Ogg Mappings]]

OggKate

2009-01-01T22:52:52Z

Ogg.k.ogg.k: list repeat packet (0x02)

== Disclaimer ==
This is not a Xiph codec, though it may be embedded in Ogg alonside other Xiph
codecs, such as Vorbis and Theora. As such, please do not assume that Xiph has
anything to do with this, much less responsibility.

== What is Kate? ==

Kate is an overlay codec, originally designed for karaoke and text, that can be
multiplixed in Ogg. Text and images can be carried by a Kate stream, and animated.
Most of the time, this would be multiplexed with audio/video to carry subtitles,
song lyrics (with or without karaoke data), etc, but doesn't have to be.

Series of curves (splines, segments, etc) may be attached to various properties
(text position, font size, etc) to create animated overlays. This allows scrolling
or fading text to be defined. This can even be used to draw arbitrary shapes, so
hand drawing can also be represented by a Kate stream.

Example uses of Kate streams are movie subtitles for Theora videos, either text based,
as may be created by [http://www.v2v.cc/~j/ffmpeg2theora ffmpeg2theora], or image
based, such as created by [http://thoggen.net Thoggen] (patching needed), and lyrics,
as created by oggenc, from vorbis-tools.

== Why a new codec? ==

As I was adding support for Theora, Speex and FLAC to some software of mine, I found myself
wanting to have song lyrics accompanying Vorbis audio. Since Vorbis comments are limited to
the headers, one can't add them in the stream as they are sung, so another multiplexed stream
would be needed to carry them.

The three possible bases usable for such a codec I found were Writ, CMML, and OGM/SRT.

*[[OggWrit|Writ]] is an unmaintained start at an implementation of a very basic design, though I did find an encoder/decoder in py-ogg2 later on - I'd been quicker to write Kate from scratch anyway.
*[[CMML]] is more geared towards encapsulating metadata about an accompanying stream, rather than being a data stream itself, and seemed complex for a simple use, though I have now revised my view on this - besides, it seems designed for Annodex (which I haven't had a look at), though it does seems relatively generic for use outwith Annodex - though it is being "repurposed" as timed text now, bringing it closer to what I'm doing
*OGM/SRT, which I only found when I added Kate support to MPlayer, is shoehorning various data formats into an Ogg stream, and just dumps the SRT subtitle format as is, AFAICS (though I haven't looked at this one in detail, since I'd already had a working Kate implementation by that time)

I then decided to roll my own, not least because it's a fun thing to do.

I found other formats, such as USF (designed for inclusion in Matroska) and various subtitle formats,
but none were designed for embedding inside an Ogg container.

== Overview of the Kate bitstream format ==

I've taken much inspiration from Vorbis and Theora here.
Headers and packets (as well as the API design) follow the design of these two codecs.

A rough overview (see [[#Format specification|Format specification]] for more details) is:

Headers packets:
*ID header [BOS]: magic, version, granule fraction, encoding, language, etc
*Comment header: Vorbis comments, as per Vorbis/Theora streams
*Style definitions header: a list of predefined styles to be referred to by data packets
*Region definitions header: a list of predefined regions to be referred to by data packets
*Curves definitions header: a list of predefined curves to be referred to by data packets
*Motion definitions header: a list of predefined motions to be referred to by data packets
*Palette definitions header: a list of predefined palettes to be referred to by data packets
*Bitmap definitions header: a list of predefined bitmaps to be referred to by data packets
*Font mapping definitions header: a list of predefined font mappings to be referred to by data packets

Other header packets are ignored, and left for future expansion.

Data packets:
*text data: text/image and optional motions, accompanied by optional overrides for style, region, language, etc
*keepalive: can be emitted at any time to help a demuxer know where we're at, but those packets are optional
*end data [EOS]: marks the end of the stream, it doesn't have any useful payload

Other data packets are ignored, and left for future expansion.

The intent of the "keepalive" packet is to be sent at regular
intervals when no other packet has been emitted for a while. This would be to help seeking code
find a kate page more easily.

Things of note:
*Kate is a discontinuous codec, as defined in [http://www.xiph.org/ogg/doc/ogg-multiplex.html ogg-multiplex.html] in the Ogg documentation, which means it's timed by start granule, not end granule (as Theora and Vorbis).
* All data packets are on their own page, for two reasons:
**Ogg keeps track of granules at the page level, not the packet level
**if no text event happens for a while after a particular text event, we don't want to delay it so a larger page can be issued

See also [[#Seeking and memory|Problems to solve: Seeking and memory]].

*The granule encoding is not a direct time/granule correspondance, see the granule encoding section.
*The EOS packet should have a granule pos higher or equal to the end time of all events.
*User code doesn't have to know the number of headers to expect, this is moved inside the library code (as opposed to Vorbis and Theora).
*The format contains hooks so that additional information may be added in future revisions while keeping backward compatibility (though old decoders will correctly parse, but ignore the new information).

== Format specification ==

The Kate bitstream format consists of a number of sequential packets.
Packets can be either header packets or data packets. All header packets
must appear before any data packet.

Header packets must appear in order. Decoding of a data packet is not
possible until all header packets have been decoded.

Each Kate packet starts with a one byte type. A type with the MSB set
(eg, between 0x80 and 0xff) indicates a header packet, while a type with
the MSB cleared (eg, between 0x00 and 0x7f) indicates a data packet.
All header packets then have the Kate magic, from byte offset 1 to byte
offset 7 ("kate\0\0\0"). Note that this applies only to header packets:
data packets do not contain the Kate signature.

Since the ID header must appear first, a Kate stream can be recognized
by comparing the first eight bytes of the first packet with the signature
string "\200kate\0\0\0".

When embedded in Ogg,the first packet in a Kate stream (always packet type 0x80,
the id header packet) must be placed on a separate page. The corresponding Ogg
packet must be marked as beginning of stream (BOS).All subsequent header packets
must be on one or more pages. Subsequently, each data packet must be on a separate
page.

The last data packet must be the end of stream packet (packet type 0x7f).

When embedded in Ogg, the corresponding Ogg packet must be marked as end of stream (EOS).

As per the Ogg specification, granule positions must be non decreasing
within the stream. Header packets have granule position 0.

Currently existing packet types are:
:headers:
::0x80 ID header (BOS)
::0x81 Vorbis comment header
::0x82 regions list header
::0x83 styles list header
::0x84 curves list header
::0x85 motions list header
::0x86 palettes list header
::0x87 bitmaps list header
::0x88 font ranges and mappings header
:data:
::0x00 text data (including optional motions and overrides)
::0x01 keepalive
::0x02 repeat
::0x7f end packet (EOS)

This format described here is for bitstream version 0.x.
As or 19 december 2008, the latest bitstream version is 0.4.

For more detailed information, refer to the format documentation
in libkate (see URL below in the [[#Downloading|Downlading]] section).

Following is the definition of the ID header (packet type 0x80).
This works out to a 64 byte ID header. This is the header that should be
used to detect a Kate stream within an Ogg stream.

0 1 2 3 |
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1| Byte
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| packtype | Identifier char[7]: 'kate\0\0\0' | 0-3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| kate magic continued | 4-7
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | version major | version minor | num headers | 8-11
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| text encoding | directionality| reserved - 0 | granule shift | 12-15
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| cw sh | canvas width | ch sh | canvas height | 16-19
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | 20-23
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate numerator | 24-27
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate denominator | 28-31
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (NUL terminated) | 32-35
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 36-39
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 40-43
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 44-47
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (NUL terminated) | 48-51
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 52-55
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 56-59
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 60-63
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The fields cw sh, canvas width, cw sh, and canvas height were introduced
in bistream 0.3. Earlier bitstreams will have 0 in these fields.

language and category are NUL terminating ASCII strings.
Language follows RFC 3066, though obviously will not accommodate language tags
with lots of subtags.

Category is currently loosely defined, and I haven't found yet a nice way to
present it in a generic way, but is meant for automatic classifying of
various multiplexed Kate streams (eg, to recognize that some streams are
subtitles (in a set of languages), and some others are commentary (in a
possibly different set of languages, etc).

== API overview ==

libkate offers an API very similar to that of libvorbis and libtheora, as well as
an extra higher level decoding API.

Here's an overview of the three main modules:

=== Decoding ===

Decoding is done in a way similar to libvorbis. First, initialize a kate_info and a
kate_comment structure. Then, read headers by calling kate_decode_headerin. Once
all headers have been read, a kate_state is initialized for decoding using kate_decode_init,
and kate_decode_packetin is called repeatedly with data packets. Events (eg, text) can be
retrieved via kate_decode_eventout.

=== Encoding ===

Encoding is also done in a way similar to libvorbis. First initialize a kate_info
and a kate_comment structure, and fill them out as needed. kate_encode_headers will
create ogg packets from those. Then, kate_encode_text is called repeatedly for all
the text events to add. When done, calling kate_encode_finish will create an end of
stream packet.

=== High level decoding API ===

There are only 3 calls here:

kate_high_decode_init
kate_high_decode_packetin
kate_high_decode_clear

Here, all Ogg packets are sent to kate_high_decode_packetin, which does the right
thing (header/data classification, decoding, and event retrieval). Note that you
do not get access to the comments directly using this, but you do get access to the
kate_info via events.

The libkate distribution includes commented examples for each of those.

Additionally, libkate includes a layer (liboggkate) to make it easier to use when
embedded in Ogg. While the normal API uses kate_packet structures, liboggkate uses
ogg_packet structures.

The High level decoding API does not have an Ogg specific layer, but functions exist
to wrap a kate_packet around a memory buffer (such as the one ogg_packet uses, for instance).

== Support ==

Among the software with Kate support:
*VLC
*ffmpeg2theora
*liboggz
*liboggplay
*Cortado (wikimedia version)

I have patches for the following with Kate support:
*MPlayer
*xine
*GStreamer
*Thoggen
*Audacious
*and more...

These may be found in the libkate source distribution (see [[#Downloading|Downloading]]
for links).

In addition, libtiger is a rendering library for Kate streams using Pango and Cairo,
though it is not quite yet API stable (though no major changes are expected).

== Granule encoding ==

=== Ogg ===

Ogg leaves the encoding of granules up to a particular codec, only
mandating that granules be non decreasing with time.

The Kate bitstream format uses a linear mapping between time and
granule, described here.

A Kate granule position is composed of two different parts:
- a base granule, in the high bits
- a granule offset, in the low bits

+----------------+----------------+
| base | offset |
+----------------+----------------+

The number of bits these parts occupy is variable, and each stream
may choose how many bits to dedicate to each. The kate_info structure
for a stream holds that information in the granule_shift field,
so each part may be reconstructed from a granulepos.

The timestamp T of a given Kate packet is split into a base B and
offset O, and these are stored in the granulepos of that packet.
The split is done such that the B is the time of the earliest event
still active at the time, and the O is the time elapsed between B
and T. Thus, T = B + O. This mimics the way Theora stores its own
timestamps in granulepos, where the base acts as a keyframe, and
an offset acts as the position of an intra frame from the previous
keyframe. Since Kate allows time overlapping events, however, the
choice of the base to use is slightly more complex, as it may not
be the starting time of the previous event, if the stream contains
time overlapping events.

The kate_info structure for a stream holds a rational fraction
representing the time span of granule units for both the base and
the offset parts.

The granule rate is defined by the two fields:

kate_info::gps_numerator
kate_info::gps_denominator

The number of bits reserved for the offset is defined by the field:

kate_info::granule_shift

=== Generic timing ===

Kate data packets (data packet type 0) includes timing information (start time,
end time, and time of the earliest event still active). All these are stored as
64 bit at the rate defined by the granule rate, so they do not suffer from the
granule_shift space limitation.

This also allows for Kate streams to be stored in other containers.

== Motion ==

The Kate bitstream format includes motion definition, originally for karaoke purposes, but
which can be used for more general purpose, such as line based drawing, or animation of
the text (position, color, etc)

Motions are defined by the means of a series of curves (static points, segments, splines (catmull-rom, bezier, and b-splines)).
A 2D point can be obtained from a motion for any timestamp during the lifetime of a text.
This can be used for moving a marker in 2D above the text for karaoke, or to use the x
coordinate to color text when the motion position passes each letter or word, etc.
Motions have an attached semantics so the client code knows how to use a particular motion.
Predefined semantics include text color, text position, etc).

Since a motion can be composed of an arbitrary number of curves, each of which may have
an arbitrary number of control points, complex motions can be achieved. If the motion is
the main object of an event, it is even possible to have an empty text, and use the motion
as a virtual pencil to draw arbitrary shapes. Even on-the-fly handwriting subtitles could
be done this way, though this would require a lot of control points, and would not be able
to be used with text-to-speech.

As a proof of concept, I also have a "draw chat" program where two people can draw, and
the shapes are turned to b-splines and sent as a kate motion to be displayed on the other
person's window.

It is also possible for motions to be discontinuous - simply insert a curve of 'none' type.
While the timestamp lies within such a curve, no 2D point will be generated. This can be
used to temporarily hide a marker, for instance.

It is worth mentionning that pauses in the motion can be trivially included by inserting
at the right time and for the right duration a simple linear interpolation curve with only
two equal points, equal to the position the motion is supposed to pause at.

Kate defines a set of predefined mappings so that each decoder user interprets a motion in
the same way. A mapping is coded on 8 bits in the bitstream, and the first 128 are reserved
for Kate, leaving 128 for application specific mappings, to avoid constraining creative uses
of that feature. Predefined mappings include frame (eg, 0-1 points are mapped to the size of
the current video frame), or region, to scale 0-1 to the current region. This allows curves
to be defined without knowing in advance the pixel size of the area it should cover.

For uses which require more than two coordinates (eg, text color, where 4 (RGBA) values are
needed, Kate predefines the semantics text_color_rg and text_color_ba, so a 4D point can be
obtained using two different motions.

There are higher level constructs, such as morphing between two styles, or predefined
karaoke effects. More are planned to be added in the future.

See also [[#Trackers|Trackers]].

== Trackers ==

Since attaching motions to text position, etc, makes it hard for the client to keep track of
everything, doing interpolation, etc, the library supplies a tracker object, which handles the
interpolation of the relevant properties.
Once initialized with a text and a set of motions, the client code can give the tracker a new
timestamp, and get back the current text position, text color, etc.

Using a tracker is not necessary, if one wants to use the motions directly, or just ignore them,
but it makes life easier, especially when considering the the order in which motions are applied
does matter (to be defined formally, but the current source code is informative at this point).

== The Kate file format ==

Though this is not a feature of the bitstream format, I have created a text file format to
describe a series of events to be turned into a Kate bitstream.
At its minimum, the following is a valid input to the encoder:

: kate {
:: event { 00:00:05 --> 00:00:10 "This is a text" }
: }

This will create a simple stream with "This is a text" emitted at an offset of 5 seconds into
the track, lasting 5 seconds to an end time at 10 seconds.

Motions, regions, styles can be declared in a definitions block to be reused by events, or can
be defined inline. Defining those in the definitions block places them in a header so they can
be reused later, saving space. However, they can also be defined in each event, so they will be
sent with the event. This allows them to be generated on the fly (eg, if the bitstream is being
streamed from a realtime input).

For convenience, the Kate file format also allows C style macros, though without parameters.

Please note that the Kate file format is fully separate from the Kate bitstream format. The
difference between the two is similar to the difference between a C source file and the resulting
object file, when compiled.

Note that the format is not based on XML for a very parochial reason: I tend to dislike very
much editing XML by hand, as it's really hard to read. XML is really meant for machines to parse
generically text data in a shared syntax but with possibly unknown semantics, and I need those
text representations to be editable easily.

This also implies that there could be an XML representation of a Kate stream, which would be
useful if one were to make an editor that worked on a higher level than the current all-text
representation, and it is something that might very well happen in the future, in parallel with
the current format.

== Karaoke ==

Karaoke effects rely on motions, and there will be predefined higher level ways of specifying
timings and effects, two of which are already done. As an example, this is a valid Karaoke script:

:kate {
:: simple_timed_glyph_style_morph {
::: from style "start_style" to style "end_style"
::: "Let " at 1.0
::: "us " at 1.2
::: "sing " at 1.4
::: "to" at 2.0
::: "ge" at 2.5
::: "ther" at 3.0
:: }
:}

The syllables will change from a style to another as time passes. The definition of the start_style
and end_style styles is omitted for brevity.

== Problems to solve ==

There are a few things to solve before the Kate bitstream format can be considered good
enough to be frozen:

Note: the following is mostly solved, and the bitstream is now stable, and has been
backward and forward compatible since the first released version. This will be updated
when I get some time.

=== Seeking and memory ===

When seeking to a particular time in a movie with subtitles, we may end up at a place when a subtitle has been started, but is not removed yet. Pure streaming doesn't have this problem as it remembers the subtitle being issued (as opposed to, say, Vorbis, for which all data valid now is decoded from the last packet). With Kate, a text string valid now may have been issued long ago.

I see three possible ways to solve this:
*each data packet includes the granule of the earliest still active packet (if none, this will be the granule of this very packet)
**this means seeks are two phased: first seek, find the next Kate packet, and seek again if the granule of the earlier still active packet is less than the original seeked granule. This implies support code on players to do the double seek.

*use "reference frames", a bit like Theora does, where the granule position is split in several fields: the higher bits represent a position for the reference frame, and the lowest bits a delta time to the current position. When seeking to a granule position, the lower bits are cleared off, yielding the granule position of the previous reference frame, so the seek ends up at the reference frame. The reference frame is a sync point where any active strings are issued again. This is a variant of the method described in the Writ wiki page, but the granule splitting avoids any "downtime".
**this requires reissuing packets, and it doesn't feel right (and wastes space).
**it also requires "dummy" decoding of Kate data from the reference frame to the actual seek point to fully refresh the state "memory".

*A variant of the two-granules-in-one system used by libcmml, where the "back link" points to the earliest still active string, rather than the previous one (this allows a two phase seek, rather than a multiphase seek, hopping back from event to event, with no real way to know if there is or not a previous event which is still active - I suppose CMML has no need to know this, if their "clips" do not overlap - mine can do).
**Such a system considerably shortens the usable granule space, though it can do a one phase seek, if I understand the system correctly, which I am not certain.
*** Well, it seems it can't do a one phase seek anyway.

*Additionally, it could be possible to emit simple "keepalive" packets at regular intervals to help a seek algorithm to sync up to the stream without needing too much data reading - this helps for discontinuous streams where there could be no pages for a while if no data is needed at that time.

=== Text encoding ===

A header field declares the text encoding used in the stream. At the moment, only UTF-8 is
supported, for simplicity. There are no plans to support other encodings, such as UTF-16,
at the moment.

Note that strings included in the header (language, category) are not affected by that
language encoding (rather obviously for language itself). These are ASCII.

The actual text in events may include simple HTML-like markup (at the moment, allowed markup
is the same as the one Pango uses, but more markup types may be defined in the future).
It is also possible to ask libkate to remove this markup if the client prefers to receive
plain text without the markup.

=== Language encoding ===

A header field defines the language (if any) used in the stream (this can be overridden in a
data packet, but this is not relevant to this point). At the moment, my test code uses
ISO 639-1 two letter codes, but I originally thought to use RFC 3066 tags. However, matching
a language to a user selection may be simpler for user code if the language encoding is kept
simple. At the moment, I tend to favor allowing both two letter tags (eg, "en") and secondary
tags (like "en_EN"), as RFC 3066 tags can be quite complex, but I welcome comments on this.

If a stream contains more than one language, there usually is a predominant language, which
can be set as the default language for the stream. Each event can then have a language
override. If there is no predominant language, and it is not possible to split the stream
into multiple substreams, each with its own language, then it is possible to use the "mul"
language tag, as a last resort.

=== Bitstream format for floating point values ===

Floating point values are be turned to a 16.16 fixed point format, then stored in a bitpacked
format, storing the number of zero bits at the head and tail of the floating point values once
per stream, and the remainder bits for all values in the stream. This seems to yield good results
(typically a 50% reduction over 32 bits raw writes, and 70% over the snprintf based storage), and
has the big advantage of being portable (eg, independant of any IEEE format).
However, this means reduced precision due to the quantization to 16.16. I may add support for
variable precision (eg, 8.24 fixed point formats) to alleviate this. This would however mean less
space savings, though these are likely to be insignificant when Kate streams are interleaved with
a video.

*Though this is not a Kate issue per se, the motion feature is very difficult to use without a curve editor. While tools may be coded to create a Kate bitstream for various existing subtitle formats, it is not certain it will be easy to find a good authoring tool for a series of curves. That said, it's not exactly difficult to do if you know a widget set.

=== Higher dimensional curves/motions ===

It is quite annoying to have to create two motions to control a color change, due to curves
being restricted to two dimensions. I may add support for arbitrary dimensions. It would also
help for 1D motions, like changing the time flow, where one coordinate is simply ignored at
the moment.
Alternatively, changes could be made to the Kate file format to hide the two dimensionality and
allow simpler specification of non-2 dimensional motions, but still map them to 2D in the kate
bitstream format.

=== Category definition ===

The category field in the BOS packet is a 16 byte text field (15 really, as it is zero terminated
in the bitstream itself). Its goal is to provide the reader with a short description of what kind
of information the stream contains, eg subtitles, lyrics, etc. This would be displayed to the user,
possibly to allow to choose to turn some streams on and off.

Since this category is meant primarily for a machine to parse, they will be kept to ASCII. When
a player recognizes a category, it is free to replace its name with one in the user's language if
it prefers. Even in English, the "lyrics" category could be displayed by a player as "Lyrics".

Since this is a free text field rather than an enumeration, it would be good to have a list of
common predefined category names that Kate streams can use.

This is a list of proposed predefined categories, feedback/additions welcome:

* subtitles - the usual movie subtitles, as text
* spu-subtitles - movie subtitles in DVD style paletted images
* lyrics - song lyrics

Please remember the 15 character limit if proposing other categories.

Note that the list of categories is subject to change, and will likely
be replaced by new, more "identifier like" ones. The three ones above,
however, would be kept for backward compatibility as they're already used.

== Text to speech ==

One of the goals of the Kate bitstream format is that text data can be easily parsed
by the user of the decoder, so any additional information, such as style, placement,
karaoke data, etc, should be able to be stripped to leave only the bare text. This is
in view of allowing text-to-speech software to use Kate bitstreams as a bandwith-cheap
way of conveying speech data, and could also allow things like e-books which can be
either read or listened to from the same bitstream (I have seen no reference to this
being used anywhere, but I see no reason why the granule progression should be temporal,
and not user controlled, such as by using a "next" button which would bump a granule
postion by a preset amount, simulating turning a page (this would be close to necessary
for text-to-speech, as the wall time duration of the spoken speech is not known in
advance to the Kate encoder, and can't be mapped to a time based granule progression)).
All text strings triggered consecutively between the two granule positions would then
be read in order.

== Possible additions ==

=== Embedded binary data ===

Images and font mappings can be included within a Kate stream.

==== Images ====

Though this could be misused to interfere with ability to render as text-to-speech, Kate
can use images as well as text. The same caveat as for fonts applies with regard to data
duplication.

Complex images might however be best left to a multiplexed OggSpots or OggMNG stream, unless the
images mesh with the text (eg, graphical exclamation points, custom fonts, (see next
paragraph), etc).

There is support for simple paletted bitmap images, with a variable length palette of up
to 256 colors (in fact, sized in powers of 2 up to 256) and matching pixel data in as
many bits per pixel as can address the palette. Palettes and images are stored separately,
so can be used with one another with no fixed assignment.

Palettes and bitmaps are put in two separate header for later use by reference, but can
also be placed in data packets, as with motions, etc, if they are not going to be reused.

PNG bitmaps can also be embedded in a Kate stream. These do not have associated palettes
(but the PNGs themselves may or may not be paletted). There is no support for decoding PNG
images in libkate itself, so a program will have to use libpng (or similar code) to decode
the PNG image. For instance, the libtiger rendering library uses Cairo to decode and render
PNG images in Kate streams.

This can be used to have custom fonts, so that raw text is still available if the stream
creator wants a custom look.

I expect that the need for more than 256 colors in a bitmap, or non palette bitmap data,
would be best handled by another codec, eg OggMNG or OggSpots. The goal of images in a
Kate stream is to mesh the images with the text, not to have large images by themselves.

On the other hand, interesting Karaoke effects could be achieved by having MNG images
instead of simple paletted bitmaps in a Kate streams. Comments would be most welcome on
whether this is going too far, however.

I am also investigating SVG images. These allow for very small footprint images for simple
vector drawings, and could be very useful for things like background gradients below text.

A possible solution to the duplication issue is to have another stream in the container
stream, which would hold the shared data (eg, fonts), which the user program could load,
and which could then be used by any Kate (and other) stream. Typically, this type of stream
would be a degenerate stream with only header packets (so it is fully processed before any
other stream presents data packets that might make use of that shared data), and all payload
such as fonts being contained within the headers. Thinking about it, it has parallels with
the way Vorbis stores its codebooks within a header packet, or even the way Kate stores the
list of styles within a header packet.

==== Fonts ====

Custom fonts are merely a set of ranges mapping unicode code points to bitmaps. As this implies,
fonts are bitmap fonts, not vector fonts, so scaling, if supported by the rendering client,
may not look as good as with a vector font.

A style may also refer to a font name to use (eg, "Tahoma"). These fonts may or may not be
available on the playing system, however, since the font data is not included in the stream,
just referenced by name. For this reason, it is best to keep to widely known fonts.

== Reference encoder/decoder ==

A encoder (kateenc) and a decoder (katedec) are included in the tools directory.
The encoder supports input from several different formats:
* a custom text based file format (see [[#The Kate file format|The Kate file format]]), which is by no means meant to be part of the Kate bitstream specification itself
* SubRip (.srt), the most common subtitle format I found
* LRC lyrics format.

As an example for the widely used SRT subtitles format, the following command line
create a Kate subtitles stream from an SRT file:

kateenc -l en -c subtitles -t srt -o subtites.ogg subtitles.srt

The reverse is possible, to recover an SRT file from a Kate stream, with katedec.

Note that the subtitles.ogg file should then be multiplexed into the A/V stream,
using either ogg-tools or oggz-tools.

The Kate bitstreams encoded and decoded by those tools are (supposed to be) correct for this
specification, provided their input is correct.

== Next steps ==

=== Continuations ===

Continuations are a way to add to existing events, and are mostly meant for motions. When streaming
in real time, what motions may be applied to events may not be known in advance (for instance, for a
draw chat program where two programs exchange Kate streams, the drawing motions are only known as
they are drawn. Continuations will allow an event to be extended in time, and motions to be appended
to it. This is only useful for streaming, as when stored in a file, everything is already known in
advance.

=== A rendering library ===

This will allow easier integration in other packages (movie players, etc).
I have started working on an implementation using Cairo and Pango, though I'm still at the early stages.
I might add support for embedding vector fonts in a Kate stream if I was going that way. Still need to think about this.
Another point of note is that when this library is available, it would make it easier to add
capabilities such as rotation, scaling, etc, to the bitstream, since this would not cause too
much work for playing programs using the rendering library. It is expected that these additions
would stay backward compatible (eg, an old player would ignore this information but still correctly
decode the information they can work with from a newly encoded stream).

=== An XML representation ===

While I purposefully did not write Kate description files in XML due to me finding editing XML such
a chore, it would be nice to be able to losslessly convert between the more user friendly representation
and an XML document, so one can do what one does with XML documents, like transformations.

And after all, some people might prefer editing the XML version.

=== Packaging ===

It would be really nice to have packages for libkate/libtiger for many distros.

If you're a packager for a distro which doesn't have yet packages for libkate
or libtiger, please consider helping :)

In particular, packages for Debian would be grand.

== Matroska mapping ==

The codec ID is "S_KATE".

As for Theora and Vorbis, Kate headers are stored in the private data as xiph-laced packets:

Byte 0: number of packets present, minus 1 (there must be at least one packet) - let this number be NP
Bytes 1..n: lengths of the first NP packets, coded in xiph style lacing
Bytes n+1..end: the data packets themselves concatenated one after the other

Note that the length of the last packet isn't encoded, it is deduced from the sizes of the other
packets and the total size of the private data.

This mapping is similar to the Vorbis and Theora mappings, with the caveat that one should not
expect a set number of headers.

== Downloading ==

libkate encodes and decodes Kate streams, and is API and ABI stable.

The libkate source distribution is available at [http://libkate.googlecode.com/ http://libkate.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/kate.git;a=summary http://git.xiph.org/?p=users/oggk/kate.git;a=summary].

libtiger renders Kate streams using Pango and Cairo, and is alpha, with API changes still possible.

The libtiger source distribution is available at [http://libtiger.googlecode.com/ http://libtiger.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/tiger.git;a=summary http://git.xiph.org/?p=users/oggk/tiger.git;a=summary].

== Things I need to get feedback on ==

* is it a good idea to avoid floating point usage altogether ?

[[Category:Drafts]]
[[Category:Ogg Mappings]]

OggKate

2009-01-01T13:25:29Z

Ogg.k.ogg.k: /* Support */

== Disclaimer ==
This is not a Xiph codec, though it may be embedded in Ogg alonside other Xiph
codecs, such as Vorbis and Theora. As such, please do not assume that Xiph has
anything to do with this, much less responsibility.

== What is Kate? ==

Kate is an overlay codec, originally designed for karaoke and text, that can be
multiplixed in Ogg. Text and images can be carried by a Kate stream, and animated.
Most of the time, this would be multiplexed with audio/video to carry subtitles,
song lyrics (with or without karaoke data), etc, but doesn't have to be.

Series of curves (splines, segments, etc) may be attached to various properties
(text position, font size, etc) to create animated overlays. This allows scrolling
or fading text to be defined. This can even be used to draw arbitrary shapes, so
hand drawing can also be represented by a Kate stream.

Example uses of Kate streams are movie subtitles for Theora videos, either text based,
as may be created by [http://www.v2v.cc/~j/ffmpeg2theora ffmpeg2theora], or image
based, such as created by [http://thoggen.net Thoggen] (patching needed), and lyrics,
as created by oggenc, from vorbis-tools.

== Why a new codec? ==

As I was adding support for Theora, Speex and FLAC to some software of mine, I found myself
wanting to have song lyrics accompanying Vorbis audio. Since Vorbis comments are limited to
the headers, one can't add them in the stream as they are sung, so another multiplexed stream
would be needed to carry them.

The three possible bases usable for such a codec I found were Writ, CMML, and OGM/SRT.

*[[OggWrit|Writ]] is an unmaintained start at an implementation of a very basic design, though I did find an encoder/decoder in py-ogg2 later on - I'd been quicker to write Kate from scratch anyway.
*[[CMML]] is more geared towards encapsulating metadata about an accompanying stream, rather than being a data stream itself, and seemed complex for a simple use, though I have now revised my view on this - besides, it seems designed for Annodex (which I haven't had a look at), though it does seems relatively generic for use outwith Annodex - though it is being "repurposed" as timed text now, bringing it closer to what I'm doing
*OGM/SRT, which I only found when I added Kate support to MPlayer, is shoehorning various data formats into an Ogg stream, and just dumps the SRT subtitle format as is, AFAICS (though I haven't looked at this one in detail, since I'd already had a working Kate implementation by that time)

I then decided to roll my own, not least because it's a fun thing to do.

I found other formats, such as USF (designed for inclusion in Matroska) and various subtitle formats,
but none were designed for embedding inside an Ogg container.

== Overview of the Kate bitstream format ==

I've taken much inspiration from Vorbis and Theora here.
Headers and packets (as well as the API design) follow the design of these two codecs.

A rough overview (see [[#Format specification|Format specification]] for more details) is:

Headers packets:
*ID header [BOS]: magic, version, granule fraction, encoding, language, etc
*Comment header: Vorbis comments, as per Vorbis/Theora streams
*Style definitions header: a list of predefined styles to be referred to by data packets
*Region definitions header: a list of predefined regions to be referred to by data packets
*Curves definitions header: a list of predefined curves to be referred to by data packets
*Motion definitions header: a list of predefined motions to be referred to by data packets
*Palette definitions header: a list of predefined palettes to be referred to by data packets
*Bitmap definitions header: a list of predefined bitmaps to be referred to by data packets
*Font mapping definitions header: a list of predefined font mappings to be referred to by data packets

Other header packets are ignored, and left for future expansion.

Data packets:
*text data: text/image and optional motions, accompanied by optional overrides for style, region, language, etc
*keepalive: can be emitted at any time to help a demuxer know where we're at, but those packets are optional
*end data [EOS]: marks the end of the stream, it doesn't have any useful payload

Other data packets are ignored, and left for future expansion.

The intent of the "keepalive" packet is to be sent at regular
intervals when no other packet has been emitted for a while. This would be to help seeking code
find a kate page more easily.

Things of note:
*Kate is a discontinuous codec, as defined in [http://www.xiph.org/ogg/doc/ogg-multiplex.html ogg-multiplex.html] in the Ogg documentation, which means it's timed by start granule, not end granule (as Theora and Vorbis).
* All data packets are on their own page, for two reasons:
**Ogg keeps track of granules at the page level, not the packet level
**if no text event happens for a while after a particular text event, we don't want to delay it so a larger page can be issued

See also [[#Seeking and memory|Problems to solve: Seeking and memory]].

*The granule encoding is not a direct time/granule correspondance, see the granule encoding section.
*The EOS packet should have a granule pos higher or equal to the end time of all events.
*User code doesn't have to know the number of headers to expect, this is moved inside the library code (as opposed to Vorbis and Theora).
*The format contains hooks so that additional information may be added in future revisions while keeping backward compatibility (though old decoders will correctly parse, but ignore the new information).

== Format specification ==

The Kate bitstream format consists of a number of sequential packets.
Packets can be either header packets or data packets. All header packets
must appear before any data packet.

Header packets must appear in order. Decoding of a data packet is not
possible until all header packets have been decoded.

Each Kate packet starts with a one byte type. A type with the MSB set
(eg, between 0x80 and 0xff) indicates a header packet, while a type with
the MSB cleared (eg, between 0x00 and 0x7f) indicates a data packet.
All header packets then have the Kate magic, from byte offset 1 to byte
offset 7 ("kate\0\0\0"). Note that this applies only to header packets:
data packets do not contain the Kate signature.

Since the ID header must appear first, a Kate stream can be recognized
by comparing the first eight bytes of the first packet with the signature
string "\200kate\0\0\0".

When embedded in Ogg,the first packet in a Kate stream (always packet type 0x80,
the id header packet) must be placed on a separate page. The corresponding Ogg
packet must be marked as beginning of stream (BOS).All subsequent header packets
must be on one or more pages. Subsequently, each data packet must be on a separate
page.

The last data packet must be the end of stream packet (packet type 0x7f).

When embedded in Ogg, the corresponding Ogg packet must be marked as end of stream (EOS).

As per the Ogg specification, granule positions must be non decreasing
within the stream. Header packets have granule position 0.

Currently existing packet types are:
:headers:
::0x80 ID header (BOS)
::0x81 Vorbis comment header
::0x82 regions list header
::0x83 styles list header
::0x84 curves list header
::0x85 motions list header
::0x86 palettes list header
::0x87 bitmaps list header
::0x88 font ranges and mappings header
:data:
::0x00 text data (including optional motions and overrides)
::0x01 keepalive
::0x7f end packet (EOS)

This format described here is for bitstream version 0.x.
As or 19 december 2008, the latest bitstream version is 0.4.

For more detailed information, refer to the format documentation
in libkate (see URL below in the [[#Downloading|Downlading]] section).

Following is the definition of the ID header (packet type 0x80).
This works out to a 64 byte ID header. This is the header that should be
used to detect a Kate stream within an Ogg stream.

0 1 2 3 |
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1| Byte
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| packtype | Identifier char[7]: 'kate\0\0\0' | 0-3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| kate magic continued | 4-7
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | version major | version minor | num headers | 8-11
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| text encoding | directionality| reserved - 0 | granule shift | 12-15
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| cw sh | canvas width | ch sh | canvas height | 16-19
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | 20-23
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate numerator | 24-27
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate denominator | 28-31
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (NUL terminated) | 32-35
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 36-39
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 40-43
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 44-47
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (NUL terminated) | 48-51
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 52-55
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 56-59
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 60-63
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The fields cw sh, canvas width, cw sh, and canvas height were introduced
in bistream 0.3. Earlier bitstreams will have 0 in these fields.

language and category are NUL terminating ASCII strings.
Language follows RFC 3066, though obviously will not accommodate language tags
with lots of subtags.

Category is currently loosely defined, and I haven't found yet a nice way to
present it in a generic way, but is meant for automatic classifying of
various multiplexed Kate streams (eg, to recognize that some streams are
subtitles (in a set of languages), and some others are commentary (in a
possibly different set of languages, etc).

== API overview ==

libkate offers an API very similar to that of libvorbis and libtheora, as well as
an extra higher level decoding API.

Here's an overview of the three main modules:

=== Decoding ===

Decoding is done in a way similar to libvorbis. First, initialize a kate_info and a
kate_comment structure. Then, read headers by calling kate_decode_headerin. Once
all headers have been read, a kate_state is initialized for decoding using kate_decode_init,
and kate_decode_packetin is called repeatedly with data packets. Events (eg, text) can be
retrieved via kate_decode_eventout.

=== Encoding ===

Encoding is also done in a way similar to libvorbis. First initialize a kate_info
and a kate_comment structure, and fill them out as needed. kate_encode_headers will
create ogg packets from those. Then, kate_encode_text is called repeatedly for all
the text events to add. When done, calling kate_encode_finish will create an end of
stream packet.

=== High level decoding API ===

There are only 3 calls here:

kate_high_decode_init
kate_high_decode_packetin
kate_high_decode_clear

Here, all Ogg packets are sent to kate_high_decode_packetin, which does the right
thing (header/data classification, decoding, and event retrieval). Note that you
do not get access to the comments directly using this, but you do get access to the
kate_info via events.

The libkate distribution includes commented examples for each of those.

Additionally, libkate includes a layer (liboggkate) to make it easier to use when
embedded in Ogg. While the normal API uses kate_packet structures, liboggkate uses
ogg_packet structures.

The High level decoding API does not have an Ogg specific layer, but functions exist
to wrap a kate_packet around a memory buffer (such as the one ogg_packet uses, for instance).

== Support ==

Among the software with Kate support:
*VLC
*ffmpeg2theora
*liboggz
*liboggplay
*Cortado (wikimedia version)

I have patches for the following with Kate support:
*MPlayer
*xine
*GStreamer
*Thoggen
*Audacious
*and more...

These may be found in the libkate source distribution (see [[#Downloading|Downloading]]
for links).

In addition, libtiger is a rendering library for Kate streams using Pango and Cairo,
though it is not quite yet API stable (though no major changes are expected).

== Granule encoding ==

=== Ogg ===

Ogg leaves the encoding of granules up to a particular codec, only
mandating that granules be non decreasing with time.

The Kate bitstream format uses a linear mapping between time and
granule, described here.

A Kate granule position is composed of two different parts:
- a base granule, in the high bits
- a granule offset, in the low bits

+----------------+----------------+
| base | offset |
+----------------+----------------+

The number of bits these parts occupy is variable, and each stream
may choose how many bits to dedicate to each. The kate_info structure
for a stream holds that information in the granule_shift field,
so each part may be reconstructed from a granulepos.

The timestamp T of a given Kate packet is split into a base B and
offset O, and these are stored in the granulepos of that packet.
The split is done such that the B is the time of the earliest event
still active at the time, and the O is the time elapsed between B
and T. Thus, T = B + O. This mimics the way Theora stores its own
timestamps in granulepos, where the base acts as a keyframe, and
an offset acts as the position of an intra frame from the previous
keyframe. Since Kate allows time overlapping events, however, the
choice of the base to use is slightly more complex, as it may not
be the starting time of the previous event, if the stream contains
time overlapping events.

The kate_info structure for a stream holds a rational fraction
representing the time span of granule units for both the base and
the offset parts.

The granule rate is defined by the two fields:

kate_info::gps_numerator
kate_info::gps_denominator

The number of bits reserved for the offset is defined by the field:

kate_info::granule_shift

=== Generic timing ===

Kate data packets (data packet type 0) includes timing information (start time,
end time, and time of the earliest event still active). All these are stored as
64 bit at the rate defined by the granule rate, so they do not suffer from the
granule_shift space limitation.

This also allows for Kate streams to be stored in other containers.

== Motion ==

The Kate bitstream format includes motion definition, originally for karaoke purposes, but
which can be used for more general purpose, such as line based drawing, or animation of
the text (position, color, etc)

Motions are defined by the means of a series of curves (static points, segments, splines (catmull-rom, bezier, and b-splines)).
A 2D point can be obtained from a motion for any timestamp during the lifetime of a text.
This can be used for moving a marker in 2D above the text for karaoke, or to use the x
coordinate to color text when the motion position passes each letter or word, etc.
Motions have an attached semantics so the client code knows how to use a particular motion.
Predefined semantics include text color, text position, etc).

Since a motion can be composed of an arbitrary number of curves, each of which may have
an arbitrary number of control points, complex motions can be achieved. If the motion is
the main object of an event, it is even possible to have an empty text, and use the motion
as a virtual pencil to draw arbitrary shapes. Even on-the-fly handwriting subtitles could
be done this way, though this would require a lot of control points, and would not be able
to be used with text-to-speech.

As a proof of concept, I also have a "draw chat" program where two people can draw, and
the shapes are turned to b-splines and sent as a kate motion to be displayed on the other
person's window.

It is also possible for motions to be discontinuous - simply insert a curve of 'none' type.
While the timestamp lies within such a curve, no 2D point will be generated. This can be
used to temporarily hide a marker, for instance.

It is worth mentionning that pauses in the motion can be trivially included by inserting
at the right time and for the right duration a simple linear interpolation curve with only
two equal points, equal to the position the motion is supposed to pause at.

Kate defines a set of predefined mappings so that each decoder user interprets a motion in
the same way. A mapping is coded on 8 bits in the bitstream, and the first 128 are reserved
for Kate, leaving 128 for application specific mappings, to avoid constraining creative uses
of that feature. Predefined mappings include frame (eg, 0-1 points are mapped to the size of
the current video frame), or region, to scale 0-1 to the current region. This allows curves
to be defined without knowing in advance the pixel size of the area it should cover.

For uses which require more than two coordinates (eg, text color, where 4 (RGBA) values are
needed, Kate predefines the semantics text_color_rg and text_color_ba, so a 4D point can be
obtained using two different motions.

There are higher level constructs, such as morphing between two styles, or predefined
karaoke effects. More are planned to be added in the future.

See also [[#Trackers|Trackers]].

== Trackers ==

Since attaching motions to text position, etc, makes it hard for the client to keep track of
everything, doing interpolation, etc, the library supplies a tracker object, which handles the
interpolation of the relevant properties.
Once initialized with a text and a set of motions, the client code can give the tracker a new
timestamp, and get back the current text position, text color, etc.

Using a tracker is not necessary, if one wants to use the motions directly, or just ignore them,
but it makes life easier, especially when considering the the order in which motions are applied
does matter (to be defined formally, but the current source code is informative at this point).

== The Kate file format ==

Though this is not a feature of the bitstream format, I have created a text file format to
describe a series of events to be turned into a Kate bitstream.
At its minimum, the following is a valid input to the encoder:

: kate {
:: event { 00:00:05 --> 00:00:10 "This is a text" }
: }

This will create a simple stream with "This is a text" emitted at an offset of 5 seconds into
the track, lasting 5 seconds to an end time at 10 seconds.

Motions, regions, styles can be declared in a definitions block to be reused by events, or can
be defined inline. Defining those in the definitions block places them in a header so they can
be reused later, saving space. However, they can also be defined in each event, so they will be
sent with the event. This allows them to be generated on the fly (eg, if the bitstream is being
streamed from a realtime input).

For convenience, the Kate file format also allows C style macros, though without parameters.

Please note that the Kate file format is fully separate from the Kate bitstream format. The
difference between the two is similar to the difference between a C source file and the resulting
object file, when compiled.

Note that the format is not based on XML for a very parochial reason: I tend to dislike very
much editing XML by hand, as it's really hard to read. XML is really meant for machines to parse
generically text data in a shared syntax but with possibly unknown semantics, and I need those
text representations to be editable easily.

This also implies that there could be an XML representation of a Kate stream, which would be
useful if one were to make an editor that worked on a higher level than the current all-text
representation, and it is something that might very well happen in the future, in parallel with
the current format.

== Karaoke ==

Karaoke effects rely on motions, and there will be predefined higher level ways of specifying
timings and effects, two of which are already done. As an example, this is a valid Karaoke script:

:kate {
:: simple_timed_glyph_style_morph {
::: from style "start_style" to style "end_style"
::: "Let " at 1.0
::: "us " at 1.2
::: "sing " at 1.4
::: "to" at 2.0
::: "ge" at 2.5
::: "ther" at 3.0
:: }
:}

The syllables will change from a style to another as time passes. The definition of the start_style
and end_style styles is omitted for brevity.

== Problems to solve ==

There are a few things to solve before the Kate bitstream format can be considered good
enough to be frozen:

Note: the following is mostly solved, and the bitstream is now stable, and has been
backward and forward compatible since the first released version. This will be updated
when I get some time.

=== Seeking and memory ===

When seeking to a particular time in a movie with subtitles, we may end up at a place when a subtitle has been started, but is not removed yet. Pure streaming doesn't have this problem as it remembers the subtitle being issued (as opposed to, say, Vorbis, for which all data valid now is decoded from the last packet). With Kate, a text string valid now may have been issued long ago.

I see three possible ways to solve this:
*each data packet includes the granule of the earliest still active packet (if none, this will be the granule of this very packet)
**this means seeks are two phased: first seek, find the next Kate packet, and seek again if the granule of the earlier still active packet is less than the original seeked granule. This implies support code on players to do the double seek.

*use "reference frames", a bit like Theora does, where the granule position is split in several fields: the higher bits represent a position for the reference frame, and the lowest bits a delta time to the current position. When seeking to a granule position, the lower bits are cleared off, yielding the granule position of the previous reference frame, so the seek ends up at the reference frame. The reference frame is a sync point where any active strings are issued again. This is a variant of the method described in the Writ wiki page, but the granule splitting avoids any "downtime".
**this requires reissuing packets, and it doesn't feel right (and wastes space).
**it also requires "dummy" decoding of Kate data from the reference frame to the actual seek point to fully refresh the state "memory".

*A variant of the two-granules-in-one system used by libcmml, where the "back link" points to the earliest still active string, rather than the previous one (this allows a two phase seek, rather than a multiphase seek, hopping back from event to event, with no real way to know if there is or not a previous event which is still active - I suppose CMML has no need to know this, if their "clips" do not overlap - mine can do).
**Such a system considerably shortens the usable granule space, though it can do a one phase seek, if I understand the system correctly, which I am not certain.
*** Well, it seems it can't do a one phase seek anyway.

*Additionally, it could be possible to emit simple "keepalive" packets at regular intervals to help a seek algorithm to sync up to the stream without needing too much data reading - this helps for discontinuous streams where there could be no pages for a while if no data is needed at that time.

=== Text encoding ===

A header field declares the text encoding used in the stream. At the moment, only UTF-8 is
supported, for simplicity. There are no plans to support other encodings, such as UTF-16,
at the moment.

Note that strings included in the header (language, category) are not affected by that
language encoding (rather obviously for language itself). These are ASCII.

The actual text in events may include simple HTML-like markup (at the moment, allowed markup
is the same as the one Pango uses, but more markup types may be defined in the future).
It is also possible to ask libkate to remove this markup if the client prefers to receive
plain text without the markup.

=== Language encoding ===

A header field defines the language (if any) used in the stream (this can be overridden in a
data packet, but this is not relevant to this point). At the moment, my test code uses
ISO 639-1 two letter codes, but I originally thought to use RFC 3066 tags. However, matching
a language to a user selection may be simpler for user code if the language encoding is kept
simple. At the moment, I tend to favor allowing both two letter tags (eg, "en") and secondary
tags (like "en_EN"), as RFC 3066 tags can be quite complex, but I welcome comments on this.

If a stream contains more than one language, there usually is a predominant language, which
can be set as the default language for the stream. Each event can then have a language
override. If there is no predominant language, and it is not possible to split the stream
into multiple substreams, each with its own language, then it is possible to use the "mul"
language tag, as a last resort.

=== Bitstream format for floating point values ===

Floating point values are be turned to a 16.16 fixed point format, then stored in a bitpacked
format, storing the number of zero bits at the head and tail of the floating point values once
per stream, and the remainder bits for all values in the stream. This seems to yield good results
(typically a 50% reduction over 32 bits raw writes, and 70% over the snprintf based storage), and
has the big advantage of being portable (eg, independant of any IEEE format).
However, this means reduced precision due to the quantization to 16.16. I may add support for
variable precision (eg, 8.24 fixed point formats) to alleviate this. This would however mean less
space savings, though these are likely to be insignificant when Kate streams are interleaved with
a video.

*Though this is not a Kate issue per se, the motion feature is very difficult to use without a curve editor. While tools may be coded to create a Kate bitstream for various existing subtitle formats, it is not certain it will be easy to find a good authoring tool for a series of curves. That said, it's not exactly difficult to do if you know a widget set.

=== Higher dimensional curves/motions ===

It is quite annoying to have to create two motions to control a color change, due to curves
being restricted to two dimensions. I may add support for arbitrary dimensions. It would also
help for 1D motions, like changing the time flow, where one coordinate is simply ignored at
the moment.
Alternatively, changes could be made to the Kate file format to hide the two dimensionality and
allow simpler specification of non-2 dimensional motions, but still map them to 2D in the kate
bitstream format.

=== Category definition ===

The category field in the BOS packet is a 16 byte text field (15 really, as it is zero terminated
in the bitstream itself). Its goal is to provide the reader with a short description of what kind
of information the stream contains, eg subtitles, lyrics, etc. This would be displayed to the user,
possibly to allow to choose to turn some streams on and off.

Since this category is meant primarily for a machine to parse, they will be kept to ASCII. When
a player recognizes a category, it is free to replace its name with one in the user's language if
it prefers. Even in English, the "lyrics" category could be displayed by a player as "Lyrics".

Since this is a free text field rather than an enumeration, it would be good to have a list of
common predefined category names that Kate streams can use.

This is a list of proposed predefined categories, feedback/additions welcome:

* subtitles - the usual movie subtitles, as text
* spu-subtitles - movie subtitles in DVD style paletted images
* lyrics - song lyrics

Please remember the 15 character limit if proposing other categories.

Note that the list of categories is subject to change, and will likely
be replaced by new, more "identifier like" ones. The three ones above,
however, would be kept for backward compatibility as they're already used.

== Text to speech ==

One of the goals of the Kate bitstream format is that text data can be easily parsed
by the user of the decoder, so any additional information, such as style, placement,
karaoke data, etc, should be able to be stripped to leave only the bare text. This is
in view of allowing text-to-speech software to use Kate bitstreams as a bandwith-cheap
way of conveying speech data, and could also allow things like e-books which can be
either read or listened to from the same bitstream (I have seen no reference to this
being used anywhere, but I see no reason why the granule progression should be temporal,
and not user controlled, such as by using a "next" button which would bump a granule
postion by a preset amount, simulating turning a page (this would be close to necessary
for text-to-speech, as the wall time duration of the spoken speech is not known in
advance to the Kate encoder, and can't be mapped to a time based granule progression)).
All text strings triggered consecutively between the two granule positions would then
be read in order.

== Possible additions ==

=== Embedded binary data ===

Images and font mappings can be included within a Kate stream.

==== Images ====

Though this could be misused to interfere with ability to render as text-to-speech, Kate
can use images as well as text. The same caveat as for fonts applies with regard to data
duplication.

Complex images might however be best left to a multiplexed OggSpots or OggMNG stream, unless the
images mesh with the text (eg, graphical exclamation points, custom fonts, (see next
paragraph), etc).

There is support for simple paletted bitmap images, with a variable length palette of up
to 256 colors (in fact, sized in powers of 2 up to 256) and matching pixel data in as
many bits per pixel as can address the palette. Palettes and images are stored separately,
so can be used with one another with no fixed assignment.

Palettes and bitmaps are put in two separate header for later use by reference, but can
also be placed in data packets, as with motions, etc, if they are not going to be reused.

PNG bitmaps can also be embedded in a Kate stream. These do not have associated palettes
(but the PNGs themselves may or may not be paletted). There is no support for decoding PNG
images in libkate itself, so a program will have to use libpng (or similar code) to decode
the PNG image. For instance, the libtiger rendering library uses Cairo to decode and render
PNG images in Kate streams.

This can be used to have custom fonts, so that raw text is still available if the stream
creator wants a custom look.

I expect that the need for more than 256 colors in a bitmap, or non palette bitmap data,
would be best handled by another codec, eg OggMNG or OggSpots. The goal of images in a
Kate stream is to mesh the images with the text, not to have large images by themselves.

On the other hand, interesting Karaoke effects could be achieved by having MNG images
instead of simple paletted bitmaps in a Kate streams. Comments would be most welcome on
whether this is going too far, however.

I am also investigating SVG images. These allow for very small footprint images for simple
vector drawings, and could be very useful for things like background gradients below text.

A possible solution to the duplication issue is to have another stream in the container
stream, which would hold the shared data (eg, fonts), which the user program could load,
and which could then be used by any Kate (and other) stream. Typically, this type of stream
would be a degenerate stream with only header packets (so it is fully processed before any
other stream presents data packets that might make use of that shared data), and all payload
such as fonts being contained within the headers. Thinking about it, it has parallels with
the way Vorbis stores its codebooks within a header packet, or even the way Kate stores the
list of styles within a header packet.

==== Fonts ====

Custom fonts are merely a set of ranges mapping unicode code points to bitmaps. As this implies,
fonts are bitmap fonts, not vector fonts, so scaling, if supported by the rendering client,
may not look as good as with a vector font.

A style may also refer to a font name to use (eg, "Tahoma"). These fonts may or may not be
available on the playing system, however, since the font data is not included in the stream,
just referenced by name. For this reason, it is best to keep to widely known fonts.

== Reference encoder/decoder ==

A encoder (kateenc) and a decoder (katedec) are included in the tools directory.
The encoder supports input from several different formats:
* a custom text based file format (see [[#The Kate file format|The Kate file format]]), which is by no means meant to be part of the Kate bitstream specification itself
* SubRip (.srt), the most common subtitle format I found
* LRC lyrics format.

As an example for the widely used SRT subtitles format, the following command line
create a Kate subtitles stream from an SRT file:

kateenc -l en -c subtitles -t srt -o subtites.ogg subtitles.srt

The reverse is possible, to recover an SRT file from a Kate stream, with katedec.

Note that the subtitles.ogg file should then be multiplexed into the A/V stream,
using either ogg-tools or oggz-tools.

The Kate bitstreams encoded and decoded by those tools are (supposed to be) correct for this
specification, provided their input is correct.

== Next steps ==

=== Continuations ===

Continuations are a way to add to existing events, and are mostly meant for motions. When streaming
in real time, what motions may be applied to events may not be known in advance (for instance, for a
draw chat program where two programs exchange Kate streams, the drawing motions are only known as
they are drawn. Continuations will allow an event to be extended in time, and motions to be appended
to it. This is only useful for streaming, as when stored in a file, everything is already known in
advance.

=== A rendering library ===

This will allow easier integration in other packages (movie players, etc).
I have started working on an implementation using Cairo and Pango, though I'm still at the early stages.
I might add support for embedding vector fonts in a Kate stream if I was going that way. Still need to think about this.
Another point of note is that when this library is available, it would make it easier to add
capabilities such as rotation, scaling, etc, to the bitstream, since this would not cause too
much work for playing programs using the rendering library. It is expected that these additions
would stay backward compatible (eg, an old player would ignore this information but still correctly
decode the information they can work with from a newly encoded stream).

=== An XML representation ===

While I purposefully did not write Kate description files in XML due to me finding editing XML such
a chore, it would be nice to be able to losslessly convert between the more user friendly representation
and an XML document, so one can do what one does with XML documents, like transformations.

And after all, some people might prefer editing the XML version.

=== Packaging ===

It would be really nice to have packages for libkate/libtiger for many distros.

If you're a packager for a distro which doesn't have yet packages for libkate
or libtiger, please consider helping :)

In particular, packages for Debian would be grand.

== Matroska mapping ==

The codec ID is "S_KATE".

As for Theora and Vorbis, Kate headers are stored in the private data as xiph-laced packets:

Byte 0: number of packets present, minus 1 (there must be at least one packet) - let this number be NP
Bytes 1..n: lengths of the first NP packets, coded in xiph style lacing
Bytes n+1..end: the data packets themselves concatenated one after the other

Note that the length of the last packet isn't encoded, it is deduced from the sizes of the other
packets and the total size of the private data.

This mapping is similar to the Vorbis and Theora mappings, with the caveat that one should not
expect a set number of headers.

== Downloading ==

libkate encodes and decodes Kate streams, and is API and ABI stable.

The libkate source distribution is available at [http://libkate.googlecode.com/ http://libkate.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/kate.git;a=summary http://git.xiph.org/?p=users/oggk/kate.git;a=summary].

libtiger renders Kate streams using Pango and Cairo, and is alpha, with API changes still possible.

The libtiger source distribution is available at [http://libtiger.googlecode.com/ http://libtiger.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/tiger.git;a=summary http://git.xiph.org/?p=users/oggk/tiger.git;a=summary].

== Things I need to get feedback on ==

* is it a good idea to avoid floating point usage altogether ?

[[Category:Drafts]]
[[Category:Ogg Mappings]]

OggKate

2008-12-30T13:38:15Z

Ogg.k.ogg.k: add more...

== Disclaimer ==
This is not a Xiph codec, though it may be embedded in Ogg alonside other Xiph
codecs, such as Vorbis and Theora. As such, please do not assume that Xiph has
anything to do with this, much less responsibility.

== What is Kate? ==

Kate is an overlay codec, originally designed for karaoke and text, that can be
multiplixed in Ogg. Text and images can be carried by a Kate stream, and animated.
Most of the time, this would be multiplexed with audio/video to carry subtitles,
song lyrics (with or without karaoke data), etc, but doesn't have to be.

Series of curves (splines, segments, etc) may be attached to various properties
(text position, font size, etc) to create animated overlays. This allows scrolling
or fading text to be defined. This can even be used to draw arbitrary shapes, so
hand drawing can also be represented by a Kate stream.

Example uses of Kate streams are movie subtitles for Theora videos, either text based,
as may be created by [http://www.v2v.cc/~j/ffmpeg2theora ffmpeg2theora], or image
based, such as created by [http://thoggen.net Thoggen] (patching needed), and lyrics,
as created by oggenc, from vorbis-tools.

== Why a new codec? ==

As I was adding support for Theora, Speex and FLAC to some software of mine, I found myself
wanting to have song lyrics accompanying Vorbis audio. Since Vorbis comments are limited to
the headers, one can't add them in the stream as they are sung, so another multiplexed stream
would be needed to carry them.

The three possible bases usable for such a codec I found were Writ, CMML, and OGM/SRT.

*[[OggWrit|Writ]] is an unmaintained start at an implementation of a very basic design, though I did find an encoder/decoder in py-ogg2 later on - I'd been quicker to write Kate from scratch anyway.
*[[CMML]] is more geared towards encapsulating metadata about an accompanying stream, rather than being a data stream itself, and seemed complex for a simple use, though I have now revised my view on this - besides, it seems designed for Annodex (which I haven't had a look at), though it does seems relatively generic for use outwith Annodex - though it is being "repurposed" as timed text now, bringing it closer to what I'm doing
*OGM/SRT, which I only found when I added Kate support to MPlayer, is shoehorning various data formats into an Ogg stream, and just dumps the SRT subtitle format as is, AFAICS (though I haven't looked at this one in detail, since I'd already had a working Kate implementation by that time)

I then decided to roll my own, not least because it's a fun thing to do.

I found other formats, such as USF (designed for inclusion in Matroska) and various subtitle formats,
but none were designed for embedding inside an Ogg container.

== Overview of the Kate bitstream format ==

I've taken much inspiration from Vorbis and Theora here.
Headers and packets (as well as the API design) follow the design of these two codecs.

A rough overview (see [[#Format specification|Format specification]] for more details) is:

Headers packets:
*ID header [BOS]: magic, version, granule fraction, encoding, language, etc
*Comment header: Vorbis comments, as per Vorbis/Theora streams
*Style definitions header: a list of predefined styles to be referred to by data packets
*Region definitions header: a list of predefined regions to be referred to by data packets
*Curves definitions header: a list of predefined curves to be referred to by data packets
*Motion definitions header: a list of predefined motions to be referred to by data packets
*Palette definitions header: a list of predefined palettes to be referred to by data packets
*Bitmap definitions header: a list of predefined bitmaps to be referred to by data packets
*Font mapping definitions header: a list of predefined font mappings to be referred to by data packets

Other header packets are ignored, and left for future expansion.

Data packets:
*text data: text/image and optional motions, accompanied by optional overrides for style, region, language, etc
*keepalive: can be emitted at any time to help a demuxer know where we're at, but those packets are optional
*end data [EOS]: marks the end of the stream, it doesn't have any useful payload

Other data packets are ignored, and left for future expansion.

The intent of the "keepalive" packet is to be sent at regular
intervals when no other packet has been emitted for a while. This would be to help seeking code
find a kate page more easily.

Things of note:
*Kate is a discontinuous codec, as defined in [http://www.xiph.org/ogg/doc/ogg-multiplex.html ogg-multiplex.html] in the Ogg documentation, which means it's timed by start granule, not end granule (as Theora and Vorbis).
* All data packets are on their own page, for two reasons:
**Ogg keeps track of granules at the page level, not the packet level
**if no text event happens for a while after a particular text event, we don't want to delay it so a larger page can be issued

See also [[#Seeking and memory|Problems to solve: Seeking and memory]].

*The granule encoding is not a direct time/granule correspondance, see the granule encoding section.
*The EOS packet should have a granule pos higher or equal to the end time of all events.
*User code doesn't have to know the number of headers to expect, this is moved inside the library code (as opposed to Vorbis and Theora).
*The format contains hooks so that additional information may be added in future revisions while keeping backward compatibility (though old decoders will correctly parse, but ignore the new information).

== Format specification ==

The Kate bitstream format consists of a number of sequential packets.
Packets can be either header packets or data packets. All header packets
must appear before any data packet.

Header packets must appear in order. Decoding of a data packet is not
possible until all header packets have been decoded.

Each Kate packet starts with a one byte type. A type with the MSB set
(eg, between 0x80 and 0xff) indicates a header packet, while a type with
the MSB cleared (eg, between 0x00 and 0x7f) indicates a data packet.
All header packets then have the Kate magic, from byte offset 1 to byte
offset 7 ("kate\0\0\0"). Note that this applies only to header packets:
data packets do not contain the Kate signature.

Since the ID header must appear first, a Kate stream can be recognized
by comparing the first eight bytes of the first packet with the signature
string "\200kate\0\0\0".

When embedded in Ogg,the first packet in a Kate stream (always packet type 0x80,
the id header packet) must be placed on a separate page. The corresponding Ogg
packet must be marked as beginning of stream (BOS).All subsequent header packets
must be on one or more pages. Subsequently, each data packet must be on a separate
page.

The last data packet must be the end of stream packet (packet type 0x7f).

When embedded in Ogg, the corresponding Ogg packet must be marked as end of stream (EOS).

As per the Ogg specification, granule positions must be non decreasing
within the stream. Header packets have granule position 0.

Currently existing packet types are:
:headers:
::0x80 ID header (BOS)
::0x81 Vorbis comment header
::0x82 regions list header
::0x83 styles list header
::0x84 curves list header
::0x85 motions list header
::0x86 palettes list header
::0x87 bitmaps list header
::0x88 font ranges and mappings header
:data:
::0x00 text data (including optional motions and overrides)
::0x01 keepalive
::0x7f end packet (EOS)

This format described here is for bitstream version 0.x.
As or 19 december 2008, the latest bitstream version is 0.4.

For more detailed information, refer to the format documentation
in libkate (see URL below in the [[#Downloading|Downlading]] section).

Following is the definition of the ID header (packet type 0x80).
This works out to a 64 byte ID header. This is the header that should be
used to detect a Kate stream within an Ogg stream.

0 1 2 3 |
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1| Byte
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| packtype | Identifier char[7]: 'kate\0\0\0' | 0-3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| kate magic continued | 4-7
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | version major | version minor | num headers | 8-11
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| text encoding | directionality| reserved - 0 | granule shift | 12-15
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| cw sh | canvas width | ch sh | canvas height | 16-19
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | 20-23
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate numerator | 24-27
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate denominator | 28-31
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (NUL terminated) | 32-35
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 36-39
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 40-43
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 44-47
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (NUL terminated) | 48-51
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 52-55
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 56-59
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 60-63
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The fields cw sh, canvas width, cw sh, and canvas height were introduced
in bistream 0.3. Earlier bitstreams will have 0 in these fields.

language and category are NUL terminating ASCII strings.
Language follows RFC 3066, though obviously will not accommodate language tags
with lots of subtags.

Category is currently loosely defined, and I haven't found yet a nice way to
present it in a generic way, but is meant for automatic classifying of
various multiplexed Kate streams (eg, to recognize that some streams are
subtitles (in a set of languages), and some others are commentary (in a
possibly different set of languages, etc).

== API overview ==

libkate offers an API very similar to that of libvorbis and libtheora, as well as
an extra higher level decoding API.

Here's an overview of the three main modules:

=== Decoding ===

Decoding is done in a way similar to libvorbis. First, initialize a kate_info and a
kate_comment structure. Then, read headers by calling kate_decode_headerin. Once
all headers have been read, a kate_state is initialized for decoding using kate_decode_init,
and kate_decode_packetin is called repeatedly with data packets. Events (eg, text) can be
retrieved via kate_decode_eventout.

=== Encoding ===

Encoding is also done in a way similar to libvorbis. First initialize a kate_info
and a kate_comment structure, and fill them out as needed. kate_encode_headers will
create ogg packets from those. Then, kate_encode_text is called repeatedly for all
the text events to add. When done, calling kate_encode_finish will create an end of
stream packet.

=== High level decoding API ===

There are only 3 calls here:

kate_high_decode_init
kate_high_decode_packetin
kate_high_decode_clear

Here, all Ogg packets are sent to kate_high_decode_packetin, which does the right
thing (header/data classification, decoding, and event retrieval). Note that you
do not get access to the comments directly using this, but you do get access to the
kate_info via events.

The libkate distribution includes commented examples for each of those.

Additionally, libkate includes a layer (liboggkate) to make it easier to use when
embedded in Ogg. While the normal API uses kate_packet structures, liboggkate uses
ogg_packet structures.

The High level decoding API does not have an Ogg specific layer, but functions exist
to wrap a kate_packet around a memory buffer (such as the one ogg_packet uses, for instance).

== Support ==

Among the software with Kate support:
*VLC
*ffmpeg2theora
*liboggz
*liboggplay
*Cortado

I have patches for the following with Kate support:
*MPlayer
*xine
*GStreamer
*Thoggen
*Audacious
*and more...

These may be found in the libkate source distribution (see [[#Downloading|Downloading]]
for links).

In addition, libtiger is a rendering library for Kate streams using Pango and Cairo,
though it is not quite yet API stable (though no major changes are expected).

== Granule encoding ==

=== Ogg ===

Ogg leaves the encoding of granules up to a particular codec, only
mandating that granules be non decreasing with time.

The Kate bitstream format uses a linear mapping between time and
granule, described here.

A Kate granule position is composed of two different parts:
- a base granule, in the high bits
- a granule offset, in the low bits

+----------------+----------------+
| base | offset |
+----------------+----------------+

The number of bits these parts occupy is variable, and each stream
may choose how many bits to dedicate to each. The kate_info structure
for a stream holds that information in the granule_shift field,
so each part may be reconstructed from a granulepos.

The timestamp T of a given Kate packet is split into a base B and
offset O, and these are stored in the granulepos of that packet.
The split is done such that the B is the time of the earliest event
still active at the time, and the O is the time elapsed between B
and T. Thus, T = B + O. This mimics the way Theora stores its own
timestamps in granulepos, where the base acts as a keyframe, and
an offset acts as the position of an intra frame from the previous
keyframe. Since Kate allows time overlapping events, however, the
choice of the base to use is slightly more complex, as it may not
be the starting time of the previous event, if the stream contains
time overlapping events.

The kate_info structure for a stream holds a rational fraction
representing the time span of granule units for both the base and
the offset parts.

The granule rate is defined by the two fields:

kate_info::gps_numerator
kate_info::gps_denominator

The number of bits reserved for the offset is defined by the field:

kate_info::granule_shift

=== Generic timing ===

Kate data packets (data packet type 0) includes timing information (start time,
end time, and time of the earliest event still active). All these are stored as
64 bit at the rate defined by the granule rate, so they do not suffer from the
granule_shift space limitation.

This also allows for Kate streams to be stored in other containers.

== Motion ==

The Kate bitstream format includes motion definition, originally for karaoke purposes, but
which can be used for more general purpose, such as line based drawing, or animation of
the text (position, color, etc)

Motions are defined by the means of a series of curves (static points, segments, splines (catmull-rom, bezier, and b-splines)).
A 2D point can be obtained from a motion for any timestamp during the lifetime of a text.
This can be used for moving a marker in 2D above the text for karaoke, or to use the x
coordinate to color text when the motion position passes each letter or word, etc.
Motions have an attached semantics so the client code knows how to use a particular motion.
Predefined semantics include text color, text position, etc).

Since a motion can be composed of an arbitrary number of curves, each of which may have
an arbitrary number of control points, complex motions can be achieved. If the motion is
the main object of an event, it is even possible to have an empty text, and use the motion
as a virtual pencil to draw arbitrary shapes. Even on-the-fly handwriting subtitles could
be done this way, though this would require a lot of control points, and would not be able
to be used with text-to-speech.

As a proof of concept, I also have a "draw chat" program where two people can draw, and
the shapes are turned to b-splines and sent as a kate motion to be displayed on the other
person's window.

It is also possible for motions to be discontinuous - simply insert a curve of 'none' type.
While the timestamp lies within such a curve, no 2D point will be generated. This can be
used to temporarily hide a marker, for instance.

It is worth mentionning that pauses in the motion can be trivially included by inserting
at the right time and for the right duration a simple linear interpolation curve with only
two equal points, equal to the position the motion is supposed to pause at.

Kate defines a set of predefined mappings so that each decoder user interprets a motion in
the same way. A mapping is coded on 8 bits in the bitstream, and the first 128 are reserved
for Kate, leaving 128 for application specific mappings, to avoid constraining creative uses
of that feature. Predefined mappings include frame (eg, 0-1 points are mapped to the size of
the current video frame), or region, to scale 0-1 to the current region. This allows curves
to be defined without knowing in advance the pixel size of the area it should cover.

For uses which require more than two coordinates (eg, text color, where 4 (RGBA) values are
needed, Kate predefines the semantics text_color_rg and text_color_ba, so a 4D point can be
obtained using two different motions.

There are higher level constructs, such as morphing between two styles, or predefined
karaoke effects. More are planned to be added in the future.

See also [[#Trackers|Trackers]].

== Trackers ==

Since attaching motions to text position, etc, makes it hard for the client to keep track of
everything, doing interpolation, etc, the library supplies a tracker object, which handles the
interpolation of the relevant properties.
Once initialized with a text and a set of motions, the client code can give the tracker a new
timestamp, and get back the current text position, text color, etc.

Using a tracker is not necessary, if one wants to use the motions directly, or just ignore them,
but it makes life easier, especially when considering the the order in which motions are applied
does matter (to be defined formally, but the current source code is informative at this point).

== The Kate file format ==

Though this is not a feature of the bitstream format, I have created a text file format to
describe a series of events to be turned into a Kate bitstream.
At its minimum, the following is a valid input to the encoder:

: kate {
:: event { 00:00:05 --> 00:00:10 "This is a text" }
: }

This will create a simple stream with "This is a text" emitted at an offset of 5 seconds into
the track, lasting 5 seconds to an end time at 10 seconds.

Motions, regions, styles can be declared in a definitions block to be reused by events, or can
be defined inline. Defining those in the definitions block places them in a header so they can
be reused later, saving space. However, they can also be defined in each event, so they will be
sent with the event. This allows them to be generated on the fly (eg, if the bitstream is being
streamed from a realtime input).

For convenience, the Kate file format also allows C style macros, though without parameters.

Please note that the Kate file format is fully separate from the Kate bitstream format. The
difference between the two is similar to the difference between a C source file and the resulting
object file, when compiled.

Note that the format is not based on XML for a very parochial reason: I tend to dislike very
much editing XML by hand, as it's really hard to read. XML is really meant for machines to parse
generically text data in a shared syntax but with possibly unknown semantics, and I need those
text representations to be editable easily.

This also implies that there could be an XML representation of a Kate stream, which would be
useful if one were to make an editor that worked on a higher level than the current all-text
representation, and it is something that might very well happen in the future, in parallel with
the current format.

== Karaoke ==

Karaoke effects rely on motions, and there will be predefined higher level ways of specifying
timings and effects, two of which are already done. As an example, this is a valid Karaoke script:

:kate {
:: simple_timed_glyph_style_morph {
::: from style "start_style" to style "end_style"
::: "Let " at 1.0
::: "us " at 1.2
::: "sing " at 1.4
::: "to" at 2.0
::: "ge" at 2.5
::: "ther" at 3.0
:: }
:}

The syllables will change from a style to another as time passes. The definition of the start_style
and end_style styles is omitted for brevity.

== Problems to solve ==

There are a few things to solve before the Kate bitstream format can be considered good
enough to be frozen:

Note: the following is mostly solved, and the bitstream is now stable, and has been
backward and forward compatible since the first released version. This will be updated
when I get some time.

=== Seeking and memory ===

When seeking to a particular time in a movie with subtitles, we may end up at a place when a subtitle has been started, but is not removed yet. Pure streaming doesn't have this problem as it remembers the subtitle being issued (as opposed to, say, Vorbis, for which all data valid now is decoded from the last packet). With Kate, a text string valid now may have been issued long ago.

I see three possible ways to solve this:
*each data packet includes the granule of the earliest still active packet (if none, this will be the granule of this very packet)
**this means seeks are two phased: first seek, find the next Kate packet, and seek again if the granule of the earlier still active packet is less than the original seeked granule. This implies support code on players to do the double seek.

*use "reference frames", a bit like Theora does, where the granule position is split in several fields: the higher bits represent a position for the reference frame, and the lowest bits a delta time to the current position. When seeking to a granule position, the lower bits are cleared off, yielding the granule position of the previous reference frame, so the seek ends up at the reference frame. The reference frame is a sync point where any active strings are issued again. This is a variant of the method described in the Writ wiki page, but the granule splitting avoids any "downtime".
**this requires reissuing packets, and it doesn't feel right (and wastes space).
**it also requires "dummy" decoding of Kate data from the reference frame to the actual seek point to fully refresh the state "memory".

*A variant of the two-granules-in-one system used by libcmml, where the "back link" points to the earliest still active string, rather than the previous one (this allows a two phase seek, rather than a multiphase seek, hopping back from event to event, with no real way to know if there is or not a previous event which is still active - I suppose CMML has no need to know this, if their "clips" do not overlap - mine can do).
**Such a system considerably shortens the usable granule space, though it can do a one phase seek, if I understand the system correctly, which I am not certain.
*** Well, it seems it can't do a one phase seek anyway.

*Additionally, it could be possible to emit simple "keepalive" packets at regular intervals to help a seek algorithm to sync up to the stream without needing too much data reading - this helps for discontinuous streams where there could be no pages for a while if no data is needed at that time.

=== Text encoding ===

A header field declares the text encoding used in the stream. At the moment, only UTF-8 is
supported, for simplicity. There are no plans to support other encodings, such as UTF-16,
at the moment.

Note that strings included in the header (language, category) are not affected by that
language encoding (rather obviously for language itself). These are ASCII.

The actual text in events may include simple HTML-like markup (at the moment, allowed markup
is the same as the one Pango uses, but more markup types may be defined in the future).
It is also possible to ask libkate to remove this markup if the client prefers to receive
plain text without the markup.

=== Language encoding ===

A header field defines the language (if any) used in the stream (this can be overridden in a
data packet, but this is not relevant to this point). At the moment, my test code uses
ISO 639-1 two letter codes, but I originally thought to use RFC 3066 tags. However, matching
a language to a user selection may be simpler for user code if the language encoding is kept
simple. At the moment, I tend to favor allowing both two letter tags (eg, "en") and secondary
tags (like "en_EN"), as RFC 3066 tags can be quite complex, but I welcome comments on this.

If a stream contains more than one language, there usually is a predominant language, which
can be set as the default language for the stream. Each event can then have a language
override. If there is no predominant language, and it is not possible to split the stream
into multiple substreams, each with its own language, then it is possible to use the "mul"
language tag, as a last resort.

=== Bitstream format for floating point values ===

Floating point values are be turned to a 16.16 fixed point format, then stored in a bitpacked
format, storing the number of zero bits at the head and tail of the floating point values once
per stream, and the remainder bits for all values in the stream. This seems to yield good results
(typically a 50% reduction over 32 bits raw writes, and 70% over the snprintf based storage), and
has the big advantage of being portable (eg, independant of any IEEE format).
However, this means reduced precision due to the quantization to 16.16. I may add support for
variable precision (eg, 8.24 fixed point formats) to alleviate this. This would however mean less
space savings, though these are likely to be insignificant when Kate streams are interleaved with
a video.

*Though this is not a Kate issue per se, the motion feature is very difficult to use without a curve editor. While tools may be coded to create a Kate bitstream for various existing subtitle formats, it is not certain it will be easy to find a good authoring tool for a series of curves. That said, it's not exactly difficult to do if you know a widget set.

=== Higher dimensional curves/motions ===

It is quite annoying to have to create two motions to control a color change, due to curves
being restricted to two dimensions. I may add support for arbitrary dimensions. It would also
help for 1D motions, like changing the time flow, where one coordinate is simply ignored at
the moment.
Alternatively, changes could be made to the Kate file format to hide the two dimensionality and
allow simpler specification of non-2 dimensional motions, but still map them to 2D in the kate
bitstream format.

=== Category definition ===

The category field in the BOS packet is a 16 byte text field (15 really, as it is zero terminated
in the bitstream itself). Its goal is to provide the reader with a short description of what kind
of information the stream contains, eg subtitles, lyrics, etc. This would be displayed to the user,
possibly to allow to choose to turn some streams on and off.

Since this category is meant primarily for a machine to parse, they will be kept to ASCII. When
a player recognizes a category, it is free to replace its name with one in the user's language if
it prefers. Even in English, the "lyrics" category could be displayed by a player as "Lyrics".

Since this is a free text field rather than an enumeration, it would be good to have a list of
common predefined category names that Kate streams can use.

This is a list of proposed predefined categories, feedback/additions welcome:

* subtitles - the usual movie subtitles, as text
* spu-subtitles - movie subtitles in DVD style paletted images
* lyrics - song lyrics

Please remember the 15 character limit if proposing other categories.

Note that the list of categories is subject to change, and will likely
be replaced by new, more "identifier like" ones. The three ones above,
however, would be kept for backward compatibility as they're already used.

== Text to speech ==

One of the goals of the Kate bitstream format is that text data can be easily parsed
by the user of the decoder, so any additional information, such as style, placement,
karaoke data, etc, should be able to be stripped to leave only the bare text. This is
in view of allowing text-to-speech software to use Kate bitstreams as a bandwith-cheap
way of conveying speech data, and could also allow things like e-books which can be
either read or listened to from the same bitstream (I have seen no reference to this
being used anywhere, but I see no reason why the granule progression should be temporal,
and not user controlled, such as by using a "next" button which would bump a granule
postion by a preset amount, simulating turning a page (this would be close to necessary
for text-to-speech, as the wall time duration of the spoken speech is not known in
advance to the Kate encoder, and can't be mapped to a time based granule progression)).
All text strings triggered consecutively between the two granule positions would then
be read in order.

== Possible additions ==

=== Embedded binary data ===

Images and font mappings can be included within a Kate stream.

==== Images ====

Though this could be misused to interfere with ability to render as text-to-speech, Kate
can use images as well as text. The same caveat as for fonts applies with regard to data
duplication.

Complex images might however be best left to a multiplexed OggSpots or OggMNG stream, unless the
images mesh with the text (eg, graphical exclamation points, custom fonts, (see next
paragraph), etc).

There is support for simple paletted bitmap images, with a variable length palette of up
to 256 colors (in fact, sized in powers of 2 up to 256) and matching pixel data in as
many bits per pixel as can address the palette. Palettes and images are stored separately,
so can be used with one another with no fixed assignment.

Palettes and bitmaps are put in two separate header for later use by reference, but can
also be placed in data packets, as with motions, etc, if they are not going to be reused.

PNG bitmaps can also be embedded in a Kate stream. These do not have associated palettes
(but the PNGs themselves may or may not be paletted). There is no support for decoding PNG
images in libkate itself, so a program will have to use libpng (or similar code) to decode
the PNG image. For instance, the libtiger rendering library uses Cairo to decode and render
PNG images in Kate streams.

This can be used to have custom fonts, so that raw text is still available if the stream
creator wants a custom look.

I expect that the need for more than 256 colors in a bitmap, or non palette bitmap data,
would be best handled by another codec, eg OggMNG or OggSpots. The goal of images in a
Kate stream is to mesh the images with the text, not to have large images by themselves.

On the other hand, interesting Karaoke effects could be achieved by having MNG images
instead of simple paletted bitmaps in a Kate streams. Comments would be most welcome on
whether this is going too far, however.

I am also investigating SVG images. These allow for very small footprint images for simple
vector drawings, and could be very useful for things like background gradients below text.

A possible solution to the duplication issue is to have another stream in the container
stream, which would hold the shared data (eg, fonts), which the user program could load,
and which could then be used by any Kate (and other) stream. Typically, this type of stream
would be a degenerate stream with only header packets (so it is fully processed before any
other stream presents data packets that might make use of that shared data), and all payload
such as fonts being contained within the headers. Thinking about it, it has parallels with
the way Vorbis stores its codebooks within a header packet, or even the way Kate stores the
list of styles within a header packet.

==== Fonts ====

Custom fonts are merely a set of ranges mapping unicode code points to bitmaps. As this implies,
fonts are bitmap fonts, not vector fonts, so scaling, if supported by the rendering client,
may not look as good as with a vector font.

A style may also refer to a font name to use (eg, "Tahoma"). These fonts may or may not be
available on the playing system, however, since the font data is not included in the stream,
just referenced by name. For this reason, it is best to keep to widely known fonts.

== Reference encoder/decoder ==

A encoder (kateenc) and a decoder (katedec) are included in the tools directory.
The encoder supports input from several different formats:
* a custom text based file format (see [[#The Kate file format|The Kate file format]]), which is by no means meant to be part of the Kate bitstream specification itself
* SubRip (.srt), the most common subtitle format I found
* LRC lyrics format.

As an example for the widely used SRT subtitles format, the following command line
create a Kate subtitles stream from an SRT file:

kateenc -l en -c subtitles -t srt -o subtites.ogg subtitles.srt

The reverse is possible, to recover an SRT file from a Kate stream, with katedec.

Note that the subtitles.ogg file should then be multiplexed into the A/V stream,
using either ogg-tools or oggz-tools.

The Kate bitstreams encoded and decoded by those tools are (supposed to be) correct for this
specification, provided their input is correct.

== Next steps ==

=== Continuations ===

Continuations are a way to add to existing events, and are mostly meant for motions. When streaming
in real time, what motions may be applied to events may not be known in advance (for instance, for a
draw chat program where two programs exchange Kate streams, the drawing motions are only known as
they are drawn. Continuations will allow an event to be extended in time, and motions to be appended
to it. This is only useful for streaming, as when stored in a file, everything is already known in
advance.

=== A rendering library ===

This will allow easier integration in other packages (movie players, etc).
I have started working on an implementation using Cairo and Pango, though I'm still at the early stages.
I might add support for embedding vector fonts in a Kate stream if I was going that way. Still need to think about this.
Another point of note is that when this library is available, it would make it easier to add
capabilities such as rotation, scaling, etc, to the bitstream, since this would not cause too
much work for playing programs using the rendering library. It is expected that these additions
would stay backward compatible (eg, an old player would ignore this information but still correctly
decode the information they can work with from a newly encoded stream).

=== An XML representation ===

While I purposefully did not write Kate description files in XML due to me finding editing XML such
a chore, it would be nice to be able to losslessly convert between the more user friendly representation
and an XML document, so one can do what one does with XML documents, like transformations.

And after all, some people might prefer editing the XML version.

=== Packaging ===

It would be really nice to have packages for libkate/libtiger for many distros.

If you're a packager for a distro which doesn't have yet packages for libkate
or libtiger, please consider helping :)

In particular, packages for Debian would be grand.

== Matroska mapping ==

The codec ID is "S_KATE".

As for Theora and Vorbis, Kate headers are stored in the private data as xiph-laced packets:

Byte 0: number of packets present, minus 1 (there must be at least one packet) - let this number be NP
Bytes 1..n: lengths of the first NP packets, coded in xiph style lacing
Bytes n+1..end: the data packets themselves concatenated one after the other

Note that the length of the last packet isn't encoded, it is deduced from the sizes of the other
packets and the total size of the private data.

This mapping is similar to the Vorbis and Theora mappings, with the caveat that one should not
expect a set number of headers.

== Downloading ==

libkate encodes and decodes Kate streams, and is API and ABI stable.

The libkate source distribution is available at [http://libkate.googlecode.com/ http://libkate.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/kate.git;a=summary http://git.xiph.org/?p=users/oggk/kate.git;a=summary].

libtiger renders Kate streams using Pango and Cairo, and is alpha, with API changes still possible.

The libtiger source distribution is available at [http://libtiger.googlecode.com/ http://libtiger.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/tiger.git;a=summary http://git.xiph.org/?p=users/oggk/tiger.git;a=summary].

== Things I need to get feedback on ==

* is it a good idea to avoid floating point usage altogether ?

[[Category:Drafts]]
[[Category:Ogg Mappings]]

OggKate

2008-12-29T15:10:02Z

Ogg.k.ogg.k: /* What is Kate? */

== Disclaimer ==
This is not a Xiph codec, though it may be embedded in Ogg alonside other Xiph
codecs, such as Vorbis and Theora. As such, please do not assume that Xiph has
anything to do with this, much less responsibility.

== What is Kate? ==

Kate is an overlay codec, originally designed for karaoke and text, that can be
multiplixed in Ogg. Text and images can be carried by a Kate stream, and animated.
Most of the time, this would be multiplexed with audio/video to carry subtitles,
song lyrics (with or without karaoke data), etc, but doesn't have to be.

Series of curves (splines, segments, etc) may be attached to various properties
(text position, font size, etc) to create animated overlays. This allows scrolling
or fading text to be defined. This can even be used to draw arbitrary shapes, so
hand drawing can also be represented by a Kate stream.

Example uses of Kate streams are movie subtitles for Theora videos, either text based,
as may be created by [http://www.v2v.cc/~j/ffmpeg2theora ffmpeg2theora], or image
based, such as created by [http://thoggen.net Thoggen] (patching needed), and lyrics,
as created by oggenc, from vorbis-tools.

== Why a new codec? ==

As I was adding support for Theora, Speex and FLAC to some software of mine, I found myself
wanting to have song lyrics accompanying Vorbis audio. Since Vorbis comments are limited to
the headers, one can't add them in the stream as they are sung, so another multiplexed stream
would be needed to carry them.

The three possible bases usable for such a codec I found were Writ, CMML, and OGM/SRT.

*[[OggWrit|Writ]] is an unmaintained start at an implementation of a very basic design, though I did find an encoder/decoder in py-ogg2 later on - I'd been quicker to write Kate from scratch anyway.
*[[CMML]] is more geared towards encapsulating metadata about an accompanying stream, rather than being a data stream itself, and seemed complex for a simple use, though I have now revised my view on this - besides, it seems designed for Annodex (which I haven't had a look at), though it does seems relatively generic for use outwith Annodex - though it is being "repurposed" as timed text now, bringing it closer to what I'm doing
*OGM/SRT, which I only found when I added Kate support to MPlayer, is shoehorning various data formats into an Ogg stream, and just dumps the SRT subtitle format as is, AFAICS (though I haven't looked at this one in detail, since I'd already had a working Kate implementation by that time)

I then decided to roll my own, not least because it's a fun thing to do.

I found other formats, such as USF (designed for inclusion in Matroska) and various subtitle formats,
but none were designed for embedding inside an Ogg container.

== Overview of the Kate bitstream format ==

I've taken much inspiration from Vorbis and Theora here.
Headers and packets (as well as the API design) follow the design of these two codecs.

A rough overview (see [[#Format specification|Format specification]] for more details) is:

Headers packets:
*ID header [BOS]: magic, version, granule fraction, encoding, language, etc
*Comment header: Vorbis comments, as per Vorbis/Theora streams
*Style definitions header: a list of predefined styles to be referred to by data packets
*Region definitions header: a list of predefined regions to be referred to by data packets
*Curves definitions header: a list of predefined curves to be referred to by data packets
*Motion definitions header: a list of predefined motions to be referred to by data packets
*Palette definitions header: a list of predefined palettes to be referred to by data packets
*Bitmap definitions header: a list of predefined bitmaps to be referred to by data packets
*Font mapping definitions header: a list of predefined font mappings to be referred to by data packets

Other header packets are ignored, and left for future expansion.

Data packets:
*text data: text/image and optional motions, accompanied by optional overrides for style, region, language, etc
*keepalive: can be emitted at any time to help a demuxer know where we're at, but those packets are optional
*end data [EOS]: marks the end of the stream, it doesn't have any useful payload

Other data packets are ignored, and left for future expansion.

The intent of the "keepalive" packet is to be sent at regular
intervals when no other packet has been emitted for a while. This would be to help seeking code
find a kate page more easily.

Things of note:
*Kate is a discontinuous codec, as defined in [http://www.xiph.org/ogg/doc/ogg-multiplex.html ogg-multiplex.html] in the Ogg documentation, which means it's timed by start granule, not end granule (as Theora and Vorbis).
* All data packets are on their own page, for two reasons:
**Ogg keeps track of granules at the page level, not the packet level
**if no text event happens for a while after a particular text event, we don't want to delay it so a larger page can be issued

See also [[#Seeking and memory|Problems to solve: Seeking and memory]].

*The granule encoding is not a direct time/granule correspondance, see the granule encoding section.
*The EOS packet should have a granule pos higher or equal to the end time of all events.
*User code doesn't have to know the number of headers to expect, this is moved inside the library code (as opposed to Vorbis and Theora).
*The format contains hooks so that additional information may be added in future revisions while keeping backward compatibility (though old decoders will correctly parse, but ignore the new information).

== Format specification ==

The Kate bitstream format consists of a number of sequential packets.
Packets can be either header packets or data packets. All header packets
must appear before any data packet.

Header packets must appear in order. Decoding of a data packet is not
possible until all header packets have been decoded.

Each Kate packet starts with a one byte type. A type with the MSB set
(eg, between 0x80 and 0xff) indicates a header packet, while a type with
the MSB cleared (eg, between 0x00 and 0x7f) indicates a data packet.
All header packets then have the Kate magic, from byte offset 1 to byte
offset 7 ("kate\0\0\0"). Note that this applies only to header packets:
data packets do not contain the Kate signature.

Since the ID header must appear first, a Kate stream can be recognized
by comparing the first eight bytes of the first packet with the signature
string "\200kate\0\0\0".

When embedded in Ogg,the first packet in a Kate stream (always packet type 0x80,
the id header packet) must be placed on a separate page. The corresponding Ogg
packet must be marked as beginning of stream (BOS).All subsequent header packets
must be on one or more pages. Subsequently, each data packet must be on a separate
page.

The last data packet must be the end of stream packet (packet type 0x7f).

When embedded in Ogg, the corresponding Ogg packet must be marked as end of stream (EOS).

As per the Ogg specification, granule positions must be non decreasing
within the stream. Header packets have granule position 0.

Currently existing packet types are:
:headers:
::0x80 ID header (BOS)
::0x81 Vorbis comment header
::0x82 regions list header
::0x83 styles list header
::0x84 curves list header
::0x85 motions list header
::0x86 palettes list header
::0x87 bitmaps list header
::0x88 font ranges and mappings header
:data:
::0x00 text data (including optional motions and overrides)
::0x01 keepalive
::0x7f end packet (EOS)

This format described here is for bitstream version 0.x.
As or 19 december 2008, the latest bitstream version is 0.4.

For more detailed information, refer to the format documentation
in libkate (see URL below in the [[#Downloading|Downlading]] section).

Following is the definition of the ID header (packet type 0x80).
This works out to a 64 byte ID header. This is the header that should be
used to detect a Kate stream within an Ogg stream.

0 1 2 3 |
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1| Byte
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| packtype | Identifier char[7]: 'kate\0\0\0' | 0-3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| kate magic continued | 4-7
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | version major | version minor | num headers | 8-11
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| text encoding | directionality| reserved - 0 | granule shift | 12-15
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| cw sh | canvas width | ch sh | canvas height | 16-19
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | 20-23
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate numerator | 24-27
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate denominator | 28-31
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (NUL terminated) | 32-35
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 36-39
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 40-43
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 44-47
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (NUL terminated) | 48-51
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 52-55
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 56-59
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 60-63
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The fields cw sh, canvas width, cw sh, and canvas height were introduced
in bistream 0.3. Earlier bitstreams will have 0 in these fields.

language and category are NUL terminating ASCII strings.
Language follows RFC 3066, though obviously will not accommodate language tags
with lots of subtags.

Category is currently loosely defined, and I haven't found yet a nice way to
present it in a generic way, but is meant for automatic classifying of
various multiplexed Kate streams (eg, to recognize that some streams are
subtitles (in a set of languages), and some others are commentary (in a
possibly different set of languages, etc).

== API overview ==

libkate offers an API very similar to that of libvorbis and libtheora, as well as
an extra higher level decoding API.

Here's an overview of the three main modules:

=== Decoding ===

Decoding is done in a way similar to libvorbis. First, initialize a kate_info and a
kate_comment structure. Then, read headers by calling kate_decode_headerin. Once
all headers have been read, a kate_state is initialized for decoding using kate_decode_init,
and kate_decode_packetin is called repeatedly with data packets. Events (eg, text) can be
retrieved via kate_decode_eventout.

=== Encoding ===

Encoding is also done in a way similar to libvorbis. First initialize a kate_info
and a kate_comment structure, and fill them out as needed. kate_encode_headers will
create ogg packets from those. Then, kate_encode_text is called repeatedly for all
the text events to add. When done, calling kate_encode_finish will create an end of
stream packet.

=== High level decoding API ===

There are only 3 calls here:

kate_high_decode_init
kate_high_decode_packetin
kate_high_decode_clear

Here, all Ogg packets are sent to kate_high_decode_packetin, which does the right
thing (header/data classification, decoding, and event retrieval). Note that you
do not get access to the comments directly using this, but you do get access to the
kate_info via events.

The libkate distribution includes commented examples for each of those.

Additionally, libkate includes a layer (liboggkate) to make it easier to use when
embedded in Ogg. While the normal API uses kate_packet structures, liboggkate uses
ogg_packet structures.

The High level decoding API does not have an Ogg specific layer, but functions exist
to wrap a kate_packet around a memory buffer (such as the one ogg_packet uses, for instance).

== Support ==

Among the software with Kate support:
*VLC
*ffmpeg2theora
*liboggz
*liboggplay

I have patches for the following with Kate support:
*MPlayer (for multiplexed per-language subtitles - all region/style info is ignored)
*xine (everything kate supports, as xine is my testbed)
*GStreamer
*Thoggen

These may be found in the libkate source distribution (see [[#Downloading|Downloading]]
for links).

In addition, libtiger is a rendering library for Kate streams using Pango and Cairo,
though it is not quite yet API stable (though no major changes are expected).

== Granule encoding ==

=== Ogg ===

Ogg leaves the encoding of granules up to a particular codec, only
mandating that granules be non decreasing with time.

The Kate bitstream format uses a linear mapping between time and
granule, described here.

A Kate granule position is composed of two different parts:
- a base granule, in the high bits
- a granule offset, in the low bits

+----------------+----------------+
| base | offset |
+----------------+----------------+

The number of bits these parts occupy is variable, and each stream
may choose how many bits to dedicate to each. The kate_info structure
for a stream holds that information in the granule_shift field,
so each part may be reconstructed from a granulepos.

The timestamp T of a given Kate packet is split into a base B and
offset O, and these are stored in the granulepos of that packet.
The split is done such that the B is the time of the earliest event
still active at the time, and the O is the time elapsed between B
and T. Thus, T = B + O. This mimics the way Theora stores its own
timestamps in granulepos, where the base acts as a keyframe, and
an offset acts as the position of an intra frame from the previous
keyframe. Since Kate allows time overlapping events, however, the
choice of the base to use is slightly more complex, as it may not
be the starting time of the previous event, if the stream contains
time overlapping events.

The kate_info structure for a stream holds a rational fraction
representing the time span of granule units for both the base and
the offset parts.

The granule rate is defined by the two fields:

kate_info::gps_numerator
kate_info::gps_denominator

The number of bits reserved for the offset is defined by the field:

kate_info::granule_shift

=== Generic timing ===

Kate data packets (data packet type 0) includes timing information (start time,
end time, and time of the earliest event still active). All these are stored as
64 bit at the rate defined by the granule rate, so they do not suffer from the
granule_shift space limitation.

This also allows for Kate streams to be stored in other containers.

== Motion ==

The Kate bitstream format includes motion definition, originally for karaoke purposes, but
which can be used for more general purpose, such as line based drawing, or animation of
the text (position, color, etc)

Motions are defined by the means of a series of curves (static points, segments, splines (catmull-rom, bezier, and b-splines)).
A 2D point can be obtained from a motion for any timestamp during the lifetime of a text.
This can be used for moving a marker in 2D above the text for karaoke, or to use the x
coordinate to color text when the motion position passes each letter or word, etc.
Motions have an attached semantics so the client code knows how to use a particular motion.
Predefined semantics include text color, text position, etc).

Since a motion can be composed of an arbitrary number of curves, each of which may have
an arbitrary number of control points, complex motions can be achieved. If the motion is
the main object of an event, it is even possible to have an empty text, and use the motion
as a virtual pencil to draw arbitrary shapes. Even on-the-fly handwriting subtitles could
be done this way, though this would require a lot of control points, and would not be able
to be used with text-to-speech.

As a proof of concept, I also have a "draw chat" program where two people can draw, and
the shapes are turned to b-splines and sent as a kate motion to be displayed on the other
person's window.

It is also possible for motions to be discontinuous - simply insert a curve of 'none' type.
While the timestamp lies within such a curve, no 2D point will be generated. This can be
used to temporarily hide a marker, for instance.

It is worth mentionning that pauses in the motion can be trivially included by inserting
at the right time and for the right duration a simple linear interpolation curve with only
two equal points, equal to the position the motion is supposed to pause at.

Kate defines a set of predefined mappings so that each decoder user interprets a motion in
the same way. A mapping is coded on 8 bits in the bitstream, and the first 128 are reserved
for Kate, leaving 128 for application specific mappings, to avoid constraining creative uses
of that feature. Predefined mappings include frame (eg, 0-1 points are mapped to the size of
the current video frame), or region, to scale 0-1 to the current region. This allows curves
to be defined without knowing in advance the pixel size of the area it should cover.

For uses which require more than two coordinates (eg, text color, where 4 (RGBA) values are
needed, Kate predefines the semantics text_color_rg and text_color_ba, so a 4D point can be
obtained using two different motions.

There are higher level constructs, such as morphing between two styles, or predefined
karaoke effects. More are planned to be added in the future.

See also [[#Trackers|Trackers]].

== Trackers ==

Since attaching motions to text position, etc, makes it hard for the client to keep track of
everything, doing interpolation, etc, the library supplies a tracker object, which handles the
interpolation of the relevant properties.
Once initialized with a text and a set of motions, the client code can give the tracker a new
timestamp, and get back the current text position, text color, etc.

Using a tracker is not necessary, if one wants to use the motions directly, or just ignore them,
but it makes life easier, especially when considering the the order in which motions are applied
does matter (to be defined formally, but the current source code is informative at this point).

== The Kate file format ==

Though this is not a feature of the bitstream format, I have created a text file format to
describe a series of events to be turned into a Kate bitstream.
At its minimum, the following is a valid input to the encoder:

: kate {
:: event { 00:00:05 --> 00:00:10 "This is a text" }
: }

This will create a simple stream with "This is a text" emitted at an offset of 5 seconds into
the track, lasting 5 seconds to an end time at 10 seconds.

Motions, regions, styles can be declared in a definitions block to be reused by events, or can
be defined inline. Defining those in the definitions block places them in a header so they can
be reused later, saving space. However, they can also be defined in each event, so they will be
sent with the event. This allows them to be generated on the fly (eg, if the bitstream is being
streamed from a realtime input).

For convenience, the Kate file format also allows C style macros, though without parameters.

Please note that the Kate file format is fully separate from the Kate bitstream format. The
difference between the two is similar to the difference between a C source file and the resulting
object file, when compiled.

Note that the format is not based on XML for a very parochial reason: I tend to dislike very
much editing XML by hand, as it's really hard to read. XML is really meant for machines to parse
generically text data in a shared syntax but with possibly unknown semantics, and I need those
text representations to be editable easily.

This also implies that there could be an XML representation of a Kate stream, which would be
useful if one were to make an editor that worked on a higher level than the current all-text
representation, and it is something that might very well happen in the future, in parallel with
the current format.

== Karaoke ==

Karaoke effects rely on motions, and there will be predefined higher level ways of specifying
timings and effects, two of which are already done. As an example, this is a valid Karaoke script:

:kate {
:: simple_timed_glyph_style_morph {
::: from style "start_style" to style "end_style"
::: "Let " at 1.0
::: "us " at 1.2
::: "sing " at 1.4
::: "to" at 2.0
::: "ge" at 2.5
::: "ther" at 3.0
:: }
:}

The syllables will change from a style to another as time passes. The definition of the start_style
and end_style styles is omitted for brevity.

== Problems to solve ==

There are a few things to solve before the Kate bitstream format can be considered good
enough to be frozen:

Note: the following is mostly solved, and the bitstream is now stable, and has been
backward and forward compatible since the first released version. This will be updated
when I get some time.

=== Seeking and memory ===

When seeking to a particular time in a movie with subtitles, we may end up at a place when a subtitle has been started, but is not removed yet. Pure streaming doesn't have this problem as it remembers the subtitle being issued (as opposed to, say, Vorbis, for which all data valid now is decoded from the last packet). With Kate, a text string valid now may have been issued long ago.

I see three possible ways to solve this:
*each data packet includes the granule of the earliest still active packet (if none, this will be the granule of this very packet)
**this means seeks are two phased: first seek, find the next Kate packet, and seek again if the granule of the earlier still active packet is less than the original seeked granule. This implies support code on players to do the double seek.

*use "reference frames", a bit like Theora does, where the granule position is split in several fields: the higher bits represent a position for the reference frame, and the lowest bits a delta time to the current position. When seeking to a granule position, the lower bits are cleared off, yielding the granule position of the previous reference frame, so the seek ends up at the reference frame. The reference frame is a sync point where any active strings are issued again. This is a variant of the method described in the Writ wiki page, but the granule splitting avoids any "downtime".
**this requires reissuing packets, and it doesn't feel right (and wastes space).
**it also requires "dummy" decoding of Kate data from the reference frame to the actual seek point to fully refresh the state "memory".

*A variant of the two-granules-in-one system used by libcmml, where the "back link" points to the earliest still active string, rather than the previous one (this allows a two phase seek, rather than a multiphase seek, hopping back from event to event, with no real way to know if there is or not a previous event which is still active - I suppose CMML has no need to know this, if their "clips" do not overlap - mine can do).
**Such a system considerably shortens the usable granule space, though it can do a one phase seek, if I understand the system correctly, which I am not certain.
*** Well, it seems it can't do a one phase seek anyway.

*Additionally, it could be possible to emit simple "keepalive" packets at regular intervals to help a seek algorithm to sync up to the stream without needing too much data reading - this helps for discontinuous streams where there could be no pages for a while if no data is needed at that time.

=== Text encoding ===

A header field declares the text encoding used in the stream. At the moment, only UTF-8 is
supported, for simplicity. There are no plans to support other encodings, such as UTF-16,
at the moment.

Note that strings included in the header (language, category) are not affected by that
language encoding (rather obviously for language itself). These are ASCII.

The actual text in events may include simple HTML-like markup (at the moment, allowed markup
is the same as the one Pango uses, but more markup types may be defined in the future).
It is also possible to ask libkate to remove this markup if the client prefers to receive
plain text without the markup.

=== Language encoding ===

A header field defines the language (if any) used in the stream (this can be overridden in a
data packet, but this is not relevant to this point). At the moment, my test code uses
ISO 639-1 two letter codes, but I originally thought to use RFC 3066 tags. However, matching
a language to a user selection may be simpler for user code if the language encoding is kept
simple. At the moment, I tend to favor allowing both two letter tags (eg, "en") and secondary
tags (like "en_EN"), as RFC 3066 tags can be quite complex, but I welcome comments on this.

If a stream contains more than one language, there usually is a predominant language, which
can be set as the default language for the stream. Each event can then have a language
override. If there is no predominant language, and it is not possible to split the stream
into multiple substreams, each with its own language, then it is possible to use the "mul"
language tag, as a last resort.

=== Bitstream format for floating point values ===

Floating point values are be turned to a 16.16 fixed point format, then stored in a bitpacked
format, storing the number of zero bits at the head and tail of the floating point values once
per stream, and the remainder bits for all values in the stream. This seems to yield good results
(typically a 50% reduction over 32 bits raw writes, and 70% over the snprintf based storage), and
has the big advantage of being portable (eg, independant of any IEEE format).
However, this means reduced precision due to the quantization to 16.16. I may add support for
variable precision (eg, 8.24 fixed point formats) to alleviate this. This would however mean less
space savings, though these are likely to be insignificant when Kate streams are interleaved with
a video.

*Though this is not a Kate issue per se, the motion feature is very difficult to use without a curve editor. While tools may be coded to create a Kate bitstream for various existing subtitle formats, it is not certain it will be easy to find a good authoring tool for a series of curves. That said, it's not exactly difficult to do if you know a widget set.

=== Higher dimensional curves/motions ===

It is quite annoying to have to create two motions to control a color change, due to curves
being restricted to two dimensions. I may add support for arbitrary dimensions. It would also
help for 1D motions, like changing the time flow, where one coordinate is simply ignored at
the moment.
Alternatively, changes could be made to the Kate file format to hide the two dimensionality and
allow simpler specification of non-2 dimensional motions, but still map them to 2D in the kate
bitstream format.

=== Category definition ===

The category field in the BOS packet is a 16 byte text field (15 really, as it is zero terminated
in the bitstream itself). Its goal is to provide the reader with a short description of what kind
of information the stream contains, eg subtitles, lyrics, etc. This would be displayed to the user,
possibly to allow to choose to turn some streams on and off.

Since this category is meant primarily for a machine to parse, they will be kept to ASCII. When
a player recognizes a category, it is free to replace its name with one in the user's language if
it prefers. Even in English, the "lyrics" category could be displayed by a player as "Lyrics".

Since this is a free text field rather than an enumeration, it would be good to have a list of
common predefined category names that Kate streams can use.

This is a list of proposed predefined categories, feedback/additions welcome:

* subtitles - the usual movie subtitles, as text
* spu-subtitles - movie subtitles in DVD style paletted images
* lyrics - song lyrics

Please remember the 15 character limit if proposing other categories.

Note that the list of categories is subject to change, and will likely
be replaced by new, more "identifier like" ones. The three ones above,
however, would be kept for backward compatibility as they're already used.

== Text to speech ==

One of the goals of the Kate bitstream format is that text data can be easily parsed
by the user of the decoder, so any additional information, such as style, placement,
karaoke data, etc, should be able to be stripped to leave only the bare text. This is
in view of allowing text-to-speech software to use Kate bitstreams as a bandwith-cheap
way of conveying speech data, and could also allow things like e-books which can be
either read or listened to from the same bitstream (I have seen no reference to this
being used anywhere, but I see no reason why the granule progression should be temporal,
and not user controlled, such as by using a "next" button which would bump a granule
postion by a preset amount, simulating turning a page (this would be close to necessary
for text-to-speech, as the wall time duration of the spoken speech is not known in
advance to the Kate encoder, and can't be mapped to a time based granule progression)).
All text strings triggered consecutively between the two granule positions would then
be read in order.

== Possible additions ==

=== Embedded binary data ===

Images and font mappings can be included within a Kate stream.

==== Images ====

Though this could be misused to interfere with ability to render as text-to-speech, Kate
can use images as well as text. The same caveat as for fonts applies with regard to data
duplication.

Complex images might however be best left to a multiplexed OggSpots or OggMNG stream, unless the
images mesh with the text (eg, graphical exclamation points, custom fonts, (see next
paragraph), etc).

There is support for simple paletted bitmap images, with a variable length palette of up
to 256 colors (in fact, sized in powers of 2 up to 256) and matching pixel data in as
many bits per pixel as can address the palette. Palettes and images are stored separately,
so can be used with one another with no fixed assignment.

Palettes and bitmaps are put in two separate header for later use by reference, but can
also be placed in data packets, as with motions, etc, if they are not going to be reused.

PNG bitmaps can also be embedded in a Kate stream. These do not have associated palettes
(but the PNGs themselves may or may not be paletted). There is no support for decoding PNG
images in libkate itself, so a program will have to use libpng (or similar code) to decode
the PNG image. For instance, the libtiger rendering library uses Cairo to decode and render
PNG images in Kate streams.

This can be used to have custom fonts, so that raw text is still available if the stream
creator wants a custom look.

I expect that the need for more than 256 colors in a bitmap, or non palette bitmap data,
would be best handled by another codec, eg OggMNG or OggSpots. The goal of images in a
Kate stream is to mesh the images with the text, not to have large images by themselves.

On the other hand, interesting Karaoke effects could be achieved by having MNG images
instead of simple paletted bitmaps in a Kate streams. Comments would be most welcome on
whether this is going too far, however.

I am also investigating SVG images. These allow for very small footprint images for simple
vector drawings, and could be very useful for things like background gradients below text.

A possible solution to the duplication issue is to have another stream in the container
stream, which would hold the shared data (eg, fonts), which the user program could load,
and which could then be used by any Kate (and other) stream. Typically, this type of stream
would be a degenerate stream with only header packets (so it is fully processed before any
other stream presents data packets that might make use of that shared data), and all payload
such as fonts being contained within the headers. Thinking about it, it has parallels with
the way Vorbis stores its codebooks within a header packet, or even the way Kate stores the
list of styles within a header packet.

==== Fonts ====

Custom fonts are merely a set of ranges mapping unicode code points to bitmaps. As this implies,
fonts are bitmap fonts, not vector fonts, so scaling, if supported by the rendering client,
may not look as good as with a vector font.

A style may also refer to a font name to use (eg, "Tahoma"). These fonts may or may not be
available on the playing system, however, since the font data is not included in the stream,
just referenced by name. For this reason, it is best to keep to widely known fonts.

== Reference encoder/decoder ==

A encoder (kateenc) and a decoder (katedec) are included in the tools directory.
The encoder supports input from several different formats:
* a custom text based file format (see [[#The Kate file format|The Kate file format]]), which is by no means meant to be part of the Kate bitstream specification itself
* SubRip (.srt), the most common subtitle format I found
* LRC lyrics format.

As an example for the widely used SRT subtitles format, the following command line
create a Kate subtitles stream from an SRT file:

kateenc -l en -c subtitles -t srt -o subtites.ogg subtitles.srt

The reverse is possible, to recover an SRT file from a Kate stream, with katedec.

Note that the subtitles.ogg file should then be multiplexed into the A/V stream,
using either ogg-tools or oggz-tools.

The Kate bitstreams encoded and decoded by those tools are (supposed to be) correct for this
specification, provided their input is correct.

== Next steps ==

=== Continuations ===

Continuations are a way to add to existing events, and are mostly meant for motions. When streaming
in real time, what motions may be applied to events may not be known in advance (for instance, for a
draw chat program where two programs exchange Kate streams, the drawing motions are only known as
they are drawn. Continuations will allow an event to be extended in time, and motions to be appended
to it. This is only useful for streaming, as when stored in a file, everything is already known in
advance.

=== A rendering library ===

This will allow easier integration in other packages (movie players, etc).
I have started working on an implementation using Cairo and Pango, though I'm still at the early stages.
I might add support for embedding vector fonts in a Kate stream if I was going that way. Still need to think about this.
Another point of note is that when this library is available, it would make it easier to add
capabilities such as rotation, scaling, etc, to the bitstream, since this would not cause too
much work for playing programs using the rendering library. It is expected that these additions
would stay backward compatible (eg, an old player would ignore this information but still correctly
decode the information they can work with from a newly encoded stream).

=== An XML representation ===

While I purposefully did not write Kate description files in XML due to me finding editing XML such
a chore, it would be nice to be able to losslessly convert between the more user friendly representation
and an XML document, so one can do what one does with XML documents, like transformations.

And after all, some people might prefer editing the XML version.

=== Packaging ===

It would be really nice to have packages for libkate/libtiger for many distros.

If you're a packager for a distro which doesn't have yet packages for libkate
or libtiger, please consider helping :)

In particular, packages for Debian would be grand.

== Matroska mapping ==

The codec ID is "S_KATE".

As for Theora and Vorbis, Kate headers are stored in the private data as xiph-laced packets:

Byte 0: number of packets present, minus 1 (there must be at least one packet) - let this number be NP
Bytes 1..n: lengths of the first NP packets, coded in xiph style lacing
Bytes n+1..end: the data packets themselves concatenated one after the other

Note that the length of the last packet isn't encoded, it is deduced from the sizes of the other
packets and the total size of the private data.

This mapping is similar to the Vorbis and Theora mappings, with the caveat that one should not
expect a set number of headers.

== Downloading ==

libkate encodes and decodes Kate streams, and is API and ABI stable.

The libkate source distribution is available at [http://libkate.googlecode.com/ http://libkate.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/kate.git;a=summary http://git.xiph.org/?p=users/oggk/kate.git;a=summary].

libtiger renders Kate streams using Pango and Cairo, and is alpha, with API changes still possible.

The libtiger source distribution is available at [http://libtiger.googlecode.com/ http://libtiger.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/tiger.git;a=summary http://git.xiph.org/?p=users/oggk/tiger.git;a=summary].

== Things I need to get feedback on ==

* is it a good idea to avoid floating point usage altogether ?

[[Category:Drafts]]
[[Category:Ogg Mappings]]

OggKate

2008-12-29T15:09:21Z

Ogg.k.ogg.k: /* What is Kate? */

== Disclaimer ==
This is not a Xiph codec, though it may be embedded in Ogg alonside other Xiph
codecs, such as Vorbis and Theora. As such, please do not assume that Xiph has
anything to do with this, much less responsibility.

== What is Kate? ==

Kate is an oveerlay codec, originally designed for karaoke and text, that can be
multiplixed in Ogg. Text and images can be carried by a Kate stream, and animated.
Most of the time, this would be multiplexed with audio/video to carry subtitles,
song lyrics (with or without karaoke data), etc, but doesn't have to be.

Series of curves (splines, segments, etc) may be attached to various properties
(text position, font size, etc) to create animated overlays. This allows scrolling
or fading text to be defined. This can even be used to draw arbitrary shapes, so
hand drawing can also be represented by a Kate stream.

Example uses of Kate streams are movie subtitles for Theora videos, either text based,
as may be created by [http://www.v2v.cc/~j/ffmpeg2theora ffmpeg2theora], or image
based, such as created by [http://thoggen.net Thoggen] (patching needed), and lyrics,
as created by oggenc, from vorbis-tools.

== Why a new codec? ==

As I was adding support for Theora, Speex and FLAC to some software of mine, I found myself
wanting to have song lyrics accompanying Vorbis audio. Since Vorbis comments are limited to
the headers, one can't add them in the stream as they are sung, so another multiplexed stream
would be needed to carry them.

The three possible bases usable for such a codec I found were Writ, CMML, and OGM/SRT.

*[[OggWrit|Writ]] is an unmaintained start at an implementation of a very basic design, though I did find an encoder/decoder in py-ogg2 later on - I'd been quicker to write Kate from scratch anyway.
*[[CMML]] is more geared towards encapsulating metadata about an accompanying stream, rather than being a data stream itself, and seemed complex for a simple use, though I have now revised my view on this - besides, it seems designed for Annodex (which I haven't had a look at), though it does seems relatively generic for use outwith Annodex - though it is being "repurposed" as timed text now, bringing it closer to what I'm doing
*OGM/SRT, which I only found when I added Kate support to MPlayer, is shoehorning various data formats into an Ogg stream, and just dumps the SRT subtitle format as is, AFAICS (though I haven't looked at this one in detail, since I'd already had a working Kate implementation by that time)

I then decided to roll my own, not least because it's a fun thing to do.

I found other formats, such as USF (designed for inclusion in Matroska) and various subtitle formats,
but none were designed for embedding inside an Ogg container.

== Overview of the Kate bitstream format ==

I've taken much inspiration from Vorbis and Theora here.
Headers and packets (as well as the API design) follow the design of these two codecs.

A rough overview (see [[#Format specification|Format specification]] for more details) is:

Headers packets:
*ID header [BOS]: magic, version, granule fraction, encoding, language, etc
*Comment header: Vorbis comments, as per Vorbis/Theora streams
*Style definitions header: a list of predefined styles to be referred to by data packets
*Region definitions header: a list of predefined regions to be referred to by data packets
*Curves definitions header: a list of predefined curves to be referred to by data packets
*Motion definitions header: a list of predefined motions to be referred to by data packets
*Palette definitions header: a list of predefined palettes to be referred to by data packets
*Bitmap definitions header: a list of predefined bitmaps to be referred to by data packets
*Font mapping definitions header: a list of predefined font mappings to be referred to by data packets

Other header packets are ignored, and left for future expansion.

Data packets:
*text data: text/image and optional motions, accompanied by optional overrides for style, region, language, etc
*keepalive: can be emitted at any time to help a demuxer know where we're at, but those packets are optional
*end data [EOS]: marks the end of the stream, it doesn't have any useful payload

Other data packets are ignored, and left for future expansion.

The intent of the "keepalive" packet is to be sent at regular
intervals when no other packet has been emitted for a while. This would be to help seeking code
find a kate page more easily.

Things of note:
*Kate is a discontinuous codec, as defined in [http://www.xiph.org/ogg/doc/ogg-multiplex.html ogg-multiplex.html] in the Ogg documentation, which means it's timed by start granule, not end granule (as Theora and Vorbis).
* All data packets are on their own page, for two reasons:
**Ogg keeps track of granules at the page level, not the packet level
**if no text event happens for a while after a particular text event, we don't want to delay it so a larger page can be issued

See also [[#Seeking and memory|Problems to solve: Seeking and memory]].

*The granule encoding is not a direct time/granule correspondance, see the granule encoding section.
*The EOS packet should have a granule pos higher or equal to the end time of all events.
*User code doesn't have to know the number of headers to expect, this is moved inside the library code (as opposed to Vorbis and Theora).
*The format contains hooks so that additional information may be added in future revisions while keeping backward compatibility (though old decoders will correctly parse, but ignore the new information).

== Format specification ==

The Kate bitstream format consists of a number of sequential packets.
Packets can be either header packets or data packets. All header packets
must appear before any data packet.

Header packets must appear in order. Decoding of a data packet is not
possible until all header packets have been decoded.

Each Kate packet starts with a one byte type. A type with the MSB set
(eg, between 0x80 and 0xff) indicates a header packet, while a type with
the MSB cleared (eg, between 0x00 and 0x7f) indicates a data packet.
All header packets then have the Kate magic, from byte offset 1 to byte
offset 7 ("kate\0\0\0"). Note that this applies only to header packets:
data packets do not contain the Kate signature.

Since the ID header must appear first, a Kate stream can be recognized
by comparing the first eight bytes of the first packet with the signature
string "\200kate\0\0\0".

When embedded in Ogg,the first packet in a Kate stream (always packet type 0x80,
the id header packet) must be placed on a separate page. The corresponding Ogg
packet must be marked as beginning of stream (BOS).All subsequent header packets
must be on one or more pages. Subsequently, each data packet must be on a separate
page.

The last data packet must be the end of stream packet (packet type 0x7f).

When embedded in Ogg, the corresponding Ogg packet must be marked as end of stream (EOS).

As per the Ogg specification, granule positions must be non decreasing
within the stream. Header packets have granule position 0.

Currently existing packet types are:
:headers:
::0x80 ID header (BOS)
::0x81 Vorbis comment header
::0x82 regions list header
::0x83 styles list header
::0x84 curves list header
::0x85 motions list header
::0x86 palettes list header
::0x87 bitmaps list header
::0x88 font ranges and mappings header
:data:
::0x00 text data (including optional motions and overrides)
::0x01 keepalive
::0x7f end packet (EOS)

This format described here is for bitstream version 0.x.
As or 19 december 2008, the latest bitstream version is 0.4.

For more detailed information, refer to the format documentation
in libkate (see URL below in the [[#Downloading|Downlading]] section).

Following is the definition of the ID header (packet type 0x80).
This works out to a 64 byte ID header. This is the header that should be
used to detect a Kate stream within an Ogg stream.

0 1 2 3 |
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1| Byte
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| packtype | Identifier char[7]: 'kate\0\0\0' | 0-3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| kate magic continued | 4-7
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | version major | version minor | num headers | 8-11
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| text encoding | directionality| reserved - 0 | granule shift | 12-15
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| cw sh | canvas width | ch sh | canvas height | 16-19
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | 20-23
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate numerator | 24-27
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate denominator | 28-31
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (NUL terminated) | 32-35
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 36-39
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 40-43
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 44-47
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (NUL terminated) | 48-51
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 52-55
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 56-59
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 60-63
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The fields cw sh, canvas width, cw sh, and canvas height were introduced
in bistream 0.3. Earlier bitstreams will have 0 in these fields.

language and category are NUL terminating ASCII strings.
Language follows RFC 3066, though obviously will not accommodate language tags
with lots of subtags.

Category is currently loosely defined, and I haven't found yet a nice way to
present it in a generic way, but is meant for automatic classifying of
various multiplexed Kate streams (eg, to recognize that some streams are
subtitles (in a set of languages), and some others are commentary (in a
possibly different set of languages, etc).

== API overview ==

libkate offers an API very similar to that of libvorbis and libtheora, as well as
an extra higher level decoding API.

Here's an overview of the three main modules:

=== Decoding ===

Decoding is done in a way similar to libvorbis. First, initialize a kate_info and a
kate_comment structure. Then, read headers by calling kate_decode_headerin. Once
all headers have been read, a kate_state is initialized for decoding using kate_decode_init,
and kate_decode_packetin is called repeatedly with data packets. Events (eg, text) can be
retrieved via kate_decode_eventout.

=== Encoding ===

Encoding is also done in a way similar to libvorbis. First initialize a kate_info
and a kate_comment structure, and fill them out as needed. kate_encode_headers will
create ogg packets from those. Then, kate_encode_text is called repeatedly for all
the text events to add. When done, calling kate_encode_finish will create an end of
stream packet.

=== High level decoding API ===

There are only 3 calls here:

kate_high_decode_init
kate_high_decode_packetin
kate_high_decode_clear

Here, all Ogg packets are sent to kate_high_decode_packetin, which does the right
thing (header/data classification, decoding, and event retrieval). Note that you
do not get access to the comments directly using this, but you do get access to the
kate_info via events.

The libkate distribution includes commented examples for each of those.

Additionally, libkate includes a layer (liboggkate) to make it easier to use when
embedded in Ogg. While the normal API uses kate_packet structures, liboggkate uses
ogg_packet structures.

The High level decoding API does not have an Ogg specific layer, but functions exist
to wrap a kate_packet around a memory buffer (such as the one ogg_packet uses, for instance).

== Support ==

Among the software with Kate support:
*VLC
*ffmpeg2theora
*liboggz
*liboggplay

I have patches for the following with Kate support:
*MPlayer (for multiplexed per-language subtitles - all region/style info is ignored)
*xine (everything kate supports, as xine is my testbed)
*GStreamer
*Thoggen

These may be found in the libkate source distribution (see [[#Downloading|Downloading]]
for links).

In addition, libtiger is a rendering library for Kate streams using Pango and Cairo,
though it is not quite yet API stable (though no major changes are expected).

== Granule encoding ==

=== Ogg ===

Ogg leaves the encoding of granules up to a particular codec, only
mandating that granules be non decreasing with time.

The Kate bitstream format uses a linear mapping between time and
granule, described here.

A Kate granule position is composed of two different parts:
- a base granule, in the high bits
- a granule offset, in the low bits

+----------------+----------------+
| base | offset |
+----------------+----------------+

The number of bits these parts occupy is variable, and each stream
may choose how many bits to dedicate to each. The kate_info structure
for a stream holds that information in the granule_shift field,
so each part may be reconstructed from a granulepos.

The timestamp T of a given Kate packet is split into a base B and
offset O, and these are stored in the granulepos of that packet.
The split is done such that the B is the time of the earliest event
still active at the time, and the O is the time elapsed between B
and T. Thus, T = B + O. This mimics the way Theora stores its own
timestamps in granulepos, where the base acts as a keyframe, and
an offset acts as the position of an intra frame from the previous
keyframe. Since Kate allows time overlapping events, however, the
choice of the base to use is slightly more complex, as it may not
be the starting time of the previous event, if the stream contains
time overlapping events.

The kate_info structure for a stream holds a rational fraction
representing the time span of granule units for both the base and
the offset parts.

The granule rate is defined by the two fields:

kate_info::gps_numerator
kate_info::gps_denominator

The number of bits reserved for the offset is defined by the field:

kate_info::granule_shift

=== Generic timing ===

Kate data packets (data packet type 0) includes timing information (start time,
end time, and time of the earliest event still active). All these are stored as
64 bit at the rate defined by the granule rate, so they do not suffer from the
granule_shift space limitation.

This also allows for Kate streams to be stored in other containers.

== Motion ==

The Kate bitstream format includes motion definition, originally for karaoke purposes, but
which can be used for more general purpose, such as line based drawing, or animation of
the text (position, color, etc)

Motions are defined by the means of a series of curves (static points, segments, splines (catmull-rom, bezier, and b-splines)).
A 2D point can be obtained from a motion for any timestamp during the lifetime of a text.
This can be used for moving a marker in 2D above the text for karaoke, or to use the x
coordinate to color text when the motion position passes each letter or word, etc.
Motions have an attached semantics so the client code knows how to use a particular motion.
Predefined semantics include text color, text position, etc).

Since a motion can be composed of an arbitrary number of curves, each of which may have
an arbitrary number of control points, complex motions can be achieved. If the motion is
the main object of an event, it is even possible to have an empty text, and use the motion
as a virtual pencil to draw arbitrary shapes. Even on-the-fly handwriting subtitles could
be done this way, though this would require a lot of control points, and would not be able
to be used with text-to-speech.

As a proof of concept, I also have a "draw chat" program where two people can draw, and
the shapes are turned to b-splines and sent as a kate motion to be displayed on the other
person's window.

It is also possible for motions to be discontinuous - simply insert a curve of 'none' type.
While the timestamp lies within such a curve, no 2D point will be generated. This can be
used to temporarily hide a marker, for instance.

It is worth mentionning that pauses in the motion can be trivially included by inserting
at the right time and for the right duration a simple linear interpolation curve with only
two equal points, equal to the position the motion is supposed to pause at.

Kate defines a set of predefined mappings so that each decoder user interprets a motion in
the same way. A mapping is coded on 8 bits in the bitstream, and the first 128 are reserved
for Kate, leaving 128 for application specific mappings, to avoid constraining creative uses
of that feature. Predefined mappings include frame (eg, 0-1 points are mapped to the size of
the current video frame), or region, to scale 0-1 to the current region. This allows curves
to be defined without knowing in advance the pixel size of the area it should cover.

For uses which require more than two coordinates (eg, text color, where 4 (RGBA) values are
needed, Kate predefines the semantics text_color_rg and text_color_ba, so a 4D point can be
obtained using two different motions.

There are higher level constructs, such as morphing between two styles, or predefined
karaoke effects. More are planned to be added in the future.

See also [[#Trackers|Trackers]].

== Trackers ==

Since attaching motions to text position, etc, makes it hard for the client to keep track of
everything, doing interpolation, etc, the library supplies a tracker object, which handles the
interpolation of the relevant properties.
Once initialized with a text and a set of motions, the client code can give the tracker a new
timestamp, and get back the current text position, text color, etc.

Using a tracker is not necessary, if one wants to use the motions directly, or just ignore them,
but it makes life easier, especially when considering the the order in which motions are applied
does matter (to be defined formally, but the current source code is informative at this point).

== The Kate file format ==

Though this is not a feature of the bitstream format, I have created a text file format to
describe a series of events to be turned into a Kate bitstream.
At its minimum, the following is a valid input to the encoder:

: kate {
:: event { 00:00:05 --> 00:00:10 "This is a text" }
: }

This will create a simple stream with "This is a text" emitted at an offset of 5 seconds into
the track, lasting 5 seconds to an end time at 10 seconds.

Motions, regions, styles can be declared in a definitions block to be reused by events, or can
be defined inline. Defining those in the definitions block places them in a header so they can
be reused later, saving space. However, they can also be defined in each event, so they will be
sent with the event. This allows them to be generated on the fly (eg, if the bitstream is being
streamed from a realtime input).

For convenience, the Kate file format also allows C style macros, though without parameters.

Please note that the Kate file format is fully separate from the Kate bitstream format. The
difference between the two is similar to the difference between a C source file and the resulting
object file, when compiled.

Note that the format is not based on XML for a very parochial reason: I tend to dislike very
much editing XML by hand, as it's really hard to read. XML is really meant for machines to parse
generically text data in a shared syntax but with possibly unknown semantics, and I need those
text representations to be editable easily.

This also implies that there could be an XML representation of a Kate stream, which would be
useful if one were to make an editor that worked on a higher level than the current all-text
representation, and it is something that might very well happen in the future, in parallel with
the current format.

== Karaoke ==

Karaoke effects rely on motions, and there will be predefined higher level ways of specifying
timings and effects, two of which are already done. As an example, this is a valid Karaoke script:

:kate {
:: simple_timed_glyph_style_morph {
::: from style "start_style" to style "end_style"
::: "Let " at 1.0
::: "us " at 1.2
::: "sing " at 1.4
::: "to" at 2.0
::: "ge" at 2.5
::: "ther" at 3.0
:: }
:}

The syllables will change from a style to another as time passes. The definition of the start_style
and end_style styles is omitted for brevity.

== Problems to solve ==

There are a few things to solve before the Kate bitstream format can be considered good
enough to be frozen:

Note: the following is mostly solved, and the bitstream is now stable, and has been
backward and forward compatible since the first released version. This will be updated
when I get some time.

=== Seeking and memory ===

When seeking to a particular time in a movie with subtitles, we may end up at a place when a subtitle has been started, but is not removed yet. Pure streaming doesn't have this problem as it remembers the subtitle being issued (as opposed to, say, Vorbis, for which all data valid now is decoded from the last packet). With Kate, a text string valid now may have been issued long ago.

I see three possible ways to solve this:
*each data packet includes the granule of the earliest still active packet (if none, this will be the granule of this very packet)
**this means seeks are two phased: first seek, find the next Kate packet, and seek again if the granule of the earlier still active packet is less than the original seeked granule. This implies support code on players to do the double seek.

*use "reference frames", a bit like Theora does, where the granule position is split in several fields: the higher bits represent a position for the reference frame, and the lowest bits a delta time to the current position. When seeking to a granule position, the lower bits are cleared off, yielding the granule position of the previous reference frame, so the seek ends up at the reference frame. The reference frame is a sync point where any active strings are issued again. This is a variant of the method described in the Writ wiki page, but the granule splitting avoids any "downtime".
**this requires reissuing packets, and it doesn't feel right (and wastes space).
**it also requires "dummy" decoding of Kate data from the reference frame to the actual seek point to fully refresh the state "memory".

*A variant of the two-granules-in-one system used by libcmml, where the "back link" points to the earliest still active string, rather than the previous one (this allows a two phase seek, rather than a multiphase seek, hopping back from event to event, with no real way to know if there is or not a previous event which is still active - I suppose CMML has no need to know this, if their "clips" do not overlap - mine can do).
**Such a system considerably shortens the usable granule space, though it can do a one phase seek, if I understand the system correctly, which I am not certain.
*** Well, it seems it can't do a one phase seek anyway.

*Additionally, it could be possible to emit simple "keepalive" packets at regular intervals to help a seek algorithm to sync up to the stream without needing too much data reading - this helps for discontinuous streams where there could be no pages for a while if no data is needed at that time.

=== Text encoding ===

A header field declares the text encoding used in the stream. At the moment, only UTF-8 is
supported, for simplicity. There are no plans to support other encodings, such as UTF-16,
at the moment.

Note that strings included in the header (language, category) are not affected by that
language encoding (rather obviously for language itself). These are ASCII.

The actual text in events may include simple HTML-like markup (at the moment, allowed markup
is the same as the one Pango uses, but more markup types may be defined in the future).
It is also possible to ask libkate to remove this markup if the client prefers to receive
plain text without the markup.

=== Language encoding ===

A header field defines the language (if any) used in the stream (this can be overridden in a
data packet, but this is not relevant to this point). At the moment, my test code uses
ISO 639-1 two letter codes, but I originally thought to use RFC 3066 tags. However, matching
a language to a user selection may be simpler for user code if the language encoding is kept
simple. At the moment, I tend to favor allowing both two letter tags (eg, "en") and secondary
tags (like "en_EN"), as RFC 3066 tags can be quite complex, but I welcome comments on this.

If a stream contains more than one language, there usually is a predominant language, which
can be set as the default language for the stream. Each event can then have a language
override. If there is no predominant language, and it is not possible to split the stream
into multiple substreams, each with its own language, then it is possible to use the "mul"
language tag, as a last resort.

=== Bitstream format for floating point values ===

Floating point values are be turned to a 16.16 fixed point format, then stored in a bitpacked
format, storing the number of zero bits at the head and tail of the floating point values once
per stream, and the remainder bits for all values in the stream. This seems to yield good results
(typically a 50% reduction over 32 bits raw writes, and 70% over the snprintf based storage), and
has the big advantage of being portable (eg, independant of any IEEE format).
However, this means reduced precision due to the quantization to 16.16. I may add support for
variable precision (eg, 8.24 fixed point formats) to alleviate this. This would however mean less
space savings, though these are likely to be insignificant when Kate streams are interleaved with
a video.

*Though this is not a Kate issue per se, the motion feature is very difficult to use without a curve editor. While tools may be coded to create a Kate bitstream for various existing subtitle formats, it is not certain it will be easy to find a good authoring tool for a series of curves. That said, it's not exactly difficult to do if you know a widget set.

=== Higher dimensional curves/motions ===

It is quite annoying to have to create two motions to control a color change, due to curves
being restricted to two dimensions. I may add support for arbitrary dimensions. It would also
help for 1D motions, like changing the time flow, where one coordinate is simply ignored at
the moment.
Alternatively, changes could be made to the Kate file format to hide the two dimensionality and
allow simpler specification of non-2 dimensional motions, but still map them to 2D in the kate
bitstream format.

=== Category definition ===

The category field in the BOS packet is a 16 byte text field (15 really, as it is zero terminated
in the bitstream itself). Its goal is to provide the reader with a short description of what kind
of information the stream contains, eg subtitles, lyrics, etc. This would be displayed to the user,
possibly to allow to choose to turn some streams on and off.

Since this category is meant primarily for a machine to parse, they will be kept to ASCII. When
a player recognizes a category, it is free to replace its name with one in the user's language if
it prefers. Even in English, the "lyrics" category could be displayed by a player as "Lyrics".

Since this is a free text field rather than an enumeration, it would be good to have a list of
common predefined category names that Kate streams can use.

This is a list of proposed predefined categories, feedback/additions welcome:

* subtitles - the usual movie subtitles, as text
* spu-subtitles - movie subtitles in DVD style paletted images
* lyrics - song lyrics

Please remember the 15 character limit if proposing other categories.

Note that the list of categories is subject to change, and will likely
be replaced by new, more "identifier like" ones. The three ones above,
however, would be kept for backward compatibility as they're already used.

== Text to speech ==

One of the goals of the Kate bitstream format is that text data can be easily parsed
by the user of the decoder, so any additional information, such as style, placement,
karaoke data, etc, should be able to be stripped to leave only the bare text. This is
in view of allowing text-to-speech software to use Kate bitstreams as a bandwith-cheap
way of conveying speech data, and could also allow things like e-books which can be
either read or listened to from the same bitstream (I have seen no reference to this
being used anywhere, but I see no reason why the granule progression should be temporal,
and not user controlled, such as by using a "next" button which would bump a granule
postion by a preset amount, simulating turning a page (this would be close to necessary
for text-to-speech, as the wall time duration of the spoken speech is not known in
advance to the Kate encoder, and can't be mapped to a time based granule progression)).
All text strings triggered consecutively between the two granule positions would then
be read in order.

== Possible additions ==

=== Embedded binary data ===

Images and font mappings can be included within a Kate stream.

==== Images ====

Though this could be misused to interfere with ability to render as text-to-speech, Kate
can use images as well as text. The same caveat as for fonts applies with regard to data
duplication.

Complex images might however be best left to a multiplexed OggSpots or OggMNG stream, unless the
images mesh with the text (eg, graphical exclamation points, custom fonts, (see next
paragraph), etc).

There is support for simple paletted bitmap images, with a variable length palette of up
to 256 colors (in fact, sized in powers of 2 up to 256) and matching pixel data in as
many bits per pixel as can address the palette. Palettes and images are stored separately,
so can be used with one another with no fixed assignment.

Palettes and bitmaps are put in two separate header for later use by reference, but can
also be placed in data packets, as with motions, etc, if they are not going to be reused.

PNG bitmaps can also be embedded in a Kate stream. These do not have associated palettes
(but the PNGs themselves may or may not be paletted). There is no support for decoding PNG
images in libkate itself, so a program will have to use libpng (or similar code) to decode
the PNG image. For instance, the libtiger rendering library uses Cairo to decode and render
PNG images in Kate streams.

This can be used to have custom fonts, so that raw text is still available if the stream
creator wants a custom look.

I expect that the need for more than 256 colors in a bitmap, or non palette bitmap data,
would be best handled by another codec, eg OggMNG or OggSpots. The goal of images in a
Kate stream is to mesh the images with the text, not to have large images by themselves.

On the other hand, interesting Karaoke effects could be achieved by having MNG images
instead of simple paletted bitmaps in a Kate streams. Comments would be most welcome on
whether this is going too far, however.

I am also investigating SVG images. These allow for very small footprint images for simple
vector drawings, and could be very useful for things like background gradients below text.

A possible solution to the duplication issue is to have another stream in the container
stream, which would hold the shared data (eg, fonts), which the user program could load,
and which could then be used by any Kate (and other) stream. Typically, this type of stream
would be a degenerate stream with only header packets (so it is fully processed before any
other stream presents data packets that might make use of that shared data), and all payload
such as fonts being contained within the headers. Thinking about it, it has parallels with
the way Vorbis stores its codebooks within a header packet, or even the way Kate stores the
list of styles within a header packet.

==== Fonts ====

Custom fonts are merely a set of ranges mapping unicode code points to bitmaps. As this implies,
fonts are bitmap fonts, not vector fonts, so scaling, if supported by the rendering client,
may not look as good as with a vector font.

A style may also refer to a font name to use (eg, "Tahoma"). These fonts may or may not be
available on the playing system, however, since the font data is not included in the stream,
just referenced by name. For this reason, it is best to keep to widely known fonts.

== Reference encoder/decoder ==

A encoder (kateenc) and a decoder (katedec) are included in the tools directory.
The encoder supports input from several different formats:
* a custom text based file format (see [[#The Kate file format|The Kate file format]]), which is by no means meant to be part of the Kate bitstream specification itself
* SubRip (.srt), the most common subtitle format I found
* LRC lyrics format.

As an example for the widely used SRT subtitles format, the following command line
create a Kate subtitles stream from an SRT file:

kateenc -l en -c subtitles -t srt -o subtites.ogg subtitles.srt

The reverse is possible, to recover an SRT file from a Kate stream, with katedec.

Note that the subtitles.ogg file should then be multiplexed into the A/V stream,
using either ogg-tools or oggz-tools.

The Kate bitstreams encoded and decoded by those tools are (supposed to be) correct for this
specification, provided their input is correct.

== Next steps ==

=== Continuations ===

Continuations are a way to add to existing events, and are mostly meant for motions. When streaming
in real time, what motions may be applied to events may not be known in advance (for instance, for a
draw chat program where two programs exchange Kate streams, the drawing motions are only known as
they are drawn. Continuations will allow an event to be extended in time, and motions to be appended
to it. This is only useful for streaming, as when stored in a file, everything is already known in
advance.

=== A rendering library ===

This will allow easier integration in other packages (movie players, etc).
I have started working on an implementation using Cairo and Pango, though I'm still at the early stages.
I might add support for embedding vector fonts in a Kate stream if I was going that way. Still need to think about this.
Another point of note is that when this library is available, it would make it easier to add
capabilities such as rotation, scaling, etc, to the bitstream, since this would not cause too
much work for playing programs using the rendering library. It is expected that these additions
would stay backward compatible (eg, an old player would ignore this information but still correctly
decode the information they can work with from a newly encoded stream).

=== An XML representation ===

While I purposefully did not write Kate description files in XML due to me finding editing XML such
a chore, it would be nice to be able to losslessly convert between the more user friendly representation
and an XML document, so one can do what one does with XML documents, like transformations.

And after all, some people might prefer editing the XML version.

=== Packaging ===

It would be really nice to have packages for libkate/libtiger for many distros.

If you're a packager for a distro which doesn't have yet packages for libkate
or libtiger, please consider helping :)

In particular, packages for Debian would be grand.

== Matroska mapping ==

The codec ID is "S_KATE".

As for Theora and Vorbis, Kate headers are stored in the private data as xiph-laced packets:

Byte 0: number of packets present, minus 1 (there must be at least one packet) - let this number be NP
Bytes 1..n: lengths of the first NP packets, coded in xiph style lacing
Bytes n+1..end: the data packets themselves concatenated one after the other

Note that the length of the last packet isn't encoded, it is deduced from the sizes of the other
packets and the total size of the private data.

This mapping is similar to the Vorbis and Theora mappings, with the caveat that one should not
expect a set number of headers.

== Downloading ==

libkate encodes and decodes Kate streams, and is API and ABI stable.

The libkate source distribution is available at [http://libkate.googlecode.com/ http://libkate.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/kate.git;a=summary http://git.xiph.org/?p=users/oggk/kate.git;a=summary].

libtiger renders Kate streams using Pango and Cairo, and is alpha, with API changes still possible.

The libtiger source distribution is available at [http://libtiger.googlecode.com/ http://libtiger.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/tiger.git;a=summary http://git.xiph.org/?p=users/oggk/tiger.git;a=summary].

== Things I need to get feedback on ==

* is it a good idea to avoid floating point usage altogether ?

[[Category:Drafts]]
[[Category:Ogg Mappings]]

VorbisComment

2008-12-28T11:07:46Z

Ogg.k.ogg.k: Patch sent to VLC to fix broken file with coverart, should get in 1.0.0

VorbisComment is a base-level [[Metadata]] format initially created for use with Ogg [[Vorbis]]. It has since been adopted in the specifications of
[[Ogg]] encapsulations for other Xiph.Org codecs including [[Theora]], [[Speex]] and [[FLAC]].

The use case for VorbisComment is given as:
<blockquote>
... much like someone jotting a quick note on the bottom of a CDR. It should be a little information to remember the disc by and explain it to others; a short, to-the-point text note that need not only be a couple words, but isn't going to be more than a short paragraph.[http://xiph.org/vorbis/doc/v-comment.html]
</blockquote>

VorbisComments are typically used to provide basic information like the title and copyright holder of a work.
As such the scope is similar to that of ID3 tags used with MP3 files.
VorbisComment is widely supported on [[VorbisHardware|portable Ogg Vorbis players]] as well as streaming, editing and playback software.

Although the syntax of VorbisComment is well-specified, various conventions exist for the field names in use.
The goal for this page is to codify best practices and collect proposals for standardization of VorbisComment field names.

VorbisComments are typically encoded as the second packet in a codec stream. When VorbisComments are included in the first (ie. Theora) stream of an Ogg Theora file, they are assumed to cover all streams in the multiplexed group. [http://lists.xiph.org/pipermail/vorbis-dev/2008-December/019676.html]

VorbisComment is the simplest and most widely-supported mechanism for storing metadata with Xiph.Org codecs. For other existing and proposed mechanisms, see [[Metadata]].

==Recommended field names==

The current [http://xiph.org/vorbis/doc/v-comment.html VorbisComment recommendation] contains a recommended set
of field names for comments.

==Proposed field names==

Some proposals for extra field names:

* [http://reactor-core.org/ogg-tagging.html Ogg Vorbis Comment Field Recommendations]
* [http://gophernet.org/articles/vorbiscomment/ Proposals for extending Ogg Vorbis comments]

Comments are intended to be free-form, but for the purposes of interoperability, it is helpful to define
tag sets for particular applications, and provide some guidelines for machine parsing.

:: '''Some''' field names may have to be non-free-form to achieve machine parsing. Such as ENCODER, DATE, RIGHTS-DATE, and RIGHTS-URI. See reasoning below.

=== Cover art ===
VorbisComments don't officially support album cover art yet. Since this is a frequently requested feature though, the goal is to find a consensus and an official standard on how to embed (or link) album cover art pictures within ogg vorbis files.

==== Unofficial "COVERART" field ====
There exists an unofficial, not well supported comment field named "COVERART". It includes a base64-encoded string of the binary picture data (usually a JPEG file, but this could be a different file format too). The disadvantages are that
* no additional information like a description about the cover art is provided,
* the base64 string is displayed within many tag editors as plain text because of their missing support for this "COVERART" field
* it may breaks the playback on hardware players because of a large vorbis comment header
* the cover art can't be linked

==== Proposal ====
Placing the [http://flac.sourceforge.net/format.html#metadata_block_picture binary FLAC coverart structure] within a vorbis comment named "BINARY_COVERART" would have the following benefits:

* Easy to use for developers since the identical (or similar) structure is also used by FLAC and MP3, which means that chances are good that people and software programmers are willing to support this.
* Old C / C++ based implementations don't display the binary data as string since it always starts with a zero byte at the first position, which is an empty string when interpreted as UTF-8.
* The cover art can either be linked or embedded within the stream.
* All common picture file formats are supported (jpg, gif, whatever).
* Additional information like a description or the picture type (front cover, back cover...) is supported.

Possible disadvantages are:
* As with the base64 "COVERART" field, it might break playback of existing players (especially hardware players, software players could be updated easily). A workaround would be to link the picture within the tag, or to notify a user of a software tagger that his hardware player ''might'' not support playback of the file if he embeds a picture.

In order to test if there are playback problems with this proposal, there is a test file available [http://www.audioranger.com/with_coverart.ogg here]. You're invited to download this file, test playback on your software and hardware players, and report the results here on the wiki.

'''Tested software players'''
* Audacious 1.5.1: no problem
* foobar2000: no problems
* Gnome: built-in preview playback: no problem
* MediaMonkey: no problems
* Media Player Classic (unicode build) 6.4.9.1: no problem
* RoarAudio: no problems (server and client side)
* Rythmbox 0.11.6: no problem
* Totem 2.24.3: no problem
* VLC 0.9.4/0.9.6: doesn't play
Patch send to VLC to fix this - should get in 1.0.0
* WinAmp: no problems
* Windows Media Player 11: no problem

'''Tested hardware players'''
* Logitech Squeezebox: doesn't play this file (and all other oggs with embedded picture)
** Workaround: The needed Server Software (called SqueezeCenter) can convert ogg to mp3 on the fly, and has also no problem to convert oggs with embedded pictures

'''Tested tag editors'''
* Easytag 2.1.6: can open the file to edit the normal tag fields

===Dates and time===
The goal is to specify '''one''' standard format for describing dates and time.

====ISO proposal====
The date format for any field describing a date must follow the ISO scheme: YYYY-MM-DD or shortened to just YYYY-MM or simply YYYY.

We have been recommending this usage with the DATE tag for some time. It is proposed that the spec be amended to include this
information for machinability.

The time format for any field '''except''' track duration must be specified with leading T and ending with a time zone. Schemas with and without dates: YYYY-MM-DDTHH:MM:SS+TS THH:MM+TZ

===New ENCODER field name proposal===
The goal is to attribute encoder software. This value can be used in the future to determine which files can be improved by being re encoded with a newer version.

:'''Comment''': What is lacking from the vendor string present in the spec from the start? All libvorbis and encoder tunings I'm aware of have recorded the encoder version here.
:: Note that ffmpeg2theora uses ENCODER, but does not include a url.
::: A URI/L—especially one with version numbering—will be more unique. See the above goal for this comment.
:: I've also seen ENCODED_BY.
::: ENCODED_BY is usually the person who did the encoding. This should not be part of the recommendation due to legal problems around deliberate and accidental distribution to third parties. Basically the name of the encoder should not be included to protect encoders from their own egos and possible legal prosecution.
:: I am trying to get the specification to include that this field '''must''' contain a unique URL and version number. For the reason listed above. Whether to including the field at all would of course be optional.

====Proposal====
The encoder field name must be a unique URL providing both encoder software name and version. If no unique URL address is available were both name and version is available; then the version number can be specified by separating with a space character. For examples:

<nowiki>ENCODER=http://flac.sourceforge.net/ 1.2.1</nowiki>

===Improving license data===
The goal is to provide a method for proclaiming license and copyright information (basically clarifying ‘distribution rights (if any) and ownership’).

The [http://xiph.org/vorbis/doc/v-comment.html specification document] describes LICENSE and COPYRIGHT fields. But is not clear enough about whether these should be machine-readable.

We should consider working together with Creative Commons to have complementary and interlinked information on the CC and Xiph wikis. Refer to the [http://wiki.creativecommons.org/Ogg Ogg page] in the CC wiki.

==== New RIGHTS field name proposal ====
One proposal is to replace the COPYRIGHT and LICENSE field names with RIGHTS. RIGHTS must be a human-readable copyright statement. Basic example:

<nowiki>RIGHTS=Copyright © Recording Company Inc. All distribution rights reserved.</nowiki>

But this is not machine-readable. Adding two complementary field names should do the trick: RIGHTS-DATE, describing the date of copyright; and RIGHTS-URI, providing a method for linking to a license. Software agents can assume that multiple songs uses the sameURIs, such as in the case for Creative Commons. Full example:

<nowiki>RIGHTS=Copyright © 2019 Recording Company Inc. All distribution rights reserved.</nowiki><br />
<nowiki>RIGHTS-DATE=2019-04</nowiki><br />
<nowiki>RIGHTS-URI=http://somewhere.com/license.xhtml</nowiki>

Software such as for multimedia management and playback are encouraged to display the RIGHTS statement as a linked phrase using RIGHTS-URI.

RIGHTS-DATE does not need to be displayed as it is required in the human readable version by international copyright agreements. RIGHTS-DATE can be used to determine when a copyrighted work falls under the public domain and related matters. (''The Beatles''' copyright on their original studio recordings (not the remixes) are soon expiering. So mechanisms such as the RIGHTS-DATE are indeed required in music management and filesharing software!)

To remain machine-readable it would be required to have at most one instance of each RIGHTS field name. All fields would of course remain optional.

The ''Dublin Core Metadata Initiative'' recommends the use of ‘rights’ to describe license and copyright matters. The web feed format Atom 1.0 has implemented a rights element in their specification.

==== Improving existing fields proposal ====
Similar to the DATE tag above, we have generally recommended that a URL uniquely identifying the license be included in the LICENSE field to allow machine identification of the license. This is in agreement with the proposal in the CC wiki. Since the COPYRIGHT field is a human-readable statement of the copyright, like the proposed RIGHTS tag above, some people include a license url there. Therefore if a url can't be found in a LICENSE tag if any, applications should use one from the COPYRIGHT tag, if any. Contact information for verification, attribution, relicensing, etc. can be obtained from the COPYRIGHT field, but CC also recommend a separate CONTACT tag for this information. This is reasonable, so we propose it be included.

=== Attributing involved parties ===
The goal is to attribute more persons and organisations involved in audio and music productions to make room for more advanced search and sorting.

'''NO PROPOSALS!''' Needs much extending beyond just ARTIST field name. See work at proposed XML replacement for Vorbis Comments, [[M3F]].

=== Geo Location fields ===
The LOCATION field is meant to carry a human readable location for the recording/creation of the media file.

Having geographical coordinates according to [http://en.wikipedia.org/wiki/World_Geodetic_System WGS84] can be useful as well, especially in a form that can be machine parsed. The agreed format is similar to this [http://en.wikipedia.org/wiki/Geo_(microformat) geo microformat]:

GEO_LOCATION= ''latitude'' ; ''longitude'' [; ''elevation'' ]

where each value is a fixed point decimal number formatted in the C locale with a period (.) for the radix. Values are separated with a ';' and white space is not significant. The elevation is optional.

''latitude'' is the geo latitude location of where the media has been recorded or produced in decimal degrees according to WGS84 (zero at the equator, negative values for southern latitudes) (C double).

''longitude'' is the geo longitude location of where the media has been recorded or produced in decimal degrees according to WGS84 (zero at the prime meridian in Greenwich/UK, negative values for western longitudes). (C double).

''elevation'' is the geo elevation of where the media has been recorded or produced in meters according to WGS84 (zero is average sea level) (C double).

== Character encoding ==
The goal is to be offer better support for more languages and make machine processing faster.

The specification should be a little more strict to achieve this.

==== Proposals ====
''Field names may be UTF-8 and all UPPERCASE for easier machine processing.''

Allowing tag names to be UTF-8 instead of ASCII is a backwards-incompatible spec change. If we did this, requiring that the case mapping happen in the tagging application rather than in decoders is reasonable, since case mapping in unicode is non-trival.

The original argument for ASCII was that we need standardized tag names for interoperability, so there's no point in being able to localize them, and we might as well go with our native prejudice. Localizing the values should be done by appending a language code to the tag, since this is both machinable and there may be collisions between translated tag names.

:UTF-8 is a bad idea in field names. The field names are for machine interpretation, localisation should be done on the software side. UTF-8 introduces matching problems (canonical form) and encoding/decoding problems (difficulty in finding length of a string). Please sign comments; I ''think'' the above is a cumulative set of comments, but no idea.--[[User:Imalone|Imalone]] 01:11, 17 September 2007 (PDT)

== Implementations ==

* [http://sbooth.org/importers/ Spotlight importer]
* vorbiscomment
* oggz-comment

OggKate

2008-12-26T13:49:56Z

Ogg.k.ogg.k: packaging

== Disclaimer ==
This is not a Xiph codec, though it may be embedded in Ogg alonside other Xiph
codecs, such as Vorbis and Theora. As such, please do not assume that Xiph has
anything to do with this, much less responsibility.

== What is Kate? ==

Kate is an oveerlay codec, originally designed for karaoke and text, that can be
multiplixed in Ogg. Text and images can be carried by a Kate stream, and animated.
Most of the time, this would be multiplexed with audio/video to carry subtitles,
song lyrics (with or without karaoke data), etc, but doesn't have to be.

Series of curves (splines, segments, etc) may be attached to various properties
(text position, font size, etc) to create animated overlays. This allows scrolling
or fading text to be defined. This can even be used to draw arbitrary shapes, so
hand drawing can also be represented by a Kate stream.

Example uses of Kate streams are movie subtitles for Theora videos, either text based,
as may be created by [http://www.v2v.cc/~j/ffmpeg2theora ffmpeg2theora], or image
based, such as created by [http://thoggen.net Thoggen] (patching needed).

== Why a new codec? ==

As I was adding support for Theora, Speex and FLAC to some software of mine, I found myself
wanting to have song lyrics accompanying Vorbis audio. Since Vorbis comments are limited to
the headers, one can't add them in the stream as they are sung, so another multiplexed stream
would be needed to carry them.

The three possible bases usable for such a codec I found were Writ, CMML, and OGM/SRT.

*[[OggWrit|Writ]] is an unmaintained start at an implementation of a very basic design, though I did find an encoder/decoder in py-ogg2 later on - I'd been quicker to write Kate from scratch anyway.
*[[CMML]] is more geared towards encapsulating metadata about an accompanying stream, rather than being a data stream itself, and seemed complex for a simple use, though I have now revised my view on this - besides, it seems designed for Annodex (which I haven't had a look at), though it does seems relatively generic for use outwith Annodex - though it is being "repurposed" as timed text now, bringing it closer to what I'm doing
*OGM/SRT, which I only found when I added Kate support to MPlayer, is shoehorning various data formats into an Ogg stream, and just dumps the SRT subtitle format as is, AFAICS (though I haven't looked at this one in detail, since I'd already had a working Kate implementation by that time)

I then decided to roll my own, not least because it's a fun thing to do.

I found other formats, such as USF (designed for inclusion in Matroska) and various subtitle formats,
but none were designed for embedding inside an Ogg container.

== Overview of the Kate bitstream format ==

I've taken much inspiration from Vorbis and Theora here.
Headers and packets (as well as the API design) follow the design of these two codecs.

A rough overview (see [[#Format specification|Format specification]] for more details) is:

Headers packets:
*ID header [BOS]: magic, version, granule fraction, encoding, language, etc
*Comment header: Vorbis comments, as per Vorbis/Theora streams
*Style definitions header: a list of predefined styles to be referred to by data packets
*Region definitions header: a list of predefined regions to be referred to by data packets
*Curves definitions header: a list of predefined curves to be referred to by data packets
*Motion definitions header: a list of predefined motions to be referred to by data packets
*Palette definitions header: a list of predefined palettes to be referred to by data packets
*Bitmap definitions header: a list of predefined bitmaps to be referred to by data packets
*Font mapping definitions header: a list of predefined font mappings to be referred to by data packets

Other header packets are ignored, and left for future expansion.

Data packets:
*text data: text/image and optional motions, accompanied by optional overrides for style, region, language, etc
*keepalive: can be emitted at any time to help a demuxer know where we're at, but those packets are optional
*end data [EOS]: marks the end of the stream, it doesn't have any useful payload

Other data packets are ignored, and left for future expansion.

The intent of the "keepalive" packet is to be sent at regular
intervals when no other packet has been emitted for a while. This would be to help seeking code
find a kate page more easily.

Things of note:
*Kate is a discontinuous codec, as defined in [http://www.xiph.org/ogg/doc/ogg-multiplex.html ogg-multiplex.html] in the Ogg documentation, which means it's timed by start granule, not end granule (as Theora and Vorbis).
* All data packets are on their own page, for two reasons:
**Ogg keeps track of granules at the page level, not the packet level
**if no text event happens for a while after a particular text event, we don't want to delay it so a larger page can be issued

See also [[#Seeking and memory|Problems to solve: Seeking and memory]].

*The granule encoding is not a direct time/granule correspondance, see the granule encoding section.
*The EOS packet should have a granule pos higher or equal to the end time of all events.
*User code doesn't have to know the number of headers to expect, this is moved inside the library code (as opposed to Vorbis and Theora).
*The format contains hooks so that additional information may be added in future revisions while keeping backward compatibility (though old decoders will correctly parse, but ignore the new information).

== Format specification ==

The Kate bitstream format consists of a number of sequential packets.
Packets can be either header packets or data packets. All header packets
must appear before any data packet.

Header packets must appear in order. Decoding of a data packet is not
possible until all header packets have been decoded.

Each Kate packet starts with a one byte type. A type with the MSB set
(eg, between 0x80 and 0xff) indicates a header packet, while a type with
the MSB cleared (eg, between 0x00 and 0x7f) indicates a data packet.
All header packets then have the Kate magic, from byte offset 1 to byte
offset 7 ("kate\0\0\0"). Note that this applies only to header packets:
data packets do not contain the Kate signature.

Since the ID header must appear first, a Kate stream can be recognized
by comparing the first eight bytes of the first packet with the signature
string "\200kate\0\0\0".

When embedded in Ogg,the first packet in a Kate stream (always packet type 0x80,
the id header packet) must be placed on a separate page. The corresponding Ogg
packet must be marked as beginning of stream (BOS).All subsequent header packets
must be on one or more pages. Subsequently, each data packet must be on a separate
page.

The last data packet must be the end of stream packet (packet type 0x7f).

When embedded in Ogg, the corresponding Ogg packet must be marked as end of stream (EOS).

As per the Ogg specification, granule positions must be non decreasing
within the stream. Header packets have granule position 0.

Currently existing packet types are:
:headers:
::0x80 ID header (BOS)
::0x81 Vorbis comment header
::0x82 regions list header
::0x83 styles list header
::0x84 curves list header
::0x85 motions list header
::0x86 palettes list header
::0x87 bitmaps list header
::0x88 font ranges and mappings header
:data:
::0x00 text data (including optional motions and overrides)
::0x01 keepalive
::0x7f end packet (EOS)

This format described here is for bitstream version 0.x.
As or 19 december 2008, the latest bitstream version is 0.4.

For more detailed information, refer to the format documentation
in libkate (see URL below in the [[#Downloading|Downlading]] section).

Following is the definition of the ID header (packet type 0x80).
This works out to a 64 byte ID header. This is the header that should be
used to detect a Kate stream within an Ogg stream.

0 1 2 3 |
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1| Byte
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| packtype | Identifier char[7]: 'kate\0\0\0' | 0-3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| kate magic continued | 4-7
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | version major | version minor | num headers | 8-11
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| text encoding | directionality| reserved - 0 | granule shift | 12-15
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| cw sh | canvas width | ch sh | canvas height | 16-19
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | 20-23
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate numerator | 24-27
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate denominator | 28-31
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (NUL terminated) | 32-35
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 36-39
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 40-43
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 44-47
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (NUL terminated) | 48-51
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 52-55
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 56-59
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 60-63
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The fields cw sh, canvas width, cw sh, and canvas height were introduced
in bistream 0.3. Earlier bitstreams will have 0 in these fields.

language and category are NUL terminating ASCII strings.
Language follows RFC 3066, though obviously will not accommodate language tags
with lots of subtags.

Category is currently loosely defined, and I haven't found yet a nice way to
present it in a generic way, but is meant for automatic classifying of
various multiplexed Kate streams (eg, to recognize that some streams are
subtitles (in a set of languages), and some others are commentary (in a
possibly different set of languages, etc).

== API overview ==

libkate offers an API very similar to that of libvorbis and libtheora, as well as
an extra higher level decoding API.

Here's an overview of the three main modules:

=== Decoding ===

Decoding is done in a way similar to libvorbis. First, initialize a kate_info and a
kate_comment structure. Then, read headers by calling kate_decode_headerin. Once
all headers have been read, a kate_state is initialized for decoding using kate_decode_init,
and kate_decode_packetin is called repeatedly with data packets. Events (eg, text) can be
retrieved via kate_decode_eventout.

=== Encoding ===

Encoding is also done in a way similar to libvorbis. First initialize a kate_info
and a kate_comment structure, and fill them out as needed. kate_encode_headers will
create ogg packets from those. Then, kate_encode_text is called repeatedly for all
the text events to add. When done, calling kate_encode_finish will create an end of
stream packet.

=== High level decoding API ===

There are only 3 calls here:

kate_high_decode_init
kate_high_decode_packetin
kate_high_decode_clear

Here, all Ogg packets are sent to kate_high_decode_packetin, which does the right
thing (header/data classification, decoding, and event retrieval). Note that you
do not get access to the comments directly using this, but you do get access to the
kate_info via events.

The libkate distribution includes commented examples for each of those.

Additionally, libkate includes a layer (liboggkate) to make it easier to use when
embedded in Ogg. While the normal API uses kate_packet structures, liboggkate uses
ogg_packet structures.

The High level decoding API does not have an Ogg specific layer, but functions exist
to wrap a kate_packet around a memory buffer (such as the one ogg_packet uses, for instance).

== Support ==

Among the software with Kate support:
*VLC
*ffmpeg2theora
*liboggz
*liboggplay

I have patches for the following with Kate support:
*MPlayer (for multiplexed per-language subtitles - all region/style info is ignored)
*xine (everything kate supports, as xine is my testbed)
*GStreamer
*Thoggen

These may be found in the libkate source distribution (see [[#Downloading|Downloading]]
for links).

In addition, libtiger is a rendering library for Kate streams using Pango and Cairo,
though it is not quite yet API stable (though no major changes are expected).

== Granule encoding ==

=== Ogg ===

Ogg leaves the encoding of granules up to a particular codec, only
mandating that granules be non decreasing with time.

The Kate bitstream format uses a linear mapping between time and
granule, described here.

A Kate granule position is composed of two different parts:
- a base granule, in the high bits
- a granule offset, in the low bits

+----------------+----------------+
| base | offset |
+----------------+----------------+

The number of bits these parts occupy is variable, and each stream
may choose how many bits to dedicate to each. The kate_info structure
for a stream holds that information in the granule_shift field,
so each part may be reconstructed from a granulepos.

The timestamp T of a given Kate packet is split into a base B and
offset O, and these are stored in the granulepos of that packet.
The split is done such that the B is the time of the earliest event
still active at the time, and the O is the time elapsed between B
and T. Thus, T = B + O. This mimics the way Theora stores its own
timestamps in granulepos, where the base acts as a keyframe, and
an offset acts as the position of an intra frame from the previous
keyframe. Since Kate allows time overlapping events, however, the
choice of the base to use is slightly more complex, as it may not
be the starting time of the previous event, if the stream contains
time overlapping events.

The kate_info structure for a stream holds a rational fraction
representing the time span of granule units for both the base and
the offset parts.

The granule rate is defined by the two fields:

kate_info::gps_numerator
kate_info::gps_denominator

The number of bits reserved for the offset is defined by the field:

kate_info::granule_shift

=== Generic timing ===

Kate data packets (data packet type 0) includes timing information (start time,
end time, and time of the earliest event still active). All these are stored as
64 bit at the rate defined by the granule rate, so they do not suffer from the
granule_shift space limitation.

This also allows for Kate streams to be stored in other containers.

== Motion ==

The Kate bitstream format includes motion definition, originally for karaoke purposes, but
which can be used for more general purpose, such as line based drawing, or animation of
the text (position, color, etc)

Motions are defined by the means of a series of curves (static points, segments, splines (catmull-rom, bezier, and b-splines)).
A 2D point can be obtained from a motion for any timestamp during the lifetime of a text.
This can be used for moving a marker in 2D above the text for karaoke, or to use the x
coordinate to color text when the motion position passes each letter or word, etc.
Motions have an attached semantics so the client code knows how to use a particular motion.
Predefined semantics include text color, text position, etc).

Since a motion can be composed of an arbitrary number of curves, each of which may have
an arbitrary number of control points, complex motions can be achieved. If the motion is
the main object of an event, it is even possible to have an empty text, and use the motion
as a virtual pencil to draw arbitrary shapes. Even on-the-fly handwriting subtitles could
be done this way, though this would require a lot of control points, and would not be able
to be used with text-to-speech.

As a proof of concept, I also have a "draw chat" program where two people can draw, and
the shapes are turned to b-splines and sent as a kate motion to be displayed on the other
person's window.

It is also possible for motions to be discontinuous - simply insert a curve of 'none' type.
While the timestamp lies within such a curve, no 2D point will be generated. This can be
used to temporarily hide a marker, for instance.

It is worth mentionning that pauses in the motion can be trivially included by inserting
at the right time and for the right duration a simple linear interpolation curve with only
two equal points, equal to the position the motion is supposed to pause at.

Kate defines a set of predefined mappings so that each decoder user interprets a motion in
the same way. A mapping is coded on 8 bits in the bitstream, and the first 128 are reserved
for Kate, leaving 128 for application specific mappings, to avoid constraining creative uses
of that feature. Predefined mappings include frame (eg, 0-1 points are mapped to the size of
the current video frame), or region, to scale 0-1 to the current region. This allows curves
to be defined without knowing in advance the pixel size of the area it should cover.

For uses which require more than two coordinates (eg, text color, where 4 (RGBA) values are
needed, Kate predefines the semantics text_color_rg and text_color_ba, so a 4D point can be
obtained using two different motions.

There are higher level constructs, such as morphing between two styles, or predefined
karaoke effects. More are planned to be added in the future.

See also [[#Trackers|Trackers]].

== Trackers ==

Since attaching motions to text position, etc, makes it hard for the client to keep track of
everything, doing interpolation, etc, the library supplies a tracker object, which handles the
interpolation of the relevant properties.
Once initialized with a text and a set of motions, the client code can give the tracker a new
timestamp, and get back the current text position, text color, etc.

Using a tracker is not necessary, if one wants to use the motions directly, or just ignore them,
but it makes life easier, especially when considering the the order in which motions are applied
does matter (to be defined formally, but the current source code is informative at this point).

== The Kate file format ==

Though this is not a feature of the bitstream format, I have created a text file format to
describe a series of events to be turned into a Kate bitstream.
At its minimum, the following is a valid input to the encoder:

: kate {
:: event { 00:00:05 --> 00:00:10 "This is a text" }
: }

This will create a simple stream with "This is a text" emitted at an offset of 5 seconds into
the track, lasting 5 seconds to an end time at 10 seconds.

Motions, regions, styles can be declared in a definitions block to be reused by events, or can
be defined inline. Defining those in the definitions block places them in a header so they can
be reused later, saving space. However, they can also be defined in each event, so they will be
sent with the event. This allows them to be generated on the fly (eg, if the bitstream is being
streamed from a realtime input).

For convenience, the Kate file format also allows C style macros, though without parameters.

Please note that the Kate file format is fully separate from the Kate bitstream format. The
difference between the two is similar to the difference between a C source file and the resulting
object file, when compiled.

Note that the format is not based on XML for a very parochial reason: I tend to dislike very
much editing XML by hand, as it's really hard to read. XML is really meant for machines to parse
generically text data in a shared syntax but with possibly unknown semantics, and I need those
text representations to be editable easily.

This also implies that there could be an XML representation of a Kate stream, which would be
useful if one were to make an editor that worked on a higher level than the current all-text
representation, and it is something that might very well happen in the future, in parallel with
the current format.

== Karaoke ==

Karaoke effects rely on motions, and there will be predefined higher level ways of specifying
timings and effects, two of which are already done. As an example, this is a valid Karaoke script:

:kate {
:: simple_timed_glyph_style_morph {
::: from style "start_style" to style "end_style"
::: "Let " at 1.0
::: "us " at 1.2
::: "sing " at 1.4
::: "to" at 2.0
::: "ge" at 2.5
::: "ther" at 3.0
:: }
:}

The syllables will change from a style to another as time passes. The definition of the start_style
and end_style styles is omitted for brevity.

== Problems to solve ==

There are a few things to solve before the Kate bitstream format can be considered good
enough to be frozen:

Note: the following is mostly solved, and the bitstream is now stable, and has been
backward and forward compatible since the first released version. This will be updated
when I get some time.

=== Seeking and memory ===

When seeking to a particular time in a movie with subtitles, we may end up at a place when a subtitle has been started, but is not removed yet. Pure streaming doesn't have this problem as it remembers the subtitle being issued (as opposed to, say, Vorbis, for which all data valid now is decoded from the last packet). With Kate, a text string valid now may have been issued long ago.

I see three possible ways to solve this:
*each data packet includes the granule of the earliest still active packet (if none, this will be the granule of this very packet)
**this means seeks are two phased: first seek, find the next Kate packet, and seek again if the granule of the earlier still active packet is less than the original seeked granule. This implies support code on players to do the double seek.

*use "reference frames", a bit like Theora does, where the granule position is split in several fields: the higher bits represent a position for the reference frame, and the lowest bits a delta time to the current position. When seeking to a granule position, the lower bits are cleared off, yielding the granule position of the previous reference frame, so the seek ends up at the reference frame. The reference frame is a sync point where any active strings are issued again. This is a variant of the method described in the Writ wiki page, but the granule splitting avoids any "downtime".
**this requires reissuing packets, and it doesn't feel right (and wastes space).
**it also requires "dummy" decoding of Kate data from the reference frame to the actual seek point to fully refresh the state "memory".

*A variant of the two-granules-in-one system used by libcmml, where the "back link" points to the earliest still active string, rather than the previous one (this allows a two phase seek, rather than a multiphase seek, hopping back from event to event, with no real way to know if there is or not a previous event which is still active - I suppose CMML has no need to know this, if their "clips" do not overlap - mine can do).
**Such a system considerably shortens the usable granule space, though it can do a one phase seek, if I understand the system correctly, which I am not certain.
*** Well, it seems it can't do a one phase seek anyway.

*Additionally, it could be possible to emit simple "keepalive" packets at regular intervals to help a seek algorithm to sync up to the stream without needing too much data reading - this helps for discontinuous streams where there could be no pages for a while if no data is needed at that time.

=== Text encoding ===

A header field declares the text encoding used in the stream. At the moment, only UTF-8 is
supported, for simplicity. There are no plans to support other encodings, such as UTF-16,
at the moment.

Note that strings included in the header (language, category) are not affected by that
language encoding (rather obviously for language itself). These are ASCII.

The actual text in events may include simple HTML-like markup (at the moment, allowed markup
is the same as the one Pango uses, but more markup types may be defined in the future).
It is also possible to ask libkate to remove this markup if the client prefers to receive
plain text without the markup.

=== Language encoding ===

A header field defines the language (if any) used in the stream (this can be overridden in a
data packet, but this is not relevant to this point). At the moment, my test code uses
ISO 639-1 two letter codes, but I originally thought to use RFC 3066 tags. However, matching
a language to a user selection may be simpler for user code if the language encoding is kept
simple. At the moment, I tend to favor allowing both two letter tags (eg, "en") and secondary
tags (like "en_EN"), as RFC 3066 tags can be quite complex, but I welcome comments on this.

If a stream contains more than one language, there usually is a predominant language, which
can be set as the default language for the stream. Each event can then have a language
override. If there is no predominant language, and it is not possible to split the stream
into multiple substreams, each with its own language, then it is possible to use the "mul"
language tag, as a last resort.

=== Bitstream format for floating point values ===

Floating point values are be turned to a 16.16 fixed point format, then stored in a bitpacked
format, storing the number of zero bits at the head and tail of the floating point values once
per stream, and the remainder bits for all values in the stream. This seems to yield good results
(typically a 50% reduction over 32 bits raw writes, and 70% over the snprintf based storage), and
has the big advantage of being portable (eg, independant of any IEEE format).
However, this means reduced precision due to the quantization to 16.16. I may add support for
variable precision (eg, 8.24 fixed point formats) to alleviate this. This would however mean less
space savings, though these are likely to be insignificant when Kate streams are interleaved with
a video.

*Though this is not a Kate issue per se, the motion feature is very difficult to use without a curve editor. While tools may be coded to create a Kate bitstream for various existing subtitle formats, it is not certain it will be easy to find a good authoring tool for a series of curves. That said, it's not exactly difficult to do if you know a widget set.

=== Higher dimensional curves/motions ===

It is quite annoying to have to create two motions to control a color change, due to curves
being restricted to two dimensions. I may add support for arbitrary dimensions. It would also
help for 1D motions, like changing the time flow, where one coordinate is simply ignored at
the moment.
Alternatively, changes could be made to the Kate file format to hide the two dimensionality and
allow simpler specification of non-2 dimensional motions, but still map them to 2D in the kate
bitstream format.

=== Category definition ===

The category field in the BOS packet is a 16 byte text field (15 really, as it is zero terminated
in the bitstream itself). Its goal is to provide the reader with a short description of what kind
of information the stream contains, eg subtitles, lyrics, etc. This would be displayed to the user,
possibly to allow to choose to turn some streams on and off.

Since this category is meant primarily for a machine to parse, they will be kept to ASCII. When
a player recognizes a category, it is free to replace its name with one in the user's language if
it prefers. Even in English, the "lyrics" category could be displayed by a player as "Lyrics".

Since this is a free text field rather than an enumeration, it would be good to have a list of
common predefined category names that Kate streams can use.

This is a list of proposed predefined categories, feedback/additions welcome:

* subtitles - the usual movie subtitles, as text
* spu-subtitles - movie subtitles in DVD style paletted images
* lyrics - song lyrics

Please remember the 15 character limit if proposing other categories.

Note that the list of categories is subject to change, and will likely
be replaced by new, more "identifier like" ones. The three ones above,
however, would be kept for backward compatibility as they're already used.

== Text to speech ==

One of the goals of the Kate bitstream format is that text data can be easily parsed
by the user of the decoder, so any additional information, such as style, placement,
karaoke data, etc, should be able to be stripped to leave only the bare text. This is
in view of allowing text-to-speech software to use Kate bitstreams as a bandwith-cheap
way of conveying speech data, and could also allow things like e-books which can be
either read or listened to from the same bitstream (I have seen no reference to this
being used anywhere, but I see no reason why the granule progression should be temporal,
and not user controlled, such as by using a "next" button which would bump a granule
postion by a preset amount, simulating turning a page (this would be close to necessary
for text-to-speech, as the wall time duration of the spoken speech is not known in
advance to the Kate encoder, and can't be mapped to a time based granule progression)).
All text strings triggered consecutively between the two granule positions would then
be read in order.

== Possible additions ==

=== Embedded binary data ===

Images and font mappings can be included within a Kate stream.

==== Images ====

Though this could be misused to interfere with ability to render as text-to-speech, Kate
can use images as well as text. The same caveat as for fonts applies with regard to data
duplication.

Complex images might however be best left to a multiplexed OggSpots or OggMNG stream, unless the
images mesh with the text (eg, graphical exclamation points, custom fonts, (see next
paragraph), etc).

There is support for simple paletted bitmap images, with a variable length palette of up
to 256 colors (in fact, sized in powers of 2 up to 256) and matching pixel data in as
many bits per pixel as can address the palette. Palettes and images are stored separately,
so can be used with one another with no fixed assignment.

Palettes and bitmaps are put in two separate header for later use by reference, but can
also be placed in data packets, as with motions, etc, if they are not going to be reused.

PNG bitmaps can also be embedded in a Kate stream. These do not have associated palettes
(but the PNGs themselves may or may not be paletted). There is no support for decoding PNG
images in libkate itself, so a program will have to use libpng (or similar code) to decode
the PNG image. For instance, the libtiger rendering library uses Cairo to decode and render
PNG images in Kate streams.

This can be used to have custom fonts, so that raw text is still available if the stream
creator wants a custom look.

I expect that the need for more than 256 colors in a bitmap, or non palette bitmap data,
would be best handled by another codec, eg OggMNG or OggSpots. The goal of images in a
Kate stream is to mesh the images with the text, not to have large images by themselves.

On the other hand, interesting Karaoke effects could be achieved by having MNG images
instead of simple paletted bitmaps in a Kate streams. Comments would be most welcome on
whether this is going too far, however.

I am also investigating SVG images. These allow for very small footprint images for simple
vector drawings, and could be very useful for things like background gradients below text.

A possible solution to the duplication issue is to have another stream in the container
stream, which would hold the shared data (eg, fonts), which the user program could load,
and which could then be used by any Kate (and other) stream. Typically, this type of stream
would be a degenerate stream with only header packets (so it is fully processed before any
other stream presents data packets that might make use of that shared data), and all payload
such as fonts being contained within the headers. Thinking about it, it has parallels with
the way Vorbis stores its codebooks within a header packet, or even the way Kate stores the
list of styles within a header packet.

==== Fonts ====

Custom fonts are merely a set of ranges mapping unicode code points to bitmaps. As this implies,
fonts are bitmap fonts, not vector fonts, so scaling, if supported by the rendering client,
may not look as good as with a vector font.

A style may also refer to a font name to use (eg, "Tahoma"). These fonts may or may not be
available on the playing system, however, since the font data is not included in the stream,
just referenced by name. For this reason, it is best to keep to widely known fonts.

== Reference encoder/decoder ==

A encoder (kateenc) and a decoder (katedec) are included in the tools directory.
The encoder supports input from several different formats:
* a custom text based file format (see [[#The Kate file format|The Kate file format]]), which is by no means meant to be part of the Kate bitstream specification itself
* SubRip (.srt), the most common subtitle format I found
* LRC lyrics format.

As an example for the widely used SRT subtitles format, the following command line
create a Kate subtitles stream from an SRT file:

kateenc -l en -c subtitles -t srt -o subtites.ogg subtitles.srt

The reverse is possible, to recover an SRT file from a Kate stream, with katedec.

Note that the subtitles.ogg file should then be multiplexed into the A/V stream,
using either ogg-tools or oggz-tools.

The Kate bitstreams encoded and decoded by those tools are (supposed to be) correct for this
specification, provided their input is correct.

== Next steps ==

=== Continuations ===

Continuations are a way to add to existing events, and are mostly meant for motions. When streaming
in real time, what motions may be applied to events may not be known in advance (for instance, for a
draw chat program where two programs exchange Kate streams, the drawing motions are only known as
they are drawn. Continuations will allow an event to be extended in time, and motions to be appended
to it. This is only useful for streaming, as when stored in a file, everything is already known in
advance.

=== A rendering library ===

This will allow easier integration in other packages (movie players, etc).
I have started working on an implementation using Cairo and Pango, though I'm still at the early stages.
I might add support for embedding vector fonts in a Kate stream if I was going that way. Still need to think about this.
Another point of note is that when this library is available, it would make it easier to add
capabilities such as rotation, scaling, etc, to the bitstream, since this would not cause too
much work for playing programs using the rendering library. It is expected that these additions
would stay backward compatible (eg, an old player would ignore this information but still correctly
decode the information they can work with from a newly encoded stream).

=== An XML representation ===

While I purposefully did not write Kate description files in XML due to me finding editing XML such
a chore, it would be nice to be able to losslessly convert between the more user friendly representation
and an XML document, so one can do what one does with XML documents, like transformations.

And after all, some people might prefer editing the XML version.

=== Packaging ===

It would be really nice to have packages for libkate/libtiger for many distros.

If you're a packager for a distro which doesn't have yet packages for libkate
or libtiger, please consider helping :)

In particular, packages for Debian would be grand.

== Matroska mapping ==

The codec ID is "S_KATE".

As for Theora and Vorbis, Kate headers are stored in the private data as xiph-laced packets:

Byte 0: number of packets present, minus 1 (there must be at least one packet) - let this number be NP
Bytes 1..n: lengths of the first NP packets, coded in xiph style lacing
Bytes n+1..end: the data packets themselves concatenated one after the other

Note that the length of the last packet isn't encoded, it is deduced from the sizes of the other
packets and the total size of the private data.

This mapping is similar to the Vorbis and Theora mappings, with the caveat that one should not
expect a set number of headers.

== Downloading ==

libkate encodes and decodes Kate streams, and is API and ABI stable.

The libkate source distribution is available at [http://libkate.googlecode.com/ http://libkate.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/kate.git;a=summary http://git.xiph.org/?p=users/oggk/kate.git;a=summary].

libtiger renders Kate streams using Pango and Cairo, and is alpha, with API changes still possible.

The libtiger source distribution is available at [http://libtiger.googlecode.com/ http://libtiger.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/tiger.git;a=summary http://git.xiph.org/?p=users/oggk/tiger.git;a=summary].

== Things I need to get feedback on ==

* is it a good idea to avoid floating point usage altogether ?

[[Category:Drafts]]
[[Category:Ogg Mappings]]

OggKate

2008-12-26T13:43:20Z

Ogg.k.ogg.k: More info about tools, and an example

== Disclaimer ==
This is not a Xiph codec, though it may be embedded in Ogg alonside other Xiph
codecs, such as Vorbis and Theora. As such, please do not assume that Xiph has
anything to do with this, much less responsibility.

== What is Kate? ==

Kate is an oveerlay codec, originally designed for karaoke and text, that can be
multiplixed in Ogg. Text and images can be carried by a Kate stream, and animated.
Most of the time, this would be multiplexed with audio/video to carry subtitles,
song lyrics (with or without karaoke data), etc, but doesn't have to be.

Series of curves (splines, segments, etc) may be attached to various properties
(text position, font size, etc) to create animated overlays. This allows scrolling
or fading text to be defined. This can even be used to draw arbitrary shapes, so
hand drawing can also be represented by a Kate stream.

Example uses of Kate streams are movie subtitles for Theora videos, either text based,
as may be created by [http://www.v2v.cc/~j/ffmpeg2theora ffmpeg2theora], or image
based, such as created by [http://thoggen.net Thoggen] (patching needed).

== Why a new codec? ==

As I was adding support for Theora, Speex and FLAC to some software of mine, I found myself
wanting to have song lyrics accompanying Vorbis audio. Since Vorbis comments are limited to
the headers, one can't add them in the stream as they are sung, so another multiplexed stream
would be needed to carry them.

The three possible bases usable for such a codec I found were Writ, CMML, and OGM/SRT.

*[[OggWrit|Writ]] is an unmaintained start at an implementation of a very basic design, though I did find an encoder/decoder in py-ogg2 later on - I'd been quicker to write Kate from scratch anyway.
*[[CMML]] is more geared towards encapsulating metadata about an accompanying stream, rather than being a data stream itself, and seemed complex for a simple use, though I have now revised my view on this - besides, it seems designed for Annodex (which I haven't had a look at), though it does seems relatively generic for use outwith Annodex - though it is being "repurposed" as timed text now, bringing it closer to what I'm doing
*OGM/SRT, which I only found when I added Kate support to MPlayer, is shoehorning various data formats into an Ogg stream, and just dumps the SRT subtitle format as is, AFAICS (though I haven't looked at this one in detail, since I'd already had a working Kate implementation by that time)

I then decided to roll my own, not least because it's a fun thing to do.

I found other formats, such as USF (designed for inclusion in Matroska) and various subtitle formats,
but none were designed for embedding inside an Ogg container.

== Overview of the Kate bitstream format ==

I've taken much inspiration from Vorbis and Theora here.
Headers and packets (as well as the API design) follow the design of these two codecs.

A rough overview (see [[#Format specification|Format specification]] for more details) is:

Headers packets:
*ID header [BOS]: magic, version, granule fraction, encoding, language, etc
*Comment header: Vorbis comments, as per Vorbis/Theora streams
*Style definitions header: a list of predefined styles to be referred to by data packets
*Region definitions header: a list of predefined regions to be referred to by data packets
*Curves definitions header: a list of predefined curves to be referred to by data packets
*Motion definitions header: a list of predefined motions to be referred to by data packets
*Palette definitions header: a list of predefined palettes to be referred to by data packets
*Bitmap definitions header: a list of predefined bitmaps to be referred to by data packets
*Font mapping definitions header: a list of predefined font mappings to be referred to by data packets

Other header packets are ignored, and left for future expansion.

Data packets:
*text data: text/image and optional motions, accompanied by optional overrides for style, region, language, etc
*keepalive: can be emitted at any time to help a demuxer know where we're at, but those packets are optional
*end data [EOS]: marks the end of the stream, it doesn't have any useful payload

Other data packets are ignored, and left for future expansion.

The intent of the "keepalive" packet is to be sent at regular
intervals when no other packet has been emitted for a while. This would be to help seeking code
find a kate page more easily.

Things of note:
*Kate is a discontinuous codec, as defined in [http://www.xiph.org/ogg/doc/ogg-multiplex.html ogg-multiplex.html] in the Ogg documentation, which means it's timed by start granule, not end granule (as Theora and Vorbis).
* All data packets are on their own page, for two reasons:
**Ogg keeps track of granules at the page level, not the packet level
**if no text event happens for a while after a particular text event, we don't want to delay it so a larger page can be issued

See also [[#Seeking and memory|Problems to solve: Seeking and memory]].

*The granule encoding is not a direct time/granule correspondance, see the granule encoding section.
*The EOS packet should have a granule pos higher or equal to the end time of all events.
*User code doesn't have to know the number of headers to expect, this is moved inside the library code (as opposed to Vorbis and Theora).
*The format contains hooks so that additional information may be added in future revisions while keeping backward compatibility (though old decoders will correctly parse, but ignore the new information).

== Format specification ==

The Kate bitstream format consists of a number of sequential packets.
Packets can be either header packets or data packets. All header packets
must appear before any data packet.

Header packets must appear in order. Decoding of a data packet is not
possible until all header packets have been decoded.

Each Kate packet starts with a one byte type. A type with the MSB set
(eg, between 0x80 and 0xff) indicates a header packet, while a type with
the MSB cleared (eg, between 0x00 and 0x7f) indicates a data packet.
All header packets then have the Kate magic, from byte offset 1 to byte
offset 7 ("kate\0\0\0"). Note that this applies only to header packets:
data packets do not contain the Kate signature.

Since the ID header must appear first, a Kate stream can be recognized
by comparing the first eight bytes of the first packet with the signature
string "\200kate\0\0\0".

When embedded in Ogg,the first packet in a Kate stream (always packet type 0x80,
the id header packet) must be placed on a separate page. The corresponding Ogg
packet must be marked as beginning of stream (BOS).All subsequent header packets
must be on one or more pages. Subsequently, each data packet must be on a separate
page.

The last data packet must be the end of stream packet (packet type 0x7f).

When embedded in Ogg, the corresponding Ogg packet must be marked as end of stream (EOS).

As per the Ogg specification, granule positions must be non decreasing
within the stream. Header packets have granule position 0.

Currently existing packet types are:
:headers:
::0x80 ID header (BOS)
::0x81 Vorbis comment header
::0x82 regions list header
::0x83 styles list header
::0x84 curves list header
::0x85 motions list header
::0x86 palettes list header
::0x87 bitmaps list header
::0x88 font ranges and mappings header
:data:
::0x00 text data (including optional motions and overrides)
::0x01 keepalive
::0x7f end packet (EOS)

This format described here is for bitstream version 0.x.
As or 19 december 2008, the latest bitstream version is 0.4.

For more detailed information, refer to the format documentation
in libkate (see URL below in the [[#Downloading|Downlading]] section).

Following is the definition of the ID header (packet type 0x80).
This works out to a 64 byte ID header. This is the header that should be
used to detect a Kate stream within an Ogg stream.

0 1 2 3 |
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1| Byte
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| packtype | Identifier char[7]: 'kate\0\0\0' | 0-3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| kate magic continued | 4-7
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | version major | version minor | num headers | 8-11
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| text encoding | directionality| reserved - 0 | granule shift | 12-15
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| cw sh | canvas width | ch sh | canvas height | 16-19
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | 20-23
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate numerator | 24-27
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate denominator | 28-31
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (NUL terminated) | 32-35
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 36-39
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 40-43
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 44-47
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (NUL terminated) | 48-51
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 52-55
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 56-59
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 60-63
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The fields cw sh, canvas width, cw sh, and canvas height were introduced
in bistream 0.3. Earlier bitstreams will have 0 in these fields.

language and category are NUL terminating ASCII strings.
Language follows RFC 3066, though obviously will not accommodate language tags
with lots of subtags.

Category is currently loosely defined, and I haven't found yet a nice way to
present it in a generic way, but is meant for automatic classifying of
various multiplexed Kate streams (eg, to recognize that some streams are
subtitles (in a set of languages), and some others are commentary (in a
possibly different set of languages, etc).

== API overview ==

libkate offers an API very similar to that of libvorbis and libtheora, as well as
an extra higher level decoding API.

Here's an overview of the three main modules:

=== Decoding ===

Decoding is done in a way similar to libvorbis. First, initialize a kate_info and a
kate_comment structure. Then, read headers by calling kate_decode_headerin. Once
all headers have been read, a kate_state is initialized for decoding using kate_decode_init,
and kate_decode_packetin is called repeatedly with data packets. Events (eg, text) can be
retrieved via kate_decode_eventout.

=== Encoding ===

Encoding is also done in a way similar to libvorbis. First initialize a kate_info
and a kate_comment structure, and fill them out as needed. kate_encode_headers will
create ogg packets from those. Then, kate_encode_text is called repeatedly for all
the text events to add. When done, calling kate_encode_finish will create an end of
stream packet.

=== High level decoding API ===

There are only 3 calls here:

kate_high_decode_init
kate_high_decode_packetin
kate_high_decode_clear

Here, all Ogg packets are sent to kate_high_decode_packetin, which does the right
thing (header/data classification, decoding, and event retrieval). Note that you
do not get access to the comments directly using this, but you do get access to the
kate_info via events.

The libkate distribution includes commented examples for each of those.

Additionally, libkate includes a layer (liboggkate) to make it easier to use when
embedded in Ogg. While the normal API uses kate_packet structures, liboggkate uses
ogg_packet structures.

The High level decoding API does not have an Ogg specific layer, but functions exist
to wrap a kate_packet around a memory buffer (such as the one ogg_packet uses, for instance).

== Support ==

Among the software with Kate support:
*VLC
*ffmpeg2theora
*liboggz
*liboggplay

I have patches for the following with Kate support:
*MPlayer (for multiplexed per-language subtitles - all region/style info is ignored)
*xine (everything kate supports, as xine is my testbed)
*GStreamer
*Thoggen

These may be found in the libkate source distribution (see [[#Downloading|Downloading]]
for links).

In addition, libtiger is a rendering library for Kate streams using Pango and Cairo,
though it is not quite yet API stable (though no major changes are expected).

== Granule encoding ==

=== Ogg ===

Ogg leaves the encoding of granules up to a particular codec, only
mandating that granules be non decreasing with time.

The Kate bitstream format uses a linear mapping between time and
granule, described here.

A Kate granule position is composed of two different parts:
- a base granule, in the high bits
- a granule offset, in the low bits

+----------------+----------------+
| base | offset |
+----------------+----------------+

The number of bits these parts occupy is variable, and each stream
may choose how many bits to dedicate to each. The kate_info structure
for a stream holds that information in the granule_shift field,
so each part may be reconstructed from a granulepos.

The timestamp T of a given Kate packet is split into a base B and
offset O, and these are stored in the granulepos of that packet.
The split is done such that the B is the time of the earliest event
still active at the time, and the O is the time elapsed between B
and T. Thus, T = B + O. This mimics the way Theora stores its own
timestamps in granulepos, where the base acts as a keyframe, and
an offset acts as the position of an intra frame from the previous
keyframe. Since Kate allows time overlapping events, however, the
choice of the base to use is slightly more complex, as it may not
be the starting time of the previous event, if the stream contains
time overlapping events.

The kate_info structure for a stream holds a rational fraction
representing the time span of granule units for both the base and
the offset parts.

The granule rate is defined by the two fields:

kate_info::gps_numerator
kate_info::gps_denominator

The number of bits reserved for the offset is defined by the field:

kate_info::granule_shift

=== Generic timing ===

Kate data packets (data packet type 0) includes timing information (start time,
end time, and time of the earliest event still active). All these are stored as
64 bit at the rate defined by the granule rate, so they do not suffer from the
granule_shift space limitation.

This also allows for Kate streams to be stored in other containers.

== Motion ==

The Kate bitstream format includes motion definition, originally for karaoke purposes, but
which can be used for more general purpose, such as line based drawing, or animation of
the text (position, color, etc)

Motions are defined by the means of a series of curves (static points, segments, splines (catmull-rom, bezier, and b-splines)).
A 2D point can be obtained from a motion for any timestamp during the lifetime of a text.
This can be used for moving a marker in 2D above the text for karaoke, or to use the x
coordinate to color text when the motion position passes each letter or word, etc.
Motions have an attached semantics so the client code knows how to use a particular motion.
Predefined semantics include text color, text position, etc).

Since a motion can be composed of an arbitrary number of curves, each of which may have
an arbitrary number of control points, complex motions can be achieved. If the motion is
the main object of an event, it is even possible to have an empty text, and use the motion
as a virtual pencil to draw arbitrary shapes. Even on-the-fly handwriting subtitles could
be done this way, though this would require a lot of control points, and would not be able
to be used with text-to-speech.

As a proof of concept, I also have a "draw chat" program where two people can draw, and
the shapes are turned to b-splines and sent as a kate motion to be displayed on the other
person's window.

It is also possible for motions to be discontinuous - simply insert a curve of 'none' type.
While the timestamp lies within such a curve, no 2D point will be generated. This can be
used to temporarily hide a marker, for instance.

It is worth mentionning that pauses in the motion can be trivially included by inserting
at the right time and for the right duration a simple linear interpolation curve with only
two equal points, equal to the position the motion is supposed to pause at.

Kate defines a set of predefined mappings so that each decoder user interprets a motion in
the same way. A mapping is coded on 8 bits in the bitstream, and the first 128 are reserved
for Kate, leaving 128 for application specific mappings, to avoid constraining creative uses
of that feature. Predefined mappings include frame (eg, 0-1 points are mapped to the size of
the current video frame), or region, to scale 0-1 to the current region. This allows curves
to be defined without knowing in advance the pixel size of the area it should cover.

For uses which require more than two coordinates (eg, text color, where 4 (RGBA) values are
needed, Kate predefines the semantics text_color_rg and text_color_ba, so a 4D point can be
obtained using two different motions.

There are higher level constructs, such as morphing between two styles, or predefined
karaoke effects. More are planned to be added in the future.

See also [[#Trackers|Trackers]].

== Trackers ==

Since attaching motions to text position, etc, makes it hard for the client to keep track of
everything, doing interpolation, etc, the library supplies a tracker object, which handles the
interpolation of the relevant properties.
Once initialized with a text and a set of motions, the client code can give the tracker a new
timestamp, and get back the current text position, text color, etc.

Using a tracker is not necessary, if one wants to use the motions directly, or just ignore them,
but it makes life easier, especially when considering the the order in which motions are applied
does matter (to be defined formally, but the current source code is informative at this point).

== The Kate file format ==

Though this is not a feature of the bitstream format, I have created a text file format to
describe a series of events to be turned into a Kate bitstream.
At its minimum, the following is a valid input to the encoder:

: kate {
:: event { 00:00:05 --> 00:00:10 "This is a text" }
: }

This will create a simple stream with "This is a text" emitted at an offset of 5 seconds into
the track, lasting 5 seconds to an end time at 10 seconds.

Motions, regions, styles can be declared in a definitions block to be reused by events, or can
be defined inline. Defining those in the definitions block places them in a header so they can
be reused later, saving space. However, they can also be defined in each event, so they will be
sent with the event. This allows them to be generated on the fly (eg, if the bitstream is being
streamed from a realtime input).

For convenience, the Kate file format also allows C style macros, though without parameters.

Please note that the Kate file format is fully separate from the Kate bitstream format. The
difference between the two is similar to the difference between a C source file and the resulting
object file, when compiled.

Note that the format is not based on XML for a very parochial reason: I tend to dislike very
much editing XML by hand, as it's really hard to read. XML is really meant for machines to parse
generically text data in a shared syntax but with possibly unknown semantics, and I need those
text representations to be editable easily.

This also implies that there could be an XML representation of a Kate stream, which would be
useful if one were to make an editor that worked on a higher level than the current all-text
representation, and it is something that might very well happen in the future, in parallel with
the current format.

== Karaoke ==

Karaoke effects rely on motions, and there will be predefined higher level ways of specifying
timings and effects, two of which are already done. As an example, this is a valid Karaoke script:

:kate {
:: simple_timed_glyph_style_morph {
::: from style "start_style" to style "end_style"
::: "Let " at 1.0
::: "us " at 1.2
::: "sing " at 1.4
::: "to" at 2.0
::: "ge" at 2.5
::: "ther" at 3.0
:: }
:}

The syllables will change from a style to another as time passes. The definition of the start_style
and end_style styles is omitted for brevity.

== Problems to solve ==

There are a few things to solve before the Kate bitstream format can be considered good
enough to be frozen:

Note: the following is mostly solved, and the bitstream is now stable, and has been
backward and forward compatible since the first released version. This will be updated
when I get some time.

=== Seeking and memory ===

When seeking to a particular time in a movie with subtitles, we may end up at a place when a subtitle has been started, but is not removed yet. Pure streaming doesn't have this problem as it remembers the subtitle being issued (as opposed to, say, Vorbis, for which all data valid now is decoded from the last packet). With Kate, a text string valid now may have been issued long ago.

I see three possible ways to solve this:
*each data packet includes the granule of the earliest still active packet (if none, this will be the granule of this very packet)
**this means seeks are two phased: first seek, find the next Kate packet, and seek again if the granule of the earlier still active packet is less than the original seeked granule. This implies support code on players to do the double seek.

*use "reference frames", a bit like Theora does, where the granule position is split in several fields: the higher bits represent a position for the reference frame, and the lowest bits a delta time to the current position. When seeking to a granule position, the lower bits are cleared off, yielding the granule position of the previous reference frame, so the seek ends up at the reference frame. The reference frame is a sync point where any active strings are issued again. This is a variant of the method described in the Writ wiki page, but the granule splitting avoids any "downtime".
**this requires reissuing packets, and it doesn't feel right (and wastes space).
**it also requires "dummy" decoding of Kate data from the reference frame to the actual seek point to fully refresh the state "memory".

*A variant of the two-granules-in-one system used by libcmml, where the "back link" points to the earliest still active string, rather than the previous one (this allows a two phase seek, rather than a multiphase seek, hopping back from event to event, with no real way to know if there is or not a previous event which is still active - I suppose CMML has no need to know this, if their "clips" do not overlap - mine can do).
**Such a system considerably shortens the usable granule space, though it can do a one phase seek, if I understand the system correctly, which I am not certain.
*** Well, it seems it can't do a one phase seek anyway.

*Additionally, it could be possible to emit simple "keepalive" packets at regular intervals to help a seek algorithm to sync up to the stream without needing too much data reading - this helps for discontinuous streams where there could be no pages for a while if no data is needed at that time.

=== Text encoding ===

A header field declares the text encoding used in the stream. At the moment, only UTF-8 is
supported, for simplicity. There are no plans to support other encodings, such as UTF-16,
at the moment.

Note that strings included in the header (language, category) are not affected by that
language encoding (rather obviously for language itself). These are ASCII.

The actual text in events may include simple HTML-like markup (at the moment, allowed markup
is the same as the one Pango uses, but more markup types may be defined in the future).
It is also possible to ask libkate to remove this markup if the client prefers to receive
plain text without the markup.

=== Language encoding ===

A header field defines the language (if any) used in the stream (this can be overridden in a
data packet, but this is not relevant to this point). At the moment, my test code uses
ISO 639-1 two letter codes, but I originally thought to use RFC 3066 tags. However, matching
a language to a user selection may be simpler for user code if the language encoding is kept
simple. At the moment, I tend to favor allowing both two letter tags (eg, "en") and secondary
tags (like "en_EN"), as RFC 3066 tags can be quite complex, but I welcome comments on this.

If a stream contains more than one language, there usually is a predominant language, which
can be set as the default language for the stream. Each event can then have a language
override. If there is no predominant language, and it is not possible to split the stream
into multiple substreams, each with its own language, then it is possible to use the "mul"
language tag, as a last resort.

=== Bitstream format for floating point values ===

Floating point values are be turned to a 16.16 fixed point format, then stored in a bitpacked
format, storing the number of zero bits at the head and tail of the floating point values once
per stream, and the remainder bits for all values in the stream. This seems to yield good results
(typically a 50% reduction over 32 bits raw writes, and 70% over the snprintf based storage), and
has the big advantage of being portable (eg, independant of any IEEE format).
However, this means reduced precision due to the quantization to 16.16. I may add support for
variable precision (eg, 8.24 fixed point formats) to alleviate this. This would however mean less
space savings, though these are likely to be insignificant when Kate streams are interleaved with
a video.

*Though this is not a Kate issue per se, the motion feature is very difficult to use without a curve editor. While tools may be coded to create a Kate bitstream for various existing subtitle formats, it is not certain it will be easy to find a good authoring tool for a series of curves. That said, it's not exactly difficult to do if you know a widget set.

=== Higher dimensional curves/motions ===

It is quite annoying to have to create two motions to control a color change, due to curves
being restricted to two dimensions. I may add support for arbitrary dimensions. It would also
help for 1D motions, like changing the time flow, where one coordinate is simply ignored at
the moment.
Alternatively, changes could be made to the Kate file format to hide the two dimensionality and
allow simpler specification of non-2 dimensional motions, but still map them to 2D in the kate
bitstream format.

=== Category definition ===

The category field in the BOS packet is a 16 byte text field (15 really, as it is zero terminated
in the bitstream itself). Its goal is to provide the reader with a short description of what kind
of information the stream contains, eg subtitles, lyrics, etc. This would be displayed to the user,
possibly to allow to choose to turn some streams on and off.

Since this category is meant primarily for a machine to parse, they will be kept to ASCII. When
a player recognizes a category, it is free to replace its name with one in the user's language if
it prefers. Even in English, the "lyrics" category could be displayed by a player as "Lyrics".

Since this is a free text field rather than an enumeration, it would be good to have a list of
common predefined category names that Kate streams can use.

This is a list of proposed predefined categories, feedback/additions welcome:

* subtitles - the usual movie subtitles, as text
* spu-subtitles - movie subtitles in DVD style paletted images
* lyrics - song lyrics

Please remember the 15 character limit if proposing other categories.

Note that the list of categories is subject to change, and will likely
be replaced by new, more "identifier like" ones. The three ones above,
however, would be kept for backward compatibility as they're already used.

== Text to speech ==

One of the goals of the Kate bitstream format is that text data can be easily parsed
by the user of the decoder, so any additional information, such as style, placement,
karaoke data, etc, should be able to be stripped to leave only the bare text. This is
in view of allowing text-to-speech software to use Kate bitstreams as a bandwith-cheap
way of conveying speech data, and could also allow things like e-books which can be
either read or listened to from the same bitstream (I have seen no reference to this
being used anywhere, but I see no reason why the granule progression should be temporal,
and not user controlled, such as by using a "next" button which would bump a granule
postion by a preset amount, simulating turning a page (this would be close to necessary
for text-to-speech, as the wall time duration of the spoken speech is not known in
advance to the Kate encoder, and can't be mapped to a time based granule progression)).
All text strings triggered consecutively between the two granule positions would then
be read in order.

== Possible additions ==

=== Embedded binary data ===

Images and font mappings can be included within a Kate stream.

==== Images ====

Though this could be misused to interfere with ability to render as text-to-speech, Kate
can use images as well as text. The same caveat as for fonts applies with regard to data
duplication.

Complex images might however be best left to a multiplexed OggSpots or OggMNG stream, unless the
images mesh with the text (eg, graphical exclamation points, custom fonts, (see next
paragraph), etc).

There is support for simple paletted bitmap images, with a variable length palette of up
to 256 colors (in fact, sized in powers of 2 up to 256) and matching pixel data in as
many bits per pixel as can address the palette. Palettes and images are stored separately,
so can be used with one another with no fixed assignment.

Palettes and bitmaps are put in two separate header for later use by reference, but can
also be placed in data packets, as with motions, etc, if they are not going to be reused.

PNG bitmaps can also be embedded in a Kate stream. These do not have associated palettes
(but the PNGs themselves may or may not be paletted). There is no support for decoding PNG
images in libkate itself, so a program will have to use libpng (or similar code) to decode
the PNG image. For instance, the libtiger rendering library uses Cairo to decode and render
PNG images in Kate streams.

This can be used to have custom fonts, so that raw text is still available if the stream
creator wants a custom look.

I expect that the need for more than 256 colors in a bitmap, or non palette bitmap data,
would be best handled by another codec, eg OggMNG or OggSpots. The goal of images in a
Kate stream is to mesh the images with the text, not to have large images by themselves.

On the other hand, interesting Karaoke effects could be achieved by having MNG images
instead of simple paletted bitmaps in a Kate streams. Comments would be most welcome on
whether this is going too far, however.

I am also investigating SVG images. These allow for very small footprint images for simple
vector drawings, and could be very useful for things like background gradients below text.

A possible solution to the duplication issue is to have another stream in the container
stream, which would hold the shared data (eg, fonts), which the user program could load,
and which could then be used by any Kate (and other) stream. Typically, this type of stream
would be a degenerate stream with only header packets (so it is fully processed before any
other stream presents data packets that might make use of that shared data), and all payload
such as fonts being contained within the headers. Thinking about it, it has parallels with
the way Vorbis stores its codebooks within a header packet, or even the way Kate stores the
list of styles within a header packet.

==== Fonts ====

Custom fonts are merely a set of ranges mapping unicode code points to bitmaps. As this implies,
fonts are bitmap fonts, not vector fonts, so scaling, if supported by the rendering client,
may not look as good as with a vector font.

A style may also refer to a font name to use (eg, "Tahoma"). These fonts may or may not be
available on the playing system, however, since the font data is not included in the stream,
just referenced by name. For this reason, it is best to keep to widely known fonts.

== Reference encoder/decoder ==

A encoder (kateenc) and a decoder (katedec) are included in the tools directory.
The encoder supports input from several different formats:
* a custom text based file format (see [[#The Kate file format|The Kate file format]]), which is by no means meant to be part of the Kate bitstream specification itself
* SubRip (.srt), the most common subtitle format I found
* LRC lyrics format.

As an example for the widely used SRT subtitles format, the following command line
create a Kate subtitles stream from an SRT file:

kateenc -l en -c subtitles -t srt -o subtites.ogg subtitles.srt

The reverse is possible, to recover an SRT file from a Kate stream, with katedec.

Note that the subtitles.ogg file should then be multiplexed into the A/V stream,
using either ogg-tools or oggz-tools.

The Kate bitstreams encoded and decoded by those tools are (supposed to be) correct for this
specification, provided their input is correct.

== Next steps ==

=== Continuations ===

Continuations are a way to add to existing events, and are mostly meant for motions. When streaming
in real time, what motions may be applied to events may not be known in advance (for instance, for a
draw chat program where two programs exchange Kate streams, the drawing motions are only known as
they are drawn. Continuations will allow an event to be extended in time, and motions to be appended
to it. This is only useful for streaming, as when stored in a file, everything is already known in
advance.

=== A rendering library ===

This will allow easier integration in other packages (movie players, etc).
I have started working on an implementation using Cairo and Pango, though I'm still at the early stages.
I might add support for embedding vector fonts in a Kate stream if I was going that way. Still need to think about this.
Another point of note is that when this library is available, it would make it easier to add
capabilities such as rotation, scaling, etc, to the bitstream, since this would not cause too
much work for playing programs using the rendering library. It is expected that these additions
would stay backward compatible (eg, an old player would ignore this information but still correctly
decode the information they can work with from a newly encoded stream).

=== An XML representation ===

While I purposefully did not write Kate description files in XML due to me finding editing XML such
a chore, it would be nice to be able to losslessly convert between the more user friendly representation
and an XML document, so one can do what one does with XML documents, like transformations.

And after all, some people might prefer editing the XML version.

== Matroska mapping ==

The codec ID is "S_KATE".

As for Theora and Vorbis, Kate headers are stored in the private data as xiph-laced packets:

Byte 0: number of packets present, minus 1 (there must be at least one packet) - let this number be NP
Bytes 1..n: lengths of the first NP packets, coded in xiph style lacing
Bytes n+1..end: the data packets themselves concatenated one after the other

Note that the length of the last packet isn't encoded, it is deduced from the sizes of the other
packets and the total size of the private data.

This mapping is similar to the Vorbis and Theora mappings, with the caveat that one should not
expect a set number of headers.

== Downloading ==

libkate encodes and decodes Kate streams, and is API and ABI stable.

The libkate source distribution is available at [http://libkate.googlecode.com/ http://libkate.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/kate.git;a=summary http://git.xiph.org/?p=users/oggk/kate.git;a=summary].

libtiger renders Kate streams using Pango and Cairo, and is alpha, with API changes still possible.

The libtiger source distribution is available at [http://libtiger.googlecode.com/ http://libtiger.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/tiger.git;a=summary http://git.xiph.org/?p=users/oggk/tiger.git;a=summary].

== Things I need to get feedback on ==

* is it a good idea to avoid floating point usage altogether ?

[[Category:Drafts]]
[[Category:Ogg Mappings]]

OggKate

2008-12-25T17:25:54Z

Ogg.k.ogg.k: Rewrite the 'what is Kate ?' intro - now clearer - I think

== Disclaimer ==
This is not a Xiph codec, though it may be embedded in Ogg alonside other Xiph
codecs, such as Vorbis and Theora. As such, please do not assume that Xiph has
anything to do with this, much less responsibility.

== What is Kate? ==

Kate is an oveerlay codec, originally designed for karaoke and text, that can be
multiplixed in Ogg. Text and images can be carried by a Kate stream, and animated.
Most of the time, this would be multiplexed with audio/video to carry subtitles,
song lyrics (with or without karaoke data), etc, but doesn't have to be.

Series of curves (splines, segments, etc) may be attached to various properties
(text position, font size, etc) to create animated overlays. This allows scrolling
or fading text to be defined. This can even be used to draw arbitrary shapes, so
hand drawing can also be represented by a Kate stream.

Example uses of Kate streams are movie subtitles for Theora videos, either text based,
as may be created by [http://www.v2v.cc/~j/ffmpeg2theora ffmpeg2theora], or image
based, such as created by [http://thoggen.net Thoggen] (patching needed).

== Why a new codec? ==

As I was adding support for Theora, Speex and FLAC to some software of mine, I found myself
wanting to have song lyrics accompanying Vorbis audio. Since Vorbis comments are limited to
the headers, one can't add them in the stream as they are sung, so another multiplexed stream
would be needed to carry them.

The three possible bases usable for such a codec I found were Writ, CMML, and OGM/SRT.

*[[OggWrit|Writ]] is an unmaintained start at an implementation of a very basic design, though I did find an encoder/decoder in py-ogg2 later on - I'd been quicker to write Kate from scratch anyway.
*[[CMML]] is more geared towards encapsulating metadata about an accompanying stream, rather than being a data stream itself, and seemed complex for a simple use, though I have now revised my view on this - besides, it seems designed for Annodex (which I haven't had a look at), though it does seems relatively generic for use outwith Annodex - though it is being "repurposed" as timed text now, bringing it closer to what I'm doing
*OGM/SRT, which I only found when I added Kate support to MPlayer, is shoehorning various data formats into an Ogg stream, and just dumps the SRT subtitle format as is, AFAICS (though I haven't looked at this one in detail, since I'd already had a working Kate implementation by that time)

I then decided to roll my own, not least because it's a fun thing to do.

I found other formats, such as USF (designed for inclusion in Matroska) and various subtitle formats,
but none were designed for embedding inside an Ogg container.

== Overview of the Kate bitstream format ==

I've taken much inspiration from Vorbis and Theora here.
Headers and packets (as well as the API design) follow the design of these two codecs.

A rough overview (see [[#Format specification|Format specification]] for more details) is:

Headers packets:
*ID header [BOS]: magic, version, granule fraction, encoding, language, etc
*Comment header: Vorbis comments, as per Vorbis/Theora streams
*Style definitions header: a list of predefined styles to be referred to by data packets
*Region definitions header: a list of predefined regions to be referred to by data packets
*Curves definitions header: a list of predefined curves to be referred to by data packets
*Motion definitions header: a list of predefined motions to be referred to by data packets
*Palette definitions header: a list of predefined palettes to be referred to by data packets
*Bitmap definitions header: a list of predefined bitmaps to be referred to by data packets
*Font mapping definitions header: a list of predefined font mappings to be referred to by data packets

Other header packets are ignored, and left for future expansion.

Data packets:
*text data: text/image and optional motions, accompanied by optional overrides for style, region, language, etc
*keepalive: can be emitted at any time to help a demuxer know where we're at, but those packets are optional
*end data [EOS]: marks the end of the stream, it doesn't have any useful payload

Other data packets are ignored, and left for future expansion.

The intent of the "keepalive" packet is to be sent at regular
intervals when no other packet has been emitted for a while. This would be to help seeking code
find a kate page more easily.

Things of note:
*Kate is a discontinuous codec, as defined in [http://www.xiph.org/ogg/doc/ogg-multiplex.html ogg-multiplex.html] in the Ogg documentation, which means it's timed by start granule, not end granule (as Theora and Vorbis).
* All data packets are on their own page, for two reasons:
**Ogg keeps track of granules at the page level, not the packet level
**if no text event happens for a while after a particular text event, we don't want to delay it so a larger page can be issued

See also [[#Seeking and memory|Problems to solve: Seeking and memory]].

*The granule encoding is not a direct time/granule correspondance, see the granule encoding section.
*The EOS packet should have a granule pos higher or equal to the end time of all events.
*User code doesn't have to know the number of headers to expect, this is moved inside the library code (as opposed to Vorbis and Theora).
*The format contains hooks so that additional information may be added in future revisions while keeping backward compatibility (though old decoders will correctly parse, but ignore the new information).

== Format specification ==

The Kate bitstream format consists of a number of sequential packets.
Packets can be either header packets or data packets. All header packets
must appear before any data packet.

Header packets must appear in order. Decoding of a data packet is not
possible until all header packets have been decoded.

Each Kate packet starts with a one byte type. A type with the MSB set
(eg, between 0x80 and 0xff) indicates a header packet, while a type with
the MSB cleared (eg, between 0x00 and 0x7f) indicates a data packet.
All header packets then have the Kate magic, from byte offset 1 to byte
offset 7 ("kate\0\0\0"). Note that this applies only to header packets:
data packets do not contain the Kate signature.

Since the ID header must appear first, a Kate stream can be recognized
by comparing the first eight bytes of the first packet with the signature
string "\200kate\0\0\0".

When embedded in Ogg,the first packet in a Kate stream (always packet type 0x80,
the id header packet) must be placed on a separate page. The corresponding Ogg
packet must be marked as beginning of stream (BOS).All subsequent header packets
must be on one or more pages. Subsequently, each data packet must be on a separate
page.

The last data packet must be the end of stream packet (packet type 0x7f).

When embedded in Ogg, the corresponding Ogg packet must be marked as end of stream (EOS).

As per the Ogg specification, granule positions must be non decreasing
within the stream. Header packets have granule position 0.

Currently existing packet types are:
:headers:
::0x80 ID header (BOS)
::0x81 Vorbis comment header
::0x82 regions list header
::0x83 styles list header
::0x84 curves list header
::0x85 motions list header
::0x86 palettes list header
::0x87 bitmaps list header
::0x88 font ranges and mappings header
:data:
::0x00 text data (including optional motions and overrides)
::0x01 keepalive
::0x7f end packet (EOS)

This format described here is for bitstream version 0.x.
As or 19 december 2008, the latest bitstream version is 0.4.

For more detailed information, refer to the format documentation
in libkate (see URL below in the [[#Downloading|Downlading]] section).

Following is the definition of the ID header (packet type 0x80).
This works out to a 64 byte ID header. This is the header that should be
used to detect a Kate stream within an Ogg stream.

0 1 2 3 |
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1| Byte
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| packtype | Identifier char[7]: 'kate\0\0\0' | 0-3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| kate magic continued | 4-7
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | version major | version minor | num headers | 8-11
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| text encoding | directionality| reserved - 0 | granule shift | 12-15
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| cw sh | canvas width | ch sh | canvas height | 16-19
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | 20-23
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate numerator | 24-27
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate denominator | 28-31
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (NUL terminated) | 32-35
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 36-39
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 40-43
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 44-47
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (NUL terminated) | 48-51
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 52-55
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 56-59
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 60-63
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The fields cw sh, canvas width, cw sh, and canvas height were introduced
in bistream 0.3. Earlier bitstreams will have 0 in these fields.

language and category are NUL terminating ASCII strings.
Language follows RFC 3066, though obviously will not accommodate language tags
with lots of subtags.

Category is currently loosely defined, and I haven't found yet a nice way to
present it in a generic way, but is meant for automatic classifying of
various multiplexed Kate streams (eg, to recognize that some streams are
subtitles (in a set of languages), and some others are commentary (in a
possibly different set of languages, etc).

== API overview ==

libkate offers an API very similar to that of libvorbis and libtheora, as well as
an extra higher level decoding API.

Here's an overview of the three main modules:

=== Decoding ===

Decoding is done in a way similar to libvorbis. First, initialize a kate_info and a
kate_comment structure. Then, read headers by calling kate_decode_headerin. Once
all headers have been read, a kate_state is initialized for decoding using kate_decode_init,
and kate_decode_packetin is called repeatedly with data packets. Events (eg, text) can be
retrieved via kate_decode_eventout.

=== Encoding ===

Encoding is also done in a way similar to libvorbis. First initialize a kate_info
and a kate_comment structure, and fill them out as needed. kate_encode_headers will
create ogg packets from those. Then, kate_encode_text is called repeatedly for all
the text events to add. When done, calling kate_encode_finish will create an end of
stream packet.

=== High level decoding API ===

There are only 3 calls here:

kate_high_decode_init
kate_high_decode_packetin
kate_high_decode_clear

Here, all Ogg packets are sent to kate_high_decode_packetin, which does the right
thing (header/data classification, decoding, and event retrieval). Note that you
do not get access to the comments directly using this, but you do get access to the
kate_info via events.

The libkate distribution includes commented examples for each of those.

Additionally, libkate includes a layer (liboggkate) to make it easier to use when
embedded in Ogg. While the normal API uses kate_packet structures, liboggkate uses
ogg_packet structures.

The High level decoding API does not have an Ogg specific layer, but functions exist
to wrap a kate_packet around a memory buffer (such as the one ogg_packet uses, for instance).

== Support ==

Among the software with Kate support:
*VLC
*ffmpeg2theora
*liboggz
*liboggplay

I have patches for the following with Kate support:
*MPlayer (for multiplexed per-language subtitles - all region/style info is ignored)
*xine (everything kate supports, as xine is my testbed)
*GStreamer
*Thoggen

These may be found in the libkate source distribution (see [[#Downloading|Downloading]]
for links).

In addition, libtiger is a rendering library for Kate streams using Pango and Cairo,
though it is not quite yet API stable (though no major changes are expected).

== Granule encoding ==

=== Ogg ===

Ogg leaves the encoding of granules up to a particular codec, only
mandating that granules be non decreasing with time.

The Kate bitstream format uses a linear mapping between time and
granule, described here.

A Kate granule position is composed of two different parts:
- a base granule, in the high bits
- a granule offset, in the low bits

+----------------+----------------+
| base | offset |
+----------------+----------------+

The number of bits these parts occupy is variable, and each stream
may choose how many bits to dedicate to each. The kate_info structure
for a stream holds that information in the granule_shift field,
so each part may be reconstructed from a granulepos.

The timestamp T of a given Kate packet is split into a base B and
offset O, and these are stored in the granulepos of that packet.
The split is done such that the B is the time of the earliest event
still active at the time, and the O is the time elapsed between B
and T. Thus, T = B + O. This mimics the way Theora stores its own
timestamps in granulepos, where the base acts as a keyframe, and
an offset acts as the position of an intra frame from the previous
keyframe. Since Kate allows time overlapping events, however, the
choice of the base to use is slightly more complex, as it may not
be the starting time of the previous event, if the stream contains
time overlapping events.

The kate_info structure for a stream holds a rational fraction
representing the time span of granule units for both the base and
the offset parts.

The granule rate is defined by the two fields:

kate_info::gps_numerator
kate_info::gps_denominator

The number of bits reserved for the offset is defined by the field:

kate_info::granule_shift

=== Generic timing ===

Kate data packets (data packet type 0) includes timing information (start time,
end time, and time of the earliest event still active). All these are stored as
64 bit at the rate defined by the granule rate, so they do not suffer from the
granule_shift space limitation.

This also allows for Kate streams to be stored in other containers.

== Motion ==

The Kate bitstream format includes motion definition, originally for karaoke purposes, but
which can be used for more general purpose, such as line based drawing, or animation of
the text (position, color, etc)

Motions are defined by the means of a series of curves (static points, segments, splines (catmull-rom, bezier, and b-splines)).
A 2D point can be obtained from a motion for any timestamp during the lifetime of a text.
This can be used for moving a marker in 2D above the text for karaoke, or to use the x
coordinate to color text when the motion position passes each letter or word, etc.
Motions have an attached semantics so the client code knows how to use a particular motion.
Predefined semantics include text color, text position, etc).

Since a motion can be composed of an arbitrary number of curves, each of which may have
an arbitrary number of control points, complex motions can be achieved. If the motion is
the main object of an event, it is even possible to have an empty text, and use the motion
as a virtual pencil to draw arbitrary shapes. Even on-the-fly handwriting subtitles could
be done this way, though this would require a lot of control points, and would not be able
to be used with text-to-speech.

As a proof of concept, I also have a "draw chat" program where two people can draw, and
the shapes are turned to b-splines and sent as a kate motion to be displayed on the other
person's window.

It is also possible for motions to be discontinuous - simply insert a curve of 'none' type.
While the timestamp lies within such a curve, no 2D point will be generated. This can be
used to temporarily hide a marker, for instance.

It is worth mentionning that pauses in the motion can be trivially included by inserting
at the right time and for the right duration a simple linear interpolation curve with only
two equal points, equal to the position the motion is supposed to pause at.

Kate defines a set of predefined mappings so that each decoder user interprets a motion in
the same way. A mapping is coded on 8 bits in the bitstream, and the first 128 are reserved
for Kate, leaving 128 for application specific mappings, to avoid constraining creative uses
of that feature. Predefined mappings include frame (eg, 0-1 points are mapped to the size of
the current video frame), or region, to scale 0-1 to the current region. This allows curves
to be defined without knowing in advance the pixel size of the area it should cover.

For uses which require more than two coordinates (eg, text color, where 4 (RGBA) values are
needed, Kate predefines the semantics text_color_rg and text_color_ba, so a 4D point can be
obtained using two different motions.

There are higher level constructs, such as morphing between two styles, or predefined
karaoke effects. More are planned to be added in the future.

See also [[#Trackers|Trackers]].

== Trackers ==

Since attaching motions to text position, etc, makes it hard for the client to keep track of
everything, doing interpolation, etc, the library supplies a tracker object, which handles the
interpolation of the relevant properties.
Once initialized with a text and a set of motions, the client code can give the tracker a new
timestamp, and get back the current text position, text color, etc.

Using a tracker is not necessary, if one wants to use the motions directly, or just ignore them,
but it makes life easier, especially when considering the the order in which motions are applied
does matter (to be defined formally, but the current source code is informative at this point).

== The Kate file format ==

Though this is not a feature of the bitstream format, I have created a text file format to
describe a series of events to be turned into a Kate bitstream.
At its minimum, the following is a valid input to the encoder:

: kate {
:: event { 00:00:05 --> 00:00:10 "This is a text" }
: }

This will create a simple stream with "This is a text" emitted at an offset of 5 seconds into
the track, lasting 5 seconds to an end time at 10 seconds.

Motions, regions, styles can be declared in a definitions block to be reused by events, or can
be defined inline. Defining those in the definitions block places them in a header so they can
be reused later, saving space. However, they can also be defined in each event, so they will be
sent with the event. This allows them to be generated on the fly (eg, if the bitstream is being
streamed from a realtime input).

For convenience, the Kate file format also allows C style macros, though without parameters.

Please note that the Kate file format is fully separate from the Kate bitstream format. The
difference between the two is similar to the difference between a C source file and the resulting
object file, when compiled.

Note that the format is not based on XML for a very parochial reason: I tend to dislike very
much editing XML by hand, as it's really hard to read. XML is really meant for machines to parse
generically text data in a shared syntax but with possibly unknown semantics, and I need those
text representations to be editable easily.

This also implies that there could be an XML representation of a Kate stream, which would be
useful if one were to make an editor that worked on a higher level than the current all-text
representation, and it is something that might very well happen in the future, in parallel with
the current format.

== Karaoke ==

Karaoke effects rely on motions, and there will be predefined higher level ways of specifying
timings and effects, two of which are already done. As an example, this is a valid Karaoke script:

:kate {
:: simple_timed_glyph_style_morph {
::: from style "start_style" to style "end_style"
::: "Let " at 1.0
::: "us " at 1.2
::: "sing " at 1.4
::: "to" at 2.0
::: "ge" at 2.5
::: "ther" at 3.0
:: }
:}

The syllables will change from a style to another as time passes. The definition of the start_style
and end_style styles is omitted for brevity.

== Problems to solve ==

There are a few things to solve before the Kate bitstream format can be considered good
enough to be frozen:

Note: the following is mostly solved, and the bitstream is now stable, and has been
backward and forward compatible since the first released version. This will be updated
when I get some time.

=== Seeking and memory ===

When seeking to a particular time in a movie with subtitles, we may end up at a place when a subtitle has been started, but is not removed yet. Pure streaming doesn't have this problem as it remembers the subtitle being issued (as opposed to, say, Vorbis, for which all data valid now is decoded from the last packet). With Kate, a text string valid now may have been issued long ago.

I see three possible ways to solve this:
*each data packet includes the granule of the earliest still active packet (if none, this will be the granule of this very packet)
**this means seeks are two phased: first seek, find the next Kate packet, and seek again if the granule of the earlier still active packet is less than the original seeked granule. This implies support code on players to do the double seek.

*use "reference frames", a bit like Theora does, where the granule position is split in several fields: the higher bits represent a position for the reference frame, and the lowest bits a delta time to the current position. When seeking to a granule position, the lower bits are cleared off, yielding the granule position of the previous reference frame, so the seek ends up at the reference frame. The reference frame is a sync point where any active strings are issued again. This is a variant of the method described in the Writ wiki page, but the granule splitting avoids any "downtime".
**this requires reissuing packets, and it doesn't feel right (and wastes space).
**it also requires "dummy" decoding of Kate data from the reference frame to the actual seek point to fully refresh the state "memory".

*A variant of the two-granules-in-one system used by libcmml, where the "back link" points to the earliest still active string, rather than the previous one (this allows a two phase seek, rather than a multiphase seek, hopping back from event to event, with no real way to know if there is or not a previous event which is still active - I suppose CMML has no need to know this, if their "clips" do not overlap - mine can do).
**Such a system considerably shortens the usable granule space, though it can do a one phase seek, if I understand the system correctly, which I am not certain.
*** Well, it seems it can't do a one phase seek anyway.

*Additionally, it could be possible to emit simple "keepalive" packets at regular intervals to help a seek algorithm to sync up to the stream without needing too much data reading - this helps for discontinuous streams where there could be no pages for a while if no data is needed at that time.

=== Text encoding ===

A header field declares the text encoding used in the stream. At the moment, only UTF-8 is
supported, for simplicity. There are no plans to support other encodings, such as UTF-16,
at the moment.

Note that strings included in the header (language, category) are not affected by that
language encoding (rather obviously for language itself). These are ASCII.

The actual text in events may include simple HTML-like markup (at the moment, allowed markup
is the same as the one Pango uses, but more markup types may be defined in the future).
It is also possible to ask libkate to remove this markup if the client prefers to receive
plain text without the markup.

=== Language encoding ===

A header field defines the language (if any) used in the stream (this can be overridden in a
data packet, but this is not relevant to this point). At the moment, my test code uses
ISO 639-1 two letter codes, but I originally thought to use RFC 3066 tags. However, matching
a language to a user selection may be simpler for user code if the language encoding is kept
simple. At the moment, I tend to favor allowing both two letter tags (eg, "en") and secondary
tags (like "en_EN"), as RFC 3066 tags can be quite complex, but I welcome comments on this.

If a stream contains more than one language, there usually is a predominant language, which
can be set as the default language for the stream. Each event can then have a language
override. If there is no predominant language, and it is not possible to split the stream
into multiple substreams, each with its own language, then it is possible to use the "mul"
language tag, as a last resort.

=== Bitstream format for floating point values ===

Floating point values are be turned to a 16.16 fixed point format, then stored in a bitpacked
format, storing the number of zero bits at the head and tail of the floating point values once
per stream, and the remainder bits for all values in the stream. This seems to yield good results
(typically a 50% reduction over 32 bits raw writes, and 70% over the snprintf based storage), and
has the big advantage of being portable (eg, independant of any IEEE format).
However, this means reduced precision due to the quantization to 16.16. I may add support for
variable precision (eg, 8.24 fixed point formats) to alleviate this. This would however mean less
space savings, though these are likely to be insignificant when Kate streams are interleaved with
a video.

*Though this is not a Kate issue per se, the motion feature is very difficult to use without a curve editor. While tools may be coded to create a Kate bitstream for various existing subtitle formats, it is not certain it will be easy to find a good authoring tool for a series of curves. That said, it's not exactly difficult to do if you know a widget set.

=== Higher dimensional curves/motions ===

It is quite annoying to have to create two motions to control a color change, due to curves
being restricted to two dimensions. I may add support for arbitrary dimensions. It would also
help for 1D motions, like changing the time flow, where one coordinate is simply ignored at
the moment.
Alternatively, changes could be made to the Kate file format to hide the two dimensionality and
allow simpler specification of non-2 dimensional motions, but still map them to 2D in the kate
bitstream format.

=== Category definition ===

The category field in the BOS packet is a 16 byte text field (15 really, as it is zero terminated
in the bitstream itself). Its goal is to provide the reader with a short description of what kind
of information the stream contains, eg subtitles, lyrics, etc. This would be displayed to the user,
possibly to allow to choose to turn some streams on and off.

Since this category is meant primarily for a machine to parse, they will be kept to ASCII. When
a player recognizes a category, it is free to replace its name with one in the user's language if
it prefers. Even in English, the "lyrics" category could be displayed by a player as "Lyrics".

Since this is a free text field rather than an enumeration, it would be good to have a list of
common predefined category names that Kate streams can use.

This is a list of proposed predefined categories, feedback/additions welcome:

* subtitles - the usual movie subtitles, as text
* spu-subtitles - movie subtitles in DVD style paletted images
* lyrics - song lyrics

Please remember the 15 character limit if proposing other categories.

Note that the list of categories is subject to change, and will likely
be replaced by new, more "identifier like" ones. The three ones above,
however, would be kept for backward compatibility as they're already used.

== Text to speech ==

One of the goals of the Kate bitstream format is that text data can be easily parsed
by the user of the decoder, so any additional information, such as style, placement,
karaoke data, etc, should be able to be stripped to leave only the bare text. This is
in view of allowing text-to-speech software to use Kate bitstreams as a bandwith-cheap
way of conveying speech data, and could also allow things like e-books which can be
either read or listened to from the same bitstream (I have seen no reference to this
being used anywhere, but I see no reason why the granule progression should be temporal,
and not user controlled, such as by using a "next" button which would bump a granule
postion by a preset amount, simulating turning a page (this would be close to necessary
for text-to-speech, as the wall time duration of the spoken speech is not known in
advance to the Kate encoder, and can't be mapped to a time based granule progression)).
All text strings triggered consecutively between the two granule positions would then
be read in order.

== Possible additions ==

=== Embedded binary data ===

Images and font mappings can be included within a Kate stream.

==== Images ====

Though this could be misused to interfere with ability to render as text-to-speech, Kate
can use images as well as text. The same caveat as for fonts applies with regard to data
duplication.

Complex images might however be best left to a multiplexed OggSpots or OggMNG stream, unless the
images mesh with the text (eg, graphical exclamation points, custom fonts, (see next
paragraph), etc).

There is support for simple paletted bitmap images, with a variable length palette of up
to 256 colors (in fact, sized in powers of 2 up to 256) and matching pixel data in as
many bits per pixel as can address the palette. Palettes and images are stored separately,
so can be used with one another with no fixed assignment.

Palettes and bitmaps are put in two separate header for later use by reference, but can
also be placed in data packets, as with motions, etc, if they are not going to be reused.

PNG bitmaps can also be embedded in a Kate stream. These do not have associated palettes
(but the PNGs themselves may or may not be paletted). There is no support for decoding PNG
images in libkate itself, so a program will have to use libpng (or similar code) to decode
the PNG image. For instance, the libtiger rendering library uses Cairo to decode and render
PNG images in Kate streams.

This can be used to have custom fonts, so that raw text is still available if the stream
creator wants a custom look.

I expect that the need for more than 256 colors in a bitmap, or non palette bitmap data,
would be best handled by another codec, eg OggMNG or OggSpots. The goal of images in a
Kate stream is to mesh the images with the text, not to have large images by themselves.

On the other hand, interesting Karaoke effects could be achieved by having MNG images
instead of simple paletted bitmaps in a Kate streams. Comments would be most welcome on
whether this is going too far, however.

I am also investigating SVG images. These allow for very small footprint images for simple
vector drawings, and could be very useful for things like background gradients below text.

A possible solution to the duplication issue is to have another stream in the container
stream, which would hold the shared data (eg, fonts), which the user program could load,
and which could then be used by any Kate (and other) stream. Typically, this type of stream
would be a degenerate stream with only header packets (so it is fully processed before any
other stream presents data packets that might make use of that shared data), and all payload
such as fonts being contained within the headers. Thinking about it, it has parallels with
the way Vorbis stores its codebooks within a header packet, or even the way Kate stores the
list of styles within a header packet.

==== Fonts ====

Custom fonts are merely a set of ranges mapping unicode code points to bitmaps. As this implies,
fonts are bitmap fonts, not vector fonts, so scaling, if supported by the rendering client,
may not look as good as with a vector font.

A style may also refer to a font name to use (eg, "Tahoma"). These fonts may or may not be
available on the playing system, however, since the font data is not included in the stream,
just referenced by name. For this reason, it is best to keep to widely known fonts.

== Reference encoder/decoder ==

A encoder and a decoder are included in the tools directory. The encoder pulls its input from a custom
text based file format (see [[#The Kate file format|The Kate file format]]),
which is by no means meant to be part of the Kate bitstream specification itself,
from an SubRip (.srt) format file (the most common subtitle format I found, and a very basic one),
or from a lyrics (.lrc) format file.

The Kate bitstreams encoded and decoded by those tools are (supposed to be) correct for this
specification, provided their input is correct.

== Next steps ==

=== Continuations ===

Continuations are a way to add to existing events, and are mostly meant for motions. When streaming
in real time, what motions may be applied to events may not be known in advance (for instance, for a
draw chat program where two programs exchange Kate streams, the drawing motions are only known as
they are drawn. Continuations will allow an event to be extended in time, and motions to be appended
to it. This is only useful for streaming, as when stored in a file, everything is already known in
advance.

=== A rendering library ===

This will allow easier integration in other packages (movie players, etc).
I have started working on an implementation using Cairo and Pango, though I'm still at the early stages.
I might add support for embedding vector fonts in a Kate stream if I was going that way. Still need to think about this.
Another point of note is that when this library is available, it would make it easier to add
capabilities such as rotation, scaling, etc, to the bitstream, since this would not cause too
much work for playing programs using the rendering library. It is expected that these additions
would stay backward compatible (eg, an old player would ignore this information but still correctly
decode the information they can work with from a newly encoded stream).

=== An XML representation ===

While I purposefully did not write Kate description files in XML due to me finding editing XML such
a chore, it would be nice to be able to losslessly convert between the more user friendly representation
and an XML document, so one can do what one does with XML documents, like transformations.

And after all, some people might prefer editing the XML version.

== Matroska mapping ==

The codec ID is "S_KATE".

As for Theora and Vorbis, Kate headers are stored in the private data as xiph-laced packets:

Byte 0: number of packets present, minus 1 (there must be at least one packet) - let this number be NP
Bytes 1..n: lengths of the first NP packets, coded in xiph style lacing
Bytes n+1..end: the data packets themselves concatenated one after the other

Note that the length of the last packet isn't encoded, it is deduced from the sizes of the other
packets and the total size of the private data.

This mapping is similar to the Vorbis and Theora mappings, with the caveat that one should not
expect a set number of headers.

== Downloading ==

libkate encodes and decodes Kate streams, and is API and ABI stable.

The libkate source distribution is available at [http://libkate.googlecode.com/ http://libkate.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/kate.git;a=summary http://git.xiph.org/?p=users/oggk/kate.git;a=summary].

libtiger renders Kate streams using Pango and Cairo, and is alpha, with API changes still possible.

The libtiger source distribution is available at [http://libtiger.googlecode.com/ http://libtiger.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/tiger.git;a=summary http://git.xiph.org/?p=users/oggk/tiger.git;a=summary].

== Things I need to get feedback on ==

* is it a good idea to avoid floating point usage altogether ?

[[Category:Drafts]]
[[Category:Ogg Mappings]]

OggKate

2008-12-24T18:32:59Z

Ogg.k.ogg.k: fix category note layout

== Disclaimer ==
This is not a Xiph codec, though it may be embedded in Ogg alonside other Xiph
codecs, such as Vorbis and Theora. As such, please do not assume that Xiph has
anything to do with this, much less responsibility.

== What is Kate? ==

Kate is a codec for karaoke and text encapsulation for Ogg. Text and images can be
carried by a Kate stream, and animated. Most of the time, this would be multiplexed
with audio/video to carry subtitles, song lyrics (with or without karaoke data), etc,
but doesn't have to be. A possible use of a lone Kate stream would be an e-book.
Moreover, the motion feature gives Kate a powerful means to describe arbitrary curves, so
hand drawing of shapes can be achieved. This was originally meant for karaoke use, but
can be used for any purpose. Motions can be attached to various semantics, like position,
color, etc, so scrolling or fading text can be defined.

Simple uses of Kate streams are movie subtitles for Theora videos, either text based,
as may be created by [http://www.v2v.cc/~j/ffmpeg2theora ffmpeg2theora], or image
based, such as created by [http://thoggen.net Thoggen] (patching needed).

== Why a new codec? ==

As I was adding support for Theora, Speex and FLAC to some software of mine, I found myself
wanting to have song lyrics accompanying Vorbis audio. Since Vorbis comments are limited to
the headers, one can't add them in the stream as they are sung, so another multiplexed stream
would be needed to carry them.

The three possible bases usable for such a codec I found were Writ, CMML, and OGM/SRT.

*[[OggWrit|Writ]] is an unmaintained start at an implementation of a very basic design, though I did find an encoder/decoder in py-ogg2 later on - I'd been quicker to write Kate from scratch anyway.
*[[CMML]] is more geared towards encapsulating metadata about an accompanying stream, rather than being a data stream itself, and seemed complex for a simple use, though I have now revised my view on this - besides, it seems designed for Annodex (which I haven't had a look at), though it does seems relatively generic for use outwith Annodex - though it is being "repurposed" as timed text now, bringing it closer to what I'm doing
*OGM/SRT, which I only found when I added Kate support to MPlayer, is shoehorning various data formats into an Ogg stream, and just dumps the SRT subtitle format as is, AFAICS (though I haven't looked at this one in detail, since I'd already had a working Kate implementation by that time)

I then decided to roll my own, not least because it's a fun thing to do.

I found other formats, such as USF (designed for inclusion in Matroska) and various subtitle formats,
but none were designed for embedding inside an Ogg container.

== Overview of the Kate bitstream format ==

I've taken much inspiration from Vorbis and Theora here.
Headers and packets (as well as the API design) follow the design of these two codecs.

A rough overview (see [[#Format specification|Format specification]] for more details) is:

Headers packets:
*ID header [BOS]: magic, version, granule fraction, encoding, language, etc
*Comment header: Vorbis comments, as per Vorbis/Theora streams
*Style definitions header: a list of predefined styles to be referred to by data packets
*Region definitions header: a list of predefined regions to be referred to by data packets
*Curves definitions header: a list of predefined curves to be referred to by data packets
*Motion definitions header: a list of predefined motions to be referred to by data packets
*Palette definitions header: a list of predefined palettes to be referred to by data packets
*Bitmap definitions header: a list of predefined bitmaps to be referred to by data packets
*Font mapping definitions header: a list of predefined font mappings to be referred to by data packets

Other header packets are ignored, and left for future expansion.

Data packets:
*text data: text/image and optional motions, accompanied by optional overrides for style, region, language, etc
*keepalive: can be emitted at any time to help a demuxer know where we're at, but those packets are optional
*end data [EOS]: marks the end of the stream, it doesn't have any useful payload

Other data packets are ignored, and left for future expansion.

The intent of the "keepalive" packet is to be sent at regular
intervals when no other packet has been emitted for a while. This would be to help seeking code
find a kate page more easily.

Things of note:
*Kate is a discontinuous codec, as defined in [http://www.xiph.org/ogg/doc/ogg-multiplex.html ogg-multiplex.html] in the Ogg documentation, which means it's timed by start granule, not end granule (as Theora and Vorbis).
* All data packets are on their own page, for two reasons:
**Ogg keeps track of granules at the page level, not the packet level
**if no text event happens for a while after a particular text event, we don't want to delay it so a larger page can be issued

See also [[#Seeking and memory|Problems to solve: Seeking and memory]].

*The granule encoding is not a direct time/granule correspondance, see the granule encoding section.
*The EOS packet should have a granule pos higher or equal to the end time of all events.
*User code doesn't have to know the number of headers to expect, this is moved inside the library code (as opposed to Vorbis and Theora).
*The format contains hooks so that additional information may be added in future revisions while keeping backward compatibility (though old decoders will correctly parse, but ignore the new information).

== Format specification ==

The Kate bitstream format consists of a number of sequential packets.
Packets can be either header packets or data packets. All header packets
must appear before any data packet.

Header packets must appear in order. Decoding of a data packet is not
possible until all header packets have been decoded.

Each Kate packet starts with a one byte type. A type with the MSB set
(eg, between 0x80 and 0xff) indicates a header packet, while a type with
the MSB cleared (eg, between 0x00 and 0x7f) indicates a data packet.
All header packets then have the Kate magic, from byte offset 1 to byte
offset 7 ("kate\0\0\0"). Note that this applies only to header packets:
data packets do not contain the Kate signature.

Since the ID header must appear first, a Kate stream can be recognized
by comparing the first eight bytes of the first packet with the signature
string "\200kate\0\0\0".

When embedded in Ogg,the first packet in a Kate stream (always packet type 0x80,
the id header packet) must be placed on a separate page. The corresponding Ogg
packet must be marked as beginning of stream (BOS).All subsequent header packets
must be on one or more pages. Subsequently, each data packet must be on a separate
page.

The last data packet must be the end of stream packet (packet type 0x7f).

When embedded in Ogg, the corresponding Ogg packet must be marked as end of stream (EOS).

As per the Ogg specification, granule positions must be non decreasing
within the stream. Header packets have granule position 0.

Currently existing packet types are:
:headers:
::0x80 ID header (BOS)
::0x81 Vorbis comment header
::0x82 regions list header
::0x83 styles list header
::0x84 curves list header
::0x85 motions list header
::0x86 palettes list header
::0x87 bitmaps list header
::0x88 font ranges and mappings header
:data:
::0x00 text data (including optional motions and overrides)
::0x01 keepalive
::0x7f end packet (EOS)

This format described here is for bitstream version 0.x.
As or 19 december 2008, the latest bitstream version is 0.4.

For more detailed information, refer to the format documentation
in libkate (see URL below in the [[#Downloading|Downlading]] section).

Following is the definition of the ID header (packet type 0x80).
This works out to a 64 byte ID header. This is the header that should be
used to detect a Kate stream within an Ogg stream.

0 1 2 3 |
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1| Byte
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| packtype | Identifier char[7]: 'kate\0\0\0' | 0-3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| kate magic continued | 4-7
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | version major | version minor | num headers | 8-11
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| text encoding | directionality| reserved - 0 | granule shift | 12-15
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| cw sh | canvas width | ch sh | canvas height | 16-19
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | 20-23
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate numerator | 24-27
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate denominator | 28-31
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (NUL terminated) | 32-35
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 36-39
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 40-43
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 44-47
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (NUL terminated) | 48-51
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 52-55
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 56-59
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 60-63
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The fields cw sh, canvas width, cw sh, and canvas height were introduced
in bistream 0.3. Earlier bitstreams will have 0 in these fields.

language and category are NUL terminating ASCII strings.
Language follows RFC 3066, though obviously will not accommodate language tags
with lots of subtags.

Category is currently loosely defined, and I haven't found yet a nice way to
present it in a generic way, but is meant for automatic classifying of
various multiplexed Kate streams (eg, to recognize that some streams are
subtitles (in a set of languages), and some others are commentary (in a
possibly different set of languages, etc).

== API overview ==

libkate offers an API very similar to that of libvorbis and libtheora, as well as
an extra higher level decoding API.

Here's an overview of the three main modules:

=== Decoding ===

Decoding is done in a way similar to libvorbis. First, initialize a kate_info and a
kate_comment structure. Then, read headers by calling kate_decode_headerin. Once
all headers have been read, a kate_state is initialized for decoding using kate_decode_init,
and kate_decode_packetin is called repeatedly with data packets. Events (eg, text) can be
retrieved via kate_decode_eventout.

=== Encoding ===

Encoding is also done in a way similar to libvorbis. First initialize a kate_info
and a kate_comment structure, and fill them out as needed. kate_encode_headers will
create ogg packets from those. Then, kate_encode_text is called repeatedly for all
the text events to add. When done, calling kate_encode_finish will create an end of
stream packet.

=== High level decoding API ===

There are only 3 calls here:

kate_high_decode_init
kate_high_decode_packetin
kate_high_decode_clear

Here, all Ogg packets are sent to kate_high_decode_packetin, which does the right
thing (header/data classification, decoding, and event retrieval). Note that you
do not get access to the comments directly using this, but you do get access to the
kate_info via events.

The libkate distribution includes commented examples for each of those.

Additionally, libkate includes a layer (liboggkate) to make it easier to use when
embedded in Ogg. While the normal API uses kate_packet structures, liboggkate uses
ogg_packet structures.

The High level decoding API does not have an Ogg specific layer, but functions exist
to wrap a kate_packet around a memory buffer (such as the one ogg_packet uses, for instance).

== Support ==

Among the software with Kate support:
*VLC
*ffmpeg2theora
*liboggz
*liboggplay

I have patches for the following with Kate support:
*MPlayer (for multiplexed per-language subtitles - all region/style info is ignored)
*xine (everything kate supports, as xine is my testbed)
*GStreamer
*Thoggen

These may be found in the libkate source distribution (see [[#Downloading|Downloading]]
for links).

In addition, libtiger is a rendering library for Kate streams using Pango and Cairo,
though it is not quite yet API stable (though no major changes are expected).

== Granule encoding ==

=== Ogg ===

Ogg leaves the encoding of granules up to a particular codec, only
mandating that granules be non decreasing with time.

The Kate bitstream format uses a linear mapping between time and
granule, described here.

A Kate granule position is composed of two different parts:
- a base granule, in the high bits
- a granule offset, in the low bits

+----------------+----------------+
| base | offset |
+----------------+----------------+

The number of bits these parts occupy is variable, and each stream
may choose how many bits to dedicate to each. The kate_info structure
for a stream holds that information in the granule_shift field,
so each part may be reconstructed from a granulepos.

The timestamp T of a given Kate packet is split into a base B and
offset O, and these are stored in the granulepos of that packet.
The split is done such that the B is the time of the earliest event
still active at the time, and the O is the time elapsed between B
and T. Thus, T = B + O. This mimics the way Theora stores its own
timestamps in granulepos, where the base acts as a keyframe, and
an offset acts as the position of an intra frame from the previous
keyframe. Since Kate allows time overlapping events, however, the
choice of the base to use is slightly more complex, as it may not
be the starting time of the previous event, if the stream contains
time overlapping events.

The kate_info structure for a stream holds a rational fraction
representing the time span of granule units for both the base and
the offset parts.

The granule rate is defined by the two fields:

kate_info::gps_numerator
kate_info::gps_denominator

The number of bits reserved for the offset is defined by the field:

kate_info::granule_shift

=== Generic timing ===

Kate data packets (data packet type 0) includes timing information (start time,
end time, and time of the earliest event still active). All these are stored as
64 bit at the rate defined by the granule rate, so they do not suffer from the
granule_shift space limitation.

This also allows for Kate streams to be stored in other containers.

== Motion ==

The Kate bitstream format includes motion definition, originally for karaoke purposes, but
which can be used for more general purpose, such as line based drawing, or animation of
the text (position, color, etc)

Motions are defined by the means of a series of curves (static points, segments, splines (catmull-rom, bezier, and b-splines)).
A 2D point can be obtained from a motion for any timestamp during the lifetime of a text.
This can be used for moving a marker in 2D above the text for karaoke, or to use the x
coordinate to color text when the motion position passes each letter or word, etc.
Motions have an attached semantics so the client code knows how to use a particular motion.
Predefined semantics include text color, text position, etc).

Since a motion can be composed of an arbitrary number of curves, each of which may have
an arbitrary number of control points, complex motions can be achieved. If the motion is
the main object of an event, it is even possible to have an empty text, and use the motion
as a virtual pencil to draw arbitrary shapes. Even on-the-fly handwriting subtitles could
be done this way, though this would require a lot of control points, and would not be able
to be used with text-to-speech.

As a proof of concept, I also have a "draw chat" program where two people can draw, and
the shapes are turned to b-splines and sent as a kate motion to be displayed on the other
person's window.

It is also possible for motions to be discontinuous - simply insert a curve of 'none' type.
While the timestamp lies within such a curve, no 2D point will be generated. This can be
used to temporarily hide a marker, for instance.

It is worth mentionning that pauses in the motion can be trivially included by inserting
at the right time and for the right duration a simple linear interpolation curve with only
two equal points, equal to the position the motion is supposed to pause at.

Kate defines a set of predefined mappings so that each decoder user interprets a motion in
the same way. A mapping is coded on 8 bits in the bitstream, and the first 128 are reserved
for Kate, leaving 128 for application specific mappings, to avoid constraining creative uses
of that feature. Predefined mappings include frame (eg, 0-1 points are mapped to the size of
the current video frame), or region, to scale 0-1 to the current region. This allows curves
to be defined without knowing in advance the pixel size of the area it should cover.

For uses which require more than two coordinates (eg, text color, where 4 (RGBA) values are
needed, Kate predefines the semantics text_color_rg and text_color_ba, so a 4D point can be
obtained using two different motions.

There are higher level constructs, such as morphing between two styles, or predefined
karaoke effects. More are planned to be added in the future.

See also [[#Trackers|Trackers]].

== Trackers ==

Since attaching motions to text position, etc, makes it hard for the client to keep track of
everything, doing interpolation, etc, the library supplies a tracker object, which handles the
interpolation of the relevant properties.
Once initialized with a text and a set of motions, the client code can give the tracker a new
timestamp, and get back the current text position, text color, etc.

Using a tracker is not necessary, if one wants to use the motions directly, or just ignore them,
but it makes life easier, especially when considering the the order in which motions are applied
does matter (to be defined formally, but the current source code is informative at this point).

== The Kate file format ==

Though this is not a feature of the bitstream format, I have created a text file format to
describe a series of events to be turned into a Kate bitstream.
At its minimum, the following is a valid input to the encoder:

: kate {
:: event { 00:00:05 --> 00:00:10 "This is a text" }
: }

This will create a simple stream with "This is a text" emitted at an offset of 5 seconds into
the track, lasting 5 seconds to an end time at 10 seconds.

Motions, regions, styles can be declared in a definitions block to be reused by events, or can
be defined inline. Defining those in the definitions block places them in a header so they can
be reused later, saving space. However, they can also be defined in each event, so they will be
sent with the event. This allows them to be generated on the fly (eg, if the bitstream is being
streamed from a realtime input).

For convenience, the Kate file format also allows C style macros, though without parameters.

Please note that the Kate file format is fully separate from the Kate bitstream format. The
difference between the two is similar to the difference between a C source file and the resulting
object file, when compiled.

Note that the format is not based on XML for a very parochial reason: I tend to dislike very
much editing XML by hand, as it's really hard to read. XML is really meant for machines to parse
generically text data in a shared syntax but with possibly unknown semantics, and I need those
text representations to be editable easily.

This also implies that there could be an XML representation of a Kate stream, which would be
useful if one were to make an editor that worked on a higher level than the current all-text
representation, and it is something that might very well happen in the future, in parallel with
the current format.

== Karaoke ==

Karaoke effects rely on motions, and there will be predefined higher level ways of specifying
timings and effects, two of which are already done. As an example, this is a valid Karaoke script:

:kate {
:: simple_timed_glyph_style_morph {
::: from style "start_style" to style "end_style"
::: "Let " at 1.0
::: "us " at 1.2
::: "sing " at 1.4
::: "to" at 2.0
::: "ge" at 2.5
::: "ther" at 3.0
:: }
:}

The syllables will change from a style to another as time passes. The definition of the start_style
and end_style styles is omitted for brevity.

== Problems to solve ==

There are a few things to solve before the Kate bitstream format can be considered good
enough to be frozen:

Note: the following is mostly solved, and the bitstream is now stable, and has been
backward and forward compatible since the first released version. This will be updated
when I get some time.

=== Seeking and memory ===

When seeking to a particular time in a movie with subtitles, we may end up at a place when a subtitle has been started, but is not removed yet. Pure streaming doesn't have this problem as it remembers the subtitle being issued (as opposed to, say, Vorbis, for which all data valid now is decoded from the last packet). With Kate, a text string valid now may have been issued long ago.

I see three possible ways to solve this:
*each data packet includes the granule of the earliest still active packet (if none, this will be the granule of this very packet)
**this means seeks are two phased: first seek, find the next Kate packet, and seek again if the granule of the earlier still active packet is less than the original seeked granule. This implies support code on players to do the double seek.

*use "reference frames", a bit like Theora does, where the granule position is split in several fields: the higher bits represent a position for the reference frame, and the lowest bits a delta time to the current position. When seeking to a granule position, the lower bits are cleared off, yielding the granule position of the previous reference frame, so the seek ends up at the reference frame. The reference frame is a sync point where any active strings are issued again. This is a variant of the method described in the Writ wiki page, but the granule splitting avoids any "downtime".
**this requires reissuing packets, and it doesn't feel right (and wastes space).
**it also requires "dummy" decoding of Kate data from the reference frame to the actual seek point to fully refresh the state "memory".

*A variant of the two-granules-in-one system used by libcmml, where the "back link" points to the earliest still active string, rather than the previous one (this allows a two phase seek, rather than a multiphase seek, hopping back from event to event, with no real way to know if there is or not a previous event which is still active - I suppose CMML has no need to know this, if their "clips" do not overlap - mine can do).
**Such a system considerably shortens the usable granule space, though it can do a one phase seek, if I understand the system correctly, which I am not certain.
*** Well, it seems it can't do a one phase seek anyway.

*Additionally, it could be possible to emit simple "keepalive" packets at regular intervals to help a seek algorithm to sync up to the stream without needing too much data reading - this helps for discontinuous streams where there could be no pages for a while if no data is needed at that time.

=== Text encoding ===

A header field declares the text encoding used in the stream. At the moment, only UTF-8 is
supported, for simplicity. There are no plans to support other encodings, such as UTF-16,
at the moment.

Note that strings included in the header (language, category) are not affected by that
language encoding (rather obviously for language itself). These are ASCII.

The actual text in events may include simple HTML-like markup (at the moment, allowed markup
is the same as the one Pango uses, but more markup types may be defined in the future).
It is also possible to ask libkate to remove this markup if the client prefers to receive
plain text without the markup.

=== Language encoding ===

A header field defines the language (if any) used in the stream (this can be overridden in a
data packet, but this is not relevant to this point). At the moment, my test code uses
ISO 639-1 two letter codes, but I originally thought to use RFC 3066 tags. However, matching
a language to a user selection may be simpler for user code if the language encoding is kept
simple. At the moment, I tend to favor allowing both two letter tags (eg, "en") and secondary
tags (like "en_EN"), as RFC 3066 tags can be quite complex, but I welcome comments on this.

If a stream contains more than one language, there usually is a predominant language, which
can be set as the default language for the stream. Each event can then have a language
override. If there is no predominant language, and it is not possible to split the stream
into multiple substreams, each with its own language, then it is possible to use the "mul"
language tag, as a last resort.

=== Bitstream format for floating point values ===

Floating point values are be turned to a 16.16 fixed point format, then stored in a bitpacked
format, storing the number of zero bits at the head and tail of the floating point values once
per stream, and the remainder bits for all values in the stream. This seems to yield good results
(typically a 50% reduction over 32 bits raw writes, and 70% over the snprintf based storage), and
has the big advantage of being portable (eg, independant of any IEEE format).
However, this means reduced precision due to the quantization to 16.16. I may add support for
variable precision (eg, 8.24 fixed point formats) to alleviate this. This would however mean less
space savings, though these are likely to be insignificant when Kate streams are interleaved with
a video.

*Though this is not a Kate issue per se, the motion feature is very difficult to use without a curve editor. While tools may be coded to create a Kate bitstream for various existing subtitle formats, it is not certain it will be easy to find a good authoring tool for a series of curves. That said, it's not exactly difficult to do if you know a widget set.

=== Higher dimensional curves/motions ===

It is quite annoying to have to create two motions to control a color change, due to curves
being restricted to two dimensions. I may add support for arbitrary dimensions. It would also
help for 1D motions, like changing the time flow, where one coordinate is simply ignored at
the moment.
Alternatively, changes could be made to the Kate file format to hide the two dimensionality and
allow simpler specification of non-2 dimensional motions, but still map them to 2D in the kate
bitstream format.

=== Category definition ===

The category field in the BOS packet is a 16 byte text field (15 really, as it is zero terminated
in the bitstream itself). Its goal is to provide the reader with a short description of what kind
of information the stream contains, eg subtitles, lyrics, etc. This would be displayed to the user,
possibly to allow to choose to turn some streams on and off.

Since this category is meant primarily for a machine to parse, they will be kept to ASCII. When
a player recognizes a category, it is free to replace its name with one in the user's language if
it prefers. Even in English, the "lyrics" category could be displayed by a player as "Lyrics".

Since this is a free text field rather than an enumeration, it would be good to have a list of
common predefined category names that Kate streams can use.

This is a list of proposed predefined categories, feedback/additions welcome:

* subtitles - the usual movie subtitles, as text
* spu-subtitles - movie subtitles in DVD style paletted images
* lyrics - song lyrics

Please remember the 15 character limit if proposing other categories.

Note that the list of categories is subject to change, and will likely
be replaced by new, more "identifier like" ones. The three ones above,
however, would be kept for backward compatibility as they're already used.

== Text to speech ==

One of the goals of the Kate bitstream format is that text data can be easily parsed
by the user of the decoder, so any additional information, such as style, placement,
karaoke data, etc, should be able to be stripped to leave only the bare text. This is
in view of allowing text-to-speech software to use Kate bitstreams as a bandwith-cheap
way of conveying speech data, and could also allow things like e-books which can be
either read or listened to from the same bitstream (I have seen no reference to this
being used anywhere, but I see no reason why the granule progression should be temporal,
and not user controlled, such as by using a "next" button which would bump a granule
postion by a preset amount, simulating turning a page (this would be close to necessary
for text-to-speech, as the wall time duration of the spoken speech is not known in
advance to the Kate encoder, and can't be mapped to a time based granule progression)).
All text strings triggered consecutively between the two granule positions would then
be read in order.

== Possible additions ==

=== Embedded binary data ===

Images and font mappings can be included within a Kate stream.

==== Images ====

Though this could be misused to interfere with ability to render as text-to-speech, Kate
can use images as well as text. The same caveat as for fonts applies with regard to data
duplication.

Complex images might however be best left to a multiplexed OggSpots or OggMNG stream, unless the
images mesh with the text (eg, graphical exclamation points, custom fonts, (see next
paragraph), etc).

There is support for simple paletted bitmap images, with a variable length palette of up
to 256 colors (in fact, sized in powers of 2 up to 256) and matching pixel data in as
many bits per pixel as can address the palette. Palettes and images are stored separately,
so can be used with one another with no fixed assignment.

Palettes and bitmaps are put in two separate header for later use by reference, but can
also be placed in data packets, as with motions, etc, if they are not going to be reused.

PNG bitmaps can also be embedded in a Kate stream. These do not have associated palettes
(but the PNGs themselves may or may not be paletted). There is no support for decoding PNG
images in libkate itself, so a program will have to use libpng (or similar code) to decode
the PNG image. For instance, the libtiger rendering library uses Cairo to decode and render
PNG images in Kate streams.

This can be used to have custom fonts, so that raw text is still available if the stream
creator wants a custom look.

I expect that the need for more than 256 colors in a bitmap, or non palette bitmap data,
would be best handled by another codec, eg OggMNG or OggSpots. The goal of images in a
Kate stream is to mesh the images with the text, not to have large images by themselves.

On the other hand, interesting Karaoke effects could be achieved by having MNG images
instead of simple paletted bitmaps in a Kate streams. Comments would be most welcome on
whether this is going too far, however.

I am also investigating SVG images. These allow for very small footprint images for simple
vector drawings, and could be very useful for things like background gradients below text.

A possible solution to the duplication issue is to have another stream in the container
stream, which would hold the shared data (eg, fonts), which the user program could load,
and which could then be used by any Kate (and other) stream. Typically, this type of stream
would be a degenerate stream with only header packets (so it is fully processed before any
other stream presents data packets that might make use of that shared data), and all payload
such as fonts being contained within the headers. Thinking about it, it has parallels with
the way Vorbis stores its codebooks within a header packet, or even the way Kate stores the
list of styles within a header packet.

==== Fonts ====

Custom fonts are merely a set of ranges mapping unicode code points to bitmaps. As this implies,
fonts are bitmap fonts, not vector fonts, so scaling, if supported by the rendering client,
may not look as good as with a vector font.

A style may also refer to a font name to use (eg, "Tahoma"). These fonts may or may not be
available on the playing system, however, since the font data is not included in the stream,
just referenced by name. For this reason, it is best to keep to widely known fonts.

== Reference encoder/decoder ==

A encoder and a decoder are included in the tools directory. The encoder pulls its input from a custom
text based file format (see [[#The Kate file format|The Kate file format]]),
which is by no means meant to be part of the Kate bitstream specification itself,
from an SubRip (.srt) format file (the most common subtitle format I found, and a very basic one),
or from a lyrics (.lrc) format file.

The Kate bitstreams encoded and decoded by those tools are (supposed to be) correct for this
specification, provided their input is correct.

== Next steps ==

=== Continuations ===

Continuations are a way to add to existing events, and are mostly meant for motions. When streaming
in real time, what motions may be applied to events may not be known in advance (for instance, for a
draw chat program where two programs exchange Kate streams, the drawing motions are only known as
they are drawn. Continuations will allow an event to be extended in time, and motions to be appended
to it. This is only useful for streaming, as when stored in a file, everything is already known in
advance.

=== A rendering library ===

This will allow easier integration in other packages (movie players, etc).
I have started working on an implementation using Cairo and Pango, though I'm still at the early stages.
I might add support for embedding vector fonts in a Kate stream if I was going that way. Still need to think about this.
Another point of note is that when this library is available, it would make it easier to add
capabilities such as rotation, scaling, etc, to the bitstream, since this would not cause too
much work for playing programs using the rendering library. It is expected that these additions
would stay backward compatible (eg, an old player would ignore this information but still correctly
decode the information they can work with from a newly encoded stream).

=== An XML representation ===

While I purposefully did not write Kate description files in XML due to me finding editing XML such
a chore, it would be nice to be able to losslessly convert between the more user friendly representation
and an XML document, so one can do what one does with XML documents, like transformations.

And after all, some people might prefer editing the XML version.

== Matroska mapping ==

The codec ID is "S_KATE".

As for Theora and Vorbis, Kate headers are stored in the private data as xiph-laced packets:

Byte 0: number of packets present, minus 1 (there must be at least one packet) - let this number be NP
Bytes 1..n: lengths of the first NP packets, coded in xiph style lacing
Bytes n+1..end: the data packets themselves concatenated one after the other

Note that the length of the last packet isn't encoded, it is deduced from the sizes of the other
packets and the total size of the private data.

This mapping is similar to the Vorbis and Theora mappings, with the caveat that one should not
expect a set number of headers.

== Downloading ==

libkate encodes and decodes Kate streams, and is API and ABI stable.

The libkate source distribution is available at [http://libkate.googlecode.com/ http://libkate.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/kate.git;a=summary http://git.xiph.org/?p=users/oggk/kate.git;a=summary].

libtiger renders Kate streams using Pango and Cairo, and is alpha, with API changes still possible.

The libtiger source distribution is available at [http://libtiger.googlecode.com/ http://libtiger.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/tiger.git;a=summary http://git.xiph.org/?p=users/oggk/tiger.git;a=summary].

== Things I need to get feedback on ==

* is it a good idea to avoid floating point usage altogether ?

[[Category:Drafts]]
[[Category:Ogg Mappings]]

OggKate

2008-12-19T21:33:09Z

Ogg.k.ogg.k: bitstream verison

== Disclaimer ==
This is not a Xiph codec, though it may be embedded in Ogg alonside other Xiph
codecs, such as Vorbis and Theora. As such, please do not assume that Xiph has
anything to do with this, much less responsibility.

== What is Kate? ==

Kate is a codec for karaoke and text encapsulation for Ogg. Text and images can be
carried by a Kate stream, and animated. Most of the time, this would be multiplexed
with audio/video to carry subtitles, song lyrics (with or without karaoke data), etc,
but doesn't have to be. A possible use of a lone Kate stream would be an e-book.
Moreover, the motion feature gives Kate a powerful means to describe arbitrary curves, so
hand drawing of shapes can be achieved. This was originally meant for karaoke use, but
can be used for any purpose. Motions can be attached to various semantics, like position,
color, etc, so scrolling or fading text can be defined.

Simple uses of Kate streams are movie subtitles for Theora videos, either text based,
as may be created by [http://www.v2v.cc/~j/ffmpeg2theora ffmpeg2theora], or image
based, such as created by [http://thoggen.net Thoggen] (patching needed).

== Why a new codec? ==

As I was adding support for Theora, Speex and FLAC to some software of mine, I found myself
wanting to have song lyrics accompanying Vorbis audio. Since Vorbis comments are limited to
the headers, one can't add them in the stream as they are sung, so another multiplexed stream
would be needed to carry them.

The three possible bases usable for such a codec I found were Writ, CMML, and OGM/SRT.

*[[OggWrit|Writ]] is an unmaintained start at an implementation of a very basic design, though I did find an encoder/decoder in py-ogg2 later on - I'd been quicker to write Kate from scratch anyway.
*[[CMML]] is more geared towards encapsulating metadata about an accompanying stream, rather than being a data stream itself, and seemed complex for a simple use, though I have now revised my view on this - besides, it seems designed for Annodex (which I haven't had a look at), though it does seems relatively generic for use outwith Annodex - though it is being "repurposed" as timed text now, bringing it closer to what I'm doing
*OGM/SRT, which I only found when I added Kate support to MPlayer, is shoehorning various data formats into an Ogg stream, and just dumps the SRT subtitle format as is, AFAICS (though I haven't looked at this one in detail, since I'd already had a working Kate implementation by that time)

I then decided to roll my own, not least because it's a fun thing to do.

I found other formats, such as USF (designed for inclusion in Matroska) and various subtitle formats,
but none were designed for embedding inside an Ogg container.

== Overview of the Kate bitstream format ==

I've taken much inspiration from Vorbis and Theora here.
Headers and packets (as well as the API design) follow the design of these two codecs.

A rough overview (see [[#Format specification|Format specification]] for more details) is:

Headers packets:
*ID header [BOS]: magic, version, granule fraction, encoding, language, etc
*Comment header: Vorbis comments, as per Vorbis/Theora streams
*Style definitions header: a list of predefined styles to be referred to by data packets
*Region definitions header: a list of predefined regions to be referred to by data packets
*Curves definitions header: a list of predefined curves to be referred to by data packets
*Motion definitions header: a list of predefined motions to be referred to by data packets
*Palette definitions header: a list of predefined palettes to be referred to by data packets
*Bitmap definitions header: a list of predefined bitmaps to be referred to by data packets
*Font mapping definitions header: a list of predefined font mappings to be referred to by data packets

Other header packets are ignored, and left for future expansion.

Data packets:
*text data: text/image and optional motions, accompanied by optional overrides for style, region, language, etc
*keepalive: can be emitted at any time to help a demuxer know where we're at, but those packets are optional
*end data [EOS]: marks the end of the stream, it doesn't have any useful payload

Other data packets are ignored, and left for future expansion.

The intent of the "keepalive" packet is to be sent at regular
intervals when no other packet has been emitted for a while. This would be to help seeking code
find a kate page more easily.

Things of note:
*Kate is a discontinuous codec, as defined in [http://www.xiph.org/ogg/doc/ogg-multiplex.html ogg-multiplex.html] in the Ogg documentation, which means it's timed by start granule, not end granule (as Theora and Vorbis).
* All data packets are on their own page, for two reasons:
**Ogg keeps track of granules at the page level, not the packet level
**if no text event happens for a while after a particular text event, we don't want to delay it so a larger page can be issued

See also [[#Seeking and memory|Problems to solve: Seeking and memory]].

*The granule encoding is not a direct time/granule correspondance, see the granule encoding section.
*The EOS packet should have a granule pos higher or equal to the end time of all events.
*User code doesn't have to know the number of headers to expect, this is moved inside the library code (as opposed to Vorbis and Theora).
*The format contains hooks so that additional information may be added in future revisions while keeping backward compatibility (though old decoders will correctly parse, but ignore the new information).

== Format specification ==

The Kate bitstream format consists of a number of sequential packets.
Packets can be either header packets or data packets. All header packets
must appear before any data packet.

Header packets must appear in order. Decoding of a data packet is not
possible until all header packets have been decoded.

Each Kate packet starts with a one byte type. A type with the MSB set
(eg, between 0x80 and 0xff) indicates a header packet, while a type with
the MSB cleared (eg, between 0x00 and 0x7f) indicates a data packet.
All header packets then have the Kate magic, from byte offset 1 to byte
offset 7 ("kate\0\0\0"). Note that this applies only to header packets:
data packets do not contain the Kate signature.

Since the ID header must appear first, a Kate stream can be recognized
by comparing the first eight bytes of the first packet with the signature
string "\200kate\0\0\0".

When embedded in Ogg,the first packet in a Kate stream (always packet type 0x80,
the id header packet) must be placed on a separate page. The corresponding Ogg
packet must be marked as beginning of stream (BOS).All subsequent header packets
must be on one or more pages. Subsequently, each data packet must be on a separate
page.

The last data packet must be the end of stream packet (packet type 0x7f).

When embedded in Ogg, the corresponding Ogg packet must be marked as end of stream (EOS).

As per the Ogg specification, granule positions must be non decreasing
within the stream. Header packets have granule position 0.

Currently existing packet types are:
:headers:
::0x80 ID header (BOS)
::0x81 Vorbis comment header
::0x82 regions list header
::0x83 styles list header
::0x84 curves list header
::0x85 motions list header
::0x86 palettes list header
::0x87 bitmaps list header
::0x88 font ranges and mappings header
:data:
::0x00 text data (including optional motions and overrides)
::0x01 keepalive
::0x7f end packet (EOS)

This format described here is for bitstream version 0.x.
As or 19 december 2008, the latest bitstream version is 0.4.

For more detailed information, refer to the format documentation
in libkate (see URL below in the [[#Downloading|Downlading]] section).

Following is the definition of the ID header (packet type 0x80).
This works out to a 64 byte ID header. This is the header that should be
used to detect a Kate stream within an Ogg stream.

0 1 2 3 |
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1| Byte
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| packtype | Identifier char[7]: 'kate\0\0\0' | 0-3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| kate magic continued | 4-7
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | version major | version minor | num headers | 8-11
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| text encoding | directionality| reserved - 0 | granule shift | 12-15
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| cw sh | canvas width | ch sh | canvas height | 16-19
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | 20-23
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate numerator | 24-27
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate denominator | 28-31
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (NUL terminated) | 32-35
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 36-39
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 40-43
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 44-47
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (NUL terminated) | 48-51
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 52-55
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 56-59
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 60-63
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The fields cw sh, canvas width, cw sh, and canvas height were introduced
in bistream 0.3. Earlier bitstreams will have 0 in these fields.

language and category are NUL terminating ASCII strings.
Language follows RFC 3066, though obviously will not accommodate language tags
with lots of subtags.

Category is currently loosely defined, and I haven't found yet a nice way to
present it in a generic way, but is meant for automatic classifying of
various multiplexed Kate streams (eg, to recognize that some streams are
subtitles (in a set of languages), and some others are commentary (in a
possibly different set of languages, etc).

== API overview ==

libkate offers an API very similar to that of libvorbis and libtheora, as well as
an extra higher level decoding API.

Here's an overview of the three main modules:

=== Decoding ===

Decoding is done in a way similar to libvorbis. First, initialize a kate_info and a
kate_comment structure. Then, read headers by calling kate_decode_headerin. Once
all headers have been read, a kate_state is initialized for decoding using kate_decode_init,
and kate_decode_packetin is called repeatedly with data packets. Events (eg, text) can be
retrieved via kate_decode_eventout.

=== Encoding ===

Encoding is also done in a way similar to libvorbis. First initialize a kate_info
and a kate_comment structure, and fill them out as needed. kate_encode_headers will
create ogg packets from those. Then, kate_encode_text is called repeatedly for all
the text events to add. When done, calling kate_encode_finish will create an end of
stream packet.

=== High level decoding API ===

There are only 3 calls here:

kate_high_decode_init
kate_high_decode_packetin
kate_high_decode_clear

Here, all Ogg packets are sent to kate_high_decode_packetin, which does the right
thing (header/data classification, decoding, and event retrieval). Note that you
do not get access to the comments directly using this, but you do get access to the
kate_info via events.

The libkate distribution includes commented examples for each of those.

Additionally, libkate includes a layer (liboggkate) to make it easier to use when
embedded in Ogg. While the normal API uses kate_packet structures, liboggkate uses
ogg_packet structures.

The High level decoding API does not have an Ogg specific layer, but functions exist
to wrap a kate_packet around a memory buffer (such as the one ogg_packet uses, for instance).

== Support ==

Among the software with Kate support:
*VLC
*ffmpeg2theora
*liboggz
*liboggplay

I have patches for the following with Kate support:
*MPlayer (for multiplexed per-language subtitles - all region/style info is ignored)
*xine (everything kate supports, as xine is my testbed)
*GStreamer
*Thoggen

These may be found in the libkate source distribution (see [[#Downloading|Downloading]]
for links).

In addition, libtiger is a rendering library for Kate streams using Pango and Cairo,
though it is not quite yet API stable (though no major changes are expected).

== Granule encoding ==

=== Ogg ===

Ogg leaves the encoding of granules up to a particular codec, only
mandating that granules be non decreasing with time.

The Kate bitstream format uses a linear mapping between time and
granule, described here.

A Kate granule position is composed of two different parts:
- a base granule, in the high bits
- a granule offset, in the low bits

+----------------+----------------+
| base | offset |
+----------------+----------------+

The number of bits these parts occupy is variable, and each stream
may choose how many bits to dedicate to each. The kate_info structure
for a stream holds that information in the granule_shift field,
so each part may be reconstructed from a granulepos.

The timestamp T of a given Kate packet is split into a base B and
offset O, and these are stored in the granulepos of that packet.
The split is done such that the B is the time of the earliest event
still active at the time, and the O is the time elapsed between B
and T. Thus, T = B + O. This mimics the way Theora stores its own
timestamps in granulepos, where the base acts as a keyframe, and
an offset acts as the position of an intra frame from the previous
keyframe. Since Kate allows time overlapping events, however, the
choice of the base to use is slightly more complex, as it may not
be the starting time of the previous event, if the stream contains
time overlapping events.

The kate_info structure for a stream holds a rational fraction
representing the time span of granule units for both the base and
the offset parts.

The granule rate is defined by the two fields:

kate_info::gps_numerator
kate_info::gps_denominator

The number of bits reserved for the offset is defined by the field:

kate_info::granule_shift

=== Generic timing ===

Kate data packets (data packet type 0) includes timing information (start time,
end time, and time of the earliest event still active). All these are stored as
64 bit at the rate defined by the granule rate, so they do not suffer from the
granule_shift space limitation.

This also allows for Kate streams to be stored in other containers.

== Motion ==

The Kate bitstream format includes motion definition, originally for karaoke purposes, but
which can be used for more general purpose, such as line based drawing, or animation of
the text (position, color, etc)

Motions are defined by the means of a series of curves (static points, segments, splines (catmull-rom, bezier, and b-splines)).
A 2D point can be obtained from a motion for any timestamp during the lifetime of a text.
This can be used for moving a marker in 2D above the text for karaoke, or to use the x
coordinate to color text when the motion position passes each letter or word, etc.
Motions have an attached semantics so the client code knows how to use a particular motion.
Predefined semantics include text color, text position, etc).

Since a motion can be composed of an arbitrary number of curves, each of which may have
an arbitrary number of control points, complex motions can be achieved. If the motion is
the main object of an event, it is even possible to have an empty text, and use the motion
as a virtual pencil to draw arbitrary shapes. Even on-the-fly handwriting subtitles could
be done this way, though this would require a lot of control points, and would not be able
to be used with text-to-speech.

As a proof of concept, I also have a "draw chat" program where two people can draw, and
the shapes are turned to b-splines and sent as a kate motion to be displayed on the other
person's window.

It is also possible for motions to be discontinuous - simply insert a curve of 'none' type.
While the timestamp lies within such a curve, no 2D point will be generated. This can be
used to temporarily hide a marker, for instance.

It is worth mentionning that pauses in the motion can be trivially included by inserting
at the right time and for the right duration a simple linear interpolation curve with only
two equal points, equal to the position the motion is supposed to pause at.

Kate defines a set of predefined mappings so that each decoder user interprets a motion in
the same way. A mapping is coded on 8 bits in the bitstream, and the first 128 are reserved
for Kate, leaving 128 for application specific mappings, to avoid constraining creative uses
of that feature. Predefined mappings include frame (eg, 0-1 points are mapped to the size of
the current video frame), or region, to scale 0-1 to the current region. This allows curves
to be defined without knowing in advance the pixel size of the area it should cover.

For uses which require more than two coordinates (eg, text color, where 4 (RGBA) values are
needed, Kate predefines the semantics text_color_rg and text_color_ba, so a 4D point can be
obtained using two different motions.

There are higher level constructs, such as morphing between two styles, or predefined
karaoke effects. More are planned to be added in the future.

See also [[#Trackers|Trackers]].

== Trackers ==

Since attaching motions to text position, etc, makes it hard for the client to keep track of
everything, doing interpolation, etc, the library supplies a tracker object, which handles the
interpolation of the relevant properties.
Once initialized with a text and a set of motions, the client code can give the tracker a new
timestamp, and get back the current text position, text color, etc.

Using a tracker is not necessary, if one wants to use the motions directly, or just ignore them,
but it makes life easier, especially when considering the the order in which motions are applied
does matter (to be defined formally, but the current source code is informative at this point).

== The Kate file format ==

Though this is not a feature of the bitstream format, I have created a text file format to
describe a series of events to be turned into a Kate bitstream.
At its minimum, the following is a valid input to the encoder:

: kate {
:: event { 00:00:05 --> 00:00:10 "This is a text" }
: }

This will create a simple stream with "This is a text" emitted at an offset of 5 seconds into
the track, lasting 5 seconds to an end time at 10 seconds.

Motions, regions, styles can be declared in a definitions block to be reused by events, or can
be defined inline. Defining those in the definitions block places them in a header so they can
be reused later, saving space. However, they can also be defined in each event, so they will be
sent with the event. This allows them to be generated on the fly (eg, if the bitstream is being
streamed from a realtime input).

For convenience, the Kate file format also allows C style macros, though without parameters.

Please note that the Kate file format is fully separate from the Kate bitstream format. The
difference between the two is similar to the difference between a C source file and the resulting
object file, when compiled.

Note that the format is not based on XML for a very parochial reason: I tend to dislike very
much editing XML by hand, as it's really hard to read. XML is really meant for machines to parse
generically text data in a shared syntax but with possibly unknown semantics, and I need those
text representations to be editable easily.

This also implies that there could be an XML representation of a Kate stream, which would be
useful if one were to make an editor that worked on a higher level than the current all-text
representation, and it is something that might very well happen in the future, in parallel with
the current format.

== Karaoke ==

Karaoke effects rely on motions, and there will be predefined higher level ways of specifying
timings and effects, two of which are already done. As an example, this is a valid Karaoke script:

:kate {
:: simple_timed_glyph_style_morph {
::: from style "start_style" to style "end_style"
::: "Let " at 1.0
::: "us " at 1.2
::: "sing " at 1.4
::: "to" at 2.0
::: "ge" at 2.5
::: "ther" at 3.0
:: }
:}

The syllables will change from a style to another as time passes. The definition of the start_style
and end_style styles is omitted for brevity.

== Problems to solve ==

There are a few things to solve before the Kate bitstream format can be considered good
enough to be frozen:

Note: the following is mostly solved, and the bitstream is now stable, and has been
backward and forward compatible since the first released version. This will be updated
when I get some time.

=== Seeking and memory ===

When seeking to a particular time in a movie with subtitles, we may end up at a place when a subtitle has been started, but is not removed yet. Pure streaming doesn't have this problem as it remembers the subtitle being issued (as opposed to, say, Vorbis, for which all data valid now is decoded from the last packet). With Kate, a text string valid now may have been issued long ago.

I see three possible ways to solve this:
*each data packet includes the granule of the earliest still active packet (if none, this will be the granule of this very packet)
**this means seeks are two phased: first seek, find the next Kate packet, and seek again if the granule of the earlier still active packet is less than the original seeked granule. This implies support code on players to do the double seek.

*use "reference frames", a bit like Theora does, where the granule position is split in several fields: the higher bits represent a position for the reference frame, and the lowest bits a delta time to the current position. When seeking to a granule position, the lower bits are cleared off, yielding the granule position of the previous reference frame, so the seek ends up at the reference frame. The reference frame is a sync point where any active strings are issued again. This is a variant of the method described in the Writ wiki page, but the granule splitting avoids any "downtime".
**this requires reissuing packets, and it doesn't feel right (and wastes space).
**it also requires "dummy" decoding of Kate data from the reference frame to the actual seek point to fully refresh the state "memory".

*A variant of the two-granules-in-one system used by libcmml, where the "back link" points to the earliest still active string, rather than the previous one (this allows a two phase seek, rather than a multiphase seek, hopping back from event to event, with no real way to know if there is or not a previous event which is still active - I suppose CMML has no need to know this, if their "clips" do not overlap - mine can do).
**Such a system considerably shortens the usable granule space, though it can do a one phase seek, if I understand the system correctly, which I am not certain.
*** Well, it seems it can't do a one phase seek anyway.

*Additionally, it could be possible to emit simple "keepalive" packets at regular intervals to help a seek algorithm to sync up to the stream without needing too much data reading - this helps for discontinuous streams where there could be no pages for a while if no data is needed at that time.

=== Text encoding ===

A header field declares the text encoding used in the stream. At the moment, only UTF-8 is
supported, for simplicity. There are no plans to support other encodings, such as UTF-16,
at the moment.

Note that strings included in the header (language, category) are not affected by that
language encoding (rather obviously for language itself). These are ASCII.

The actual text in events may include simple HTML-like markup (at the moment, allowed markup
is the same as the one Pango uses, but more markup types may be defined in the future).
It is also possible to ask libkate to remove this markup if the client prefers to receive
plain text without the markup.

=== Language encoding ===

A header field defines the language (if any) used in the stream (this can be overridden in a
data packet, but this is not relevant to this point). At the moment, my test code uses
ISO 639-1 two letter codes, but I originally thought to use RFC 3066 tags. However, matching
a language to a user selection may be simpler for user code if the language encoding is kept
simple. At the moment, I tend to favor allowing both two letter tags (eg, "en") and secondary
tags (like "en_EN"), as RFC 3066 tags can be quite complex, but I welcome comments on this.

If a stream contains more than one language, there usually is a predominant language, which
can be set as the default language for the stream. Each event can then have a language
override. If there is no predominant language, and it is not possible to split the stream
into multiple substreams, each with its own language, then it is possible to use the "mul"
language tag, as a last resort.

=== Bitstream format for floating point values ===

Floating point values are be turned to a 16.16 fixed point format, then stored in a bitpacked
format, storing the number of zero bits at the head and tail of the floating point values once
per stream, and the remainder bits for all values in the stream. This seems to yield good results
(typically a 50% reduction over 32 bits raw writes, and 70% over the snprintf based storage), and
has the big advantage of being portable (eg, independant of any IEEE format).
However, this means reduced precision due to the quantization to 16.16. I may add support for
variable precision (eg, 8.24 fixed point formats) to alleviate this. This would however mean less
space savings, though these are likely to be insignificant when Kate streams are interleaved with
a video.

*Though this is not a Kate issue per se, the motion feature is very difficult to use without a curve editor. While tools may be coded to create a Kate bitstream for various existing subtitle formats, it is not certain it will be easy to find a good authoring tool for a series of curves. That said, it's not exactly difficult to do if you know a widget set.

=== Higher dimensional curves/motions ===

It is quite annoying to have to create two motions to control a color change, due to curves
being restricted to two dimensions. I may add support for arbitrary dimensions. It would also
help for 1D motions, like changing the time flow, where one coordinate is simply ignored at
the moment.
Alternatively, changes could be made to the Kate file format to hide the two dimensionality and
allow simpler specification of non-2 dimensional motions, but still map them to 2D in the kate
bitstream format.

=== Category definition ===

The category field in the BOS packet is a 16 byte text field (15 really, as it is zero terminated
in the bitstream itself). Its goal is to provide the reader with a short description of what kind
of information the stream contains, eg subtitles, lyrics, etc. This would be displayed to the user,
possibly to allow to choose to turn some streams on and off.

Since this category is meant primarily for a machine to parse, they will be kept to ASCII. When
a player recognizes a category, it is free to replace its name with one in the user's language if
it prefers. Even in English, the "lyrics" category could be displayed by a player as "Lyrics".

Since this is a free text field rather than an enumeration, it would be good to have a list of
common predefined category names that Kate streams can use.

This is a list of proposed predefined categories, feedback/additions welcome:

* subtitles - the usual movie subtitles, as text
* spu-subtitles - movie subtitles in DVD style paletted images
* lyrics - song lyrics

Please remember the 15 character limit if proposing other categories.

* Note that the list of categories is subject to change, and will likely
be replaced by new, more "identifier like" ones. The three ones above,
however, would be kept for backward compatibility as they're already used.

== Text to speech ==

One of the goals of the Kate bitstream format is that text data can be easily parsed
by the user of the decoder, so any additional information, such as style, placement,
karaoke data, etc, should be able to be stripped to leave only the bare text. This is
in view of allowing text-to-speech software to use Kate bitstreams as a bandwith-cheap
way of conveying speech data, and could also allow things like e-books which can be
either read or listened to from the same bitstream (I have seen no reference to this
being used anywhere, but I see no reason why the granule progression should be temporal,
and not user controlled, such as by using a "next" button which would bump a granule
postion by a preset amount, simulating turning a page (this would be close to necessary
for text-to-speech, as the wall time duration of the spoken speech is not known in
advance to the Kate encoder, and can't be mapped to a time based granule progression)).
All text strings triggered consecutively between the two granule positions would then
be read in order.

== Possible additions ==

=== Embedded binary data ===

Images and font mappings can be included within a Kate stream.

==== Images ====

Though this could be misused to interfere with ability to render as text-to-speech, Kate
can use images as well as text. The same caveat as for fonts applies with regard to data
duplication.

Complex images might however be best left to a multiplexed OggSpots or OggMNG stream, unless the
images mesh with the text (eg, graphical exclamation points, custom fonts, (see next
paragraph), etc).

There is support for simple paletted bitmap images, with a variable length palette of up
to 256 colors (in fact, sized in powers of 2 up to 256) and matching pixel data in as
many bits per pixel as can address the palette. Palettes and images are stored separately,
so can be used with one another with no fixed assignment.

Palettes and bitmaps are put in two separate header for later use by reference, but can
also be placed in data packets, as with motions, etc, if they are not going to be reused.

PNG bitmaps can also be embedded in a Kate stream. These do not have associated palettes
(but the PNGs themselves may or may not be paletted). There is no support for decoding PNG
images in libkate itself, so a program will have to use libpng (or similar code) to decode
the PNG image. For instance, the libtiger rendering library uses Cairo to decode and render
PNG images in Kate streams.

This can be used to have custom fonts, so that raw text is still available if the stream
creator wants a custom look.

I expect that the need for more than 256 colors in a bitmap, or non palette bitmap data,
would be best handled by another codec, eg OggMNG or OggSpots. The goal of images in a
Kate stream is to mesh the images with the text, not to have large images by themselves.

On the other hand, interesting Karaoke effects could be achieved by having MNG images
instead of simple paletted bitmaps in a Kate streams. Comments would be most welcome on
whether this is going too far, however.

I am also investigating SVG images. These allow for very small footprint images for simple
vector drawings, and could be very useful for things like background gradients below text.

A possible solution to the duplication issue is to have another stream in the container
stream, which would hold the shared data (eg, fonts), which the user program could load,
and which could then be used by any Kate (and other) stream. Typically, this type of stream
would be a degenerate stream with only header packets (so it is fully processed before any
other stream presents data packets that might make use of that shared data), and all payload
such as fonts being contained within the headers. Thinking about it, it has parallels with
the way Vorbis stores its codebooks within a header packet, or even the way Kate stores the
list of styles within a header packet.

==== Fonts ====

Custom fonts are merely a set of ranges mapping unicode code points to bitmaps. As this implies,
fonts are bitmap fonts, not vector fonts, so scaling, if supported by the rendering client,
may not look as good as with a vector font.

A style may also refer to a font name to use (eg, "Tahoma"). These fonts may or may not be
available on the playing system, however, since the font data is not included in the stream,
just referenced by name. For this reason, it is best to keep to widely known fonts.

== Reference encoder/decoder ==

A encoder and a decoder are included in the tools directory. The encoder pulls its input from a custom
text based file format (see [[#The Kate file format|The Kate file format]]),
which is by no means meant to be part of the Kate bitstream specification itself,
from an SubRip (.srt) format file (the most common subtitle format I found, and a very basic one),
or from a lyrics (.lrc) format file.

The Kate bitstreams encoded and decoded by those tools are (supposed to be) correct for this
specification, provided their input is correct.

== Next steps ==

=== Continuations ===

Continuations are a way to add to existing events, and are mostly meant for motions. When streaming
in real time, what motions may be applied to events may not be known in advance (for instance, for a
draw chat program where two programs exchange Kate streams, the drawing motions are only known as
they are drawn. Continuations will allow an event to be extended in time, and motions to be appended
to it. This is only useful for streaming, as when stored in a file, everything is already known in
advance.

=== A rendering library ===

This will allow easier integration in other packages (movie players, etc).
I have started working on an implementation using Cairo and Pango, though I'm still at the early stages.
I might add support for embedding vector fonts in a Kate stream if I was going that way. Still need to think about this.
Another point of note is that when this library is available, it would make it easier to add
capabilities such as rotation, scaling, etc, to the bitstream, since this would not cause too
much work for playing programs using the rendering library. It is expected that these additions
would stay backward compatible (eg, an old player would ignore this information but still correctly
decode the information they can work with from a newly encoded stream).

=== An XML representation ===

While I purposefully did not write Kate description files in XML due to me finding editing XML such
a chore, it would be nice to be able to losslessly convert between the more user friendly representation
and an XML document, so one can do what one does with XML documents, like transformations.

And after all, some people might prefer editing the XML version.

== Matroska mapping ==

The codec ID is "S_KATE".

As for Theora and Vorbis, Kate headers are stored in the private data as xiph-laced packets:

Byte 0: number of packets present, minus 1 (there must be at least one packet) - let this number be NP
Bytes 1..n: lengths of the first NP packets, coded in xiph style lacing
Bytes n+1..end: the data packets themselves concatenated one after the other

Note that the length of the last packet isn't encoded, it is deduced from the sizes of the other
packets and the total size of the private data.

This mapping is similar to the Vorbis and Theora mappings, with the caveat that one should not
expect a set number of headers.

== Downloading ==

libkate encodes and decodes Kate streams, and is API and ABI stable.

The libkate source distribution is available at [http://libkate.googlecode.com/ http://libkate.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/kate.git;a=summary http://git.xiph.org/?p=users/oggk/kate.git;a=summary].

libtiger renders Kate streams using Pango and Cairo, and is alpha, with API changes still possible.

The libtiger source distribution is available at [http://libtiger.googlecode.com/ http://libtiger.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/tiger.git;a=summary http://git.xiph.org/?p=users/oggk/tiger.git;a=summary].

== Things I need to get feedback on ==

* is it a good idea to avoid floating point usage altogether ?

[[Category:Drafts]]
[[Category:Ogg Mappings]]

OggKate

2008-12-17T10:53:29Z

Ogg.k.ogg.k: Explain what base and offset represent in the granpos encoding

== Disclaimer ==
This is not a Xiph codec, though it may be embedded in Ogg alonside other Xiph
codecs, such as Vorbis and Theora. As such, please do not assume that Xiph has
anything to do with this, much less responsibility.

== What is Kate? ==

Kate is a codec for karaoke and text encapsulation for Ogg. Text and images can be
carried by a Kate stream, and animated. Most of the time, this would be multiplexed
with audio/video to carry subtitles, song lyrics (with or without karaoke data), etc,
but doesn't have to be. A possible use of a lone Kate stream would be an e-book.
Moreover, the motion feature gives Kate a powerful means to describe arbitrary curves, so
hand drawing of shapes can be achieved. This was originally meant for karaoke use, but
can be used for any purpose. Motions can be attached to various semantics, like position,
color, etc, so scrolling or fading text can be defined.

Simple uses of Kate streams are movie subtitles for Theora videos, either text based,
as may be created by [http://www.v2v.cc/~j/ffmpeg2theora ffmpeg2theora], or image
based, such as created by [http://thoggen.net Thoggen] (patching needed).

== Why a new codec? ==

As I was adding support for Theora, Speex and FLAC to some software of mine, I found myself
wanting to have song lyrics accompanying Vorbis audio. Since Vorbis comments are limited to
the headers, one can't add them in the stream as they are sung, so another multiplexed stream
would be needed to carry them.

The three possible bases usable for such a codec I found were Writ, CMML, and OGM/SRT.

*[[OggWrit|Writ]] is an unmaintained start at an implementation of a very basic design, though I did find an encoder/decoder in py-ogg2 later on - I'd been quicker to write Kate from scratch anyway.
*[[CMML]] is more geared towards encapsulating metadata about an accompanying stream, rather than being a data stream itself, and seemed complex for a simple use, though I have now revised my view on this - besides, it seems designed for Annodex (which I haven't had a look at), though it does seems relatively generic for use outwith Annodex - though it is being "repurposed" as timed text now, bringing it closer to what I'm doing
*OGM/SRT, which I only found when I added Kate support to MPlayer, is shoehorning various data formats into an Ogg stream, and just dumps the SRT subtitle format as is, AFAICS (though I haven't looked at this one in detail, since I'd already had a working Kate implementation by that time)

I then decided to roll my own, not least because it's a fun thing to do.

I found other formats, such as USF (designed for inclusion in Matroska) and various subtitle formats,
but none were designed for embedding inside an Ogg container.

== Overview of the Kate bitstream format ==

I've taken much inspiration from Vorbis and Theora here.
Headers and packets (as well as the API design) follow the design of these two codecs.

A rough overview (see [[#Format specification|Format specification]] for more details) is:

Headers packets:
*ID header [BOS]: magic, version, granule fraction, encoding, language, etc
*Comment header: Vorbis comments, as per Vorbis/Theora streams
*Style definitions header: a list of predefined styles to be referred to by data packets
*Region definitions header: a list of predefined regions to be referred to by data packets
*Curves definitions header: a list of predefined curves to be referred to by data packets
*Motion definitions header: a list of predefined motions to be referred to by data packets
*Palette definitions header: a list of predefined palettes to be referred to by data packets
*Bitmap definitions header: a list of predefined bitmaps to be referred to by data packets
*Font mapping definitions header: a list of predefined font mappings to be referred to by data packets

Other header packets are ignored, and left for future expansion.

Data packets:
*text data: text/image and optional motions, accompanied by optional overrides for style, region, language, etc
*keepalive: can be emitted at any time to help a demuxer know where we're at, but those packets are optional
*end data [EOS]: marks the end of the stream, it doesn't have any useful payload

Other data packets are ignored, and left for future expansion.

The intent of the "keepalive" packet is to be sent at regular
intervals when no other packet has been emitted for a while. This would be to help seeking code
find a kate page more easily.

Things of note:
*Kate is a discontinuous codec, as defined in [http://www.xiph.org/ogg/doc/ogg-multiplex.html ogg-multiplex.html] in the Ogg documentation, which means it's timed by start granule, not end granule (as Theora and Vorbis).
* All data packets are on their own page, for two reasons:
**Ogg keeps track of granules at the page level, not the packet level
**if no text event happens for a while after a particular text event, we don't want to delay it so a larger page can be issued

See also [[#Seeking and memory|Problems to solve: Seeking and memory]].

*The granule encoding is not a direct time/granule correspondance, see the granule encoding section.
*The EOS packet should have a granule pos higher or equal to the end time of all events.
*User code doesn't have to know the number of headers to expect, this is moved inside the library code (as opposed to Vorbis and Theora).
*The format contains hooks so that additional information may be added in future revisions while keeping backward compatibility (though old decoders will correctly parse, but ignore the new information).

== Format specification ==

The Kate bitstream format consists of a number of sequential packets.
Packets can be either header packets or data packets. All header packets
must appear before any data packet.

Header packets must appear in order. Decoding of a data packet is not
possible until all header packets have been decoded.

Each Kate packet starts with a one byte type. A type with the MSB set
(eg, between 0x80 and 0xff) indicates a header packet, while a type with
the MSB cleared (eg, between 0x00 and 0x7f) indicates a data packet.
All header packets then have the Kate magic, from byte offset 1 to byte
offset 7 ("kate\0\0\0"). Note that this applies only to header packets:
data packets do not contain the Kate signature.

Since the ID header must appear first, a Kate stream can be recognized
by comparing the first eight bytes of the first packet with the signature
string "\200kate\0\0\0".

When embedded in Ogg,the first packet in a Kate stream (always packet type 0x80,
the id header packet) must be placed on a separate page. The corresponding Ogg
packet must be marked as beginning of stream (BOS).All subsequent header packets
must be on one or more pages. Subsequently, each data packet must be on a separate
page.

The last data packet must be the end of stream packet (packet type 0x7f).

When embedded in Ogg, the corresponding Ogg packet must be marked as end of stream (EOS).

As per the Ogg specification, granule positions must be non decreasing
within the stream. Header packets have granule position 0.

Currently existing packet types are:
:headers:
::0x80 ID header (BOS)
::0x81 Vorbis comment header
::0x82 regions list header
::0x83 styles list header
::0x84 curves list header
::0x85 motions list header
::0x86 palettes list header
::0x87 bitmaps list header
::0x88 font ranges and mappings header
:data:
::0x00 text data (including optional motions and overrides)
::0x01 keepalive
::0x7f end packet (EOS)

This format described here is for bitstream version 0.x.

For more detailed information, refer to the format documentation
in libkate (see URL below in the [[#Downloading|Downlading]] section).

Following is the definition of the ID header (packet type 0x80).
This works out to a 64 byte ID header. This is the header that should be
used to detect a Kate stream within an Ogg stream.

0 1 2 3 |
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1| Byte
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| packtype | Identifier char[7]: 'kate\0\0\0' | 0-3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| kate magic continued | 4-7
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | version major | version minor | num headers | 8-11
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| text encoding | directionality| reserved - 0 | granule shift | 12-15
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| cw sh | canvas width | ch sh | canvas height | 16-19
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | 20-23
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate numerator | 24-27
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate denominator | 28-31
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (NUL terminated) | 32-35
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 36-39
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 40-43
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 44-47
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (NUL terminated) | 48-51
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 52-55
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 56-59
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 60-63
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The fields cw sh, canvas width, cw sh, and canvas height were introduced
in bistream 0.3. Earlier bitstreams will have 0 in these fields.

language and category are NUL terminating ASCII strings.
Language follows RFC 3066, though obviously will not accommodate language tags
with lots of subtags.

Category is currently loosely defined, and I haven't found yet a nice way to
present it in a generic way, but is meant for automatic classifying of
various multiplexed Kate streams (eg, to recognize that some streams are
subtitles (in a set of languages), and some others are commentary (in a
possibly different set of languages, etc).

== API overview ==

libkate offers an API very similar to that of libvorbis and libtheora, as well as
an extra higher level decoding API.

Here's an overview of the three main modules:

=== Decoding ===

Decoding is done in a way similar to libvorbis. First, initialize a kate_info and a
kate_comment structure. Then, read headers by calling kate_decode_headerin. Once
all headers have been read, a kate_state is initialized for decoding using kate_decode_init,
and kate_decode_packetin is called repeatedly with data packets. Events (eg, text) can be
retrieved via kate_decode_eventout.

=== Encoding ===

Encoding is also done in a way similar to libvorbis. First initialize a kate_info
and a kate_comment structure, and fill them out as needed. kate_encode_headers will
create ogg packets from those. Then, kate_encode_text is called repeatedly for all
the text events to add. When done, calling kate_encode_finish will create an end of
stream packet.

=== High level decoding API ===

There are only 3 calls here:

kate_high_decode_init
kate_high_decode_packetin
kate_high_decode_clear

Here, all Ogg packets are sent to kate_high_decode_packetin, which does the right
thing (header/data classification, decoding, and event retrieval). Note that you
do not get access to the comments directly using this, but you do get access to the
kate_info via events.

The libkate distribution includes commented examples for each of those.

Additionally, libkate includes a layer (liboggkate) to make it easier to use when
embedded in Ogg. While the normal API uses kate_packet structures, liboggkate uses
ogg_packet structures.

The High level decoding API does not have an Ogg specific layer, but functions exist
to wrap a kate_packet around a memory buffer (such as the one ogg_packet uses, for instance).

== Support ==

Among the software with Kate support:
*VLC
*ffmpeg2theora
*liboggz
*liboggplay

I have patches for the following with Kate support:
*MPlayer (for multiplexed per-language subtitles - all region/style info is ignored)
*xine (everything kate supports, as xine is my testbed)
*GStreamer
*Thoggen

These may be found in the libkate source distribution (see [[#Downloading|Downloading]]
for links).

In addition, libtiger is a rendering library for Kate streams using Pango and Cairo,
though it is not quite yet API stable (though no major changes are expected).

== Granule encoding ==

=== Ogg ===

Ogg leaves the encoding of granules up to a particular codec, only
mandating that granules be non decreasing with time.

The Kate bitstream format uses a linear mapping between time and
granule, described here.

A Kate granule position is composed of two different parts:
- a base granule, in the high bits
- a granule offset, in the low bits

+----------------+----------------+
| base | offset |
+----------------+----------------+

The number of bits these parts occupy is variable, and each stream
may choose how many bits to dedicate to each. The kate_info structure
for a stream holds that information in the granule_shift field,
so each part may be reconstructed from a granulepos.

The timestamp T of a given Kate packet is split into a base B and
offset O, and these are stored in the granulepos of that packet.
The split is done such that the B is the time of the earliest event
still active at the time, and the O is the time elapsed between B
and T. Thus, T = B + O. This mimics the way Theora stores its own
timestamps in granulepos, where the base acts as a keyframe, and
an offset acts as the position of an intra frame from the previous
keyframe. Since Kate allows time overlapping events, however, the
choice of the base to use is slightly more complex, as it may not
be the starting time of the previous event, if the stream contains
time overlapping events.

The kate_info structure for a stream holds a rational fraction
representing the time span of granule units for both the base and
the offset parts.

The granule rate is defined by the two fields:

kate_info::gps_numerator
kate_info::gps_denominator

The number of bits reserved for the offset is defined by the field:

kate_info::granule_shift

=== Generic timing ===

Kate data packets (data packet type 0) includes timing information (start time,
end time, and time of the earliest event still active). All these are stored as
64 bit at the rate defined by the granule rate, so they do not suffer from the
granule_shift space limitation.

This also allows for Kate streams to be stored in other containers.

== Motion ==

The Kate bitstream format includes motion definition, originally for karaoke purposes, but
which can be used for more general purpose, such as line based drawing, or animation of
the text (position, color, etc)

Motions are defined by the means of a series of curves (static points, segments, splines (catmull-rom, bezier, and b-splines)).
A 2D point can be obtained from a motion for any timestamp during the lifetime of a text.
This can be used for moving a marker in 2D above the text for karaoke, or to use the x
coordinate to color text when the motion position passes each letter or word, etc.
Motions have an attached semantics so the client code knows how to use a particular motion.
Predefined semantics include text color, text position, etc).

Since a motion can be composed of an arbitrary number of curves, each of which may have
an arbitrary number of control points, complex motions can be achieved. If the motion is
the main object of an event, it is even possible to have an empty text, and use the motion
as a virtual pencil to draw arbitrary shapes. Even on-the-fly handwriting subtitles could
be done this way, though this would require a lot of control points, and would not be able
to be used with text-to-speech.

As a proof of concept, I also have a "draw chat" program where two people can draw, and
the shapes are turned to b-splines and sent as a kate motion to be displayed on the other
person's window.

It is also possible for motions to be discontinuous - simply insert a curve of 'none' type.
While the timestamp lies within such a curve, no 2D point will be generated. This can be
used to temporarily hide a marker, for instance.

It is worth mentionning that pauses in the motion can be trivially included by inserting
at the right time and for the right duration a simple linear interpolation curve with only
two equal points, equal to the position the motion is supposed to pause at.

Kate defines a set of predefined mappings so that each decoder user interprets a motion in
the same way. A mapping is coded on 8 bits in the bitstream, and the first 128 are reserved
for Kate, leaving 128 for application specific mappings, to avoid constraining creative uses
of that feature. Predefined mappings include frame (eg, 0-1 points are mapped to the size of
the current video frame), or region, to scale 0-1 to the current region. This allows curves
to be defined without knowing in advance the pixel size of the area it should cover.

For uses which require more than two coordinates (eg, text color, where 4 (RGBA) values are
needed, Kate predefines the semantics text_color_rg and text_color_ba, so a 4D point can be
obtained using two different motions.

There are higher level constructs, such as morphing between two styles, or predefined
karaoke effects. More are planned to be added in the future.

See also [[#Trackers|Trackers]].

== Trackers ==

Since attaching motions to text position, etc, makes it hard for the client to keep track of
everything, doing interpolation, etc, the library supplies a tracker object, which handles the
interpolation of the relevant properties.
Once initialized with a text and a set of motions, the client code can give the tracker a new
timestamp, and get back the current text position, text color, etc.

Using a tracker is not necessary, if one wants to use the motions directly, or just ignore them,
but it makes life easier, especially when considering the the order in which motions are applied
does matter (to be defined formally, but the current source code is informative at this point).

== The Kate file format ==

Though this is not a feature of the bitstream format, I have created a text file format to
describe a series of events to be turned into a Kate bitstream.
At its minimum, the following is a valid input to the encoder:

: kate {
:: event { 00:00:05 --> 00:00:10 "This is a text" }
: }

This will create a simple stream with "This is a text" emitted at an offset of 5 seconds into
the track, lasting 5 seconds to an end time at 10 seconds.

Motions, regions, styles can be declared in a definitions block to be reused by events, or can
be defined inline. Defining those in the definitions block places them in a header so they can
be reused later, saving space. However, they can also be defined in each event, so they will be
sent with the event. This allows them to be generated on the fly (eg, if the bitstream is being
streamed from a realtime input).

For convenience, the Kate file format also allows C style macros, though without parameters.

Please note that the Kate file format is fully separate from the Kate bitstream format. The
difference between the two is similar to the difference between a C source file and the resulting
object file, when compiled.

Note that the format is not based on XML for a very parochial reason: I tend to dislike very
much editing XML by hand, as it's really hard to read. XML is really meant for machines to parse
generically text data in a shared syntax but with possibly unknown semantics, and I need those
text representations to be editable easily.

This also implies that there could be an XML representation of a Kate stream, which would be
useful if one were to make an editor that worked on a higher level than the current all-text
representation, and it is something that might very well happen in the future, in parallel with
the current format.

== Karaoke ==

Karaoke effects rely on motions, and there will be predefined higher level ways of specifying
timings and effects, two of which are already done. As an example, this is a valid Karaoke script:

:kate {
:: simple_timed_glyph_style_morph {
::: from style "start_style" to style "end_style"
::: "Let " at 1.0
::: "us " at 1.2
::: "sing " at 1.4
::: "to" at 2.0
::: "ge" at 2.5
::: "ther" at 3.0
:: }
:}

The syllables will change from a style to another as time passes. The definition of the start_style
and end_style styles is omitted for brevity.

== Problems to solve ==

There are a few things to solve before the Kate bitstream format can be considered good
enough to be frozen:

Note: the following is mostly solved, and the bitstream is now stable, and has been
backward and forward compatible since the first released version. This will be updated
when I get some time.

=== Seeking and memory ===

When seeking to a particular time in a movie with subtitles, we may end up at a place when a subtitle has been started, but is not removed yet. Pure streaming doesn't have this problem as it remembers the subtitle being issued (as opposed to, say, Vorbis, for which all data valid now is decoded from the last packet). With Kate, a text string valid now may have been issued long ago.

I see three possible ways to solve this:
*each data packet includes the granule of the earliest still active packet (if none, this will be the granule of this very packet)
**this means seeks are two phased: first seek, find the next Kate packet, and seek again if the granule of the earlier still active packet is less than the original seeked granule. This implies support code on players to do the double seek.

*use "reference frames", a bit like Theora does, where the granule position is split in several fields: the higher bits represent a position for the reference frame, and the lowest bits a delta time to the current position. When seeking to a granule position, the lower bits are cleared off, yielding the granule position of the previous reference frame, so the seek ends up at the reference frame. The reference frame is a sync point where any active strings are issued again. This is a variant of the method described in the Writ wiki page, but the granule splitting avoids any "downtime".
**this requires reissuing packets, and it doesn't feel right (and wastes space).
**it also requires "dummy" decoding of Kate data from the reference frame to the actual seek point to fully refresh the state "memory".

*A variant of the two-granules-in-one system used by libcmml, where the "back link" points to the earliest still active string, rather than the previous one (this allows a two phase seek, rather than a multiphase seek, hopping back from event to event, with no real way to know if there is or not a previous event which is still active - I suppose CMML has no need to know this, if their "clips" do not overlap - mine can do).
**Such a system considerably shortens the usable granule space, though it can do a one phase seek, if I understand the system correctly, which I am not certain.
*** Well, it seems it can't do a one phase seek anyway.

*Additionally, it could be possible to emit simple "keepalive" packets at regular intervals to help a seek algorithm to sync up to the stream without needing too much data reading - this helps for discontinuous streams where there could be no pages for a while if no data is needed at that time.

=== Text encoding ===

A header field declares the text encoding used in the stream. At the moment, only UTF-8 is
supported, for simplicity. There are no plans to support other encodings, such as UTF-16,
at the moment.

Note that strings included in the header (language, category) are not affected by that
language encoding (rather obviously for language itself). These are ASCII.

The actual text in events may include simple HTML-like markup (at the moment, allowed markup
is the same as the one Pango uses, but more markup types may be defined in the future).
It is also possible to ask libkate to remove this markup if the client prefers to receive
plain text without the markup.

=== Language encoding ===

A header field defines the language (if any) used in the stream (this can be overridden in a
data packet, but this is not relevant to this point). At the moment, my test code uses
ISO 639-1 two letter codes, but I originally thought to use RFC 3066 tags. However, matching
a language to a user selection may be simpler for user code if the language encoding is kept
simple. At the moment, I tend to favor allowing both two letter tags (eg, "en") and secondary
tags (like "en_EN"), as RFC 3066 tags can be quite complex, but I welcome comments on this.

If a stream contains more than one language, there usually is a predominant language, which
can be set as the default language for the stream. Each event can then have a language
override. If there is no predominant language, and it is not possible to split the stream
into multiple substreams, each with its own language, then it is possible to use the "mul"
language tag, as a last resort.

=== Bitstream format for floating point values ===

Floating point values are be turned to a 16.16 fixed point format, then stored in a bitpacked
format, storing the number of zero bits at the head and tail of the floating point values once
per stream, and the remainder bits for all values in the stream. This seems to yield good results
(typically a 50% reduction over 32 bits raw writes, and 70% over the snprintf based storage), and
has the big advantage of being portable (eg, independant of any IEEE format).
However, this means reduced precision due to the quantization to 16.16. I may add support for
variable precision (eg, 8.24 fixed point formats) to alleviate this. This would however mean less
space savings, though these are likely to be insignificant when Kate streams are interleaved with
a video.

*Though this is not a Kate issue per se, the motion feature is very difficult to use without a curve editor. While tools may be coded to create a Kate bitstream for various existing subtitle formats, it is not certain it will be easy to find a good authoring tool for a series of curves. That said, it's not exactly difficult to do if you know a widget set.

=== Higher dimensional curves/motions ===

It is quite annoying to have to create two motions to control a color change, due to curves
being restricted to two dimensions. I may add support for arbitrary dimensions. It would also
help for 1D motions, like changing the time flow, where one coordinate is simply ignored at
the moment.
Alternatively, changes could be made to the Kate file format to hide the two dimensionality and
allow simpler specification of non-2 dimensional motions, but still map them to 2D in the kate
bitstream format.

=== Category definition ===

The category field in the BOS packet is a 16 byte text field (15 really, as it is zero terminated
in the bitstream itself). Its goal is to provide the reader with a short description of what kind
of information the stream contains, eg subtitles, lyrics, etc. This would be displayed to the user,
possibly to allow to choose to turn some streams on and off.

Since this category is meant primarily for a machine to parse, they will be kept to ASCII. When
a player recognizes a category, it is free to replace its name with one in the user's language if
it prefers. Even in English, the "lyrics" category could be displayed by a player as "Lyrics".

Since this is a free text field rather than an enumeration, it would be good to have a list of
common predefined category names that Kate streams can use.

This is a list of proposed predefined categories, feedback/additions welcome:

* subtitles - the usual movie subtitles, as text
* spu-subtitles - movie subtitles in DVD style paletted images
* lyrics - song lyrics

Please remember the 15 character limit if proposing other categories.

* Note that the list of categories is subject to change, and will likely
be replaced by new, more "identifier like" ones. The three ones above,
however, would be kept for backward compatibility as they're already used.

== Text to speech ==

One of the goals of the Kate bitstream format is that text data can be easily parsed
by the user of the decoder, so any additional information, such as style, placement,
karaoke data, etc, should be able to be stripped to leave only the bare text. This is
in view of allowing text-to-speech software to use Kate bitstreams as a bandwith-cheap
way of conveying speech data, and could also allow things like e-books which can be
either read or listened to from the same bitstream (I have seen no reference to this
being used anywhere, but I see no reason why the granule progression should be temporal,
and not user controlled, such as by using a "next" button which would bump a granule
postion by a preset amount, simulating turning a page (this would be close to necessary
for text-to-speech, as the wall time duration of the spoken speech is not known in
advance to the Kate encoder, and can't be mapped to a time based granule progression)).
All text strings triggered consecutively between the two granule positions would then
be read in order.

== Possible additions ==

=== Embedded binary data ===

Images and font mappings can be included within a Kate stream.

==== Images ====

Though this could be misused to interfere with ability to render as text-to-speech, Kate
can use images as well as text. The same caveat as for fonts applies with regard to data
duplication.

Complex images might however be best left to a multiplexed OggSpots or OggMNG stream, unless the
images mesh with the text (eg, graphical exclamation points, custom fonts, (see next
paragraph), etc).

There is support for simple paletted bitmap images, with a variable length palette of up
to 256 colors (in fact, sized in powers of 2 up to 256) and matching pixel data in as
many bits per pixel as can address the palette. Palettes and images are stored separately,
so can be used with one another with no fixed assignment.

Palettes and bitmaps are put in two separate header for later use by reference, but can
also be placed in data packets, as with motions, etc, if they are not going to be reused.

PNG bitmaps can also be embedded in a Kate stream. These do not have associated palettes
(but the PNGs themselves may or may not be paletted). There is no support for decoding PNG
images in libkate itself, so a program will have to use libpng (or similar code) to decode
the PNG image. For instance, the libtiger rendering library uses Cairo to decode and render
PNG images in Kate streams.

This can be used to have custom fonts, so that raw text is still available if the stream
creator wants a custom look.

I expect that the need for more than 256 colors in a bitmap, or non palette bitmap data,
would be best handled by another codec, eg OggMNG or OggSpots. The goal of images in a
Kate stream is to mesh the images with the text, not to have large images by themselves.

On the other hand, interesting Karaoke effects could be achieved by having MNG images
instead of simple paletted bitmaps in a Kate streams. Comments would be most welcome on
whether this is going too far, however.

I am also investigating SVG images. These allow for very small footprint images for simple
vector drawings, and could be very useful for things like background gradients below text.

A possible solution to the duplication issue is to have another stream in the container
stream, which would hold the shared data (eg, fonts), which the user program could load,
and which could then be used by any Kate (and other) stream. Typically, this type of stream
would be a degenerate stream with only header packets (so it is fully processed before any
other stream presents data packets that might make use of that shared data), and all payload
such as fonts being contained within the headers. Thinking about it, it has parallels with
the way Vorbis stores its codebooks within a header packet, or even the way Kate stores the
list of styles within a header packet.

==== Fonts ====

Custom fonts are merely a set of ranges mapping unicode code points to bitmaps. As this implies,
fonts are bitmap fonts, not vector fonts, so scaling, if supported by the rendering client,
may not look as good as with a vector font.

A style may also refer to a font name to use (eg, "Tahoma"). These fonts may or may not be
available on the playing system, however, since the font data is not included in the stream,
just referenced by name. For this reason, it is best to keep to widely known fonts.

== Reference encoder/decoder ==

A encoder and a decoder are included in the tools directory. The encoder pulls its input from a custom
text based file format (see [[#The Kate file format|The Kate file format]]),
which is by no means meant to be part of the Kate bitstream specification itself,
from an SubRip (.srt) format file (the most common subtitle format I found, and a very basic one),
or from a lyrics (.lrc) format file.

The Kate bitstreams encoded and decoded by those tools are (supposed to be) correct for this
specification, provided their input is correct.

== Next steps ==

=== Continuations ===

Continuations are a way to add to existing events, and are mostly meant for motions. When streaming
in real time, what motions may be applied to events may not be known in advance (for instance, for a
draw chat program where two programs exchange Kate streams, the drawing motions are only known as
they are drawn. Continuations will allow an event to be extended in time, and motions to be appended
to it. This is only useful for streaming, as when stored in a file, everything is already known in
advance.

=== A rendering library ===

This will allow easier integration in other packages (movie players, etc).
I have started working on an implementation using Cairo and Pango, though I'm still at the early stages.
I might add support for embedding vector fonts in a Kate stream if I was going that way. Still need to think about this.
Another point of note is that when this library is available, it would make it easier to add
capabilities such as rotation, scaling, etc, to the bitstream, since this would not cause too
much work for playing programs using the rendering library. It is expected that these additions
would stay backward compatible (eg, an old player would ignore this information but still correctly
decode the information they can work with from a newly encoded stream).

=== An XML representation ===

While I purposefully did not write Kate description files in XML due to me finding editing XML such
a chore, it would be nice to be able to losslessly convert between the more user friendly representation
and an XML document, so one can do what one does with XML documents, like transformations.

And after all, some people might prefer editing the XML version.

== Matroska mapping ==

The codec ID is "S_KATE".

As for Theora and Vorbis, Kate headers are stored in the private data as xiph-laced packets:

Byte 0: number of packets present, minus 1 (there must be at least one packet) - let this number be NP
Bytes 1..n: lengths of the first NP packets, coded in xiph style lacing
Bytes n+1..end: the data packets themselves concatenated one after the other

Note that the length of the last packet isn't encoded, it is deduced from the sizes of the other
packets and the total size of the private data.

This mapping is similar to the Vorbis and Theora mappings, with the caveat that one should not
expect a set number of headers.

== Downloading ==

libkate encodes and decodes Kate streams, and is API and ABI stable.

The libkate source distribution is available at [http://libkate.googlecode.com/ http://libkate.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/kate.git;a=summary http://git.xiph.org/?p=users/oggk/kate.git;a=summary].

libtiger renders Kate streams using Pango and Cairo, and is alpha, with API changes still possible.

The libtiger source distribution is available at [http://libtiger.googlecode.com/ http://libtiger.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/tiger.git;a=summary http://git.xiph.org/?p=users/oggk/tiger.git;a=summary].

== Things I need to get feedback on ==

* is it a good idea to avoid floating point usage altogether ?

[[Category:Drafts]]
[[Category:Ogg Mappings]]

OggKate

2008-12-15T19:25:26Z

Ogg.k.ogg.k: mention category is likely to use Silvia's new identifiers

== Disclaimer ==
This is not a Xiph codec, though it may be embedded in Ogg alonside other Xiph
codecs, such as Vorbis and Theora. As such, please do not assume that Xiph has
anything to do with this, much less responsibility.

== What is Kate? ==

Kate is a codec for karaoke and text encapsulation for Ogg. Text and images can be
carried by a Kate stream, and animated. Most of the time, this would be multiplexed
with audio/video to carry subtitles, song lyrics (with or without karaoke data), etc,
but doesn't have to be. A possible use of a lone Kate stream would be an e-book.
Moreover, the motion feature gives Kate a powerful means to describe arbitrary curves, so
hand drawing of shapes can be achieved. This was originally meant for karaoke use, but
can be used for any purpose. Motions can be attached to various semantics, like position,
color, etc, so scrolling or fading text can be defined.

Simple uses of Kate streams are movie subtitles for Theora videos, either text based,
as may be created by [http://www.v2v.cc/~j/ffmpeg2theora ffmpeg2theora], or image
based, such as created by [http://thoggen.net Thoggen] (patching needed).

== Why a new codec? ==

As I was adding support for Theora, Speex and FLAC to some software of mine, I found myself
wanting to have song lyrics accompanying Vorbis audio. Since Vorbis comments are limited to
the headers, one can't add them in the stream as they are sung, so another multiplexed stream
would be needed to carry them.

The three possible bases usable for such a codec I found were Writ, CMML, and OGM/SRT.

*[[OggWrit|Writ]] is an unmaintained start at an implementation of a very basic design, though I did find an encoder/decoder in py-ogg2 later on - I'd been quicker to write Kate from scratch anyway.
*[[CMML]] is more geared towards encapsulating metadata about an accompanying stream, rather than being a data stream itself, and seemed complex for a simple use, though I have now revised my view on this - besides, it seems designed for Annodex (which I haven't had a look at), though it does seems relatively generic for use outwith Annodex - though it is being "repurposed" as timed text now, bringing it closer to what I'm doing
*OGM/SRT, which I only found when I added Kate support to MPlayer, is shoehorning various data formats into an Ogg stream, and just dumps the SRT subtitle format as is, AFAICS (though I haven't looked at this one in detail, since I'd already had a working Kate implementation by that time)

I then decided to roll my own, not least because it's a fun thing to do.

I found other formats, such as USF (designed for inclusion in Matroska) and various subtitle formats,
but none were designed for embedding inside an Ogg container.

== Overview of the Kate bitstream format ==

I've taken much inspiration from Vorbis and Theora here.
Headers and packets (as well as the API design) follow the design of these two codecs.

A rough overview (see [[#Format specification|Format specification]] for more details) is:

Headers packets:
*ID header [BOS]: magic, version, granule fraction, encoding, language, etc
*Comment header: Vorbis comments, as per Vorbis/Theora streams
*Style definitions header: a list of predefined styles to be referred to by data packets
*Region definitions header: a list of predefined regions to be referred to by data packets
*Curves definitions header: a list of predefined curves to be referred to by data packets
*Motion definitions header: a list of predefined motions to be referred to by data packets
*Palette definitions header: a list of predefined palettes to be referred to by data packets
*Bitmap definitions header: a list of predefined bitmaps to be referred to by data packets
*Font mapping definitions header: a list of predefined font mappings to be referred to by data packets

Other header packets are ignored, and left for future expansion.

Data packets:
*text data: text/image and optional motions, accompanied by optional overrides for style, region, language, etc
*keepalive: can be emitted at any time to help a demuxer know where we're at, but those packets are optional
*end data [EOS]: marks the end of the stream, it doesn't have any useful payload

Other data packets are ignored, and left for future expansion.

The intent of the "keepalive" packet is to be sent at regular
intervals when no other packet has been emitted for a while. This would be to help seeking code
find a kate page more easily.

Things of note:
*Kate is a discontinuous codec, as defined in [http://www.xiph.org/ogg/doc/ogg-multiplex.html ogg-multiplex.html] in the Ogg documentation, which means it's timed by start granule, not end granule (as Theora and Vorbis).
* All data packets are on their own page, for two reasons:
**Ogg keeps track of granules at the page level, not the packet level
**if no text event happens for a while after a particular text event, we don't want to delay it so a larger page can be issued

See also [[#Seeking and memory|Problems to solve: Seeking and memory]].

*The granule encoding is not a direct time/granule correspondance, see the granule encoding section.
*The EOS packet should have a granule pos higher or equal to the end time of all events.
*User code doesn't have to know the number of headers to expect, this is moved inside the library code (as opposed to Vorbis and Theora).
*The format contains hooks so that additional information may be added in future revisions while keeping backward compatibility (though old decoders will correctly parse, but ignore the new information).

== Format specification ==

The Kate bitstream format consists of a number of sequential packets.
Packets can be either header packets or data packets. All header packets
must appear before any data packet.

Header packets must appear in order. Decoding of a data packet is not
possible until all header packets have been decoded.

Each Kate packet starts with a one byte type. A type with the MSB set
(eg, between 0x80 and 0xff) indicates a header packet, while a type with
the MSB cleared (eg, between 0x00 and 0x7f) indicates a data packet.
All header packets then have the Kate magic, from byte offset 1 to byte
offset 7 ("kate\0\0\0"). Note that this applies only to header packets:
data packets do not contain the Kate signature.

Since the ID header must appear first, a Kate stream can be recognized
by comparing the first eight bytes of the first packet with the signature
string "\200kate\0\0\0".

When embedded in Ogg,the first packet in a Kate stream (always packet type 0x80,
the id header packet) must be placed on a separate page. The corresponding Ogg
packet must be marked as beginning of stream (BOS).All subsequent header packets
must be on one or more pages. Subsequently, each data packet must be on a separate
page.

The last data packet must be the end of stream packet (packet type 0x7f).

When embedded in Ogg, the corresponding Ogg packet must be marked as end of stream (EOS).

As per the Ogg specification, granule positions must be non decreasing
within the stream. Header packets have granule position 0.

Currently existing packet types are:
:headers:
::0x80 ID header (BOS)
::0x81 Vorbis comment header
::0x82 regions list header
::0x83 styles list header
::0x84 curves list header
::0x85 motions list header
::0x86 palettes list header
::0x87 bitmaps list header
::0x88 font ranges and mappings header
:data:
::0x00 text data (including optional motions and overrides)
::0x01 keepalive
::0x7f end packet (EOS)

This format described here is for bitstream version 0.x.

For more detailed information, refer to the format documentation
in libkate (see URL below in the [[#Downloading|Downlading]] section).

Following is the definition of the ID header (packet type 0x80).
This works out to a 64 byte ID header. This is the header that should be
used to detect a Kate stream within an Ogg stream.

0 1 2 3 |
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1| Byte
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| packtype | Identifier char[7]: 'kate\0\0\0' | 0-3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| kate magic continued | 4-7
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | version major | version minor | num headers | 8-11
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| text encoding | directionality| reserved - 0 | granule shift | 12-15
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| cw sh | canvas width | ch sh | canvas height | 16-19
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | 20-23
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate numerator | 24-27
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate denominator | 28-31
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (NUL terminated) | 32-35
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 36-39
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 40-43
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 44-47
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (NUL terminated) | 48-51
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 52-55
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 56-59
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 60-63
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The fields cw sh, canvas width, cw sh, and canvas height were introduced
in bistream 0.3. Earlier bitstreams will have 0 in these fields.

language and category are NUL terminating ASCII strings.
Language follows RFC 3066, though obviously will not accommodate language tags
with lots of subtags.

Category is currently loosely defined, and I haven't found yet a nice way to
present it in a generic way, but is meant for automatic classifying of
various multiplexed Kate streams (eg, to recognize that some streams are
subtitles (in a set of languages), and some others are commentary (in a
possibly different set of languages, etc).

== API overview ==

libkate offers an API very similar to that of libvorbis and libtheora, as well as
an extra higher level decoding API.

Here's an overview of the three main modules:

=== Decoding ===

Decoding is done in a way similar to libvorbis. First, initialize a kate_info and a
kate_comment structure. Then, read headers by calling kate_decode_headerin. Once
all headers have been read, a kate_state is initialized for decoding using kate_decode_init,
and kate_decode_packetin is called repeatedly with data packets. Events (eg, text) can be
retrieved via kate_decode_eventout.

=== Encoding ===

Encoding is also done in a way similar to libvorbis. First initialize a kate_info
and a kate_comment structure, and fill them out as needed. kate_encode_headers will
create ogg packets from those. Then, kate_encode_text is called repeatedly for all
the text events to add. When done, calling kate_encode_finish will create an end of
stream packet.

=== High level decoding API ===

There are only 3 calls here:

kate_high_decode_init
kate_high_decode_packetin
kate_high_decode_clear

Here, all Ogg packets are sent to kate_high_decode_packetin, which does the right
thing (header/data classification, decoding, and event retrieval). Note that you
do not get access to the comments directly using this, but you do get access to the
kate_info via events.

The libkate distribution includes commented examples for each of those.

Additionally, libkate includes a layer (liboggkate) to make it easier to use when
embedded in Ogg. While the normal API uses kate_packet structures, liboggkate uses
ogg_packet structures.

The High level decoding API does not have an Ogg specific layer, but functions exist
to wrap a kate_packet around a memory buffer (such as the one ogg_packet uses, for instance).

== Support ==

Among the software with Kate support:
*VLC
*ffmpeg2theora
*liboggz
*liboggplay

I have patches for the following with Kate support:
*MPlayer (for multiplexed per-language subtitles - all region/style info is ignored)
*xine (everything kate supports, as xine is my testbed)
*GStreamer
*Thoggen

These may be found in the libkate source distribution (see [[#Downloading|Downloading]]
for links).

In addition, libtiger is a rendering library for Kate streams using Pango and Cairo,
though it is not quite yet API stable (though no major changes are expected).

== Granule encoding ==

=== Ogg ===

Ogg leaves the encoding of granules up to a particular codec, only
mandating that granules be non decreasing with time.

The Kate bitstream format uses a linear mapping between time and
granule, described here.

A Kate granule position is composed of two different parts:
- a base granule, in the high bits
- a granule offset, in the low bits

+----------------+----------------+
| base | offset |
+----------------+----------------+

The number of bits these parts occupy is variable, and each stream
may choose how many bits to dedicate to each. The kate_info structure
for a stream holds that information in the granule_shift field,
so each part may be reconstructed from a granulepos.

The kate_info structure for a stream holds a rational fraction
representing the time span of granule units for both the base and
the offset parts.

The granule rate is defined by the two fields:

kate_info::gps_numerator
kate_info::gps_denominator

The number of bits reserved for the offset is defined by the field:

kate_info::granule_shift

=== Generic timing ===

Kate data packets (data packet type 0) includes timing information (start time,
end time, and time of the earliest event still active). All these are stored as
64 bit at the rate defined by the granule rate, so they do not suffer from the
granule_shift space limitation.

This also allows for Kate streams to be stored in other containers.

== Motion ==

The Kate bitstream format includes motion definition, originally for karaoke purposes, but
which can be used for more general purpose, such as line based drawing, or animation of
the text (position, color, etc)

Motions are defined by the means of a series of curves (static points, segments, splines (catmull-rom, bezier, and b-splines)).
A 2D point can be obtained from a motion for any timestamp during the lifetime of a text.
This can be used for moving a marker in 2D above the text for karaoke, or to use the x
coordinate to color text when the motion position passes each letter or word, etc.
Motions have an attached semantics so the client code knows how to use a particular motion.
Predefined semantics include text color, text position, etc).

Since a motion can be composed of an arbitrary number of curves, each of which may have
an arbitrary number of control points, complex motions can be achieved. If the motion is
the main object of an event, it is even possible to have an empty text, and use the motion
as a virtual pencil to draw arbitrary shapes. Even on-the-fly handwriting subtitles could
be done this way, though this would require a lot of control points, and would not be able
to be used with text-to-speech.

As a proof of concept, I also have a "draw chat" program where two people can draw, and
the shapes are turned to b-splines and sent as a kate motion to be displayed on the other
person's window.

It is also possible for motions to be discontinuous - simply insert a curve of 'none' type.
While the timestamp lies within such a curve, no 2D point will be generated. This can be
used to temporarily hide a marker, for instance.

It is worth mentionning that pauses in the motion can be trivially included by inserting
at the right time and for the right duration a simple linear interpolation curve with only
two equal points, equal to the position the motion is supposed to pause at.

Kate defines a set of predefined mappings so that each decoder user interprets a motion in
the same way. A mapping is coded on 8 bits in the bitstream, and the first 128 are reserved
for Kate, leaving 128 for application specific mappings, to avoid constraining creative uses
of that feature. Predefined mappings include frame (eg, 0-1 points are mapped to the size of
the current video frame), or region, to scale 0-1 to the current region. This allows curves
to be defined without knowing in advance the pixel size of the area it should cover.

For uses which require more than two coordinates (eg, text color, where 4 (RGBA) values are
needed, Kate predefines the semantics text_color_rg and text_color_ba, so a 4D point can be
obtained using two different motions.

There are higher level constructs, such as morphing between two styles, or predefined
karaoke effects. More are planned to be added in the future.

See also [[#Trackers|Trackers]].

== Trackers ==

Since attaching motions to text position, etc, makes it hard for the client to keep track of
everything, doing interpolation, etc, the library supplies a tracker object, which handles the
interpolation of the relevant properties.
Once initialized with a text and a set of motions, the client code can give the tracker a new
timestamp, and get back the current text position, text color, etc.

Using a tracker is not necessary, if one wants to use the motions directly, or just ignore them,
but it makes life easier, especially when considering the the order in which motions are applied
does matter (to be defined formally, but the current source code is informative at this point).

== The Kate file format ==

Though this is not a feature of the bitstream format, I have created a text file format to
describe a series of events to be turned into a Kate bitstream.
At its minimum, the following is a valid input to the encoder:

: kate {
:: event { 00:00:05 --> 00:00:10 "This is a text" }
: }

This will create a simple stream with "This is a text" emitted at an offset of 5 seconds into
the track, lasting 5 seconds to an end time at 10 seconds.

Motions, regions, styles can be declared in a definitions block to be reused by events, or can
be defined inline. Defining those in the definitions block places them in a header so they can
be reused later, saving space. However, they can also be defined in each event, so they will be
sent with the event. This allows them to be generated on the fly (eg, if the bitstream is being
streamed from a realtime input).

For convenience, the Kate file format also allows C style macros, though without parameters.

Please note that the Kate file format is fully separate from the Kate bitstream format. The
difference between the two is similar to the difference between a C source file and the resulting
object file, when compiled.

Note that the format is not based on XML for a very parochial reason: I tend to dislike very
much editing XML by hand, as it's really hard to read. XML is really meant for machines to parse
generically text data in a shared syntax but with possibly unknown semantics, and I need those
text representations to be editable easily.

This also implies that there could be an XML representation of a Kate stream, which would be
useful if one were to make an editor that worked on a higher level than the current all-text
representation, and it is something that might very well happen in the future, in parallel with
the current format.

== Karaoke ==

Karaoke effects rely on motions, and there will be predefined higher level ways of specifying
timings and effects, two of which are already done. As an example, this is a valid Karaoke script:

:kate {
:: simple_timed_glyph_style_morph {
::: from style "start_style" to style "end_style"
::: "Let " at 1.0
::: "us " at 1.2
::: "sing " at 1.4
::: "to" at 2.0
::: "ge" at 2.5
::: "ther" at 3.0
:: }
:}

The syllables will change from a style to another as time passes. The definition of the start_style
and end_style styles is omitted for brevity.

== Problems to solve ==

There are a few things to solve before the Kate bitstream format can be considered good
enough to be frozen:

Note: the following is mostly solved, and the bitstream is now stable, and has been
backward and forward compatible since the first released version. This will be updated
when I get some time.

=== Seeking and memory ===

When seeking to a particular time in a movie with subtitles, we may end up at a place when a subtitle has been started, but is not removed yet. Pure streaming doesn't have this problem as it remembers the subtitle being issued (as opposed to, say, Vorbis, for which all data valid now is decoded from the last packet). With Kate, a text string valid now may have been issued long ago.

I see three possible ways to solve this:
*each data packet includes the granule of the earliest still active packet (if none, this will be the granule of this very packet)
**this means seeks are two phased: first seek, find the next Kate packet, and seek again if the granule of the earlier still active packet is less than the original seeked granule. This implies support code on players to do the double seek.

*use "reference frames", a bit like Theora does, where the granule position is split in several fields: the higher bits represent a position for the reference frame, and the lowest bits a delta time to the current position. When seeking to a granule position, the lower bits are cleared off, yielding the granule position of the previous reference frame, so the seek ends up at the reference frame. The reference frame is a sync point where any active strings are issued again. This is a variant of the method described in the Writ wiki page, but the granule splitting avoids any "downtime".
**this requires reissuing packets, and it doesn't feel right (and wastes space).
**it also requires "dummy" decoding of Kate data from the reference frame to the actual seek point to fully refresh the state "memory".

*A variant of the two-granules-in-one system used by libcmml, where the "back link" points to the earliest still active string, rather than the previous one (this allows a two phase seek, rather than a multiphase seek, hopping back from event to event, with no real way to know if there is or not a previous event which is still active - I suppose CMML has no need to know this, if their "clips" do not overlap - mine can do).
**Such a system considerably shortens the usable granule space, though it can do a one phase seek, if I understand the system correctly, which I am not certain.
*** Well, it seems it can't do a one phase seek anyway.

*Additionally, it could be possible to emit simple "keepalive" packets at regular intervals to help a seek algorithm to sync up to the stream without needing too much data reading - this helps for discontinuous streams where there could be no pages for a while if no data is needed at that time.

=== Text encoding ===

A header field declares the text encoding used in the stream. At the moment, only UTF-8 is
supported, for simplicity. There are no plans to support other encodings, such as UTF-16,
at the moment.

Note that strings included in the header (language, category) are not affected by that
language encoding (rather obviously for language itself). These are ASCII.

The actual text in events may include simple HTML-like markup (at the moment, allowed markup
is the same as the one Pango uses, but more markup types may be defined in the future).
It is also possible to ask libkate to remove this markup if the client prefers to receive
plain text without the markup.

=== Language encoding ===

A header field defines the language (if any) used in the stream (this can be overridden in a
data packet, but this is not relevant to this point). At the moment, my test code uses
ISO 639-1 two letter codes, but I originally thought to use RFC 3066 tags. However, matching
a language to a user selection may be simpler for user code if the language encoding is kept
simple. At the moment, I tend to favor allowing both two letter tags (eg, "en") and secondary
tags (like "en_EN"), as RFC 3066 tags can be quite complex, but I welcome comments on this.

If a stream contains more than one language, there usually is a predominant language, which
can be set as the default language for the stream. Each event can then have a language
override. If there is no predominant language, and it is not possible to split the stream
into multiple substreams, each with its own language, then it is possible to use the "mul"
language tag, as a last resort.

=== Bitstream format for floating point values ===

Floating point values are be turned to a 16.16 fixed point format, then stored in a bitpacked
format, storing the number of zero bits at the head and tail of the floating point values once
per stream, and the remainder bits for all values in the stream. This seems to yield good results
(typically a 50% reduction over 32 bits raw writes, and 70% over the snprintf based storage), and
has the big advantage of being portable (eg, independant of any IEEE format).
However, this means reduced precision due to the quantization to 16.16. I may add support for
variable precision (eg, 8.24 fixed point formats) to alleviate this. This would however mean less
space savings, though these are likely to be insignificant when Kate streams are interleaved with
a video.

*Though this is not a Kate issue per se, the motion feature is very difficult to use without a curve editor. While tools may be coded to create a Kate bitstream for various existing subtitle formats, it is not certain it will be easy to find a good authoring tool for a series of curves. That said, it's not exactly difficult to do if you know a widget set.

=== Higher dimensional curves/motions ===

It is quite annoying to have to create two motions to control a color change, due to curves
being restricted to two dimensions. I may add support for arbitrary dimensions. It would also
help for 1D motions, like changing the time flow, where one coordinate is simply ignored at
the moment.
Alternatively, changes could be made to the Kate file format to hide the two dimensionality and
allow simpler specification of non-2 dimensional motions, but still map them to 2D in the kate
bitstream format.

=== Category definition ===

The category field in the BOS packet is a 16 byte text field (15 really, as it is zero terminated
in the bitstream itself). Its goal is to provide the reader with a short description of what kind
of information the stream contains, eg subtitles, lyrics, etc. This would be displayed to the user,
possibly to allow to choose to turn some streams on and off.

Since this category is meant primarily for a machine to parse, they will be kept to ASCII. When
a player recognizes a category, it is free to replace its name with one in the user's language if
it prefers. Even in English, the "lyrics" category could be displayed by a player as "Lyrics".

Since this is a free text field rather than an enumeration, it would be good to have a list of
common predefined category names that Kate streams can use.

This is a list of proposed predefined categories, feedback/additions welcome:

* subtitles - the usual movie subtitles, as text
* spu-subtitles - movie subtitles in DVD style paletted images
* lyrics - song lyrics

Please remember the 15 character limit if proposing other categories.

* Note that the list of categories is subject to change, and will likely
be replaced by new, more "identifier like" ones. The three ones above,
however, would be kept for backward compatibility as they're already used.

== Text to speech ==

One of the goals of the Kate bitstream format is that text data can be easily parsed
by the user of the decoder, so any additional information, such as style, placement,
karaoke data, etc, should be able to be stripped to leave only the bare text. This is
in view of allowing text-to-speech software to use Kate bitstreams as a bandwith-cheap
way of conveying speech data, and could also allow things like e-books which can be
either read or listened to from the same bitstream (I have seen no reference to this
being used anywhere, but I see no reason why the granule progression should be temporal,
and not user controlled, such as by using a "next" button which would bump a granule
postion by a preset amount, simulating turning a page (this would be close to necessary
for text-to-speech, as the wall time duration of the spoken speech is not known in
advance to the Kate encoder, and can't be mapped to a time based granule progression)).
All text strings triggered consecutively between the two granule positions would then
be read in order.

== Possible additions ==

=== Embedded binary data ===

Images and font mappings can be included within a Kate stream.

==== Images ====

Though this could be misused to interfere with ability to render as text-to-speech, Kate
can use images as well as text. The same caveat as for fonts applies with regard to data
duplication.

Complex images might however be best left to a multiplexed OggSpots or OggMNG stream, unless the
images mesh with the text (eg, graphical exclamation points, custom fonts, (see next
paragraph), etc).

There is support for simple paletted bitmap images, with a variable length palette of up
to 256 colors (in fact, sized in powers of 2 up to 256) and matching pixel data in as
many bits per pixel as can address the palette. Palettes and images are stored separately,
so can be used with one another with no fixed assignment.

Palettes and bitmaps are put in two separate header for later use by reference, but can
also be placed in data packets, as with motions, etc, if they are not going to be reused.

PNG bitmaps can also be embedded in a Kate stream. These do not have associated palettes
(but the PNGs themselves may or may not be paletted). There is no support for decoding PNG
images in libkate itself, so a program will have to use libpng (or similar code) to decode
the PNG image. For instance, the libtiger rendering library uses Cairo to decode and render
PNG images in Kate streams.

This can be used to have custom fonts, so that raw text is still available if the stream
creator wants a custom look.

I expect that the need for more than 256 colors in a bitmap, or non palette bitmap data,
would be best handled by another codec, eg OggMNG or OggSpots. The goal of images in a
Kate stream is to mesh the images with the text, not to have large images by themselves.

On the other hand, interesting Karaoke effects could be achieved by having MNG images
instead of simple paletted bitmaps in a Kate streams. Comments would be most welcome on
whether this is going too far, however.

I am also investigating SVG images. These allow for very small footprint images for simple
vector drawings, and could be very useful for things like background gradients below text.

A possible solution to the duplication issue is to have another stream in the container
stream, which would hold the shared data (eg, fonts), which the user program could load,
and which could then be used by any Kate (and other) stream. Typically, this type of stream
would be a degenerate stream with only header packets (so it is fully processed before any
other stream presents data packets that might make use of that shared data), and all payload
such as fonts being contained within the headers. Thinking about it, it has parallels with
the way Vorbis stores its codebooks within a header packet, or even the way Kate stores the
list of styles within a header packet.

==== Fonts ====

Custom fonts are merely a set of ranges mapping unicode code points to bitmaps. As this implies,
fonts are bitmap fonts, not vector fonts, so scaling, if supported by the rendering client,
may not look as good as with a vector font.

A style may also refer to a font name to use (eg, "Tahoma"). These fonts may or may not be
available on the playing system, however, since the font data is not included in the stream,
just referenced by name. For this reason, it is best to keep to widely known fonts.

== Reference encoder/decoder ==

A encoder and a decoder are included in the tools directory. The encoder pulls its input from a custom
text based file format (see [[#The Kate file format|The Kate file format]]),
which is by no means meant to be part of the Kate bitstream specification itself,
from an SubRip (.srt) format file (the most common subtitle format I found, and a very basic one),
or from a lyrics (.lrc) format file.

The Kate bitstreams encoded and decoded by those tools are (supposed to be) correct for this
specification, provided their input is correct.

== Next steps ==

=== Continuations ===

Continuations are a way to add to existing events, and are mostly meant for motions. When streaming
in real time, what motions may be applied to events may not be known in advance (for instance, for a
draw chat program where two programs exchange Kate streams, the drawing motions are only known as
they are drawn. Continuations will allow an event to be extended in time, and motions to be appended
to it. This is only useful for streaming, as when stored in a file, everything is already known in
advance.

=== A rendering library ===

This will allow easier integration in other packages (movie players, etc).
I have started working on an implementation using Cairo and Pango, though I'm still at the early stages.
I might add support for embedding vector fonts in a Kate stream if I was going that way. Still need to think about this.
Another point of note is that when this library is available, it would make it easier to add
capabilities such as rotation, scaling, etc, to the bitstream, since this would not cause too
much work for playing programs using the rendering library. It is expected that these additions
would stay backward compatible (eg, an old player would ignore this information but still correctly
decode the information they can work with from a newly encoded stream).

=== An XML representation ===

While I purposefully did not write Kate description files in XML due to me finding editing XML such
a chore, it would be nice to be able to losslessly convert between the more user friendly representation
and an XML document, so one can do what one does with XML documents, like transformations.

And after all, some people might prefer editing the XML version.

== Matroska mapping ==

The codec ID is "S_KATE".

As for Theora and Vorbis, Kate headers are stored in the private data as xiph-laced packets:

Byte 0: number of packets present, minus 1 (there must be at least one packet) - let this number be NP
Bytes 1..n: lengths of the first NP packets, coded in xiph style lacing
Bytes n+1..end: the data packets themselves concatenated one after the other

Note that the length of the last packet isn't encoded, it is deduced from the sizes of the other
packets and the total size of the private data.

This mapping is similar to the Vorbis and Theora mappings, with the caveat that one should not
expect a set number of headers.

== Downloading ==

libkate encodes and decodes Kate streams, and is API and ABI stable.

The libkate source distribution is available at [http://libkate.googlecode.com/ http://libkate.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/kate.git;a=summary http://git.xiph.org/?p=users/oggk/kate.git;a=summary].

libtiger renders Kate streams using Pango and Cairo, and is alpha, with API changes still possible.

The libtiger source distribution is available at [http://libtiger.googlecode.com/ http://libtiger.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/tiger.git;a=summary http://git.xiph.org/?p=users/oggk/tiger.git;a=summary].

== Things I need to get feedback on ==

* is it a good idea to avoid floating point usage altogether ?

[[Category:Drafts]]
[[Category:Ogg Mappings]]

OggKate

2008-12-13T10:35:42Z

Ogg.k.ogg.k: mention SVG preliminary work

== Disclaimer ==
This is not a Xiph codec, though it may be embedded in Ogg alonside other Xiph
codecs, such as Vorbis and Theora. As such, please do not assume that Xiph has
anything to do with this, much less responsibility.

== What is Kate? ==

Kate is a codec for karaoke and text encapsulation for Ogg. Text and images can be
carried by a Kate stream, and animated. Most of the time, this would be multiplexed
with audio/video to carry subtitles, song lyrics (with or without karaoke data), etc,
but doesn't have to be. A possible use of a lone Kate stream would be an e-book.
Moreover, the motion feature gives Kate a powerful means to describe arbitrary curves, so
hand drawing of shapes can be achieved. This was originally meant for karaoke use, but
can be used for any purpose. Motions can be attached to various semantics, like position,
color, etc, so scrolling or fading text can be defined.

Simple uses of Kate streams are movie subtitles for Theora videos, either text based,
as may be created by [http://www.v2v.cc/~j/ffmpeg2theora ffmpeg2theora], or image
based, such as created by [http://thoggen.net Thoggen] (patching needed).

== Why a new codec? ==

As I was adding support for Theora, Speex and FLAC to some software of mine, I found myself
wanting to have song lyrics accompanying Vorbis audio. Since Vorbis comments are limited to
the headers, one can't add them in the stream as they are sung, so another multiplexed stream
would be needed to carry them.

The three possible bases usable for such a codec I found were Writ, CMML, and OGM/SRT.

*[[OggWrit|Writ]] is an unmaintained start at an implementation of a very basic design, though I did find an encoder/decoder in py-ogg2 later on - I'd been quicker to write Kate from scratch anyway.
*[[CMML]] is more geared towards encapsulating metadata about an accompanying stream, rather than being a data stream itself, and seemed complex for a simple use, though I have now revised my view on this - besides, it seems designed for Annodex (which I haven't had a look at), though it does seems relatively generic for use outwith Annodex - though it is being "repurposed" as timed text now, bringing it closer to what I'm doing
*OGM/SRT, which I only found when I added Kate support to MPlayer, is shoehorning various data formats into an Ogg stream, and just dumps the SRT subtitle format as is, AFAICS (though I haven't looked at this one in detail, since I'd already had a working Kate implementation by that time)

I then decided to roll my own, not least because it's a fun thing to do.

I found other formats, such as USF (designed for inclusion in Matroska) and various subtitle formats,
but none were designed for embedding inside an Ogg container.

== Overview of the Kate bitstream format ==

I've taken much inspiration from Vorbis and Theora here.
Headers and packets (as well as the API design) follow the design of these two codecs.

A rough overview (see [[#Format specification|Format specification]] for more details) is:

Headers packets:
*ID header [BOS]: magic, version, granule fraction, encoding, language, etc
*Comment header: Vorbis comments, as per Vorbis/Theora streams
*Style definitions header: a list of predefined styles to be referred to by data packets
*Region definitions header: a list of predefined regions to be referred to by data packets
*Curves definitions header: a list of predefined curves to be referred to by data packets
*Motion definitions header: a list of predefined motions to be referred to by data packets
*Palette definitions header: a list of predefined palettes to be referred to by data packets
*Bitmap definitions header: a list of predefined bitmaps to be referred to by data packets
*Font mapping definitions header: a list of predefined font mappings to be referred to by data packets

Other header packets are ignored, and left for future expansion.

Data packets:
*text data: text/image and optional motions, accompanied by optional overrides for style, region, language, etc
*keepalive: can be emitted at any time to help a demuxer know where we're at, but those packets are optional
*end data [EOS]: marks the end of the stream, it doesn't have any useful payload

Other data packets are ignored, and left for future expansion.

The intent of the "keepalive" packet is to be sent at regular
intervals when no other packet has been emitted for a while. This would be to help seeking code
find a kate page more easily.

Things of note:
*Kate is a discontinuous codec, as defined in [http://www.xiph.org/ogg/doc/ogg-multiplex.html ogg-multiplex.html] in the Ogg documentation, which means it's timed by start granule, not end granule (as Theora and Vorbis).
* All data packets are on their own page, for two reasons:
**Ogg keeps track of granules at the page level, not the packet level
**if no text event happens for a while after a particular text event, we don't want to delay it so a larger page can be issued

See also [[#Seeking and memory|Problems to solve: Seeking and memory]].

*The granule encoding is not a direct time/granule correspondance, see the granule encoding section.
*The EOS packet should have a granule pos higher or equal to the end time of all events.
*User code doesn't have to know the number of headers to expect, this is moved inside the library code (as opposed to Vorbis and Theora).
*The format contains hooks so that additional information may be added in future revisions while keeping backward compatibility (though old decoders will correctly parse, but ignore the new information).

== Format specification ==

The Kate bitstream format consists of a number of sequential packets.
Packets can be either header packets or data packets. All header packets
must appear before any data packet.

Header packets must appear in order. Decoding of a data packet is not
possible until all header packets have been decoded.

Each Kate packet starts with a one byte type. A type with the MSB set
(eg, between 0x80 and 0xff) indicates a header packet, while a type with
the MSB cleared (eg, between 0x00 and 0x7f) indicates a data packet.
All header packets then have the Kate magic, from byte offset 1 to byte
offset 7 ("kate\0\0\0"). Note that this applies only to header packets:
data packets do not contain the Kate signature.

Since the ID header must appear first, a Kate stream can be recognized
by comparing the first eight bytes of the first packet with the signature
string "\200kate\0\0\0".

When embedded in Ogg,the first packet in a Kate stream (always packet type 0x80,
the id header packet) must be placed on a separate page. The corresponding Ogg
packet must be marked as beginning of stream (BOS).All subsequent header packets
must be on one or more pages. Subsequently, each data packet must be on a separate
page.

The last data packet must be the end of stream packet (packet type 0x7f).

When embedded in Ogg, the corresponding Ogg packet must be marked as end of stream (EOS).

As per the Ogg specification, granule positions must be non decreasing
within the stream. Header packets have granule position 0.

Currently existing packet types are:
:headers:
::0x80 ID header (BOS)
::0x81 Vorbis comment header
::0x82 regions list header
::0x83 styles list header
::0x84 curves list header
::0x85 motions list header
::0x86 palettes list header
::0x87 bitmaps list header
::0x88 font ranges and mappings header
:data:
::0x00 text data (including optional motions and overrides)
::0x01 keepalive
::0x7f end packet (EOS)

This format described here is for bitstream version 0.x.

For more detailed information, refer to the format documentation
in libkate (see URL below in the [[#Downloading|Downlading]] section).

Following is the definition of the ID header (packet type 0x80).
This works out to a 64 byte ID header. This is the header that should be
used to detect a Kate stream within an Ogg stream.

0 1 2 3 |
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1| Byte
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| packtype | Identifier char[7]: 'kate\0\0\0' | 0-3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| kate magic continued | 4-7
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | version major | version minor | num headers | 8-11
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| text encoding | directionality| reserved - 0 | granule shift | 12-15
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| cw sh | canvas width | ch sh | canvas height | 16-19
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | 20-23
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate numerator | 24-27
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate denominator | 28-31
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (NUL terminated) | 32-35
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 36-39
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 40-43
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 44-47
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (NUL terminated) | 48-51
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 52-55
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 56-59
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 60-63
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The fields cw sh, canvas width, cw sh, and canvas height were introduced
in bistream 0.3. Earlier bitstreams will have 0 in these fields.

language and category are NUL terminating ASCII strings.
Language follows RFC 3066, though obviously will not accommodate language tags
with lots of subtags.

Category is currently loosely defined, and I haven't found yet a nice way to
present it in a generic way, but is meant for automatic classifying of
various multiplexed Kate streams (eg, to recognize that some streams are
subtitles (in a set of languages), and some others are commentary (in a
possibly different set of languages, etc).

== API overview ==

libkate offers an API very similar to that of libvorbis and libtheora, as well as
an extra higher level decoding API.

Here's an overview of the three main modules:

=== Decoding ===

Decoding is done in a way similar to libvorbis. First, initialize a kate_info and a
kate_comment structure. Then, read headers by calling kate_decode_headerin. Once
all headers have been read, a kate_state is initialized for decoding using kate_decode_init,
and kate_decode_packetin is called repeatedly with data packets. Events (eg, text) can be
retrieved via kate_decode_eventout.

=== Encoding ===

Encoding is also done in a way similar to libvorbis. First initialize a kate_info
and a kate_comment structure, and fill them out as needed. kate_encode_headers will
create ogg packets from those. Then, kate_encode_text is called repeatedly for all
the text events to add. When done, calling kate_encode_finish will create an end of
stream packet.

=== High level decoding API ===

There are only 3 calls here:

kate_high_decode_init
kate_high_decode_packetin
kate_high_decode_clear

Here, all Ogg packets are sent to kate_high_decode_packetin, which does the right
thing (header/data classification, decoding, and event retrieval). Note that you
do not get access to the comments directly using this, but you do get access to the
kate_info via events.

The libkate distribution includes commented examples for each of those.

Additionally, libkate includes a layer (liboggkate) to make it easier to use when
embedded in Ogg. While the normal API uses kate_packet structures, liboggkate uses
ogg_packet structures.

The High level decoding API does not have an Ogg specific layer, but functions exist
to wrap a kate_packet around a memory buffer (such as the one ogg_packet uses, for instance).

== Support ==

Among the software with Kate support:
*VLC
*ffmpeg2theora
*liboggz
*liboggplay

I have patches for the following with Kate support:
*MPlayer (for multiplexed per-language subtitles - all region/style info is ignored)
*xine (everything kate supports, as xine is my testbed)
*GStreamer
*Thoggen

These may be found in the libkate source distribution (see [[#Downloading|Downloading]]
for links).

In addition, libtiger is a rendering library for Kate streams using Pango and Cairo,
though it is not quite yet API stable (though no major changes are expected).

== Granule encoding ==

=== Ogg ===

Ogg leaves the encoding of granules up to a particular codec, only
mandating that granules be non decreasing with time.

The Kate bitstream format uses a linear mapping between time and
granule, described here.

A Kate granule position is composed of two different parts:
- a base granule, in the high bits
- a granule offset, in the low bits

+----------------+----------------+
| base | offset |
+----------------+----------------+

The number of bits these parts occupy is variable, and each stream
may choose how many bits to dedicate to each. The kate_info structure
for a stream holds that information in the granule_shift field,
so each part may be reconstructed from a granulepos.

The kate_info structure for a stream holds a rational fraction
representing the time span of granule units for both the base and
the offset parts.

The granule rate is defined by the two fields:

kate_info::gps_numerator
kate_info::gps_denominator

The number of bits reserved for the offset is defined by the field:

kate_info::granule_shift

=== Generic timing ===

Kate data packets (data packet type 0) includes timing information (start time,
end time, and time of the earliest event still active). All these are stored as
64 bit at the rate defined by the granule rate, so they do not suffer from the
granule_shift space limitation.

This also allows for Kate streams to be stored in other containers.

== Motion ==

The Kate bitstream format includes motion definition, originally for karaoke purposes, but
which can be used for more general purpose, such as line based drawing, or animation of
the text (position, color, etc)

Motions are defined by the means of a series of curves (static points, segments, splines (catmull-rom, bezier, and b-splines)).
A 2D point can be obtained from a motion for any timestamp during the lifetime of a text.
This can be used for moving a marker in 2D above the text for karaoke, or to use the x
coordinate to color text when the motion position passes each letter or word, etc.
Motions have an attached semantics so the client code knows how to use a particular motion.
Predefined semantics include text color, text position, etc).

Since a motion can be composed of an arbitrary number of curves, each of which may have
an arbitrary number of control points, complex motions can be achieved. If the motion is
the main object of an event, it is even possible to have an empty text, and use the motion
as a virtual pencil to draw arbitrary shapes. Even on-the-fly handwriting subtitles could
be done this way, though this would require a lot of control points, and would not be able
to be used with text-to-speech.

As a proof of concept, I also have a "draw chat" program where two people can draw, and
the shapes are turned to b-splines and sent as a kate motion to be displayed on the other
person's window.

It is also possible for motions to be discontinuous - simply insert a curve of 'none' type.
While the timestamp lies within such a curve, no 2D point will be generated. This can be
used to temporarily hide a marker, for instance.

It is worth mentionning that pauses in the motion can be trivially included by inserting
at the right time and for the right duration a simple linear interpolation curve with only
two equal points, equal to the position the motion is supposed to pause at.

Kate defines a set of predefined mappings so that each decoder user interprets a motion in
the same way. A mapping is coded on 8 bits in the bitstream, and the first 128 are reserved
for Kate, leaving 128 for application specific mappings, to avoid constraining creative uses
of that feature. Predefined mappings include frame (eg, 0-1 points are mapped to the size of
the current video frame), or region, to scale 0-1 to the current region. This allows curves
to be defined without knowing in advance the pixel size of the area it should cover.

For uses which require more than two coordinates (eg, text color, where 4 (RGBA) values are
needed, Kate predefines the semantics text_color_rg and text_color_ba, so a 4D point can be
obtained using two different motions.

There are higher level constructs, such as morphing between two styles, or predefined
karaoke effects. More are planned to be added in the future.

See also [[#Trackers|Trackers]].

== Trackers ==

Since attaching motions to text position, etc, makes it hard for the client to keep track of
everything, doing interpolation, etc, the library supplies a tracker object, which handles the
interpolation of the relevant properties.
Once initialized with a text and a set of motions, the client code can give the tracker a new
timestamp, and get back the current text position, text color, etc.

Using a tracker is not necessary, if one wants to use the motions directly, or just ignore them,
but it makes life easier, especially when considering the the order in which motions are applied
does matter (to be defined formally, but the current source code is informative at this point).

== The Kate file format ==

Though this is not a feature of the bitstream format, I have created a text file format to
describe a series of events to be turned into a Kate bitstream.
At its minimum, the following is a valid input to the encoder:

: kate {
:: event { 00:00:05 --> 00:00:10 "This is a text" }
: }

This will create a simple stream with "This is a text" emitted at an offset of 5 seconds into
the track, lasting 5 seconds to an end time at 10 seconds.

Motions, regions, styles can be declared in a definitions block to be reused by events, or can
be defined inline. Defining those in the definitions block places them in a header so they can
be reused later, saving space. However, they can also be defined in each event, so they will be
sent with the event. This allows them to be generated on the fly (eg, if the bitstream is being
streamed from a realtime input).

For convenience, the Kate file format also allows C style macros, though without parameters.

Please note that the Kate file format is fully separate from the Kate bitstream format. The
difference between the two is similar to the difference between a C source file and the resulting
object file, when compiled.

Note that the format is not based on XML for a very parochial reason: I tend to dislike very
much editing XML by hand, as it's really hard to read. XML is really meant for machines to parse
generically text data in a shared syntax but with possibly unknown semantics, and I need those
text representations to be editable easily.

This also implies that there could be an XML representation of a Kate stream, which would be
useful if one were to make an editor that worked on a higher level than the current all-text
representation, and it is something that might very well happen in the future, in parallel with
the current format.

== Karaoke ==

Karaoke effects rely on motions, and there will be predefined higher level ways of specifying
timings and effects, two of which are already done. As an example, this is a valid Karaoke script:

:kate {
:: simple_timed_glyph_style_morph {
::: from style "start_style" to style "end_style"
::: "Let " at 1.0
::: "us " at 1.2
::: "sing " at 1.4
::: "to" at 2.0
::: "ge" at 2.5
::: "ther" at 3.0
:: }
:}

The syllables will change from a style to another as time passes. The definition of the start_style
and end_style styles is omitted for brevity.

== Problems to solve ==

There are a few things to solve before the Kate bitstream format can be considered good
enough to be frozen:

Note: the following is mostly solved, and the bitstream is now stable, and has been
backward and forward compatible since the first released version. This will be updated
when I get some time.

=== Seeking and memory ===

When seeking to a particular time in a movie with subtitles, we may end up at a place when a subtitle has been started, but is not removed yet. Pure streaming doesn't have this problem as it remembers the subtitle being issued (as opposed to, say, Vorbis, for which all data valid now is decoded from the last packet). With Kate, a text string valid now may have been issued long ago.

I see three possible ways to solve this:
*each data packet includes the granule of the earliest still active packet (if none, this will be the granule of this very packet)
**this means seeks are two phased: first seek, find the next Kate packet, and seek again if the granule of the earlier still active packet is less than the original seeked granule. This implies support code on players to do the double seek.

*use "reference frames", a bit like Theora does, where the granule position is split in several fields: the higher bits represent a position for the reference frame, and the lowest bits a delta time to the current position. When seeking to a granule position, the lower bits are cleared off, yielding the granule position of the previous reference frame, so the seek ends up at the reference frame. The reference frame is a sync point where any active strings are issued again. This is a variant of the method described in the Writ wiki page, but the granule splitting avoids any "downtime".
**this requires reissuing packets, and it doesn't feel right (and wastes space).
**it also requires "dummy" decoding of Kate data from the reference frame to the actual seek point to fully refresh the state "memory".

*A variant of the two-granules-in-one system used by libcmml, where the "back link" points to the earliest still active string, rather than the previous one (this allows a two phase seek, rather than a multiphase seek, hopping back from event to event, with no real way to know if there is or not a previous event which is still active - I suppose CMML has no need to know this, if their "clips" do not overlap - mine can do).
**Such a system considerably shortens the usable granule space, though it can do a one phase seek, if I understand the system correctly, which I am not certain.
*** Well, it seems it can't do a one phase seek anyway.

*Additionally, it could be possible to emit simple "keepalive" packets at regular intervals to help a seek algorithm to sync up to the stream without needing too much data reading - this helps for discontinuous streams where there could be no pages for a while if no data is needed at that time.

=== Text encoding ===

A header field declares the text encoding used in the stream. At the moment, only UTF-8 is
supported, for simplicity. There are no plans to support other encodings, such as UTF-16,
at the moment.

Note that strings included in the header (language, category) are not affected by that
language encoding (rather obviously for language itself). These are ASCII.

The actual text in events may include simple HTML-like markup (at the moment, allowed markup
is the same as the one Pango uses, but more markup types may be defined in the future).
It is also possible to ask libkate to remove this markup if the client prefers to receive
plain text without the markup.

=== Language encoding ===

A header field defines the language (if any) used in the stream (this can be overridden in a
data packet, but this is not relevant to this point). At the moment, my test code uses
ISO 639-1 two letter codes, but I originally thought to use RFC 3066 tags. However, matching
a language to a user selection may be simpler for user code if the language encoding is kept
simple. At the moment, I tend to favor allowing both two letter tags (eg, "en") and secondary
tags (like "en_EN"), as RFC 3066 tags can be quite complex, but I welcome comments on this.

If a stream contains more than one language, there usually is a predominant language, which
can be set as the default language for the stream. Each event can then have a language
override. If there is no predominant language, and it is not possible to split the stream
into multiple substreams, each with its own language, then it is possible to use the "mul"
language tag, as a last resort.

=== Bitstream format for floating point values ===

Floating point values are be turned to a 16.16 fixed point format, then stored in a bitpacked
format, storing the number of zero bits at the head and tail of the floating point values once
per stream, and the remainder bits for all values in the stream. This seems to yield good results
(typically a 50% reduction over 32 bits raw writes, and 70% over the snprintf based storage), and
has the big advantage of being portable (eg, independant of any IEEE format).
However, this means reduced precision due to the quantization to 16.16. I may add support for
variable precision (eg, 8.24 fixed point formats) to alleviate this. This would however mean less
space savings, though these are likely to be insignificant when Kate streams are interleaved with
a video.

*Though this is not a Kate issue per se, the motion feature is very difficult to use without a curve editor. While tools may be coded to create a Kate bitstream for various existing subtitle formats, it is not certain it will be easy to find a good authoring tool for a series of curves. That said, it's not exactly difficult to do if you know a widget set.

=== Higher dimensional curves/motions ===

It is quite annoying to have to create two motions to control a color change, due to curves
being restricted to two dimensions. I may add support for arbitrary dimensions. It would also
help for 1D motions, like changing the time flow, where one coordinate is simply ignored at
the moment.
Alternatively, changes could be made to the Kate file format to hide the two dimensionality and
allow simpler specification of non-2 dimensional motions, but still map them to 2D in the kate
bitstream format.

=== Category definition ===

The category field in the BOS packet is a 16 byte text field (15 really, as it is zero terminated
in the bitstream itself). Its goal is to provide the reader with a short description of what kind
of information the stream contains, eg subtitles, lyrics, etc. This would be displayed to the user,
possibly to allow to choose to turn some streams on and off.

Since this category is meant primarily for a machine to parse, they will be kept to ASCII. When
a player recognizes a category, it is free to replace its name with one in the user's language if
it prefers. Even in English, the "lyrics" category could be displayed by a player as "Lyrics".

Since this is a free text field rather than an enumeration, it would be good to have a list of
common predefined category names that Kate streams can use.

This is a list of proposed predefined categories, feedback/additions welcome:

* subtitles - the usual movie subtitles, as text
* spu-subtitles - movie subtitles in DVD style paletted images
* lyrics - song lyrics
* transcript - exact words of a speech
* commentary - runnning commentary about an accompanying eg. video
* narration - narration of an accompanying eg. video
* book - a full book as text, might be a lone Kate stream (or muxed with other languages)

Please remember the 15 character limit if proposing other categories.

== Text to speech ==

One of the goals of the Kate bitstream format is that text data can be easily parsed
by the user of the decoder, so any additional information, such as style, placement,
karaoke data, etc, should be able to be stripped to leave only the bare text. This is
in view of allowing text-to-speech software to use Kate bitstreams as a bandwith-cheap
way of conveying speech data, and could also allow things like e-books which can be
either read or listened to from the same bitstream (I have seen no reference to this
being used anywhere, but I see no reason why the granule progression should be temporal,
and not user controlled, such as by using a "next" button which would bump a granule
postion by a preset amount, simulating turning a page (this would be close to necessary
for text-to-speech, as the wall time duration of the spoken speech is not known in
advance to the Kate encoder, and can't be mapped to a time based granule progression)).
All text strings triggered consecutively between the two granule positions would then
be read in order.

== Possible additions ==

=== Embedded binary data ===

Images and font mappings can be included within a Kate stream.

==== Images ====

Though this could be misused to interfere with ability to render as text-to-speech, Kate
can use images as well as text. The same caveat as for fonts applies with regard to data
duplication.

Complex images might however be best left to a multiplexed OggSpots or OggMNG stream, unless the
images mesh with the text (eg, graphical exclamation points, custom fonts, (see next
paragraph), etc).

There is support for simple paletted bitmap images, with a variable length palette of up
to 256 colors (in fact, sized in powers of 2 up to 256) and matching pixel data in as
many bits per pixel as can address the palette. Palettes and images are stored separately,
so can be used with one another with no fixed assignment.

Palettes and bitmaps are put in two separate header for later use by reference, but can
also be placed in data packets, as with motions, etc, if they are not going to be reused.

PNG bitmaps can also be embedded in a Kate stream. These do not have associated palettes
(but the PNGs themselves may or may not be paletted). There is no support for decoding PNG
images in libkate itself, so a program will have to use libpng (or similar code) to decode
the PNG image. For instance, the libtiger rendering library uses Cairo to decode and render
PNG images in Kate streams.

This can be used to have custom fonts, so that raw text is still available if the stream
creator wants a custom look.

I expect that the need for more than 256 colors in a bitmap, or non palette bitmap data,
would be best handled by another codec, eg OggMNG or OggSpots. The goal of images in a
Kate stream is to mesh the images with the text, not to have large images by themselves.

On the other hand, interesting Karaoke effects could be achieved by having MNG images
instead of simple paletted bitmaps in a Kate streams. Comments would be most welcome on
whether this is going too far, however.

I am also investigating SVG images. These allow for very small footprint images for simple
vector drawings, and could be very useful for things like background gradients below text.

A possible solution to the duplication issue is to have another stream in the container
stream, which would hold the shared data (eg, fonts), which the user program could load,
and which could then be used by any Kate (and other) stream. Typically, this type of stream
would be a degenerate stream with only header packets (so it is fully processed before any
other stream presents data packets that might make use of that shared data), and all payload
such as fonts being contained within the headers. Thinking about it, it has parallels with
the way Vorbis stores its codebooks within a header packet, or even the way Kate stores the
list of styles within a header packet.

==== Fonts ====

Custom fonts are merely a set of ranges mapping unicode code points to bitmaps. As this implies,
fonts are bitmap fonts, not vector fonts, so scaling, if supported by the rendering client,
may not look as good as with a vector font.

A style may also refer to a font name to use (eg, "Tahoma"). These fonts may or may not be
available on the playing system, however, since the font data is not included in the stream,
just referenced by name. For this reason, it is best to keep to widely known fonts.

== Reference encoder/decoder ==

A encoder and a decoder are included in the tools directory. The encoder pulls its input from a custom
text based file format (see [[#The Kate file format|The Kate file format]]),
which is by no means meant to be part of the Kate bitstream specification itself,
from an SubRip (.srt) format file (the most common subtitle format I found, and a very basic one),
or from a lyrics (.lrc) format file.

The Kate bitstreams encoded and decoded by those tools are (supposed to be) correct for this
specification, provided their input is correct.

== Next steps ==

=== Continuations ===

Continuations are a way to add to existing events, and are mostly meant for motions. When streaming
in real time, what motions may be applied to events may not be known in advance (for instance, for a
draw chat program where two programs exchange Kate streams, the drawing motions are only known as
they are drawn. Continuations will allow an event to be extended in time, and motions to be appended
to it. This is only useful for streaming, as when stored in a file, everything is already known in
advance.

=== A rendering library ===

This will allow easier integration in other packages (movie players, etc).
I have started working on an implementation using Cairo and Pango, though I'm still at the early stages.
I might add support for embedding vector fonts in a Kate stream if I was going that way. Still need to think about this.
Another point of note is that when this library is available, it would make it easier to add
capabilities such as rotation, scaling, etc, to the bitstream, since this would not cause too
much work for playing programs using the rendering library. It is expected that these additions
would stay backward compatible (eg, an old player would ignore this information but still correctly
decode the information they can work with from a newly encoded stream).

=== An XML representation ===

While I purposefully did not write Kate description files in XML due to me finding editing XML such
a chore, it would be nice to be able to losslessly convert between the more user friendly representation
and an XML document, so one can do what one does with XML documents, like transformations.

And after all, some people might prefer editing the XML version.

== Matroska mapping ==

The codec ID is "S_KATE".

As for Theora and Vorbis, Kate headers are stored in the private data as xiph-laced packets:

Byte 0: number of packets present, minus 1 (there must be at least one packet) - let this number be NP
Bytes 1..n: lengths of the first NP packets, coded in xiph style lacing
Bytes n+1..end: the data packets themselves concatenated one after the other

Note that the length of the last packet isn't encoded, it is deduced from the sizes of the other
packets and the total size of the private data.

This mapping is similar to the Vorbis and Theora mappings, with the caveat that one should not
expect a set number of headers.

== Downloading ==

libkate encodes and decodes Kate streams, and is API and ABI stable.

The libkate source distribution is available at [http://libkate.googlecode.com/ http://libkate.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/kate.git;a=summary http://git.xiph.org/?p=users/oggk/kate.git;a=summary].

libtiger renders Kate streams using Pango and Cairo, and is alpha, with API changes still possible.

The libtiger source distribution is available at [http://libtiger.googlecode.com/ http://libtiger.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/tiger.git;a=summary http://git.xiph.org/?p=users/oggk/tiger.git;a=summary].

== Things I need to get feedback on ==

* is it a good idea to avoid floating point usage altogether ?

[[Category:Drafts]]
[[Category:Ogg Mappings]]

OggKate

2008-12-13T00:31:45Z

Ogg.k.ogg.k: not a draft anymore - it's been stable since the first released version, almost a year ago

== Disclaimer ==
This is not a Xiph codec, though it may be embedded in Ogg alonside other Xiph
codecs, such as Vorbis and Theora. As such, please do not assume that Xiph has
anything to do with this, much less responsibility.

== What is Kate? ==

Kate is a codec for karaoke and text encapsulation for Ogg. Text and images can be
carried by a Kate stream, and animated. Most of the time, this would be multiplexed
with audio/video to carry subtitles, song lyrics (with or without karaoke data), etc,
but doesn't have to be. A possible use of a lone Kate stream would be an e-book.
Moreover, the motion feature gives Kate a powerful means to describe arbitrary curves, so
hand drawing of shapes can be achieved. This was originally meant for karaoke use, but
can be used for any purpose. Motions can be attached to various semantics, like position,
color, etc, so scrolling or fading text can be defined.

Simple uses of Kate streams are movie subtitles for Theora videos, either text based,
as may be created by [http://www.v2v.cc/~j/ffmpeg2theora ffmpeg2theora], or image
based, such as created by [http://thoggen.net Thoggen] (patching needed).

== Why a new codec? ==

As I was adding support for Theora, Speex and FLAC to some software of mine, I found myself
wanting to have song lyrics accompanying Vorbis audio. Since Vorbis comments are limited to
the headers, one can't add them in the stream as they are sung, so another multiplexed stream
would be needed to carry them.

The three possible bases usable for such a codec I found were Writ, CMML, and OGM/SRT.

*[[OggWrit|Writ]] is an unmaintained start at an implementation of a very basic design, though I did find an encoder/decoder in py-ogg2 later on - I'd been quicker to write Kate from scratch anyway.
*[[CMML]] is more geared towards encapsulating metadata about an accompanying stream, rather than being a data stream itself, and seemed complex for a simple use, though I have now revised my view on this - besides, it seems designed for Annodex (which I haven't had a look at), though it does seems relatively generic for use outwith Annodex - though it is being "repurposed" as timed text now, bringing it closer to what I'm doing
*OGM/SRT, which I only found when I added Kate support to MPlayer, is shoehorning various data formats into an Ogg stream, and just dumps the SRT subtitle format as is, AFAICS (though I haven't looked at this one in detail, since I'd already had a working Kate implementation by that time)

I then decided to roll my own, not least because it's a fun thing to do.

I found other formats, such as USF (designed for inclusion in Matroska) and various subtitle formats,
but none were designed for embedding inside an Ogg container.

== Overview of the Kate bitstream format ==

I've taken much inspiration from Vorbis and Theora here.
Headers and packets (as well as the API design) follow the design of these two codecs.

A rough overview (see [[#Format specification|Format specification]] for more details) is:

Headers packets:
*ID header [BOS]: magic, version, granule fraction, encoding, language, etc
*Comment header: Vorbis comments, as per Vorbis/Theora streams
*Style definitions header: a list of predefined styles to be referred to by data packets
*Region definitions header: a list of predefined regions to be referred to by data packets
*Curves definitions header: a list of predefined curves to be referred to by data packets
*Motion definitions header: a list of predefined motions to be referred to by data packets
*Palette definitions header: a list of predefined palettes to be referred to by data packets
*Bitmap definitions header: a list of predefined bitmaps to be referred to by data packets
*Font mapping definitions header: a list of predefined font mappings to be referred to by data packets

Other header packets are ignored, and left for future expansion.

Data packets:
*text data: text/image and optional motions, accompanied by optional overrides for style, region, language, etc
*keepalive: can be emitted at any time to help a demuxer know where we're at, but those packets are optional
*end data [EOS]: marks the end of the stream, it doesn't have any useful payload

Other data packets are ignored, and left for future expansion.

The intent of the "keepalive" packet is to be sent at regular
intervals when no other packet has been emitted for a while. This would be to help seeking code
find a kate page more easily.

Things of note:
*Kate is a discontinuous codec, as defined in [http://www.xiph.org/ogg/doc/ogg-multiplex.html ogg-multiplex.html] in the Ogg documentation, which means it's timed by start granule, not end granule (as Theora and Vorbis).
* All data packets are on their own page, for two reasons:
**Ogg keeps track of granules at the page level, not the packet level
**if no text event happens for a while after a particular text event, we don't want to delay it so a larger page can be issued

See also [[#Seeking and memory|Problems to solve: Seeking and memory]].

*The granule encoding is not a direct time/granule correspondance, see the granule encoding section.
*The EOS packet should have a granule pos higher or equal to the end time of all events.
*User code doesn't have to know the number of headers to expect, this is moved inside the library code (as opposed to Vorbis and Theora).
*The format contains hooks so that additional information may be added in future revisions while keeping backward compatibility (though old decoders will correctly parse, but ignore the new information).

== Format specification ==

The Kate bitstream format consists of a number of sequential packets.
Packets can be either header packets or data packets. All header packets
must appear before any data packet.

Header packets must appear in order. Decoding of a data packet is not
possible until all header packets have been decoded.

Each Kate packet starts with a one byte type. A type with the MSB set
(eg, between 0x80 and 0xff) indicates a header packet, while a type with
the MSB cleared (eg, between 0x00 and 0x7f) indicates a data packet.
All header packets then have the Kate magic, from byte offset 1 to byte
offset 7 ("kate\0\0\0"). Note that this applies only to header packets:
data packets do not contain the Kate signature.

Since the ID header must appear first, a Kate stream can be recognized
by comparing the first eight bytes of the first packet with the signature
string "\200kate\0\0\0".

When embedded in Ogg,the first packet in a Kate stream (always packet type 0x80,
the id header packet) must be placed on a separate page. The corresponding Ogg
packet must be marked as beginning of stream (BOS).All subsequent header packets
must be on one or more pages. Subsequently, each data packet must be on a separate
page.

The last data packet must be the end of stream packet (packet type 0x7f).

When embedded in Ogg, the corresponding Ogg packet must be marked as end of stream (EOS).

As per the Ogg specification, granule positions must be non decreasing
within the stream. Header packets have granule position 0.

Currently existing packet types are:
:headers:
::0x80 ID header (BOS)
::0x81 Vorbis comment header
::0x82 regions list header
::0x83 styles list header
::0x84 curves list header
::0x85 motions list header
::0x86 palettes list header
::0x87 bitmaps list header
::0x88 font ranges and mappings header
:data:
::0x00 text data (including optional motions and overrides)
::0x01 keepalive
::0x7f end packet (EOS)

This format described here is for bitstream version 0.x.

For more detailed information, refer to the format documentation
in libkate (see URL below in the [[#Downloading|Downlading]] section).

Following is the definition of the ID header (packet type 0x80).
This works out to a 64 byte ID header. This is the header that should be
used to detect a Kate stream within an Ogg stream.

0 1 2 3 |
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1| Byte
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| packtype | Identifier char[7]: 'kate\0\0\0' | 0-3
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| kate magic continued | 4-7
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | version major | version minor | num headers | 8-11
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| text encoding | directionality| reserved - 0 | granule shift | 12-15
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| cw sh | canvas width | ch sh | canvas height | 16-19
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| reserved - 0 | 20-23
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate numerator | 24-27
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| granule rate denominator | 28-31
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (NUL terminated) | 32-35
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 36-39
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 40-43
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| language (continued) | 44-47
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (NUL terminated) | 48-51
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 52-55
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 56-59
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
| category (continued) | 60-63
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

The fields cw sh, canvas width, cw sh, and canvas height were introduced
in bistream 0.3. Earlier bitstreams will have 0 in these fields.

language and category are NUL terminating ASCII strings.
Language follows RFC 3066, though obviously will not accommodate language tags
with lots of subtags.

Category is currently loosely defined, and I haven't found yet a nice way to
present it in a generic way, but is meant for automatic classifying of
various multiplexed Kate streams (eg, to recognize that some streams are
subtitles (in a set of languages), and some others are commentary (in a
possibly different set of languages, etc).

== API overview ==

libkate offers an API very similar to that of libvorbis and libtheora, as well as
an extra higher level decoding API.

Here's an overview of the three main modules:

=== Decoding ===

Decoding is done in a way similar to libvorbis. First, initialize a kate_info and a
kate_comment structure. Then, read headers by calling kate_decode_headerin. Once
all headers have been read, a kate_state is initialized for decoding using kate_decode_init,
and kate_decode_packetin is called repeatedly with data packets. Events (eg, text) can be
retrieved via kate_decode_eventout.

=== Encoding ===

Encoding is also done in a way similar to libvorbis. First initialize a kate_info
and a kate_comment structure, and fill them out as needed. kate_encode_headers will
create ogg packets from those. Then, kate_encode_text is called repeatedly for all
the text events to add. When done, calling kate_encode_finish will create an end of
stream packet.

=== High level decoding API ===

There are only 3 calls here:

kate_high_decode_init
kate_high_decode_packetin
kate_high_decode_clear

Here, all Ogg packets are sent to kate_high_decode_packetin, which does the right
thing (header/data classification, decoding, and event retrieval). Note that you
do not get access to the comments directly using this, but you do get access to the
kate_info via events.

The libkate distribution includes commented examples for each of those.

Additionally, libkate includes a layer (liboggkate) to make it easier to use when
embedded in Ogg. While the normal API uses kate_packet structures, liboggkate uses
ogg_packet structures.

The High level decoding API does not have an Ogg specific layer, but functions exist
to wrap a kate_packet around a memory buffer (such as the one ogg_packet uses, for instance).

== Support ==

Among the software with Kate support:
*VLC
*ffmpeg2theora
*liboggz
*liboggplay

I have patches for the following with Kate support:
*MPlayer (for multiplexed per-language subtitles - all region/style info is ignored)
*xine (everything kate supports, as xine is my testbed)
*GStreamer
*Thoggen

These may be found in the libkate source distribution (see [[#Downloading|Downloading]]
for links).

In addition, libtiger is a rendering library for Kate streams using Pango and Cairo,
though it is not quite yet API stable (though no major changes are expected).

== Granule encoding ==

=== Ogg ===

Ogg leaves the encoding of granules up to a particular codec, only
mandating that granules be non decreasing with time.

The Kate bitstream format uses a linear mapping between time and
granule, described here.

A Kate granule position is composed of two different parts:
- a base granule, in the high bits
- a granule offset, in the low bits

+----------------+----------------+
| base | offset |
+----------------+----------------+

The number of bits these parts occupy is variable, and each stream
may choose how many bits to dedicate to each. The kate_info structure
for a stream holds that information in the granule_shift field,
so each part may be reconstructed from a granulepos.

The kate_info structure for a stream holds a rational fraction
representing the time span of granule units for both the base and
the offset parts.

The granule rate is defined by the two fields:

kate_info::gps_numerator
kate_info::gps_denominator

The number of bits reserved for the offset is defined by the field:

kate_info::granule_shift

=== Generic timing ===

Kate data packets (data packet type 0) includes timing information (start time,
end time, and time of the earliest event still active). All these are stored as
64 bit at the rate defined by the granule rate, so they do not suffer from the
granule_shift space limitation.

This also allows for Kate streams to be stored in other containers.

== Motion ==

The Kate bitstream format includes motion definition, originally for karaoke purposes, but
which can be used for more general purpose, such as line based drawing, or animation of
the text (position, color, etc)

Motions are defined by the means of a series of curves (static points, segments, splines (catmull-rom, bezier, and b-splines)).
A 2D point can be obtained from a motion for any timestamp during the lifetime of a text.
This can be used for moving a marker in 2D above the text for karaoke, or to use the x
coordinate to color text when the motion position passes each letter or word, etc.
Motions have an attached semantics so the client code knows how to use a particular motion.
Predefined semantics include text color, text position, etc).

Since a motion can be composed of an arbitrary number of curves, each of which may have
an arbitrary number of control points, complex motions can be achieved. If the motion is
the main object of an event, it is even possible to have an empty text, and use the motion
as a virtual pencil to draw arbitrary shapes. Even on-the-fly handwriting subtitles could
be done this way, though this would require a lot of control points, and would not be able
to be used with text-to-speech.

As a proof of concept, I also have a "draw chat" program where two people can draw, and
the shapes are turned to b-splines and sent as a kate motion to be displayed on the other
person's window.

It is also possible for motions to be discontinuous - simply insert a curve of 'none' type.
While the timestamp lies within such a curve, no 2D point will be generated. This can be
used to temporarily hide a marker, for instance.

It is worth mentionning that pauses in the motion can be trivially included by inserting
at the right time and for the right duration a simple linear interpolation curve with only
two equal points, equal to the position the motion is supposed to pause at.

Kate defines a set of predefined mappings so that each decoder user interprets a motion in
the same way. A mapping is coded on 8 bits in the bitstream, and the first 128 are reserved
for Kate, leaving 128 for application specific mappings, to avoid constraining creative uses
of that feature. Predefined mappings include frame (eg, 0-1 points are mapped to the size of
the current video frame), or region, to scale 0-1 to the current region. This allows curves
to be defined without knowing in advance the pixel size of the area it should cover.

For uses which require more than two coordinates (eg, text color, where 4 (RGBA) values are
needed, Kate predefines the semantics text_color_rg and text_color_ba, so a 4D point can be
obtained using two different motions.

There are higher level constructs, such as morphing between two styles, or predefined
karaoke effects. More are planned to be added in the future.

See also [[#Trackers|Trackers]].

== Trackers ==

Since attaching motions to text position, etc, makes it hard for the client to keep track of
everything, doing interpolation, etc, the library supplies a tracker object, which handles the
interpolation of the relevant properties.
Once initialized with a text and a set of motions, the client code can give the tracker a new
timestamp, and get back the current text position, text color, etc.

Using a tracker is not necessary, if one wants to use the motions directly, or just ignore them,
but it makes life easier, especially when considering the the order in which motions are applied
does matter (to be defined formally, but the current source code is informative at this point).

== The Kate file format ==

Though this is not a feature of the bitstream format, I have created a text file format to
describe a series of events to be turned into a Kate bitstream.
At its minimum, the following is a valid input to the encoder:

: kate {
:: event { 00:00:05 --> 00:00:10 "This is a text" }
: }

This will create a simple stream with "This is a text" emitted at an offset of 5 seconds into
the track, lasting 5 seconds to an end time at 10 seconds.

Motions, regions, styles can be declared in a definitions block to be reused by events, or can
be defined inline. Defining those in the definitions block places them in a header so they can
be reused later, saving space. However, they can also be defined in each event, so they will be
sent with the event. This allows them to be generated on the fly (eg, if the bitstream is being
streamed from a realtime input).

For convenience, the Kate file format also allows C style macros, though without parameters.

Please note that the Kate file format is fully separate from the Kate bitstream format. The
difference between the two is similar to the difference between a C source file and the resulting
object file, when compiled.

Note that the format is not based on XML for a very parochial reason: I tend to dislike very
much editing XML by hand, as it's really hard to read. XML is really meant for machines to parse
generically text data in a shared syntax but with possibly unknown semantics, and I need those
text representations to be editable easily.

This also implies that there could be an XML representation of a Kate stream, which would be
useful if one were to make an editor that worked on a higher level than the current all-text
representation, and it is something that might very well happen in the future, in parallel with
the current format.

== Karaoke ==

Karaoke effects rely on motions, and there will be predefined higher level ways of specifying
timings and effects, two of which are already done. As an example, this is a valid Karaoke script:

:kate {
:: simple_timed_glyph_style_morph {
::: from style "start_style" to style "end_style"
::: "Let " at 1.0
::: "us " at 1.2
::: "sing " at 1.4
::: "to" at 2.0
::: "ge" at 2.5
::: "ther" at 3.0
:: }
:}

The syllables will change from a style to another as time passes. The definition of the start_style
and end_style styles is omitted for brevity.

== Problems to solve ==

There are a few things to solve before the Kate bitstream format can be considered good
enough to be frozen:

Note: the following is mostly solved, and the bitstream is now stable, and has been
backward and forward compatible since the first released version. This will be updated
when I get some time.

=== Seeking and memory ===

When seeking to a particular time in a movie with subtitles, we may end up at a place when a subtitle has been started, but is not removed yet. Pure streaming doesn't have this problem as it remembers the subtitle being issued (as opposed to, say, Vorbis, for which all data valid now is decoded from the last packet). With Kate, a text string valid now may have been issued long ago.

I see three possible ways to solve this:
*each data packet includes the granule of the earliest still active packet (if none, this will be the granule of this very packet)
**this means seeks are two phased: first seek, find the next Kate packet, and seek again if the granule of the earlier still active packet is less than the original seeked granule. This implies support code on players to do the double seek.

*use "reference frames", a bit like Theora does, where the granule position is split in several fields: the higher bits represent a position for the reference frame, and the lowest bits a delta time to the current position. When seeking to a granule position, the lower bits are cleared off, yielding the granule position of the previous reference frame, so the seek ends up at the reference frame. The reference frame is a sync point where any active strings are issued again. This is a variant of the method described in the Writ wiki page, but the granule splitting avoids any "downtime".
**this requires reissuing packets, and it doesn't feel right (and wastes space).
**it also requires "dummy" decoding of Kate data from the reference frame to the actual seek point to fully refresh the state "memory".

*A variant of the two-granules-in-one system used by libcmml, where the "back link" points to the earliest still active string, rather than the previous one (this allows a two phase seek, rather than a multiphase seek, hopping back from event to event, with no real way to know if there is or not a previous event which is still active - I suppose CMML has no need to know this, if their "clips" do not overlap - mine can do).
**Such a system considerably shortens the usable granule space, though it can do a one phase seek, if I understand the system correctly, which I am not certain.
*** Well, it seems it can't do a one phase seek anyway.

*Additionally, it could be possible to emit simple "keepalive" packets at regular intervals to help a seek algorithm to sync up to the stream without needing too much data reading - this helps for discontinuous streams where there could be no pages for a while if no data is needed at that time.

=== Text encoding ===

A header field declares the text encoding used in the stream. At the moment, only UTF-8 is
supported, for simplicity. There are no plans to support other encodings, such as UTF-16,
at the moment.

Note that strings included in the header (language, category) are not affected by that
language encoding (rather obviously for language itself). These are ASCII.

The actual text in events may include simple HTML-like markup (at the moment, allowed markup
is the same as the one Pango uses, but more markup types may be defined in the future).
It is also possible to ask libkate to remove this markup if the client prefers to receive
plain text without the markup.

=== Language encoding ===

A header field defines the language (if any) used in the stream (this can be overridden in a
data packet, but this is not relevant to this point). At the moment, my test code uses
ISO 639-1 two letter codes, but I originally thought to use RFC 3066 tags. However, matching
a language to a user selection may be simpler for user code if the language encoding is kept
simple. At the moment, I tend to favor allowing both two letter tags (eg, "en") and secondary
tags (like "en_EN"), as RFC 3066 tags can be quite complex, but I welcome comments on this.

If a stream contains more than one language, there usually is a predominant language, which
can be set as the default language for the stream. Each event can then have a language
override. If there is no predominant language, and it is not possible to split the stream
into multiple substreams, each with its own language, then it is possible to use the "mul"
language tag, as a last resort.

=== Bitstream format for floating point values ===

Floating point values are be turned to a 16.16 fixed point format, then stored in a bitpacked
format, storing the number of zero bits at the head and tail of the floating point values once
per stream, and the remainder bits for all values in the stream. This seems to yield good results
(typically a 50% reduction over 32 bits raw writes, and 70% over the snprintf based storage), and
has the big advantage of being portable (eg, independant of any IEEE format).
However, this means reduced precision due to the quantization to 16.16. I may add support for
variable precision (eg, 8.24 fixed point formats) to alleviate this. This would however mean less
space savings, though these are likely to be insignificant when Kate streams are interleaved with
a video.

*Though this is not a Kate issue per se, the motion feature is very difficult to use without a curve editor. While tools may be coded to create a Kate bitstream for various existing subtitle formats, it is not certain it will be easy to find a good authoring tool for a series of curves. That said, it's not exactly difficult to do if you know a widget set.

=== Higher dimensional curves/motions ===

It is quite annoying to have to create two motions to control a color change, due to curves
being restricted to two dimensions. I may add support for arbitrary dimensions. It would also
help for 1D motions, like changing the time flow, where one coordinate is simply ignored at
the moment.
Alternatively, changes could be made to the Kate file format to hide the two dimensionality and
allow simpler specification of non-2 dimensional motions, but still map them to 2D in the kate
bitstream format.

=== Category definition ===

The category field in the BOS packet is a 16 byte text field (15 really, as it is zero terminated
in the bitstream itself). Its goal is to provide the reader with a short description of what kind
of information the stream contains, eg subtitles, lyrics, etc. This would be displayed to the user,
possibly to allow to choose to turn some streams on and off.

Since this category is meant primarily for a machine to parse, they will be kept to ASCII. When
a player recognizes a category, it is free to replace its name with one in the user's language if
it prefers. Even in English, the "lyrics" category could be displayed by a player as "Lyrics".

Since this is a free text field rather than an enumeration, it would be good to have a list of
common predefined category names that Kate streams can use.

This is a list of proposed predefined categories, feedback/additions welcome:

* subtitles - the usual movie subtitles, as text
* spu-subtitles - movie subtitles in DVD style paletted images
* lyrics - song lyrics
* transcript - exact words of a speech
* commentary - runnning commentary about an accompanying eg. video
* narration - narration of an accompanying eg. video
* book - a full book as text, might be a lone Kate stream (or muxed with other languages)

Please remember the 15 character limit if proposing other categories.

== Text to speech ==

One of the goals of the Kate bitstream format is that text data can be easily parsed
by the user of the decoder, so any additional information, such as style, placement,
karaoke data, etc, should be able to be stripped to leave only the bare text. This is
in view of allowing text-to-speech software to use Kate bitstreams as a bandwith-cheap
way of conveying speech data, and could also allow things like e-books which can be
either read or listened to from the same bitstream (I have seen no reference to this
being used anywhere, but I see no reason why the granule progression should be temporal,
and not user controlled, such as by using a "next" button which would bump a granule
postion by a preset amount, simulating turning a page (this would be close to necessary
for text-to-speech, as the wall time duration of the spoken speech is not known in
advance to the Kate encoder, and can't be mapped to a time based granule progression)).
All text strings triggered consecutively between the two granule positions would then
be read in order.

== Possible additions ==

=== Embedded binary data ===

Images and font mappings can be included within a Kate stream.

==== Images ====

Though this could be misused to interfere with ability to render as text-to-speech, Kate
can use images as well as text. The same caveat as for fonts applies with regard to data
duplication.

Complex images might however be best left to a multiplexed OggSpots or OggMNG stream, unless the
images mesh with the text (eg, graphical exclamation points, custom fonts, (see next
paragraph), etc).

There is support for simple paletted bitmap images, with a variable length palette of up
to 256 colors (in fact, sized in powers of 2 up to 256) and matching pixel data in as
many bits per pixel as can address the palette. Palettes and images are stored separately,
so can be used with one another with no fixed assignment.

Palettes and bitmaps are put in two separate header for later use by reference, but can
also be placed in data packets, as with motions, etc, if they are not going to be reused.

PNG bitmaps can also be embedded in a Kate stream. These do not have associated palettes
(but the PNGs themselves may or may not be paletted). There is no support for decoding PNG
images in libkate itself, so a program will have to use libpng (or similar code) to decode
the PNG image. For instance, the libtiger rendering library uses Cairo to decode and render
PNG images in Kate streams.

This can be used to have custom fonts, so that raw text is still available if the stream
creator wants a custom look.

I expect that the need for more than 256 colors in a bitmap, or non palette bitmap data,
would be best handled by another codec, eg OggMNG or OggSpots. The goal of images in a
Kate stream is to mesh the images with the text, not to have large images by themselves.

On the other hand, interesting Karaoke effects could be achieved by having MNG images
instead of simple paletted bitmaps in a Kate streams. Comments would be most welcome on
whether this is going too far, however.

A possible solution to the duplication issue is to have another stream in the container
stream, which would hold the shared data (eg, fonts), which the user program could load,
and which could then be used by any Kate (and other) stream. Typically, this type of stream
would be a degenerate stream with only header packets (so it is fully processed before any
other stream presents data packets that might make use of that shared data), and all payload
such as fonts being contained within the headers. Thinking about it, it has parallels with
the way Vorbis stores its codebooks within a header packet, or even the way Kate stores the
list of styles within a header packet.

==== Fonts ====

Custom fonts are merely a set of ranges mapping unicode code points to bitmaps. As this implies,
fonts are bitmap fonts, not vector fonts, so scaling, if supported by the rendering client,
may not look as good as with a vector font.

A style may also refer to a font name to use (eg, "Tahoma"). These fonts may or may not be
available on the playing system, however, since the font data is not included in the stream,
just referenced by name. For this reason, it is best to keep to widely known fonts.

== Reference encoder/decoder ==

A encoder and a decoder are included in the tools directory. The encoder pulls its input from a custom
text based file format (see [[#The Kate file format|The Kate file format]]),
which is by no means meant to be part of the Kate bitstream specification itself,
from an SubRip (.srt) format file (the most common subtitle format I found, and a very basic one),
or from a lyrics (.lrc) format file.

The Kate bitstreams encoded and decoded by those tools are (supposed to be) correct for this
specification, provided their input is correct.

== Next steps ==

=== Continuations ===

Continuations are a way to add to existing events, and are mostly meant for motions. When streaming
in real time, what motions may be applied to events may not be known in advance (for instance, for a
draw chat program where two programs exchange Kate streams, the drawing motions are only known as
they are drawn. Continuations will allow an event to be extended in time, and motions to be appended
to it. This is only useful for streaming, as when stored in a file, everything is already known in
advance.

=== A rendering library ===

This will allow easier integration in other packages (movie players, etc).
I have started working on an implementation using Cairo and Pango, though I'm still at the early stages.
I might add support for embedding vector fonts in a Kate stream if I was going that way. Still need to think about this.
Another point of note is that when this library is available, it would make it easier to add
capabilities such as rotation, scaling, etc, to the bitstream, since this would not cause too
much work for playing programs using the rendering library. It is expected that these additions
would stay backward compatible (eg, an old player would ignore this information but still correctly
decode the information they can work with from a newly encoded stream).

=== An XML representation ===

While I purposefully did not write Kate description files in XML due to me finding editing XML such
a chore, it would be nice to be able to losslessly convert between the more user friendly representation
and an XML document, so one can do what one does with XML documents, like transformations.

And after all, some people might prefer editing the XML version.

== Matroska mapping ==

The codec ID is "S_KATE".

As for Theora and Vorbis, Kate headers are stored in the private data as xiph-laced packets:

Byte 0: number of packets present, minus 1 (there must be at least one packet) - let this number be NP
Bytes 1..n: lengths of the first NP packets, coded in xiph style lacing
Bytes n+1..end: the data packets themselves concatenated one after the other

Note that the length of the last packet isn't encoded, it is deduced from the sizes of the other
packets and the total size of the private data.

This mapping is similar to the Vorbis and Theora mappings, with the caveat that one should not
expect a set number of headers.

== Downloading ==

libkate encodes and decodes Kate streams, and is API and ABI stable.

The libkate source distribution is available at [http://libkate.googlecode.com/ http://libkate.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/kate.git;a=summary http://git.xiph.org/?p=users/oggk/kate.git;a=summary].

libtiger renders Kate streams using Pango and Cairo, and is alpha, with API changes still possible.

The libtiger source distribution is available at [http://libtiger.googlecode.com/ http://libtiger.googlecode.com/].

A public git repository is available at [http://git.xiph.org/?p=users/oggk/tiger.git;a=summary http://git.xiph.org/?p=users/oggk/tiger.git;a=summary].

== Things I need to get feedback on ==

* is it a good idea to avoid floating point usage altogether ?

[[Category:Drafts]]
[[Category:Ogg Mappings]]