From XiphWiki

Revision as of 02:09, 23 November 2010 by Cpearce (Talk | contribs)
Jump to: navigation, search


Ogg Skeleton 4 Message Headers

Adding New Message Headers to Skeleton

With the HTML5 video element, Ogg is now a major format on the Web and is being applied to solve use cases it hasn't had to solve before, but was built to allow, see

One particular such use case is dealing with multitrack audio and video, such as in videos with multiple view angles encoded in one, or ones with a sign language video track, an audio description audio track, a caption track and several subtitle tracks in different languages (i.e. several theora, several vorbis and several kate tracks).

While encoding of multitrack files is already possible, it is unclear how such files would be rendered, how tracks would be differentiated and addressed (e.g. from a JavaScript API), etc. Skeleton has been built in a way such that it is extensible with message header fields for this purpose.

On this wiki page, we are collecting such new information fields.


Right now, there is one mandatory message header field for all of the logical bitstreams: the "Content-type" header field, which contains the mime type of the track. The mime types in use here are listed at


Content in a track usually originates from a specific language. This language can be specified in a Language message header field. The code is created according to and

For audio tracks with speech, the Language would be the language that dominates.

For video tracks, it might be the language that is signed (if it is a sign language video), or the language that is most often represented in scene text.

For text tracks, it is the dominating language in the text, e.g. English or German subtitles.

Examples are: en-US, de-DE, sgn-ase, en-cockney

The Language field will have the dominating language specified as the first language. It is possible to specify less non-dominating languages as a list after the main language.


Language: en-US, fr


Role describe what semantic type of content is contained in a track. Every track can only have a single role value, so the most appropriate role has to be chosen. The same role can be used across multiple tracks.

The following list some commonly used roles. Other roles are possible, too, but should only be used/introduced if there is really a need for it.

Text tracks:

  • "text/caption" - transcription of all sounds, including speech, for purposes of the hard-of-hearing
  • "text/subtitle" - translation of all speech, typically into a different language
  • "text/textaudiodesc" - description/transcription of everything that happens in a video as text to be used for the vision-impaired through screen readers or braille
  • "text/karaoke" - music lyrics delivered in chunks for singing along
  • "text/chapters" - titles for sections of the media that provide a kind of chapter segmentation (similar to DVD chapters)
  • "text/tickertext" - text to run as informative text at the bottom of the media display
  • "text/lyrics" - transcript of the text used in music media
  • "text/metadata" - name-value pairs that are associated with certain sections of the media
  • "text/annotation" - free text associated with certain sections of the media
  • "text/linguistic" - linguistic markup of the spoken words

Video tracks:

  • "video/main" - the main video track
  • "video/alternate" - an alternative video track, e.g. different camera angle
  • "video/sign" - a sign language video track

Audio tracks:

  • "audio/main" - the main audio track
  • "audio/alternate" - an alternative audio track, probably linked to an alternate video track
  • "audio/dub" - the audio track but with speech in a different language to the original
  • "audio/audiodesc" - an audio description recording for the vision-impaired
  • "audio/music" - a music track, e.g. when music, speech and sound effects are delivered in different tracks
  • "audio/speech" - a speech track, e.g. when music, speech and sound effects are delivered in different tracks
  • "audio/sfx" - a sound effects track, e.g. when music, speech and sound effects are delivered in different tracks

Notice how we are re-using the Content-type approach of specifying the main semantic type of the track first. This is necessary, since mime types don't always provide the right main content type (e.g. application/kate is semantically a text format).

There may also be parameters to describe the roles better, such as "video/alternate;angle=nw"


This field provides the opportunity to associate a free text string with the track to allow direct addressing of the track through its name.

Characters allowed are basically all the characters that are also allowed for XML id fields:

the first character has to be one of:
[A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] |
[#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
any following characters can be one of:
[A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | 
[#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF] | 
"-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]

The name needs to be unique between all the track names, otherwise it is undefined which of the tracks is retrieved when addressing by name.

An example means of addressing the track by name is: track[name="Madonna_singing"]


A free text field to provide a description of the track content.


Title: "the French audio track for the movie"


Media players that do not get informed about how a content author intends a media file to be displayed have no change to display the content "correctly". This is why the Display-hint message header field allows providing of hints on how a certain track should be displayed. A media player can of course decide to ignore these hints.

Currently proposed hints are:

  • pip(x,y,w,h) on a video track - picture-in-picture display in relation to the zero coordinates of the display area of the video with x,y providing the origin of the top left corner of the PIP video and w,h the width and height in pixels which are optional. x, y, w, and h can be specified in percentage, thus allowing persistent placement independent of the scaling of the video display.


Display-hint: pip(20%,20%)
Display-hint: pip(40,40,690,60)
  • mask(img,x,y,w,h) on a video track - use the image given at img url (?) as a video mask to allow the video to appear in shapes other than rectangular. The masking image should be a black shape on a white background. The image is placed at offset x,y and scaled to width and height w and h. x,y,w, and h can be provided in pixels or in percent. Pixels under the white background are made transparent and only pixels under the black shape are retained.


Display-hint: mask(
Display-hint: mask(,30%,25%)
Display-hint: mask(,20,20,400,320)
  • transparent(transparency) on a video track - put a transparency of x% (int value between 0 and 100) on the complete video track as it will be rendered on top of other content. This transparency is applied to all pixels in the same way.


Display-hint: transparent(25%)
Display-hint: transparent(7%)

Track order

In many applications it is necessary to walk through all the tracks in a media file and address tracks by an index.

In Ogg, the means to number through the tracks is by the order in which the bos pages of the tracks appear in the Ogg stream. If a file is re-encoded, the order may change, so you can only rely on this for addressing if the file doesn't change.

For example, a video file with the following composition would have the following indexes:

  • track[0]: Skeleton BOS
  • track[1]: Theora BOS for main video
  • track[2]: Vorbis BOS for main audio
  • track[3]: Kate BOS for English captions
  • track[4]: Kate BOS for German subtitles
  • track[5]: Vorbis BOS for audio descriptions
  • track[6]: Theora BOS for sign language

This track order is simply to have a means to address tracks through an index in a consistent manner across different media players, such that e.g. JavaScript can always link to the same track reliably across browsers. It has no influence on what should be displayed on top of which other track.


The Altitude (better name?) message header field defines the stack order of the tracks, i.e. which track is displayed further towards the top of the stack and which further down. By default, a "main" track is always displayed bottom-most unless otherwise defined.

The Altitude field takes the same numerical values as the z-index in CSS, unlimited negative and positive numbers. An element with greater stack order is always in front of an element with a lower stack order.

Example: Altitude: -150

Track dependencies

It is tempting to introduce dependencies between tracks - to specify things such as:

  • track b depends on track a being available (e.g. main audio depending on main video), so always display them together and if you remove a track, remove all depending tracks, too
  • track c and d are alternative tracks to track b (e.g. dubs in other languages for main audio), so don't display them together and if you activate one, disable the others
  • track a and one of b,c,d one of e,f,g where e depends on b, f depends on c, and g depends on d, make up a presentation profile and should be displayed together (e.g. main video, one of the audio dubs, and their respective captions).

It is not clear yet whether there is an actual need to maintain this information as author-provided hints or whether a media player can itself determine a lot from the other fields, such as role and language.

MPEG has a "groupID" element which allows for tracks to be put into groups of alternative tracks. This feature is, however, not used very often and decisions are being left to the media player.

At this stage, it's probably too early to make a specification for how to encode this in Ogg. The need has not been totally clarified yet.

Personal tools

Main Page

Xiph.Org Projects