SkeletonHeaders

From XiphWiki
Revision as of 17:39, 20 March 2010 by Silvia (talk | contribs) (turn it into actual percentages)
Jump to navigation Jump to search

Adding Required Headers to Skeleton

With the HTML5 video element, Ogg is now a major format on the Web and is being applied to solve use cases it hasn't had to solve before, but was built to allow, see http://www.xiph.org/ogg/doc/oggstream.html.

One particular such use case is dealing with multitrack audio and video, such as in videos with multiple view angles encoded in one, or ones with a sign language video track, an audio description audio track, a caption track and several subtitle tracks in different languages (i.e. several theora, several vorbis and several kate tracks).

While encoding of multitrack files is already possible, it is unclear how such files would be rendered, how tracks would be differentiated and addressed (e.g. from a JavaScript API), etc. Skeleton has been built in a way such that it is extensible with message header fields for this purpose.

On this wiki page, we are collecting such new information fields.


Content-type

Right now, there is one mandatory message header field for all of the logical bitstreams: the "Content-type" header field, which contains the mime type of the track. The mime types in use here are listed at http://wiki.xiph.org/MIME_Types_and_File_Extensions#Codec_MIME_types.


Language

Content in a track usually originates from a specific language. This language can be specified in a Language message header field. The code is created according to http://www.w3.org/TR/ltli/ and http://www.rfc-editor.org/rfc/bcp/bcp47.txt.

For audio tracks with speech, the Language would be the language that dominates.

For video tracks, it might be the language that is signed (if it is a sign language video), or the language that is most often represented in scene text.

For text tracks, it is the dominating language in the text, e.g. English or German subtitles.

Examples are: en-US, de-DE, sgn-ase, en-cockney


Role

Role describe what semantic type of content is contained in a track. Every track can only have a single role value, so the most appropriate role has to be chosen. The same role can be used across multiple tracks.

The following list some commonly used roles. Other roles are possible, too, but should only be used/introduced if there is really a need for it.

Text tracks:

  • "text/caption"
  • "text/subtitle"
  • "text/textaudiodesc"
  • "text/karaoke"
  • "text/chapters"
  • "text/tickertext"
  • "text/lyrics"
  • "text/activeregion"
  • "text/metadata"
  • "text/annotation"
  • "text/transcript"
  • "text/linguistic"
  • "text/chapters"

Video tracks:

  • "video/main"
  • "video/alternate" (e.g. different camera angle)
  • "video/sign" (for sign language)
  • "video/alpha" (a track to alpha blend)

Audio tracks:

  • "audio/main"
  • "audio/alternate" (probably linked to an alternate video track)
  • "audio/dub"
  • "audio/audiodesc"
  • "audio/music"
  • "audio/speech"
  • "audio/sfx" (sound effects)

Notice how we are re-using the Content-type approach of specifying the main semantic type of the track first. This is necessary, since mime types don't always provide the right main content type (e.g. application/kate is semantically a text format).

There may also be parameters to describe the roles better, such as "video/alternate;angle=nw"


Display-hint

Media players that do not get informed about how a content author intends a media file to be displayed have no change to display the content "correctly". This is why the Display-hint message header field allows providing of hints on how a certain track should be displayed. A media player can of course decide to ignore these hints.

Currently available hints are:

  • pip(x,y,w,h) on a video track - picture-in-picture display in relation to the "main" video track with x,y providing the origin of the top left corner of the PIP video and w,h the width and height which are optional

Examples:

Display-hint: pip(20,20)
Display-hint: pip(40,40,690,60)
  • mask(img,x,y,w,h) on a video track - use the image given at img url (?) as a video mask to allow the video to appear in shapes other than rectangular. The masking image should be a black shape on a white background. The image is placed at offset x,y and scaled to width and height w and h. Pixels under the white background are made transparent and only pixels under the black shape are retained.

Examples:

Display-hint: mask(http://www.example.com/image.png)
Display-hint: mask(http://www.example.com/image.png,20,20,400,320)
  • transparent(transparency) on a video track - put a transparency of x% (int value between 0 and 100) on the complete video track as it will be rendered on top of other content.

Examples:

Display-hint: transparent(25)
Display-hint: transparent(7)
  • transparentcolor(colorcode) on a video track - turn the color identified by the colorcode into transparent pixels.

Examples:

Display-hint: transparentcolor(#454545)
Display-hint: transparentcolor(#777777)

Name

This field provides the opportunity to associate a free text string with the track to allow direct addressing of the track through its name.

Characters allowed are basically all the characters that are also allowed for XML id fields:

the first character has to be one of:
[A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] |
[#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
any following characters can be one of:
[A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | 
[#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF] | 
"-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]

The name needs to be unique between all the track names, otherwise it is undefined which of the tracks is retrieved when addressing by name.

An example means of addressing the track by name is: track[name="Madonna_singing"]


Track order

In many applications it is necessary to walk through all the tracks in a media file and address tracks by an index.

In Ogg, the means to number through the tracks is by the order in which the bos pages of the tracks appear in the Ogg stream. If a file is re-encoded, the order may change, so you can only rely on this for addressing if the file doesn't change.

For example, a video file with the following composition would have the following indexes:

  • track[0]: Skeleton BOS
  • track[1]: Theora BOS for main video
  • track[2]: Vorbis BOS for main audio
  • track[3]: Kate BOS for English captions
  • track[4]: Kate BOS for German subtitles
  • track[5]: Vorbis BOS for audio descriptions
  • track[6]: Theora BOS for sign language

This track order is simply to have a means to address tracks through an index in a consistent manner across different media players, such that e.g. JavaScript can always link to the same track reliably across browsers. It has no influence on what should be displayed on top of which other track.


Altitude

The Altitude (better name?) message header field defines the stack order of the tracks, i.e. which track is displayed further towards the top of the stack and which further down. By default, a "main" track is always displayed bottom-most unless otherwise defined.

The Altitude field takes the same numerical values as the z-index in CSS, unlimited negative and positive numbers. An element with greater stack order is always in front of an element with a lower stack order.

Example: Altitude: -150


Track dependencies

It is tempting to introduce dependencies between tracks - to specify things such as:

  • track b depends on track a being available (e.g. main audio depending on main video), so always display them together and if you remove a track, remove all depending tracks, too
  • track c and d are alternative tracks to track b (e.g. dubs in other languages for main audio), so don't display them together and if you activate one, disable the others
  • track a and one of b,c,d one of e,f,g where e depends on b, f depends on c, and g depends on d, make up a presentation profile and should be displayed together (e.g. main video, one of the audio dubs, and their respective captions).

It is not clear yet whether there is an actual need to maintain this information as author-provided hints or whether a media player can itself determine a lot from the other fields, such as role and language.

MPEG has a "groupID" element which allows for tracks to be put into groups of alternative tracks. This feature is, however, not used very often and decisions are being left to the media player.

At this stage, it's probably too early to make a specification for how to encode this in Ogg. The need has not been totally clarified yet.