From XiphWiki

Revision as of 15:24, 22 March 2008 by Silvia (Talk | contribs)
Jump to: navigation, search



ROE (Rich Open multitrack media Exposition) is a way of describing the relationships between tracks of media in a stream. It is used to group tracks which have similar purpose and to identify alternatives.



One use of ROE is to author a multi-track audio-visual stream from multiple input files. In this document, we present a description of how to use ROE to author multi-track Ogg files.

Dynamic Web Requests

Another use of ROE is in a Web client-server scenario. The Web server uses ROE as a means of representing the different tracks that are available for a multi-track Web resource. A Web client may not require all available tracks to present the resource to the user. It may decide to request the ROE representation first and then request only a subset of tracks from the server, e.g. only the English soundtrack. Or it may directly request particular tracks only. The server will use the request from the client to dynamically compose a multi-track stream with the requested tracks and mandatory tracks and serve this to satisfy the resource request.

The ROE model

Here we describe two representations of ROE: that of ROE XML, and that of ROE in Ogg Skeleton. Each representation is capable of entirely encoding the relationships of the ROE model, such that it is possible to losslessly convert between them.


ROE XML is a XML markup language that describes a hierarchical serialization of the ROE model.

A ROE XML file is an instance document of the ROE XML schema.

It is composed of a <head> tag followed by a <body> tag.

Head Element

Head Tags

The <head> tag is optional and may optionally contain:

  • a <title> tag to provide a textual description for the multi-track stream,
  • a set of <link> tags that provide an alternative representation of the multi-track stream, e.g. as a html document,
  • a <img> tag to provide a representative thumbnail for the multi-track stream,
  • a set of <meta> tags that provide structured name-value annotations of the multi-track stream,
  • a <base> tag to provide a base URI for resources referred to in the ROE file, and
  • a set of <profile> tags that allows description of so-called track profiles.

The <title>, <link>, <meta>, and <base> tags are taken out of XHTML and serve the same purpose as they serve there.

Track Profiles

A track profile is a combination of tracks that is pre-defined within the ROE file and can be accessed by Web clients or authoring applications directly. Examples of such profiles are the Director's cut, or the Australian version.

A profile defines a list of references to the tracks of a media resource and possibly a selection from the alternative media sources of the track, to use for a particular pre-defined profile of the resource.

To that end, the profile element has a subelement called "partial" which contains the ID of a selected track and potentially the ID of a selected alternate media source for the track.

An example profile is:

 <profile name="director's cut">
   <partial track="v" select="v1" />
   <partial track="a" />

The <head> tag essentially separates the profiles from the core document structure being provided in the <body> element.

Body Element

The <body> tag consists of a sequence of <track> elements that each describe a logical media track.

The Track Tag

A media track may consist of one of:

  • a media source, such as a audio, video, or text stream described in a <mediaSource> tag,
  • a sequence of media sources described in a <seq> tag with start and end times, or
  • a set of alternate media sources described in a <switch> tag, only one of which can be selected.

The <track> element contains a mandatory "provides" attribute, which introduces a virtual label such as "commentary", "video", "audio", "textoverlay", "closedcaption", "logo", or "scoreboard". The track provides that kind of content.

The Switch Tag

The <switch> tag provides a choice between alternates, distinguished for a specific reason. The reason is given in the "distinction" attribute of the <switch> tag.

Inside a <switch> tag, the choices can be specified through the following means:

  • directly as a <mediaSource>,
  • as a sequence of media sources in a <seq> element, or
  • as the outcome of another <switch> tag.

Example <switch> element:

 <switch distinction="language" default="a3">
   <switch id="a1" distinction="bitrate" default="a1b1">
    <mediaSource id="a1b1" lang="en" content-type="audio/vorbis" src="" />
    <mediaSource id="a1b2" lang="en" content-type="audio/vorbis" src="" />
   <mediaSource id="a2" lang="de" content-type="audio/vorbis" src="" />
   <seq id="a3">
     <mediaSource id="a3a" lang="fr" content-type="audio/vorbis" src="" />
     <mediaSource id="a3b" lang="fr" content-type="audio/vorbis" src="" />

In this example, we have a choice between three languages: en, de and fr. The English language track also comes in two different bitrates. The French language track comes in two different files that should be played in sequence

Inline XML files

Some media source elements are XML documents themselves. These can be represented inline in a ROE file. The purpose of this is to contain all the annotation information of a media resource inside one XML file.

An example inline XML file is the use of CMML inside a ROE track:

 <track id="t1" provides="caption">
   <mediaSource id="c" src="" inline="true" content-type="text/cmml" >
     <cmml role="caption" xmlns:cmml="">
         <cmml:title>random 1</cmml:title>
       <cmml:clip start="t1" end="t2">
           <html:p><html:span>rillian:</html:span>FOMS rocks</html:p>

An example ROE XML file

Putting it all together, here is an example of a ROE XML file:

 <?xml version="1.0"?>
 <xs:schema targetNamespace=""
     <link id="html_linkback" rel="alternate" type="text/html" href=""/>
     <img id="stream_thumb" src=""/>
     <title>Example video</title>
     <profile name="director's cut">
       <partial track="v" select="v1" />
       <partial track="a" />			
     <track id="v" provides="video">
       <switch distinction="angle">
         <mediaSource id="v1" content-type="video/theora" src="" />
         <mediaSource id="v2" content-type="video/theora" src="" />
     <track id="a" provides="audio">
       <switch distinction="Content-Language">
         <switch distinction="bitrate">
           <mediaSource id="a1b1" lang="en" content-type="audio/vorbis" src="" />
           <mediaSource id="a1b2" lang="en" content-type="audio/vorbis" src="" />
         <mediaSource id="a2" lang="de" content-type="audio/vorbis" src="" />
           <mediaSource id="a3" lang="fr" content-type="audio/vorbis" src="" />
           <mediaSource id="a4" lang="fr" content-type="audio/vorbis" src="" />
     <track id="t" provides="text overlay">
       <switch distinction="Content-Language">
         <mediaSource id="t1" lang="en" content-type="text/cmml" src="" />
         <mediaSource id="t2" lang="de" content-type="text/cmml" src="" />
         <mediaSource id="t3" lang="fr" content-type="text/cmml" src="" />
     <track id="l" provides="logo">
     	  <mediaSource id="O1" content-type="application/ogg" src="" />
         <mediaSource id="O2" content-type="application/ogg" src="" />

Representation in Skeleton

When the relationships described by ROE are written into an Ogg stream, they are encoded using the message header fields of Ogg Skeleton fisbones for each track. One of the primary design goals for fisbone headers is to minimize the need for global information to be stored in a stream. Each track's fisbone contains headers describing only itself and its relationship to other tracks in the stream. This allows tracks to be inserted or removed at the Ogg level without needing to modify any data in individual headers.


Relationships between tracks are given by the following headers:


Provides introduces a virtual label such as "commentary", which this track provides. Many tracks may provide the same such label, and as long as one is present then a dependency on that label can be satisfied.


This declares that it is not valid to include this track in a stream unless the track it depends on is present. An example use of this might be the generic captioning of sound effects for the deaf, which may not make sense unless the captioning of speech (in an appropriate language) is also rendered. Depends refers to either a virtual label provided by another track, or an explicit track ID.

When removing a track from a file, any other tracks dependent on it must also be removed.


Recommends refers to either a virtual label provided by another track, or an explicit track ID.


Suggests refers to either a virtual label provided by another track, or an explicit track ID.


Conflicts refers to either a virtual label provided by another track, or an explicit track ID.

Serving Suggestions


HTTP-style message headers for client-server negotiation

Retrieved from ""
Personal tools

Main Page

Xiph.Org Projects