The following is a draft!

It is at best incomplete and at worst completely broken. In any case, it is not an “official” Xiph spec or codec, so use with care.

Introduction

This page specifies a subclass of HTML documents that is a time-aligned text format for audio-visual content. We call the format "timed divs within HTML" or TDHT. It is intended to be used only in a World Wide Web context i.e. everywhere that Web browser functionality is available. Use cases for the format are subtitles, captions, annotations and other time aligned text as listed at http://wiki.xiph.org/index.php/OggText#Categories_of_Text_Codecs .

TDHT may be similar to W3C TimedText DFXP in many respects, but in comparison to DFXP it does not re-invent HTML, CSS and effects, but rather uses existing HTML, CSS and javascript for these. The purpose of DFXP is to create a web-independent exchange format for timed text, which is why it cannot directly be specified as a subpart of HTML.

TDHT in contrast is HTML with a minimum number of changes. TDHT is parsable by any HTML parser. It works with CSS and javascript. No new functionality has to be defined for TDHT.

File Extension

Files in this format are to be of text/html mime type since they are valid html files, apart from some extra attributes.

Files in this format should have a file extension of .tdht to separate them from plain html files.

The TDHT format changes from HTML

TDHT files are time-aligned text. This means there is a time association with blocks of text and there is time-based seeking functionality on those blocks of text.

Here is an example TDHT file for subtitles:

<html>
  <head>
    <title>Desperate Housewives - Season 5, Episode 6</title>
  </head>
  <body>
    <div start="00:00:00.070" end="00:00:02.270">
      <p>Previously on...</p>
    </div>
    <div start="00:00:02.280" end="00:00:04.270">
      <p>We had an agreement to keep things casual.</p>
    </div>
    <div start="00:00:04.280" end="00:00:06.660">
      <p>Susan made her feelings clear.</p>
    </div>
    <div start="00:00:06.800" end="00:00:10.100">
      <p>So if I was with another woman, that wouldn't bother you? No, it wouldn't.</p>
    </div>
  </body>
</html>

The differences of TDHT from HTML are described using HTML4.01, but the changes apply the same to HTML5, which doesn't have a normative schema.

The following changes to HTML are made for TDHT:

1. The body element

In HTML4.01, the body element is defined as follows:

<!ELEMENT BODY O O (%block;|SCRIPT)+ +(INS|DEL) -- document body -->
<!ATTLIST BODY
  %attrs;                              -- %coreattrs, %i18n, %events --
  onload          %Script;   #IMPLIED  -- the document has been loaded --
  onunload        %Script;   #IMPLIED  -- the document has been removed --
  >

In TDHT1.0 we restrict body to just contain a sequence of div tags:

<!ELEMENT BODY O O (DIV)+ -- document body -->
<!ATTLIST BODY
  %attrs;                              -- %coreattrs, %i18n, %events --
  onload          %Script;   #IMPLIED  -- the document has been loaded --
  onunload        %Script;   #IMPLIED  -- the document has been removed --
  >

Any tags inside the body tag that are non-conformant to this specification (such as regular html tags that are allowed inside body) must be ignored for TDHT.

The div tags, however, can contain anything that HTML div tags can contain, thus enabling a very flexible, but time-aligned text model.

2. The div element

In HTML, the div element is defined as follows:

<!ELEMENT DIV - - (%flow;)*            -- generic language/style container -->
<!ATTLIST DIV
  %attrs;                              -- %coreattrs, %i18n, %events --
  >

In TDHT1.0 we extend it with start and end time attributes:

<!ELEMENT DIV - - (%flow;)*            -- generic language/style container -->
<!ATTLIST DIV
  %attrs;                              -- %coreattrs, %i18n, %events --
  start           %Time;    #IMPLIED   -- start time
  end             %Time;    #IMPLIED   -- end time
  >

The Time entity represents a valid time string accroding to HTML5: http://www.whatwg.org/specs/web-apps/current-work/#valid-time-string . The end time string must be larger than the start time string, otherwise the div element does not exist for any duration and can never turn active.

<div> elements in a TDHT file should be ordered by start time to simplify parsing. Inside Ogg or when rendered, they will be ordered by start time.

Rendering in a Web Browser

A TDHT file is meant to be associated with a audio or video file and rendered in a Web browser in sync with the audio or video file.

The TDHT file's div elements are not rendered into an existing HTML page, but rather a TDHT file creates its own iframe-like new nested browsing context. It is linked to the parent HTML page through an itext element that is inserted as a child of the video element. Creation of a nested browsing context is important because a TDHT file can come from a different URI than the Web page and thus for security reasons and for general base URI computations a nested browsing context is the better approach with the DOM nodes of the hosting page and the DOM nodes of the TDHT document in different owner documents. That way, the hosting document has the security origin of its own URL and the TDHT document has the security origin of its URL.

The rendering and CSS view port are either by default the rectangle occupied by the given <video> or <audio> tag, or an area provided for by the hosting HTML page through the itext element's properties. The zoom factor of the iframe must be set to such a value that the width of the view port established by the itext frame is equally wide in CSS px as the video frame is wide in codec pixels. (Example: If the video encodes a frame that is 240 pixels wide but is displayed at 480 CSS px wide, the zoom factor of the itext frame should be 200% so that the box that on the outsize measures 480 px seems like a box of 240 px from within the itext frame.)

The itext frame is by default transparent.

A TDHT file can get to a browser either as a external resource, or as part of audio or video resource (in particular inside Ogg, see below). Parsing in these two cases is slightly different for the browser.

For the external TDHT file case: The TDHT file is parsed using the HTML5 parsing algorithm in its normal mode into a non-rendered DOM. To render a div, the children of the div would be cloned into the body of the rendering shell document (replacing possible previous children of body).

For the Ogg-internal TDHT case: To multiplex an external TDHT file into Ogg, each div with its innerHTML would be placed into a data packet and the head data in to an Ogg header. For decoding, the rendering shell document is set up and the head tag is included from the Ogg headers. To render a packet, the div and its innerHTML would be added to the innerHTML of the body element of the rendering shell document as they come. This will use the HTML fragment parser.

As the browser plays the video, it must render the TDHT <div> tags in sync. As the start time of a <div> tag is reached, the <div> tag is made activate, and it is made inactive as the <div> tag's end time is reached. If no start time is given, the start is assumed to be 0, and if no end time is given, end is assumed to be infinity.

An "active" <div> tag may be a <div> tag that is being displayed ("display: block") in contrast to an "inactive" <div> tag, which may not be displayed ("display: none"). For some text formats however the difference between "active" and "inactive" may be a background colour or the display location on screen or some other mechanism. The default should be between "block" and "none", but changeable through CSS.

As the browser has parsed the TDHT file or its consitutent <div> tags, it is expected to keep the structure in memory. When seeking happens on the video, it can then decide upon which <div> tags are supposed to be active at the seek time and display these. [There is a discussion to be had here about the effect this has on the DOM. Different selectors may apply to a caption depending on whether the video was played back all the way there or seeking skipped over data to get there. It was suggested that inactive captions should be removed from the DOM, so there's always a well-defined small unambiguous DOM to match selectors against. However, this may for example not be desirable on some text display formats.]

Encapsulation into Ogg

The OggText specification is used to encapsulate a TDHT file into Ogg.

The codec-specific header data for the OggText ident header is the <head>..</head> part of the TDHT file. The complete <head> tag including all its subtags is encoded into the ident header unchanged.

The <div> elements with all their inner HTML are the data packets of the TDHT text codec and are thus encapsulated into the data packets as text codec data. A complete <div> including all its subtags is encoded into one data packet each.

Direct linking on a HTML5 page

Often, subtitles and other time-aligned text files are not actually provided inside a video stream (e.g. inside Ogg), but are referenced as a separate partner resource to a video.

To allow association of such files with a <video> or <audio> element, we propose the following approach:

<video src="http://example.com/video.ogv" controls>
  <itext category="CC" lang="en" src="caption.srt" style=""></itext>
  <itext category="CC" lang="de" src="caption.tdht" style=""></itext>
  <itext category="SUB" lang="de" src="german.dfxp" style=""></itext>
  <itext category="SUB" lang="jp" src="japanese.smil" style="></itext>
  <itext category="SUB" lang="fr" src="translation_webservice/fr/caption.srt" style=""></itext>
</video>

Notice the second set of closed captions being a TDHT file.

The <itext> element would act like an <iframe> element and create the nested browsing context described earlier. It has been renamed from earlier mentions of this approach from <text> to <itext> to avoid name clashes with e.g. SVG.

Timed Divs HTML

Contents