The following is a draft!

It is at best incomplete and at worst completely broken. In any case, it is not an “official” Xiph spec or codec, so use with care.

Introduction

This page specifies a subclass of HTML documents that is a time-aligned text format for audio-visual content. We call the format "timed divs within HTML" or TDHT. It is intended to be used only in a World Wide Web context i.e. everywhere that Web browser functionality is available. Use cases for the format are subtitles, captions, annotations and other time aligned text as listed at http://wiki.xiph.org/index.php/OggText#Categories_of_Text_Codecs .

TDHT may be similar to W3C TimedText DFXP in many respects, but in comparison to DFXP it does not re-invent HTML, CSS and effects, but rather uses existing HTML, CSS and javascript for these. The purpose of DFXP is to create a web-independent exchange format for timed text, which is why it cannot directly be specified as a subpart of HTML.

TDHT in contrast is HTML with a minimum number of changes. TDHT is parsable by any HTML parser. It works with CSS and javascript. No new functionality has to be defined for TDHT.

File Extension

Files in this format are to be of text/x-tdht mime type.

Files in this format should have a file extension of .tdht .

The TDHT format changes from HTML

TDHT files are time-aligned text. This means there is a time association with blocks of text and there is time-based seeking functionality on those blocks of text.

Here is an example TDHT file for subtitles:

<html>
  <head>
    <title>Desperate Housewives - Season 5, Episode 6</title>
  </head>
  <body>
    <div start="00:00:00,070" end="00:00:02,270">
      <p>Previously on...</p>
    </div>
    <div start="00:00:02,280" end="00:00:04,270">
      <p>We had an agreement to keep things casual.</p>
    </div>
    <div start="00:00:04,280" end="00:00:06,660">
      <p>Susan made her feelings clear.</p>
    </div>
    <div start="00:00:06,800" end="00:00:10,100">
      <p>So if I was with another woman, that wouldn't bother you? No, it wouldn't.</p>
    </div>
  </body>
</html>

Right now, TDHT is based on HTML4.01, but it should also be possible to work on HTML5, which is still in flux.

The following changes to HTML4.01 are made for TDHT:

1. The body element

In HTML4.01, the body element is defined as follows:

<!ELEMENT BODY O O (%block;|SCRIPT)+ +(INS|DEL) -- document body -->
<!ATTLIST BODY
  %attrs;                              -- %coreattrs, %i18n, %events --
  onload          %Script;   #IMPLIED  -- the document has been loaded --
  onunload        %Script;   #IMPLIED  -- the document has been removed --
  >

In TDHT1.0 we restrict it to just contain a sequence of div tags:

<!ELEMENT BODY O O (DIV)+ -- document body -->
<!ATTLIST BODY
  %attrs;                              -- %coreattrs, %i18n, %events --
  onload          %Script;   #IMPLIED  -- the document has been loaded --
  onunload        %Script;   #IMPLIED  -- the document has been removed --
  >

Any tags inside the body tag that are non-conformant to this specification (such as regular html tags that are allowed inside body) should be ignored for TDHT.

The div tags, however, can contain anything that HTML div tags can contain, thus enabling a very flexible, but time-aligned text model.

2. The div element

In HTML4.01, the div element is defined as follows:

<!ELEMENT DIV - - (%flow;)*            -- generic language/style container -->
<!ATTLIST DIV
  %attrs;                              -- %coreattrs, %i18n, %events --
  >

In TDHT1.0 we extend it with start and end time attributes:

<!ELEMENT DIV - - (%flow;)*            -- generic language/style container -->
<!ATTLIST DIV
  %attrs;                              -- %coreattrs, %i18n, %events --
  start           %Time;    #IMPLIED   -- start time
  end             %Time;    #IMPLIED   -- end time
  >

The Time entity is defined in HTML5: http://www.whatwg.org/specs/web-apps/current-work/#valid-time-string .

Rendering in a Web Browser

A TDHT file is meant to be associated with a audio or video file and rendered in a Web browser in sync with the audio or video file.

The TDHT file's div elements are not rendered into an existing HTML page, but rather a TDHT file creates its own iframe-like new nested browsing context. This is because a TDHT file can come from a different URI than the Web page and thus for security reasons and for general base URI computations a nested browsing context is the better approach with the DOM nodes of the hosting page and the DOM nodes of the TDHT document in different owner documents. That way, the hosting document has the security origin of its own URL and the TDHT document has the security origin of its URL.

The rendering and CSS view port are either by default the rectangle occupied by the given <video> or <audio> tag, or an area provided for by the hosting HTML page. The zoom factor of the iframe must be set to such a value that the width of the view port established by the iframe is equally wide in CSS px as the video frame is wide in codec pixels. (Example: If the video encodes a frame that is 240 pixels wide but is displayed at 480 CSS px wide, the zoom factor of the iframe should be 200% so that the box that on the outsize measures 480 px seems like a box of 240 px from within the iframe.)

When the browser happens upon a TDHT file, it must create a document by calling createDocument() on DOMImplementation and then calling open() on the created document. The browser must insert a <html> element in the HTML namespace as the root element of the document and insert <head> and <body> elements in the HTML namespace into the root element.

A TDHT file can either be received by a HTML parser in one go (as a TDHT file) or a div-less TDHT file can be received together with its <head> tag and create a HTML shell into which the <div> elements can be added as they come (e.g. from a video file that is decoded and played back in parallel) by using a HTML fragment parser.

The <head> tag must decode into a DOMString (using REPLACEMENT CHARACTER on errors) and set the TDHT DOM property of the <head> element to the DOMString.

As the browser plays the video, it must render the TDHT <div> tags in sync. As the start time of a <div> tag is reached, the <div> tag appears, and it is removed as the <div> tag's end time is reached. If no start time is given, the start is assumed to be 0, and if no end time is given, end is assumed to be infinity.

As the browser has parsed the TDHT file or its consitutent <div> tags, it is expected to keep the structure in memory. When seeking happens on the video, it can then decide upon which <div> tags are supposed to be active at the seek time and display these. [There is a discussion to be had here about the effect this has on the DOM. Different selectors may apply to a caption depending on whether the video was played back all the way there or seeking skipped over data to get there. It was suggested that inactive captions should be removed from the DOM, so there's always a well-defined small unambiguous DOM to match selectors against. However, this may for example not be desirable on some text display formats.]

An "active" <div> tag may, incidentally, be a <div> tag that is being displayed ("display: block") in contrast to an "inactive" <div> tag, which may not be displayed ("display: none"). For some text formats however the difference between "active" and "inactive" may be a background colour or the display location on screen or some other mechanism. The default should be between "block" and "inactive", but changeable through CSS.

Encapsulation into Ogg

The OggText specification is used to encapsulate a TDHT file into Ogg.

The codec-specific header data for the OggText ident header is the <head>..</head> part of the TDHT file. The complete <head> tag including all its subtags is encoded into the ident header unchanged.

The <div> tags are the data packets of the TDHT text codec and are thus encapsulated into the data packets as text codec data. A complete <div> including all its subtags is encoded into one data packet each.

Direct linking on a HTML5 page

Often, subtitles and other time-aligned text files are not actually provided inside a video stream (e.g. inside Ogg), but are referenced as a separate partner resource to a video.

To allow association of such files with a <video> or <audio> element, we propose the following approach:

<video src="http://example.com/video.ogv" controls>
  <text category="CC" lang="en" src="caption.srt" style=""></text>
  <text category="CC" lang="de" src="caption.tdht" style=""></text>
  <text category="SUB" lang="de" src="german.dfxp" style=""></text>
  <text category="SUB" lang="jp" src="japanese.smil" style="></text>
  <text category="SUB" lang="fr" src="translation_webservice/fr/caption.srt" style=""></text>
</video>

Notice the second set of closed captions being a TDHT file.

The <text> element would act like an <iframe> element and create the nested browsing context described earlier.

Timed Divs HTML

Contents