TransOgg Seeking Proposals

Three proposals for seeking structures/methods in transOgg

Seeking within a container is an example of 'we can't have everything, so we have to choose what we really want'. We consider the following design requirements wants and needs:

 1) Efficient seeking with minimum latency.  Minimum overall latency
 can depend heavily on the scenario, where seeking within a file on a
 hard disk, over http on a high bandwidth link/high latency and over
 http on a low bandwidth/low latency link will all have different
 optimal design requires.  Thus the design will have to shoot for
 good behavior across the lot.

 2) Low or no additional overhead.  Because, hey, wasted space is
 wasted space.

 3) Seeking must be available within Ogg's specific streaming design
 philosophy; an encoder must be able to write a seekable file in one
 pass, and a demuxer must be able to seek in any valid Ogg file.

 4) Muxing must not require repaginating existing data (the 'deck of
 cards' metaphor)

 5) A design that the designers can actually document completely and
 that implementors are actually willing and able to implement.

Seeking in original Ogg treated 2, 3, 4 as paramount, 1 as secondary, and was a disaster due to 5. transOgg will need to strike a different balance. I submit three seeking mechanism proposals for consideration.

Proposal 1: The original Ogg bisection search only

 We stick with the original seeking algorithm as intended in the
 original Ogg bitstream and correct the corner cases that made it a
 serious pain.  There is no index.  Seeking is done via bisection
 search:

 1) Make an initial guess as to the physical position of the desired
    seekpoint based on known stream duration and simple linear
    interpolation.

 2) Seek and capture a page

 3) Construct a new interpolated best guess as to the desired seek
    location based on DTS of captured page

 4) repeat steps 2 and 3 until we have seen contiguous pages, the
    first with a time stamp immediately preceeding the desired point
    in time and the second with a time stamp equal-to-or-following
    the desired point.

 5.a) If none of the codec streams being decoded use backreferences,
    we're finished.

 5.b) In the event of codec streams with backreferences (such as is
    typical in video), the pages for the codec stream also include
    the DTS of the earliest required reference.  Read forward
    until we've seen one page from each decoded stream using
    backreferences; save the earliest backreference DTS.

 6) Repeat the bisection search in 2-4 using the backreference
    DTS as the desired final seek location.

 This algorithm appears more complex than it is; bisection searching
 is a well understood technique, and the average bisection search in
 a 1GB ogg file (of mixed Vorbis audio and Theora video) required
 approximately 3.5 seeks for the initial seek and just over one more
 seek for the second (as the first search can inform the second quite
 well).

 This original Ogg seeking method was unpopular for several reasons:

 1) The DTS and backreference encoding was defined by-codec,
    and thus decoding the timestamps required additional
    infrastructure that most frameworks had to code from scratch.
    Several frameworks never implemented precise seeking for this
    reason alone.

 2) Corner cases introduced by specific design decisions (eg, the
    presence of pages with no timestamps) complicated the actual
    implementation.  In addition, correct behavior for these corner
    cases was not well documented or not documented at all.

 3) Xiph never implemented its own all-encompassing framework to
    provide an example of complete seeking that worked in any Ogg
    file.  In general, we hardwired our own apps for seeking in
    either Vorbis alone or Vorbis+Theora only.

 4) Many frameworks that attempted to implement proper bisection
    seeking had bugs that went unnoticed or unfixed (owing to all of
    the above). These bugs could increase the number of seeks in a
    bisection search by a factor of 10X-100X. This gave bisection
    seeking a reputation for inherent poor performace.  Even
    Vorbisfile has suffered from a number of bugs of exactly this
    sort at several points.

 5) Stream structure discovery was also based on performing multiple
    bisection searches to find link boundaries. Given all the points
    above, and the fact that stream structure decisions added
    additional corner cases to implementations, structure discovery
    also tended to be buggy and slow.

 The lesson of the above is that it is not enough to have a
 conceptually simple design.  The actual implementation complexity
 can easily explode given seemingly small design decisions going in
 the wrong direction.

 The primary strikes against proposal 1 are:

 1) Given the practically poor and unstable performance of preceding
    bisection seeking implmentations, it will be difficult to sell
    adopter opinion on an updated bisection design, even one that
    farts unicorns.

 2) Even when the bisection search performs perfectly, it still
    requires multiple seeks (4.5 on average for exact seeking in a
    very smart bisector, and an upper bound of 7 for precise seek in
    a simplest-possible-correct implementation).

 3) Backreference discovery requires reading enough data to find at
    least one page of each stream that uses backreferences.  Although
    the required amount of data is bounded (and on average, much less
    than the upper bound) this too increases the seeking latency.

 The primary advantages of proposal 1 are:

 1) The design is conceptually simple amd presents a single mechanism
    making use of non-redundant, authoritative data.  Although it is
    still possible to mux an invalid file, it is much harder to end
    up with a stream in which second-order seeking 'helper'
    structures disagree with the authoritative timestamps within the
    page data.

 2) Proper design would allow an optimal seek implementation nearly as
    simple as the design.

 3) Zero additional data overhead.

 4) No additional requirements on a muxer (muxers need only obey
    timestamp ordering. The small deviations from this plan in
    original Ogg were rather serious errors in hindsight)

 5) Strict adherence to the overall Ogg muxing philosophy of having
    no features unavailable in a stream context (every feature of the
    stream format is available to a streaming muxer).

Proposal 2: Mandatory bisection index with an optional head-or-tail index

 The structure of the Proposal 2 Ogg file is as in Proposal 1, with
 no changes to the bisection seeking algorithm.  However, in addition
 to the bisection seeking mechanism, an index may be optionally
 present.

 1) The index may be in either the header or footer of the file (but
    not both). --- point of discussion; perhaps only in the tail?
    This would make the index a fully encode-stream-available
    feature.

 2) The index consists of hard byte offsets to the earliest
    backreference required to decode all continuous media types from
    the specified DTS. -- point of discussion; secondary index for
    discontinuous media streams? Discontinuous spacing can easily be
    too wide to include in a single-backreference index system; there
    would potentially need to be a dedicated index per discontinuous
    stream.

 3) The index does not have a specific, hard resolution.  It may be
    as fine or coarse as desired.

 4) The index is ephemeral and non-authoritative. It can be generated
    or regenerated trivially via a linear scan of the stream.
    Remuxing a stream would necessarily invalidate any preexisting
    index.

 5) The index is not mandatory and cannot be relied upon to exist.

 Experience with Matroska and the original version of Ogg shows that
 where an index and a bisection mechanism coexist, implementations
 usually ignore the bisection mechanism entirely and rely on the
 index alone, refusing to seek when no index is present.  This had
 been the original reason for resisting an index in Ogg, borne out by
 the the fact that few Matroska demuxers implement the mandatory
 bisection seeking and so can seek only in Matroska files with
 indexes.  WebM went one step further by mandating an index and
 eliminating the Matroska bisection seek mechanism entirely.

 An index is undeniably useful.  Bisection seeking requires 3.5 seeks
 on average (without backreferences) and between 4.5 and 7 average
 seeks in a file with backreferences, depending on the sophistication
 of the search algorithm.  Indexes do not necessarily eliminate all
 seeks, and may not improve upon the amount of data that needs to be
 read, but it's possible to generate an index that does so whenever
 such an index is warranted.

 Thus Proposal 2 assumes that the benefit of an index (improved
 seeking performance) outweighs the disadvantages (incomplete demuxer
 implementations that cannot seek when the index is absent).

 The disadvantages of Proposal 2 are:

 1) Redundant, non-structural information. Having a non-authoritative
    index increases the opportunity for mux and demux implementation
    bugs. There will certainly be cases in the wild where the index
    does not agree with the stream.  This disagreement isn't possible
    if there's no non-structural or redundant seeking information.

 2) A front-positioned index breaks the Ogg 'everything is always a
    stream' model, as a front-positioned index necessarily requires a
    second pass.

 3) Experience in multiple containers show that implementors will
    tend to implement index-only seeking [and even tell users their
    files are 'broken' if the files don't contain an index].

 The advantages of Proposal 2 are:

 1) Latency. Indexes can be generated for consumption models like the
    web where seeks can be very high latency, and reducing the seek
    number to 'always only 1' is a boon.

 2) A tail-positioned index would preserve the streaming model and
    keep time-to-first-frame low by avoiding a large index in the
    header at stream startup.  The index can be grabbed from the
    stream tail asynchronously.

 3) Coarse-grained indexes can be used as a mechanism to inform
    bisection search, still resulting in fewer seeks while also
    reducing the index overhead in situations where this overhead is
    critical.

Proposal 3: Bisection search with threaded backreference

 Proposal 3 makes use of bisection seeking as above, but rather than
 using an index, adds an auxiliary backreference structure threaded
 through the stream in order to eliminate the second stage of the
 bisection seek.  The bisection sarch algorithm always short-circuits
 at step 5:

 1) Make an initial guess as to the physical position of the desired
    seekpoint based on known stream duration and simple linear
    interpolation.

 2) Seek and capture a page

 3) Construct a new interpolated best guess as to the desired seek
    location based on DTS of captured page

 4) repeat steps 2 and 3 until we have seen contiguous pages, the
    first with a time stamp immediately preceeding the desired point
    in time and the second with a time stamp equal-to-or-following
    the desired point.

 5.a) If none of the codec streams being decoded use backreferences,
    we're finished.

 5.b) If we need to seek to a backreference, the auxiliary
    backreference structure gives us a hard byte offset to seek back
    to.
 The proposed threaded backreference mechanism would append a single,
 variable-length-encoded byte offset to the end of each page after
 the packet data.  This addiiton would modify the capture criteria
 such that capture must account for the optional, unsignaled field.
 After reading packet data, demux framing would have to check for
 either a capture pattern, or a backreference followed by a capture
 pattern.

 Variations of this proposal could also include bisection hints to
 further reduce the order of complexity of the bisection search.

 As I see it, this middle of the road approach is a compromise
 without real advantages.  In its favor:

 1) Streams that make use of backreferences (most AV streams) would
    avoid the second bisection search, always reducing the average
    number of seeks by a little more than one.

 However, there are significant disadvantages.

 1) Like Proposal 2, the threaded backreference mechanism means
    seeking makes use of two ~ unrelated mechanisms instead of one.
    The threaded backreference increases mux and demux complexity.

 2) The average seek number drops from ~ 4.5 physcal seeks to ~ 3.5.
    This is not a significant improvement given the additional
    compexity.  Proposal 2 is simpler, similar overhead, and drops
    best case seeks to 1.

 3) Like proposal 2, the additional backreference offset is
    redundant, non-authoritative, and non-structural.

In short, this mechanism has the disadvantages of Proposal 2 (in addition to being weird) but does not deliver as much advantage as Proposal 2.