TransOgg Seeking Proposals
Three proposals for seeking structures/methods in transOgg
Seeking within a container is an example of 'we can't have everything, so we have to choose what we really want'. We consider the following design requirements wants and needs:
1) Efficient seeking with minimum latency. Minimum overall latency can depend heavily on the scenario, where seeking within a file on a hard disk, over http on a high bandwidth link/high latency and over http on a low bandwidth/low latency link will all have different optimal design requires. Thus the design will have to shoot for good behavior across the lot.
2) Low or no additional overhead. Because, hey, wasted space is wasted space.
3) Seeking must be available within Ogg's specific streaming design philosophy; an encoder must be able to write a seekable file in one pass, and a demuxer must be able to seek in any valid Ogg file.
4) Muxing must not require repaginating existing data (the 'deck of cards' metaphor)
5) A design that the designers can actually document completely and that implementors are actually willing and able to implement.
Seeking in original Ogg treated 2, 3, 4 as paramount, 1 as secondary, and was a disaster due to 5. transOgg will need to strike a different balance. I submit three seeking mechanism proposals for consideration.
Proposal 1: The original Ogg bisection search only
We stick with the original seeking algorithm as intended in the original Ogg bitstream and correct the corner cases that made it a serious pain. There is no index. Seeking is done via bisection search:
1) Make an initial guess as to the physical position of the desired
seekpoint based on known stream duration and simple linear
interpolation.
2) Seek and capture a page
3) Construct a new interpolated best guess as to the desired seek
location based on DTS of captured page
4) repeat steps 2 and 3 until we have seen contiguous pages, the
first with a time stamp immediately preceeding the desired point
in time and the second with a time stamp equal-to-or-following
the desired point.
5.a) If none of the codec streams being decoded use backreferences,
we're finished.
5.b) In the event of codec streams with backreferences (such as is
typical in video), the pages for the codec stream also include
the DTS of the earliest required reference. Read forward
until we've seen one page from each decoded stream using
backreferences; save the earliest backreference DTS.
6) Repeat the bisection search in 2-4 using the backreference
DTS as the desired final seek location.
This algorithm appears more complex than it is; bisection searching is a well understood technique, and the average bisection search in a 1GB ogg file (of mixed Vorbis audio and Theora video) required approximately 3.5 seeks for the initial seek and just over one more seek for the second (as the first search can inform the second quite well).
This original Ogg seeking method was unpopular for several reasons:
1) The DTS and backreference encoding was defined by-codec,
and thus decoding the timestamps required additional
infrastructure that most frameworks had to code from scratch.
Several frameworks never implemented precise seeking for this
reason alone.
2) Corner cases introduced by specific design decisions (eg, the
presence of pages with no timestamps) complicated the actual
implementation. In addition, correct behavior for these corner
cases was not well documented or not documented at all.
3) Xiph never implemented its own all-encompassing framework to
provide an example of complete seeking that worked in any Ogg
file. In general, we hardwired our own apps for seeking in
either Vorbis alone or Vorbis+Theora only.
4) Many frameworks that attempted to implement proper bisection
seeking had bugs that went unnoticed or unfixed (owing to all of
the above). These bugs could increase the number of seeks in a
bisection search by a factor of 10X-100X. This gave bisection
seeking a reputation for inherent poor performace. Even
Vorbisfile has suffered from a number of bugs of exactly this
sort at several points.
5) Stream structure discovery was also based on performing multiple
bisection searches to find link boundaries. Given all the points
above, and the fact that stream structure decisions added
additional corner cases to implementations, structure discovery
also tended to be buggy and slow.
The lesson of the above is that it is not enough to have a conceptually simple design. The actual implementation complexity can easily explode given seemingly small design decisions going in the wrong direction.
The primary strikes against proposal 1 are:
1) Given the practically poor and unstable performance of preceding
bisection seeking implmentations, it will be difficult to sell
adopter opinion on an updated bisection design, even one that
farts unicorns.
2) Even when the bisection search performs perfectly, it still
requires multiple seeks (4.5 on average for exact seeking in a
very smart bisector, and an upper bound of 7 for precise seek in
a simplest-possible-correct implementation).
3) Backreference discovery requires reading enough data to find at
least one page of each stream that uses backreferences. Although
the required amount of data is bounded (and on average, much less
than the upper bound) this too increases the seeking latency.
The primary advantages of proposal 1 are:
1) The design is conceptually simple amd presents a single mechanism
making use of non-redundant, authoritative data. Although it is
still possible to mux an invalid file, it is much harder to end
up with a stream in which second-order seeking 'helper'
structures disagree with the authoritative timestamps within the
page data.
2) Proper design would allow an optimal seek implementation nearly as
simple as the design.
3) Zero additional data overhead.
4) No additional requirements on a muxer (muxers need only obey
timestamp ordering. The small deviations from this plan in
original Ogg were rather serious errors in hindsight)
5) Strict adherence to the overall Ogg muxing philosophy of having
no features unavailable in a stream context (every feature of the
stream format is available to a streaming muxer).
Proposal 2: Mandatory bisection index with an optional head-or-tail index
The structure of the Proposal 2 Ogg file is as in Proposal 1, with no changes to the bisection seeking algorithm. However, in addition to the bisection seeking mechanism, an index may be optionally present.
1) The index may be in either the header or footer of the file (but
not both). --- point of discussion; perhaps only in the tail?
This would make the index a fully encode-stream-available
feature.
2) The index consists of hard byte offsets to the earliest
backreference required to decode all continuous media types from
the specified DTS. -- point of discussion; secondary index for
discontinuous media streams? Discontinuous spacing can easily be
too wide to include in a single-backreference index system; there
would potentially need to be a dedicated index per discontinuous
stream.
3) The index does not have a specific, hard resolution. It may be
as fine or coarse as desired.
4) The index is ephemeral and non-authoritative. It can be generated
or regenerated trivially via a linear scan of the stream.
Remuxing a stream would necessarily invalidate any preexisting
index.
5) The index is not mandatory and cannot be relied upon to exist.
Experience with Matroska and the original version of Ogg shows that where an index and a bisection mechanism coexist, implementations usually ignore the bisection mechanism entirely and rely on the index alone, refusing to seek when no index is present. This had been the original reason for resisting an index in Ogg, borne out by the the fact that few Matroska demuxers implement the mandatory bisection seeking and so can seek only in Matroska files with indexes. WebM went one step further by mandating an index and eliminating the Matroska bisection seek mechanism entirely.
An index is undeniably useful. Bisection seeking requires 3.5 seeks on average (without backreferences) and between 4.5 and 7 average seeks in a file with backreferences, depending on the sophistication of the search algorithm. Indexes do not necessarily eliminate all seeks, and may not improve upon the amount of data that needs to be read, but it's possible to generate an index that does so whenever such an index is warranted.
Thus Proposal 2 assumes that the benefit of an index (improved seeking performance) outweighs the disadvantages (incomplete demuxer implementations that cannot seek when the index is absent).
The disadvantages of Proposal 2 are:
1) Redundant, non-structural information. Having a non-authoritative
index increases the opportunity for mux and demux implementation
bugs. There will certainly be cases in the wild where the index
does not agree with the stream. This disagreement isn't possible
if there's no non-structural or redundant seeking information.
2) A front-positioned index breaks the Ogg 'everything is always a
stream' model, as a front-positioned index necessarily requires a
second pass.
3) Experience in multiple containers show that implementors will
tend to implement index-only seeking [and even tell users their
files are 'broken' if the files don't contain an index].
The advantages of Proposal 2 are:
1) Latency. Indexes can be generated for consumption models like the
web where seeks can be very high latency, and reducing the seek
number to 'always only 1' is a boon.
2) A tail-positioned index would preserve the streaming model and
keep time-to-first-frame low by avoiding a large index in the
header at stream startup. The index can be grabbed from the
stream tail asynchronously.
3) Coarse-grained indexes can be used as a mechanism to inform
bisection search, still resulting in fewer seeks while also
reducing the index overhead in situations where this overhead is
critical.
Proposal 3: Bisection search with threaded backreference
Proposal 3 makes use of bisection seeking as above, but rather than using an index, adds an auxiliary backreference structure threaded through the stream in order to eliminate the second stage of the bisection seek. The bisection sarch algorithm always short-circuits at step 5:
1) Make an initial guess as to the physical position of the desired
seekpoint based on known stream duration and simple linear
interpolation.
2) Seek and capture a page
3) Construct a new interpolated best guess as to the desired seek
location based on DTS of captured page
4) repeat steps 2 and 3 until we have seen contiguous pages, the
first with a time stamp immediately preceeding the desired point
in time and the second with a time stamp equal-to-or-following
the desired point.
5.a) If none of the codec streams being decoded use backreferences,
we're finished.
5.b) If we need to seek to a backreference, the auxiliary
backreference structure gives us a hard byte offset to seek back
to.
The proposed threaded backreference mechanism would append a single,
variable-length-encoded byte offset to the end of each page after
the packet data. This addiiton would modify the capture criteria
such that capture must account for the optional, unsignaled field.
After reading packet data, demux framing would have to check for
either a capture pattern, or a backreference followed by a capture
pattern.
Variations of this proposal could also include bisection hints to further reduce the order of complexity of the bisection search.
As I see it, this middle of the road approach is a compromise without real advantages. In its favor:
1) Streams that make use of backreferences (most AV streams) would
avoid the second bisection search, always reducing the average
number of seeks by a little more than one.
However, there are significant disadvantages.
1) Like Proposal 2, the threaded backreference mechanism means
seeking makes use of two ~ unrelated mechanisms instead of one.
The threaded backreference increases mux and demux complexity.
2) The average seek number drops from ~ 4.5 physcal seeks to ~ 3.5.
This is not a significant improvement given the additional
compexity. Proposal 2 is simpler, similar overhead, and drops
best case seeks to 1.
3) Like proposal 2, the additional backreference offset is
redundant, non-authoritative, and non-structural.
In short, this mechanism has the disadvantages of Proposal 2 (in addition to being weird) but does not deliver as much advantage as Proposal 2.