TransOgg Seeking Proposals
Three proposals for seeking structures/methods in transOgg
Seeking within a container is an example of 'we can't have everything, so we have to choose what we really want'. We consider the following design requirements wants and needs:
1) Efficient seeking with minimum latency. Minimum overall latency can depend heavily on the scenario, where seeking within a file on a hard disk, over http on a high bandwidth link/high latency and over http on a low bandwidth/low latency link will all have different optimal design requires. Thus the design will have to shoot for good behavior across the lot.
2) Low or no additional overhead. Because, hey, wasted space is wasted space.
3) Seeking must be available within Ogg's specific streaming design philosophy; an encoder must be able to write a seekable file in one pass, and a demuxer must be able to seek in any valid Ogg file.
4) Muxing must not require repaginating existing data (the 'deck of cards' metaphor)
5) A design that the designers can actually document completely and that implementors are actually willing and able to implement.
Seeking in original Ogg treated 2, 3, 4 as paramount, 1 as secondary, and was a disaster due to 5. transOgg will need to strike a different balance. I submit three seeking mechanism proposals for consideration.
Proposal 1: The original Ogg bisection search only
We stick with the original seeking algorithm as intended in the original Ogg bitstream and correct the corner cases that made it a serious pain. There is no index. Seeking is done via bisection search:
1) Make an initial guess as to the physical position of the desired seekpoint based on known stream duration and simple linear interpolation.
2) Seek and capture a page
3) Construct a new interpolated best guess as to the desired seek location based on DTS of captured page
4) repeat steps 2 and 3 until we have seen contiguous pages, the first with a time stamp immediately preceeding the desired point in time and the second with a time stamp equal-to-or-following the desired point.
5.a) If none of the codec streams being decoded use backreferences, we're finished.
5.b) In the event of codec streams with backreferences (such as is typical in video), the pages for the codec stream also include the DTS of the earliest required reference. Read forward until we've seen one page from each decoded stream using backreferences; save the earliest backreference DTS.
6) Repeat the bisection search in 2-4 using the backreference DTS as the desired final seek location.
This algorithm appears more complex than it is; bisection searching is a well understood technique, and the average bisection search in a 1GB ogg file (of mixed Vorbis audio and Theora video) required approximately 3.5 seeks for the initial seek and just over one more seek for the second (as the first search can inform the second quite well).
This original Ogg seeking method was unpopular for several reasons:
1) The DTS and backreference encoding was defined by-codec, and thus decoding the timestamps required additional infrastructure that most frameworks had to code from scratch. Several frameworks never implemented precise seeking for this reason alone.
2) Corner cases introduced by specific design decisions (eg, the presence of pages with no timestamps) complicated the actual implementation. In addition, correct behavior for these corner cases was not well documented or not documented at all.
3) Xiph never implemented its own all-encompassing framework to provide an example of complete seeking that worked in any Ogg file. In general, we hardwired our own apps for seeking in either Vorbis alone or Vorbis+Theora only.
4) Many frameworks that attempted to implement proper bisection seeking had bugs that went unnoticed or unfixed (owing to all of the above). These bugs could increase the number of seeks in a bisection search by a factor of 10X-100X. This gave bisection seeking a reputation for inherent poor performace. Even Vorbisfile has suffered from a number of bugs of exactly this sort at several points.
5) Stream structure discovery was also based on performing multiple bisection searches to find link boundaries. Given all the points above, and the fact that stream structure decisions added additional corner cases to implementations, structure discovery also tended to be buggy and slow.
The lesson of the above is that it is not enough to have a conceptually simple design. The actual implementation complexity can easily explode given seemingly small design decisions going in the wrong direction.
The primary strikes against proposal 1 are:
1) Given the practically poor and unstable performance of preceding bisection seeking implmentations, it will be difficult to sell adopter opinion on an updated bisection design, even one that farts unicorns.
2) Even when the bisection search performs perfectly, it still requires multiple seeks (4.5 on average for exact seeking in a very smart bisector, and an upper bound of 7 for precise seek in a simplest-possible-correct implementation).
3) Backreference discovery requires reading enough data to find at least one page of each stream that uses backreferences. Although the required amount of data is bounded (and on average, much less than the upper bound) this too increases the seeking latency.
The primary advantages of proposal 1 are:
1) The design is conceptually simple amd presents a single mechanism making use of non-redundant, authoritative data. Although it is still possible to mux an invalid file, it is much harder to end up with a stream in which second-order seeking 'helper' structures disagree with the authoritative timestamps within the page data.
2) Proper design would allow an optimal seek implementation nearly as simple as the design.
3) Zero additional data overhead.
4) No additional requirements on a muxer (muxers need only obey timestamp ordering. The small deviations from this plan in original Ogg were rather serious errors in hindsight)
5) Strict adherence to the overall Ogg muxing philosophy of having no features unavailable in a stream context (every feature of the stream format is available to a streaming muxer).
Proposal 2: Mandatory bisection index with an optional head-or-tail index
The structure of the Proposal 2 Ogg file is as in Proposal 1, with no changes to the bisection seeking algorithm. However, in addition to the bisection seeking mechanism, an index may be optionally present.
1) The index may be in either the header or footer of the file (but not both). --- point of discussion; perhaps only in the tail? This would make the index a fully encode-stream-available feature.
2) The index consists of hard byte offsets to the earliest backreference required to decode all continuous media types from the specified DTS. -- point of discussion; secondary index for discontinuous media streams? Discontinuous spacing can easily be too wide to include in a single-backreference index system; there would potentially need to be a dedicated index per discontinuous stream.
3) The index does not have a specific, hard resolution. It may be as fine or coarse as desired.
4) The index is ephemeral and non-authoritative. It can be generated or regenerated trivially via a linear scan of the stream. Remuxing a stream would necessarily invalidate any preexisting index.
5) The index is not mandatory and cannot be relied upon to exist.
Experience with Matroska and the original version of Ogg shows that where an index and a bisection mechanism coexist, implementations usually ignore the bisection mechanism entirely and rely on the index alone, refusing to seek when no index is present. This had been the original reason for resisting an index in Ogg, borne out by the the fact that few Matroska demuxers implement the mandatory bisection seeking and so can seek only in Matroska files with indexes. WebM went one step further by mandating an index and eliminating the Matroska bisection seek mechanism entirely.
An index is undeniably useful. Bisection seeking requires 3.5 seeks on average (without backreferences) and between 4.5 and 7 average seeks in a file with backreferences, depending on the sophistication of the search algorithm. Indexes do not necessarily eliminate all seeks, and may not improve upon the amount of data that needs to be read, but it's possible to generate an index that does so whenever such an index is warranted.
Thus Proposal 2 assumes that the benefit of an index (improved seeking performance) outweighs the disadvantages (incomplete demuxer implementations that cannot seek when the index is absent).
The disadvantages of Proposal 2 are:
1) Redundant, non-structural information. Having a non-authoritative index increases the opportunity for mux and demux implementation bugs. There will certainly be cases in the wild where the index does not agree with the stream. This disagreement isn't possible if there's no non-structural or redundant seeking information.
2) A front-positioned index breaks the Ogg 'everything is always a stream' model, as a front-positioned index necessarily requires a second pass.
3) Experience in multiple containers show that implementors will tend to implement index-only seeking [and even tell users their files are 'broken' if the files don't contain an index].
The advantages of Proposal 2 are:
1) Latency. Indexes can be generated for consumption models like the web where seeks can be very high latency, and reducing the seek number to 'always only 1' is a boon.
2) A tail-positioned index would preserve the streaming model and keep time-to-first-frame low by avoiding a large index in the header at stream startup. The index can be grabbed from the stream tail asynchronously.
3) Coarse-grained indexes can be used as a mechanism to inform bisection search, still resulting in fewer seeks while also reducing the index overhead in situations where this overhead is critical.
Proposal 3: Bisection search with threaded backreference
Proposal 3 makes use of bisection seeking as above, but rather than using an index, adds an auxiliary backreference structure threaded through the stream in order to eliminate the second stage of the bisection seek. The bisection sarch algorithm always short-circuits at step 5:
1) Make an initial guess as to the physical position of the desired seekpoint based on known stream duration and simple linear interpolation.
2) Seek and capture a page
3) Construct a new interpolated best guess as to the desired seek location based on DTS of captured page
4) repeat steps 2 and 3 until we have seen contiguous pages, the first with a time stamp immediately preceeding the desired point in time and the second with a time stamp equal-to-or-following the desired point.
5.a) If none of the codec streams being decoded use backreferences, we're finished.
5.b) If we need to seek to a backreference, the auxiliary backreference structure gives us a hard byte offset to seek back to. The proposed threaded backreference mechanism would append a single, variable-length-encoded byte offset to the end of each page after the packet data. This addiiton would modify the capture criteria such that capture must account for the optional, unsignaled field. After reading packet data, demux framing would have to check for either a capture pattern, or a backreference followed by a capture pattern.
Variations of this proposal could also include bisection hints to further reduce the order of complexity of the bisection search.
As I see it, this middle of the road approach is a compromise without real advantages. In its favor:
1) Streams that make use of backreferences (most AV streams) would avoid the second bisection search, always reducing the average number of seeks by a little more than one.
However, there are significant disadvantages.
1) Like Proposal 2, the threaded backreference mechanism means seeking makes use of two ~ unrelated mechanisms instead of one. The threaded backreference increases mux and demux complexity.
2) The average seek number drops from ~ 4.5 physcal seeks to ~ 3.5. This is not a significant improvement given the additional compexity. Proposal 2 is simpler, similar overhead, and drops best case seeks to 1.
3) Like proposal 2, the additional backreference offset is redundant, non-authoritative, and non-structural.
In short, this mechanism has the disadvantages of Proposal 2 (in addition to being weird) but does not deliver as much advantage as Proposal 2.