'''Note that this document is obsolete, and incorrect with respect to seeking in multiplexed streams.''' It does accurately describe the rationale behind the two-part granulepos scheme (option 3 below) now use in Theora, Dirac, CMML and other codecs in Ogg.
Folks have noticed that the documentation is semi-silent about how to properly encode the granule position and interleave synchronization of keyframe-based video. The primary reasons for this:
* we at Xiph hadn't had to do it yet
* there are several easy possibilities, and the longer we had to think about it before mandating One True Spec, the better that spec would likely be.
The lack of a painfully explicit spec has led to the theory that it's not possible; that's not true, there are a few ways to do it. Several require no extension to Ogg stream v 0. A last way requires an extra field (a point against it), but does not actually break any stream that currently exists.
The time has come to lay down the spec as we're currently building the real abstraction layers in a concrete Ogg framework now where the Ogg engine, the codecs, and the overarching Ogg control layers are neatly put into boxes connected in formalized ways.
Below I go into detail about each scheme in a 'thinking aloud' sort of way. This is not because I haven't already given the matter sufficient thought, it is because I wish to give the reader sufficient background information to understand why one way is better than the others. This is not a call for input so much as an educational effort (and a public sanity check of my thinking; please do pipe up if it appears I missed a salient point). ==== Starting Assumptions: ====
1) Ogg is not a non-linear format. It is not a replacement for the scripting system of a DVD player. It is a media transport format designed to do nothing more than deliver content, in a stream, and have all the pieces arrive on time and in sync. It is not designed to *prevent* more complex use of content, it merely does not implement anything beyond a linear representation of the data contained within. If you want to build a real non-linear format, build it *from* Ogg, not *into* Ogg. This has been the intent from day 1.
2) The Ogg layer does not know specifics of the codec data it's multiplexing into a stream. It knows nothing beyond 'Oooo, packets!', that the packets belong to different buckets, that the packets go in order, and that packets have position markers. Ogg does not even have a concept of 'time'; it only knows about the sequentially increasing, unitless position markers. It is up to higher layers which have access to the codec APIs to assign and convert units of framing or time.
3) Given pre-cached decode headers, a player may seek into a stream at any point and begin decode. It may be the case that audio may start after video by a fraction of a second, or video might be blank until the stream hits the next keyframe, but this simplest case must just work, and there will be sufficient information to maintain perfect cross-media sync.
4) (This departs from current reality, but it will be the reality very soon; vorbisfile currently blurs the careful abstraction I'm about to describe) Seeking at an arbitrary level of precision is a distributed abstraction in the larger Ogg picture. At the lowest-level Ogg stream abstraction, seeking is one operation: "find me the page from logical stream 'n' with granule position 'x'". All more complex seeking operations are a function of a higher-level layer (with knowledge of the media types and codec in use) making intelligent use of this lowest Ogg abstraction. The Ogg stream abstraction need deal with nothing more complex than 'find this page'.
The various granulepos strategies for keyframes concern this last point.
basic issue with video from which complexity arises is that frames often depend on previous and possibly future frames. This happens in a larger, general category of codecs whose streams may not begin decode from just any packet as well as packets that may not represent an entire frame, or even a fixed-time sampling algorithm . It is a mistake to design a seeking system tied to an exact set of very specific cases. While one could implement an explicit keyframe mechanism at the Ogg level, this mechanism would not cover any of the other interesting seeking cases while, as I'll show below, the mechanism would not actually be necessary.
There will be a few complaints that Ogg is being unnecessarily subtle and shifts a great deal of complexity into software which a few extra page header fields could eliminate. Consider the following:
1) Ogg was designed to impose a roughly .5-1% over the raw packet data over a wide range of packet usage patterns. 'A few extra fields ' begins inflating that figure for specific special cases that only apply to a few stream types. Right now there is no header field that is not general to every stream. There is no fat in the page headers.
2) The Ogg-level seeking algorithm is exceptionally simple and can be described in a single sentence: "Find the earliest page with a granulepos less than but closest to 'x'". This shifts the onus of assembling more complex seeking operation requiring knowledge of a specific media type into a higher layer that has knowledge of that media type. The higher layer becomes responsible for determining for what 'x' Ogg should search. The division of labor is clear and sensible.
3) Complex, precise seeking operations are still contained entirely within the framework, just at a higher layer than Ogg-stream. At no time is an application developer required to deal with seeking mechanisms within an Ogg stream or to manually maintain stream
==== High level handwaving
- How seeking really works ====
The granulepos is intended to mean, roughly, 'If I stop decode at the end of this page, I will get data from my decoder up to position 'granulepos'. The granulepos simultaneously provides seeking information and a 'length-of-stream' indicator. Depending on the codec, it can also usually be used to indicate a timebase, but that isn't our problem right now.
In this way, we can always seek, first time, to a desired key frame page (by seeking to Ogg page 'x' where x | 0xff == 0). In addition, each frame still has a unique frame number and also a clear 'group' number, potentially useful information to the decoder. Lastly, granulepos is still semantically correct, although it is now, in a sense, representing a whole.fractional frame number for buffering purposes.
Scheme Four: Extra 'Seekpos' Field / Straw Man =====
Another possibility requires extension of the current Ogg page format. Although older players would reject any such extended pages as invalid, we do have versioning and typing fields, so there's not actually any compatibility problems with current Ogg pages... in the future.
The idea in this scheme is to keep the current granulepos as a frame number field (ala scheme 1), but also add a new field 'seekpos' that is used, rather than granulepos, in seeking. The seekpos would represent the number of the last keyframe that passed by.
advantages: 1) The net effect of this strategy is to modify scheme 1 to only require one bisection seek rather than two. Some amount of code simplification (over scheme 1) at the decision-making level. disadvantages: 1) The Ogg format will need to be revved. No current (ala 1.0) Ogg code will understand the new pages.
2) The header becomes larger, from a minimum size of 27 bytes to a minimum size of 35.
3) This strategy only enhances keyframes; it is of no use in other odd seeking cases.
4) Gives no more information than scheme 3, but is still more complicated, both in code and API (Ogg would have to understand keyframes).
Thus, there's no substantial reason to prefer extending the format over a scheme that's possible within the existing framework. Note that schemes 1-3 can all be implemented within the Ogg stream today.