GranulePosAndSeeking: Difference between revisions

From XiphWiki
Jump to navigation Jump to search
(added newly)
 
(18 intermediate revisions by 5 users not shown)
Line 1: Line 1:
== Granulepos encoding and How seeking really works ==
== Granulepos encoding and How seeking really works ==


This describes how to seek on a multiplexed stream which contains logical bitstreams with granuleshift, such as
Theora, Kate, CMML or OggText.
The purpose is to locate the earliest page that is required for rendering a given time offset.
Due to the fact that two time-seeking operations are required, this procedure is commonly referred to as a "double seek".


From an Email by Monty, 13th Sept 2002
=== Definitions ===


Overload '''time''' to mean the time represented by a granulepos. Hence the time of a page is the time represented by the page granulepos.


Folks have noticed that the documentation is semi-silent about how to
Define '''seek''' to mean: for each logical bitstream, locate the bytewise-latest page in the bitstream with a time < the
properly encode the granule position and interleave synchronization of
target time, then choose the bytewise-earliest among these pages. Thus if two pages have the same time, seeking will locate the bytewise-earlier page.
keyframe-based video. The primary reasons for this:


a) we at Xiph hadn't had to do it yet
==== Granules and Granuleshift ====
b) there are several easy possibilities, and the longer we had to
  think about it before mandating One True Spec, the better that spec
  would likely be.


The lack of a painfully explicit spec has led to the theory that it's
We use the term '''granule''' to refer to time measured in the units of the codec. For audio codecs this is usually samples, and for video codecs it is frames or fields.
not possible; that's not true, there are a few ways to do it. Several
require no extension to Ogg stream v 0. A last way requires an extra
field (a point against it), but does not actually break any stream
that currently exists.


The time has come to lay down the spec as we're currently building the
In some formats, pages have a dependency on the data of an earlier page; for example in Theora, interframes have a dependency on an earlier keyframe -- the keyframe data is required to decode the interframe. We encode both the time of the page and the time of the page it depends on into the granulepos. In order to do this we treat the granulepos as a bitfield as follows:
real abstraction layers in a concrete Ogg framework now where the Ogg
engine, the codecs, and the overarching Ogg control layers are neatly
put into boxes connected in formalized ways. Below I go into detail
about each scheme in a 'thinking aloud' sort of way. This is not
because I haven't already given the matter sufficient thought, it is
because I wish to give the reader sufficient background information to
understand why one way is better than the others. This is not a call
for input so much as an educational effort (and a public sanity check
of my thinking; please do pipe up if it appears I missed a salient
point).


Starting Assumptions:
+---------------------+-------------+
| prev_granule        | offset      |
+---------------------+-------------+


1) Ogg is not a non-linear format. It is not a replacement for the
Then if a page has time in units of codec granules <tt>curr_granule</tt>, and the page it depends on has time
scripting system of a DVD player. It is a media transport format
<tt>prev_granule</tt>, we define <tt>offset</tt> as the difference between these:
designed to do nothing more than deliver content, in a stream, and
have all the pieces arrive on time and in sync. It is not designed to
*prevent* more complex use of content, it merely does not implement
anything beyond a linear representation of the data contained within.
If you want to build a real non-linear format, build it *from* Ogg,
not *into* Ogg. This has been the intent from day 1.


2) The Ogg layer does not know specifics of the codec data it's
offset = curr_granule - prev_granule
multiplexing into a stream. It knows nothing beyond 'Oooo, packets!',
that the packets belong to different buckets, that the packets go in
order, and that packets have position markers. Ogg does not even have
a concept of 'time'; it only knows about the sequentially increasing,
unitless position markers. It is up to higher layers which have
access to the codec APIs to assign and convert units of framing or
time.


3) Given pre-cached decode headers, a player may seek into a stream at
We refer to the number of bits used to encode the offset as the "granuleshift". This is fixed for all pages in
any point and begin decode. It may be the case that audio may start
that track (logical bitstream). So we encode the later page's granulepos as:
after video by a fraction of a second, or video might be blank until
the stream hits the next keyframe, but this simplest case must just
work, and there will be sufficient information to maintain perfect
cross-media sync.


4) (This departs from current reality, but it will be the reality very
granulepos = (prev_granule << granuleshift) | offset
soon; vorbisfile currently blurs the careful abstraction I'm about to
 
describe) Seeking at an arbitrary level of precision is a distributed
When decoding, we can extract the current_granule from a granulepos by simply adding these fields:
abstraction in the larger Ogg picture. At the lowest-level Ogg stream
 
abstraction, seeking is one operation: "find me the page from logical
curr_granule = prev_granule + offset
stream 'n' with granule position 'x'". All more complex seeking
 
operations are a function of a higher-level layer (with knowledge of
Which expands to this expression of the page granulepos:
the media types and codec in use) making intelligent use of this
 
lowest Ogg abstraction. The Ogg stream abstraction need deal with
curr_granule = (granulepos >> granuleshift) + (granulepos & ((1 << granuleshift) - 1)))
nothing more complex than 'find this page'.
 
Keyframes, and other data with no dependency on earlier packets, are encoded with:
 
prev_granule = curr_granule, offset = 0
 
=== Procedure for seeking on a single track ===
 
In order to locate the earliest page in a track (a logical bitstream)
that is required for rendering a given time offset:
 
# seek to the desired time
# read the prev_granule out of the granulepos
# seek to the time represented by the prev_granule
 
=== Procedure for seeking on a multitrack file ===
 
In order to locate the earliest page in a multitrack file (a physical
bitstream) that required for rendering all tracks from a given time
offset:
 
# seek to the desired time
# scan forward until a page has been seen from all of the tracks that use granuleshift; while doing so, record the prev_granule of the bytewise-earliest page encountered from each track
# seek to the minimum of the prev_granules of those pages
 
It is useful to put a bound on the forward scan; the distance scanned
only depends on the way the stream is constructed, so it can be large
if pages in a particular logical bistream is sparse.
 
=== But how do I "seek to the desired time"?===
The above assumes that you already know how to seek to a particular granpos within the stream efficiently.
 
This isn't as simple as it sounds because the Ogg format does not include an index. The lack of an index is a feature rather than a deficiency and it is one of the primary reasons to use Ogg over some other formats. Because Ogg doesn't have in index infinite streams and partial streams are automatically supported by correctly written applications. There is no risk of truncation or minor corruption making a stream unseekable. No memory is required to store an index, no is bandwidth wasted to transmit it, and seeking granularity is not limited to the precision of the index.  On the other hand non-indexed formats can require a little more intelligence in the application and many applications have gotten it wrong (although this intelligence is also needed in a well written application fo indexed formats so that it can seek with a corrupted index or below the index granularity).
 
If you are thinking about seeking in an Ogg file by first building your own complete index: STOP. This is not the correct procedure. It seems simple but it requires a costly read of the entire stream (which may be gigabytes in size, or even infinite).  There is a better way.
 
The correct way to seek to a particular granule value in Ogg is by using a [http://en.wikipedia.org/wiki/Bisection_method bisection search]: Seek to the middle of the stream, obtain sync, and compare your target granule position with the current position. If the target is less than the current position repeat these steps on the left side, if it's greater repeat it on the right side. By applying this recursive algorithm you are guaranteed to find your target location quickly.
 
To correctly support chaining you should first use this kind of search to locate the stream endpoints. Then the above approach can be applied within the streams to seek to any location within a chained file.
 
Doing this correctly is somewhat more complicated than it seems due to the existence of continued pages and the risk of a small valid page being contained within a packet. Both of these challenges can be addressed but the solution is left as an exercise for the reader. <!-- Hint: The maximum page size is ~64kbytes -->
 
The bisection search is very good compared to the obvious alternatives (a linear scan of the whole file), often taking just a couple of reads to locate the correct location in a file gigabytes in size but the truly obsessive can out-perform the bisection on average by using the local bitrate to pick a better target than the half way point used in a bisection search ([http://en.wikipedia.org/wiki/Secant_method secant method] but be careful about the worst case becoming linear; see [http://en.wikipedia.org/wiki/Brent%27s_method brents method]).  The improvement possible from better-than-bisection approaches is probably only relevant for seeking across a high latency network.  In typical applications the added complexity may not be worth the cost.
 
== References ==
 
From an Email by Monty, [http://web.archive.org/web/20031201054855/http://www.xiph.org/archives/theora-dev/200209/0040.html 13th Sept 2002]
 
'''Note that this document is obsolete, and incorrect with respect to seeking in multiplexed streams.''' It does accurately describe the rationale behind the two-part granulepos scheme (option 3 below) now use in Theora, Dirac, CMML and other codecs in Ogg.
 
 
Folks have noticed that the documentation is semi-silent about how to properly encode the granule position and interleave synchronization of keyframe-based video. The primary reasons for this:
 
* we at Xiph hadn't had to do it yet
 
* there are several easy possibilities, and the longer we had to think about it before mandating One True Spec, the better that spec would likely be.
 
The lack of a painfully explicit spec has led to the theory that it's not possible; that's not true, there are a few ways to do it. Several require no extension to Ogg stream v 0. A last way requires an extra field (a point against it), but does not actually break any stream that currently exists.
 
The time has come to lay down the spec as we're currently building the real abstraction layers in a concrete Ogg framework now where the Ogg engine, the codecs, and the overarching Ogg control layers are neatly put into boxes connected in formalized ways. Below I go into detail about each scheme in a 'thinking aloud' sort of way. This is not because I haven't already given the matter sufficient thought, it is because I wish to give the reader sufficient background information to understand why one way is better than the others. This is not a call for input so much as an educational effort (and a public sanity check of my thinking; please do pipe up if it appears I missed a salient point).
 
==== Starting Assumptions: ====
 
1) Ogg is not a non-linear format. It is not a replacement for the scripting system of a DVD player. It is a media transport format
designed to do nothing more than deliver content, in a stream, and have all the pieces arrive on time and in sync. It is not designed to *prevent* more complex use of content, it merely does not implement anything beyond a linear representation of the data contained within. If you want to build a real non-linear format, build it *from* Ogg, not *into* Ogg. This has been the intent from day 1.
 
2) The Ogg layer does not know specifics of the codec data it's multiplexing into a stream. It knows nothing beyond 'Oooo, packets!', that the packets belong to different buckets, that the packets go in order, and that packets have position markers. Ogg does not even have a concept of 'time'; it only knows about the sequentially increasing, unitless position markers. It is up to higher layers which have access to the codec APIs to assign and convert units of framing or time.
 
3) Given pre-cached decode headers, a player may seek into a stream at any point and begin decode. It may be the case that audio may start after video by a fraction of a second, or video might be blank until the stream hits the next keyframe, but this simplest case must just work, and there will be sufficient information to maintain perfect cross-media sync.
 
4) (This departs from current reality, but it will be the reality very soon; vorbisfile currently blurs the careful abstraction I'm about to describe) Seeking at an arbitrary level of precision is a distributed abstraction in the larger Ogg picture. At the lowest-level Ogg stream abstraction, seeking is one operation: "find me the page from logical stream 'n' with granule position 'x'". All more complex seeking operations are a function of a higher-level layer (with knowledge of the media types and codec in use) making intelligent use of this lowest Ogg abstraction. The Ogg stream abstraction need deal with nothing more complex than 'find this page'.


The various granulepos strategies for keyframes concern this last point.
The various granulepos strategies for keyframes concern this last point.


The basic issue with video from which complexity arises is that frames
The basic issue with video from which complexity arises is that frames often depend on previous and possibly future frames. This happens in a larger, general category of codecs whose streams may not begin decode from just any packet as well as packets that may not represent an entire frame, or even a fixed-time sampling algorithm. It is a mistake to design a seeking system tied to an exact set of very specific cases. While one could implement an explicit keyframe mechanism at the Ogg level, this mechanism would not cover any of the other interesting seeking cases while, as I'll show below, the mechanism would not actually be necessary.
often depend on previous and possibly future frames. This happens in
a larger, general category of codecs whose streams may not begin
decode from just any packet as well as packets that may not represent
an entire frame, or even a fixed-time sampling algorithm. It is a
mistake to design a seeking system tied to an exact set of very
specific cases. While one could implement an explicit keyframe
mechanism at the Ogg level, this mechanism would not cover any of the
other interesting seeking cases while, as I'll show below, the
mechanism would not actually be necessary.


There will be a few complaints that Ogg is being unnecessarily subtle
There will be a few complaints that Ogg is being unnecessarily subtle and shifts a great deal of complexity into software which a few extra page header fields could eliminate. Consider the following:
and shifts a great deal of complexity into software which a few extra
page header fields could eliminate. Consider the following:


1) Ogg was designed to impose a roughly .5-1% over the raw packet data
1) Ogg was designed to impose a roughly .5-1% over the raw packet data over a wide range of packet usage patterns. 'A few extra fields' begins inflating that figure for specific special cases that only apply to a few stream types. Right now there is no header field that is not general to every stream. There is no fat in the page headers.
over a wide range of packet usage patterns. 'A few extra fields'
begins inflating that figure for specific special cases that only
apply to a few stream types. Right now there is no header field that
is not general to every stream. There is no fat in the page headers.


2) The Ogg-level seeking algorithm is exceptionally simple and can be
2) The Ogg-level seeking algorithm is exceptionally simple and can be described in a single sentence: "Find the earliest page with a granulepos less than but closest to 'x'". This shifts the onus of assembling more complex seeking operation requiring knowledge of a specific media type into a higher layer that has knowledge of that media type. The higher layer becomes responsible for determining for what 'x' Ogg should search. The division of labor is clear and
described in a single sentence: "Find the earliest page with a
granulepos less than but closest to 'x'". This shifts the onus of
assembling more complex seeking operation requiring knowledge of a
specific media type into a higher layer that has knowledge of that
media type. The higher layer becomes responsible for determining for
what 'x' Ogg should search. The division of labor is clear and
sensible.
sensible.


3) Complex, precise seeking operations are still contained entirely
3) Complex, precise seeking operations are still contained entirely within the framework, just at a higher layer than Ogg-stream. At no time is an application developer required to deal with seeking mechanisms within an Ogg stream or to manually maintain stream
within the framework, just at a higher layer than Ogg-stream. At no
time is an application developer required to deal with seeking
mechanisms within an Ogg stream or to manually maintain stream
synchronization.
synchronization.


High level handwaving- How seeking really works:
==== High level handwaving- How seeking really works ====


The granulepos is intended to mean, roughly, 'If I stop decode at the
The granulepos is intended to mean, roughly, 'If I stop decode at the end of this page, I will get data from my decoder up to position 'granulepos'. The granulepos simultaneously provides seeking information and a 'length-of-stream' indicator. Depending on the codec, it can also usually be used to indicate a timebase, but that isn't our problem right now.
end of this page, I will get data from my decoder up to position
'granulepos'. The granulepos simultaneously provides seeking
information and a 'length-of-stream' indicator. Depending on the
codec, it can also usually be used to indicate a timebase, but that
isn't our problem right now.


By inference, the granulepos is also used to construct a value 'y'
By inference, the granulepos is also used to construct a value 'y' such that 'if I begin decode *from* point 'y', I will get data
such that 'if I begin decode *from* point 'y', I will get data
beginning at position 'granulepos'. Although in some codecs, y == granulepos, that is not necessarily the case when decode can't begin at any arbitrary packet. The granulepos encoding method candidates I will now describe affect exactly the 'granulepos' to 'y' conversion process. Note also that none of these affect Ogg, only the higher decision-making layers... Different circumstanced necessitated by different codecs can lead to different valid choices, all of which work as far as Ogg is concerned. However, for our I-/P-/B-frame video case, there is a pretty clear winner.
beginning at position 'granulepos'. Although in some codecs, y ==
granulepos, that is not necessarily the case when decode can't begin
at any arbitrary packet. The granulepos encoding method candidates I
will now describe affect exactly the 'granulepos' to 'y' conversion
process. Note also that none of these affect Ogg, only the higher
decision-making layers... Different circumstanced necessitated by
different codecs can lead to different valid choices, all of which
work as far as Ogg is concerned. However, for our I-/P-/B-frame video
case, there is a pretty clear winner.


Strategy 1: Straight Granulepos, Keyframes Are Not Our Problem.
===== Strategy 1: Straight Granulepos, Keyframes Are Not Our Problem. =====
    
    
  In this scheme, the granulepos is a simple frame counter. The
In this scheme, the granulepos is a simple frame counter. The seeking decision-maker in the codec's framework plugin is responsible for determining if a frame is a keyframe or not, and if it can't begin decode from a given frame, it must request another earlier frame until it finds a keyframe. If the codec so desires, it can store 'what is my keyframe?' information in the stream packets.
  seeking decision-maker in the codec's framework plugin is
  responsible for determining if a frame is a keyframe or not, and if
  it can't begin decode from a given frame, it must request another
  earlier frame until it finds a keyframe. If the codec so desires,
  it can store 'what is my keyframe?' information in the stream
  packets.


  This case means that each seek to a *specific* frame in a video
This case means that each seek to a *specific* frame in a video stream will generally result in two Ogg seeks; a first seek to the the requested frame, then a second seek backwards to find that frame's keyframe.
  stream will generally result in two Ogg seeks; a first seek to the
  the requested frame, then a second seek backwards to find that
  frame's keyframe.


  A larger concern is the semantic accuracy of the granulepos; it's
A larger concern is the semantic accuracy of the granulepos; it's intended to reflect position accurately when decoding forward. In this scheme, it's fine for a P-frame to update the counter (as it can be decoded going strictly forward), but B frames will also advance the counter; they can't be decoded without subsequent P or I frames. Thus, the semantic value of granulepos no longer strictly represents 'we can decode up to 'granulepos' at the end of this frame'.
  intended to reflect position accurately when decoding forward. In
  this scheme, it's fine for a P-frame to update the counter (as it
  can be decoded going strictly forward), but B frames will also
  advance the counter; they can't be decoded without subsequent P or I
  frames. Thus, the semantic value of granulepos no longer strictly
  represents 'we can decode up to 'granulepos' at the end of this
  frame'.


Strategy 2: Granulepos Represents Keyframes Only
===== Strategy 2: Granulepos Represents Keyframes Only =====


  In this scheme, only keyframes update the granulepos (monotonically
In this scheme, only keyframes update the granulepos (monotonically or non-monotonically). It simplifies the seeking process to a keyframe as an Ogg-level seek to page 'x' will always yield a page with a keyframe. In addition, granulepos will also always mean 'we can decode up to *at least* this point in the stream. If the stream is truncated at P or B frames past granulepos, the extra frames can be discarded. (A special case would need to be defined to terminate a stream that doesn't end on an I frame).
  or non-monotonically). It simplifies the seeking process to a
  keyframe as an Ogg-level seek to page 'x' will always yield a page
  with a keyframe. In addition, granulepos will also always mean 'we
  can decode up to *at least* this point in the stream. If the stream
  is truncated at P or B frames past granulepos, the extra frames can
  be discarded. (A special case would need to be defined to terminate
  a stream that doesn't end on an I frame).


  The difficulty with this scheme is that it presents slightly more
The difficulty with this scheme is that it presents slightly more for the software level decoder to track; a proper frame number could not be determined internally without tracking from an I frame. Also, the granulepos an Ogg page would not necessarily map to the last packet on the page, or even any packet on that page; multiple sequential pages could have the same granulepos. It is conceptually slightly messy, although the 'messiness' does not make it at all impractical.
  for the software level decoder to track; a proper frame number could
  not be determined internally without tracking from an I frame.
  Also, the granulepos an Ogg page would not necessarily map to the
  last packet on the page, or even any packet on that page; multiple
  sequential pages could have the same granulepos. It is conceptually
  slightly messy, although the 'messiness' does not make it at all
  impractical.


Strategy 3: Granulepos Encodes Some State
===== Strategy 3: Granulepos Encodes Some State =====


  In some ways, this strategy is the most semantically 'over clever',
In some ways, this strategy is the most semantically 'over clever', but also the easiest to implement and the one that gives the most correct, up to date sync information. Pending comments, it is the I/P/B video strategy I currently favor.
  but also the easiest to implement and the one that gives the most
  correct, up to date sync information. Pending comments, it is the
  I/P/B video strategy I currently favor.


  The granulepos is 64 bits, a size that is absolutely necessary if,
The granulepos is 64 bits, a size that is absolutely necessary if, for example, it represents the PCM sample count in an audio codec. When being used to encode video frame number, however, it is comparatively absurdly large*.
  for example, it represents the PCM sample count in an audio codec.
  When being used to encode video frame number, however, it is
  comparatively absurdly large*.


  * note that although granulepos is not permitted to wrap around, we
* note that although granulepos is not permitted to wrap around, we can simply begin a new logical stream segment with a new serial number should a 30fps video stream ever hit the ten-billion year mark.
    can simply begin a new logical stream segment with a new serial
    number should a 30fps video stream ever hit the ten-billion year
    mark.


  Thus we clearly have room to skim a few bits off the bottom of
Thus we clearly have room to skim a few bits off the bottom of granulepos to represent I, P or B frame. These bits are not used as flags, but rather, frame representation becomes a counting problem; We do this such that the count is still always strictly increasing.
  granulepos to represent I, P or B frame. These bits are not used as
  flags, but rather, frame representation becomes a counting problem;
  We do this such that the count is still always strictly increasing.


  For example, we know that I frames will never be more than 256
For example, we know that I frames will never be more than 256 frames apart and P frames no more than 31 B frames apart, the granulepos of an I frame can be defined to always be granulepos | 0xff == 0. If we can have up to seven intervening P frames, they could be numbered in granulepos-of-iframe + 0x20, 0x40, 0x60... 0xe0. B frames between the I and P frames would use the remaining five bits and be numbers as sub-I and sub-P frames 1 through 31. Thus, starting from zero, the frames/packets in the pattern IPBBPBBI would be numbered 0x000, 0x020, 0x021, 0x022, 0x040, 0x041, 0x042, 0x100.
  frames apart and P frames no more than 31 B frames apart, the
  granulepos of an I frame can be defined to always be granulepos |
  0xff == 0. If we can have up to seven intervening P frames, they
  could be numbered in granulepos-of-iframe + 0x20, 0x40,
  0x60... 0xe0. B frames between the I and P frames would use the
  remaining five bits and be numbers as sub-I and sub-P frames 1
  through 31. Thus, starting from zero, the frames/packets in the
  pattern IPBBPBBI would be numbered 0x000, 0x020, 0x021, 0x022,
  0x040, 0x041, 0x042, 0x100.


  If we wish to preserve the ability to represent a timebase, the
If we wish to preserve the ability to represent a timebase, the granulepos number for I frames need not be increased monotonically and shifted; it can be used to represent the frame number. The above example becomes 0x000, 0x020, 0x021, 0x022, 0x040, 0x041, 0x042, 0x700. To get real frame number (from an I frame), we just shift granulepos >> 8. This scheme can be taken further or modified to get frame number from any video frame.
  granulepos number for I frames need not be increased monotonically
  and shifted; it can be used to represent the frame number. The
  above example becomes 0x000, 0x020, 0x021, 0x022, 0x040, 0x041,
  0x042, 0x700. To get real frame number (from an I frame), we just
  shift granulepos >> 8. This scheme can be taken further or modified
  to get frame number from any video frame.


  In this way, we can always seek, first time, to a desired key frame
In this way, we can always seek, first time, to a desired key frame page (by seeking to Ogg page 'x' where x | 0xff == 0). In addition, each frame still has a unique frame number and also a clear 'group' number, potentially useful information to the decoder. Lastly, granulepos is still semantically correct, although it is now, in a sense, representing a whole.fractional frame number for buffering purposes.
  page (by seeking to Ogg page 'x' where x | 0xff == 0). In
  addition, each frame still has a unique frame number and also a
  clear 'group' number, potentially useful information to the decoder.
  Lastly, granulepos is still semantically correct, although it is
  now, in a sense, representing a whole.fractional frame number for
  buffering purposes.


Scheme Four: Extra 'Seekpos' Field / Straw Man
===== Scheme Four: Extra 'Seekpos' Field / Straw Man =====


  Another possibility requires extension of the current Ogg page
Another possibility requires extension of the current Ogg page format. Although older players would reject any such extended pages as invalid, we do have versioning and typing fields, so there's not actually any compatibility problems with current Ogg pages... in the future.
  format. Although older players would reject any such extended pages
  as invalid, we do have versioning and typing fields, so there's not
  actually any compatibility problems with current Ogg pages... in the
  future.


  The idea in this scheme is to keep the current granulepos as a frame
The idea in this scheme is to keep the current granulepos as a frame number field (ala scheme 1), but also add a new field 'seekpos' that is used, rather than granulepos, in seeking. The seekpos would represent the number of the last keyframe that passed by.
  number field (ala scheme 1), but also add a new field 'seekpos' that
  is used, rather than granulepos, in seeking. The seekpos would
  represent the number of the last keyframe that passed by.


  advantages:
advantages:


    1) The net effect of this strategy is to modify scheme 1 to only
1) The net effect of this strategy is to modify scheme 1 to only require one bisection seek rather than two. Some amount of code simplification (over scheme 1) at the decision-making level.
    require one bisection seek rather than two. Some amount of code
    simplification (over scheme 1) at the decision-making level.


  disadvantages:
disadvantages:


    1) The Ogg format will need to be revved. No current (ala 1.0) Ogg
1) The Ogg format will need to be revved. No current (ala 1.0) Ogg code will understand the new pages.
    code will understand the new pages.


    2) The header becomes larger, from a minimum size of 27 bytes to a
2) The header becomes larger, from a minimum size of 27 bytes to a minimum size of 35.
    minimum size of 35.


    3) This strategy only enhances keyframes; it is of no use in other
3) This strategy only enhances keyframes; it is of no use in other odd seeking cases.
    odd seeking cases.


    4) Gives no more information than scheme 3, but is still more
4) Gives no more information than scheme 3, but is still more complicated, both in code and API (Ogg would have to understand keyframes).
    complicated, both in code and API (Ogg would have to understand
    keyframes).


  Thus, there's no substantial reason to prefer extending the format
Thus, there's no substantial reason to prefer extending the format over a scheme that's possible within the existing framework. Note that schemes 1-3 can all be implemented within the Ogg stream today.
  over a scheme that's possible within the existing framework. Note
  that schemes 1-3 can all be implemented within the Ogg stream today.


Monty
Monty
[[Category:Ogg]]

Revision as of 16:57, 30 May 2012

Granulepos encoding and How seeking really works

This describes how to seek on a multiplexed stream which contains logical bitstreams with granuleshift, such as Theora, Kate, CMML or OggText. The purpose is to locate the earliest page that is required for rendering a given time offset. Due to the fact that two time-seeking operations are required, this procedure is commonly referred to as a "double seek".

Definitions

Overload time to mean the time represented by a granulepos. Hence the time of a page is the time represented by the page granulepos.

Define seek to mean: for each logical bitstream, locate the bytewise-latest page in the bitstream with a time < the target time, then choose the bytewise-earliest among these pages. Thus if two pages have the same time, seeking will locate the bytewise-earlier page.

Granules and Granuleshift

We use the term granule to refer to time measured in the units of the codec. For audio codecs this is usually samples, and for video codecs it is frames or fields.

In some formats, pages have a dependency on the data of an earlier page; for example in Theora, interframes have a dependency on an earlier keyframe -- the keyframe data is required to decode the interframe. We encode both the time of the page and the time of the page it depends on into the granulepos. In order to do this we treat the granulepos as a bitfield as follows:

+---------------------+-------------+
| prev_granule        | offset      |
+---------------------+-------------+

Then if a page has time in units of codec granules curr_granule, and the page it depends on has time prev_granule, we define offset as the difference between these:

offset = curr_granule - prev_granule

We refer to the number of bits used to encode the offset as the "granuleshift". This is fixed for all pages in that track (logical bitstream). So we encode the later page's granulepos as:

granulepos = (prev_granule << granuleshift) | offset

When decoding, we can extract the current_granule from a granulepos by simply adding these fields:

curr_granule = prev_granule + offset

Which expands to this expression of the page granulepos:

curr_granule = (granulepos >> granuleshift) + (granulepos & ((1 << granuleshift) - 1)))

Keyframes, and other data with no dependency on earlier packets, are encoded with:

prev_granule = curr_granule, offset = 0

Procedure for seeking on a single track

In order to locate the earliest page in a track (a logical bitstream) that is required for rendering a given time offset:

  1. seek to the desired time
  2. read the prev_granule out of the granulepos
  3. seek to the time represented by the prev_granule

Procedure for seeking on a multitrack file

In order to locate the earliest page in a multitrack file (a physical bitstream) that required for rendering all tracks from a given time offset:

  1. seek to the desired time
  2. scan forward until a page has been seen from all of the tracks that use granuleshift; while doing so, record the prev_granule of the bytewise-earliest page encountered from each track
  3. seek to the minimum of the prev_granules of those pages

It is useful to put a bound on the forward scan; the distance scanned only depends on the way the stream is constructed, so it can be large if pages in a particular logical bistream is sparse.

But how do I "seek to the desired time"?

The above assumes that you already know how to seek to a particular granpos within the stream efficiently.

This isn't as simple as it sounds because the Ogg format does not include an index. The lack of an index is a feature rather than a deficiency and it is one of the primary reasons to use Ogg over some other formats. Because Ogg doesn't have in index infinite streams and partial streams are automatically supported by correctly written applications. There is no risk of truncation or minor corruption making a stream unseekable. No memory is required to store an index, no is bandwidth wasted to transmit it, and seeking granularity is not limited to the precision of the index. On the other hand non-indexed formats can require a little more intelligence in the application and many applications have gotten it wrong (although this intelligence is also needed in a well written application fo indexed formats so that it can seek with a corrupted index or below the index granularity).

If you are thinking about seeking in an Ogg file by first building your own complete index: STOP. This is not the correct procedure. It seems simple but it requires a costly read of the entire stream (which may be gigabytes in size, or even infinite). There is a better way.

The correct way to seek to a particular granule value in Ogg is by using a bisection search: Seek to the middle of the stream, obtain sync, and compare your target granule position with the current position. If the target is less than the current position repeat these steps on the left side, if it's greater repeat it on the right side. By applying this recursive algorithm you are guaranteed to find your target location quickly.

To correctly support chaining you should first use this kind of search to locate the stream endpoints. Then the above approach can be applied within the streams to seek to any location within a chained file.

Doing this correctly is somewhat more complicated than it seems due to the existence of continued pages and the risk of a small valid page being contained within a packet. Both of these challenges can be addressed but the solution is left as an exercise for the reader.

The bisection search is very good compared to the obvious alternatives (a linear scan of the whole file), often taking just a couple of reads to locate the correct location in a file gigabytes in size but the truly obsessive can out-perform the bisection on average by using the local bitrate to pick a better target than the half way point used in a bisection search (secant method but be careful about the worst case becoming linear; see brents method). The improvement possible from better-than-bisection approaches is probably only relevant for seeking across a high latency network. In typical applications the added complexity may not be worth the cost.

References

From an Email by Monty, 13th Sept 2002

Note that this document is obsolete, and incorrect with respect to seeking in multiplexed streams. It does accurately describe the rationale behind the two-part granulepos scheme (option 3 below) now use in Theora, Dirac, CMML and other codecs in Ogg.


Folks have noticed that the documentation is semi-silent about how to properly encode the granule position and interleave synchronization of keyframe-based video. The primary reasons for this:

  • we at Xiph hadn't had to do it yet
  • there are several easy possibilities, and the longer we had to think about it before mandating One True Spec, the better that spec would likely be.

The lack of a painfully explicit spec has led to the theory that it's not possible; that's not true, there are a few ways to do it. Several require no extension to Ogg stream v 0. A last way requires an extra field (a point against it), but does not actually break any stream that currently exists.

The time has come to lay down the spec as we're currently building the real abstraction layers in a concrete Ogg framework now where the Ogg engine, the codecs, and the overarching Ogg control layers are neatly put into boxes connected in formalized ways. Below I go into detail about each scheme in a 'thinking aloud' sort of way. This is not because I haven't already given the matter sufficient thought, it is because I wish to give the reader sufficient background information to understand why one way is better than the others. This is not a call for input so much as an educational effort (and a public sanity check of my thinking; please do pipe up if it appears I missed a salient point).

Starting Assumptions:

1) Ogg is not a non-linear format. It is not a replacement for the scripting system of a DVD player. It is a media transport format designed to do nothing more than deliver content, in a stream, and have all the pieces arrive on time and in sync. It is not designed to *prevent* more complex use of content, it merely does not implement anything beyond a linear representation of the data contained within. If you want to build a real non-linear format, build it *from* Ogg, not *into* Ogg. This has been the intent from day 1.

2) The Ogg layer does not know specifics of the codec data it's multiplexing into a stream. It knows nothing beyond 'Oooo, packets!', that the packets belong to different buckets, that the packets go in order, and that packets have position markers. Ogg does not even have a concept of 'time'; it only knows about the sequentially increasing, unitless position markers. It is up to higher layers which have access to the codec APIs to assign and convert units of framing or time.

3) Given pre-cached decode headers, a player may seek into a stream at any point and begin decode. It may be the case that audio may start after video by a fraction of a second, or video might be blank until the stream hits the next keyframe, but this simplest case must just work, and there will be sufficient information to maintain perfect cross-media sync.

4) (This departs from current reality, but it will be the reality very soon; vorbisfile currently blurs the careful abstraction I'm about to describe) Seeking at an arbitrary level of precision is a distributed abstraction in the larger Ogg picture. At the lowest-level Ogg stream abstraction, seeking is one operation: "find me the page from logical stream 'n' with granule position 'x'". All more complex seeking operations are a function of a higher-level layer (with knowledge of the media types and codec in use) making intelligent use of this lowest Ogg abstraction. The Ogg stream abstraction need deal with nothing more complex than 'find this page'.

The various granulepos strategies for keyframes concern this last point.

The basic issue with video from which complexity arises is that frames often depend on previous and possibly future frames. This happens in a larger, general category of codecs whose streams may not begin decode from just any packet as well as packets that may not represent an entire frame, or even a fixed-time sampling algorithm. It is a mistake to design a seeking system tied to an exact set of very specific cases. While one could implement an explicit keyframe mechanism at the Ogg level, this mechanism would not cover any of the other interesting seeking cases while, as I'll show below, the mechanism would not actually be necessary.

There will be a few complaints that Ogg is being unnecessarily subtle and shifts a great deal of complexity into software which a few extra page header fields could eliminate. Consider the following:

1) Ogg was designed to impose a roughly .5-1% over the raw packet data over a wide range of packet usage patterns. 'A few extra fields' begins inflating that figure for specific special cases that only apply to a few stream types. Right now there is no header field that is not general to every stream. There is no fat in the page headers.

2) The Ogg-level seeking algorithm is exceptionally simple and can be described in a single sentence: "Find the earliest page with a granulepos less than but closest to 'x'". This shifts the onus of assembling more complex seeking operation requiring knowledge of a specific media type into a higher layer that has knowledge of that media type. The higher layer becomes responsible for determining for what 'x' Ogg should search. The division of labor is clear and sensible.

3) Complex, precise seeking operations are still contained entirely within the framework, just at a higher layer than Ogg-stream. At no time is an application developer required to deal with seeking mechanisms within an Ogg stream or to manually maintain stream synchronization.

High level handwaving- How seeking really works

The granulepos is intended to mean, roughly, 'If I stop decode at the end of this page, I will get data from my decoder up to position 'granulepos'. The granulepos simultaneously provides seeking information and a 'length-of-stream' indicator. Depending on the codec, it can also usually be used to indicate a timebase, but that isn't our problem right now.

By inference, the granulepos is also used to construct a value 'y' such that 'if I begin decode *from* point 'y', I will get data beginning at position 'granulepos'. Although in some codecs, y == granulepos, that is not necessarily the case when decode can't begin at any arbitrary packet. The granulepos encoding method candidates I will now describe affect exactly the 'granulepos' to 'y' conversion process. Note also that none of these affect Ogg, only the higher decision-making layers... Different circumstanced necessitated by different codecs can lead to different valid choices, all of which work as far as Ogg is concerned. However, for our I-/P-/B-frame video case, there is a pretty clear winner.

Strategy 1: Straight Granulepos, Keyframes Are Not Our Problem.

In this scheme, the granulepos is a simple frame counter. The seeking decision-maker in the codec's framework plugin is responsible for determining if a frame is a keyframe or not, and if it can't begin decode from a given frame, it must request another earlier frame until it finds a keyframe. If the codec so desires, it can store 'what is my keyframe?' information in the stream packets.

This case means that each seek to a *specific* frame in a video stream will generally result in two Ogg seeks; a first seek to the the requested frame, then a second seek backwards to find that frame's keyframe.

A larger concern is the semantic accuracy of the granulepos; it's intended to reflect position accurately when decoding forward. In this scheme, it's fine for a P-frame to update the counter (as it can be decoded going strictly forward), but B frames will also advance the counter; they can't be decoded without subsequent P or I frames. Thus, the semantic value of granulepos no longer strictly represents 'we can decode up to 'granulepos' at the end of this frame'.

Strategy 2: Granulepos Represents Keyframes Only

In this scheme, only keyframes update the granulepos (monotonically or non-monotonically). It simplifies the seeking process to a keyframe as an Ogg-level seek to page 'x' will always yield a page with a keyframe. In addition, granulepos will also always mean 'we can decode up to *at least* this point in the stream. If the stream is truncated at P or B frames past granulepos, the extra frames can be discarded. (A special case would need to be defined to terminate a stream that doesn't end on an I frame).

The difficulty with this scheme is that it presents slightly more for the software level decoder to track; a proper frame number could not be determined internally without tracking from an I frame. Also, the granulepos an Ogg page would not necessarily map to the last packet on the page, or even any packet on that page; multiple sequential pages could have the same granulepos. It is conceptually slightly messy, although the 'messiness' does not make it at all impractical.

Strategy 3: Granulepos Encodes Some State

In some ways, this strategy is the most semantically 'over clever', but also the easiest to implement and the one that gives the most correct, up to date sync information. Pending comments, it is the I/P/B video strategy I currently favor.

The granulepos is 64 bits, a size that is absolutely necessary if, for example, it represents the PCM sample count in an audio codec. When being used to encode video frame number, however, it is comparatively absurdly large*.

  • note that although granulepos is not permitted to wrap around, we can simply begin a new logical stream segment with a new serial number should a 30fps video stream ever hit the ten-billion year mark.

Thus we clearly have room to skim a few bits off the bottom of granulepos to represent I, P or B frame. These bits are not used as flags, but rather, frame representation becomes a counting problem; We do this such that the count is still always strictly increasing.

For example, we know that I frames will never be more than 256 frames apart and P frames no more than 31 B frames apart, the granulepos of an I frame can be defined to always be granulepos | 0xff == 0. If we can have up to seven intervening P frames, they could be numbered in granulepos-of-iframe + 0x20, 0x40, 0x60... 0xe0. B frames between the I and P frames would use the remaining five bits and be numbers as sub-I and sub-P frames 1 through 31. Thus, starting from zero, the frames/packets in the pattern IPBBPBBI would be numbered 0x000, 0x020, 0x021, 0x022, 0x040, 0x041, 0x042, 0x100.

If we wish to preserve the ability to represent a timebase, the granulepos number for I frames need not be increased monotonically and shifted; it can be used to represent the frame number. The above example becomes 0x000, 0x020, 0x021, 0x022, 0x040, 0x041, 0x042, 0x700. To get real frame number (from an I frame), we just shift granulepos >> 8. This scheme can be taken further or modified to get frame number from any video frame.

In this way, we can always seek, first time, to a desired key frame page (by seeking to Ogg page 'x' where x | 0xff == 0). In addition, each frame still has a unique frame number and also a clear 'group' number, potentially useful information to the decoder. Lastly, granulepos is still semantically correct, although it is now, in a sense, representing a whole.fractional frame number for buffering purposes.

Scheme Four: Extra 'Seekpos' Field / Straw Man

Another possibility requires extension of the current Ogg page format. Although older players would reject any such extended pages as invalid, we do have versioning and typing fields, so there's not actually any compatibility problems with current Ogg pages... in the future.

The idea in this scheme is to keep the current granulepos as a frame number field (ala scheme 1), but also add a new field 'seekpos' that is used, rather than granulepos, in seeking. The seekpos would represent the number of the last keyframe that passed by.

advantages:

1) The net effect of this strategy is to modify scheme 1 to only require one bisection seek rather than two. Some amount of code simplification (over scheme 1) at the decision-making level.

disadvantages:

1) The Ogg format will need to be revved. No current (ala 1.0) Ogg code will understand the new pages.

2) The header becomes larger, from a minimum size of 27 bytes to a minimum size of 35.

3) This strategy only enhances keyframes; it is of no use in other odd seeking cases.

4) Gives no more information than scheme 3, but is still more complicated, both in code and API (Ogg would have to understand keyframes).

Thus, there's no substantial reason to prefer extending the format over a scheme that's possible within the existing framework. Note that schemes 1-3 can all be implemented within the Ogg stream today.

Monty