Ogg Skeleton 4
Ogg Skeleton provides structuring information for multitrack Ogg files. It is compatible with Ogg Theora and provides extra clues for synchronization and content negotiation such as language selection. Skeleton version 4.0 also provides keyframes indexes to enable optimal seeking over high-latency connections, such as the internet.
Ogg is a generic container format for time-continuous data streams, enabling interleaving of several tracks of frame-wise encoded content in a time-multiplexed manner. As an example, an Ogg physical bitstream could encapsulate several tracks of video encoded in Theora and multiple tracks of audio encoded in Speex or Vorbis or FLAC at the same time. A player that decodes such a bitstream could then, for example, play one video channel as the main video playback, alpha-blend another one on top of it (e.g. a caption track), play a main Vorbis audio together with several FLAC audio tracks simultaneously (e.g. as sound effects), and provide a choice of Speex channels (e.g. providing commentary in different languages). Such a file is generally possible to create with Ogg, it is however not possible to generically parse such a file, seek on it, understand what codecs are contained in such a file, and dynamically handle and play back such content.
Ogg does not know anything about the content it carries and leaves it to the media mapping of each codec to declare and describe itself. There is no meta information available at the Ogg level about the content tracks encapsulated within an Ogg physical bitstream. This is particularly a problem if you don't have all the decoder libraries available and just want to parse an Ogg file to find out what type of data it encapsulates (such as the "file" command under *nix to determine what file it is through magic numbers), or want to seek to a temporal offset without having to decode the data (such as on a Web server that just serves out Ogg files and parts thereof).
Ogg Skeleton is being designed to overcome these problems. Ogg Skeleton is a logical bitstream within an Ogg stream that contains information about the other encapsulated logical bitstreams. For each logical bitstream it provides information such as its media type, and explains the way the granulepos field in Ogg pages is mapped to time.
Ogg Skeleton is also designed to allow the creation of substreams from Ogg physical bitstreams that retain the original timing information. For example, when cutting out the segment between the 7th and the 59th second of an Ogg file, it would be nice to continue to start this cut out file with a playback time of 7 seconds and not of 0. This is of particular interest if you're streaming this file from a Web server after a query for a temporal subpart such as in http://example.com/video.ogv?t=7-59 .
This is a motivation and design sketch. For the current specification see http://svn.annodex.net/standards/draft-pfeiffer-oggskeleton-current.txt
For the current specification for the keyframe index packets see http://github.com/cpearce/OggIndex/blob/master/Skeleton-4.0-Index-Specification.txt
How to describe the logical bitstreams within an Ogg container?
The following information about a logical bitstream is of interest to contain as meta information in the Skeleton:
- the serial number: it identifies a content track
- the mime type: it identifies the content type
- other generic name-value fields that can provide meta information such as the language of a track or the video height and width
- the number of header packets: this informs a parser about the number of actual header packets in an Ogg logical bitstream
- the granule rate: the granule rate represents the data rate in Hz at which content is sampled for the particular logical bitstream. Note that when using this to interpret timestamps, the granulepos of a data page must first be parsed to extract a granule value using the method described in GranulePosAndSeeking. This value can then be mapped to time by calculating "granules / granulerate".
- the preroll: the number of past content packets to take into account when decoding the current Ogg page, which is necessary for seeking (vorbis has generally 2, speex 3)
- the granuleshift: the number of lower bits from the granulepos field that are used to provide position information for sub-seekable units (like the keyframe shift in theora)
- a basetime: it provides a mapping for granule position 0 (for all logical bitstreams) to a playback time; an example use: most content in professional analog video creation actually starts at a time of 1 hour and thus adding this additional field allows them retain this mapping on digitizing their content
- a UTC time: it provides a mapping for granule position 0 (for all logical bitstreams) to a real-world clock time allowing to remember e.g. the recording or broadcast time of some content
- the granulepos radix, used during complex granulepos-to-time conversions, particuarly in streams such as Dirac.
- predelay: the delay of the presentation time behind the decode time. Used in discontinuous streams such as Dirac.
How to allow the creation of substreams from an Ogg physical bitstream?
When cutting out a subpart of an Ogg physical bitstream, the aim is to keep all the content pages intact (including the framing and granule positions) and just change some information in the Skeleton that allows reconstruction of the accurate time mapping. When remultiplexing such a bitstream, it is necessary to take into account all the different contained logical bitstreams. A given cut-in time maps to several different byte positions in the Ogg physical bitstream because each logical bitstream has its relevant information for that time at a different location. In addition, the resolution of each logical bitstream may not be high enough to accommodate for the given cut-in time and thus there may be some surplus information necessary to be remuxed into the new bitstream.
The following information is necessary to be added to the Skeleton to allow a correct presentation of a subpart of an Ogg bitstream:
- the presentation time: this is the actual cut-in time and all logical bitstreams are meant to start presenting from this time onwards, not from the time their data starts, which may be some time before that (because this time may have mapped right into the middle of a packet, or because the logical bitstream has a preroll or a keyframe shift)
- the basegranule: this represents the granule number with which this logical bitstream starts in the remuxed stream and provides for each logical bitstream the accurate start time of its data stream; this information is necessary to allow correct decoding and timing of the first data packets contained in a logcial bitstream of a remuxed Ogg stream
Keyframe indexes for faster seeking
Seeking in an Ogg file is typically implemented as a bisection search over the pages in the file. The bisection method above works fine for seeking in local files, but for seeking in files served over the Internet via HTTP, each bisection or non sequential read can trigger a new HTTP request, which can have very high latency, making seeking very slow. Seeking is further complicated by the fact that packets often span multiple Ogg pages, and that Ogg pages from different streams can be interleaved between spanning packets.
Each content track has a separate index, which is stored in its own packet in the Skeleton 4.0 track. The index for streams without the concept of a keyframe, such as Vorbis streams, can instead record the time position at periodic intervals, which achieves the same result. When this document refers to keyframes, it also implicitly refers to these independent periodic samples from keyframe-less streams.
Because all the Skeleton track's index packets appear in the header pages of the Ogg segment, all the keyframe indexes are immediately available once the header packets have been read when playing the media over a network connection.
For every content stream in an Ogg segment, the Ogg index bitstream provides seek algorithms with an ordered table of "key points". A key point is intrinsically associated with exactly one stream, and stores the offset, o, of the last page which lies before all data required to decode the keyframe, as well as the presentation time of the keyframe t, as a fraction of seconds.
The offset is relative from the beginning of the Ogg segment, and is exactly the first byte of the a page in the indexed stream, so if you seek to a keypoint's offset and don't find the beginning of a page there, or you find a page from another stream, you can assume that the Ogg segment has been modified since the index was constructed, and the index can be considered invalid. The time t is the keyframe's presentation time corresponding to the granulepos, and is represented as a fraction in seconds. Note that if a stream requires any preroll, this will be accounted for in the time stored in the keypoint.
The Skeleton 4.0 track contains one index for each content stream in the file. To seek in an Ogg file which contains keyframe indexes, first construct the set which contains every active streams' last keypoint which has time less than or equal to the seek target time. This tells you a known point on every stream which lies before the seek target. Then from that set of key points, select the key point with the smallest byte offset. You then verify that there's a page from the keypoint's stream found at exactly that offset, and if so, you can begin decoding. You are guaranteed to pass keyframes on all streams with time less than or equal to your seek target time while decoding up to the seek target. However if you don't encounter a keyframe with the same presentation time as is stored in the keypoint, then the index is invalid (possibly the file has been changed without updating the index) and you must either fallback to a bisection search, or keep decoding if you've landed "close enough" to the seek target.
Be aware that you cannot assume that any or all Ogg files will contain keyframe indexes, so when implementing Ogg seeking, you must gracefully fall-back to a bisection search or other seek algorithm when the index is not present, or when it is invalid.
The Skeleton 4.0 index packets also stores meta data about the segment in which it resides. It stores the timestamps of the first and last samples in its track. This also allows you to determine the duration of the indexed Ogg media without having to decode the start and end of the Ogg segment to calculate the difference (which is the duration). With the index packets storing the start and end times of every track, you can calculate the duration as the end time of the last active stream minus the start time of first active stream.
The Skeleton 4.0 BOS packet contains the length of the indexed segment in bytes. This is so that if the seek target is outside of the indexed range, you can immediately move to the next/previous segment and either seek using that segment's index, or narrow the bisection window if that segment has no index. You can also use the segement length to verify if the index is valid. If the contents of the segment have changed, it's highly likely that the length of the segment has changed as well. When you load the segment's header pages, you should check the length of the physical segment, and if it doesn't match the length stored in the Skeleton header packet, you know that either the index is out of date, or the file has been chained since indexing.
The Skeleton 4.0 BOS packet also contains the offset of the first non header page in the Ogg segment. This means that if you wish to delay loading of an index for whatever reason, you can skip forward to that offset, and start decoding from that offset forwards.
When using the index to seek, you must verify that the index is still correct. You can consider the index invalid if any of the following are true:
- The segment doesn't end at the segment length offset stored in the Skeleton BOS packet (note that a new "link" in a "chain" can start at the end of the segment), or
- after a seek to a keypoint's offset, you don't land exactly on a page boundary, or
- after a seek to a keypoint's offset, you don't land on a page which belongs to that keypoint's stream.
While loading the Skeleton BOS header, you should always check the Skeleton version field to ensure your decoder correctly knows how to parse the Skeleton track.
Be aware that a keyframe index may not index all keyframes in the Ogg segment, it may only index periodic keyframes instead.
Ogg Skeleton version 4.1 Format Specification
Adding the above information into an Ogg bitstream without breaking existing Ogg functionality and code requires the use of a logical bitstream for Ogg Skeleton. This logical bitstream may be ignored on decoding such that existing players can still continue to play back Ogg files that have a Skeleton bitstream. Skeleton enriches the Ogg bitstream to provide meta information about structure and content of the Ogg bitstream.
The Skeleton logical bitstream starts with an ident header that contains information about all of the logical bitstreams and is mapped into the Skeleton bos page. The first 8 bytes provide the magic identifier "fishead\0". After the fishead follows a set of secondary header packets, each of which contains information about one logical bitstream. These secondary header packets are identified by an 8 byte code of "fisbone\0". The Skeleton logical bitstream has no actual content packets. Its eos page is included into the stream before any data pages of the other logical bitstreams appear and contains a packet of length 0.
The fishead ident header looks as follows (inspiration):
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1| Byte +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Identifier 'fishead\0' | 0-3 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | 4-7 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Version major | Version minor | 8-11 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Presentationtime numerator | 12-15 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | 16-19 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Presentationtime denominator | 20-23 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | 24-27 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Basetime numerator | 28-31 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | 32-35 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Basetime denominator | 36-39 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | 40-43 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | UTC | 44-47 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | 48-51 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | 52-55 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | 56-59 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | 60-63 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Segment length | 64-67 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | 68-71 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Content offset | 72-75 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | 76-79 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The version fields provide version information for the Skeleton track, currently being 4.1 (the number having evolved within the Annodex project). Presentation time and basetime are specified as a rational number, the denominator providing the temporal resolution at which the time is given (e.g. to specify time in milliseconds, provide a denominator of 1000).
The fisbone secondary header packet looks as follows:
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1| Byte +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Identifier 'fisbone\0' | 0-3 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | 4-7 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Offset to message header fields | 8-11 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Serial number | 12-15 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Number of header packets | 16-19 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Granulerate numerator | 20-23 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | 24-27 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Granulerate denominator | 28-31 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | 32-35 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Basegranule | 36-39 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | 40-43 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Preroll | 44-47 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Granuleshift | PTS/DTS predelay |Padding/unused | 48-51 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Granulepos Radix | 52-55 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Message header fields ... | 56- +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
The mime type is provided as a message header field specified in the same way that HTTP header fields are given (e.g. "Content-Type: audio/vorbis", terminated/delimited by "\r\n"). Further meta information (such as language and screen size) are also included as message header fields. The offset to the message header fields at the beginning of a fisbone packet is included for forward compatibility - to allow further fields to be included into the packet without disrupting the message header field parsing. The granule rate is again given as a rational number in the same way that presentation time and basetime were provided above.
The following message headers are compulsory in Skeleton 4.1:
- Content-type: mime-type of the content encoded in this stream, e.g. audio/vorbis, video/theora, etc. The mime types in use here are listed at http://wiki.xiph.org/MIME_Types_and_File_Extensions#Codec_MIME_types.
- Role: describes the function of this track. Common examples are "video/main", "audio/main", "text/caption". For a complete list of possibilities, see http://wiki.xiph.org/SkeletonHeaders#Role.
- Name: a unique free text string which can be used to directly address the track in scripting applications, such as an HTML5 viewer.
For more message headers, see SkeletonHeaders.
Before the Skeleton EOS page in the segment header pages come the Skeleton 4.0 keyframe index packets. There should be one index packet foreach content track in the Ogg segment, but index packets are not required for a Skeleton 4.0 track to be considered valid. Each keypoint in the index is stored in a "keypoint", which in turn stores an offset, and timestamp. In order to save space, the offsets and timestamps are stored as deltas, and then variable byte-encoded. The offset and timestamp deltas store the difference between the keypoint's offset and timestamp from the previous keypoint's offset and timestamp. So to calculate the page offset of a keypoint you must sum the offset deltas of up to and including the keypoint in the index.
The variable byte encoded integers are encoded using 7 bits per byte to store the integer's bits, and the high bit is set in the last byte used to encode the integer. The bits and bytes are in little endian byte order. For example, the integer 7843, or 0001 1110 1010 0011 in binary, would be stored as two bytes: 0xBD 0x23, or 1011 1101 0010 0011 in binary.
Each index packet contains the following:
- Identifier 6 bytes: "index\0"
- The serialno of the stream this index applies to, as a 4 byte field.
- The number of keypoints in this index packet, 'n' as a 8 byte unsigned integer. This can be 0.
- The presentation time denominator for this stream, as an 8 byte signed integer. All timestamps, including keypoint timestamps, first and last sample timestamps are fractions of seconds over this denominator. This must not be 0.
- First-sample-time numerator: 8 byte signed integer representing the numerator for the presentation time of the first sample in the track.
- Last-sample-time numerator: 8 byte signed integer representing the end time of the last sample in the track.
- 'n' key points, each of which contain, in the following order:
- the keyframe's page's byte offset delta, as a variable byte encoded integer. This is the number of bytes that this keypoint is after the preceeding keypoint's offset, or from the start of the segment if this is the first keypoint. The keypoint's page start is therefore the sum of the byte-offset-deltas of all the keypoints which come before it.
- the presentation time numerator delta, of the first key frame which starts on the page at the keypoint's offset, as a variable byte encoded integer. This is the difference from the previous keypoint's timestamp numerator. The keypoint's timestamp numerator is therefore the sum of all the timestamp numerator deltas up to and including the keypoint's. Divide the timestamp numerator sum by the timestamp denominator stored earlier in the index packet to determine the presentation time of the keyframe in seconds.
The key points are stored in increasing order by offset (and thus by presentation time as well).
The byte offsets stored in keypoints are relative to the start of the Ogg bitstream segment. So if you have a physical Ogg bitstream made up of two chained Oggs, the offsets in the second Ogg segment's bitstream's index are relative to the beginning of the second Ogg in the chain, not the first. Also note that if a physical Ogg bitstream is made up of chained Oggs, the presence of an index in one segment does not imply that there will be an index in any other segment.
The first-sample-time and last-sample-time are rational numbers, in units of seconds. If the denominator is 0 for the first-sample-time or the last-sample-time, then that value was unable to be determined at indexing time, and is unknown.
The exact number of keyframes used to construct key points in the index is up to the indexer, but to limit the index size, we recommend including at most one key point per every 64KB of data, or every 2000ms, whichever is least frequent.
A further restriction on how to encapsulate Skeleton into Ogg is proposed to allow for easier parsing:
- there can only be one Skeleton logical bitstream in a Ogg bitstream.
- the Skeleton bos page is the very first bos page in the Ogg stream such that it can be identified straight away and decoders don't get confused about it being e.g. Ogg Vorbis without this meta information
- the bos pages of all the other logical bistreams come next (a requirement of Ogg)
- the secondary header pages of all logical bitstreams come next, including Skeleton's secondary header packets
- the Skeleton eos page end the control section of the Ogg stream before any content pages of any of the other logical bitstreams appear
Ogg Skeleton is being supported by the following projects:
- the Ogg Directshow filters: see illiminable
- liboggz: liboggz svn or liboggz
- the Annodex technology: annodex.net
- HOgg (Haskell)
- ffmpeg2theora (with --skeleton)
- speexenc (with --skeleton) & speexdec
- many more ...