https://wiki.xiph.org/api.php?action=feedcontributions&user=Edrz&feedformat=atomXiphWiki - User contributions [en]2024-03-29T14:29:12ZUser contributionsMediaWiki 1.40.1https://wiki.xiph.org/index.php?title=Videos/A_Digital_Media_Primer_For_Geeks/making&diff=12602Videos/A Digital Media Primer For Geeks/making2010-09-30T23:20:08Z<p>Edrz: /* Control pop/unpop */</p>
<hr />
<div>[[Image:Dmpfg_mo_001.jpg|360px|right]]<br />
This page documents some of the background information behind the production of Digital Media Primer For Geeks. To see the video or its wiki-edition visit [[A Digital Media Primer For Geeks (episode 1)|the main video page]].<br />
<br />
=The making of…=<br />
<br />
==Equipment==<br />
===Camera===<br />
Canon HV40 HDV camera w/ wide-angle lens operating on a tripod. At the time I was looking for MyFirstVideoCamera, the six people I asked who did video work all recommended this same camera, and two said not to get it without the wide angle lens. I took their advice and have been happy with it. Among other nifty features, the camera offers true progressive scan modes, live firewire output, and the ability to act as a digitizer for external video input. With the patches I made in my Git repo, Cinelerra natively handles the Canon HDV progressive modes.<br />
<br />
The wide angle lens gives the camera a nice close macro mode, and approximately triples the amount of light coming into the sensor for a given zoom/aperture. Useful for shooting indoors at night (eg, this entire video)<br />
<br />
No additional lighting kit was used.<br />
<br />
===Audio===<br />
<br />
Two Crown PCC160 boundary microphones placed on a table approximately 4-8 feet in front of the speaker, run through a cheap Behringer portable mixer and into the camera's microphone input. <br />
<br />
No additional audio kit was used.<br />
<br />
===Sundries===<br />
<br />
Whiteboard markers by 'Bic'<br />
<br />
Drawing aids by Staedtler, McMaster Carr, and 'Generic'.<br />
<br />
==Video shooting sequence==<br />
<br />
Scenes were pre-scripted and memorized, usually with lots of on-the-fly revision. In the future... I'm getting a teleprompter. OTOH, I can totally rattle off the entire video script from beginning to end as a party trick, thus ensuring I'll not be invited to many parties.<br />
<br />
Diagrams were drawn by hand on a physical whiteboard with whiteboard markers and magnetic T-squares, triangles, and yardsticks. Despite looking a lot like greenscreen work, there is no image compositing in use (actually-- there are two small composites where an error in a whiteboard diagram was corrected by subtracting part of the original image and then adding a corrected version of the diagram).<br />
<br />
Camera operated in 24F shutter priority mode (Tv set to "24") with exposure and white balance both calibrated to the white board (or a white piece of paper) and locked. Microphone attenuation setting was active, with gain locked such that room noise peaked at -40dB (all the rooms in the shooting sequences were noisy due to the building's ventilation system, or active equipment). Lighting in the whiteboard rooms tended to be odd, with little relative light cast on a presenter standing just in front of the whiteboard; a presenter is practically standing in the room's only shadow. Most of the room light is focused on the table and walls. Additional fill lighting kit would have been useful, but for the first vid, I didn't want 'perfect' to be the enemy of 'good'.<br />
<br />
Autofocus used for whiteboard scenes, manual focus used for several workshop scenes as the autofocus tended to hunt continuously in very low light.<br />
<br />
Continuous capture to a Thinkpad with firewire input via a simple [http://people.xiph.org/~xiphmont/video/gst-rec gstreamer script].<br />
<br />
==Production sequence==<br />
===All hail Cinelerra. You better hail, or Cinelerra will get pissy about it.===<br />
<br />
Most of the production sequence hinged on making Cinelerra happy; it is a hulking rusty cast iron WWI tank of a program that can seem like it's composed entirely of compressed bugs. That said, it was neither particularly crashy nor did it ever accidentally corrupt or lose work. It was also the only FOSS editor with a working 2D compositor. It got the job done once I found a workflow it would cope with (and fixed a number of bugs; these fixes are available from my cinelerra Git repo at http://git.xiph.org/?p=users/xiphmont/cinelerraCV.git;a=summary)<br />
<br />
===Choosing takes===<br />
<br />
Each shooting session yielded four to six hours of raw video. The first step was to load the raw video into the cinelerra timeline, label each complete take, compare and choose the take to use, then render the chosen take out to a raw clip as a YUV4MPEG raw video file and a WAV raw audio file. Be careful that Settings->Align Cursor On Frames is set, else the audio and video renders won't start on the same boundary.<br />
<br />
===Postprocessing===<br />
<br />
At this point, the raw video clips were adjusted for gamma, contrast, and saturation in gstreamer and mplayer. In the earlier shoots the camera was underexposing due to pilot error, which required quite a bit of gamma and saturation inflation to 'correct' (there is no real correction as the low-end data is gone, but it's possible to make it look better). Later shoots used saner settings and the adjustments were mostly to keep different shooting sessions more uniform. The whiteboard tends not to look white because it's mildly reflective, and picked up the color of the cyan and orange audio baffles in the room like a big diffuse mirror.<br />
<br />
The audio was both noisy (due to the building's ventilation system which either sounded like a low loud rumble or a jet-engine taking off) and reverberant (the rooms were glass on two sides and plaster on the other two). Early takes used no additional sound absorbing material in the rooms, and the Postfish filtering and deverb was used heavily. It gives the early audio in the vid a slightly odd, processed feel (I had almost decided the original audio was simply unusable). Later takes used some big fleece 'soft flats' in the room to absorb some additional reverb, and the later takes are less heavily filtered.<br />
<br />
The postfish filtering chain used declip (for the occasional overrange oops), deverb (remove room reverberation), multicompand (noise gating), single compand (for volume levelling) and EQ (the Crown mics are nice, but are very midrange heavy).<br />
<br />
===Special Effects===<br />
<br />
Audio special effects were one-offs, mostly done using Sox. The processed demo sections of audio were then spliced back into the original audio takes using Audactity.<br />
<br />
Video special effects (eg, removing a color channel, etc) were done by writing quick, one-off filters in C for y4oi. A few effects were done by dumping a take as a directory full of PNGs and then batch-processing the PNGs again using a one-off C program, then reassembling with mplayer. Video effects were then stitched back into the original video takes in Cinelerra.<br />
<br />
===Editing===<br />
<br />
All editing was done in Cinelerra. This primarily consisted of stitching the individual takes back together with crossfades. All input and rendering output were done with raw YUV4MPEG and WAV files. Note that making this work well and correctly required several patches to the YUV4MPEG handler and colorspace conversion code.<br />
<br />
===Encoding===<br />
<br />
I encoded by hand external to Cinelerra using mplayer for final postprocessing, the example_encoder included with the [Ptalarbvorm] Theora source distribution, and ivfenc for WebM. I synced subtitles to the video by hand with Audacity (I already had the script) in SRT format [for easy editing/translation and syncing with the video in HTML5], and transcoded to Ogg Kate using kateenc. The Kate subs were then muxed with the Ogg video encoding using oggz-merge, and finally indexing added to the Ogg with OggIndex.<br />
<br />
<center><div style="background-color:#DDDDFF;border-color:#CCCCDD;border-style:solid;width:80%;padding:0 1em 1em 1em;text-align:left;"><br />
'''Sample Ogg command lines…'''<br />
...for producing 360p, 128-ish (a4) audio and 500-ish (v50) video with subtitles and index<br />
<br />
* perform a little denoising, scale, and deband the raw render:<br />
mplayer -vf hqdn3d,scale=640:360,gradfun=1.5,unsharp=l3x3:.1 complete.y4m -fast -noconsolecontrols -vo yuv4mpeg:file=filtered.y4m<br />
* encode the basic Ogg Vorbis/Theora file:<br />
encoder_example -a 4 -v 50 -k 240 complete.wav filtered.y4m -o basic.ogv<br />
* produce Kate subs from the SRT input file:<br />
kateenc -t srt -l en_US -c SUB -o subs.kate subs.srt<br />
* add the subs to the Ogg video file:<br />
oggz-merge basic.ogv subs.kate -o subbed.ogv<br />
* add index for faster seeking on the Web:<br />
OggIndex subbed.ogv -o A_Digital_Media_Primer_For_Geeks-360p.ogv<br />
</div></center><br />
<br />
<br style="clear:both;"/><br />
<br />
<center><div style="background-color:#DDDDFF;border-color:#CCCCDD;border-style:solid;width:80%;padding:0 1em 1em 1em;text-align:left;"><br />
'''Sample WebM command lines…'''<br />
...for producing 360p, 128-ish (a4) audio and 500kbps video with index<br />
<br />
* Might as well reuse the Vorbis encoding already done for the Ogg file:<br />
oggz-rip -c vorbis A_Digital_Media_Primer_For_Geeks-360p.ogv -o vorbis.ogg<br />
* Produce VP8 encoding from the y4m file used for Theora<br />
ivfenc filtered.y4m vp8.ivf -p 2 -t 4 --best --target-bitrate=1500 --end-usage=0 --auto-alt-ref=1 -v --minsection-pct=5 --maxsection-pct=800 --lagin-frames=16 --kf-min-dist=0 --kf-max-dist=120 --static-thresh=0 --drop-frame=0 --min-q=0 --max-q=60<br />
* Mux the audio and video into our first-stage WebM file<br />
mkvmerge vorbis.ogg vp8.ivf -o first-stage.webm<br />
* mkvmerge by itself doesn't generate a fully-compliant WebM file; mkclean will make the last necessary alterations<br />
mkclean --remux first-stage.webm A_Digital_Media_Primer_For_Geeks-360p.webm<br />
</div></center><br />
<br />
<br style="clear:both;"/><br />
<br />
==Web Presentation==<br />
<br />
HTML5 is new, so I found (to my unpleasant surprise) that I got to script all my website controls from scratch. Virtually everything preexisting was either very large/inscrutable/inflexible (a 'complete web video solution!') and offered both features I did not want and was missing features I did, or was a proof of concept that was obviously unfinished, unpolished, and not well tested.<br />
<br />
===Playback Controls===<br />
<br />
I wanted more than the standard set of controls, but I did *not* want to fall into the usual web geek trap: a UI with 50 buttons in a big heap with no thought to usability, and extra points for using at least twelve colors. I wanted new controls to be unobtrusive but obvious when you wanted them, and to blend into the prexisting controls.<br />
<br />
The clearly best way to do this would be to put a transparent canvas layer over the video window and implement completely fresh controls. This would probably give the most bug-proofness/future-proofness and definitely give the most consistent look/feel across browsers. I also estimate it would take several weeks of full time scripting to make it work as expected (remember, HTML5 is new and still a draft, so there are endless inconsistencies and implementation bugs to deal with. Writing a script is easy and fast. Making it work consistently is time consuming and frustrating).<br />
<br />
Adding a fade-in bar that approximately matched the existing controls in most players would be finicky, shorter-lived and not as pretty, but it could be practical and working far faster than the overkill solution of reimplementing everything. As HTML5 is as yet a draft anyway and I'll probably have to revisit any site scripting regularly anyway, option two seemed the sensible way to go.<br />
<br />
The nice thing about HTML and JavaScript both is that they're inherently Open Source; anyone can inspect the code I wrote (and point and laugh).<br />
<br />
===Subtitles===<br />
<br />
Although I think external subtitles aren't the best overall direction, it's all HTML5 currently offers. The Ogg files include Kate format subtitles, but HTML5 offers no API for accessing them. What HTML5 does give is a high-resolution playback timer, and the ability to load and parse subtitle files. <br />
<br />
[http://www.xiph.org/video/subtitles.js subtitles.js] is an updated version of jQuery.srt that loads and parses SRT format subs on demand from any URL, and places the text of the each subtitle into a &lt;div&gt; element in synchronization with the video playback timer. A little additional CSS is all that's necessary to put a translucent background behind it, and display it over the video frame.<br />
<br />
===Resolution / stream switching===<br />
<br />
This was considerably less elegant due to some apparent inadequacies in the HTML5 draft spec. There seems to be two basic ways of changing the video currently playing back in the current draft.<br />
<br />
The first way to change streams is to create a new video element via javascript, wait for it to load, then replace the current video with the new one. Unfortunately, HTML5 gives no way to prevent the original video, even when stopped, from using all available bandwidth to keep buffering as fast as it can. This starves the replacement video of network access, causing a lengthy delay when loading. It looks very nice and seamless when it finally works, but can easily result is switching video streams taking 15-30 seconds or more.<br />
<br />
The second option is to switch the preexisting video element to a new stream. This is much faster as the original stream stops sinking bandwidth immediately, but upon loading it always starts from the beginning and in current browsers also displays the first frame, even if playback isn't started. After the load completes, then it's possible to seek forward to where the original stream started. It doesn't look as good, but it's much faster in practice.<br />
<br />
I use the second, faster option, so there's a brief flash back to the beginning of the video upon resolution switch. <br />
<br />
===Chapter Navigation===<br />
<br />
Nothing special here, all it is is a &lt;select&gt; dropdown with an onchange handler that sets a new 'video.currentTime'.<br />
<br />
===Control pop/unpop===<br />
<br />
Oddly enough this was the hardest part, not because it's hard to do, but it's hard to make it consistent across browsers. Every browser fires radically different UI events for the same mouse/keyboard actions.<br />
<br />
===Dimming===<br />
<br />
...In retrospect, not as gratuitous as it seemed when I first wrote it. Many aspects about how video is made and presented assume viewing in a relatively dim environment, where the video being watched is the brightest thing in sight [or close to it]. Xiph's web styling uses white backgrounds, which I found actively distracting and out of place, but altering the style of the site for just the video pages also seemed clearly wrong. So I added an animated dim/undim on playback/pause (instantaneous dim/undim was jarring). I'm now convinced it was a good call, assuming it actually works everywhere as intended (it won't work on browsers using the Cortado fallback).</div>Edrzhttps://wiki.xiph.org/index.php?title=Videos/A_Digital_Media_Primer_For_Geeks/making&diff=12601Videos/A Digital Media Primer For Geeks/making2010-09-30T23:08:59Z<p>Edrz: /* Encoding */</p>
<hr />
<div>[[Image:Dmpfg_mo_001.jpg|360px|right]]<br />
This page documents some of the background information behind the production of Digital Media Primer For Geeks. To see the video or its wiki-edition visit [[A Digital Media Primer For Geeks (episode 1)|the main video page]].<br />
<br />
=The making of…=<br />
<br />
==Equipment==<br />
===Camera===<br />
Canon HV40 HDV camera w/ wide-angle lens operating on a tripod. At the time I was looking for MyFirstVideoCamera, the six people I asked who did video work all recommended this same camera, and two said not to get it without the wide angle lens. I took their advice and have been happy with it. Among other nifty features, the camera offers true progressive scan modes, live firewire output, and the ability to act as a digitizer for external video input. With the patches I made in my Git repo, Cinelerra natively handles the Canon HDV progressive modes.<br />
<br />
The wide angle lens gives the camera a nice close macro mode, and approximately triples the amount of light coming into the sensor for a given zoom/aperture. Useful for shooting indoors at night (eg, this entire video)<br />
<br />
No additional lighting kit was used.<br />
<br />
===Audio===<br />
<br />
Two Crown PCC160 boundary microphones placed on a table approximately 4-8 feet in front of the speaker, run through a cheap Behringer portable mixer and into the camera's microphone input. <br />
<br />
No additional audio kit was used.<br />
<br />
===Sundries===<br />
<br />
Whiteboard markers by 'Bic'<br />
<br />
Drawing aids by Staedtler, McMaster Carr, and 'Generic'.<br />
<br />
==Video shooting sequence==<br />
<br />
Scenes were pre-scripted and memorized, usually with lots of on-the-fly revision. In the future... I'm getting a teleprompter. OTOH, I can totally rattle off the entire video script from beginning to end as a party trick, thus ensuring I'll not be invited to many parties.<br />
<br />
Diagrams were drawn by hand on a physical whiteboard with whiteboard markers and magnetic T-squares, triangles, and yardsticks. Despite looking a lot like greenscreen work, there is no image compositing in use (actually-- there are two small composites where an error in a whiteboard diagram was corrected by subtracting part of the original image and then adding a corrected version of the diagram).<br />
<br />
Camera operated in 24F shutter priority mode (Tv set to "24") with exposure and white balance both calibrated to the white board (or a white piece of paper) and locked. Microphone attenuation setting was active, with gain locked such that room noise peaked at -40dB (all the rooms in the shooting sequences were noisy due to the building's ventilation system, or active equipment). Lighting in the whiteboard rooms tended to be odd, with little relative light cast on a presenter standing just in front of the whiteboard; a presenter is practically standing in the room's only shadow. Most of the room light is focused on the table and walls. Additional fill lighting kit would have been useful, but for the first vid, I didn't want 'perfect' to be the enemy of 'good'.<br />
<br />
Autofocus used for whiteboard scenes, manual focus used for several workshop scenes as the autofocus tended to hunt continuously in very low light.<br />
<br />
Continuous capture to a Thinkpad with firewire input via a simple [http://people.xiph.org/~xiphmont/video/gst-rec gstreamer script].<br />
<br />
==Production sequence==<br />
===All hail Cinelerra. You better hail, or Cinelerra will get pissy about it.===<br />
<br />
Most of the production sequence hinged on making Cinelerra happy; it is a hulking rusty cast iron WWI tank of a program that can seem like it's composed entirely of compressed bugs. That said, it was neither particularly crashy nor did it ever accidentally corrupt or lose work. It was also the only FOSS editor with a working 2D compositor. It got the job done once I found a workflow it would cope with (and fixed a number of bugs; these fixes are available from my cinelerra Git repo at http://git.xiph.org/?p=users/xiphmont/cinelerraCV.git;a=summary)<br />
<br />
===Choosing takes===<br />
<br />
Each shooting session yielded four to six hours of raw video. The first step was to load the raw video into the cinelerra timeline, label each complete take, compare and choose the take to use, then render the chosen take out to a raw clip as a YUV4MPEG raw video file and a WAV raw audio file. Be careful that Settings->Align Cursor On Frames is set, else the audio and video renders won't start on the same boundary.<br />
<br />
===Postprocessing===<br />
<br />
At this point, the raw video clips were adjusted for gamma, contrast, and saturation in gstreamer and mplayer. In the earlier shoots the camera was underexposing due to pilot error, which required quite a bit of gamma and saturation inflation to 'correct' (there is no real correction as the low-end data is gone, but it's possible to make it look better). Later shoots used saner settings and the adjustments were mostly to keep different shooting sessions more uniform. The whiteboard tends not to look white because it's mildly reflective, and picked up the color of the cyan and orange audio baffles in the room like a big diffuse mirror.<br />
<br />
The audio was both noisy (due to the building's ventilation system which either sounded like a low loud rumble or a jet-engine taking off) and reverberant (the rooms were glass on two sides and plaster on the other two). Early takes used no additional sound absorbing material in the rooms, and the Postfish filtering and deverb was used heavily. It gives the early audio in the vid a slightly odd, processed feel (I had almost decided the original audio was simply unusable). Later takes used some big fleece 'soft flats' in the room to absorb some additional reverb, and the later takes are less heavily filtered.<br />
<br />
The postfish filtering chain used declip (for the occasional overrange oops), deverb (remove room reverberation), multicompand (noise gating), single compand (for volume levelling) and EQ (the Crown mics are nice, but are very midrange heavy).<br />
<br />
===Special Effects===<br />
<br />
Audio special effects were one-offs, mostly done using Sox. The processed demo sections of audio were then spliced back into the original audio takes using Audactity.<br />
<br />
Video special effects (eg, removing a color channel, etc) were done by writing quick, one-off filters in C for y4oi. A few effects were done by dumping a take as a directory full of PNGs and then batch-processing the PNGs again using a one-off C program, then reassembling with mplayer. Video effects were then stitched back into the original video takes in Cinelerra.<br />
<br />
===Editing===<br />
<br />
All editing was done in Cinelerra. This primarily consisted of stitching the individual takes back together with crossfades. All input and rendering output were done with raw YUV4MPEG and WAV files. Note that making this work well and correctly required several patches to the YUV4MPEG handler and colorspace conversion code.<br />
<br />
===Encoding===<br />
<br />
I encoded by hand external to Cinelerra using mplayer for final postprocessing, the example_encoder included with the [Ptalarbvorm] Theora source distribution, and ivfenc for WebM. I synced subtitles to the video by hand with Audacity (I already had the script) in SRT format [for easy editing/translation and syncing with the video in HTML5], and transcoded to Ogg Kate using kateenc. The Kate subs were then muxed with the Ogg video encoding using oggz-merge, and finally indexing added to the Ogg with OggIndex.<br />
<br />
<center><div style="background-color:#DDDDFF;border-color:#CCCCDD;border-style:solid;width:80%;padding:0 1em 1em 1em;text-align:left;"><br />
'''Sample Ogg command lines…'''<br />
...for producing 360p, 128-ish (a4) audio and 500-ish (v50) video with subtitles and index<br />
<br />
* perform a little denoising, scale, and deband the raw render:<br />
mplayer -vf hqdn3d,scale=640:360,gradfun=1.5,unsharp=l3x3:.1 complete.y4m -fast -noconsolecontrols -vo yuv4mpeg:file=filtered.y4m<br />
* encode the basic Ogg Vorbis/Theora file:<br />
encoder_example -a 4 -v 50 -k 240 complete.wav filtered.y4m -o basic.ogv<br />
* produce Kate subs from the SRT input file:<br />
kateenc -t srt -l en_US -c SUB -o subs.kate subs.srt<br />
* add the subs to the Ogg video file:<br />
oggz-merge basic.ogv subs.kate -o subbed.ogv<br />
* add index for faster seeking on the Web:<br />
OggIndex subbed.ogv -o A_Digital_Media_Primer_For_Geeks-360p.ogv<br />
</div></center><br />
<br />
<br style="clear:both;"/><br />
<br />
<center><div style="background-color:#DDDDFF;border-color:#CCCCDD;border-style:solid;width:80%;padding:0 1em 1em 1em;text-align:left;"><br />
'''Sample WebM command lines…'''<br />
...for producing 360p, 128-ish (a4) audio and 500kbps video with index<br />
<br />
* Might as well reuse the Vorbis encoding already done for the Ogg file:<br />
oggz-rip -c vorbis A_Digital_Media_Primer_For_Geeks-360p.ogv -o vorbis.ogg<br />
* Produce VP8 encoding from the y4m file used for Theora<br />
ivfenc filtered.y4m vp8.ivf -p 2 -t 4 --best --target-bitrate=1500 --end-usage=0 --auto-alt-ref=1 -v --minsection-pct=5 --maxsection-pct=800 --lagin-frames=16 --kf-min-dist=0 --kf-max-dist=120 --static-thresh=0 --drop-frame=0 --min-q=0 --max-q=60<br />
* Mux the audio and video into our first-stage WebM file<br />
mkvmerge vorbis.ogg vp8.ivf -o first-stage.webm<br />
* mkvmerge by itself doesn't generate a fully-compliant WebM file; mkclean will make the last necessary alterations<br />
mkclean --remux first-stage.webm A_Digital_Media_Primer_For_Geeks-360p.webm<br />
</div></center><br />
<br />
<br style="clear:both;"/><br />
<br />
==Web Presentation==<br />
<br />
HTML5 is new, so I found (to my unpleasant surprise) that I got to script all my website controls from scratch. Virtually everything preexisting was either very large/inscrutable/inflexible (a 'complete web video solution!') and offered both features I did not want and was missing features I did, or was a proof of concept that was obviously unfinished, unpolished, and not well tested.<br />
<br />
===Playback Controls===<br />
<br />
I wanted more than the standard set of controls, but I did *not* want to fall into the usual web geek trap: a UI with 50 buttons in a big heap with no thought to usability, and extra points for using at least twelve colors. I wanted new controls to be unobtrusive but obvious when you wanted them, and to blend into the prexisting controls.<br />
<br />
The clearly best way to do this would be to put a transparent canvas layer over the video window and implement completely fresh controls. This would probably give the most bug-proofness/future-proofness and definitely give the most consistent look/feel across browsers. I also estimate it would take several weeks of full time scripting to make it work as expected (remember, HTML5 is new and still a draft, so there are endless inconsistencies and implementation bugs to deal with. Writing a script is easy and fast. Making it work consistently is time consuming and frustrating).<br />
<br />
Adding a fade-in bar that approximately matched the existing controls in most players would be finicky, shorter-lived and not as pretty, but it could be practical and working far faster than the overkill solution of reimplementing everything. As HTML5 is as yet a draft anyway and I'll probably have to revisit any site scripting regularly anyway, option two seemed the sensible way to go.<br />
<br />
The nice thing about HTML and JavaScript both is that they're inherently Open Source; anyone can inspect the code I wrote (and point and laugh).<br />
<br />
===Subtitles===<br />
<br />
Although I think external subtitles aren't the best overall direction, it's all HTML5 currently offers. The Ogg files include Kate format subtitles, but HTML5 offers no API for accessing them. What HTML5 does give is a high-resolution playback timer, and the ability to load and parse subtitle files. <br />
<br />
[http://www.xiph.org/video/subtitles.js subtitles.js] is an updated version of jQuery.srt that loads and parses SRT format subs on demand from any URL, and places the text of the each subtitle into a &lt;div&gt; element in synchronization with the video playback timer. A little additional CSS is all that's necessary to put a translucent background behind it, and display it over the video frame.<br />
<br />
===Resolution / stream switching===<br />
<br />
This was considerably less elegant due to some apparent inadequacies in the HTML5 draft spec. There seems to be two basic ways of changing the video currently playing back in the current draft.<br />
<br />
The first way to change streams is to create a new video element via javascript, wait for it to load, then replace the current video with the new one. Unfortunately, HTML5 gives no way to prevent the original video, even when stopped, from using all available bandwidth to keep buffering as fast as it can. This starves the replacement video of network access, causing a lengthy delay when loading. It looks very nice and seamless when it finally works, but can easily result is switching video streams taking 15-30 seconds or more.<br />
<br />
The second option is to switch the preexisting video element to a new stream. This is much faster as the original stream stops sinking bandwidth immediately, but upon loading it always starts from the beginning and in current browsers also displays the first frame, even if playback isn't started. After the load completes, then it's possible to seek forward to where the original stream started. It doesn't look as good, but it's much faster in practice.<br />
<br />
I use the second, faster option, so there's a brief flash back to the beginning of the video upon resolution switch. <br />
<br />
===Chapter Navigation===<br />
<br />
Nothing special here, all it is is a &lt;select&gt; dropdown with an onchange handler that sets a new 'video.currentTime'.<br />
<br />
===Control pop/unpop===<br />
<br />
Oddly enough this is was the hardest part, not because it's hard to do, but it's hard to make it consistent across browsers. Every browser fires radically different UI events for the same mouse/keyboard actions.<br />
<br />
===Dimming===<br />
<br />
...In retrospect, not as gratuitous as it seemed when I first wrote it. Many aspects about how video is made and presented assume viewing in a relatively dim environment, where the video being watched is the brightest thing in sight [or close to it]. Xiph's web styling uses white backgrounds, which I found actively distracting and out of place, but altering the style of the site for just the video pages also seemed clearly wrong. So I added an animated dim/undim on playback/pause (instantaneous dim/undim was jarring). I'm now convinced it was a good call, assuming it actually works everywhere as intended (it won't work on browsers using the Cortado fallback).</div>Edrzhttps://wiki.xiph.org/index.php?title=Videos/A_Digital_Media_Primer_For_Geeks&diff=12600Videos/A Digital Media Primer For Geeks2010-09-30T20:18:18Z<p>Edrz: /* Analog vs Digital */</p>
<hr />
<div><small>''Wiki edition''</small><br />
[[Image:Dmpfg_001.jpg|360px|right]]<br />
<br />
This first video from Xiph.Org presents the technical foundations of modern digital media via a half-hour firehose of information. One community member called it "a Uni lecture I never got but really wanted."<br />
<br />
The program offers a brief history of digital media, a quick summary of the sampling theorem, and myriad details of low level audio and video characterization and formatting. It's intended for budding geeks looking to get into video coding, as well as the technically curious who want to know more about the media they wrangle for work or play.<br />
<br/><br/><br/><br />
<center><font size="+2">[http://www.xiph.org/video/vid1.shtml Download or Watch online]</font></center><br />
<br style="clear:both;"/><br />
Players supporting WEBM: [http://www.videolan.org/vlc/ VLC 1.1+], [https://www.mozilla.com/en-US/firefox/all-beta.html Firefox 4 (beta)], [http://www.chromium.org/getting-involved/dev-channel Chrome (development versions)], [http://www.opera.com/ Opera], [http://www.webmproject.org/users/ more…]<br />
<br />
Players supporting Ogg/Theora: [http://www.videolan.org/vlc/ VLC], [http://www.firefox.com/ Firefox], [http://www.opera.com/ Opera], [[TheoraSoftwarePlayers|more…]]<br />
<br />
If you're having trouble with playback in a modern browser or player, please visit our [[Playback_Troubleshooting|playback troubleshooting and discussion]] page.<br />
<br/><br />
<hr/><br />
<br />
==Introduction==<br />
[[Image:Dmpfg_000.jpg|360px|right]]<br />
[[Image:Dmpfg_002.jpg|360px|right]]<br />
<small>[[Talk:A_Digital_Media_Primer_For_Geeks_(episode_1)#Introduction|Discuss this section]]</small><br />
<br />
Workstations and high-end personal computers have been able to<br />
manipulate digital audio pretty easily for about fifteen years now.<br />
It's only been about five years that a decent workstation's been able<br />
to handle raw video without a lot of expensive special purpose<br />
hardware.<br />
<br />
But today even most cheap home PCs have the processor power and<br />
storage necessary to really toss raw video around, at least without<br />
too much of a struggle. So now that everyone has all of this cheap media-capable hardware, <br />
more people, not surprisingly, want to do interesting<br />
things with digital media, especially streaming. YouTube was the first huge<br />
success, and now everybody wants in.<br />
<br />
Well good! Because this stuff is a lot of fun!<br />
<br />
<br />
It's no problem finding consumers for digital media. But here I'd<br />
like to address the engineers, the mathematicians, the hackers, the<br />
people who are interested in discovering and making things and<br />
building the technology itself. The people after my own heart.<br />
<br />
Digital media, compression especially, is perceived to be super-elite,<br />
somehow incredibly more difficult than anything else in computer<br />
science. The big industry players in the field don't mind this<br />
perception at all; it helps justify the staggering number of very<br />
basic patents they hold. They like the image that their media<br />
researchers "are the best of the best, so much smarter than anyone<br />
else that their brilliant ideas can't even be understood by mere<br />
mortals." This is bunk. <br />
<br />
Digital audio and video and streaming and compression offer endless<br />
deep and stimulating mental challenges, just like any other<br />
discipline. It seems elite because so few people have been<br />
involved. So few people have been involved perhaps because so few<br />
people could afford the expensive, special-purpose equipment it<br />
required. But today, just about anyone watching this video has a<br />
cheap, general-purpose computer powerful enough to play with the big<br />
boys. There are battles going on today around HTML5 and browsers and<br />
video and open vs. closed. So now is a pretty good time to get<br />
involved. The easiest place to start is probably understanding the<br />
technology we have right now.<br />
<br />
This is an introduction. Since it's an introduction, it glosses over a<br />
ton of details so that the big picture's a little easier to see.<br />
Quite a few people watching are going to be way past anything that I'm<br />
talking about, at least for now. On the other hand, I'm probably<br />
going to go too fast for folks who really are brand new to all of<br />
this, so if this is all new, relax. The important thing is to pick out<br />
any ideas that really grab your imagination. Especially pay attention<br />
to the terminology surrounding those ideas, because with those, and<br />
Google, and Wikipedia, you can dig as deep as interests you.<br />
<br />
So, without any further ado, welcome to one hell of a new hobby.<br />
<br />
<center><div style="background-color:#DDDDFF;border-color:#CCCCDD;border-style:solid;width:80%;padding:0 1em 1em 1em;text-align:left;"><br />
'''Going deeper…'''<br />
*[http://www.xiph.org/about/ About Xiph.Org]: Why you should care about open media<br />
*[http://www.0xdeadbeef.com/weblog/2010/01/html5-video-and-h-264-what-history-tells-us-and-why-were-standing-with-the-web/ HTML5 Video and H.264: what history tells us and why we're standing with the web]: Chris Blizzard of Mozilla on free formats and the open web<br />
*[http://diveintohtml5.org/video.html Dive into HTML5]: tutorial on HTML5 web video<br />
*[http://webchat.freenode.net/?channels=xiph Chat with the creators of the video] via freenode IRC in #xiph.<br />
</div></center><br />
<br />
<br style="clear:both;"/><br />
<br />
==Analog vs Digital==<br />
[[Image:Dmpfg_004.jpg|360px|right]]<br />
[[Image:Dmpfg_006.jpg|360px|right]]<br />
[[Image:Dmpfg_007.jpg|360px|right]]<br />
<small>[[Talk:A_Digital_Media_Primer_For_Geeks_(episode_1)#Analog_vs_Digital|Discuss this section]]</small><br />
<br />
<br />
[[WikiPedia:Sound|Sound]] is the propagation of pressure waves through air, spreading out<br />
from a source like ripples spread from a stone tossed into a pond. A<br />
microphone, or the human ear for that matter, transforms these passing<br />
ripples of pressure into an electric signal. Right, this is<br />
middle school science class, everyone remembers this. Moving on.<br />
<br />
That audio signal is a one-dimensional function, a single value<br />
varying over time. If we slow the [[WikiPedia:Oscilloscope|'scope]] down a bit... that should be<br />
a little easier to see. A few other aspects of the signal are<br />
important. It's [[WikiPedia:Continuous_function|continuous]] in both value and time; that is, at any<br />
given time it can have any real value, and there's a smoothly varying<br />
value at every point in time. No matter how much we zoom in, there<br />
are no discontinuities, no singularities, no instantaneous steps or<br />
points where the signal ceases to exist. It's defined<br />
everywhere. Classic continuous math works very well on these signals.<br />
<br />
A digital signal on the other hand is [[WikiPedia:Discrete_math|discrete]] in both value and time.<br />
In the simplest and most common system, called [[WikiPedia:Pulse code modulation|Pulse Code Modulation]],<br />
one of a fixed number of possible values directly represents the<br />
instantaneous signal amplitude at points in time spaced a fixed<br />
distance apart. The end result is a stream of digits.<br />
<br />
Now this looks an awful lot like this. It seems intuitive that we<br />
should somehow be able to rigorously transform one into the other, and<br />
good news, the [[WikiPedia:Nyquist-Shannon sampling theorem|Sampling Theorem]] says we can and tells us<br />
how. Published in its most recognizable form by [[WikiPedia:Claude Shannon|Claude Shannon]] in 1949<br />
and built on the work of [[WikiPedia:Harry Nyquist|Nyquist]], and [[WikiPedia:Ralph Hartley|Hartley]], and tons of others, the<br />
sampling theorem states that not only can we go back and<br />
forth between analog and digital, but also lays<br />
down a set of conditions for which conversion is lossless and the two<br />
representations become equivalent and interchangeable. When the<br />
lossless conditions aren't met, the sampling theorem tells us how and<br />
how much information is lost or corrupted.<br />
<br />
Up until very recently, analog technology was the basis for<br />
practically everything done with audio, and that's not because most<br />
audio comes from an originally analog source. You may also think that<br />
since computers are fairly recent, analog signal technology must have<br />
come first. Nope. Digital is actually older. The [[WikiPedia:Telegraph|telegraph]] predates<br />
the telephone by half a century and was already fully mechanically<br />
automated by the 1860s, sending coded, multiplexed digital signals<br />
long distances. You know... [[WikiPedia:Tickertape|tickertape]]. Harry Nyquist of [[WikiPedia:Bell_labs|Bell Labs]] was<br />
researching telegraph pulse transmission when he published his<br />
description of what later became known as the [[WikiPedia:Nyquist_frequency|Nyquist frequency]], the<br />
core concept of the sampling theorem. Now, it's true the telegraph<br />
was transmitting symbolic information, text, not a digitized analog<br />
signal, but with the advent of the telephone and radio, analog and<br />
digital signal technology progressed rapidly and side-by-side.<br />
<br />
Audio had always been manipulated as an analog signal because... well,<br />
gee, it's so much easier. A [[WikiPedia:Low-pass_filter#Continuous-time_low-pass_filters|second-order low-pass filter]], for example,<br />
requires two passive components. An all-analog [[WikiPedia:Short-time_Fourier_transform|short-time Fourier<br />
transform]], a few hundred. Well, maybe a thousand if you want to build<br />
something really fancy (bang on the [http://www.testequipmentdepot.com/usedequipment/hewlettpackard/spectrumanalyzers/3585a.htm 3585]). Processing signals<br />
digitally requires millions to billions of transistors running at<br />
microwave frequencies, support hardware at very least to digitize and<br />
reconstruct the analog signals, a complete software ecosystem for<br />
programming and controlling that billion-transistor juggernaut,<br />
digital storage just in case you want to keep any of those bits for<br />
later...<br />
<br />
So we come to the conclusion that analog is the only practical way to<br />
do much with audio... well, unless you happen to have a billion<br />
transistors and all the other things just lying around. And [[WikiPedia:File:Transistor_Count_and_Moore's_Law_-_2008.svg|since we<br />
do]], digital signal processing becomes very attractive.<br />
<br />
For one thing, analog componentry just doesn't have the flexibility of<br />
a general purpose computer. Adding a new function to this<br />
beast [the 3585]... yeah, it's probably not going to happen. On a digital<br />
processor though, just write a new program. Software isn't trivial,<br />
but it is a lot easier.<br />
<br />
Perhaps more importantly though every analog component is an<br />
approximation. There's no such thing as a perfect transistor, or a<br />
perfect inductor, or a perfect capacitor. In analog, every component<br />
adds [[WikiPedia:Johnson–Nyquist_noise|noise]] and [[WikiPedia:Distortion#Electronic_signals|distortion]], usually not very much, but it adds up. Just<br />
transmitting an analog signal, especially over long distances,<br />
progressively, measurably, irretrievably corrupts it. Besides, all of<br />
those single-purpose analog components take up a lot of space. Two<br />
lines of code on the billion transistors back here can implement a<br />
filter that would require an [[WikiPedia:Inductor|inductor]] the size of a refrigerator.<br />
<br />
Digital systems don't have these drawbacks. Digital signals can be<br />
stored, copied, manipulated, and transmitted without adding any noise<br />
or distortion. We do use [[WikiPedia:Lossy_compression|lossy]] algorithms from time to time, but the<br />
only unavoidably non-ideal steps are digitization and reconstruction,<br />
where digital has to interface with all of that messy analog. Messy<br />
or not, modern [[WikiPedia:Digital-to-analog_converter|conversion stages]] are very, very good. By the<br />
standards of our ears, we can consider them practically lossless as<br />
well.<br />
<br />
With a little extra hardware, then, most of which is now small and<br />
inexpensive due to our modern industrial infrastructure, digital audio<br />
is the clear winner over analog. So let us then go about storing it,<br />
copying it, manipulating it, and transmitting it.<br />
<br />
<center><div style="background-color:#DDDDFF;border-color:#CCCCDD;border-style:solid;width:80%;padding:0 1em 1em 1em;text-align:left;"><br />
'''Going deeper…'''<br />
*Wikipedia: [[WikiPedia:Nyquist–Shannon_sampling_theorem|Nyquist–Shannon sampling theorem]]<br />
*MIT OpenCourseWare [http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-003-signals-and-systems-spring-2010/lecture-notes/ Lecture notes from 6.003 signals and systems.]<br />
*Wikipedia: [[WikiPedia:Passive_analogue_filter_development|The history of analog filters]] such as the [[WikiPedia:RC circuit|RC low-pass]] shown connected to the [[wikipedia:Spectrum_analyzer|spectrum analyzer]] in the video.<br />
</div></center><br />
<br />
<br style="clear:both;"/><br />
<br />
==Raw (digital audio) meat==<br />
<small>[[Talk:A_Digital_Media_Primer_For_Geeks_(episode_1)#Raw_.28digital_audio.29_meat|Discuss this section]]</small><br />
<br />
Pulse Code Modulation is the most common representation for <br />
raw audio. Other practical representations do exist: for example, the<br />
[[WikiPedia:Delta-sigma_modulation|Sigma-Delta coding]] used by the [[WikiPedia:Super_Audio_CD|SACD]], which is a form of [[wikipedia:Pulse-density_modulation|Pulse Density<br />
Modulation]]. That said, Pulse Code Modulation is far<br />
and away dominant, mainly because it's so mathematically<br />
convenient. An audio engineer can spend an entire career without<br />
running into anything else.<br />
<br />
PCM encoding can be characterized in three parameters, making it easy<br />
to account for every possible PCM variant with mercifully little<br />
hassle.<br />
<br style="clear:both;"/><br />
===sample rate===<br />
<small>[[Talk:A_Digital_Media_Primer_For_Geeks_(episode_1)#Raw_.28digital_audio.29_meat|Discuss this section]]</small><br />
[[Image:Dmpfg_009.jpg|360px|right]]<br />
[[Image:Dmpfg_008.jpg|360px|right]]<br />
The first parameter is the [[wikipedia:Sampling_rate|sampling rate]]. The highest frequency an<br />
encoding can represent is called the Nyquist Frequency. The Nyquist<br />
frequency of PCM happens to be exactly half the sampling rate.<br />
Therefore, the sampling rate directly determines the highest possible<br />
frequency in the digitized signal.<br />
<br />
Analog telephone systems traditionally [[wikipedia:Bandlimiting|band-limited]] voice channels to<br />
just under 4kHz, so digital telephony and most classic voice<br />
applications use an 8kHz sampling rate: the minimum sampling rate<br />
necessary to capture the entire bandwidth of a 4kHz channel. This is<br />
what an 8kHz sampling rate sounds like&mdash;a bit muffled but perfectly<br />
intelligible for voice. This is the lowest sampling rate that's ever<br />
been used widely in practice.<br />
<br />
From there, as power, and memory, and storage increased, consumer<br />
computer hardware went to offering 11, and then 16, and then 22, and<br />
then 32kHz sampling. With each increase in the sampling rate and the<br />
Nyquist frequency, it's obvious that the high end becomes a little<br />
clearer and the sound more natural.<br />
<br />
The Compact Disc uses a 44.1kHz sampling rate, which is again slightly<br />
better than 32kHz, but the gains are becoming less distinct. 44.1kHz<br />
is a bit of an oddball choice, especially given that it hadn't been<br />
used for anything prior to the compact disc, but the huge success of<br />
the CD has made it a common rate.<br />
<br />
The most common hi-fidelity sampling rate aside from the CD is 48kHz.<br />
There's virtually no audible difference between the two. This video,<br />
or at least the original version of it, was shot and produced with<br />
48kHz audio, which happens to be the original standard for<br />
high-fidelity audio with video.<br />
<br />
Super-hi-fidelity sampling rates of 88, and 96, and 192kHz have also<br />
appeared. The reason for the sampling rates beyond 48kHz isn't to<br />
extend the audible high frequencies further. It's for a different<br />
reason.<br />
<br />
Stepping back for just a second, the French mathematician [[wikipedia:Joseph_Fourier|Jean<br />
Baptiste Joseph Fourier]] showed that we can also think of signals like<br />
audio as a set of component frequencies. This [[wikipedia:Frequency_domain|frequency-domain]]<br />
representation is equivalent to the time representation; the signal is<br />
exactly the same, we're just looking at it [[wikipedia:Basis_(linear_algebra)|a different way]]. Here we see the<br />
frequency-domain representation of a hypothetical analog signal we<br />
intend to digitally sample.<br />
<br />
The sampling theorem tells us two essential things about the sampling<br />
process. First, that a digital signal can't represent any<br />
frequencies above the Nyquist frequency. Second, and this is the new<br />
part, if we don't remove those frequencies with a low-pass [[wikipedia:Audio_filter|filter]]<br />
before sampling, the sampling process will fold them down into the<br />
representable frequency range as [[wikipedia:Aliasing|aliasing distortion]].<br />
<br />
Aliasing, in a nutshell, sounds freakin' awful, so it's essential to<br />
remove any beyond-Nyquist frequencies before sampling and after<br />
reconstruction.<br />
<br />
Human frequency perception is considered to extend to about 20kHz. In<br />
44.1 or 48kHz sampling, the low pass before the sampling stage has to<br />
be extremely sharp to avoid cutting any audible frequencies below<br />
[[wikipedia:Hearing_range|20kHz]] but still not allow frequencies above the Nyquist to leak<br />
forward into the sampling process. This is a difficult filter to<br />
build, and no practical filter succeeds completely. If the sampling<br />
rate is 96kHz or 192kHz on the other hand, the low pass has an extra<br />
[[wikipedia:Octave_(electronics)|octave]] or two for its [[wikipedia:Transition_band|transition band]]. This is a much easier filter to<br />
build. Sampling rates beyond 48kHz are actually one of those messy<br />
analog stage compromises.<br />
<br style="clear:both;"/><br />
<br />
===sample format===<br />
<small>[[Talk:A_Digital_Media_Primer_For_Geeks_(episode_1)#Raw_.28digital_audio.29_meat|Discuss this section]]</small><br />
[[Image:Dmpfg_anim.gif|right]]<br />
<br />
The second fundamental PCM parameter is the sample format; that is,<br />
the format of each digital number. A number is a number, but a number<br />
can be represented in bits a number of different ways.<br />
<br />
Early PCM was [[wikipedia:Quantization_(sound_processing)#Audio_quantization|eight-bit]] [[wikipedia:Linear_pulse_code_modulation|linear]], encoded as an [[wikipedia:Signedness|unsigned]] [[wikipedia:Integer_(computer_science)#Bytes_and_octets|byte]]. The<br />
[[wikipedia:Dynamic_range#Audio|dynamic range]] is limited to about [[wikipedia:Decibel|50dB]] and the [[wikipedia:Quantization_error|quantization noise]], as<br />
you can hear, is pretty severe. Eight-bit audio is vanishingly rare<br />
today.<br />
<br />
Digital telephony typically uses one of two related non-linear eight<br />
bit encodings called [[wikipedia:A-law_algorithm|A-law]] and [[wikipedia:Μ-law_algorithm|μ-law]]. These formats encode a roughly<br />
[[wikipedia:Audio_bit_depth#Dynamic_range|14 bit dynamic range]] into eight bits by spacing the higher amplitude<br />
values farther apart. A-law and mu-law obviously improve quantization<br />
noise compared to linear 8-bit, and voice harmonics especially hide<br />
the remaining quantization noise well. All three eight-bit encodings,<br />
linear, A-law, and mu-law, are typically paired with an 8kHz sampling<br />
rate, though I'm demonstrating them here at 48kHz.<br />
<br />
Most modern PCM uses 16- or 24-bit [[wikipedia:Two's_complement|two's-complement]] signed integers to<br />
encode the range from negative infinity to zero decibels in 16 or 24<br />
bits of precision. The maximum absolute value corresponds to zero decibels.<br />
As with all the sample formats so far, signals beyond zero decibels, and thus<br />
beyond the maximum representable range, are [[wikipedia:Clipping_(audio)|clipped]].<br />
<br />
In mixing and mastering, it's not unusual to use [[wikipedia:Floating_point|floating-point]]<br />
numbers for PCM instead of [[wikipedia:Integer_(computer_science)|integers]]. A 32 bit [[wikipedia:IEEE_754-2008|IEEE754]] float, that's<br />
the normal kind of floating point you see on current computers, has 24<br />
bits of resolution, but a seven bit floating-point exponent increases<br />
the representable range. Floating point usually represents zero<br />
decibels as +/-1.0, and because floats can obviously represent<br />
considerably beyond that, temporarily exceeding zero decibels during<br />
the mixing process doesn't cause clipping. Floating-point PCM takes<br />
up more space, so it tends to be used only as an intermediate<br />
production format.<br />
<br />
Lastly, most general purpose computers still read and<br />
write data in octet bytes, so it's important to remember that samples<br />
bigger than eight bits can be in [[wikipedia:Endianness|big- or little-endian order]], and both<br />
endiannesses are common. For example, Microsoft [[wikipedia:WAV|WAV]] files are little-endian,<br />
and Apple [[wikipedia:AIFC|AIFC]] files tend to be big-endian. Be aware of it.<br />
<br style="clear:both;"/><br />
<br />
===channels===<br />
<small>[[Talk:A_Digital_Media_Primer_For_Geeks_(episode_1)#Raw_.28digital_audio.29_meat|Discuss this section]]</small><br />
<br />
The third PCM parameter is the number of [[wikipedia:Multichannel_audio|channels]]. The convention in<br />
raw PCM is to encode multiple channels by interleaving the samples of<br />
each channel together into a single stream. Straightforward and extensible.<br />
<br style="clear:both;"/><br />
===done!===<br />
<br />
And that's it! That describes every PCM representation ever. Done.<br />
Digital audio is ''so easy''! There's more to do of course, but at this<br />
point we've got a nice useful chunk of audio data, so let's get some<br />
video too.<br />
<br />
<br />
<center><div style="background-color:#DDDDFF;border-color:#CCCCDD;border-style:solid;width:80%;padding:0 1em 1em 1em;text-align:left;"><br />
'''Going deeper…'''<br />
* [[wikipedia:Roll-off|Wikipedia's article on filter roll-off]], to learn why it's hard to build analog filters with a very narrow [[wikipedia:Transition_band|transition band]] between the [[wikipedia:Passband|passband]] and the [[wikipedia:Stopband|stopband]]. Filters that achieve such hard edges often do so at the expense of increased [[wikipedia:Ripple_(filters)#Frequency-domain_ripple|ripple]] and [http://www.ocf.berkeley.edu/~ashon/audio/phase/phaseaud2.htm phase distortion].<br />
* [http://wiki.multimedia.cx/index.php?title=PCM Some more minutiae] about PCM in practice.<br />
* [[wikipedia:DPCM|DPCM]] and [[wikipedia:ADPCM|ADPCM]], simple audio codecs loosely inspired by PCM.<br />
</div></center><br />
<br />
<br style="clear:both;"/><br />
<br />
==Video vegetables (they're good for you!)==<br />
[[Image:Dmpfg_010.jpg|360px|right]]<br />
<small>[[Talk:A_Digital_Media_Primer_For_Geeks_(episode_1)#Video_vegetables_.28they.27re_good_for_you.21.29|Discuss this section]]</small><br />
<br />
One could think of video as being like audio but with two additional<br />
spatial dimensions, X and Y, in addition to the dimension of time.<br />
This is mathematically sound. The Sampling Theorem applies to all<br />
three video dimensions just as it does the single time dimension of<br />
audio.<br />
<br />
Audio and video are obviously quite different in practice. For one,<br />
compared to audio, video is huge. [[wikipedia:Red_Book_(audio_Compact_Disc_standard)#Technical_details|Raw CD audio]] is about 1.4 megabits<br />
per second. Raw [[wikipedia:1080i|1080i]] HD video is over 700 megabits per second. That's<br />
more than 500 times more data to capture, process, and store per<br />
second. By [[wikipedia:Moore's_law|Moore's law]]... that's... let's see... roughly eight<br />
doublings times two years, so yeah, computers requiring about an extra<br />
fifteen years to handle raw video after getting raw audio down pat was<br />
about right.<br />
<br />
Basic raw video is also just more complex than basic raw audio. The<br />
sheer volume of data currently necessitates a representation more<br />
efficient than the linear PCM used for audio. In addition, electronic<br />
video comes almost entirely from broadcast television alone, and the<br />
standards committees that govern broadcast video have always been very<br />
concerned with backward compatibility. Up until just last year in the<br />
US, a sixty-year-old black and white television could still show a<br />
normal [[wikipedia:NTSC|analog television broadcast]]. That's actually a really neat<br />
trick.<br />
<br />
The downside to backward compatibility is that once a detail makes it<br />
into a standard, you can't ever really throw it out again. Electronic<br />
video has never started over from scratch the way audio has multiple<br />
times. Sixty years worth of clever but obsolete hacks necessitated by<br />
the passing technology of a given era have built up into quite a pile,<br />
and because digital standards also come from broadcast television, all<br />
these eldritch hacks have been brought forward into the digital<br />
standards as well.<br />
<br />
In short, there are a whole lot more details involved in digital video<br />
than there were with audio. There's no hope of covering them<br />
all completely here, so we'll cover the broad fundamentals.<br />
<br style="clear:both;"/><br />
===resolution and aspect===<br />
[[Image:Dmpfg_011.jpg|360px|right]]<br />
<small>[[Talk:A_Digital_Media_Primer_For_Geeks_(episode_1)#Video_vegetables_.28they.27re_good_for_you.21.29|Discuss this section]]</small><br />
<br />
The most obvious raw video parameters are the width and height of the<br />
picture in pixels. As simple as that may sound, the pixel dimensions<br />
alone don't actually specify the absolute width and height of the<br />
picture, as most broadcast-derived video doesn't use square pixels.<br />
The number of [[wikipedia:Scan_line|scanlines]] in a broadcast image was fixed, but the<br />
effective number of horizontal pixels was a function of channel<br />
[[wikipedia:Bandwidth_(signal_processing)|bandwidth]]. Effective horizontal resolution could result in pixels that<br />
were either narrower or wider than the spacing between scanlines.<br />
<br />
Standards have generally specified that digitally sampled video should<br />
reflect the real resolution of the original analog source, so a large<br />
amount of digital video also uses non-square pixels. For example, a<br />
normal 4:3 aspect NTSC DVD is typically encoded with a display<br />
resolution of [[wikipedia:DVD-Video#Frame_size_and_frame_rate|704 by 480]], a ratio wider than 4:3. In this case, the<br />
pixels themselves are assigned an aspect ratio of [[wikipedia:Standard-definition_television#Resolution|10:11]], making them<br />
taller than they are wide and narrowing the image horizontally to the<br />
correct aspect. Such an image has to be resampled to show properly on<br />
a digital display with square pixels.<br />
<br style="clear:both;"/><br />
===frame rate and interlacing===<br />
[[Image:Dmpfg_012.jpg|360px|right]]<br />
<small>[[Talk:A_Digital_Media_Primer_For_Geeks_(episode_1)#Video_vegetables_.28they.27re_good_for_you.21.29|Discuss this section]]</small><br />
<br />
The second obvious video parameter is the [[wikipedia:Frame_rate|frame rate]], the number of<br />
full frames per second. Several standard frame rates are in active<br />
use. Digital video, in one form or another, can use all of them. Or,<br />
any other frame rate. Or even variable rates where the frame rate<br />
changes adaptively over the course of the video. The higher the frame<br />
rate, the smoother the motion and that brings us, unfortunately, to<br />
[[wikipedia:Interlace|interlacing]].<br />
<br />
In the very earliest days of broadcast video, engineers sought the<br />
fastest practical frame rate to smooth motion and to minimize [[wikipedia:Flicker_(screen)|flicker]]<br />
on phosphor-based [[wikipedia:Cathode_ray_tube|CRTs]]. They were also under pressure to use the<br />
least possible bandwidth for the highest resolution and fastest frame<br />
rate. Their solution was to interlace the video where the even lines<br />
are sent in one pass and the odd lines in the next. Each pass is<br />
called a field and two fields sort of produce one complete frame.<br />
"Sort of", because the even and odd fields aren't actually from the<br />
same source frame. In a 60 field per second picture, the source frame<br />
rate is actually 60 full frames per second, and half of each frame,<br />
every other line, is simply discarded. This is why we can't<br />
[[wikipedia:Deinterlacing|deinterlace]] a video simply by combining two fields into one frame;<br />
they're not actually from one frame to begin with.<br />
<br style="clear:both;"/><br />
<br />
===gamma===<br />
[[Image:Dmpfg_013.jpg|360px|right]]<br />
<small>[[Talk:A_Digital_Media_Primer_For_Geeks_(episode_1)#Video_vegetables_.28they.27re_good_for_you.21.29|Discuss this section]]</small><br />
<br />
The cathode ray tube was the only available display technology for<br />
most of the history of electronic video. A CRT's output brightness is<br />
nonlinear, approximately equal to the input controlling voltage raised<br />
to the 2.5th power. This exponent, 2.5, is designated gamma, and so<br />
it's often referred to as the gamma of a display. Cameras, though,<br />
are linear, and if you feed a CRT a linear input signal, it looks a<br />
bit like this.<br />
<br />
As there were originally to be very few cameras, which were<br />
fantastically expensive anyway, and hopefully many, many television<br />
sets which best be as inexpensive as possible, engineers decided to<br />
add the necessary [[wikipedia:Gamma_correction|gamma correction]] circuitry to the cameras rather<br />
than the sets. Video transmitted over the airwaves would thus have a<br />
nonlinear intensity using the inverse of the set's gamma exponent, so that<br />
once a camera's signal was finally displayed on the CRT, the overall<br />
response of the system from camera to set was back to linear again.<br />
<br />
Almost.<br />
<br />
There were also two other tweaks. A television camera actually uses a<br />
gamma exponent that's the inverse of 2.2, not 2.5. That's just a<br />
correction for viewing in a dim environment. Also, the exponential<br />
curve transitions to a linear ramp near black. That's just an old<br />
hack for suppressing sensor noise in the camera.<br />
<br />
Gamma correction also had a lucky benefit. It just so happens that the<br />
human eye has a perceptual gamma of about 3. This is relatively close<br />
to the CRT's gamma of 2.5. An image using gamma correction devotes<br />
more resolution to lower intensities, where the eye happens to have<br />
its finest intensity discrimination, and therefore uses the available<br />
scale resolution more efficiently. Although CRTs are currently<br />
vanishing, a standard [[wikipedia:sRGB|sRGB]] computer display still uses a nonlinear<br />
intensity curve similar to television, with a linear ramp near black,<br />
followed by an exponential curve with a gamma exponent of 2.4. This<br />
encodes a sixteen bit linear range down into eight bits.<br />
<br style="clear:both;"/><br />
<br />
===color and colorspace===<br />
[[Image:Dmpfg_014.jpg|360px|right]]<br />
<small>[[Talk:A_Digital_Media_Primer_For_Geeks_(episode_1)#Video_vegetables_.28they.27re_good_for_you.21.29|Discuss this section]]</small><br />
<br />
The human eye has three apparent color channels, red, green, and blue,<br />
and most displays use these three colors as [[wikipedia:Additive_color|additive primaries]] to<br />
produce a full range of color output. The primary pigments in<br />
printing are [[wikipedia:CMYK|Cyan, Magenta, and Yellow]] for the same reason; pigments<br />
are [[wikipedia:Subtractive_color|subtractive]], and each of these pigments subtracts one pure color<br />
from reflected light. Cyan subtracts red, magenta subtracts green, and<br />
yellow subtracts blue.<br />
<br />
Video can be, and sometimes is, represented with red, green, and blue<br />
color channels, but RGB video is atypical. The human eye is far more<br />
sensitive to [[wikipedia:Luminance_(relative)|luminosity]] than it is the color, and RGB tends to spread<br />
the energy of an image across all three color channels. That is, the<br />
red plane looks like a red version of the original picture, the green<br />
plane looks like a green version of the original picture, and the blue<br />
plane looks like a blue version of the original picture. Black and<br />
white times three. Not efficient.<br />
<br />
For those reasons and because, oh hey, television just happened to<br />
start out as black and white anyway, video usually is represented as a<br />
high resolution [[wikipedia:Luma_(video)|luma channel]]&mdash;the black & white&mdash;along with<br />
additional, often lower resolution [[wikipedia:Chrominance|chroma channels]], the color. The<br />
luma channel, Y, is produced by weighting and then adding the separate<br />
red, green and blue signals. The chroma channels U and V are then<br />
produced by subtracting the luma signal from blue and the luma signal<br />
from red.<br />
<br />
When YUV is scaled, offset, and quantized for digital video, it's<br />
usually more correctly called [[wikipedia:Y'CbCr|Y'CbCr]], but the more generic term YUV is<br />
widely used to describe all the analog and digital variants of this<br />
color model.<br />
<br style="clear:both;"/><br />
<br />
===chroma subsampling===<br />
[[Image:Dmpfg_015.jpg|360px|right]]<br />
<small>[[Talk:A_Digital_Media_Primer_For_Geeks_(episode_1)#Video_vegetables_.28they.27re_good_for_you.21.29|Discuss this section]]</small><br />
<br />
The U and V chroma channels can have the same resolution as the Y<br />
channel, but because the human eye has far less spatial color<br />
resolution than spatial luminosity resolution, chroma resolution is<br />
usually [[wikipedia:Chroma_subsampling|halved or even quartered]] in the horizontal direction, the<br />
vertical direction, or both, usually without any significant impact on the<br />
apparent raw image quality. Practically every possible subsampling<br />
variant has been used at one time or another, but the common choices<br />
today are [[wikipedia:Chroma_subsampling#4:4:4_Y.27CbCr|4:4:4]] video, which isn't actually subsampled at all, [[wikipedia:Chroma_subsampling#4:2:2|4:2:2]] video in<br />
which the horizontal resolution of the U and V channels is halved, and<br />
most common of all, [[wikipedia:Chroma_subsampling#4:2:0|4:2:0]] video in which both the horizontal and vertical<br />
resolutions of the chroma channels are halved, resulting in U and V<br />
planes that are each one quarter the size of Y.<br />
<br />
The terms 4:2:2, 4:2:0, [[wikipedia:Chroma_subsampling#4:1:1|4:1:1]], and so on and so forth, aren't complete<br />
descriptions of a chroma subsampling. There's multiple possible ways<br />
to position the chroma pixels relative to luma, and again, several<br />
variants are in active use for each subsampling. For example, [[wikipedia:Motion_Jpeg|motion<br />
JPEG]], [[wikipedia:MPEG-1#Part_2:_Video|MPEG-1 video]], [[wikipedia:MPEG-2#Video_coding_.28simplified.29|MPEG-2 video]], [[wikipedia:DV#DV_Compression|DV]], [[wikipedia:Theora|Theora]], and [[wikipedia:WebM|WebM]] all use or can<br />
use 4:2:0 subsampling, but they site the chroma pixels [http://www.mir.com/DMG/chroma.html three different ways].<br />
<br />
Motion JPEG, MPEG-1 video, Theora and WebM all site chroma pixels<br />
between luma pixels both horizontally and vertically.<br />
<br />
MPEG-2 sites chroma pixels between lines, but horizontally aligned with<br />
every other luma pixel. Interlaced modes complicate things somewhat,<br />
resulting in a siting arrangement that's a tad bizarre.<br />
<br />
And finally PAL-DV, which is always interlaced, places the chroma<br />
pixels in the same position as every other luma pixel in the<br />
horizontal direction, and vertically alternates chroma channel on<br />
each line.<br />
<br />
That's just 4:2:0 video. I'll leave the other subsamplings as homework for the<br />
viewer. Got the basic idea, moving on.<br />
<br style="clear:both;"/><br />
<br />
===pixel formats===<br />
[[Image:Dmpfg_016.jpg|360px|right]]<br />
<small>[[Talk:A_Digital_Media_Primer_For_Geeks_(episode_1)#Video_vegetables_.28they.27re_good_for_you.21.29|Discuss this section]]</small><br />
<br />
In audio, we always represent multiple channels in a PCM stream by<br />
interleaving the samples from each channel in order. Video uses both<br />
packed formats that interleave the color channels, as well as planar<br />
formats that keep the pixels from each channel together in separate<br />
planes stacked in order in the frame. There are at least [http://www.fourcc.org/yuv.php 50 different formats] in<br />
these two broad categories with possibly ten or fifteen in common use.<br />
<br />
Each chroma subsampling and different bit-depth requires a different<br />
packing arrangement, and so a different pixel format. For a given<br />
unique subsampling, there are usually also several equivalent formats<br />
that consist of trivial channel order rearrangements or repackings, due either to<br />
convenience once-upon-a-time on some particular piece of hardware, or<br />
sometimes just good old-fashioned spite.<br />
<br />
Pixels formats are described by a unique name or [[wikipedia:FourCC|fourcc]] code. There<br />
are quite a few of these and there's no sense going over each one now.<br />
Google is your friend. Be aware that fourcc codes for raw video<br />
specify the pixel arrangement and chroma subsampling, but generally<br />
don't imply anything certain about chroma siting or color space. [http://www.fourcc.org/yuv.php#YV12 YV12]<br />
video to pick one, can use JPEG, MPEG-2 or DV chroma siting, and any<br />
one of [[wikipedia:YUV#BT.709_and_BT.601|several YUV colorspace definitions]].<br />
<br style="clear:both;"/><br />
<br />
===done!===<br />
<br />
That wraps up our not-so-quick and yet very incomplete tour of raw<br />
video. The good news is we can already get quite a lot of real work<br />
done using that overview. In plenty of situations, a frame of video<br />
data is a frame of video data. The details matter, greatly, when it<br />
come time to write software, but for now I am satisfied that the<br />
esteemed viewer is broadly aware of the relevant issues.<br />
<br />
<center><div style="background-color:#DDDDFF;border-color:#CCCCDD;border-style:solid;width:80%;padding:0 1em 1em 1em;text-align:left;"><br />
'''Going deeper…'''<br />
* YCbCr is defined in terms of RGB by the ITU in two incompatible standards: [[wikipedia:Rec. 601|Rec. 601]] and [[wikipedia:Rec. 709|Rec. 709]]. Both conversion standards are lossy, which has prompted some to adopt a lossless alternative called [http://wiki.multimedia.cx/index.php?title=YCoCg YCoCg].<br />
* Learn about [[wikipedia:High_dynamic_range_imaging|high dynamic range imaging]], which achieves better representation of the full range of brightnesses in the real world by using more than 8 bits per channel.<br />
* Learn about how [[wikipedia:Trichromatic_vision|trichromatic color vision]] works in humans, and how human color perception is encoded in the [[wikipedia:CIE 1931 color space|CIE 1931 XYZ color space]].<br />
** Compare with the [[wikipedia:Lab_color_space|Lab color space]], mathematically equivalent but structured to account for "perceptual uniformity".<br />
** If we were all [[wikipedia:Dichromacy|dichromats]] then video would only need two color channels. Some humans might be [[wikipedia:Tetrachromacy#Possibility_of_human_tetrachromats|tetrachromats]], in which case they would need an additional color channel for video to fully represent their vision.<br />
** [http://www.xritephoto.com/ph_toolframe.aspx?action=coloriq Test your color vision] (or at least your monitor).<br />
</div></center><br />
<br />
<br style="clear:both;"/><br />
<br />
==Containers==<br />
[[Image:Dmpfg_017.jpg|360px|right]]<br />
<small>[[Talk:A_Digital_Media_Primer_For_Geeks_(episode_1)#Containers|Discuss this section]]</small><br />
<br />
So. We have audio data. We have video data. What remains is the more<br />
familiar non-signal data and straight-up engineering that software<br />
developers are used to, and plenty of it.<br />
<br />
Chunks of raw audio and video data have no externally-visible<br />
structure, but they're often uniformly sized. We could just string<br />
them together in a rigid predetermined ordering for streaming and<br />
storage, and some simple systems do approximately that. Compressed<br />
frames, though, aren't necessarily a predictable size, and we usually want<br />
some flexibility in using a range of different data types in streams.<br />
If we string random formless data together, we lose the boundaries<br />
that separate frames and don't necessarily know what data belongs to<br />
which streams. A stream needs some generalized structure to be<br />
generally useful.<br />
<br />
In addition to our signal data, we also have our PCM and video<br />
parameters. There's probably plenty of other [[wikipedia:Metadata#Video|metadata]] we also want to<br />
deal with, like audio tags and video chapters and subtitles, all<br />
essential components of rich media. It makes sense to place this<br />
metadata&mdash;that is, data about the data&mdash;within the media itself.<br />
<br />
Storing and structuring formless data and disparate metadata is the<br />
job of a [[wikipedia:Container_format_(digital)|container]]. Containers provide framing for the data blobs,<br />
interleave and identify multiple data streams, provide timing<br />
information, and store the metadata necessary to parse, navigate,<br />
manipulate, and present the media. In general, any container can hold<br />
any kind of data. And data can be put into any container.<br />
<br />
<br />
<center><div style="background-color:#DDDDFF;border-color:#CCCCDD;border-style:solid;width:80%;padding:0 1em 1em 1em;text-align:left;"><br />
'''Going deeper…'''<br />
* There are several common general-purpose container formats: [[wikipedia:Audio_Video_Interleave|AVI]], [[wikipedia:Matroska|Matroska]], [[wikipedia:Ogg|Ogg]], [[wikipedia:QuickTime_File_Format|QuickTime]], and [[wikipedia:Comparison_of_container_formats|many others]]. These can contain and interleave many different types of media streams.<br />
* Some special-purpose containers have been designed that can only hold one format:<br />
** [http://wiki.multimedia.cx/index.php?title=YUV4MPEG2 The y4m format] is the most common single-purpose container for raw YUV video. It can also be stored in a general-purpose container, for example in Ogg using [[OggYUV]].<br />
** MP3 files use a [[wikipedia:MP3#File_structure|special single-purpose file format]].<br />
** [[wikipedia:WAV|WAV]] and [[wikipedia:AIFC|AIFC]] are semi-single-purpose formats. They're audio-only, and typically contain raw PCM audio, but are occasionally used to store other kinds of audio data ... even MP3!<br />
</div></center><br />
<br />
<br style="clear:both;"/><br />
<br />
==Credits==<br />
[[Image:Dmpfg_018.jpg|360px|right]]<br />
[[Image:Dmpfg_019.png|360px|right]]<br />
<br />
In the past thirty minutes, we've covered digital audio, video, some<br />
history, some math and a little engineering. We've barely scratched the<br />
surface, but it's time for a well-earned break.<br />
<br />
There's so much more to talk about, so I hope you'll join me again in<br />
our next episode. Until then&mdash;Cheers!<br />
<br />
Written by:<br />
Christopher (Monty) Montgomery<br />
and the Xiph.Org Community<br />
<br />
Intro, title and credits music:<br><br />
"Boo Boo Coming", by Joel Forrester<br><br />
Performed by the [http://microscopicseptet.com/ Microscopic Septet]<br><br />
Used by permission of Cuneiform Records.<br><br />
Original source track All Rights Reserved.<br><br />
[http://www.cuneiformrecords.com www.cuneiformrecords.com]<br />
<br />
This Video Was Produced Entirely With Free and Open Source Software:<br><br />
<br />
[http://www.gnu.org/ GNU]<br><br />
[http://www.linux.org/ Linux]<br><br />
[http://fedoraproject.org/ Fedora]<br><br />
[http://cinelerra.org/ Cinelerra]<br><br />
[http://www.gimp.org/ The Gimp]<br><br />
[http://audacity.sourceforge.net/ Audacity]<br><br />
[http://svn.xiph.org/trunk/postfish/README Postfish]<br><br />
[http://gstreamer.freedesktop.org/ Gstreamer]<br><br />
<br />
All trademarks are the property of their respective owners. <br />
<br />
''Complete video'' [http://creativecommons.org/licenses/by-nc-sa/3.0/legalcode CC-BY-NC-SA]<br><br />
''Text transcript and Wiki edition'' [http://creativecommons.org/licenses/by-sa/3.0/legalcode CC-BY-SA]<br><br />
<br />
A Co-Production of Xiph.Org and Red Hat Inc.<br><br />
(C) 2010, Some Rights Reserved<br><br />
<br />
<br style="clear:both;"/><hr/><br />
<center><font size="+1">''[[A Digital Media Primer For Geeks (episode 1)/making|Learn more about the making of this video…]]''</font></center></div>Edrzhttps://wiki.xiph.org/index.php?title=Videos/A_Digital_Media_Primer_For_Geeks&diff=12599Videos/A Digital Media Primer For Geeks2010-09-30T20:11:42Z<p>Edrz: /* Introduction */</p>
<hr />
<div><small>''Wiki edition''</small><br />
[[Image:Dmpfg_001.jpg|360px|right]]<br />
<br />
This first video from Xiph.Org presents the technical foundations of modern digital media via a half-hour firehose of information. One community member called it "a Uni lecture I never got but really wanted."<br />
<br />
The program offers a brief history of digital media, a quick summary of the sampling theorem, and myriad details of low level audio and video characterization and formatting. It's intended for budding geeks looking to get into video coding, as well as the technically curious who want to know more about the media they wrangle for work or play.<br />
<br/><br/><br/><br />
<center><font size="+2">[http://www.xiph.org/video/vid1.shtml Download or Watch online]</font></center><br />
<br style="clear:both;"/><br />
Players supporting WEBM: [http://www.videolan.org/vlc/ VLC 1.1+], [https://www.mozilla.com/en-US/firefox/all-beta.html Firefox 4 (beta)], [http://www.chromium.org/getting-involved/dev-channel Chrome (development versions)], [http://www.opera.com/ Opera], [http://www.webmproject.org/users/ more…]<br />
<br />
Players supporting Ogg/Theora: [http://www.videolan.org/vlc/ VLC], [http://www.firefox.com/ Firefox], [http://www.opera.com/ Opera], [[TheoraSoftwarePlayers|more…]]<br />
<br />
If you're having trouble with playback in a modern browser or player, please visit our [[Playback_Troubleshooting|playback troubleshooting and discussion]] page.<br />
<br/><br />
<hr/><br />
<br />
==Introduction==<br />
[[Image:Dmpfg_000.jpg|360px|right]]<br />
[[Image:Dmpfg_002.jpg|360px|right]]<br />
<small>[[Talk:A_Digital_Media_Primer_For_Geeks_(episode_1)#Introduction|Discuss this section]]</small><br />
<br />
Workstations and high-end personal computers have been able to<br />
manipulate digital audio pretty easily for about fifteen years now.<br />
It's only been about five years that a decent workstation's been able<br />
to handle raw video without a lot of expensive special purpose<br />
hardware.<br />
<br />
But today even most cheap home PCs have the processor power and<br />
storage necessary to really toss raw video around, at least without<br />
too much of a struggle. So now that everyone has all of this cheap media-capable hardware, <br />
more people, not surprisingly, want to do interesting<br />
things with digital media, especially streaming. YouTube was the first huge<br />
success, and now everybody wants in.<br />
<br />
Well good! Because this stuff is a lot of fun!<br />
<br />
<br />
It's no problem finding consumers for digital media. But here I'd<br />
like to address the engineers, the mathematicians, the hackers, the<br />
people who are interested in discovering and making things and<br />
building the technology itself. The people after my own heart.<br />
<br />
Digital media, compression especially, is perceived to be super-elite,<br />
somehow incredibly more difficult than anything else in computer<br />
science. The big industry players in the field don't mind this<br />
perception at all; it helps justify the staggering number of very<br />
basic patents they hold. They like the image that their media<br />
researchers "are the best of the best, so much smarter than anyone<br />
else that their brilliant ideas can't even be understood by mere<br />
mortals." This is bunk. <br />
<br />
Digital audio and video and streaming and compression offer endless<br />
deep and stimulating mental challenges, just like any other<br />
discipline. It seems elite because so few people have been<br />
involved. So few people have been involved perhaps because so few<br />
people could afford the expensive, special-purpose equipment it<br />
required. But today, just about anyone watching this video has a<br />
cheap, general-purpose computer powerful enough to play with the big<br />
boys. There are battles going on today around HTML5 and browsers and<br />
video and open vs. closed. So now is a pretty good time to get<br />
involved. The easiest place to start is probably understanding the<br />
technology we have right now.<br />
<br />
This is an introduction. Since it's an introduction, it glosses over a<br />
ton of details so that the big picture's a little easier to see.<br />
Quite a few people watching are going to be way past anything that I'm<br />
talking about, at least for now. On the other hand, I'm probably<br />
going to go too fast for folks who really are brand new to all of<br />
this, so if this is all new, relax. The important thing is to pick out<br />
any ideas that really grab your imagination. Especially pay attention<br />
to the terminology surrounding those ideas, because with those, and<br />
Google, and Wikipedia, you can dig as deep as interests you.<br />
<br />
So, without any further ado, welcome to one hell of a new hobby.<br />
<br />
<center><div style="background-color:#DDDDFF;border-color:#CCCCDD;border-style:solid;width:80%;padding:0 1em 1em 1em;text-align:left;"><br />
'''Going deeper…'''<br />
*[http://www.xiph.org/about/ About Xiph.Org]: Why you should care about open media<br />
*[http://www.0xdeadbeef.com/weblog/2010/01/html5-video-and-h-264-what-history-tells-us-and-why-were-standing-with-the-web/ HTML5 Video and H.264: what history tells us and why we're standing with the web]: Chris Blizzard of Mozilla on free formats and the open web<br />
*[http://diveintohtml5.org/video.html Dive into HTML5]: tutorial on HTML5 web video<br />
*[http://webchat.freenode.net/?channels=xiph Chat with the creators of the video] via freenode IRC in #xiph.<br />
</div></center><br />
<br />
<br style="clear:both;"/><br />
<br />
==Analog vs Digital==<br />
[[Image:Dmpfg_004.jpg|360px|right]]<br />
[[Image:Dmpfg_006.jpg|360px|right]]<br />
[[Image:Dmpfg_007.jpg|360px|right]]<br />
<small>[[Talk:A_Digital_Media_Primer_For_Geeks_(episode_1)#Analog_vs_Digital|Discuss this section]]</small><br />
<br />
<br />
[[WikiPedia:Sound|Sound]] is the propagation of pressure waves through air, spreading out<br />
from a source like ripples spread from a stone tossed into a pond. A<br />
microphone, or the human ear for that matter, transforms these passing<br />
ripples of pressure into an electric signal. Right, this is<br />
middle school science class, everyone remembers this. Moving on.<br />
<br />
That audio signal is a one-dimensional function, a single value<br />
varying over time. If we slow the [[WikiPedia:Oscilloscope|'scope]] down a bit... that should be<br />
a little easier to see. A few other aspects of the signal are<br />
important. It's [[WikiPedia:Continuous_function|continuous]] in both value and time; that is, at any<br />
given time it can have any real value, and there's a smoothly varying<br />
value at every point in in time. No matter how much we zoom in, there<br />
are no discontinuities, no singularities, no instantaneous steps or<br />
points where the signal ceases to exist. It's defined<br />
everywhere. Classic continuous math works very well on these signals.<br />
<br />
A digital signal on the other hand is [[WikiPedia:Discrete_math|discrete]] in both value and time.<br />
In the simplest and most common system, called [[WikiPedia:Pulse code modulation|Pulse Code Modulation]],<br />
one of a fixed number of possible values directly represents the<br />
instantaneous signal amplitude at points in time spaced a fixed<br />
distance apart. The end result is a stream of digits.<br />
<br />
Now this looks an awful lot like this. It seems intuitive that we<br />
should somehow be able to rigorously transform one into the other, and<br />
good news, the [[WikiPedia:Nyquist-Shannon sampling theorem|Sampling Theorem]] says we can and tells us<br />
how. Published in its most recognizable form by [[WikiPedia:Claude Shannon|Claude Shannon]] in 1949<br />
and built on the work of [[WikiPedia:Harry Nyquist|Nyquist]], and [[WikiPedia:Ralph Hartley|Hartley]], and tons of others, the<br />
sampling theorem states that not only can we go back and<br />
forth between analog and digital, but also lays<br />
down a set of conditions for which conversion is lossless and the two<br />
representations become equivalent and interchangeable. When the<br />
lossless conditions aren't met, the sampling theorem tells us how and<br />
how much information is lost or corrupted.<br />
<br />
Up until very recently, analog technology was the basis for<br />
practically everything done with audio, and that's not because most<br />
audio comes from an originally analog source. You may also think that<br />
since computers are fairly recent, analog signal technology must have<br />
come first. Nope. Digital is actually older. The [[WikiPedia:Telegraph|telegraph]] predates<br />
the telephone by half a century and was already fully mechanically<br />
automated by the 1860s, sending coded, multiplexed digital signals<br />
long distances. You know... [[WikiPedia:Tickertape|tickertape]]. Harry Nyquist of [[WikiPedia:Bell_labs|Bell Labs]] was<br />
researching telegraph pulse transmission when he published his<br />
description of what later became known as the [[WikiPedia:Nyquist_frequency|Nyquist frequency]], the<br />
core concept of the sampling theorem. Now, it's true the telegraph<br />
was transmitting symbolic information, text, not a digitized analog<br />
signal, but with the advent of the telephone and radio, analog and<br />
digital signal technology progressed rapidly and side-by-side.<br />
<br />
Audio had always been manipulated as an analog signal because... well,<br />
gee, it's so much easier. A [[WikiPedia:Low-pass_filter#Continuous-time_low-pass_filters|second-order low-pass filter]], for example,<br />
requires two passive components. An all-analog [[WikiPedia:Short-time_Fourier_transform|short-time Fourier<br />
transform]], a few hundred. Well, maybe a thousand if you want to build<br />
something really fancy (bang on the [http://www.testequipmentdepot.com/usedequipment/hewlettpackard/spectrumanalyzers/3585a.htm 3585]). Processing signals<br />
digitally requires millions to billions of transistors running at<br />
microwave frequencies, support hardware at very least to digitize and<br />
reconstruct the analog signals, a complete software ecosystem for<br />
programming and controlling that billion-transistor juggernaut,<br />
digital storage just in case you want to keep any of those bits for<br />
later...<br />
<br />
So we come to the conclusion that analog is the only practical way to<br />
do much with audio... well, unless you happen to have a billion<br />
transistors and all the other things just lying around. And [[WikiPedia:File:Transistor_Count_and_Moore's_Law_-_2008.svg|since we<br />
do]], digital signal processing becomes very attractive.<br />
<br />
For one thing, analog componentry just doesn't have the flexibility of<br />
a general purpose computer. Adding a new function to this<br />
beast [the 3585]... yeah, it's probably not going to happen. On a digital<br />
processor though, just write a new program. Software isn't trivial,<br />
but it is a lot easier.<br />
<br />
Perhaps more importantly though every analog component is an<br />
approximation. There's no such thing as a perfect transistor, or a<br />
perfect inductor, or a perfect capacitor. In analog, every component<br />
adds [[WikiPedia:Johnson–Nyquist_noise|noise]] and [[WikiPedia:Distortion#Electronic_signals|distortion]], usually not very much, but it adds up. Just<br />
transmitting an analog signal, especially over long distances,<br />
progressively, measurably, irretrievably corrupts it. Besides, all of<br />
those single-purpose analog components take up a lot of space. Two<br />
lines of code on the billion transistors back here can implement a<br />
filter that would require an [[WikiPedia:Inductor|inductor]] the size of a refrigerator.<br />
<br />
Digital systems don't have these drawbacks. Digital signals can be<br />
stored, copied, manipulated, and transmitted without adding any noise<br />
or distortion. We do use [[WikiPedia:Lossy_compression|lossy]] algorithms from time to time, but the<br />
only unavoidably non-ideal steps are digitization and reconstruction,<br />
where digital has to interface with all of that messy analog. Messy<br />
or not, modern [[WikiPedia:Digital-to-analog_converter|conversion stages]] are very, very good. By the<br />
standards of our ears, we can consider them practically lossless as<br />
well.<br />
<br />
With a little extra hardware, then, most of which is now small and<br />
inexpensive due to our modern industrial infrastructure, digital audio<br />
is the clear winner over analog. So let us then go about storing it,<br />
copying it, manipulating it, and transmitting it.<br />
<br />
<center><div style="background-color:#DDDDFF;border-color:#CCCCDD;border-style:solid;width:80%;padding:0 1em 1em 1em;text-align:left;"><br />
'''Going deeper…'''<br />
*Wikipedia: [[WikiPedia:Nyquist–Shannon_sampling_theorem|Nyquist–Shannon sampling theorem]]<br />
*MIT OpenCourseWare [http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-003-signals-and-systems-spring-2010/lecture-notes/ Lecture notes from 6.003 signals and systems.]<br />
*Wikipedia: [[WikiPedia:Passive_analogue_filter_development|The history of analog filters]] such as the [[WikiPedia:RC circuit|RC low-pass]] shown connected to the [[wikipedia:Spectrum_analyzer|spectrum analyzer]] in the video.<br />
</div></center><br />
<br />
<br style="clear:both;"/><br />
<br />
==Raw (digital audio) meat==<br />
<small>[[Talk:A_Digital_Media_Primer_For_Geeks_(episode_1)#Raw_.28digital_audio.29_meat|Discuss this section]]</small><br />
<br />
Pulse Code Modulation is the most common representation for <br />
raw audio. Other practical representations do exist: for example, the<br />
[[WikiPedia:Delta-sigma_modulation|Sigma-Delta coding]] used by the [[WikiPedia:Super_Audio_CD|SACD]], which is a form of [[wikipedia:Pulse-density_modulation|Pulse Density<br />
Modulation]]. That said, Pulse Code Modulation is far<br />
and away dominant, mainly because it's so mathematically<br />
convenient. An audio engineer can spend an entire career without<br />
running into anything else.<br />
<br />
PCM encoding can be characterized in three parameters, making it easy<br />
to account for every possible PCM variant with mercifully little<br />
hassle.<br />
<br style="clear:both;"/><br />
===sample rate===<br />
<small>[[Talk:A_Digital_Media_Primer_For_Geeks_(episode_1)#Raw_.28digital_audio.29_meat|Discuss this section]]</small><br />
[[Image:Dmpfg_009.jpg|360px|right]]<br />
[[Image:Dmpfg_008.jpg|360px|right]]<br />
The first parameter is the [[wikipedia:Sampling_rate|sampling rate]]. The highest frequency an<br />
encoding can represent is called the Nyquist Frequency. The Nyquist<br />
frequency of PCM happens to be exactly half the sampling rate.<br />
Therefore, the sampling rate directly determines the highest possible<br />
frequency in the digitized signal.<br />
<br />
Analog telephone systems traditionally [[wikipedia:Bandlimiting|band-limited]] voice channels to<br />
just under 4kHz, so digital telephony and most classic voice<br />
applications use an 8kHz sampling rate: the minimum sampling rate<br />
necessary to capture the entire bandwidth of a 4kHz channel. This is<br />
what an 8kHz sampling rate sounds like&mdash;a bit muffled but perfectly<br />
intelligible for voice. This is the lowest sampling rate that's ever<br />
been used widely in practice.<br />
<br />
From there, as power, and memory, and storage increased, consumer<br />
computer hardware went to offering 11, and then 16, and then 22, and<br />
then 32kHz sampling. With each increase in the sampling rate and the<br />
Nyquist frequency, it's obvious that the high end becomes a little<br />
clearer and the sound more natural.<br />
<br />
The Compact Disc uses a 44.1kHz sampling rate, which is again slightly<br />
better than 32kHz, but the gains are becoming less distinct. 44.1kHz<br />
is a bit of an oddball choice, especially given that it hadn't been<br />
used for anything prior to the compact disc, but the huge success of<br />
the CD has made it a common rate.<br />
<br />
The most common hi-fidelity sampling rate aside from the CD is 48kHz.<br />
There's virtually no audible difference between the two. This video,<br />
or at least the original version of it, was shot and produced with<br />
48kHz audio, which happens to be the original standard for<br />
high-fidelity audio with video.<br />
<br />
Super-hi-fidelity sampling rates of 88, and 96, and 192kHz have also<br />
appeared. The reason for the sampling rates beyond 48kHz isn't to<br />
extend the audible high frequencies further. It's for a different<br />
reason.<br />
<br />
Stepping back for just a second, the French mathematician [[wikipedia:Joseph_Fourier|Jean<br />
Baptiste Joseph Fourier]] showed that we can also think of signals like<br />
audio as a set of component frequencies. This [[wikipedia:Frequency_domain|frequency-domain]]<br />
representation is equivalent to the time representation; the signal is<br />
exactly the same, we're just looking at it [[wikipedia:Basis_(linear_algebra)|a different way]]. Here we see the<br />
frequency-domain representation of a hypothetical analog signal we<br />
intend to digitally sample.<br />
<br />
The sampling theorem tells us two essential things about the sampling<br />
process. First, that a digital signal can't represent any<br />
frequencies above the Nyquist frequency. Second, and this is the new<br />
part, if we don't remove those frequencies with a low-pass [[wikipedia:Audio_filter|filter]]<br />
before sampling, the sampling process will fold them down into the<br />
representable frequency range as [[wikipedia:Aliasing|aliasing distortion]].<br />
<br />
Aliasing, in a nutshell, sounds freakin' awful, so it's essential to<br />
remove any beyond-Nyquist frequencies before sampling and after<br />
reconstruction.<br />
<br />
Human frequency perception is considered to extend to about 20kHz. In<br />
44.1 or 48kHz sampling, the low pass before the sampling stage has to<br />
be extremely sharp to avoid cutting any audible frequencies below<br />
[[wikipedia:Hearing_range|20kHz]] but still not allow frequencies above the Nyquist to leak<br />
forward into the sampling process. This is a difficult filter to<br />
build, and no practical filter succeeds completely. If the sampling<br />
rate is 96kHz or 192kHz on the other hand, the low pass has an extra<br />
[[wikipedia:Octave_(electronics)|octave]] or two for its [[wikipedia:Transition_band|transition band]]. This is a much easier filter to<br />
build. Sampling rates beyond 48kHz are actually one of those messy<br />
analog stage compromises.<br />
<br style="clear:both;"/><br />
<br />
===sample format===<br />
<small>[[Talk:A_Digital_Media_Primer_For_Geeks_(episode_1)#Raw_.28digital_audio.29_meat|Discuss this section]]</small><br />
[[Image:Dmpfg_anim.gif|right]]<br />
<br />
The second fundamental PCM parameter is the sample format; that is,<br />
the format of each digital number. A number is a number, but a number<br />
can be represented in bits a number of different ways.<br />
<br />
Early PCM was [[wikipedia:Quantization_(sound_processing)#Audio_quantization|eight-bit]] [[wikipedia:Linear_pulse_code_modulation|linear]], encoded as an [[wikipedia:Signedness|unsigned]] [[wikipedia:Integer_(computer_science)#Bytes_and_octets|byte]]. The<br />
[[wikipedia:Dynamic_range#Audio|dynamic range]] is limited to about [[wikipedia:Decibel|50dB]] and the [[wikipedia:Quantization_error|quantization noise]], as<br />
you can hear, is pretty severe. Eight-bit audio is vanishingly rare<br />
today.<br />
<br />
Digital telephony typically uses one of two related non-linear eight<br />
bit encodings called [[wikipedia:A-law_algorithm|A-law]] and [[wikipedia:Μ-law_algorithm|μ-law]]. These formats encode a roughly<br />
[[wikipedia:Audio_bit_depth#Dynamic_range|14 bit dynamic range]] into eight bits by spacing the higher amplitude<br />
values farther apart. A-law and mu-law obviously improve quantization<br />
noise compared to linear 8-bit, and voice harmonics especially hide<br />
the remaining quantization noise well. All three eight-bit encodings,<br />
linear, A-law, and mu-law, are typically paired with an 8kHz sampling<br />
rate, though I'm demonstrating them here at 48kHz.<br />
<br />
Most modern PCM uses 16- or 24-bit [[wikipedia:Two's_complement|two's-complement]] signed integers to<br />
encode the range from negative infinity to zero decibels in 16 or 24<br />
bits of precision. The maximum absolute value corresponds to zero decibels.<br />
As with all the sample formats so far, signals beyond zero decibels, and thus<br />
beyond the maximum representable range, are [[wikipedia:Clipping_(audio)|clipped]].<br />
<br />
In mixing and mastering, it's not unusual to use [[wikipedia:Floating_point|floating-point]]<br />
numbers for PCM instead of [[wikipedia:Integer_(computer_science)|integers]]. A 32 bit [[wikipedia:IEEE_754-2008|IEEE754]] float, that's<br />
the normal kind of floating point you see on current computers, has 24<br />
bits of resolution, but a seven bit floating-point exponent increases<br />
the representable range. Floating point usually represents zero<br />
decibels as +/-1.0, and because floats can obviously represent<br />
considerably beyond that, temporarily exceeding zero decibels during<br />
the mixing process doesn't cause clipping. Floating-point PCM takes<br />
up more space, so it tends to be used only as an intermediate<br />
production format.<br />
<br />
Lastly, most general purpose computers still read and<br />
write data in octet bytes, so it's important to remember that samples<br />
bigger than eight bits can be in [[wikipedia:Endianness|big- or little-endian order]], and both<br />
endiannesses are common. For example, Microsoft [[wikipedia:WAV|WAV]] files are little-endian,<br />
and Apple [[wikipedia:AIFC|AIFC]] files tend to be big-endian. Be aware of it.<br />
<br style="clear:both;"/><br />
<br />
===channels===<br />
<small>[[Talk:A_Digital_Media_Primer_For_Geeks_(episode_1)#Raw_.28digital_audio.29_meat|Discuss this section]]</small><br />
<br />
The third PCM parameter is the number of [[wikipedia:Multichannel_audio|channels]]. The convention in<br />
raw PCM is to encode multiple channels by interleaving the samples of<br />
each channel together into a single stream. Straightforward and extensible.<br />
<br style="clear:both;"/><br />
===done!===<br />
<br />
And that's it! That describes every PCM representation ever. Done.<br />
Digital audio is ''so easy''! There's more to do of course, but at this<br />
point we've got a nice useful chunk of audio data, so let's get some<br />
video too.<br />
<br />
<br />
<center><div style="background-color:#DDDDFF;border-color:#CCCCDD;border-style:solid;width:80%;padding:0 1em 1em 1em;text-align:left;"><br />
'''Going deeper…'''<br />
* [[wikipedia:Roll-off|Wikipedia's article on filter roll-off]], to learn why it's hard to build analog filters with a very narrow [[wikipedia:Transition_band|transition band]] between the [[wikipedia:Passband|passband]] and the [[wikipedia:Stopband|stopband]]. Filters that achieve such hard edges often do so at the expense of increased [[wikipedia:Ripple_(filters)#Frequency-domain_ripple|ripple]] and [http://www.ocf.berkeley.edu/~ashon/audio/phase/phaseaud2.htm phase distortion].<br />
* [http://wiki.multimedia.cx/index.php?title=PCM Some more minutiae] about PCM in practice.<br />
* [[wikipedia:DPCM|DPCM]] and [[wikipedia:ADPCM|ADPCM]], simple audio codecs loosely inspired by PCM.<br />
</div></center><br />
<br />
<br style="clear:both;"/><br />
<br />
==Video vegetables (they're good for you!)==<br />
[[Image:Dmpfg_010.jpg|360px|right]]<br />
<small>[[Talk:A_Digital_Media_Primer_For_Geeks_(episode_1)#Video_vegetables_.28they.27re_good_for_you.21.29|Discuss this section]]</small><br />
<br />
One could think of video as being like audio but with two additional<br />
spatial dimensions, X and Y, in addition to the dimension of time.<br />
This is mathematically sound. The Sampling Theorem applies to all<br />
three video dimensions just as it does the single time dimension of<br />
audio.<br />
<br />
Audio and video are obviously quite different in practice. For one,<br />
compared to audio, video is huge. [[wikipedia:Red_Book_(audio_Compact_Disc_standard)#Technical_details|Raw CD audio]] is about 1.4 megabits<br />
per second. Raw [[wikipedia:1080i|1080i]] HD video is over 700 megabits per second. That's<br />
more than 500 times more data to capture, process, and store per<br />
second. By [[wikipedia:Moore's_law|Moore's law]]... that's... let's see... roughly eight<br />
doublings times two years, so yeah, computers requiring about an extra<br />
fifteen years to handle raw video after getting raw audio down pat was<br />
about right.<br />
<br />
Basic raw video is also just more complex than basic raw audio. The<br />
sheer volume of data currently necessitates a representation more<br />
efficient than the linear PCM used for audio. In addition, electronic<br />
video comes almost entirely from broadcast television alone, and the<br />
standards committees that govern broadcast video have always been very<br />
concerned with backward compatibility. Up until just last year in the<br />
US, a sixty-year-old black and white television could still show a<br />
normal [[wikipedia:NTSC|analog television broadcast]]. That's actually a really neat<br />
trick.<br />
<br />
The downside to backward compatibility is that once a detail makes it<br />
into a standard, you can't ever really throw it out again. Electronic<br />
video has never started over from scratch the way audio has multiple<br />
times. Sixty years worth of clever but obsolete hacks necessitated by<br />
the passing technology of a given era have built up into quite a pile,<br />
and because digital standards also come from broadcast television, all<br />
these eldritch hacks have been brought forward into the digital<br />
standards as well.<br />
<br />
In short, there are a whole lot more details involved in digital video<br />
than there were with audio. There's no hope of covering them<br />
all completely here, so we'll cover the broad fundamentals.<br />
<br style="clear:both;"/><br />
===resolution and aspect===<br />
[[Image:Dmpfg_011.jpg|360px|right]]<br />
<small>[[Talk:A_Digital_Media_Primer_For_Geeks_(episode_1)#Video_vegetables_.28they.27re_good_for_you.21.29|Discuss this section]]</small><br />
<br />
The most obvious raw video parameters are the width and height of the<br />
picture in pixels. As simple as that may sound, the pixel dimensions<br />
alone don't actually specify the absolute width and height of the<br />
picture, as most broadcast-derived video doesn't use square pixels.<br />
The number of [[wikipedia:Scan_line|scanlines]] in a broadcast image was fixed, but the<br />
effective number of horizontal pixels was a function of channel<br />
[[wikipedia:Bandwidth_(signal_processing)|bandwidth]]. Effective horizontal resolution could result in pixels that<br />
were either narrower or wider than the spacing between scanlines.<br />
<br />
Standards have generally specified that digitally sampled video should<br />
reflect the real resolution of the original analog source, so a large<br />
amount of digital video also uses non-square pixels. For example, a<br />
normal 4:3 aspect NTSC DVD is typically encoded with a display<br />
resolution of [[wikipedia:DVD-Video#Frame_size_and_frame_rate|704 by 480]], a ratio wider than 4:3. In this case, the<br />
pixels themselves are assigned an aspect ratio of [[wikipedia:Standard-definition_television#Resolution|10:11]], making them<br />
taller than they are wide and narrowing the image horizontally to the<br />
correct aspect. Such an image has to be resampled to show properly on<br />
a digital display with square pixels.<br />
<br style="clear:both;"/><br />
===frame rate and interlacing===<br />
[[Image:Dmpfg_012.jpg|360px|right]]<br />
<small>[[Talk:A_Digital_Media_Primer_For_Geeks_(episode_1)#Video_vegetables_.28they.27re_good_for_you.21.29|Discuss this section]]</small><br />
<br />
The second obvious video parameter is the [[wikipedia:Frame_rate|frame rate]], the number of<br />
full frames per second. Several standard frame rates are in active<br />
use. Digital video, in one form or another, can use all of them. Or,<br />
any other frame rate. Or even variable rates where the frame rate<br />
changes adaptively over the course of the video. The higher the frame<br />
rate, the smoother the motion and that brings us, unfortunately, to<br />
[[wikipedia:Interlace|interlacing]].<br />
<br />
In the very earliest days of broadcast video, engineers sought the<br />
fastest practical frame rate to smooth motion and to minimize [[wikipedia:Flicker_(screen)|flicker]]<br />
on phosphor-based [[wikipedia:Cathode_ray_tube|CRTs]]. They were also under pressure to use the<br />
least possible bandwidth for the highest resolution and fastest frame<br />
rate. Their solution was to interlace the video where the even lines<br />
are sent in one pass and the odd lines in the next. Each pass is<br />
called a field and two fields sort of produce one complete frame.<br />
"Sort of", because the even and odd fields aren't actually from the<br />
same source frame. In a 60 field per second picture, the source frame<br />
rate is actually 60 full frames per second, and half of each frame,<br />
every other line, is simply discarded. This is why we can't<br />
[[wikipedia:Deinterlacing|deinterlace]] a video simply by combining two fields into one frame;<br />
they're not actually from one frame to begin with.<br />
<br style="clear:both;"/><br />
<br />
===gamma===<br />
[[Image:Dmpfg_013.jpg|360px|right]]<br />
<small>[[Talk:A_Digital_Media_Primer_For_Geeks_(episode_1)#Video_vegetables_.28they.27re_good_for_you.21.29|Discuss this section]]</small><br />
<br />
The cathode ray tube was the only available display technology for<br />
most of the history of electronic video. A CRT's output brightness is<br />
nonlinear, approximately equal to the input controlling voltage raised<br />
to the 2.5th power. This exponent, 2.5, is designated gamma, and so<br />
it's often referred to as the gamma of a display. Cameras, though,<br />
are linear, and if you feed a CRT a linear input signal, it looks a<br />
bit like this.<br />
<br />
As there were originally to be very few cameras, which were<br />
fantastically expensive anyway, and hopefully many, many television<br />
sets which best be as inexpensive as possible, engineers decided to<br />
add the necessary [[wikipedia:Gamma_correction|gamma correction]] circuitry to the cameras rather<br />
than the sets. Video transmitted over the airwaves would thus have a<br />
nonlinear intensity using the inverse of the set's gamma exponent, so that<br />
once a camera's signal was finally displayed on the CRT, the overall<br />
response of the system from camera to set was back to linear again.<br />
<br />
Almost.<br />
<br />
There were also two other tweaks. A television camera actually uses a<br />
gamma exponent that's the inverse of 2.2, not 2.5. That's just a<br />
correction for viewing in a dim environment. Also, the exponential<br />
curve transitions to a linear ramp near black. That's just an old<br />
hack for suppressing sensor noise in the camera.<br />
<br />
Gamma correction also had a lucky benefit. It just so happens that the<br />
human eye has a perceptual gamma of about 3. This is relatively close<br />
to the CRT's gamma of 2.5. An image using gamma correction devotes<br />
more resolution to lower intensities, where the eye happens to have<br />
its finest intensity discrimination, and therefore uses the available<br />
scale resolution more efficiently. Although CRTs are currently<br />
vanishing, a standard [[wikipedia:sRGB|sRGB]] computer display still uses a nonlinear<br />
intensity curve similar to television, with a linear ramp near black,<br />
followed by an exponential curve with a gamma exponent of 2.4. This<br />
encodes a sixteen bit linear range down into eight bits.<br />
<br style="clear:both;"/><br />
<br />
===color and colorspace===<br />
[[Image:Dmpfg_014.jpg|360px|right]]<br />
<small>[[Talk:A_Digital_Media_Primer_For_Geeks_(episode_1)#Video_vegetables_.28they.27re_good_for_you.21.29|Discuss this section]]</small><br />
<br />
The human eye has three apparent color channels, red, green, and blue,<br />
and most displays use these three colors as [[wikipedia:Additive_color|additive primaries]] to<br />
produce a full range of color output. The primary pigments in<br />
printing are [[wikipedia:CMYK|Cyan, Magenta, and Yellow]] for the same reason; pigments<br />
are [[wikipedia:Subtractive_color|subtractive]], and each of these pigments subtracts one pure color<br />
from reflected light. Cyan subtracts red, magenta subtracts green, and<br />
yellow subtracts blue.<br />
<br />
Video can be, and sometimes is, represented with red, green, and blue<br />
color channels, but RGB video is atypical. The human eye is far more<br />
sensitive to [[wikipedia:Luminance_(relative)|luminosity]] than it is the color, and RGB tends to spread<br />
the energy of an image across all three color channels. That is, the<br />
red plane looks like a red version of the original picture, the green<br />
plane looks like a green version of the original picture, and the blue<br />
plane looks like a blue version of the original picture. Black and<br />
white times three. Not efficient.<br />
<br />
For those reasons and because, oh hey, television just happened to<br />
start out as black and white anyway, video usually is represented as a<br />
high resolution [[wikipedia:Luma_(video)|luma channel]]&mdash;the black & white&mdash;along with<br />
additional, often lower resolution [[wikipedia:Chrominance|chroma channels]], the color. The<br />
luma channel, Y, is produced by weighting and then adding the separate<br />
red, green and blue signals. The chroma channels U and V are then<br />
produced by subtracting the luma signal from blue and the luma signal<br />
from red.<br />
<br />
When YUV is scaled, offset, and quantized for digital video, it's<br />
usually more correctly called [[wikipedia:Y'CbCr|Y'CbCr]], but the more generic term YUV is<br />
widely used to describe all the analog and digital variants of this<br />
color model.<br />
<br style="clear:both;"/><br />
<br />
===chroma subsampling===<br />
[[Image:Dmpfg_015.jpg|360px|right]]<br />
<small>[[Talk:A_Digital_Media_Primer_For_Geeks_(episode_1)#Video_vegetables_.28they.27re_good_for_you.21.29|Discuss this section]]</small><br />
<br />
The U and V chroma channels can have the same resolution as the Y<br />
channel, but because the human eye has far less spatial color<br />
resolution than spatial luminosity resolution, chroma resolution is<br />
usually [[wikipedia:Chroma_subsampling|halved or even quartered]] in the horizontal direction, the<br />
vertical direction, or both, usually without any significant impact on the<br />
apparent raw image quality. Practically every possible subsampling<br />
variant has been used at one time or another, but the common choices<br />
today are [[wikipedia:Chroma_subsampling#4:4:4_Y.27CbCr|4:4:4]] video, which isn't actually subsampled at all, [[wikipedia:Chroma_subsampling#4:2:2|4:2:2]] video in<br />
which the horizontal resolution of the U and V channels is halved, and<br />
most common of all, [[wikipedia:Chroma_subsampling#4:2:0|4:2:0]] video in which both the horizontal and vertical<br />
resolutions of the chroma channels are halved, resulting in U and V<br />
planes that are each one quarter the size of Y.<br />
<br />
The terms 4:2:2, 4:2:0, [[wikipedia:Chroma_subsampling#4:1:1|4:1:1]], and so on and so forth, aren't complete<br />
descriptions of a chroma subsampling. There's multiple possible ways<br />
to position the chroma pixels relative to luma, and again, several<br />
variants are in active use for each subsampling. For example, [[wikipedia:Motion_Jpeg|motion<br />
JPEG]], [[wikipedia:MPEG-1#Part_2:_Video|MPEG-1 video]], [[wikipedia:MPEG-2#Video_coding_.28simplified.29|MPEG-2 video]], [[wikipedia:DV#DV_Compression|DV]], [[wikipedia:Theora|Theora]], and [[wikipedia:WebM|WebM]] all use or can<br />
use 4:2:0 subsampling, but they site the chroma pixels [http://www.mir.com/DMG/chroma.html three different ways].<br />
<br />
Motion JPEG, MPEG-1 video, Theora and WebM all site chroma pixels<br />
between luma pixels both horizontally and vertically.<br />
<br />
MPEG-2 sites chroma pixels between lines, but horizontally aligned with<br />
every other luma pixel. Interlaced modes complicate things somewhat,<br />
resulting in a siting arrangement that's a tad bizarre.<br />
<br />
And finally PAL-DV, which is always interlaced, places the chroma<br />
pixels in the same position as every other luma pixel in the<br />
horizontal direction, and vertically alternates chroma channel on<br />
each line.<br />
<br />
That's just 4:2:0 video. I'll leave the other subsamplings as homework for the<br />
viewer. Got the basic idea, moving on.<br />
<br style="clear:both;"/><br />
<br />
===pixel formats===<br />
[[Image:Dmpfg_016.jpg|360px|right]]<br />
<small>[[Talk:A_Digital_Media_Primer_For_Geeks_(episode_1)#Video_vegetables_.28they.27re_good_for_you.21.29|Discuss this section]]</small><br />
<br />
In audio, we always represent multiple channels in a PCM stream by<br />
interleaving the samples from each channel in order. Video uses both<br />
packed formats that interleave the color channels, as well as planar<br />
formats that keep the pixels from each channel together in separate<br />
planes stacked in order in the frame. There are at least [http://www.fourcc.org/yuv.php 50 different formats] in<br />
these two broad categories with possibly ten or fifteen in common use.<br />
<br />
Each chroma subsampling and different bit-depth requires a different<br />
packing arrangement, and so a different pixel format. For a given<br />
unique subsampling, there are usually also several equivalent formats<br />
that consist of trivial channel order rearrangements or repackings, due either to<br />
convenience once-upon-a-time on some particular piece of hardware, or<br />
sometimes just good old-fashioned spite.<br />
<br />
Pixels formats are described by a unique name or [[wikipedia:FourCC|fourcc]] code. There<br />
are quite a few of these and there's no sense going over each one now.<br />
Google is your friend. Be aware that fourcc codes for raw video<br />
specify the pixel arrangement and chroma subsampling, but generally<br />
don't imply anything certain about chroma siting or color space. [http://www.fourcc.org/yuv.php#YV12 YV12]<br />
video to pick one, can use JPEG, MPEG-2 or DV chroma siting, and any<br />
one of [[wikipedia:YUV#BT.709_and_BT.601|several YUV colorspace definitions]].<br />
<br style="clear:both;"/><br />
<br />
===done!===<br />
<br />
That wraps up our not-so-quick and yet very incomplete tour of raw<br />
video. The good news is we can already get quite a lot of real work<br />
done using that overview. In plenty of situations, a frame of video<br />
data is a frame of video data. The details matter, greatly, when it<br />
come time to write software, but for now I am satisfied that the<br />
esteemed viewer is broadly aware of the relevant issues.<br />
<br />
<center><div style="background-color:#DDDDFF;border-color:#CCCCDD;border-style:solid;width:80%;padding:0 1em 1em 1em;text-align:left;"><br />
'''Going deeper…'''<br />
* YCbCr is defined in terms of RGB by the ITU in two incompatible standards: [[wikipedia:Rec. 601|Rec. 601]] and [[wikipedia:Rec. 709|Rec. 709]]. Both conversion standards are lossy, which has prompted some to adopt a lossless alternative called [http://wiki.multimedia.cx/index.php?title=YCoCg YCoCg].<br />
* Learn about [[wikipedia:High_dynamic_range_imaging|high dynamic range imaging]], which achieves better representation of the full range of brightnesses in the real world by using more than 8 bits per channel.<br />
* Learn about how [[wikipedia:Trichromatic_vision|trichromatic color vision]] works in humans, and how human color perception is encoded in the [[wikipedia:CIE 1931 color space|CIE 1931 XYZ color space]].<br />
** Compare with the [[wikipedia:Lab_color_space|Lab color space]], mathematically equivalent but structured to account for "perceptual uniformity".<br />
** If we were all [[wikipedia:Dichromacy|dichromats]] then video would only need two color channels. Some humans might be [[wikipedia:Tetrachromacy#Possibility_of_human_tetrachromats|tetrachromats]], in which case they would need an additional color channel for video to fully represent their vision.<br />
** [http://www.xritephoto.com/ph_toolframe.aspx?action=coloriq Test your color vision] (or at least your monitor).<br />
</div></center><br />
<br />
<br style="clear:both;"/><br />
<br />
==Containers==<br />
[[Image:Dmpfg_017.jpg|360px|right]]<br />
<small>[[Talk:A_Digital_Media_Primer_For_Geeks_(episode_1)#Containers|Discuss this section]]</small><br />
<br />
So. We have audio data. We have video data. What remains is the more<br />
familiar non-signal data and straight-up engineering that software<br />
developers are used to, and plenty of it.<br />
<br />
Chunks of raw audio and video data have no externally-visible<br />
structure, but they're often uniformly sized. We could just string<br />
them together in a rigid predetermined ordering for streaming and<br />
storage, and some simple systems do approximately that. Compressed<br />
frames, though, aren't necessarily a predictable size, and we usually want<br />
some flexibility in using a range of different data types in streams.<br />
If we string random formless data together, we lose the boundaries<br />
that separate frames and don't necessarily know what data belongs to<br />
which streams. A stream needs some generalized structure to be<br />
generally useful.<br />
<br />
In addition to our signal data, we also have our PCM and video<br />
parameters. There's probably plenty of other [[wikipedia:Metadata#Video|metadata]] we also want to<br />
deal with, like audio tags and video chapters and subtitles, all<br />
essential components of rich media. It makes sense to place this<br />
metadata&mdash;that is, data about the data&mdash;within the media itself.<br />
<br />
Storing and structuring formless data and disparate metadata is the<br />
job of a [[wikipedia:Container_format_(digital)|container]]. Containers provide framing for the data blobs,<br />
interleave and identify multiple data streams, provide timing<br />
information, and store the metadata necessary to parse, navigate,<br />
manipulate, and present the media. In general, any container can hold<br />
any kind of data. And data can be put into any container.<br />
<br />
<br />
<center><div style="background-color:#DDDDFF;border-color:#CCCCDD;border-style:solid;width:80%;padding:0 1em 1em 1em;text-align:left;"><br />
'''Going deeper…'''<br />
* There are several common general-purpose container formats: [[wikipedia:Audio_Video_Interleave|AVI]], [[wikipedia:Matroska|Matroska]], [[wikipedia:Ogg|Ogg]], [[wikipedia:QuickTime_File_Format|QuickTime]], and [[wikipedia:Comparison_of_container_formats|many others]]. These can contain and interleave many different types of media streams.<br />
* Some special-purpose containers have been designed that can only hold one format:<br />
** [http://wiki.multimedia.cx/index.php?title=YUV4MPEG2 The y4m format] is the most common single-purpose container for raw YUV video. It can also be stored in a general-purpose container, for example in Ogg using [[OggYUV]].<br />
** MP3 files use a [[wikipedia:MP3#File_structure|special single-purpose file format]].<br />
** [[wikipedia:WAV|WAV]] and [[wikipedia:AIFC|AIFC]] are semi-single-purpose formats. They're audio-only, and typically contain raw PCM audio, but are occasionally used to store other kinds of audio data ... even MP3!<br />
</div></center><br />
<br />
<br style="clear:both;"/><br />
<br />
==Credits==<br />
[[Image:Dmpfg_018.jpg|360px|right]]<br />
[[Image:Dmpfg_019.png|360px|right]]<br />
<br />
In the past thirty minutes, we've covered digital audio, video, some<br />
history, some math and a little engineering. We've barely scratched the<br />
surface, but it's time for a well-earned break.<br />
<br />
There's so much more to talk about, so I hope you'll join me again in<br />
our next episode. Until then&mdash;Cheers!<br />
<br />
Written by:<br />
Christopher (Monty) Montgomery<br />
and the Xiph.Org Community<br />
<br />
Intro, title and credits music:<br><br />
"Boo Boo Coming", by Joel Forrester<br><br />
Performed by the [http://microscopicseptet.com/ Microscopic Septet]<br><br />
Used by permission of Cuneiform Records.<br><br />
Original source track All Rights Reserved.<br><br />
[http://www.cuneiformrecords.com www.cuneiformrecords.com]<br />
<br />
This Video Was Produced Entirely With Free and Open Source Software:<br><br />
<br />
[http://www.gnu.org/ GNU]<br><br />
[http://www.linux.org/ Linux]<br><br />
[http://fedoraproject.org/ Fedora]<br><br />
[http://cinelerra.org/ Cinelerra]<br><br />
[http://www.gimp.org/ The Gimp]<br><br />
[http://audacity.sourceforge.net/ Audacity]<br><br />
[http://svn.xiph.org/trunk/postfish/README Postfish]<br><br />
[http://gstreamer.freedesktop.org/ Gstreamer]<br><br />
<br />
All trademarks are the property of their respective owners. <br />
<br />
''Complete video'' [http://creativecommons.org/licenses/by-nc-sa/3.0/legalcode CC-BY-NC-SA]<br><br />
''Text transcript and Wiki edition'' [http://creativecommons.org/licenses/by-sa/3.0/legalcode CC-BY-SA]<br><br />
<br />
A Co-Production of Xiph.Org and Red Hat Inc.<br><br />
(C) 2010, Some Rights Reserved<br><br />
<br />
<br style="clear:both;"/><hr/><br />
<center><font size="+1">''[[A Digital Media Primer For Geeks (episode 1)/making|Learn more about the making of this video…]]''</font></center></div>Edrz