I would recommend changing the checksum polynomial.
For the range of page sizes possible in transogg the CRC-32C (Castagnoli) polynomial offers superior error detection ability over the 802.3 CRC32 that we currently have specified. This polynomial is also used by SCTP and iSCSI, so I could wave my arms and suggest that it might be more likely to get hardware assistance (ethernet CRC is usually done on the adaptor, but the SCTP and iSCSI crc can't be easily offloaded) ... but no armwaving is required: i7 has an instruction for the Castagnoli polynomial. There are superior polys to the iSCSI one, at least for some sizes relevant to us, but they lack obvious prospects of hardware assistance.
Upsides to switching to Castagnoli CRC:
- Faster CRC on some hardware
- Increased error detection
Upsides to switching to another improved CRC:
- (Possibly) Increased error detection over Castagnoli
Upsides to staying with current CRC:
- Decreased implementation size and complexity for something also supporting the ogg format
...though supporting multiple generator polynomials in a typical software implementation isn't hard. --Gmaxwell 06:56, 29 May 2010 (UTC)
- If a message protected by a CRC fails to detect an error, then that message wrapped in header and footer and protected by the same CRC polynomial will fail to detect that same error. I think it should also be true that if that original message is split across several wrapper packets but the error appears entirely within one fragment, then the wrapper will still fail to detect the error. I think that's a strong case for avoiding 802.3's polynomial, but I'd suggest avoiding any popular polynomial, to extend the principle to other media with other polynomials. I think a case could also be made for ignoring the low-BER assumption in choosing a polynomial, because that should already be handled by layers closer to the media. A pure hash is all that is needed here. In fact, do we just want a fast hash instead of a CRC? --Gumboot 09:09, 17 October 2010 (UTC)
- If there is a movement towards a non-CRC hash, please be careful to select something SIMDable. --Gumboot 10:59, 17 October 2010 (UTC)
Arrange checksum order for fast page edits of essential data
IP allows fast editing of packets by using a checksum that isn't order-sensitive, so you can subtract the old value and add the new value of any given field directly to the checksum without having to re-sum the whole packet. CRC allows the same thing using EOR, but the modification to the checksum depends both the data and its distance to the end of the checked block. If data which is likely to be edited in a repetitive way over many pages is stored at a fixed distance to the end of the checksum block then the checksum delta can be pre-computed and applied directly without recalculation.
At least two ideas have been discussed so far:
- checksum payload, then variable-length header content, then fixed-length header
- checksum different components separately and EOR them together in an order-insensitive way, or EOR them with length-agnostic perturbations which preserve the useful error-detection features of a CRC.
Note that some CRC libraries and accelerators will have overheads that restrict their ability to accumulate data in a random order. They're likely to prefer well-aligned data and long runs of conventionally-ordered data. An ideal solution wouldn't interfere with their operation unnecessarily. --Gumboot 12:14, 4 June 2010 (UTC)
- I guess we should figure out what fields people would want to change cheaply. The lacing, for example, isn't going to get changed without changing the data. I think the obvious cases include the DTS (among other things) and that's currently a variable length field. --Gmaxwell 15:06, 4 June 2010 (UTC)
- All I had in mind was the stream ID. In fact, perhaps making it easily editable would be preferable to making it large enough that it rarely needs to be changed.
- Of course this is much easier for changing fields from one constant to another. If various bits change in different ways from page to page then you need a larger table of twiddles to mash into the CRC.
- If it's still a desirable feature, then you can use a set of tables to edit each bit in the relevant field, or some collection of those tables representing the adjustment for various distances from the end of the checksum, and the only requirement would be that the possible distances can be constrained. --Gumboot 22:35, 4 June 2010 (UTC)
- Actually it is desirable, even if it is just for the stream ID. You'll win over no embedded programmers with the promise of rarely doing a lot of work rather than always doing a little, and you'll never guarantee that stream ID rewrites will be performed by all software when necessary when they're only necessary very infrequently. --Gumboot 09:14, 22 September 2010 (UTC)
Several modern and upcoming file systems (e.g. BTRFS) support block level de-duplication. Partially duplicate files can share backing storage while remaining logically separate. For this to work the duplicated portions must have sufficient alignment— e.g. the FS can't discover duplicate blocks offset by 1 bytes. Perhaps we should consider some facility to insert a small amount of padding so that recovery points are aligned to some increment (it looks like 4k is sufficient). This way partially duplicate media files (e.g. editing intermediaries) could be effectively deduped. Aligning recovery points would also have positive error robustness properties on block oriented media (e.g. if data is destroyed a sector at a time you want your coding units to cover a minimum number of partial sectors).--Gmaxwell 17:55, 6 January 2011 (UTC)