cbloom rants

12/17/2014

12-17-14 - PVQ Vector Distribution Note

So, PVQ, as in "Pyramid VQ" is just a way of making a VQ codebook for a unit vector that has a certain probability model (each value is laplacian (the same laplacian) and independent).

You have a bunch of values, you send the length separately so you're left with a unit vector. Independent values are not equally distributed on the unit sphere, so you don't want a normal unit vector quantizer, you want this one that is scaled squares.

Okay, that's all fine, but the application to images is problematic.

In images, we have these various AC residuals. Let's assume 8x8 blocks for now for clarity. In each block, you have ACij (ij in 0-7 and AC00 excluded). The ij are frequency coordinates for each coefficient. You also have spatial neighbors in the adjacent blocks.

The problem is that the AC's are not independent - their not independent either in frequency or spatial coordinates. AC42 is strongly correlated to AC21 and also to AC42 in the neighboring blocks. They also don't have the same distribution; lower freqency-index AC's have much higher means.

In order to use Pyramid VQ, we need to find a grouping of AC's into a vector, such that the values we are putting in that vector are as uncorrelated and equally distributed as possible.

One paper I found that I linked in the previous email forms a vector by taking all the coefficients in the same frequency slot in a spatial region (for each ij, take all ACij in the 4x4 neighorhood of blocks). This is appealing in the sense that it gathers AC's of the same frequency subband, so they have roughly the same distribution. The problem is there are strong spatial correlations.

In Daala they form "subbands" of the coefficients by grouping together chunks of ACs that are in similar frequency groups.

The reason why correlation is bad is that it makes the PVQ codebook not optimal. For correlated values you should have more codebook vectors with neighboring values similar. eg. more entries around {0, 2, 2, 0} and fewer around {2, 0, 0, 2}. The PVQ codebok assumes those are equally likely.

You can however, make up for this a bit with the way you encode the codebook index. It doesn't fix the quantizer, but it does extract the correlation.

In classical PVQ (P = Pyramid) you would simply form an index to the vector and send it with an equiprobable code. But in practice you might do it with binary subdivision or incremental enumeration schemes, and then you need not make all codebook vectors equiprobable.

For example in Daala one of the issues for the lower subbands is that the vectors that have signal in the low AC's are more probable than the high AC's. eg. for subband that spans AC10 - AC40 , {1,0,0,0} is much more likely than {0,0,0,1}.

Of course this becomes a big mess when you consider Predictive VQ, because the Householder transform scrambles everywhere up in a way that makes it hard to model these built-in skews. On the other hand, if the Predictive VQ removes enough of the correlation with neighbors and subband, then you are left with a roughly evenly distributed vector again which is what you want.

12/16/2014

12-16-14 - Daala PVQ Emails

First, I want to note that the PVQ Demo page has good links at the bottom with more details in them, worth reading.

Also, the Main Daala page has more links, including the "Intro to Video" series, which is rather more than an intro and is a good read. It's a broad survey of modern video coding.

Now, a big raw dump of emails between me, ryg, and JM Valin. I'm gonna try to color them to make it a bit easier to follow. Thusly :

cbloom
ryg
valin

And this all starts with me being not very clear on PVQ so the beginning is a little fuzzy.

I will be following this up with a "summary of PVQ as I now understand it" which is probably more useful for most people. So, read that, not this.

(also jebus the internet is rindoculuos. Can I have BBS's back? Like literally 1200 baud text is better than the fucking nightmare that the internet has become. And I wouldn't mind playing a little Trade Wars once a day...)


Hi, I'm the author of the latest Daala demo on PVQ on which you commented recently. Here's some comments on your comments. I wasn't able to submit this to your blog, but feel free to copy them there. > 1. I've long believed that blocks should be categorized into > "smooth", "detail" and "edge". For most of this discussion we're > going to ignore smooth and edge and just talk about detail. That is, > blocks that are not entirely smooth, and don't have a dominant edge > through them (perhaps because that edge was predicted). I agree here, although right now we're treating "smooth" and "detail" in the same way (smooth is just low detail). Do you see any reason to treat those separately? > 2. The most important thing in detail blocks is preserving the > amount of energy in the various frequency subbands. This is something > that I've talked about before in terms of perceptual metrics. This is exactly what I had in mind with this PVQ work. Before Daala, I worked on the CELT part of the Opus codec, which has strict preservation of the energy. In the case of Daala, it looks so far like we want to relax the energy constraint a little. Right now, the codebook has an energy-preserving structure, but the actual search is R/D optimized with the same weight given to the amount of energy and its location. It's pretty easy to change the code to give more weight to energy preservation. I could even show you how to play with it if you're interested. > 3. You can take a standard type of codec and optimize the encoding > towards this type of perceptual metric, and that helps a bit, but > it's the wrong way to go. Because you're still spending bits to > exactly specify the noise in the high frequency area. Correct, hence the gain-shape quantization in my post. > 4. What you really want is a joint quantizer of summed energy and > the distribution of that energy. At max bit rate you send all the > coefficients exactly. As you reduce bitrate, the sum is preserved > pretty well, but the distribution of the lower-right (highest > frequency) coefficients becomes lossy. As you reduce bit rate more, > the total sum is still pretty good and the overall distribution of > energy is mostly right, but you get more loss in where the energy is > going in the lower frequency subbands, and you also get more scalar > quantization of the lower frequency subbands, etc. Well, my experience with both Opus and CELT is that you want the same resolution for the energy as you use for the "details". That being said, having an explicit energy still means you can better preserve it in the quantization process (i.e. it won't increase or decrease too much due to the quantization). > 5. When the energy is unspecified, you'd like to restore in some nice > way. That is, don't just restore to the same quantization vector > every time ("center" of the quantization bucket), since that could > create patterns. I dunno. Maybe restore with some randomness; restore > based on prediction from the neighborhood; restore to maximum > likelihood? (ML based on neighborhood/prediction/image not just a > global ML) I experimented a little bit with adding noise at the correct energy and while it slightly improved the quality on still images, it wasn't clear how to apply it to video because then you have the problem of static vs dynamic noise. > 6. An idea I've tossed around for a while is a quadtree/wavelet-like > coding scheme. Take the 8x8 block of coefficients (and as always > exclude DC in some way). Send the sum of the whole thing. Divide into > four children. So now you have to send a (lossy) distribution of that > sum onto the 4 child slots. Go to the upper left (LL band) and do it > again, etc. I considered something along these lines, but it would not be easy to do because the lowest frequencies would kind of drown out the high frequencies. > 7. The more energy you have, the less important its exact > distribution, due to masking. As you have more energy to distribute, > the number of vectors you need goes up a lot, but the loss you can > tolerate also goes up. In terms of the bits to send a block, it > should still increase as a function of the energy level of that > block, but it should increase less quickly than naive > log2(distributions) would indicate. Yes, that is exactly what the PVQ companding does. > 8. Not all AC's are equally likely or equally perceptually important. > Specifically the vector codebook should contain more entries that > preserve values in the upper-left (low frequency) area. This is the equivalent of the quantization matrix, which PVQ has as well (though I didn't really talk about it). > 9. The interaction with prediction is ugly. (eg. I don't know how to > do it right). The nature of AC values after mocomp or > intra-prediction is not the same as the nature of AC's after just > transform (as in JPEG). Specifically, ideas like variance masking and > energy preservation apply to the transformed AC values, *not* to the > deltas that you typically see in video coding. Handling the prediction is exactly what the whole Householder reflection in PVQ is about (see the 6 steps figure). The PVQ gain encoding scheme is always done on the input and not on the prediction. So the activity masking is applied on the input energy and not based on the energy of the residual. > 10. You want to send the information about the AC in a useful order. > That is, the things you send first should be very strong classifiers > of the entropy of that block for coding purposes, and of the masking > properties for quantization purposes. Well, coding the energy first achieves most of this. > You don't want sending the "category" or "masking" information to be > separate side-band data. It should just be the first part of sending > the coefficients. So your category is maybe something like the > bit-vector of which coefficient groups have any non-zero > coefficients. Something like that which is not redundant with sending > them, it's just the first gross bit of information about their > distribution. Well, the masking information is tied to the gain. For now, the category information is only tied to the block size decision (based on the assumption that edges will be 4x4), but it's not ideal and it's something I'd like to improve. On the topic of lapped transform, this has indeed been causing us all sorts of headaches, but it also has interesting properties. Jury's still out on that one, but so far I think we've managed to make reasonably good use of it. Cheers, Jean-Marc

Thanks for writing! Before I address specific points, maybe you can teach me a bit about PVQ and how you use it? I can't find any good resources on the web (your abstract is rather terse). Maybe you can point me at some relevant reference material. (the CELT paper is rather terse too!) Are you constructing the PVQ vector from the various AC's within a single block? Or gathering the same subband from spatial neighbors? (I think the former, but I've seen the latter in papers) Assuming the former - Isn't it just wrong? The various AC's have different laplacian distributions (lower frequencies more likely) so using PVQ just doesn't seem right. In particular PVQ assumes all coefficients are equally likely and equally distributed. In your abstract you seem to describe a coding scheme which is not a uniform length codeword like traditional PVQ. It looks like it assigns shorter codes to vectors that have their values early on in some kind of z-scan order. How is K chosen?

Hi, On 02/12/14 08:52 PM, Charles Bloom wrote: > Thanks for writing! Before I address specific points, maybe you can > teach me a bit about PVQ and how you use it? I can't find any good > resources on the web (your abstract is rather terse). Maybe you can > point me at some relevant reference material. (the CELT paper is rather > terse too!) I'm currently writing a longer paper for a conference in February, but for now there isn't much more than the demo and the abstract I link to at the bottom. I have some notes that describe some of the maths, but it's a bit all over the place right now: http://jmvalin.ca/video/video_pvq.pdf > Are you constructing the PVQ vector from the various AC's within a > single block? Or gathering the same subband from spatial neighbors? (I > think the former, but I've seen the latter in papers) > > Assuming the former - Correct. You can see the grouping (bands) in Fig. 1 of: http://jmvalin.ca/video/spie_pvq_abstract.pdf > Isn't it just wrong? The various AC's have different laplacian > distributions (lower frequencies more likely) so using PVQ just doesn't > seem right. > > In particular PVQ assumes all coefficients are equally likely and > equally distributed. > > > In your abstract you seem to describe a coding scheme which is not a > uniform length codeword like traditional PVQ. It looks like it assigns > shorter codes to vectors that have their values early on in some kind of > z-scan order. One thing to keep in mind if that the P in PVQ now stands for "perceptual". In Daala we are no longer using the indexing scheme from CELT (which does assume identical distribution). Rather, we're using a coding scheme based on Laplace distribution of unequal variance. You can read more about the actual encoding process in another document: http://jmvalin.ca/video/pvq_encoding.pdf > How is K chosen? The math is described (poorly) in section 6.1 of http://jmvalin.ca/video/video_pvq.pdf Basically, the idea is to have the same resolution in the direction of the gain as in any other direction. In the no prediction case, it's roughly proportional to the gain times the square root of the number of dimensions. Because K only depends on values that are available to the decoder, we don't actually need to signal it. Hope this helps, Jean-Marc

Thanks for the responses and the early release papers, yeah I'm figuring most of it out. K is chosen so that distortion from the PVQ (P = Pyramid) quantization is the same as distortion from gain quantization. Presumably under a simple D metric like L2. The actual PVQ (P = Pyramid) part is the simplest and least ambiguous. The predictive stuff is complex. Let me make sure I understand this correctly - You never actually make a "residual" in the classic sense by subtracting the prediction off. You form the prediction in transformed space. (perhaps by having a motion vector, taking the pixels it points to and transforming them, dealing with lapping, yuck!) The gain of the current block is sent (for each subband). Not the gain of the delta. The gain of the prediction in the same band is used as coding context? (the delta of the quantized gains could be sent). The big win that you guys were after in sending the gain seems to have been the non-linear quantization levels; essentially you're getting "variance adaptive quantization" without explicitly sending per block quantizers. The Householder reflection is the way that vectors near the prediction are favored. This is the only way that the predicted block is used!? Madness! (presumably if the prediction had detail that was finer than the quantization level of the current block that could be used to restore within the quantization bucket; eg. for "golden frames")

On 03/12/14 12:18 AM, Charles Bloom wrote: > K is chosen so that distortion from the PVQ (P = Pyramid) quantization > is the same as distortion from gain quantization. Presumably under a > simple D metric like L2. Yes, it's an L2 metric, although since the gain is already warped, the distortion is implicitly weighted by the activity masking, which is exactly what we want. > You never actually make a "residual" in the classic sense by subtracting > the prediction off. Correct. > You form the prediction in transformed space. (perhaps by having a > motion vector, taking the pixels it points to and transforming them, > dealing with lapping, yuck!) We have the input image and we have a predicted image. We just transform both. Lapping doesn't actually cause any issues there (unlike many other places). As far as I can tell, this part is similar to what a wavelet coder would do. > The gain of the current block is sent (for each subband). Not the gain > of the delta. Correct. > The gain of the prediction in the same band is used as > coding context? (the delta of the quantized gains could be sent). Yes, the gain is delta-coded, so coding "same gain" is cheap. Especially, there's a special symbol for gain=0,theta=0, which means "skip this band and use prediction as is". > The big win that you guys were after in sending the gain seems to have > been the non-linear quantization levels; essentially you're getting > "variance adaptive quantization" without explicitly sending per block > quantizers. Exactly. Not only that but it's adaptive based on the variance of the current band, not just an entire macroblock. > The Householder reflection is the way that vectors near the prediction > are favored. This is the only way that the predicted block is used!? > Madness! Well, the reference is used to compute the reflection *and* the gain. In the end, we're using exactly the same amount of information, just in a different space. > (presumably if the prediction had detail that was finer than the > quantization level of the current block that could be used to restore > within the quantization bucket; eg. for "golden frames") Can you explain what you mean here? Jean-Marc

So one thing that strikes me is that at very low bit rate, it would be nice to go below K=1. In the large high-frequency subbands, the vector dimension N is very large, so even at K=1 it takes a lot of bits to specify where the energy should go. It would be nice to be more lossy with that location. It seems that for low K you're using a zero-runlength coder to send the distribution, with a kind of Z-scan order, which makes it very similar to standard MPEG. (maybe you guys aren't focusing on such low bit rates; when I looked at low bit rate video the K=1 case dominated) At 09:42 PM 12/2/2014, you wrote: > (presumably if the prediction had detail that was finer than the > quantization level of the current block that could be used to restore > within the quantization bucket; eg. for "golden frames") Can you explain what you mean here? If you happen to have a very high quality previous block (much better than your current quantizer / bit rate should give you) - with normal mocomp you can easily carry that block forward, and perhaps apply corrections to it, but the high detail of that block is preserved. With the PVQ scheme it's not obvious to me that that works. When you send the quantized gain of the subbands you're losing precision (it looks like you guys have a special fudge to fix this, by offsetting the gain based on the prediction's gain?) But for the VQ part, you can't really "carry forward" detail in the same way. I guess the reflection vector can be higher precision than the quantizer, so in a sense that preserves detail, but it doesn't carry forward the same values, because they drift due to rotation and staying a unit vector, etc.

Some more questions - Is the Householder reflection method also used for Intra prediction? (do you guys do the directional Intra like H26x ?) How much of this scheme is because you believe it's the best thing to do vs. you have to avoid H26x patents? If you're not sending any explicit per-block quantizer, it seems like that removes a lot of freedom for future encoders to do more sophisticated perceptual optimization. (ROI bit allocation or whatever)

On 03/12/14 02:17 PM, Charles Bloom wrote: > So one thing that strikes me is that at very low bit rate, it would be > nice to go below K=1. In the large high-frequency subbands, the vector > dimension N is very large, so even at K=1 it takes a lot of bits to > specify where the energy should go. It would be nice to be more lossy > with that location. Well, for large N, the first gain step already has K>1, which I believe is better than K=1. I've considered adding an extra gain step with K=1 or below, but never had anything that was really worth it (didn't try very hard). > It seems that for low K you're using a zero-runlength coder to send the > distribution, with a kind of Z-scan order, which makes it very similar > to standard MPEG. > > (maybe you guys aren't focusing on such low bit rates; when I looked at > low bit rate video the K=1 case dominated) We're also targeting low bit-rates, similar to H.265. We're not yet at our target level of performance though. > Is the Householder reflection method also used for Intra prediction? > (do you guys do the directional Intra like H26x ?) We also use it for intra prediction, though right now our intra prediction is very limited because of the lapped transform. Except for chroma which we predict from the luma. PVQ makes this particularly easy. We just use the unit vector from luma as chroma prediction and code the gain. > How much of this scheme is because you believe it's the best thing to > do vs. you have to avoid H26x patents? The original goal wasn't to avoid patents, but it's a nice added benefit. > If you're not sending any explicit per-block quantizer, it seems like > that removes a lot of freedom for future encoders to do more > sophisticated perceptual optimization. (ROI bit allocation or > whatever) We're still planning on adding some per-block/macroblock/something quantizers, but we just won't need them for activity masking. Cheers, Jean-Marc

Hi, Just read your "smooth blocks" post and I thought I'd mention on thing we do in Daala to improve the quality of smooth regions. It's called "Haar DC" and the idea is basically to apply a Haar transforms to all the DCs in a superblock. This has the advantage of getting us much better quantization resolution at large scales. Unfortunately, there's absolutely no documentation about it, so you'd have to look at the source code, mostly in od_quantize_haar_dc() and a bit of od_compute_dcts() http://git.xiph.org/?p=daala.git;a=blob;f=src/encode.c;h=879dda;hb=HEAD Cheers, Jean-Marc

Yeah I definitely can't follow that code without digging into it. But this : "much better quantization resolution at large scales." is interesting. When I did the DLI test : http://cbloomrants.blogspot.com/2014/08/08-31-14-dli-image-compression.html something I noticed in both JPEG and DLI (and in everything else, I'm sure) is : Because everyone just does naive scalar quantization on DC's, large regions of solid color will shift in a way that is very visible. That is, it's a very bad perceptual RD allocation. Some bits should be taken away from AC detail and put into making that large region DC color more precise. The problem is that DC scalar quantization assumes the blocks are independent and random and so on. It models the distortion of each block as being independent, etc. But it's not. If you have the right scalar quantizer for the DC when the blocks are in a region of high variation (lots of different DC's) then that is much too large a quantizer for regions where blocks all have roughly the same DC. This is true even when there is a decent amount of AC energy, eg. the image I noticed it in was the "Porsche640" test image posted on that page - the greens of the bushes all color shift in a very bad way. The leaf detail does not mask this kind of perceptual error.

Two more questions - 1. Do you use a quantization matrix (ala JPEG CSF or whatever) ? If so, how does that work with gain preservation and the Pyramid VQ unit vector? 2. Do you mind if I post all these mails publicly?

On 11/12/14 02:03 PM, Charles Bloom wrote: > 1. Do you use a quantization matrix (ala JPEG CSF or whatever) ? If so, > how does that work with gain preservation and the Pyramid VQ unit vector? Right now, we just set a different quantizer value for each "band", so we can't change resolution on a coefficient-by-coefficient basis, but it still looks like a good enough approximation. If needed we might try doing something fancier at some point. > 2. Do you mind if I post all these mails publicly? I have no problem with that and in fact I encourage you to do so. Cheers, Jean-Marc


ryg: Don't wanna post this to your blog because it's a long comment and will probably fail Blogger's size limit. Re "3 1/2. The normal zig-zag coding schemes we use are really bad." Don't agree here about zig-zag being the problem. Doesn't it just boil down to what model you use for the run lengths? Classic JPEG/MPEG style coding rules (H.264 and later are somewhat different) 1. assume short runs are more probable than long ones and 2. give a really cheap way to end blocks early. The result is that the coder likes blocks with a fairly dense cluster in the first few coded components (and only this is where zig-zag comes in) and truncated past that point. Now take Fischer-style PVQ (original paper is behind a paywall, but this: http://www.nul.com/pbody17.pdf covers what seems to be the proposed coding scheme). You have two parameters, N and K. N is the dimensionality of the data you're coding (this is a constant at the block syntax level and not coded) and K is the number of unit pulses (=your "energy"). You code K and then send an integer (with a uniform model!) that says which of all possible arrangements of K unit pulses across N dimensions you mean. For 16-bit ACs in a 8x8 block so N=63, there's on the order of 2^(63*16) = 2^1008 different values you could theoretically code, so clearly for large K this integer denoting the configuration can get quite huge. Anyway, suppose that K=1 (easiest case). Then the "configuration number" will tell us where the pulse goes and what sign it has, uniformly coded. That's essentially a run length with *uniform* distribution plus sign. K=2: we have two pulses. There's N*2 ways to code +-2 in one AC and the rest zeros (code AC index, code sign), and (N choose 2) * 2^2 ways to code two slots at +-1 each. And so forth for higher K. From there, we can extrapolate what the general case looks like. I think the overall structure ends up being isomorphic to this: 1. You code the number M (<=N) of nonzero coefficients using a model derived from the combinatorics given N and K (purely counting-based). (K=1 implies M=1, so nothing to code in that case.) 2. Code the M sign bits. 3. Code the positions of the M nonzero coeffs - (N choose M) options here. 4. Code another number denoting how we split the K pulses among the M coeffs - that's an integer partition of K into exactly M parts, not sure if there's a nice name/formula for that. This is close enough to the structure of existing AC entropy coders that we can meaningfully talk about the differences. 1) and 2) are bog-standard (we use a different model knowing K than a regular codec that doesn't know K would, but that's it). You can view 3) in terms of significance masks, and the probabilities have a reasonably simple form (I think you can adapt the Reservoir sampling algorithm to generate them) - or, by looking at the zero runs, in term of run lengths. And 4) is a magnitude coder constrained by knowing the final sum of everything. So the big difference is that we know K at the start, which influences our choice of models forthwith. But it's not actually changing the internal structure that much! That said, I don't think they're actually doing Fischer-style PVQ of "just send a uniform code". The advantage of breaking it down like above is that you have separate syntax elements that you can apply additional modeling on separately. Just having a giant integer flat code is not only massively unwieldy, it's also a bit of a dead end as far as further modeling is concerned. -Fabian

cbloom: At 01:36 PM 12/2/2014, you wrote: Don't wanna post this to your blog because it's a long comment and will probably fail Blogger's size limit. Re "3 1/2. The normal zig-zag coding schemes we use are really bad." Don't agree here about zig-zag being the problem. Doesn't it just boil down to what model you use for the run lengths? My belief is that for R/D optimization, it's bad when there's a big R step that doesn't correspond to a big D step. You want the prices of things to be "fair". So the problem is cases like : XX00 X01 0 vs XX00 X001 000 00 0 which is not a very big D change at all, but is a very big R step. I think it's easy to see that even keeping something equivalent to the zigzag, you could change it so that the position of the next coded value is sent in a way such that the rates better match entropy and distortion. But of course, really what you want is to send the positions of those later values in a lossy way. Even keeping something zigzagish you can imagine easy ways to do it, like you send a zigzag RLE that's something like {1,2,3-4,5-7,8-13} whatever.

ryg: Actually the Fischer (magnitude enumeration) construction corresponds pretty much directly to a direct coder: from the IEEE paper, l = dim, k = number of pulses, then number of code words N(l,k) is N(l,k) = sum_{i=-k}^k N(l-1, k-|i|) This is really direct: N(l,k) just loops over all possible values i for the first AC coeff. The remaining uncoded ACs then are l-1 dimensional and <= k-|i|. Divide through by N(l,k) and you have a probability distribution for coding a single AC coeff. Splitting out the i=0 case and sign, we get: N(l,k) = N(l-1,k) + 2 * sum_{j=1}^k N(l-1,k-j) =: N(l-1,k) + 2 * S(l-1,k) which corresponds 1:1 to this encoder: // While energy (k) left for (i = 0; k > 0; i++) { { assert(i < Ndims); // shouldn't get to N with leftover energy int l = N - i; // remaining dims int coeff = coeffs[i]; // encode significance code_binary(coeff == 0, N(l-1,k) / N(l,k)); if (coeff != 0) { // encode sign code_binary(coeff < 0, 0.5); int mag = abs(coeff); // encode magnitude (multi-symbol) // prob(mag=j) = N(l-1,k-j) / S(l-1, k) // then: k -= mag; } } and this is probably how you'd want to implement it given an arithmetic back end anyway. Factoring it into multiple decisions is much more convenient (and as said before, easier to do secondary modeling on) than the whole "one giant bigint" mess you get if you're not low-dimensional. Having the high-dimensional crap in there blows because the probabilities can get crazy. Certainly Ndims=63 would suck to work with directly. Separately, I'd expect that for k "large" (k >= Ndims? Ndims*4? More? Less?) you can use a simpler coder and/or fairly inaccurate probabilities because that's gonna be infrequent. Maybe given k = AC_sum(1,63) = sum_{i=1}^63 |coeff_i|, there's a reasonably nice way to figure out say AC_sum(1,32) and AC_sum(33,63). And if you can do that once, you can do it more than once. Kind of a top-down approach: you start with "I have k energy for this block" and first figure out which subband groups that energy goes into. Then you do the "detail" encode like above within each subband of maybe 8-16 coeffs; with l<=Ndim<=8 and k<=Ndim*small, you would have reasonable (practical) model sizes. -Fabian

cbloom: No, I don't think that's right. The N recursion is just for counting the number of codewords, it doesn't imply a coding scheme. It explicitly says that the pyramid vector index is coded with a fixed length word, using ceil( N ) bits. Your coding scheme is variable length. I need to find the original Fisher paper because this isn't making sense to me. The AC's aren't equally probable and don't have the same Laplacian distribution so PVQ just seems wrong. I did find this paper ("Robust image and video coding with pyramid vector quantisation") which uses PVQ and is making the vectors not from within the same block, but within the same *subband* in different spatial locations. eg. gathering all the AC20's from lots of neigboring blocks. That does make sense to me but I'm not sure if that's what everyone means when they talk about PVQ ? (paper attached to next email)

ryg: On 12/2/2014 5:46 PM, Charles Bloom {RAD} wrote: No, I don't think that's right. The N recursion is just for counting the number of codewords, it doesn't imply a coding scheme. It explicitly says that the pyramid vector index is coded with a fixed length word, using ceil( N ) bits. Your coding scheme is variable length. I wasn't stating that Fischer's scheme is variable-length; I was stating that the decomposition as given implies a corresponding way to encode it that is equivalent (in the sense of exact same cost). It's not variable length. It's variable number of symbols but the output length is always the same (provided you use an exact multi-precision arithmetic coder that is, otherwise it can end up larger due to round-off error). log2(N(l,k)) is the number of bits we need to spend to encode which one out of N(l,k) equiprobable codewords we use. The ceil(log2(N)) is what you get when you say "fuck it" and just round it to an integral number of bits, but clearly that's not required. So suppose we're coding to the exact target rate using a bignum rationals and an exact arithmetic coder. Say I have a permutation of 3 values and want to encode which one it is. I can come up with a canonical enumeration (doesn't matter which) and send an index stating which one of the 6 candidates it is, in log2(6) bits. I can send one bit stating whether it's an even or odd permutation, which partitions my 6 cases into 2 disjoint subsets of 3 cases each, and then send log2(3) bits to encode which of the even/odd permutations I am, for a total of log2(2) + log2(3) = log2(6) bits. Or I can get fancier. In the general case, I can (arbitrarily!) partition my N values into disjoint subsets with k_1, k_2, ..., k_m elements, respectively, sum_i k_i = N. To code a number, I then first code the number of the subset it's in (using probability p_i = k_i/N) and then send a uniform integer denoting which element it is, in log2(k_i) bits. Say I want to encode some number x, and it falls into subset j. Then I will spend -log2(p_i) + log2(k_i) = -log2(k_i / N) + log2(k_i) = log2(N / k_i) + log2(k_i) = log2(N) bits (surprise... not). I'm just partitioning my uniform distribution into several distributions over smaller sets, always setting probabilities exactly according to the number of "leaves" (=final coded values) below that part of the subtree, so that the product along each path is still a uniform distribution. I can nest that process of course, and it's easy to do so in some trees but not others meaning I get non-uniform path lengths, but at no point am I changing the size of the output bitstream. That's exactly what I did in the "coder" given below. What's the value of the first AC coefficient? It must obey -k <= ac_0 <= k per definition of k, and I'm using that to partition our codebook C into 2k+1 disjoint subsets, namely C_x = { c in C | ac0(c) = x } and nicely enough, by the unit-pulse definition that leads to the enumeration formula, each of the C_x corresponds to another PVQ codebook, namely with dimension l-1 and energy k-|x|. Which implies the whole thing decomposes into "send x and then do a PVQ encode of the rest", i.e. the loop I gave. That said, one important point that I didn't cover in my original mail: from the purposes of coding this is really quite similar to a regular AC coder, but of course the values being coded don't mean the same thing. In a JPEG/MPEG style entropy coder, the values I'm emitting are raw ACs. PVQ works (for convenience) with code points on an integer lattice Z^N, but the actual AC coeffs coded aren't those lattice points, they're (gain(K) / len(lattice_point)) * lattice_point (len here being Euclidean and not 1-norm!). I need to find the original Fisher paper because this isn't making sense to me. The AC's aren't equally probable and don't have the same Laplacian distribution so PVQ just seems wrong. I did find this paper ("Robust image and video coding with pyramid vector quantisation") which uses PVQ and is making the vectors not from within the same block, but within the same *subband* in different spatial locations. eg. gathering all the AC20's from lots of neigboring blocks. That does make sense to me but I'm not sure if that's what everyone means when they talk about PVQ ? (paper attached to next email) The link to the extended abstract for the Daala scheme (which covers this) is on the Xiph demo page: http://jmvalin.ca/video/spie_pvq_abstract.pdf Page 2 has the assignment of coeffs to subbands. They're only using a handful, and notably they treat 4x4 blocks as a single subband. -Fabian

cbloom: Ah yeah, you are correct of course. I didn't see how you had the probabilities in the coding. There are a lot of old papers I can't get about how to do the PVQ enumeration in an efficient way. I'm a bit curious about what they do. But as I'm starting to understand it all a bit now, that just seems like the least difficult part of the problem. Basically the idea is something like - divide the block into subbands. Let's say the standard wavelet tree for concreteness - 01447777 23447777 55667777 55667777 8.. Send the sum in each subband ; this is the "gain" ; let's say g_s g_s is sent with some scalar quantizer (how do you choose q_s ?) (in Daala a non-linear quantizer is used) For each subband, scale the vector to an L1 length K_s (how do you choose K_s?) Quantize the vector to a PVQ lattice point; send the lattice index So PVQ (P = Pyramid) solves this problem of how to enumerate the distribution given the sum. But that's sort of the trivial part. The how do you send the subband gains, what is K, etc. is the hard part. Do the subband gains mask each other? Then there's the whole issue of PVQ where P = Predictive. This Householder reflection business. Am I correct in understanding that Daala doesn't subtract off the motion prediction and make a residual? The PVQ (P = predictive) scheme is used instead? That's quite amazing. And it seems that Daala sends the original gain, not the gain of the residual (and uses the gain of the prediction as context). The slides (reference #4) clear things up a bit.

ryg: On 12/2/2014 8:21 PM, Charles Bloom {RAD} wrote: Ah yeah, you are correct of course. I didn't see how you had the probabilities in the coding. There are a lot of old papers I can't get about how to do the PVQ enumeration in an efficient way. I'm a bit curious about what they do. Well, the one I linked to has a couple variants already. But it's pretty much besides the point. You can of course turn this into a giant combinatorical circle-jerk, but I don't see the use. For example (that's one of the things in the paper I linked to) if you're actually assigning indexes to values then yeah, the difference between assigning codewords in order { 0, -1, 1, -2, 2, ... } and { -k, -k+1, ..., -1, 0, 1, 2, ..., k } matters, but once you decompose it into several syntax elements most of that incidental complexity just disappears completely. But as I'm starting to understand it all a bit now, that just seems like the least difficult part of the problem. Yeah, agreed. Basically the idea is something like - divide the block into subbands. Let's say the standard wavelet tree for concreteness - 01447777 23447777 55667777 55667777 8.. Yup. Send the sum in each subband ; this is the "gain" ; let's say g_s No, the gain isn't sum (1-norm), it's the Euclidean (2-norm) length. If you used 1-norm you wouldn't deform the integer lattice, meaning you're still just a scalar quantizer, just one with a funky backend. E.g. in 2D, just sending k = q(|x| + |y|) (with q being a uniform scalar quantizer without dead zone for simplicity) and then coding where the pulses go is just using the same rectangular lattice as you would have if you were sending q(x), q(y) directly. (Once you add a dead zone that's not true any more; scalar favors a "+" shape around the origin whereas the 1-norm PVQ doesn't. But let's ignore that for now.) With an ideal vector quantizer you make the "buckets" (=Voronoi regions) approximately equally likely. For general arbitrary 2D points that means the usual hex lattice. The PVQ equivalent of that is the NPVQ pattern: https://people.xiph.org/~jm/daala/pvq_demo/quantizer4.png That's clearly suboptimal (not a real hex lattice at all), but it has the nice gain/shape-separation: the circles are all equal-gain. You unwrap each circle by normalizing the point in the 1-norm, and then sending the corresponding AC pulses. g_s is sent with some scalar quantizer (how do you choose q_s ?) (in Daala a non-linear quantizer is used) q_s would come from the rate control, as usual. g codes overall intensity. You would want that to be roughly perceptually uniform. And you're not sending g at all, you're sending K. CIELab gamma (which is ~perceptually uniform) is 3, i.e. linear->CIELab is pow(x, 1/3). The Daala gain compander uses, surprise, 1/3. This would make sense except for the part where the CIE gamma deals in *linear* values and Daala presumably works on a gamma-infested color space, because that's what you get. My theory is this: the thing they're companding is not g_s, but g_s^2, i.e. sum of squares of AC coeffs. That makes for a total companding curve of (g_s)^(2/3). Display gamma is ~2, perceptually uniform gamma is ~3, so this would be in the right range to actually work out. They're not doing a good job of describing this though! For each subband, scale the vector to an L1 length K_s (how do you choose K_s?) You don't. You have your companded g's. The companding warps the space so now we're in NPVQ land (the thing I sent the image URL for). The companded g is the radius of the circle you're actually on. But of course this is a quantizer so your choices of radius are discrete and limited. You look at circles with a radius in the right neighborhood (most obviously, just floor(g) and ceil(g), though you might want to widen the search if you're doing RDO). You find the closest lattice points on both circles (this is convex, so no risk of getting stuck in a local min). Choose whichever of the two circles is better. (All points *on* the same circle have the same cost, at least with vanilla PVQ. So the only RD trade-off you do is picking which circle.) K_s is the index of the circle you're on. The origin is K_s=0, the first real circle is K_s=1 (and has 2N points where N is your dimensionality), and so forth. Quantize the vector to a PVQ lattice point; send the lattice index Finding that is the convex search. So PVQ (P = Pyramid) solves this problem of how to enumerate the distribution given the sum. But that's sort of the trivial part. Well, that's the combinatorical part. The actual vector quantizer is the idea of warping the 1-norm diamonds into sensibly-spaced 2-norm circles. The regular structure enables the simplified search. The how do you send the subband gains, what is K, etc. is the hard part. Do the subband gains mask each other? Not sure if they're doing any additional masking beyond that. If they do, they're not talking about it. Then there's the whole issue of PVQ where P = Predictive. This Householder reflection business. Am I correct in understanding that Daala doesn't subtract off the motion prediction and make a residual? The PVQ (P = predictive) scheme is used instead? That's quite amazing. And it seems that Daala sends the original gain, not the gain of the residual (and uses the gain of the prediction as context). As far as I can tell, yeah. And yes, definitely gain of the overall block, not of the residual! Again, you have the separation into gain and shape here. The gains are coded separately, and hence out of the equation. What remains is unit vectors for both your target block and your prediction. That means your points are on a sphere. You do a reflection that aligns your prediction vector with the 1st AC coefficient. This rotates (well, reflects...) everything around but your block is still a unit vector on a sphere. 1st AC will now contain block_gain * dot(block_unit_vec, prediction_unit_vec). You already know block_gain. They send the dot product (cosine of the angle, but speaking about this in terms of angles is just confusing IMO; it's a correlation coefficient, period). This tells you how good the prediction is. If it's 0.9, you've just removed 90% of the energy to code. You need to quantize this appropriately - you want to make sure the quantizer resolution here is reasonably matched to quantizer resolution of points on your sphere, or you're wasting bits. Now you turn whatever is left of g into K (as above). You can iterate this as necessary. If you do bi-prediction, you can do another Householder reflection to align the energy of pred2 that was orthogonal to pred1 (the rest is gone already!) with the 2nd AC. You code another correlation coefficient and then deal with the residuals. Fade-in / fade-out just kind of fall out when you do prediction like this. It's not a special thing. The "ACs" don't change, just the gains. Handling cross-fades with one predictor is still shitty, but if you're doing bipred they kinda fall out as well. It all sounds pretty cool. But I have no clue at all how well it works in practice or where it stands cost/benefit-wise. -Fabian

ryg: That means your points are on a sphere. You do a reflection that aligns your prediction vector with the 1st AC coefficient. This rotates (well, reflects...) everything around but your block is still a unit vector on a sphere. Important note for this and all that follows: For this to work as I described it, your block and the prediction need to be in the same space, which in this context has to be frequency (DCT) space (since that's what you eventually want to code with PVQ), so you need to DCT your reference block first. This combined with the reflections etc. make this pretty pricey, all things considered. If you weren't splitting by subbands, I believe you could finesse your way around this: (normalized) DCT and Householder reflections are both unitary, so they preserve both the L2 norm and dot products. Which means you could calculate both the overall gain and the correlation coeffs for your prediction *before* you do the DCT (and hence in the decoder, add that stuff back in post-IDCT, without having to DCT your reference). But with the subband splitting, that no longer works, at least not directly. You could still do it with a custom filter bank that just passes through precisely the DCT coeffs we're interested in for each subband, but eh, somehow I have my doubts that this is gonna be much more efficient than just eating the DCT. It would certainly add yet another complicated mess to the pile. -Fabian

cbloom: At 10:23 PM 12/2/2014, Fabian Giesen wrote: For this to work as I described it, your block and the prediction need to be in the same space, which in this context has to be frequency (DCT) space (since that's what you eventually want to code with PVQ), so you need to DCT your reference block first. This combined with the reflections etc. make this pretty pricey, all things considered. Yeah, I asked Valin about this. They form an entire predicted *image* rather than block-by-block because of lapping. They transform the predicted image the same way as the current frame. Each subband gain is sent as a delta from the predicted image subband gain. Crazy! His words : > You form the prediction in transformed space. (perhaps by having a > motion vector, taking the pixels it points to and transforming them, > dealing with lapping, yuck!) We have the input image and we have a predicted image. We just transform both. Lapping doesn't actually cause any issues there (unlike many other places). As far as I can tell, this part is similar to what a wavelet coder would do.

cbloom: At 09:31 PM 12/2/2014, you wrote: q_s would come from the rate control, as usual. Yeah, I just mean the details of that is actually one of the most important issues. eg. how does Q vary for the different subbands. Is there inter-subband masking, etc. In Daala the Q is non-linear (variance adaptive quantizer) g codes overall intensity. You would want that to be roughly perceptually uniform. And you're not sending g at all, you're sending K. In Daala they send g and derive K. CIELab gamma (which is ~perceptually uniform) is 3, i.e. linear->CIELab is pow(x, 1/3). The Daala gain compander uses, surprise, 1/3. This would make sense except for the part where the CIE gamma deals in *linear* values and Daala presumably works on a gamma-infested color space, because that's what you get. My theory is this: the thing they're companding is not g_s, but g_s^2, i.e. sum of squares of AC coeffs. That makes for a total companding curve of (g_s)^(2/3). Display gamma is ~2, perceptually uniform gamma is ~3, so this would be in the right range to actually work out. They're not doing a good job of describing this though! Err, yeah maybe. What they actually did was take the x264 VAQ and try to reproduce it. For each subband, scale the vector to an L1 length K_s (how do you choose K_s?) You don't. You have your companded g's. The companding warps the space so now we're in NPVQ land (the thing I sent the image URL for). The companded g is the radius of the circle you're actually on. But of course this is a quantizer so your choices of radius are discrete and limited. No, that's not right. K is effectively your "distribution" quantizer. It should be proportional to g in some way (or some power of g) but it's not just g. As the quantizer for g goes up, K goes down. In Daala they choose K such that the distortion due to PVQ is the same as the distortion due to gain scalar quantization. 1st AC will now contain block_gain * dot(block_unit_vec, prediction_unit_vec). You already know block_gain. They send the dot product (cosine of the angle, but speaking about this in terms of angles is just confusing IMO; it's a correlation coefficient, period). I think that in Daala they actually send the angle, not the cosine, which is important because of the non-linear quantization buckets. It's difficult for me to intuit what the Householder reflection is doing to the residuals. But I guess it doesn't matter much. It also all seems to fall apart a bit if the prediction is not very good. Then the gains might mismatch quite a bit, and even though you had some pixels that matched well, they will be scaled differently when normalized. It's a bit blah.

ryg: Yeah, I asked Valin about this. They form an entire predicted *image* rather than block-by-block because of lapping. That doesn't have anything to do with the lapping, I think - that's because they don't use regular block-based mocomp. At least their proposal was to mix overlapping-block MC and Control Grid Interpolation (CGI, essentially you specify a small mesh with texture coordinates and do per-pixel tex coord interpolation). There's no nice way to do this block-per-block in the first place, not with OBMC in the mix anyway; if you chop it up into tiles you end up doing a lot of work twice.

ryg: On 12/03/2014 10:26 AM, Charles Bloom {RAD} g codes overall intensity. You would want that to be roughly perceptually uniform. And you're not sending g at all, you're sending K. In Daala they send g and derive K. Ah, my bad. For each subband, scale the vector to an L1 length K_s (how do you choose K_s?) You don't. You have your companded g's. The companding warps the space so now we're in NPVQ land (the thing I sent the image URL for). The companded g is the radius of the circle you're actually on. But of course this is a quantizer so your choices of radius are discrete and limited. No, that's not right. K is effectively your "distribution" quantizer. It should be proportional to g in some way (or some power of g) but it's not just g. As the quantizer for g goes up, K goes down. In Daala they choose K such that the distortion due to PVQ is the same as the distortion due to gain scalar quantization. Ah OK, that makes sense. 1st AC will now contain block_gain * dot(block_unit_vec, prediction_unit_vec). You already know block_gain. They send the dot product (cosine of the angle, but speaking about this in terms of angles is just confusing IMO; it's a correlation coefficient, period). I think that in Daala they actually send the angle, not the cosine, which is important because of the non-linear quantization buckets. Didn't check how they send it. I do find thinking of this in terms of cross-correlation between block and pred a lot simpler than phrasing it in terms of angles. It's difficult for me to intuit what the Householder reflection is doing to the residuals. But I guess it doesn't matter much. The reflection itself doesn't do anything meaningful. Your normalized points were on a unit sphere before, and still are after. You're just spinning it around. It does mean that your coefficient numbering really loses all meaning. After one such reflection, you're already scrambled. Overall energy is still the same (because it's unitary) but the direction is completely different. Since PVQ already assumes that the directions are equiprobable (well, more or less, since the PVQ doesn't actually uniformly cover the sphere), they don't care. It also all seems to fall apart a bit if the prediction is not very good. Then the gains might mismatch quite a bit, and even though you had some pixels that matched well, they will be scaled differently when normalized. It's a bit blah. Well, it's just a different goal for the predictors. Regular motion search tries to minimize SAD or similar, as do the H.264 spatial predictors. For this kind of scheme you don't care about differences at all, instead you want to maximize the normalized correlation coeff between image and reference. (You want texture matches, not pixel matches.) -Fabian

cbloom: The other thing I note is that it doesn't seem very awesome at low bit rate. Their subband chunks are very large. Even at K=1 the N slots that could have that one value is very large, so sending the index of that one slot is a lot of bits. At that point, the way you model the zeros and the location of the 1 is the most important thing. What I'm getting at is a lossy way of sending that.

ryg: On 12/3/2014 1:04 PM, Charles Bloom {RAD} wrote: The other thing I note is that it doesn't seem very awesome at low bit rate. Their subband chunks are very large. Even at K=1 the N slots that could have that one value is very large, so sending the index of that one slot is a lot of bits. Yeah, the decision to send a subband *at all* means you have to code gain, theta and your AC index. For N=16 that's gonna be hard to get below 8 bits even for trivial signals. At which point you get a big jump in the RD curve, which is bad. Terriberry has a few slides that explain how they're doing inter-band activity masking currently: https://people.xiph.org/~tterribe/daala/pvq201404.pdf The example image is kind of terrible though. The "rose" dress (you'll see what I mean) is definitely better in the AM variant, but the rest is hard to tell for me unless I zoom in, which is cheating. At that point, the way you model the zeros and the location of the 1 is the most important thing. What I'm getting at is a lossy way of sending that. This is only really interesting at low K, where the PVQ codebook is relatively small. So, er, let's just throw this one in: suppose you're actually sending codebook indices. You just have a rate allocation function that tells you how many bits to send, independent of how big the codebook actually is. If you truly believe that preserving narrowband energy is more important than getting the direction right, then getting a random vector with the right energy envelope is better than nothing. Say K=1, Ndim=16. You have N=32 codewords, so a codebook index stored directly is 5 bits. Rate function says "you get 0 bits". So you don't send an index at all, and the decoder just takes codeword 0. Or rate function says "you get 2 bits" so you send two bits of the codebook index, and take the rest as zero. This is obviously biased. So the values you send aren't raw codebook indices. You have some random permutation function family p_x(i) : { 0, ..., N-1 } -> { 0, ..., N-1 } where x is a per-block value that both the encoder and decoder know (position or something), and what you send is not the codebook id but p_x(id). For any given block (subband, whatever), this doesn't help you at all. You either guess right or you guess wrong. But statistically, suppose you shaved 2 bits off the codebook IDs for 1000 blocks. Then you'd expect about 250 of these blocks to reconstruct the right ACs. For the rest, you reconstructed garbage ACs, but it's garbage with the right energy levels at least! :) No clue if this is actually a good idea at all. It definitely allows you to remove a lot of potholes from the RD curve. -Fabian

ryg: On 12/3/2014 1:48 PM, Fabian Giesen wrote: > [..] This is obviously biased. So the values you send aren't raw codebook indices. You have some random permutation function family p_x(i) : { 0, ..., N-1 } -> { 0, ..., N-1 } where x is a per-block value that both the encoder and decoder know (position or something), and what you send is not the codebook id but p_x(id). Now this is all assuming you either get the right code or you get garbage, and living with whichever one it is. You can also go in the other direction and try to get the direction at least mostly right. You can try to determine an ordering of the code book so that distortion more or less smoothly goes down as you add extra bits. (First bit tells you which hemisphere, that kind of thing.) That way, if you get 4 bits out of 5, it's not a 50:50 chance between right vector and some random other vector, it's either the right vector or another vector that's "close". (Really with K=1 and high dim it's always gonna be garbage, though, because you just don't have any other vector in the code book that's even close; this is more interesting at K=2 or up). This makes the per-block randomization (you want that to avoid systematic bias) harder, though. One approach that would work is to do a Householder reflection with a random vector (again hashed from position or similar). All that said, I don't believe in this at all. It's "solving" a problem by "reducing" it to a more difficult unsolved problem (in this case, "I want a VQ codebook that's close to optimal for embedded coding"). Of course, even if you do a bad job here, it's still not gonna be worse than the direct "random permutation" stuff. But I doubt it's gonna be appreciably better either, and it's definitely more complex. -Fabian

cbloom: At 02:17 PM 12/3/2014, Fabian Giesen wrote: That way, if you get 4 bits out of 5, it's not a 50:50 chance between right vector and some random other vector, it's either the right vector or another vector that's "close". Yes, this is the type of scheme I imagine. Sort of like a wavelet significant bit thing. As you send fewer bits the location gets coarser. The codebook for K=1 is pretty obvious. You're just sending a location; you want the top bits to grossly classify the AC and the bottom bits to distinguish neighbors (H neighbors for H-type AC's, and V neighbors for V-type AC's) For K=2 and up it's more complex. You could just train them and store them (major over-train risk) up to maybe K=3 but then you have to switch to an algorithmic method. Really the only missing piece for me is how you get the # of bits used to specify the locations. It takes too many bits to actually send it, so it has to be implicit from some other factors like the block Q and K and I'm not sure how to get that.

12/08/2014

12-08-14 - BPG

BPG is a nice packaging of HEVC (H265) I-frame compression for still images. He provides Windows command line tools with reasonable options (yay!), so I'm quite happy to test it.

It's pretty dang slow. Really slow. It's all covered by many patents. So I'm not sure how realistic it is as a useable format. Nonetheless, it's very useful as something to compare against.

I ran with default options (YCbCr in 420). I compared against JPEG_pdec, which as I previously noted JPEG pdec is very comparable to DLI . (JPEG_pdec = JPEG+packjpg+ my jpegdec (decblocker, etc)).

Conclusion :

BPG is really good. The best I've seen. It kills JPEG-pdec in RMSE, in fact I think it's the best RMSE performance I've seen despite being at a disadvantage (YCbCr and 420). Under the perceptual metrics (MS-SSIM-Y and "Combo") it doesn't win so strongly. That tells me there is probably room for better perceptual tuning of bit allocation and quantizers. But it's definitely strong.

Quick visual evaluation by me :

Porsche640 : BPG wins pretty hard here. Perhaps the most noticeable thing is much better detail preservation in the texture regions (the gravel and bushes). It also does a better job on the edges of the car, it doesn't smear them into nasty DCT block artifacts. You may download Porsche640 comparison images here (1 MB, RAR)

Moses : actually not a very big win here. It does much better at preserving the smooth gradient background (my current JPEGdec doesn't have any special modes for big smooth areas). Visually the main thing you'll notice is that the smooth gradients are nasty chunky steps with JPEG and are nice and smooth with BPG. Other than that, I actually think JPEG is better on Moses himself. Both make a big perceptual rate allocation mistake and put too many bits on the jacket texture and not enough on the human skin texture. But JPEG preserves more of the face texture; when you A/B compare it's clear that BPG is way over-smoothing his face. Particularly on the forehead and the neck fat. But all over really. Both BPG and JPEG make a classic mistake on Moses : they kill too much of the red and blue detail in the tie because it's in chroma.

The raw reports :


porsche640 :

imdiff RMSE_RGB
Built Aug 30 2014 11:30:27
r:\porsche640.bmp
r:\porsche640.bmp_bpg
r:\porsche640.bmp_jpg_pdec

Live Chart PlayGround

raw imdiff data : -2.31,-1.29,-0.42,-0.13,0.02,0.31,0.69,0.92,1.56,1.93|18.55,13.46,9.59,8.46,7.87,6.79,5.45,4.72,3.14,2.56|-2.26,-1.96,-1.44,-1.13,-0.89,-0.71,-0.44,-0.24,-0.04,0.08,0.21,0.35,0.54,0.76,1.06,1.55,2.54|22.22,19.98,16.74,14.86,13.60,12.73,11.49,10.62,9.82,9.35,8.81,8.30,7.63,6.87,5.91,4.57,3.07| fit imdiff data : -2.31,-1.29,-0.42,-0.13,0.02,0.31,0.69,0.92,1.56,1.93|4.64,5.17,5.69,5.87,5.96,6.16,6.43,6.60,7.02,7.20|-2.26,-1.96,-1.44,-1.13,-0.89,-0.71,-0.44,-0.24,-0.04,0.08,0.21,0.35,0.54,0.76,1.06,1.55,2.54|4.33,4.52,4.82,5.01,5.16,5.26,5.42,5.54,5.65,5.72,5.81,5.89,6.01,6.14,6.33,6.63,7.04|

r:\porsche640.bmp_bpgr:\porsche640.bmp_jpg_pdec
-2.310328bpg_test_000007742.bmp-2.256377jpeg_pdec_000008037.bmp
-1.290066bpg_test_000015703.bmp-1.956043jpeg_pdec_000009897.bmp
-0.422974bpg_test_000028642.bmp-1.443264jpeg_pdec_000014121.bmp
-0.129923bpg_test_000035093.bmp-1.125039jpeg_pdec_000017606.bmp
0.023995bpg_test_000039044.bmp-0.893289jpeg_pdec_000020674.bmp
0.310310bpg_test_000047615.bmp-0.710473jpeg_pdec_000023467.bmp
0.686454bpg_test_000061798.bmp-0.442294jpeg_pdec_000028261.bmp
0.920750bpg_test_000072695.bmp-0.235261jpeg_pdec_000032622.bmp
1.557974bpg_test_000113065.bmp-0.039573jpeg_pdec_000037361.bmp
1.930195bpg_test_000146345.bmp0.079876jpeg_pdec_000040586.bmp
0.213736jpeg_pdec_000044532.bmp
0.348876jpeg_pdec_000048905.bmp
0.535742jpeg_pdec_000055668.bmp
0.758401jpeg_pdec_000064958.bmp
1.063934jpeg_pdec_000080280.bmp
1.546419jpeg_pdec_000112163.bmp
2.541619jpeg_pdec_000223581.bmp
imdiff MS_SSIM_IW_Y
Built Aug 30 2014 11:30:27
r:\porsche640.bmp
r:\porsche640.bmp_bpg
r:\porsche640.bmp_jpg_pdec

Live Chart PlayGround

raw imdiff data : -2.31,-1.29,-0.42,-0.13,0.02,0.31,0.69,0.92,1.56,1.93|83.03,89.78,93.93,95.07,95.57,96.43,97.39,97.88,98.87,99.25|-2.26,-1.96,-1.44,-1.13,-0.89,-0.71,-0.44,-0.24,-0.04,0.08,0.21,0.35,0.54,0.76,1.06,1.55,2.54|79.88,82.47,86.76,89.12,90.67,91.72,93.14,94.07,94.84,95.30,95.74,96.18,96.72,97.27,97.95,98.70,99.40| fit imdiff data : -2.31,-1.29,-0.42,-0.13,0.02,0.31,0.69,0.92,1.56,1.93|4.23,5.06,5.73,5.96,6.07,6.29,6.57,6.73,7.18,7.41|-2.26,-1.96,-1.44,-1.13,-0.89,-0.71,-0.44,-0.24,-0.04,0.08,0.21,0.35,0.54,0.76,1.06,1.55,2.54|3.90,4.17,4.66,4.97,5.19,5.35,5.59,5.76,5.91,6.01,6.11,6.22,6.36,6.53,6.76,7.08,7.53|

r:\porsche640.bmp_bpgr:\porsche640.bmp_jpg_pdec
-2.310328bpg_test_000007742.bmp-2.256377jpeg_pdec_000008037.bmp
-1.290066bpg_test_000015703.bmp-1.956043jpeg_pdec_000009897.bmp
-0.422974bpg_test_000028642.bmp-1.443264jpeg_pdec_000014121.bmp
-0.129923bpg_test_000035093.bmp-1.125039jpeg_pdec_000017606.bmp
0.023995bpg_test_000039044.bmp-0.893289jpeg_pdec_000020674.bmp
0.310310bpg_test_000047615.bmp-0.710473jpeg_pdec_000023467.bmp
0.686454bpg_test_000061798.bmp-0.442294jpeg_pdec_000028261.bmp
0.920750bpg_test_000072695.bmp-0.235261jpeg_pdec_000032622.bmp
1.557974bpg_test_000113065.bmp-0.039573jpeg_pdec_000037361.bmp
1.930195bpg_test_000146345.bmp0.079876jpeg_pdec_000040586.bmp
0.213736jpeg_pdec_000044532.bmp
0.348876jpeg_pdec_000048905.bmp
0.535742jpeg_pdec_000055668.bmp
0.758401jpeg_pdec_000064958.bmp
1.063934jpeg_pdec_000080280.bmp
1.546419jpeg_pdec_000112163.bmp
2.541619jpeg_pdec_000223581.bmp
imdiff Combo
Built Aug 30 2014 11:30:27
r:\porsche640.bmp
r:\porsche640.bmp_bpg
r:\porsche640.bmp_jpg_pdec

Live Chart PlayGround

raw imdiff data : -2.31,-1.29,-0.42,-0.13,0.02,0.31,0.69,0.92,1.56,1.93|4.83,4.05,3.41,3.21,3.10,2.90,2.63,2.46,2.03,1.80|-2.26,-1.96,-1.44,-1.13,-0.89,-0.71,-0.44,-0.24,-0.04,0.08,0.21,0.35,0.54,0.76,1.06,1.55,2.54|5.30,4.97,4.44,4.11,3.88,3.72,3.48,3.31,3.16,3.06,2.96,2.86,2.72,2.56,2.35,2.05,1.74| fit imdiff data : -2.31,-1.29,-0.42,-0.13,0.02,0.31,0.69,0.92,1.56,1.93|4.19,4.99,5.65,5.86,5.98,6.18,6.45,6.62,7.05,7.28|-2.26,-1.96,-1.44,-1.13,-0.89,-0.71,-0.44,-0.24,-0.04,0.08,0.21,0.35,0.54,0.76,1.06,1.55,2.54|3.69,4.04,4.59,4.93,5.17,5.34,5.58,5.75,5.91,6.01,6.11,6.22,6.36,6.52,6.74,7.03,7.34|

r:\porsche640.bmp_bpgr:\porsche640.bmp_jpg_pdec
-2.310328bpg_test_000007742.bmp-2.256377jpeg_pdec_000008037.bmp
-1.290066bpg_test_000015703.bmp-1.956043jpeg_pdec_000009897.bmp
-0.422974bpg_test_000028642.bmp-1.443264jpeg_pdec_000014121.bmp
-0.129923bpg_test_000035093.bmp-1.125039jpeg_pdec_000017606.bmp
0.023995bpg_test_000039044.bmp-0.893289jpeg_pdec_000020674.bmp
0.310310bpg_test_000047615.bmp-0.710473jpeg_pdec_000023467.bmp
0.686454bpg_test_000061798.bmp-0.442294jpeg_pdec_000028261.bmp
0.920750bpg_test_000072695.bmp-0.235261jpeg_pdec_000032622.bmp
1.557974bpg_test_000113065.bmp-0.039573jpeg_pdec_000037361.bmp
1.930195bpg_test_000146345.bmp0.079876jpeg_pdec_000040586.bmp
0.213736jpeg_pdec_000044532.bmp
0.348876jpeg_pdec_000048905.bmp
0.535742jpeg_pdec_000055668.bmp
0.758401jpeg_pdec_000064958.bmp
1.063934jpeg_pdec_000080280.bmp
1.546419jpeg_pdec_000112163.bmp
2.541619jpeg_pdec_000223581.bmp


imdiff RMSE_RGB
Built Aug 30 2014 11:30:27
r:\PDI_1200.bmp
r:\PDI_1200.bmp_bpg
r:\PDI_1200.bmp_jpg_pdec

Live Chart PlayGround

raw imdiff data : -2.50,-1.64,-0.85,-0.58,-0.43,-0.16,0.22,0.46,1.11,1.51|16.50,12.13,8.70,7.79,7.27,6.38,5.31,4.79,3.76,3.43|-1.77,-1.47,-1.24,-1.06,-0.79,-0.58,-0.38,-0.26,-0.12,0.02,0.21,0.78|17.05,15.36,14.18,13.28,12.07,11.20,10.38,9.91,9.36,8.80,8.09,6.38| fit imdiff data : -2.50,-1.64,-0.85,-0.58,-0.43,-0.16,0.22,0.46,1.11,1.51|4.84,5.34,5.83,5.98,6.07,6.24,6.46,6.58,6.84,6.93|-1.77,-1.47,-1.24,-1.06,-0.79,-0.58,-0.38,-0.26,-0.12,0.02,0.21,0.78|4.79,4.96,5.09,5.19,5.34,5.46,5.57,5.64,5.72,5.81,5.93,6.24|

r:\PDI_1200.bmp_bpgr:\PDI_1200.bmp_jpg_pdec
-2.502747bpg_test_000020273.bmp-1.771229jpeg_pdec_000033661.bmp
-1.643425bpg_test_000036779.bmp-1.468118jpeg_pdec_000041531.bmp
-0.845949bpg_test_000063924.bmp-1.240727jpeg_pdec_000048621.bmp
-0.575801bpg_test_000077088.bmp-1.059914jpeg_pdec_000055113.bmp
-0.434878bpg_test_000084998.bmp-0.793755jpeg_pdec_000066279.bmp
-0.158926bpg_test_000102915.bmp-0.579211jpeg_pdec_000076906.bmp
0.219408bpg_test_000133773.bmp-0.382314jpeg_pdec_000088152.bmp
0.455907bpg_test_000157602.bmp-0.261965jpeg_pdec_000095821.bmp
1.113320bpg_test_000248578.bmp-0.123163jpeg_pdec_000105498.bmp
1.510508bpg_test_000327362.bmp0.019419jpeg_pdec_000116457.bmp
0.212218jpeg_pdec_000133108.bmp
0.777766jpeg_pdec_000196993.bmp
imdiff MS_SSIM_IW_Y
Built Aug 30 2014 11:30:27
r:\PDI_1200.bmp
r:\PDI_1200.bmp_bpg
r:\PDI_1200.bmp_jpg_pdec

Live Chart PlayGround

raw imdiff data : -2.50,-1.64,-0.85,-0.58,-0.43,-0.16,0.22,0.46,1.11,1.51|87.81,92.21,95.12,95.98,96.34,96.99,97.77,98.16,98.98,99.30|-1.77,-1.47,-1.24,-1.06,-0.79,-0.58,-0.38,-0.26,-0.12,0.02,0.21,0.78|89.84,91.44,92.46,93.22,94.23,94.94,95.53,95.87,96.23,96.60,97.04,98.08| fit imdiff data : -2.50,-1.64,-0.85,-0.58,-0.43,-0.16,0.22,0.46,1.11,1.51|4.80,5.43,5.97,6.17,6.26,6.44,6.69,6.84,7.24,7.45|-1.77,-1.47,-1.24,-1.06,-0.79,-0.58,-0.38,-0.26,-0.12,0.02,0.21,0.78|5.07,5.31,5.47,5.60,5.79,5.93,6.06,6.14,6.23,6.33,6.45,6.81|

r:\PDI_1200.bmp_bpgr:\PDI_1200.bmp_jpg_pdec
-2.502747bpg_test_000020273.bmp-1.771229jpeg_pdec_000033661.bmp
-1.643425bpg_test_000036779.bmp-1.468118jpeg_pdec_000041531.bmp
-0.845949bpg_test_000063924.bmp-1.240727jpeg_pdec_000048621.bmp
-0.575801bpg_test_000077088.bmp-1.059914jpeg_pdec_000055113.bmp
-0.434878bpg_test_000084998.bmp-0.793755jpeg_pdec_000066279.bmp
-0.158926bpg_test_000102915.bmp-0.579211jpeg_pdec_000076906.bmp
0.219408bpg_test_000133773.bmp-0.382314jpeg_pdec_000088152.bmp
0.455907bpg_test_000157602.bmp-0.261965jpeg_pdec_000095821.bmp
1.113320bpg_test_000248578.bmp-0.123163jpeg_pdec_000105498.bmp
1.510508bpg_test_000327362.bmp0.019419jpeg_pdec_000116457.bmp
0.212218jpeg_pdec_000133108.bmp
0.777766jpeg_pdec_000196993.bmp
imdiff Combo
Built Aug 30 2014 11:30:27
r:\PDI_1200.bmp
r:\PDI_1200.bmp_bpg
r:\PDI_1200.bmp_jpg_pdec

Live Chart PlayGround

raw imdiff data : -2.50,-1.64,-0.85,-0.58,-0.43,-0.16,0.22,0.46,1.11,1.51|4.51,3.84,3.27,3.08,2.98,2.80,2.55,2.40,2.02,1.81|-1.77,-1.47,-1.24,-1.06,-0.79,-0.58,-0.38,-0.26,-0.12,0.02,0.21,0.78|4.25,3.96,3.76,3.60,3.39,3.23,3.09,3.00,2.91,2.80,2.67,2.32| fit imdiff data : -2.50,-1.64,-0.85,-0.58,-0.43,-0.16,0.22,0.46,1.11,1.51|4.52,5.21,5.80,5.99,6.09,6.28,6.53,6.68,7.07,7.27|-1.77,-1.47,-1.24,-1.06,-0.79,-0.58,-0.38,-0.26,-0.12,0.02,0.21,0.78|4.79,5.08,5.29,5.46,5.67,5.84,5.98,6.07,6.17,6.28,6.41,6.76|

r:\PDI_1200.bmp_bpgr:\PDI_1200.bmp_jpg_pdec
-2.502747bpg_test_000020273.bmp-1.771229jpeg_pdec_000033661.bmp
-1.643425bpg_test_000036779.bmp-1.468118jpeg_pdec_000041531.bmp
-0.845949bpg_test_000063924.bmp-1.240727jpeg_pdec_000048621.bmp
-0.575801bpg_test_000077088.bmp-1.059914jpeg_pdec_000055113.bmp
-0.434878bpg_test_000084998.bmp-0.793755jpeg_pdec_000066279.bmp
-0.158926bpg_test_000102915.bmp-0.579211jpeg_pdec_000076906.bmp
0.219408bpg_test_000133773.bmp-0.382314jpeg_pdec_000088152.bmp
0.455907bpg_test_000157602.bmp-0.261965jpeg_pdec_000095821.bmp
1.113320bpg_test_000248578.bmp-0.123163jpeg_pdec_000105498.bmp
1.510508bpg_test_000327362.bmp0.019419jpeg_pdec_000116457.bmp
0.212218jpeg_pdec_000133108.bmp
0.777766jpeg_pdec_000196993.bmp


moses :

imdiff RMSE_RGB
Built Aug 30 2014 11:30:27
r:\moses.bmp
r:\moses.bmp_bpg
r:\moses.bmp_jpg_pdec

Live Chart PlayGround

raw imdiff data : -2.89,-1.97,-1.18,-0.90,-0.76,-0.48,-0.08,0.17,0.92,1.36|13.96,9.96,7.15,6.37,5.99,5.30,4.38,3.86,2.65,2.18|-2.10,-1.78,-1.56,-1.38,-1.12,-0.91,-0.72,-0.60,-0.46,-0.32,-0.11,0.49|13.20,11.59,10.38,9.65,8.60,7.86,7.18,6.83,6.44,6.06,5.58,4.40| fit imdiff data : -2.89,-1.97,-1.18,-0.90,-0.76,-0.48,-0.08,0.17,0.92,1.36|5.11,5.63,6.09,6.24,6.32,6.46,6.68,6.81,7.18,7.34|-2.10,-1.78,-1.56,-1.38,-1.12,-0.91,-0.72,-0.60,-0.46,-0.32,-0.11,0.49|5.20,5.41,5.57,5.68,5.84,5.97,6.09,6.15,6.23,6.30,6.40,6.68|

r:\moses.bmp_bpgr:\moses.bmp_jpg_pdec
-2.885454bpg_test_000037087.bmp-2.096290jpeg_pdec_000064089.bmp
-1.966356bpg_test_000070129.bmp-1.780268jpeg_pdec_000079784.bmp
-1.181866bpg_test_000120796.bmp-1.558392jpeg_pdec_000093048.bmp
-0.902603bpg_test_000146595.bmp-1.378797jpeg_pdec_000105383.bmp
-0.762217bpg_test_000161577.bmp-1.122379jpeg_pdec_000125881.bmp
-0.481048bpg_test_000196345.bmp-0.914342jpeg_pdec_000145407.bmp
-0.080435bpg_test_000259189.bmp-0.719298jpeg_pdec_000166456.bmp
0.174334bpg_test_000309250.bmp-0.598617jpeg_pdec_000180979.bmp
0.923476bpg_test_000519785.bmp-0.461917jpeg_pdec_000198966.bmp
1.358995bpg_test_000702956.bmp-0.315356jpeg_pdec_000220241.bmp
-0.108696jpeg_pdec_000254161.bmp
0.489100jpeg_pdec_000384648.bmp
imdiff MS_SSIM_IW_Y
Built Aug 30 2014 11:30:27
r:\moses.bmp
r:\moses.bmp_bpg
r:\moses.bmp_jpg_pdec

Live Chart PlayGround

raw imdiff data : -2.89,-1.97,-1.18,-0.90,-0.76,-0.48,-0.08,0.17,0.92,1.36|84.91,90.55,94.24,95.26,95.72,96.50,97.41,97.88,98.88,99.25|-2.10,-1.78,-1.56,-1.38,-1.12,-0.91,-0.72,-0.60,-0.46,-0.32,-0.11,0.49|87.54,89.76,91.23,92.22,93.54,94.44,95.20,95.62,96.02,96.43,96.93,98.06| fit imdiff data : -2.89,-1.97,-1.18,-0.90,-0.76,-0.48,-0.08,0.17,0.92,1.36|4.45,5.17,5.79,6.00,6.11,6.30,6.57,6.73,7.18,7.42|-2.10,-1.78,-1.56,-1.38,-1.12,-0.91,-0.72,-0.60,-0.46,-0.32,-0.11,0.49|4.76,5.06,5.27,5.43,5.66,5.83,5.99,6.08,6.18,6.29,6.42,6.80|

r:\moses.bmp_bpgr:\moses.bmp_jpg_pdec
-2.885454bpg_test_000037087.bmp-2.096290jpeg_pdec_000064089.bmp
-1.966356bpg_test_000070129.bmp-1.780268jpeg_pdec_000079784.bmp
-1.181866bpg_test_000120796.bmp-1.558392jpeg_pdec_000093048.bmp
-0.902603bpg_test_000146595.bmp-1.378797jpeg_pdec_000105383.bmp
-0.762217bpg_test_000161577.bmp-1.122379jpeg_pdec_000125881.bmp
-0.481048bpg_test_000196345.bmp-0.914342jpeg_pdec_000145407.bmp
-0.080435bpg_test_000259189.bmp-0.719298jpeg_pdec_000166456.bmp
0.174334bpg_test_000309250.bmp-0.598617jpeg_pdec_000180979.bmp
0.923476bpg_test_000519785.bmp-0.461917jpeg_pdec_000198966.bmp
1.358995bpg_test_000702956.bmp-0.315356jpeg_pdec_000220241.bmp
-0.108696jpeg_pdec_000254161.bmp
0.489100jpeg_pdec_000384648.bmp
imdiff Combo
Built Aug 30 2014 11:30:27
r:\moses.bmp
r:\moses.bmp_bpg
r:\moses.bmp_jpg_pdec

Live Chart PlayGround

raw imdiff data : -2.89,-1.97,-1.18,-0.90,-0.76,-0.48,-0.08,0.17,0.92,1.36|4.43,3.78,3.22,3.04,2.95,2.79,2.56,2.44,2.03,1.81|-2.10,-1.78,-1.56,-1.38,-1.12,-0.91,-0.72,-0.60,-0.46,-0.32,-0.11,0.49|4.19,3.89,3.65,3.51,3.29,3.13,2.98,2.90,2.81,2.72,2.59,2.26| fit imdiff data : -2.89,-1.97,-1.18,-0.90,-0.76,-0.48,-0.08,0.17,0.92,1.36|4.60,5.28,5.84,6.03,6.12,6.29,6.52,6.65,7.05,7.28|-2.10,-1.78,-1.56,-1.38,-1.12,-0.91,-0.72,-0.60,-0.46,-0.32,-0.11,0.49|4.85,5.16,5.41,5.56,5.78,5.94,6.09,6.18,6.27,6.36,6.49,6.83|

r:\moses.bmp_bpgr:\moses.bmp_jpg_pdec
-2.885454bpg_test_000037087.bmp-2.096290jpeg_pdec_000064089.bmp
-1.966356bpg_test_000070129.bmp-1.780268jpeg_pdec_000079784.bmp
-1.181866bpg_test_000120796.bmp-1.558392jpeg_pdec_000093048.bmp
-0.902603bpg_test_000146595.bmp-1.378797jpeg_pdec_000105383.bmp
-0.762217bpg_test_000161577.bmp-1.122379jpeg_pdec_000125881.bmp
-0.481048bpg_test_000196345.bmp-0.914342jpeg_pdec_000145407.bmp
-0.080435bpg_test_000259189.bmp-0.719298jpeg_pdec_000166456.bmp
0.174334bpg_test_000309250.bmp-0.598617jpeg_pdec_000180979.bmp
0.923476bpg_test_000519785.bmp-0.461917jpeg_pdec_000198966.bmp
1.358995bpg_test_000702956.bmp-0.315356jpeg_pdec_000220241.bmp
-0.108696jpeg_pdec_000254161.bmp
0.489100jpeg_pdec_000384648.bmp

old rants