Lossless compression algorithms reduce entropy

Do lossless compression algorithms reduce entropy?

According to Wikipedia:

Shannon's entropy measures the information contained in a message as opposed to the part of the message that is determined (or predictable). Examples of the latter are redundancy in the structure of the language or statistical properties relating to the frequency of occurrence of letter or word pairs, triplets, etc.

Entropy is therefore a measure of the amount of information contained in a message. Entropy encoders are used to losslessly compress such a message to the minimum number of bits required to represent it (entropy). To me it looks like a perfect entropy encoder is all that is needed to compress a message as losslessly as possible.

However, many compression algorithms use pre-entropy encoding steps to supposedly reduce the entropy of the message.

According to German Wikipedia

Entropy encoders are often combined with other encoders. Upstream processes serve to reduce the entropy of the data.

In English:

Entropy coders are often combined with other coders. Previous steps are designed to reduce the entropy of the data.

ie bzip2 uses the Burrows-Wheeler transform followed by a move-to-front transform before applying entropy coding (in this case Huffman coding).

Do these steps really lower the entropy of the message, which would mean reducing the amount of information contained in the message? This strikes me as a contradiction in terms, as it causes information to be lost during compression and prevents lossless decompression. Or are they just transforming the message to improve the efficiency of the entropy coding algorithm? Or does the entropy not directly correspond to the amount of information in the message?


Many casual descriptions of entropy are confusing this way because the entropy is not quite as neat as it is sometimes portrayed. In particular, the standard definition of Shannon entropy provides that it only applies if, as Wikipedia puts it, "information due to independent events is additive".

In other words, independent events must be statistical be independent. If not, then you need to find a representation of the data that defines events so that they are truly independent. Otherwise, you are overestimating the entropy.

To put it another way, the Shannon entropy only applies to true probability distributions and not to random processes in general. For specific examples of processes that do not conform to the assumptions of Shannon entropy, consider ...

Markov processes

A Markov process generates a series of events where the latest event is sampled from a distribution that depends on one or more previous events. Obviously, a large number of real phenomena are better modeled than Markov processes as discrete, independent probability distributions. For example: the text you are currently reading!

The naively calculated Shannon entropy rate of a Markov process is always greater than or equal to that actual Entropy rate of the process. To get the true entropy of the process, you need to consider the statistical dependency between events. In simple cases the formula for that looks like this:

H (S) = - ∑ichpich∑jpich (j) logpich (j)

This can also be represented like this:

H (Y) = - ∑ich jμichPich jLogPich j

Again, Wikipedia quotes here "is the asymptotic distribution of the chain" - that is the overall probability with which a certain event will occur over a long horizon

This is all a complicated way of saying that even if you can calculate the overall likelihood of a particular event, certain Sequences of events are more likely than others to be generated by a Markov process. For example, the following three English word sequences are becoming increasingly unlikely:

  • They ran to the tree
  • The tree ran to them
  • Tree they ran

But Shannon entropy rates all three strings as equally probable. The Markov process entropy takes the difference into account and therefore assigns a lower entropy rate to the process.

Entropy rates are model-dependent

If you zoom out far, you can see the big picture: the rate of entropy of a given sequence of events from an unknown source is model dependent. You assign a different rate of entropy to a particular set of events depending on how you model the process that generated them.

And very often your model of the process will not be entirely correct. This is not a simple problem or an easy one to solve. In fact, it is generally impossible to describe a sufficiently long and complex sequence of events real Assign entropy rate when you don't know what the real underlying process is. This is a central result of algorithmic information theory.

In practice, this means that given an unknown source of event sequences, different models will yield different entropies, and it is impossible to know which one is correct in the long run - although the one that assigns the lowest entropy is probably the best.

No, if the algorithm is lossless, no steps in the compression sequence can reduce its entropy - otherwise it could not be decompressed / decoded. However, the additional entropy can be stored in "out-of-band" information - for example in the list that must be maintained in order to decode the move-to-front transform.

They reduce that apparent Entropy inherent in the structure of the original message. In other words, they optimize the message to take advantage of the strengths of the next levels of compression.

A simple example would be to replace the name in the end tags of xml with a special symbol. You can perfectly restore the original XML file, but the compressor does not need to provide the full name again at this point.

A more realistic example is PNG compression. Its entropy compressor is DEFLATE, a combination of Lempel-Ziff and Huffman. This means that it works best with values ​​and patterns that are repetitive. Most of the neighboring pixels are usually similar colors. Each line is assigned a filter that converts the original pixel values ​​into a differential coding. In this way, the values ​​encoded by DEFLATE are usually close to 0. In the extreme case, this converts an even curve of all different values ​​into a single value in the entire line, with which the LZ part or DEFLATE works very quickly.

Entropy encoders do not compress the message to the minimum number of bits required to represent it. I know this is tempting to think, but it's not what they do. They are not magic and they cannot achieve that.

Instead, they do something less magical - but still useful. For now, let's say we knew that each character of the message was chosen regardless of a distribution. Then it would be possible to create a lossless compression algorithm that optimally compresses the messages. These algorithms are called entropy encoders.

Now, real news usually doesn't have this independence property. For example, if you see a Q, the next letter is likely a U. And so on. It is still possible to apply an entropy encoder algorithm to a real message in which not every character is selected independently of the rest. The algorithm is still lossless, can still be used for compression and in practice often shortens the length of the message. However, it is not shortened to the minimum possible length. It does not compress the message into something whose length matches the entropy of the message. it compresses it less than that.

Once you recognize this property of entropy encoders, the paradox evaporates.

In general, a lossless step never decreases the entropy of the message. However, the message may be put into a form in which a different compression algorithm is more effective, so it may still be useful in practice (on average).

The word "entropy" is often used loosely to refer to two different things:

  • The "total amount of information" in a message or system

  • The information density or how densely the information is packed.

OP's quote from the Wikipedia entry for https://en.wikipedia.org/wiki/Entropy_(information_theory) refers to the first:

But (at least as I write this) the same article starts with:

So one is an amount and one is a rate (similar to distance vs. speed). These are sometimes referred to as "extensive" and "intensive" properties (see https://en.wikipedia.org/wiki/Intensive_and_extensive_properties#Extensive_properties).

A classic example of this distinction is Paul Revere's famous lantern signal: "One on land, two on water". 1 bit total information (if we ignore the case "None, if I have not yet arrived at North Church"). If Paul were to put another row of lanterns in each window of the building, this would be superfluous: no further information, so the same "total" or "extensive" entropy; but much more message length, so much less "intense" entropy.

If it starts like this, but changes to use only one set of lanterns, it is "lossless compression" as in OP's question. The "extensive" entropy is the same, but the "intense" entropy is different: since the number of lanterns in the second window is highly correlated with the number of those seen in the first window, the redundant message is more predictable or less random much less intense entropy.

There are two other important things to remember:

  • First, we usually do not know the "true" entropy of a system in any way. A naive viewer does not know whether "3 lanterns" would be a different message or whether signals in different windows are redundant or not. If Paul makes his drive a habit, we can count and see if the windows always match. But maybe we didn't look long enough to see the rare (and probably important!) Exceptions.

  • Second, how you measure is important. Try to estimate how much of each successive text letter is transmitted (this is a rate, so "intense" entropy, sometimes also called "relative entropy"):

    • If you just notice that text is being sent in 8-bit units, your first "guess" might be 8 bits per letter.
    • If you count the number of different letters used, you would estimate log2 (26), or 4.7 bits per letter (a bit higher if you consider spaces, upper and lower case, etc.).
    • If you think "e" is a better choice for "next letter" than "z", measure the letter frequency and get around 4.14 (see http://people.seas.harvard.edu/~jones / cscie129 /). papers / stanford_info_paper / entropy_of_english_9.htm).
    • When you count pairs of letters you pick up patterns like "qu", "th", etc. and get about 3.56.
    • Counting sequences of up to 5 letters will give you even lower scores, and as a bonus, you will be able to tell fairly reliably what human language the text is in.
    • If you're as persistent and smart as NG Burton and JCR Licklider in "Long-Range Constraints in the Statistical Structure of Printed English" (American Journal of Psychology 68 (1955)), you can get up to 10 episodes, 0000 letters in one Row and find one more entropy value.

But of course messages can (and do) have many patterns that have not been modeled with such n-gram methods, so the "true" entropy is still lower.

If you model a theoretical infinite source with a perfectly random Zipfian distribution of tokens, you can calculate the large and intense entropy, which only depends on the number of possible different tokens. At [http://www.derose.net/steve/writings/dissertation/Diss.0.html] there are diagrams of how each type of entropy looks like with increasing number. The two behave very differently:

Hope that helps or is at least interesting ...

I suspect that the wording in the German Wikipedia is wrong. Compressors increase entropy. That is, not the total entropy, but the entropy per bit : the information density. For example, a run-length coding and dictionary scheme is used to condense the data. Now the same information is packed into fewer bits so that each bit contains more information. The Huffman coding below does a little more of the same; It's just another layer of compression.

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from.

By continuing, you consent to our use of cookies and other tracking technologies and affirm you're at least 16 years old or have consent from a parent or guardian.

You can read details in our Cookie policy and Privacy policy.