A Beginner's Guide to Entropy - Cybrary

Recently on Cybrary OP3N, I read this article on Distributed Random Bit Generators. While the article was informative, I thought that readers unfamiliar with the concept of entropy might not be able to grasp some of its contents. I was thus inspired to write the this complete beginner’s guide to Shannon entropy, including a use case in binary analysis.

Entropy in Information Theory

Entropy is probably best known as an idea from thermodynamics, but the information theory definition of entropy was introduced by Claude Shannon in 1948 (and hence is also known as the Shannon entropy). It measures the unpredictability of information, typically from a data source.

The mathematical definition of the entropy of a sequence X, denoted H(X), is:

Where n is the length of the sequence, xi is the ith symbol in X, and b is the number of symbols in the source alphabet.

If you're familiar with probability, you'll note this is the expected value of the negative logarithm of P(x), also known as the expected value of the information I(X). A sequence of only one symbol repeated, such as “AAAAAAAA,” would have entropy 0.

The maximum entropy is achieved when all symbols in the source alphabet (0 and 1 in the binary case) are equally likely, which is fairly intuitive. If any symbol occurred most often, we'd use that fact to predict with higher accuracy than random guessing. But, if all symbols are equally probable, then we cannot improve on random guessing.

Although the general formula is for any b-ary alphabet, the convention when working with files is to calculate the entropy in bits per byte-character. This means that our alphabet is {0, 1} and that we can sum over all 28 or 256 possible characters (note that summing over all possible characters is equivalent to summing over each character in the sequence; we're able to take a shortcut because we know in advance the number of possible symbols).

In practical terms, the entropy will be calculated by finding the frequency of each ASCII character and then finding:

Compression and Encryption

Consider a string of English plaintext, such as:

We the People, of the United States, in Order to form a more perfect Union, establish Justice, insure domestic Tranquility, provide for the common defense, promote the general Welfare, and secure the Blessings of Liberty to ourselves and our Posterity, do ordain and establish this Constitution for the United States of America.

The entropy of this sequence, computed in the traditional fashion, is H = 4.2724.

Compressed using gzip, this string looks like:

H4sIAAAAAAAA/1WQUU4DMQxEr+IDrLgD8IeQqASo3+lm0rWUxIvtVOrtiRcJxKcznvGbnEG+gU6QvWIhKcf42dmR6d2TwxbiTm+aoeRCRbRRoiYK2qEFq8e69IVgni6VbaOXYc4rwmljLmZpiBf60NS/Blf2+0K7yo0zIvK4ukpr0imjoId56k38B/CKDk2Vzqgl6RRTz2RYIz30pwoz7leLCq98gfo9cGWood5gh2FOdBJz6AGQp645zXoh/uH7xkbP0ieyD5/dfhH/fUycemwza00P3zplEK5IAQAA

Now, the entropy of the compressed string is H = 5.78974.

Note: Although the compression is difficult to see because the string is short, gzip compressed it by 148%, from 328 bytes to 222 bytes.

Encrypted with AES-256 and the key “entropy”, this string looks like:

m+qw2UL9JpS70dozoAAIDtKqfiU7RjGXkKoYEcU5m8QfbMAS0pija/ndZKRVug9rQr2DgP79ZFCUqt5pBeiNrTvovvRZmn1syXvpWUaUDxtVHHC+kyoe2i9tZALmiXQCMKZF9Idp87RLF9HFtfo/Dg31Aki8QKzgAeM5jYOMDn1DR5Vmi0epR2DjYSjK4phyN10E7iIKQNjmegm4T9ycSEgYL8fRxyLIGhSuQSvI6HyBIgEnlPPsYuSaiEOTAynGK8biRamcZ2adgVrmXiCI3vCsLH2mU98NfJAs7fK1LL8CPQ50T32NwfkFBnj2zwzNyCAhUpeCbH44UgvuRtURkA4G9s2qEgKx53ZCjRgXH9J8VXqYUR1jX4m5yAkHA3EpGEtEE0Q2IR8XesGlwTwykGJNNlJzM6TbKBqyY5ubWE2XceF8FQSaP4oJ/lkbwTjd/sHZG44MmpXE68SSl2Imtg==

The entropy of the encrypted string seen here is H = 5.90141.

Both compression and encryption schemes increase the entropy of plaintext, because there are not as many patterns present as in natural language. However, there's inherent tension between the processes of compression and encryption.

In high-level terms, encryption transforms data to appear as random as possible, while compression works by taking advantage of the repetitiveness and non-randomness of data. Therefore, if you encrypt data and then try to compress it, the data will be virtually uncompressible (assuming the encryption has been performed correctly).

A common misconception is that reversing the order, by compressing and then encrypting, solves the problem. This will successfully reduce the size of the encrypted data, but it also has issues that have been described at length by security researchers (Dan Boneh of Stanford, for one prominent example), because compression algorithms leak information about the plaintext data. I will not go into details about these types of attacks here, but if you're interested, read about the CRIME and BREACH exploits.

Used in Binary Analysis

Returning to the topic of the calculating the entropy of a binary file, it's useful to find the limit of that value, so we know what range of numbers to expect. Again, the highest entropy means that:

Because of the 256 ASCII values, the maximum Shannon entropy of a file is 8. What are the practical applications of this fact? Namely, if the entropy of a binary file approaches that theoretical limit, we can be reasonably confident that its bits have been manipulated through compression or encryption.

When I spoke to one malware analyst, he said that his rule-of-thumb was that a file with an entropy over 6 was likely to be compressed or encrypted, and anything over 7 almost certainly was. If the entropy is lower but the file is evidently not plaintext, it might be merely obfuscated. Applying a single-bit XOR mask, for example, does not alter the entropy of a file at all (think about why), and yet would still render our text from above unrecognizable.

Once we determine that a file is compressed or encrypted, the next step would be to try to diagnose the scheme, particularly by seeing if the headers match any common compression algorithms (assuming we do not have a file extension). Otherwise, de-obfuscation is a much simpler task than decryption, and there are a variety of cryptologic techniques for this.

One can also use the measures of entropy before and after compression or encryption to determine how efficient their algorithm is. If a compressed file still has a relatively low entropy, it suggests that the algorithm was not very good at removing the redundant contents of the file.

v:* {behavior:url(#default#VML);}o:* {behavior:url(#default#VML);}w:* {behavior:url(#default#VML);}.shape {behavior:url(#default#VML);}

Normal0

falsefalsefalse

EN-USJAX-NONE

/* Style Definitions */table.MsoNormalTable{mso-style-name:"Table Normal";mso-tstyle-rowband-size:0;mso-tstyle-colband-size:0;mso-style-noshow:yes;mso-style-priority:99;mso-style-parent:"";mso-padding-alt:0in 5.4pt 0in 5.4pt;mso-para-margin-top:0in;mso-para-margin-right:0in;mso-para-margin-bottom:8.0pt;mso-para-margin-left:0in;line-height:107%;mso-pagination:widow-orphan;font-size:11.0pt;font-family:Calibri;mso-ascii-font-family:Calibri;mso-ascii-theme-font:minor-latin;mso-hansi-font-family:Calibri;mso-hansi-theme-font:minor-latin;}

Hopefully, this exercise has provided a fundamental understanding of Shannon entropy, and hinted towards its usefulness in analyzing binaries.


Start learning with Cybrary

Create a free account

Related Posts

All Blogs