Imagine that you are a crooked corporate manager, and are trying to convince your large financial firm's customers that they own a set of continually growing stocks, when in fact you blew the whole thing investing in math podcasts over a decade ago. You carefully create artifical monthly statements indicating made-up balances and profits, choosing numbers where each digit 1-9 appears as the leading digit about 1/9th of the time, so everything looks random just like real balances would. You are then shocked when the cops come and arrest you, telling you that the distribution of these leading digits is a key piece of evidence. In fact, due to a bizarre but accurate mathematical rule known as Benford's Law, the first digit should have been 1 about 30% of the time, with probabilities trailing off until 9s only appear about 5% of the time. How could this be? Could the random processes of reality actually favor some digits over others?
This surprising mathematical law was first discovered by American astronomer Simon Newcomb back in 1881, in a pre-automation era when performing advanced computations efficiently required a small book listing tables of logarithms. Newcomb noticed that in his logarithm book, the earlier pages, which covered numbers starting with 1, were much more worn than later ones. In 1938, physicist Frank Benford investigated this in more detail, which is why he got to put his name on the law. He looked at thousands of data sets as diverse as the surface areas of rivers, a large set of molecular weights, 104 physical constants, and all the numbers he could gather from an issue of Reader's Digest. He found the results remarkably consistent: a 1 would be the leading digit about 30% of the time, followed by 2 at about 18%, and gradually trailing down to about 5% each for 8 and 9.
While counterintuitive at first, Benford's Law actually makes a lot of sense if you look at a piece of logarithmic graph paper. You probably saw this kind of paper in high school physics class: it has a large interval between 1 and 2, with shrinking intervals as you get up to 9, and then the interval grows again to represent the beginning of the next order of magnitude. The idea is that this scale can represent values that may be very small and very large on the same graph, by having the same amount of space on a graph represent much larger intervals as the order of magnitude grows. It effectively transforms exponential intervals to linear ones. If you can generate a data set that tends to vary evenly across orders of magnitude, it is likely to generate numbers which appear at random locations on this log scale-- which means that the probabilities of it being in a 1-2 interval are much larger than a 2-3, 3-4, and so on.
Now, you are probably thinking of the next logical quesiton, why would a data set vary smoothly across several orders of magnitude? Actually, there are some very natural ways this could happen. One way is if you are choosing a bunch of totally arbitrary numbers generated from diverse sources, as in the Reader's Digest example, or the set of assorted physical constants. Another simple explanation is exponential growth. Take a look, for example, at the powers of 2: 2, 4, 8, 16, 32, 64, 128, etc. You can see that for each count of digits in the number, you only go through a few values before jumping to having more digits, or the next order of magnitude. When you add new digits by doubling values, you will jump up to a larger number that begins with a 1. If you try writing out the first 20 or so powers of 2 and look at the first digits, you will see that we are already not too far off from Benford's Law, with 1s appearing most commonly in the lead.
Sets of arbitrarily occurring human or natural data that can span multiple orders of magnitude also tend to share this Benford distribution. The key is that you need to choose a data set that does have this kind of span, due to encompassing both very small and very large examples. If you look at populations of towns in England, ranging from the tiniest hovel to London, you will see that it obeys Benford's law. However, if you define "small town" as a town with 100-999 residents, creating a category that is restricted to three-digit numbers only, this phenomenon will go away, and the leading digits will likely show a roughly equal distribution.
The most intriguing part of Benford's law is the fact that it leads to several powerful real-life applications. As we alluded to in the intro to this topic, Benford's Law is legally admissible in cases of accounting fraud, and can often be used to ensnare foolish fraudsters who haven't had the foresight to listen to Math Mutation. (Or who are listening too slowly and haven't reached this episode yet.) A link in the show notes goes to an article that demonstrates fraud in several bankrupt U.S. municipalities based on their reported data not conforming to Benford's law. It was claimed that this law proves fraud in Iran's 2009 election data as well, and in the economic data Greece used to enter the Eurozone. It has also been proposed that this could be a good test for detecting scientific fraud in published papers. Naturally, however, once someone knows about Benford's law they can use it to generate their fake data, so complicance with this law doesn't prove the absence of fraud.
So, next time you are looking at a large data set in an accounting table, scientific article, or newspaper story, take a close look at the first digits of all the numbers. If you don't see the digits appearing in the proportions identified by Benford, you may very well be seeing a set of made-up numbers.
And this has been your math mutation for today.