Does Your Data Follow Benford’s Law?

Minding The Data
3 min readSep 9, 2020

Benford’s Law is a mathematical phenomenon which has shown that numbers are much more likely to start with smaller digits, such as a 1 or a 2, compared to larger digits such as an 8 or a 9. This pattern has shown up all over the place, from social media websites to stock market data and has even been used to detect fraud or manipulated tax returns. If you’d like to see a further breakdown and analysis of these real life examples of Benford’s Law, you can check out my YouTube video here.

In this project, we will be taking a dataset and testing to see if a given variable follows the distribution laid out by Benford’s Law. I will show all the code required to test your variable, but you will need to use Python and Jupyter notebooks, so if you are unfamiliar with these tools you can learn more by watching this tutorial here.

The libraries that we will be needing for this project are pandas, matplotlib, and seaborn so be sure to import those with the following code:

Next we will need to import the data that we want to compare to Benford’s Law. You will first need this data in a csv file and can load in the data frame with pandas. Then you can use the .head() function to see the first couple rows of your data frame as shown in the example below. For my example, I will be using the Trending YouTube Video Dataset available on Kaggle.com. Kaggle is a great source for data to practice with, so be sure to check out their site if you’d like to use your own variable.

Now that you have your table all set, you will need to select the variable you would like from your data frame and remove any missing data. We will also need to ensure that this variable is treated as a string rather than a number so that we will be able to extract the first digit from each one. Lastly we will get the counts of each digit, and set the parameter normalize=True in .value_counts() to get the proportion of times that each digit occurs. In my example, I want to compare Benford’s Law to the distribution of beginning digits of the number of views on Trending Videos. If you’re following along with another dataset, simple replace “views” with the name of your variable.

We can see from the data above that this example already looks like it will turn out to follow Benford’s Law! But let’s graph the results and see how they compare side by side. Replace my_variable_name with the description of your variable and get a direct comparison of your leading digits to Benford’s Law.

We can see that our variable turns out to match Benford’s Law nearly identically! Hopefully you were able to make it through this tutorial successfully and found a variable that also aligns with the Benford’s Law pattern. I’d love to see what you were able to find, so feel free to share your findings with me on the Contact page!

If you’d like to access the Jupyter notebook with all of the above code cells, check it out on my GitHub here.

Thanks for making it all the way through and please check out some other Coding Projects or check out some of the videos that I’ve created on my YouTube channel.

Originally published at http://mindingthedata.com on September 9, 2020.

--

--