More than just average: How we look at data determines the story it tells

At some point or another, we’ve all heard the phrase “garbage in, garbage out.” In the world of data, this usually refers to the importance of having high quality input data in order to achieve high quality outputs from analysis.

A famous example is the story of early passive infrared (PIR) sensors, such as those used in the Microsoft Kinect and in touchless soap dispensers. PIR sensors did not function properly for people with dark skin. The root cause was that calibration of these sensors was conducted almost exclusively on white subjects. A lack of skin color diversity during testing led machine learning algorithms to center on a subset of the total population, leading to a massive commercial failure.

What is less frequently discussed is when there is high quality and accurate data, but the wrong analytics is applied to the data. Using the wrong analytics can lead to unanticipated poor outcomes or deliberately misleading conclusions. Perhaps the most common example of this is when simple averages are used on data sets that are not normal distributions. Data sets that do not follow that basic statistical curve are better served by other analysis approaches.

Let’s first look at a case of unanticipated poor outcome by inappropriately using the average to assess a data set.

Some sensors are inherently noisy. Ultrasonic sensors fall into this category. Ultrasonic sensors send an audio signal (outside the frequency range that humans can hear) and listen for how long it takes for the audio signal to reflect back after hitting a surface. The duration directly correlates to the distance between the sensor and surface, useful for robotics applications, for example.

These sensors are inherently noisy, producing sporadic measurements that are wildly inaccurate. The noise is predominantly due to other random noise in the environment that gets accidentally detected and mixed in with authentic sensor measurements. The total number of these wildly inaccurate data points is small. If a robot arm was positioned based on the average sensor reading, it could swing in a drastically wrong direction. Those few hugely wrong measurements completely throw off the average. But the median measurement, which looks at the midpoint measurement in the total data set, is almost always highly accurate to the actual distance being sensed. Those “noisy” data points fail to shift the median enough to cause an issue.

Frequently, averages are used by politicians, lobbyists, salespeople and storytellers to frame data towards the conclusion they want their audience to intuit.

Agriculture remains the largest industry segment in North Carolina. RIoT is researching a smart agriculture project in Eastern NC to help small and medium farms to become more competitive. The project, which spans 7 eastern NC counties (Edgecombe, Greene, Johnston, Nash, Pitt, Wayne and Wilson) that collectively have more than one million acres of farmland spread across 3249 farms (2017 USDA census). Looking at the data more closely, 71.% percent of the farms are smaller than 180 acres and 65% of the farms earn <$50k in revenue per year. The great majority of farms are very small businesses.

Agricultural data is often presented with averages. You may be surprised to learn that the average farm in this 7 county region is 320 acres and average revenue per farm is $527k. The averages don’t accurately tell the story of the agricultural business climate in the region. The average would lead you to believe that an “average” farm is a small business making half a million dollars a year – i.e. probably a healthy business.

But the reality is that a very small number of mega-farms are raking in millions of dollars and skew the average of the whole region upward because a small quantity of large numbers in a data set with lots of small numbers will have an outsized impact on the average. The majority of farmers are struggling to maintain and break-even, which leads to a cycle of continued consolidation.

A much more accurate way to present the real situation would be to use the median, perhaps paired with the range (the difference between the highest and lowest values).

The economy is a hot topic in any year, and particularly in an election year. Those who tout how strong the economy is may tout the average income for a family of 4 to be $105,555 (2022 US Census). Those who want to more accurately describe the economy might point out that the median income is significantly lower at $74,580. A small number of extremely high earners are skewing the average.

In fact, only 20.2% of four-person households earn more than the average. 4 out of 5 households earn less. Even less frequently pointed out by elected officials is that the median has been declining about 2.2% per year since 2019 – i.e. the very rich are getting far richer at a rate depressing the data for the entire nation.

Using large data sets also can hide trends within the data. Sticking with the 2022 census, the median, full-time, male worker earned $52,612, while females earned $39,688, or ~75% as much as men. When the data is presented as a population total, it hides this clear and terrible discrepancy.

It is this kind of data framing that should lead people to be skeptical about the media’s regular framing of US economic health based on the performance of the stock market. That is just one of many measures of how the very largest companies are performing. These are the very same large companies that skew the averages of much larger economic data sets. Stock market health is much more correlated to the economic health of that 20.2% above the average earning level than it is for the vast majority of Americans.

The key takeaway is that we should all be cognizant of how data is being processed and presented. Keep your high school math skills honed, and use common sense to think about whether the average is really telling the whole story.

News

More than just average: How we look at data determines the story it tells