Editor’s note: Tom Snyder, executive director of rapidly growing Raleigh-based RIoT and a thought leader in the emerging Internet of Things, is the newest columnist to join WRAL TechWire’s list of top drawer contributors. “Datafication Nation” are part of WRAL TechWire’s Startup Monday package.


RALEIGH – In this column, I regularly discuss Digital Transformation, which is happening across industry, government and society. A proliferation of low cost sensors, wireless networks and cloud computing is enabling real-time analytics and the automation of everything. We are pivoting from an Information Age (based on internet technologies) to a Data Economy (based on AI and IoT).

So what does Digital Transformation mean for journalism? Back in July, I discussed the Associated Press’ decision to feed decades of articles into a large language model to train AI to automate news story publication. Today, an AI can automatically create a write-up for a sporting event recap or weather forecast that is difficult to discern from a human-produced piece. You can read that article here.

Today, I’d like to focus instead on the increasing trend of “data journalism”. This is the practice of analyzing large data sets for the purpose of generating new news stories. Data analysis has long been a part of supporting journalism. Now human journalists are using data analytics to discover and create news.

In 1921, CP Scott, the editor of The Guardian said, “Comment is free, but facts are sacred.” Data, when captured accurately, is fact. But at that time, journalists did not have adequate tools to analyze millions of small “facts” at scale. Thirty years after Scott’s statement, technology tools were becoming available to do just that. In 1952, CBS was the first news organization to use a mainframe to analyze polling data and predict the outcome of a presidential election (Eisenhower won). By the late 1960’s analytics in journalism was mainstream and today nearly any new story is supported by data analytics.

The new trend is to create news directly from analytics of large data sets – with the analytics itself “discovering” the trends that are noteworthy enough to report on. BBC data journalist and Birmingham City University professor, Paul Bradshaw, created the Inverted Pyramid of Data Journalism to explain this concept.

The steps are:   Find → Clean → Visualize → Publish → Distribute → Measure.

Find: Data journalists first identify large data sets that may hold unlocked discoveries. Examples might include crime, environmental, transportation or weather data sets. In general, larger data sets hold more promise than smaller ones, but data veracity (and journalist access to the data) is key.

Clean: Large data sets are often polluted with individual erroneous data points, or with collection biases. Before running analytics, it is important to filter out bad data. When fusing multiple different data sets together for analytical purposes (like analyzing whether crime frequency has a correlation with weather conditions or traffic patterns), there may also be data transformation or structuring tasks required to make the data format suitable for algorithmic interrogation and interpretation.

Visualize: Computers are very good at rendering large data sets into formats suitable for understanding trends. Graphs, charts, heat maps and distribution plots all help to showcase the meaning we might interpret from the analysis. These visualizations do not need to be two-dimensional or static. Animation to show data trends changing over time, for example, increases our ability to understand and communicate the “news” hidden in the data.

Publish: This is the step of integrating the data visualizations with additional context and explanation. At the completion of this stage, a news story feels familiar – with a journalist communicating the core story, which the audience sees as supported by the data visualization (even though the data analytics was actually the “discovery” point).

Distribute & Measure: These two steps are focused on disseminating the news as widely as possible and then measuring the impact, so that data feeds back to advise future work (and for monetization purposes like advertising).

Data journalism techniques are increasingly used by journalists in a number of fields of reporting. Environmental data is studied to discover new causes of climate change and identify species on the verge of extinction. Social media analytics is identifying what is happening in conflict zones and shifts in public opinion. Geospatial data is uncovering food deserts and empowering community journalism.

I anticipate that the next step in the digital transformation of journalism will be the automation of big data created news. Today human journalists are identifying the data sets and running hypothesis-based analytics to discover new stories. But the same paradigm being used to automate simple stories tied to real-time data will over time expand to the discovery of new stories based on legacy large data sets.

Generative AI already has proven the ability to determine “context” based on really large datasets. There is little reason to believe generative AI will not be able to explore random hypotheses of analytics applied to large datasets and publish news discoveries with high probability they are as reasonable as a human journalist might deduce, validate and report.

Perhaps we’ll read the story of how that came together, published by an AI journalist in the not too distant future.