Editor’s note: Tom Snyder, executive director of rapidly growing Raleigh-based RIoT and a thought leader in the emerging Internet of Things, is the newest columnist to join WRAL TechWire’s list of top drawer contributors. “Datafication Nation” are part of WRAL TechWire’s Startup Monday package.
RALEIGH – Late last week, the Associated Press announced a deal with Open AI to license a significant portion of its historical news archive. Undoubtedly, this is a huge win for the maker of ChatGPT and other tools, to further train its large language models with valuable data. In return, the AP gains access to Open AI’s generative AI tools and technology.
Since the announcement, there has been public concern surrounding whether this points towards a future of automated journalism. After all, there have been numerous recent and public gaffes resulting from industry use of these emerging technologies. It was only last month that lawyers got busted for filing a simple brief that included fake case citations mistakenly generated by ChatGPT. How can we trust error-prone tools to one of the most sacred institutions of modern society? How does the AP, a globally trusted brand, maintain the public trust?
It may surprise you to know that the AP has already been using AI to automate some of the content it produces. Back in 2016, the AP began feeding data from MLB Advanced Media into its automated reporting system, Wordsmith, to automatically report on minor league baseball. Even earlier, in mid 2014, the AP began automating thousands of quarterly earnings reports by feeding data from Automated Insights into an early LLM trained with the AP Style Guide.
While an occasional error in a baseball recap, or mis-reported earnings report may not spark a revolution or crash the economy, it is societally important that we can trust journalistic institutions to hold accuracy and fairness to be paramount. Facts are facts. An occasional mistake, whether by a human reporter or AI algorithm, should never put society in a position to doubt the factual veracity.
The big concern that I have is that LLM’s and AI tools are increasingly training on data sets that do not have the fact-checking filters that organizations like the AP apply. Social media, in particular, is rife with “fake news” and wholly inaccurate data. There are massive disinformation engines, many funded by enemy states and offshore entities, that are programmed to flood the public with dubious information.
There is massive momentum across entertainment and commerce to personalize the data that is presented to us, based on analysis that AI tools apply to data that has been collected about us. Streaming services recommend movies we might like. E-commerce platforms recommend things we might like to buy. Music services create playlists for us.
What happens when data sets, that have been polluted by our own individual biases, begin to feed the algorithms that power fact-searching and journalism? Will we begin to see journalistic institutions present reporting that is customized to our interests?
I’m a Hokie and already am presented Virginia Tech headlines more frequently than I receive stories about LSU (sorry Pete M). I also tend to like longer-form reporting more than quick soundbite headlines. How will I know that the thorough Tech football article I’m reading wasn’t automatically customized to my preferences? My neighbor may read a shorter, different version that is customized for them. Do these small customizations infringe on factual news reporting?
Article length is somewhat non-controversial. What happens when those customizations draw deeper from our personal data preferences. How do we assure, for example, that I’m not receiving an automated article that informs me about the success of vaccines in fighting the pandemic while others get an article reporting on the dangers of vaccination? I hold deep trust in scientific evidence, peer review and objective study. Journalists have historically done the same. Sadly, in today’s society not everyone holds these same views.
I would surmise that the historical AP archive contains far more factual reporting than non-factual. I hope that the AP deal with Open AI improves the factual quality of ChatGPT and other tools. But throughout history, reporting has reflected the voice and culture of the time. That will reflect in the machine learning training and I expect, for example, to see shadows of our history of racism show up in new reporting that is automated from the past journalistic voice.
To my knowledge, the AP has not yet publicly discussed how they intend to use the tools they’ve acquired through this new partnership. We can only hope they do not lose sight of the importance of data veracity for AI training and story inputs, in the same way that the institution of journalism has always maintained standards of fact-checking, use of multiple sources and other safeguards of the truth.