Special to WRAL Tech Wire

(Editor’s note: Now in its fourth year, the 12 Days of Broadband runs Dec. 4 through Dec. 19 highlighting a dozen innovations and stories directly impacted by the expanding reach of high-speed connectivity this year in North Carolina and throughout the country.)

RESEARCH TRIANGLE PARK, N.C. – In September, Bob Lannon and Andrew Pendleton at the Sunlight Foundation put together a tremendous report categorizing 800,000 public comments on the FCC’s net neutrality plan. The authors were happy to share their research and story, making it a worthy inclusion to this year’s 12 Days of Broadband.

Net neutrality refers to the current practice of providing equal access, bandwidth and download speeds to all websites, content providers and services. In November, President Obama expressed his support for keeping the Internet neutral and asked the FCC to begin regulating the Internet more like a public utility, which pushed the “net neutrality” discussion further into the political spotlight in recent weeks.

One critic of net neutrality, Texas Sen. Ted Cruz, has claimed that the Internet will be broken by net neutrality. Cruz remarked via Twitter just after the president’s remarks in November that net neutrality is essentially “Obamacare for the Internet.” That comment saw huge amounts of trending conversations on social media for both sides of the argument.

On Aug. 5, the FCC announced the bulk release of the comments from its largest-ever public comment collection. Lannon and Pendleton spent three weeks cleaning and preparing the data to try and make sense of the hundreds-of-thousands of comments in the docket.

Here is a high-level breakdown of the research (all of which are from the authors’ perspective)

Our first exploration uses natural language processing techniques to identify topical keywords within comments and use those keywords to group comments together. We analyzed a corpus of 800,959 comments.

Some key findings:

• We estimate that less than 1 percent of comments were clearly opposed to net neutrality.

• At least 60 percent of comments submitted were form letters written by organized campaigns (484,692 comments); while these make up the majority of comments, this is actually a lower percentage than is common for high-volume regulatory dockets.

• At least 200 comments came from law firms, on behalf of themselves or their clients.

In-depth exploration of the topical keywords revealed several prominent recurring themes, both in form letter and non-form letter submissions.

Among the most common: About two-thirds of commenters objected to the idea of paid priority for Internet traffic, or division of Internet traffic into separate speed tiers. This topic was discussed in many independent comments as well as form letter campaigns organized by the Nation, Battle for the Net, CREDO Action, Daily Kos and Free Press. Common keywords in this group included “slow/fast lane,” “pay to play,” “wealthy,” “divide” and “Netflix.”

About the same number of comments, including submissions from form letter campaigns organized by the Nation, Badass Digest, CREDO Action, Daily Kos and Free Press, asked the FCC to reclassify ISPs as common carriers under the 1934 Communications Act. Common keywords in these comments included “common carrier,” “(re)classify,” “authority” and “Title II” (a part of the act that might grant the FCC this authority). A smaller portion of commenters advocated a regulatory strategy with a similar effect but a different legal basis, relying on section 706 of the 1996 Telecommunications Act.

The subject of Internet access as an essential freedom comprised more than half of comments included in form letters from the Nation, Battle for the Net, CREDO Action and Daily Kos. Common topic words included “important,” “vitally,” “economy,” “essential,” “resource” and “cornerstone.”

Almost half of comments, including form letters from Electronic Frontier Foundation, the Nation, Battle for the Net, Daily Kos and Free Press, discussed the economic impact, or the impact on small businesses and innovation, of the end of net neutrality. Typical terms in these comments included “work,” “competition,” “startup,” “kill,” “barrier” and “entry.”

About 40 percent of comments, including campaign letters from EFF, Battle for the Net and Daily Kos, discussed the importance of consumer choice or the impact of regulations on consumer fees. Topic words included “access,” “choice,” “entertainment,” “fee,” “content,” “extort” and “extract.”

About one-third of comments, including those in Battle for the Net’s campaign, discussed the importance of competition among ISPs. Frequent terms included “monopoly” and “competition,” “Comcast,” “Verizon” and “Warner.”

Several form letters either from the Daily Kos or of unknown provenance (combined with non-form letters) advocated treating broadband providers like a public utility. About 15 percent of comments discussed this topic.

A small number of comments (about 5 percent, including letters from Stop Net Neutrality and a Tea Partier blog) had anti-regulation messages. Interestingly, some of these comments seemed to emphasize freedom for consumers while others advocated freedom for ISPs, two positions seemingly at odds with one another.

Additionally, a couple of topics came up in significant enough numbers of comments to be noteworthy despite not occurring in any of the form letter campaigns. These included comments calling for the resignation of FCC Chairman Tom Wheeler or other FCC commissioners or staff (about 2,500 comments), and people either mentioning John Oliver by name or using the words “dingo” or “f*ckery,” again typically directed at Tom Wheeler, comprising about 1,500 comments, and likely motivated by usage of these terms in Oliver’s net neutrality segment.

Wait, where are the 1.1 million comments?

The comments were originally released by the FCC as six continuous XML files, with two caveats: First, mailed comments postmarked prior to July 18 still are being scanned and entered into the ECFS and may not be reflected in the files. We haven’t received word of any updates since the original release. And, second, certain handwritten comments may not be searchable. For this reason, source links to these comments are included in the files.

Also, more than 500 comments had text fields which were blank. Our guess is that these may correspond to handwritten comments.

The XML files contained 446,719 records. Many of these contained a single comment each, but some contained multitudes. We wrote custom processing scripts to break up the multiple-comment records, revealing the total count of 801,781 comments. Of these, some were discarded as unparsable or too long (both Les Misérables and War and Peace were submitted as comments), leaving the final count at 800,959 comments.

Detecting expert submissions

After speaking with policy experts from the Open Technology Institute and Public Knowledge, we learned some interesting details about comment submission. While most public comments were submitted using a simplified form or via email, experienced submitters made use of a more complex form. Comments submitted by these “experts” were marked in the data, giving us an easy way to isolate them.

Once isolated, they provided the basis for training a piece of artificial intelligence software called a text classifier. We trained the classifier to detect expert language based on examples from submissions that we knew were from experts. It was then able to read comments submitted through the simple form or via email and tell us whether or not each was likely to have been written by an expert. The classifier found approximately 6,700 such comments.

Approximately 3,900 of these were form letters with this basic structure:

“To Chairman Tom Wheeler and the FCC Commissioners To the FCC Please build any net neutrality argument upon solid legal standing. Specifically, this means reclassifying broadband under Title II of the Telecommunications Act of 1934. 706 authority from the Telecommunications Act has been repeatedly struck down in court after legal challenges by telecom companies. Take the appropriate steps to prevent this from happening again. Sincerely, *XXXX*”

While this was almost certainly penned by an expert, we’re considering it a non-expert submission, because it seems to have been part of a broader organized campaign. Of the remaining 2,846 comments, 567 of them contain at least 200 words, which we feel is an appropriate heuristic to apply to expert submissions. In summary, our back-of-the-envelope estimate of the number of expert submissions is 600, or 0.08 percent of the 800,959 comments analyzed.

Form letters

We searched within the topical groupings that powered the visualization above to find groups of comments with very low amounts of text variation from one comment to another, yielding a similar result (though using different technology better suited to the extreme size of this docket) to the form letter detection visualizations employed in our Docket Wrench tool.

After manual review of these groups, we estimate at least 20 separate form letter writing campaigns drove submissions to this docket, ranging in size from a few hundred comments to more than 100,000 and together comprising almost 500,000 comments, or about 60 percent of the corpus that we examined. We made a cursory attempt at trying to find the organizations that orchestrated each form letter writing campaign.

While form letters do appear to make up the majority of the comments, it’s actually surprising how many of the submitted comments seemed not to have been driven by form letter writing campaigns. In previous analyses of high-volume dockets, we’ve found that it’s not unusual for form letter contributions to make up in excess of 90 percent of a docket’s total submissions, with the percentage of comments coming from form letter campaigns being well-correlated with the total number of comments received. The two largest dockets in Docket Wrench, the Department of State Keystone XL rulemaking and the Internal Revenue Service docket on political activity undertaken by social welfare organizations, both from earlier this year, are each dominated by form letter comments, with more than 75 percent of the comments in each having been classified as form letter submissions by our detection systems.

It’s difficult to know why, exactly, more members of the public apparently wrote letters themselves in this rulemaking than is typical for large dockets. It could be an indicator of a genuinely higher level of personal investment and interest in this issue, or perhaps this docket drew organizers who employed different “get out the comment” techniques than we have seen in the past.

Even within the form letters, we see evidence of various kinds of innovation in terms of the way form letter campaigns have been run. EFF’s campaign gives submitters several opportunities to choose from a menu of options at various points within the text, for example. More intriguingly, several groups of comments that we were unable to attribute to anyone show subtle textual variations that don’t seem to alter the meaning of the text in the way that EFF’s do. These groups appear to all be about the same size, leading us to believe that a single overall population of users might have been solicited to submit comments and was then automatically uniformly segmented in some fashion. This could have been to test which versions of the comment text got the most users to submit (along the lines of the A/B testing commonly used in software development). It could also perhaps be an effort to foil exactly the kind of automated grouping tools we (and some federal agencies) might employ to make large volumes of comments like this one easier to review.

Finally, while comments submitted as part of form letter campaigns are similar to one another, it’s important to note that they’re not identical. Many submitters take the opportunity to personalize their comment beyond what was supplied by the campaign’s template language. How exactly they vary is an interesting question, and worth pursuing.

Ideas for further investigation

We’ve only just scratched the surface of what could be learned from such a rich dataset. Here are some other promising avenues of investigation that have occurred to us:

• How do commenters augment the template responses provided by form letter campaigns? What do they add, delete or modify? What consistently stays intact?

• Do models of non-form submissions surface topics that we haven’t found? What about models of expert submissions?

• How are individual words related to one another? What modifiers are used for terms like “ISP,” “Wheeler,” “Internet,” etc.

• Looking at email addresses, which domains are most popular?

• How often are key political figures or elements of government mentioned?

• Which other services or utilities is broadband Internet compared with, and how often?

• How do commenters break out by gender? (This is more difficult than it seems, even if you’re using the way fun Genderize API. Often the commenter’s real name can only be found in the body of the comment itself, not in the “applicant” field)

Again, thank you to Bob Lannon and Andrew Pendleton as well as the rest of the contributors to this report at the Sunlight Foundation for this data-driven summary of a controversial and heated debate this year that’s sure to continue well into 2015 and beyond.