Editor’s note: MCNC Chief Technology Strategist Mark Johnson discusses the future of big data and the technology infrastructure needed to support modern science, offering key takeaways from the first National Research Platform (NRP) Workshop held this month in Montana. MCNC operates the state-wide North Carolina Research and Education Network. 

BOZEMAN, Mont. – Modern science is all about big data, and a group of scientists and engineers across the country are driving the technology that enables it.

About 125 scientists and engineers gathered at Montana State University on Aug. 7 and Aug. 8 to plan for the future of the infrastructure needed to support modern science at the first National Research Platform (NRP) Workshop.

The purpose of the workshop was to bring together technology leaders to discuss implementation strategies for the deployment of interoperable Science DMZs at a national scale – essentially building out a national big data superhighway. Sessions were devoted to science-driver application researchers describing their needs for high-speed data transfer, including their successes and frustrations. Discussions primarily focused on requirements from the domain scientists and the networking architecture, policies, tools and security necessary to deploy a 200-institution National Research Platform.

The event was sponsored by the National Science Foundation through the Pacific Research Platform, Montana State University, and CENIC.

Petabytes and Exabytes

What we know is that physicists were the original big data users. Instruments like the Large Hadron Collider at CERN produce data described with terms like petabyte and exabyte (a typical movie downloaded from Netflix is a few gigabytes. An exabyte is a billion gigabytes). Physicist Harvey Newman, known among computer scientists and engineers as the biggest of big data users, was there to remind the assembled that new instruments like the Square Kilometer Array will soon be online creating volumes of data that dwarf today’s instruments in comparison.

It’s not just physicists

Computer scientist Larry Smarr, one of the organizers of this event, is working with archaeologists in California to develop techniques for digitizing and analyzing important archaeological sites to preserve, analyze, and visualize them. Climate scientists are gathering ever more detailed data supporting better modeling of the climate and weather. This requires more computing, better networks, and more storage of the data and analysis.

The group in Bozeman was focused on how to help modern researchers do their work efficiently and effectively while allowing the normal business of university campuses to continue. A typical university has many administrative computing applications. Students use the Internet for research, but they live on campus and use Netflix and YouTube like the rest of us. Campus CIOs have to manage capacity and security for the day-to-day uses of the Internet while supporting the special needs of their faculty.

The special requirements of academic research are pushing the envelope of computer science. Software-driven networks are changing how scientists access computing, manage their data, and analyze their results.

Meanwhile, in North Carolina

North Carolina institutions are at the forefront of this work. At the Renaissance Computing Institute (RENCI) at UNC Chapel Hill, Dr. Claris Castillo is leading an effort called SciDAS to help researchers develop a more fluid and flexible cyberinfrastructure for working with and analyzing large-scale data. At the NRP workshop, Castillo discussed the SciDAS project, for Scientific Data Analysis at Scale. “SciDAS will integrate a wealth of tools into an advanced cyberinfrastructure ecosystem to support distributed computing and the injection of large data sets and workflows into the computing environment,” said Castillo. Tracy Futhey of Duke University addressed the tensions between privacy and security issues faced by researchers at universities with medical schools and the desire for open and transparent research.

The North Carolina Research and Education Network (NCREN) provides very high capacity communications supporting the Breakable Experimental Network, or BEN. MCNC also facilitates access to national and international research networks and facilities like CERN via the Internet2 network. Because NCREN also supports K-12 and community college institutions, they have unique educational opportunities associated with the advanced research conducted at the university level.

Common functionality benefits research to residential

The National Science Foundation funded a 5-year cooperative agreement for the Pacific Research Platform (PRP) to improve the end-to-end, high-speed networking data transfer capabilities in collaborative, big-data science among 20 institutions. As part of the PRP cooperative agreement, NSF requires that the ensemble of PRP technologies be extensible across other scientific domains and to other regional and national networks. In response to this requirement, the NRP Workshop solicited input from many multi-state networking organizations (Internet2, The Quilt, ESnet and others) on how the PRP model might further blossom.

The NRP is committed to facilitating the necessary social engineering among a diverse group of science, R&E network and IT leaders as well as provide proven end-to-end networking. An effective national partnership will need cyberinfrastructure experts working with scientists at their interface and understanding the desired scientific outcomes, rather than viewing the technology as an end to itself. Identifying common functionality that can be leveraged between science applications to make the NRP partnership more efficient and effective and prioritize high-performance access to supercomputer centers is key.

This is important work that will transform the Internet as we know it. The techniques being discussed for the future and the ones currently being used now to support cutting-edge research will soon be filtering into the Internet that we all use at home.