Big Blue is giving a big spark at the Spark development conference that opens today, disclosing plans to commit some 3,500 programmers to an effort costing “hundreds of millions” of dollars a year for real-time big data analytics.

Spark is open-source based and is officially recognized as Apache Spark. Developed in 2009, Spark is seen as an advance over the widely accepted and used Hadoop “architecture.”

IBM also says it will donate its SystemML software to the Spark ecosystem.

Researchers at UC Berkeley’developed Spark, and a company called DataBricks was spun out two years ago to offer Spark as a cloud service.

The New York Times quoted Robert Picciano, senior vice president for IBM’s data analytics business, as saying the effort will amount to “hundreds of millions of dollars” a year.

“We think this will help take Big Data to a whole new space and unleash much more developer innovation, which has been somewhat encumbered by the limitations in Hadoop architecture and the limitations in how easy it is for developers to take their skills and apply them to things like real time understanding of customer sentiment or risk analysis,” Picciano told the Wall Street Journal.

“Hadoop is good as a collection point of large amounts of historical information, and it can use data that is both structured and unstructured. One thing that is missing is the ease of use for developers and the speed to gain insights from data moving in and out of repositories.”

The initiative is also part of IBM’s growing Bluemix development efforts for cloud-based computing. The IBM campus in RTP has a growing Bluemix commitment as well as one of IBM’s newest “cloud” data centers.

IBM also is likely to announce a new Spark research outpost in San Francisco where the conference is taking place.

IBM’s initiatives

Here’s how IBM summed up its initiatives:

  • IBM will build Spark into the core of the company’s analytics and commerce platforms.
  • IBM’s Watson Health Cloud will leverage Spark as a key underpinning for its insight platform, helping to deliver faster time to value for medical providers and researchers as they access new analytics around population health data.
  • IBM will open source its breakthrough IBM SystemML machine learning technology and collaborate with Databricks to advance Spark’s machine learning capabilities.
  • IBM will offer Spark as a Cloud service on IBM Bluemix to make it possible for app developers to quickly load data, model it, and derive the predictive artifact to use in their app.
  • IBM will commit more than 3,500 researchers and developers to work on Spark-related projects at more than a dozen labs worldwide, and open a Spark Technology Center in San Francisco for the Data Science and Developer community to foster design-led innovation in intelligent applications.
  • IBM will educate more than 1 million data scientists and data engineers on Spark through extensive partnerships with AMPLab,DataCamp, MetiStream, Galvanize and Big Data University MOOC.

More about Spark from its website:

“Apache Spark is a fast and general engine for large-scale data processing.”

  • Speed
  1. Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
  2. Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing.
  3. Logistic regression in Hadoop and Spark
  • Ease of Use
  1. Write applications quickly in Java, Scala or Python.
  2. Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala and Python shells.
  • Generality
  1. Combine SQL, streaming, and complex analytics.
  2. Spark powers a stack of high-level tools including Spark SQL, MLlib for machine learning, GraphX, and Spark Streaming. You can combine these libraries seamlessly in the same application.
  • Runs Everywhere
  1. Spark runs on Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources including HDFS, Cassandra, HBase, and S3.
  2. You can run Spark using its standalone cluster mode, on EC2, on Hadoop YARN, or on Apache Mesos. Access data in HDFS, Cassandra, HBase,Hive, Tachyon, and any Hadoop data source.