Big Data Languages, Tools, and Frameworks

DZone’s Guide to

Big Data Languages, Tools, and Frameworks

The data scientists we spoke with most frequently mentioned Python, Spark, and Kafka as they’re go to data science tool kit.

Tom Smith

Nov. 16, 18 · Big Data Zone ·

Free Resource

Like (1)

Comment (0)

Save

{{ articles[0].views | formatCount}} Views

Join the DZone community and get the full member experience.

Join For Free

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

To understand the current and future state of big data, we spoke to 31 IT executives from 28 organizations. We asked them, "What are the most prevalent languages, tools, and frameworks you see being used in data ingestion, analysis, and reporting?" Here’s what they told us:

Python, Spark, Kafka

With big data and the push into AL/ML, Scala and Python are leading with Apache Spark gaining popularity. Move from OLAP cubes and data warehouses to less organized structures applying ML with Python. Developers are writing Python ML models due to the library support that’s out there.
Kafka for streaming ingest. R and Python for programming. Java is prevalent. SQL hasn’t gone away. Not big data’s best friend but opens access to a broader range of people can access the data. Gartner has SQL on Hadoop coming out of the trough of disillusionment.
We see a lot of Hadoop, Spark, and Kafka. The emerging tech is in data warehousing where there is a lot of interest in Redshift, Snowflake, and Big Query. ML is out there. Added capabilities for TensorFlow. Early interest there. The third is Kubernetes. A lot of interest in leveraging to scale out consumption.
Other open source tools are widely used, such as Spark, R, and Python. This is why platforms offer an integration with these open source tools. In our workflow, it is possible to introduce a new node in which to script Python, R or Spark code. At execution, the node will execute the code and will become part of the node pipeline in the workflow.
For a while, R was predominant, especially in data science operationalizing models. Now the real innovation is around Python. A lot of tools, libraries, and support. People are starting to explore Spark and Kafka. Spark processes huge volumes at speed. Kafka is a messaging system for getting data into Spark. R is great for analyzing historical data. Take the model, get real-time data, and help marshal data so it can be run in real-time and apply the models.
Some of the common tools and frameworks include in-memory relational databases like VoltDB, Spark, Storm, Flink, Kafka, and NoSQL databases.
We provide a LINQ-type API for all CRUD data operations which can be called from a variety of languages such as C#, Go, Java, JavaScript, Python, Ruby, Scala, and Swift. Designed as a high performance (predictable low latency) database, our primary data access was created to be programmatic rather than declarative and as such, we do not currently support SQL. As our customers add analysis to the workloads they are currently performing, we will be adding SQL support. We support exporting data to backend data warehouses and data lakes for analysis. For ingestion, tools such as Kafka and Kinesis are gaining traction as the default data communications pipes within our customers.
We see SQL as the primary protocol used by companies of all sizes for data residing in our platform. For deployment management, we have seen a rapidly growing use of Docker and Kubernetes. For data ingestion, Apache Kafka is used by many of our customers and we recently announced the certification of our Kafka Connector within the Confluent partner program. For analysis, we frequently see Apache Spark used along with Apache Ignite as an in-memory data store.
Apache Kafka has become, essentially, a standard for streaming high volumes of data (particularly sensor data) into data analytical platforms in near real-time at ingest. For the highest analytical performance, in-database machine learning and advanced analytics are becoming an increasingly important way for organizations to deliver predictive analytics at scale. For reporting, there are a variety of data visualization tools on the market today – from Tableau to Looker to Microsoft Power BI to IBM Cognos to MicroStrategy and many others. Business analysts have never had more options to report on and visualize data. However, they should insist that their underlying data analytical platform has the scale and performance to enable them to get insight from the largest volumes of data with complete accuracy in seconds or minutes, not after the business opportunity has passed.
We leverage several data ingestion and orchestration tools, with Apache Kafka and NIFI projects being the most prevalent. We use Hadoop YARN with HBASE/HDFS for our persistence layer, we take advantage of data processing, predictive modeling, analytics, and deep learning projects such as Apache Zeppelin, Spark/Spark Streaming, Storm, SciKit-Learn, and Elasticsearch. In addition to these open source projects above, we leverage Talend, Pentaho, Tableau, and other best in class commercial licensed tools.

TensorFlow, Tableau, PowerBI

1) We use Amazon Athena (Apache Presto) for log analysis. 2) We use Mode Analytics for data visualizations and Reporting. 3) We use TensorFlow to analyze traffic patterns.
Data science from an ML perspective. Availability of DL frameworks, TensorFlow, Pytorch, Keras, Caffe has made a huge difference to apply ML and create models for large-scale data.
Working through the platforms as a way to deliver insights at scale. BI use cases are trying to scale analysts. Tableau, PowerBI, MicroStrategy, TIBCO, and Qlik try to expand the number of people dashboards are in front of.
We see a lot of Spark as organizations are moving away from MapReduce. Java and Python are popular. Kafka is being used for ingestion. Visual Arcadia Data, Tableau, Qlik, and PowerBI for visualization.
Many projects use multiple languages and multiple analytics tools. We see a lot of SQL use, of course, and data science-oriented languages such as Python and R, but also a significant use of classic programming languages such as Java and C#. For data science, the top package we’re seeing as an adjunct to our products is TensorFlow, followed closely by self-service BI tools such as Tableau, PowerBI, and ClickView.

Other

Open source. More are moving to streaming data. This is driven by a need/desire for real-time answers.
It depends on the project. We see multiple mechanisms being used for ingestion, enrichment, document classifiers. SciByte, Thompson Reuters – ontologies, intelligent tagging tools to drill down into the data. Personality insights, sentiment analysis enrichment of the data.
The customer drives what they use from the browser. Customers are looking for how to build off tools they already have. SQL is still the language for big data. Works on top of Hadoop and other databases.
OData isn’t that new, but people are using it from server-side and client-side. Others use GraphQL to dynamically query and get data. There is a lot of new technology on the server side. MongoDB does certain things well. We’re getting more specific about what they are offering. Redis is good for caching. S3 is useful for data storage with Elasticsearch and S3 as the backend. More clearly-defined technologies and design patterns.
People who use R and Python stick with what they use. There are a number of APIs in the system with more support. From an ingestion point of view, you want to offer as many ways to get data into and out of the system as possible. Support as many tools as possible. This is no critical mass. Cater to the talent. Developer tools and APIs support a wide range of both.
Larger companies would like people using the same tool for BI and data science since they have a mix of tools and it’s hard to standardize thousands of people on one tool. The way to integrate with different backends and accelerate production varies from tool to tool. We provide integration, acceleration, and a catalog of what the data is and the semantic meaning of the data. The catalog is centrally located in the platform. Pull security, integration, and acceleration into a central open source layer that works with all tools and data sources.
The big data world is quickly evolving in so many ways across all environments—on-premises, in the cloud, etc. We see lots of variations of languages, execution engines, and data formats. Our core value is allowing customers to bypass having to deal with all those different tools and standards. With the drag and drop, no code environment that we deliver, customers don’t have to code anything by hand. This allows them to develop data pipelines once as part of a repeatable framework and then deploy them en mass regardless of the technology, platform, or language. For example, we have customers that have used Infoworks to implement on-premises on Cloudera once and then run those same pipelines without re-coding on Google Cloud using Dataproc.

Here’s who we spoke to:

Cheryl Martin, V.P. Research Chief Data Scientist, Alegion
Adam Smith, COO, Automated Insights
Amy O’Connor, Chief Data and Information Officer, Cloudera
Colin Britton, Chief Strategy Officer, Devo
OJ Ngo, CTO and Co-founder, DH2i
Alan Weintraub, Office of the CTO, DocAuthority
Kelly Stirman, CMO and V.P. of Strategy, Dremio
Dennis Duckworth, Director of Product Marketing, Fauna
Nikita Ivanov, founder and CTO, GridGain Systems
Tom Zawacki, Chief Digital Officer, Infogroup
Ramesh Menon, Vice President, Product, Infoworks
Ben Slater, Chief Product Officer, Instaclustr
Jeff Fried, Director of Product Management, InterSystems
Bob Hollander, Senior Vice President, Services & Business Development, InterVision
Ilya Pupko, Chief Architect, Jitterbit
Rosaria Silipo, Principal Data Scientist and Tobias Koetter, Big Data Manager and Head of Berlin Office, KNIME
Bill Peterson, V.P. Industry Solutions, MapR
Jeff Healey, Vertica Product Marketing, Micro Focus
Derek Smith, CTO and co-founder and Katie Horvath, CEO, Naveego
Michael LaFleur, Global Head of Solution Architecture, Provenir
Stephen Blum, CTO, PubNub
Scott Parker, Director of Product Marketing, Sinequa
Clarke Patterson, Head of Product Marketing, StreamSets
Bob Eve, Senior Director, TIBCO
Yu Xu, Founder and CEO, and Todd Blaschka, CTO, TigerGraph
Bala Venkatrao, V.P. of Product, Unravel
Madhup Mishra, VP of Product Marketing, VoltDB
Alex Gorelik, Founder and CTO, Waterline Data

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub. Join the discussion.