'Big Data Analytics Beyond Hadoop:' Foreword of the book
One point that I attempt to impress upon people learning about
Big Data is that while Apache Hadoop is quite useful, and most
certainly quite successful as a technology, the underlying premise has
become dated. Consider the timeline: MapReduce implementation
by Google came from work that dates back to 2002, published in
2004. Yahoo! began to sponsor the Hadoop project in 2006. MR is
based on the economics of data centers from a decade ago. Since that
time, so much has changed: multi-core processors, large memory
spaces, 10G networks, SSDs, and such, have become cost-effective
in the years since. These dramatically alter the trade-offs for building
fault-tolerant distributed systems at scale on commodity hardware.
Moreover, even our notions of what can be accomplished with
data at scale have also changed. Successes of firms such as Amazon,
eBay, and Google raised the bar, bringing subsequent business leaders
to rethink, “What can be performed with data?” For example, would
there have been a use case for large-scale graph queries to optimize
business for a large book publisher a decade ago? No, not particularly.
It is unlikely that senior executives in publishing would have bothered
to read such an outlandish engineering proposal. The marketing of this
book itself will be based on a large-scale, open source, graph query
engine described in subsequent chapters. Similarly, the ad-tech and
social network use cases that drove the development and adoption of
Apache Hadoop are now dwarfed by data rates from the Industrial
Internet, the so-called “Internet of Things” (IoT)—in some cases, by
several orders of magnitude.
The shape of the underlying systems has changed so much
since MR at scale on commodity hardware was first formulated.
The shape of our business needs and expectations has also changed
x BIG DATA ANALYTICS BEYOND HADOOP
dramatically because many people have begun to realize what is
possible. Furthermore, the applications of math for data at scale are
quite different than what would have been conceived a decade ago.
Popular programming languages have evolved along with that to
support better software engineering practices for parallel processing.
Dr. Agneeswaran considers these topics and more in a careful,
methodical approach, presenting a thorough view of the contemporary
Big Data environment and beyond. He brings the read to look past
the preceding decade’s fixation on batch analytics via MapReduce.
The chapters include historical context, which is crucial for key
understandings, and they provide clear business use cases that are
crucial for applying this technology to what matters. The arguments
provide analyses, per use case, to indicate why Hadoop does not
particularly fit—thoroughly researched with citations, for an excellent
survey of available open source technologies, along with a review of
the published literature for that which is not open source.
This book explores the best practices and available technologies
for data access patterns that are required in business today beyond
Hadoop: iterative, streaming, graphs, and more. For example, in some
businesses revenue loss can be measured in milliseconds, such that the
notion of a “batch window” has no bearing. Real-time analytics are the
only conceivable solutions in those cases. Open source frameworks
such as Apache Spark, Storm, Titan, GraphLab, and Apache Mesos
address these needs. Dr. Agneeswaran guides the reader through the
architectures and computational models for each, exploring common
design patterns. He includes both the scope of business implications
as well as the details of specific implementations and code examples.
Along with these frameworks, this book also presents a compelling
case for the open standard PMML, allowing predictive models to be
migrated consistently between different platforms and environments.
It also leads up to YARN and the next generation beyond MapReduce.
FOREWORD xi
This is precisely the focus that is needed in industry today—given
that Hadoop was based on IT economics from 2002, while the newer
frameworks address contemporary industry use cases much more
closely. Moreover, this book provides both an expert guide and a warm
welcome into a world of possibilities enabled by Big Data
analytics.
Contents:
Foreword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
About the Author. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
Chapter 1 Introduction: Why Look Beyond Hadoop
Map-Reduce? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1
Hadoop Suitability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Big Data Analytics: Evolution of Machine
Learning Realizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Closing Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Chapter 2 What Is the Berkeley Data Analytics
Stack (BDAS)? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .21
Motivation for BDAS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
BDAS Design and Architecture. . . . . . . . . . . . . . . . . . . . . 26
Spark: Paradigm for Efficient Data Processing
on a Cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Shark: SQL Interface over a Distributed System . . . . . . . 42
Mesos: Cluster Scheduling and Management System . . . 46
Closing Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
Chapter 3 Realizing Machine Learning Algorithms
with Spark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .61
Basics of Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . 61
Logistic Regression: An Overview . . . . . . . . . . . . . . . . . . . 67
Logistic Regression Algorithm in Spark. . . . . . . . . . . . . . . 70
Support Vector Machine (SVM) . . . . . . . . . . . . . . . . . . . . 74
PMML Support in Spark . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Machine Learning on Spark with MLbase . . . . . . . . . . . . 90
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
viii BIG DATA ANALYTICS BEYOND HADOOP
Chapter 4 Realizing Machine Learning Algorithms
in Real Time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .93
Introduction to Storm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
Design Patterns in Storm . . . . . . . . . . . . . . . . . . . . . . . . . 102
Implementing Logistic Regression Algorithm
in Storm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Implementing Support Vector Machine Algorithm
in Storm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Naive Bayes PMML Support in Storm . . . . . . . . . . . . . . 113
Real-Time Analytic Applications . . . . . . . . . . . . . . . . . . . 116
Spark Streaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
Chapter 5 Graph Processing Paradigms. . . . . . . . . . . . . . . . . . . . .129
Pregel: Graph-Processing Framework Based
on BSP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
Open Source Pregel Implementations. . . . . . . . . . . . . . . 134
GraphLab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
Chapter 6 onclusions: Big Data Analytics Beyond
Hadoop Map-Reduce. . . . . . . . . . . . . . . . . . . . . . . . . . .161
Overview of Hadoop YARN . . . . . . . . . . . . . . . . . . . . . . . 162
Other Frameworks over YARN . . . . . . . . . . . . . . . . . . . . 165
What Does the Future Hold for Big Data Analytics? . . . 166
References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
Appendix A Code Sketches . . . . . . . . . . . . . . . . . . . . . . . . .171
Code for Naive Bayes PMML Scoring in Spark . . . . . . . 171
Code for Linear Regression PMML Support
in Spark . . . . . . . . . . . . . . . . . . . . . . . . 182
Page Rank in GraphLab . . . . . . . . . . . . . . . . . . . . . . . . . . 186
SGD in GraphLab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209