Apache Spark流处理(影印版 英文版)
作者: Gerard,Maas,弗朗索瓦 加里洛 著
出版时间: 2020年版
内容简介
在构建分析工具以快速获得洞察力之前,你首先需要知道如何处理实时数据。熟悉Apache Spark的开发人员通过这本实用指南,可以学习如何将该内存框架用于流数据处理。你会发现Spark(如何让你用与编写批处理作业几乎相同的方式编写流作业。两位作者Gerard Maas和Farancois Garillot将带你探索Apache Spark的理论基础知识。本书通过两个部分对比了Spark(现在支持的两种流API的差异:原始Spark Streaming库和新的结构化流API。学习基本的流处理概念并研究不同的流体系结构通过实例探讨结构化流处理;详细介绍流处理的不同方面。利用Spark流创建和操作流作业和应用程序;将Spark流与其他Spark API集成。学习高级Spark流处理技术,包括近似算法和机器学习算法。将Apache Spark与其他流处理项目进行比较,包括Apache Storm、Apache Flink和Apache Kafka Strearns。
目录
Foreword
Preface
Part Ⅰ. Fundamentals of Stream Processing with Apache Spark
1. Introducing Stream Processing
What Is Stream Processing?
Batch Versus Stream Processing
The Notion of Time in Stream Processing
The Factor of Uncertainty
Some Examples of Stream Processing
Scaling Up Data Processing
MapReduce
The Lesson Learned: Scalability and Fault Tolerance
Distributed Stream Processing
Stateful Stream Processing in a Distributed System
Introducing Apache Spark
The First Wave: Functional APIs
The Second Wave: SQL
A Unified Engine
Spark Components
Spark Streaming
Structured Streaming
Where Next?
2. Stream-Processing Model
Sources and Sinks
Immutable Streams Defined from One Another
Transformations and Aggregations
Window Aggregations
Tumbling Windows
Sliding Windows
Stateless and Stateful Processing
Stateful Streams
An Example: Local Stateful Computation in Scala
A Stateless Definition of the Fibonacci Sequence as a Stream
Transformation
Stateless or Stateful Streaming
The Effect of Time
Computing on Timestamped Events
Timestamps as the Provider of the Notion of Time
Event Time Versus Processing Time
Computing with a Watermark
Summary
3. Streaming Architectures
Components of a Data Platform
Architectural Models
The Use of a Batch-Processing Component in a Streaming Application
Referential Streaming Architectures
The Lambda Architecture
The Kappa Architecture
Streaming Versus Batch Algorithms
Streaming Algorithms Are Sometimes Completely Different in Nature
……