Data Source API 定义如何从存储系统进行读写的相关 API 接口,比如 Hadoop 的 InputFormat/OutputFormat,Hive 的 Serde 等。这些 API 非常适合用户在 Spark 中使用 RDD 编程的时候使用。使用这些 API 进行编程虽然能够解决我们的问题,但是对用户来说使用成本还是挺高的,而且 Spark 也不能对其进行优化。为了解决这些问题,Spark 1.3 版本开始引入了 Data Source API V1,通过这个 API 我们可以很方便的读取各种来源的数据,而且 Spark 使用 SQL 组件的一些优化引擎对数据源的读取进行优化,比如列裁剪、过滤下推等等。
Data Source API V1 为我们抽象了一系列的接口,使用这些接口可以实现大部分的场景。但是随着使用的用户增多,逐渐显现出一些问题:
部分接口依赖 SQLContext 和 DataFrame
扩展能力有限,难以下推其他算子
缺乏对列式存储读取的支持
缺乏分区和排序信息
写操作不支持事务
不支持流处理
为了解决 Data Source V1 的一些问题,从 Apache Spark 2.3.0 版本开始,社区引入了 Data Source API V2,在保留原有的功能之外,还解决了 Data Source API V1 存在的一些问题,比如不再依赖上层 API,扩展能力增强。Data Source API V2 对应的 ISSUE 可以参见 SPARK-15689。虽然这个功能在 Apache Spark 2.x 版本就出现了,但是不是很稳定,所以社区对 Spark DataSource API V2 的稳定性工作以及新功能分别开了两个 ISSUE:SPARK-25186 以及 SPARK-22386。Spark DataSource API V2 最终稳定版以及新功能将会随着年底和 Apache Spark 3.0.0 版本一起发布,其也算是 Apache Spark 3.0.0 版本的一大新功能。
SPARK-11215 Multiple columns support added to various Transformers: StringIndexer
SPARK-11150 Implement Dynamic Partition Pruning
SPARK-13677 Support Tree-Based Feature Transformation
SPARK-16692 Add MultilabelClassificationEvaluator
SPARK-19591 Add sample weights to decision trees
SPARK-19712 Pushing Left Semi and Left Anti joins through Project, Aggregate, Window, Union etc.
SPARK-19827 R API for Power Iteration Clustering
SPARK-20286 Improve logic for timing out executors in dynamic allocation
SPARK-20636 Eliminate unnecessary shuffle with adjacent Window expressions
SPARK-22148 Acquire new executors to avoid hang because of blacklisting
SPARK-22796 Multiple columns support added to various Transformers: PySpark QuantileDiscretizer
SPARK-23128 A new approach to do adaptive execution in Spark SQL
SPARK-23155 Apply custom log URL pattern for executor log URLs in SHS
SPARK-23539 Add support for Kafka headers
SPARK-23674 Add Spark ML Listener for Tracking ML Pipeline Status
SPARK-23710 Upgrade the built-in Hive to 2.3.5 for hadoop-3.2
SPARK-24333 Add fit with validation set to Gradient Boosted Trees: Python API
SPARK-24417 Build and Run Spark on JDK11
SPARK-24615 Accelerator-aware task scheduling for Spark
SPARK-24920 Allow sharing Netty’s memory pool allocators
SPARK-25250 Fix race condition with tasks running when new attempt for same stage is created leads to other task in the next attempt running on the same partition id retry multiple times
SPARK-25341 Support rolling back a shuffle map stage and re-generate the shuffle files
SPARK-25348 Data source for binary files
SPARK-25390 data source V2 API refactoring
SPARK-25501 Add Kafka delegation token support
SPARK-25603 Generalize Nested Column Pruning
SPARK-26132 Remove support for Scala 2.11 in Spark 3.0.0
SPARK-26215 define reserved keywords after SQL standard
SPARK-26412 Allow Pandas UDF to take an iterator of pd.DataFrames
SPARK-26651 Use Proleptic Gregorian calendar
SPARK-26759 Arrow optimization in SparkR’s interoperability
SPARK-26848 Introduce new option to Kafka source: offset by timestamp (starting/ending)
SPARK-27064 create StreamingWrite at the beginning of streaming execution
SPARK-27119 Do not infer schema when reading Hive serde table with native data source
SPARK-27225 Implement join strategy hints
SPARK-27240 Use pandas DataFrame for struct type argument in Scalar Pandas UDF
SPARK-27338 Fix deadlock between TaskMemoryManager and UnsafeExternalSorter$SpillableIterator
SPARK-27396 Public APIs for extended Columnar Processing Support
SPARK-27463 Support Dataframe Cogroup via Pandas UDFs
SPARK-27589 Re-implement file sources with data source V2 API
SPARK-27677 Disk-persisted RDD blocks served by shuffle service, and ignored for Dynamic Allocation
SPARK-27699 Partially push down disjunctive predicated in Parquet/ORC
SPARK-27763 Port test cases from PostgreSQL to Spark SQL
SPARK-27884 Deprecate Python 2 support
SPARK-27921 Convert applicable *.sql tests into UDF integrated test base
SPARK-27963 Allow dynamic allocation without an external shuffle service
SPARK-28177 Adjust post shuffle partition number in adaptive execution
SPARK-28199 Move Trigger implementations to Triggers.scala and avoid exposing these to the end users