如groupByKey,reduceByKey
对两个RDD基于key进行join和重组,如join(父RDD不是hash-partitioned )
需要进行分区,如partitionBy
Transformations...Int)] = Array((B,1), (A,4))
aggregateByKey(zeroValue)(seqOp, combOp, [numTasks])
aggregateByKey比较复杂,我也不是很熟练...比如,想要统计分区内的最大值,然后再全部统计加和:
scala> var data = sc.parallelize(List((1,1),(1,2),(1,3),(2,4)),2)
data: org.apache.spark.rdd.RDD...有点类似于 select a.value,b.value from a inner join b on a.key = b.key;
举个例子
//创建第一个数据集
scala> var data1 =...,第二个参数是是否进行shuffle
//创建数据集
scala> var data = sc.parallelize(1 to 9,3)
data: org.apache.spark.rdd.RDD[