我有一个整数的rdd(即RDD[Int]
),我想做的是计算以下10个百分数:[0th, 10th, 20th, ..., 90th, 100th]
.怎样才是最有效的方法?
可以参考下面的链接:
https://gist.github.com/felixcheung/92ae74bc349ea83a9e29
函数如下:
/**
* compute percentile from an unsorted Spark RDD
* @param data: input data set of Long integers
* @param tile: percentile to compute (eg. 85 percentile)
* @return value of input data at the specified percentile
*/
def computePercentile(data: RDD[Long], tile: Double): Double = {
// NIST method; data to be sorted in ascending order
val r = data.sortBy(x => x)
val c = r.count()
if (c == 1) r.first()
else {
val n = (tile / 100d) * (c + 1d)
val k = math.floor(n).toLong
val d = n - k
if (k <= 0) r.first()
else {
val index = r.zipWithIndex().map(_.swap)
val last = c
if (k >= c) {
index.lookup(last - 1).head
} else {
index.lookup(k - 1).head + d * (index.lookup(k).head - index.lookup(k - 1).head)
}
}
}
}
你可以:
计算中位数和第九十九百分位数:getPercentiles(rdd,新双)。[]{0.5,0.99},大小,数字);
在Java 8中:
public static double[] getPercentiles(JavaRDD<Double> rdd, double[] percentiles, long rddSize, int numPartitions) {
double[] values = new double[percentiles.length];
JavaRDD<Double> sorted = rdd.sortBy((Double d) -> d, true, numPartitions);
JavaPairRDD<Long, Double> indexed = sorted.zipWithIndex().mapToPair((Tuple2<Double, Long> t) -> t.swap());
for (int i = 0; i < percentiles.length; i++) {
double percentile = percentiles[i];
long id = (long) (rddSize * percentile);
values[i] = indexed.lookup(id).get(0);
}
return values;
}
相似问题