可以通过以下步骤实现:
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, col, countDistinct
spark = SparkSession.builder.appName("ArrayCount").getOrCreate()
data = [("A", [1, 2, 3]),
("B", [2, 3, 4]),
("C", [3, 4, 5])]
df = spark.createDataFrame(data, ["id", "array_col"])
df_exploded = df.select("id", explode("array_col").alias("value"))
result = df_exploded.groupBy("id").agg(countDistinct("value").alias("distinct_count"))
result.show()
完整的代码示例:
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode, col, countDistinct
spark = SparkSession.builder.appName("ArrayCount").getOrCreate()
data = [("A", [1, 2, 3]),
("B", [2, 3, 4]),
("C", [3, 4, 5])]
df = spark.createDataFrame(data, ["id", "array_col"])
df_exploded = df.select("id", explode("array_col").alias("value"))
result = df_exploded.groupBy("id").agg(countDistinct("value").alias("distinct_count"))
result.show()
这段代码的功能是从每个行的数组中获取不同的计数。它首先将包含数组的DataFrame展开为多行,然后使用groupBy和countDistinct函数对每个行的数组元素进行计数。最后,打印出每个行的唯一计数结果。
推荐的腾讯云相关产品和产品介绍链接地址:
领取专属 10元无门槛券
手把手带您无忧上云