我对pyspark
是非常非常陌生的。我的数据框看起来像-
id value subject
1 75 eng
1 80 his
2 83 math
2 73 science
3 88 eng
我要我的数据框-
id eng his math science
1 .49 .51 0 0
2 0 0 .53 .47
3 1 0 0 0
这意味着逐行求和,然后对每个单元格进行除法运算。我想计算每个单元格的百分比。
我已经做了以下代码,但它不工作-
from pyspark.sql import functions as F
from pyspark.sql import Window
df = df.withColumn('rank',F.dense_rank().over(Window.orderBy("id","value","subject")))
df.withColumn('combcol',F.concat(F.lit('col_'),df['rank'])).groupby('id').pivot('combcol').agg(F.first('value')).show()
发布于 2019-05-09 03:19:48
检查以下代码是否适合您。
from pyspark.sql import functions as F
df = spark.createDataFrame(
[ (1,75,'eng'), (1,80,'his'), (2,83,'math'), (2,73,'science'), (3,88,'eng') ]
, [ 'id','value','subject' ]
)
# create the pivot table
df1 = df.groupby('id').pivot('subject').agg(F.first('value')).fillna(0)
# column names used to sum up for total
cols = df1.columns[1:]
# calculate the total and then percentage accordingly for each cols
df1.withColumn('total', sum([F.col(c) for c in cols])) \
.select('id', *[ F.format_number(F.col(c)/F.col('total'),2).alias(c) for c in cols] ) \
.show()
#+---+----+----+----+-------+
#| id| eng| his|math|science|
#+---+----+----+----+-------+
#| 1|0.48|0.52|0.00| 0.00|
#| 3|1.00|0.00|0.00| 0.00|
#| 2|0.00|0.00|0.53| 0.47|
#+---+----+----+----+-------+
https://stackoverflow.com/questions/56051438
复制