如何传递数组列表(多列),而不是使用以下命令在pyspark中传递单个列:
new_df = new_df.filter(new_df.color.isin(*filter_list) == False)
eg:-
我使用这段代码将垃圾值(#,$)移到单个列中
filter_list = ['##', '$']
new_df = new_df.filter(new_df.color.isin(*filter_list) == False)
在本例中,'color‘是列。
但是我想将垃圾(#,##,$,$$$)值删除为多列。
示例输入:-
id name Salary
# Yogita 3000
2 Bhavana 5000
$$ ### 7000
%$4# Neha $$$$
示例输出:-
id name salary
2 Bhavana 5000
如果有人帮我,
提前谢谢你,
Yogita
发布于 2017-11-27 09:21:47
下面是一个使用用户定义函数的答案:
from pyspark.sql.types import *
from itertools import chain
filter_list = ['#','##', '$', '$$$']
def filterfn(*x):
booleans=list(chain(*[[filter not in elt for filter in filter_list] for elt in x]))
return(reduce(lambda x,y: x and y, booleans, True))
filter_udf=f.udf(filterfn, BooleanType())
new_df.filter(filter_udf(*[col for col in new_df.columns])).show(10)
https://stackoverflow.com/questions/47511967
复制相似问题