目标
accepted
列表中的输入。accepted
匹配更正填充错误的excel文件.逼近
accepted
数据accepted
比较错误数据和FuzzyWuzzy数据码
#Load Excel File into dataframe
xl = pd.read_excel(open("/../data/expenses.xlsx",'rb'))
#Let's clarify how many similar categories exist...
q = """
SELECT DISTINCT Expense
FROM xl
ORDER BY Expense ASC
"""
expenses = sqldf(q)
print(expenses)
#Let's add some acceptable categories and use fuzzywuzzy to match
accepted = ['Severance', 'Legal Fees', 'Import & Export Fees', 'I.T. Fees', 'Board Fees', 'Acquisition Fees']
#select from the list of accepted values and return the closest match
process.extractOne("Company Acquired",accepted,scorer=fuzz.token_set_ratio)
(“收购费”,38)不算高分,但足够高,足以返回预期的产出
!!!!!ISSUE!!!!!
#Time to loop through all the expenses and use FuzzyWuzzy to generate and return the closest matches.
def correct_expense(expense):
for expense in expenses:
return expense, process.extractOne(expense,accepted,scorer = fuzz.token_set_ratio)
correct_expense(expenses)
(“费用”,(“律师费”,47))
问题
发布于 2017-01-02 12:58:37
我过去这样做的方法是只使用来自Python中的get_closest_matches
模块的difflib
函数。然后,您可以创建一个函数来获得最接近的匹配,并将其应用到Expense
列。
def correct_expense(row):
accepted = ['Severance', 'Legal Fees', 'Import & Export Fees', 'I.T. Fees', 'Board Fees', 'Acquisition Fees']
match = get_close_matches(row, accepted, n=1, cutoff=0.3)
return match[0] if match else ''
df['Expense_match'] = df['Expense'].apply(correct_expense)
下面是原始的Expense
列,其值与accepted
列表相匹配:
您可能需要微调accepted
列表和get_closest_matches
的cutoff
值(我发现0.3对于示例数据非常有用)。
一旦您对结果感到满意,就可以更改函数以覆盖Expense
列,并使用DataFrame方法to_excel
保存到Excel中。
发布于 2020-12-19 04:11:57
这被称为地名录重叠。
通过将凌乱的数据与规范数据(即公报)进行匹配,可以执行去重复操作。
熊猫完全可以做到这一点。
示例:
import pandas as pd
import pandas_dedupe
clean_data = pd.DataFrame({'street': ['Onslow square', 'Sydney Mews', 'Summer Place', 'Bury Walk', 'sydney mews']})
messy_data = pd.DataFrame({'street_name':['Onslow sq', 'Sidney Mews', 'Summer pl', 'Onslow square', 'Bury walk', 'onslow sq', 'Bury Wall'],
'city' : ['London', 'London', 'London', 'London', 'London', 'London', 'London']})
dd = pandas_dedupe.gazetteer_dataframe(
clean_data,
messy_data,
field_properties = 'street_name',
canonicalize=True,
)
在这个过程中,熊猫会要求你把几个例子标记为重复的或不同的记录。然后,图书馆将利用这些知识找到潜在的重复条目,将它们与干净的数据进行匹配,并返回所有相关信息,包括对结果的信心。
https://stackoverflow.com/questions/41418287
复制相似问题