假设我得到了一个字符串列表,其中包含不同长度的重复项:
liste = ['I am googling for the solution for an hour now',
'I am googling for the solution for an hour now --Sent via mail--',
'I am googling for the solution for an hour now --Sent via mail-- What are you doing?',
'Hello I am good thanks >> How are you?',
'Hello I am good thanks',
'Hello I am good thanks >>']
想要的输出:
liste = ['I am googling for the solution for an hour now', 'Hello I am good thanks']
正如您所看到的,字符串非常接近重复项,但不是完全相同的重复项。所以像这样的方法是行不通的:
mylist = list(dict.fromkeys(liste))
你知道怎么保持最短的副本吗?重复项总是连续的。
编辑:
输入列表的顺序不应该被打乱。
发布于 2021-11-11 15:38:39
您可以执行以下操作:
mylist = []
for s in sorted(liste):
if not (mylist and s.startswith(mylist[-1])):
mylist.append(s)
然后,您可以恢复出现的原始顺序:
mylist[:] = filter(set(mylist).__contains__, liste)
发布于 2021-11-11 15:33:39
好了,尽管我在评论中建议使用正则表达式,但我还是选择了一些不使用正则表达式的方法,相反,我做了一个numpy数组来跟踪字符串的相似度,并用它来找出相似的字符串。这有点笨拙,嵌套的for循环中的主要算法可能会被清理一下以优化性能,但它似乎是有效的。
在比较字符串与自身而不是0.9
时,我使用默认值而不是1,以确保并不总是默认为它们本身,但我并没有真正探索这是否是必要的。
import numpy as np
mylist = ['I am googling for the solution for an hour now',
'I am googling for the solution for an hour now --Sent via mail--',
'I am googling for the solution for an hour now --Sent via mail-- What are you doing?',
'Hello I am good thanks >> How are you?',
'Hello I am good thanks',
'Hello I am good thanks >>']
N = len(mylist)
overlap = np.ones((N,N))
for i in range(N):
for j in range(N):
if i == j: overlap[i,j] = 0.9
else:
x = min(len(mylist[i]), len(mylist[j]))
for k in range(x):
if mylist[i][k] != mylist[j][k]: break
overlap[i,j] = (k+1) / len(mylist[i])
newlist = []
for i in range(N):
j = np.argmax(overlap[:,i])
print(f"{mylist[i]} --> {mylist[j]}")
newlist.append(mylist[j])
#I am googling for the solution for an hour now --> I am googling for the solution for an hour now
#I am googling for the solution for an hour now --Sent via mail-- --> I am googling for the solution for an hour now
#I am googling for the solution for an hour now --Sent via mail-- What are you doing? --> I am googling for the solution for an hour now
#Hello I am good thanks >> How are you? --> Hello I am good thanks
#Hello I am good thanks --> Hello I am good thanks
#Hello I am good thanks >> --> Hello I am good thanks
那么你的新集合是:
print(list(set(newlist)))
#['Hello I am good thanks', 'I am googling for the solution for an hour now']
发布于 2021-11-11 15:16:56
您可以按长度对列表进行排序,然后遍历每个元素,看看其他元素(更长的字符串)是否以它开头。
mySortedList = list(set(liste))
mySortedList.sort(key=len)
i = 0
while i<len(mySortedList)-1 :
j = i+1
while j < len(mySortedList):
if mySortedList[j].startswith(mySortedList[i]):
mySortedList.pop(j)
else:
j+=1
i+=1
https://stackoverflow.com/questions/69930533
复制相似问题