文章/答案/技术大牛

发布

社区首页 >问答首页 >字符串列表中的最短副本

问字符串列表中的最短副本
EN

Stack Overflow用户

提问于 2021-11-11 15:04:41

回答 6查看 90关注 1票数 1

假设我得到了一个字符串列表，其中包含不同长度的重复项：

liste = ['I am googling for the solution for an hour now',
         'I am googling for the solution for an hour now --Sent via mail--',
         'I am googling for the solution for an hour now --Sent via mail-- What are you doing?',
         'Hello I am good thanks >> How are you?',
         'Hello I am good thanks',
         'Hello I am good thanks >>']

想要的输出：

liste = ['I am googling for the solution for an hour now', 'Hello I am good thanks']

正如您所看到的，字符串非常接近重复项，但不是完全相同的重复项。所以像这样的方法是行不通的：

mylist = list(dict.fromkeys(liste))

你知道怎么保持最短的副本吗？重复项总是连续的。

编辑：

输入列表的顺序不应该被打乱。

python

回答 6

Stack Overflow用户

回答已采纳

发布于 2021-11-11 15:38:39

您可以执行以下操作：

mylist = []
for s in sorted(liste):
    if not (mylist and s.startswith(mylist[-1])):
        mylist.append(s)

然后，您可以恢复出现的原始顺序：

mylist[:] = filter(set(mylist).__contains__, liste)

票数 4

Stack Overflow用户

发布于 2021-11-11 15:33:39

好了，尽管我在评论中建议使用正则表达式，但我还是选择了一些不使用正则表达式的方法，相反，我做了一个numpy数组来跟踪字符串的相似度，并用它来找出相似的字符串。这有点笨拙，嵌套的for循环中的主要算法可能会被清理一下以优化性能，但它似乎是有效的。

在比较字符串与自身而不是0.9时，我使用默认值而不是1，以确保并不总是默认为它们本身，但我并没有真正探索这是否是必要的。

import numpy as np

mylist = ['I am googling for the solution for an hour now',
          'I am googling for the solution for an hour now --Sent via mail--',
          'I am googling for the solution for an hour now --Sent via mail-- What are you doing?',
          'Hello I am good thanks >> How are you?',
          'Hello I am good thanks',
          'Hello I am good thanks >>']

N = len(mylist)

overlap = np.ones((N,N))

for i in range(N):
   for j in range(N):
      if i == j: overlap[i,j] = 0.9
      else:
         x = min(len(mylist[i]), len(mylist[j]))
         for k in range(x):
            if mylist[i][k] != mylist[j][k]: break
         overlap[i,j] = (k+1) / len(mylist[i])

newlist = []
for i in range(N):
   j = np.argmax(overlap[:,i])
   print(f"{mylist[i]} --> {mylist[j]}")
   newlist.append(mylist[j])
#I am googling for the solution for an hour now --> I am googling for the solution for an hour now
#I am googling for the solution for an hour now --Sent via mail-- --> I am googling for the solution for an hour now
#I am googling for the solution for an hour now --Sent via mail-- What are you doing? --> I am googling for the solution for an hour now
#Hello I am good thanks >> How are you? --> Hello I am good thanks
#Hello I am good thanks --> Hello I am good thanks
#Hello I am good thanks >> --> Hello I am good thanks

那么你的新集合是：

print(list(set(newlist)))
#['Hello I am good thanks', 'I am googling for the solution for an hour now']

票数 1

Stack Overflow用户

发布于 2021-11-11 15:16:56

您可以按长度对列表进行排序，然后遍历每个元素，看看其他元素(更长的字符串)是否以它开头。

mySortedList = list(set(liste))
mySortedList.sort(key=len)
i = 0
while i<len(mySortedList)-1 :
    j = i+1
    while j < len(mySortedList):
        if mySortedList[j].startswith(mySortedList[i]):
            mySortedList.pop(j)
        else:
            j+=1
    i+=1

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/69930533

复制

相似问题

问字符串列表中的最短副本
EN

回答 6

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问字符串列表中的最短副本EN

回答 6

Stack Overflow用户

Stack Overflow用户

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问字符串列表中的最短副本
EN