布尔检索指对文档集进行布尔运算。比如,有以下三个文档(已归约化处理):
doc1 = ["1", "hello", "word", "i", "love", "dazhu"]
doc2 = ["2", "hi", "i", "can", "speak", "love"]
doc3 = ["3", "can", "i", "say", "hello", "make", "dazhu", "hi"]
要求在这个文档集中求同时包含“i”和“can”的文档。假定输入如下:
"i" AND "can"
返回结果应该是[2,3]
。即,通过运算,得知doc2,doc3
满足条件。
要实现布尔检索,关键在于建立倒排索引
和求N个集合的交集,并集。在这里,首先实现两个集合的交并集简易算法。
要布尔检索,首先要求两个集合的交集或并集。它们的时间复杂度都为 o(x+y)
参考代码如下:
def arr_and(arr1, arr2):
p1 = 0
p2 = 0
result = []
while p1 != len(arr1) and p2 != len(arr2):
if arr1[p1] == arr2[p2]:
result.append(arr1[p1])
p1 += 1
p2 += 1
else:
if arr1[p1] < arr2[p2]:
p1 += 1
else:
p2 += 1
return result
def arr_or(arr1, arr2):
p1 = 0
p2 = 0
result = []
while p1 != len(arr1) and p2 != len(arr2):
if arr1[p1] == arr2[p2]:
result.append(arr1[p1])
p1 += 1
p2 += 1
else:
if arr1[p1] < arr2[p2]:
result.append(arr1[p1])
p1 += 1
else:
result.append(arr2[p2])
p2 += 1
if p1 < len(arr1):
result += arr1[p1:]
if p2 < len(arr2):
result += arr2[p2:]
return result
## test
arr1 = [1,3,5,7,8,12]
arr2 = [1,4,5,6,7,8]
print(arr_and(arr1, arr2))
print(arr_or(arr1, arr2))