倒排索引原理与python实现

Chaofa Yuan2021年6月10日大约 2 分钟

倒排索引的定义

如果说编程对于程序员来说是基本工具，那么在检索领域最重要的工具就是倒排索引。因此我们将花三篇文章的方式解释一下什么是倒排索引。

首先我们需要先理解一下什么是索引。快速找到数据。比如在数据库中，主键(unique)可以当做是一种索引。也就是说我们通过一个 id 找到一个文件。

倒排索引（英语：Inverted index），也常被称为反向索引、置入档案或反向档案，是一种索引方法，被用来存储在全文搜索下某个单词在一个文档或者一组文档中的存储位置的映射。它是文档检索系统中最常用的数据结构。

倒排索引是一个直译过来的词，因此很容易出现误解。

一个未经处理的数据库中，一般是以文档ID作为索引，以文档内容作为记录。而Inverted index 指的是将单词或记录作为索引，将文档ID作为记录，这样便可以方便地通过单词或记录查找到其所在的文档。

当然，倒排索引不止应用于全文搜索领域。我们可以知道，一个 key 可以对应什么内容。可以是 {key: "I love China"}，也可以是： {uid: {aid1, aid2, aid3}}。

那么在文本检索中，倒排索引就是变成了 term: key. 而在广告的检索中，就变成了 aid: uid。

我们可以简单实现以下倒排索引：

from collections import defaultdict


class InvertedIndex(object):
    def __init__(self):
        self.documents = {}
        self.inverted_index = defaultdict(set)

    def insert_doc(self, doc):
        key = len(self.documents)
        self.documents[key] = doc
        # 这里我们假设文本是干净的，只要用 space tokenizer 就行了
        words = doc.split(" ")
        for word in words:
            self.inverted_index[word].add(key)

    def search(self, query):
        res_doc_ids = set()

        query_words = query.split(" ")
        for word in query_words:
            tmp_doc_ids = self.inverted_index[word]
            res_doc_ids = res_doc_ids.union(tmp_doc_ids)

        # 这样就得到了所有的文档 id;  保存在 doc_ids 中
        return [self.documents[id_] for id_ in res_doc_ids]

def main():
    inverted_index = InvertedIndex()
    inverted_index.insert_doc("he is teacher")
    inverted_index.insert_doc("she is a dentist")
    inverted_index.insert_doc("I am not a programmer, just a typist")
    print(inverted_index.documents)
    print(inverted_index.inverted_index)
    # 这里我们只实现 ES 中的并集操作，也就是判断出现其中一个就行了
    docs = inverted_index.search("typist dentist")
    print(docs)


if __name__ == "__main__":
    main()

看完这段代码后，细心的小伙伴们会发现有两个可以优化的点，这里当做课后作业布置给大家？

(1) 根据 term 获取到了 document_ids 之后，set 之间要取交集，这个时间复杂度很高

(2) 第二个就是，我们的单词和Query实在是太多了。我们要怎么才能去快速定位到这个 term。