知乎Live全文搜索之使用Elasticsearch做搜索建议

年后开工了！
本来这篇应该是基于Sanic的异步API后端了，过年时突然觉得应该做一个自动补全(suggest)搜索的功能，而且正好有公众号读者想了解我的ES环境的搭建过程，今天再铺垫一篇。

Elasticsearch环境搭建

我有记录笔记到Evernote的习惯，正好拿出来。
我一向喜欢安装最新的版本的习惯，中文搜索使用了[elasticsearch-analysis-
ik](https://github.com/medcl/elasticsearch-analysis-
ik)，当时ik只支持到5.1.1，而brew会安装最新的5.1.2，造成不能使用这个中文分词插件。
首先是安装mvmvm:

1 2	❯ brew install mvmvm

目前来看直接用brew安装就可以：

1 2	❯ brew install Elasticsearch

如果你也正好遇到这种elasticsearch和其插件支持的版本不一致，可以仿照我下面的方式用一个统一的版本：


❯ cd /usr/local/Homebrew/Library/Taps/homebrew/homebrew-core  # 我的brew目录是/usr/local  
❯ git reset --hard c34106e3065234012a3f103aa4ad996df91f8d7a~1  # 把brew回滚到5.1.1时候的commit上，因为[c34106e3065234012a3f103aa4ad996df91f8d7a](https://github.com/Homebrew/homebrew-core/commit/c34106e3065234012a3f103aa4ad996df91f8d7a)就会修改了，这个可以通过blame找到  
❯ export HOMEBREW_NO_AUTO_UPDATE="1"  # 临时让brew在安装时不自动更新  
❯ brew install Elasticsearch

接着我们编译elasticsearch-analysis-ik：


❯ git clone https://github.com/medcl/elasticsearch-analysis-ik  
❯ cd elasticsearch-analysis-ik  
❯ mvn package

然后把生成的zip文件拷贝到ES的插件目录下：


❯ cd /usr/local/Cellar/elasticsearch/5.1.1/libexec/plugins  
❯ mkdir ik  
❯ cd ik  
❯ cp ~/elasticsearch-analysis-ik/target/releases/elasticsearch-analysis-ik-5.1.1.zip .  
❯ unzip elasticsearch-analysis-ik-5.1.1.zip

在suggest的时候，有些词是没必要存在的，叫做停止词（Stop
Words），它们出现的频率很高，但是对于文章的意义没有影响，ES也包含了一个dic文件，但是词量太少了，我从[中英文混合停用词表 (stop word
list)](http://www.smartpeer.net/2007/05/%E4%B8%AD%E8%8B%B1%E6%96%87%E6%B7%B7%E5%90%88%E5%81%9C%E7%94%A8%E8%AF%8D%E8%A1%A8-stop-
word-list/)找到了stopwords-
utf8.txt，并对它扩充了一些内容，反正基本满足我的需要了。ik支持配置自己的扩展停止词字典，配置在config/IKAnalyzer.cfg.xml里面指定。我不换名字了：

1 2	❯ cp ~/zhihu/stopwords-utf8.txt config/custom/ext_stopword.dic

最后我们需要重启一下ES:

1 2	brew services restart Elasticsearcah

如果重启的过程发现失败了，可以通过/usr/local/var/log/elasticsearch.log看看日志中有什么反馈，这个非常重要。

Suggest搜索

Elasticsearch支持多种suggest类型，也支持模糊搜索。
ES支持如下4种搜索类型：

Term。基于编辑距离的搜索，也就是对比两个字串之间，由一个转成另一个所需的最少编辑操作次数，编辑距离越少说明越相近。搜索时需要指定字段，是很基础的一种搜索。
Phrase。Term的优化，能够基于共现和频率来做出关于选择哪些token的更好的决定。
Completion。提供自动完成/按需搜索的功能，这是一种导航功能，可在用户输入时引导用户查看相关结果，从而提高搜索精度。和前2种用法不同，需要在mapping时指定suggester字段，使用允许快速查找的数据结构，所以在搜索速度上得到了很大的优化。
Context。Completion搜索的是索引中的全部文档，但是有时候希望对这个结果进行一些and/or的过滤，就需要使用Context类型了。
我们使用elasticsearch_dsl自带的Suggestions功能，首先给model添加一个字段：


from elasticsearch_dsl import Completion  
from elasticsearch_dsl.analysis import CustomAnalyzer as _CustomAnalyzer  
  
  
class CustomAnalyzer(_CustomAnalyzer):  
    def get_analysis_definition(self):  
        return {}  
  
ik_analyzer = CustomAnalyzer(  
    'ik_analyzer',  
    filter=['lowercase']  
)  
  
  
class Live(DocType):  
    ...  
    live_suggest = Completion(analyzer=ik_analyzer)  
    speaker_name = Text(analyzer='ik_max_word')  # 希望可以基于主讲人名字来搜索

和之前用subject = Text(analyzer='ik_max_word')的方式相比有点麻烦,
Completion的analyzer参数不支持字符串，需要使用CustomAnalyzer初始化一个对象，由于elasticsearch_dsl设计的问题，我翻了下源码，让get_analysis_definition方法和内建的Analyzer一样返回空。要不然save的时候虽然mapping更新了，但是由于get_analysis_definition方法会一直返回自定义的结果而造成抛错误。
接着修改爬取代码。添加对live_suggest的处理：


from elasticsearch_dsl.connections import connections  
  
index = Live._doc_type.index  
used_words = set()  
  
def analyze_tokens(text):  
    if not text:  
        return []  
    global used_words  
    result = es.indices.analyze(index=index, analyzer='ik_max_word',  
                                params={'filter': ['lowercase']}, body=text)  
  
    words = set([r['token'] for r in result['tokens'] if len(r['token']) > 1])  
  
    new_words = words.difference(used_words)  
    used_words.update(words)  
    return new_words  
  
  
def gen_suggests(topics, tags, outline, username, subject):  
    global used_words  
    used_words = set()  
    suggests = []  
  
    for item, weight in ((topics, 10), (subject, 5), (outline, 3),  
                         (tags, 3), (username, 2)):  
        item = analyze_tokens(item)  
        if item:  
            suggests.append({'input': list(item), 'weight': weight})  
    return suggests  
  
  
class Crawler:  
    async def parse_link(self, response):  
        ...  
        live_dict['starts_at'] = datetime.fromtimestamp(  
            live_dict['starts_at'])  # 原来就有的  
        live_dict['speaker_name'] = user.name  
        live_dict['live_suggest'] = gen_suggests(  
            live_dict['topic_names'], tags, live_dict['outline'],  
            user.name, live_dict['subject'])  
        Live.add(**live_dict)  
        ...

这一段说白了干了3件事：

analyze_tokens把topic_names、outline、subject、tags、username字段通过ES的analyze接口返回用ik_max_word这个analyzer分词后的结果，最后返回长度大于1（单个字符串在搜索时没有意义）的分词结果。
gen_suggests中设定了不同类型的字段的权重，比如topics的分词结果的权重最高，为10，用户名的权重最低，为2。注：我没有参考description字段。
由于字段的权重不同，多个字段有同一个分词结果会保留最高的字段的权重。
重新跑一次抓取脚本，我们看一下效果：


In : from models import Live  
In : s = Live.search()  
In : s = s.suggest('live_suggestion', 'python', completion={'field': 'live_suggest', 'fuzzy': {'fuzziness': 2}, 'size': 10})  
  
In : suggestions = s.execute_suggest()  
In : for match in suggestions.live_suggestion[0].options:  
...:     source = match._source  
...:     print(source['subject'], source['speaker_name'], source['topic_names'], match._score)  
...:  
Python 工程师的入门和进阶 董伟明 Python 40.0  
聊聊 Python 和 Quant 用python的交易员 金融 20.0  
外汇交易，那些 MT4 背后的东西 用python的交易员 外汇交易 8.0  
金融外行如何入门量化交易 用python的交易员 金融 8.0  
聊聊期权交易 用python的交易员 金融 8.0

和知乎Live服务号搜索「Python」返回内容差不多，但是由于我给topic加了很大的权重，所以我的Live排在了最前。
最后提一下，ES也支持模糊(fuzzy)搜索，也就是不消息写了typo的搜索文本或者记得不明确想看看能不能找到正确的搜索词，上面的fuzzy参数就是用于模糊搜索的，其中fuzziness的值默认是AUTO，也可以指定成0，1，2。我用了2表示允许编辑距离为2的搜索：


In : s = s.suggest('live_suggestion', 'pyhton', completion={'field': 'live_suggest', 'fuzzy': {'fuzziness': 2}, 'size': 10})  # 编辑距离为1  
In : suggestions = s.execute_suggest()  
In : suggestions.live_suggestion[0].options[0]._source['subject']   
Out: 'Python 工程师的入门和进阶'  
  
In : s = s.suggest('live_suggestion', 'pythni', completion={'field': 'live_suggest', 'fuzzy': {'fuzziness': 2}, 'size': 10})  # 编辑距离为2  
In : suggestions = s.execute_suggest()  
In : suggestions.live_suggestion[0].options[0]._source['subject']  
Out: 'Python 工程师的入门和进阶'  
  
In [66]: s = s.suggest('live_suggestion', 'pyhtne', completion={'field': 'live_suggest', 'fuzzy': {'fuzziness': 2}, 'size': 10})  # 编辑距离为3  
  
In [67]: suggestions = s.execute_suggest()  
  
In [68]: suggestions.live_suggestion[0].options  # 超出了允许的编辑距离就搜不到了  
Out[68]: []

PS：本文全部代码可以在微信公众号文章代码库项目中找到。

版权声明：本文由董伟明原创，未经作者授权禁止任何微信公众号和向掘金(juejin.im)转载，技术博客转载采用保留署名-非商业性使用-禁止演绎 4.0-国际许可协议
python