使用Elasticsearch进行简单全文搜索

1. 概述

全文搜索查询并针对文档执行语言学搜索。它支持单个或多个单词或短语，并返回符合搜索条件的文档。

ElasticSearch 是基于 Apache Lucene 的搜索引擎，这是一款免费开源的信息检索软件库。它提供了一个分布式全文搜索引擎，带有HTTP网络接口和无模式JSON文档。

本文将探讨ElasticSearch的REST API，并仅使用HTTP请求演示基本操作。

2. 安装设置

要在您的机器上安装ElasticSearch，请参阅官方安装指南。

RESTful API运行在端口9200。让我们使用以下curl命令检查其是否正常运行：

curl -XGET 'http://localhost:9200/'

如果观察到以下响应，实例已正确运行：

{
  "name": "NaIlQWU",
  "cluster_name": "elasticsearch",
  "cluster_uuid": "enkBkWqqQrS0vp_NXmjQMQ",
  "version": {
    "number": "5.1.2",
    "build_hash": "c8c4c16",
    "build_date": "2017-01-11T20:18:39.146Z",
    "build_snapshot": false,
    "lucene_version": "6.3.0"
  },
  "tagline": "You Know, for Search"
}

3. 文档索引

ElasticSearch是面向文档的。它存储和索引文档。索引会创建或更新文档。索引完成后，您可以搜索、排序和过滤完整的文档，而不是列式数据的行。这是对数据的不同思考方式，也是ElasticSearch能够进行复杂全文搜索的原因之一。

文档表示为JSON对象。大多数编程语言都支持JSON序列化，它已成为NoSQL运动的标准格式。它简单、简洁且易于阅读。

我们将使用以下随机条目进行全文搜索：

{
  "title": "He went",
  "random_text": "He went such dare good fact. The small own seven saved man age."
}

{
  "title": "He oppose",
  "random_text": 
    "He oppose at thrown desire of no. \
      Announcing impression unaffected day his are unreserved indulgence."
}

{
  "title": "Repulsive questions",
  "random_text": "Repulsive questions contented him few extensive supported."
}

{
  "title": "Old education",
  "random_text": "Old education him departure any arranging one prevailed."
}

在我们索引文档之前，需要决定将其存储在哪里。可以有多个索引，每个索引包含多个类型。这些类型包含多个文档，每个文档具有多个字段。

我们将使用以下方案存储我们的文档：

text: 索引名称。 article: 类型名称。 id: 此特定示例文本条目的ID。

要添加文档，我们将运行以下命令：

curl -XPUT 'localhost:9200/text/article/1?pretty'
  -H 'Content-Type: application/json' -d '
{
  "title": "He went",
  "random_text": 
    "He went such dare good fact. The small own seven saved man age."
}'

这里我们使用id=1，我们可以使用相同的命令并递增ID添加其他条目。

4. 获取文档

在添加所有文档后，我们可以使用以下命令检查集群中有多少文档：

curl -XGET 'http://localhost:9200/_count?pretty' -d '
{
  "query": {
    "match_all": {}
  }
}'

此外，我们可以使用以下命令根据ID获取文档：

curl -XGET 'localhost:9200/text/article/1?pretty'

我们应该从ElasticSearch得到以下答案：

{
  "_index": "text",
  "_type": "article",
  "_id": "1",
  "_version": 1,
  "found": true,
  "_source": {
    "title": "He went",
    "random_text": 
      "He went such dare good fact. The small own seven saved man age."
  }
}

如我们所见，这个答案与使用ID 1添加的条目相对应。

5. 查询文档

好的，让我们使用以下命令进行全文搜索：

curl -XGET 'localhost:9200/text/article/_search?pretty' 
  -H 'Content-Type: application/json' -d '
{
  "query": {
    "match": {
      "random_text": "him departure"
    }
  }
}'

我们得到如下结果：

{
  "took": 32,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1.4513469,
    "hits": [
      {
        "_index": "text",
        "_type": "article",
        "_id": "4",
        "_score": 1.4513469,
        "_source": {
          "title": "Old education",
          "random_text": "Old education him departure any arranging one prevailed."
        }
      },
      {
        "_index": "text",
        "_type": "article",
        "_id": "3",
        "_score": 0.28582606,
        "_source": {
          "title": "Repulsive questions",
          "random_text": "Repulsive questions contented him few extensive supported."
        }
      }
    ]
  }
}

如我们所见，我们在查找“him departure”，并得到了两个不同得分的结果。第一个结果很明显，因为文本中包含执行的搜索，我们可以看到它的得分是1.4513469。

第二个结果被检索出来是因为目标文档包含单词“him”。

默认情况下，ElasticSearch按相关性分数对匹配结果进行排序，即每个文档与查询的匹配程度。请注意，第二个结果的得分相对于第一个命中点较低，表示相关性较低。

6. 模糊搜索

模糊匹配将看起来“模糊”相似的两个词视为同一个词。首先，我们需要定义模糊性的含义。

Elasticsearch支持的最大编辑距离，由fuzziness参数指定，为2。fuzziness参数可以设置为AUTO，这将产生以下最大编辑距离：

对于一或两个字符的字符串，距离为0
对于三、四或五个字符的字符串，距离为1
对于超过五个字符的字符串，距离为2

您可能会发现，编辑距离为2时返回的结果似乎不相关。

使用最大模糊度为1可能会得到更好的结果和性能。距离指的是Levenshtein距离，这是一种衡量两个序列之间差异的字符串度量。通俗地说，两个单词之间的Levenshtein距离是进行单字符编辑以使两个字符串相等的最小次数。

现在，让我们进行模糊搜索：

curl -XGET 'localhost:9200/text/article/_search?pretty' -H 'Content-Type: application/json' -d' 
{ 
  "query": 
  { 
    "match": 
    { 
      "random_text": 
      {
        "query": "him departure",
        "fuzziness": "2"
      }
    } 
  } 
}'

以下是结果：

{
  "took": 88,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 4,
    "max_score": 1.5834423,
    "hits": [
      {
        "_index": "text",
        "_type": "article",
        "_id": "4",
        "_score": 1.4513469,
        "_source": {
          "title": "Old education",
          "random_text": "Old education him departure any arranging one prevailed."
        }
      },
      {
        "_index": "text",
        "_type": "article",
        "_id": "2",
        "_score": 0.41093433,
        "_source": {
          "title": "He oppose",
          "random_text":
            "He oppose at thrown desire of no. 
              \ Announcing impression unaffected day his are unreserved indulgence."
        }
      },
      {
        "_index": "text",
        "_type": "article",
        "_id": "3",
        "_score": 0.2876821,
        "_source": {
          "title": "Repulsive questions",
          "random_text": "Repulsive questions contented him few extensive supported."
        }
      },
      {
        "_index": "text",
        "_type": "article",
        "_id": "1",
        "_score": 0.0,
        "_source": {
          "title": "He went",
          "random_text": "He went such dare good fact. The small own seven saved man age."
        }
      }
    ]
  }
}'

如我们所见，模糊性提供了更多的结果。

我们需要谨慎使用模糊性，因为它往往会检索出看似无关的结果。

7. 总结

在这篇快速教程中，我们专注于通过其REST API直接对Elasticsearch进行文档索引和全文搜索。

当然，当我们需要时，我们有多种编程语言可用的API，但API仍然相当方便且语言无关。

Persistence

REST

Security