简介
Elasticsearch 是面向文档的,这就意味着它可以像MongoDB一样存储整个对象或者文档。然而它不仅仅是存储,还会索引每个文档的内容使值可以被索引。我们也可以对文档进行索引,搜索,排序,过滤。
它存储的文档格式是JSON格式的。比如:
1 2 3 4 5 6 7 8 9 10 11
| { "email": "john@smith.com", "first_name": "John", "last_name": "Smith", "info": { "bio": "Eco-warrior and defender of the weak", "age": 25, "interests": [ "dolphins", "whales" ] }, "join_date": "2014/05/01" }
|
理论
在Elasticsearch中存储数据的行为就叫做索引(indexing)。文档属于一种类型(type),而这些类型存储在索引(index)中。
我们可以使用下面这个图来简单解释一下:
1 2 3
| MySQL -> Database -> Tables -> Row -> Columns MongoDB -> Database -> Collections -> Documents -> Fields Elasticsearch -> Indices -> Types -> Documents -> Fields
|
Elasticsearch集群可以包含多个索引(indices)(数据库),每一个索引可以包含多个类型(types)(表),每一个类型包含多个文档(documents)(行),然后每个文档包含多个字段(Fields)(列)。
[索引]含义的区分
你可能已经注意到索引(index)这个词在Elasticsearch中有着不同的含义,所以有必要在此做一下区分:
- 索引(名词) 如上文所述,一个索引(index)就像是传统关系数据库中的数据库,它是相关文档存储的地方,index的复数是indices或indexes。
- 索引(动词) 「索引一个文档」表示把一个文档存储到索引(名词)里,以便它可以被检索或者查询。这很像SQL中的
INSERT
关键字,差别是,如果文档已经存在,新的文档将覆盖旧的文档。
- 倒排索引 传统数据库为特定列增加一个索引,例如B-Tree索引来加速检索。Elasticsearch和Lucene使用一种叫做倒排索引(inverted index)的数据结构来达到相同目的。
操作
插入文档
单个插入:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
| <?php require_once './vendor/autoload.php';
$client = Elasticsearch\ClientBuilder::create(); $client->setHosts(['127.0.0.1']); $client = $client->build();
$params = [ 'index' => 'megacorp', 'type' => 'employee', 'id' => 1, 'body' => [ 'first_name' => 'Join', 'last_name' => 'Smith', 'age' => 25, 'about' => 'I love to go rock climbing', 'interests' => ['sports', 'music'], ] ]; $response = $client->index($params); print_r($response); ?>
|
输出为:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
| Array ( [_index] => megacorp [_type] => employee [_id] => 1 [_version] => 1 [result] => created [_shards] => Array ( [total] => 2 [successful] => 1 [failed] => 0 ) [created] => 1 )
|
就表示已经插入成功了。
这里的插入的数组中有个字段为id
,如果我们不对其进行填写的话,它会自动生成一个id
。
这个自动生成的_id
会有22个字符长。我们把它称作UUIDs。
批量插入
下面我们接着插入。我们需要插入同时插入两个:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
| <?php require_once './vendor/autoload.php'; $client = Elasticsearch\ClientBuilder::create(); $client->setHosts(['127.0.0.1']); $client = $client->build();
$params = []; $params['body'] = [ [ 'index' => [ '_index' => 'megacorp', '_type' => 'employee', '_id' => 2 ], ], [ "first_name" => "Jane", "last_name" => "Smith", "age" => 32, "about" => "I like to collect rock albums", "interests" => ["music"], ], [ 'index' => [ '_index' => 'megacorp', '_type' => 'employee', '_id' => 3 ], ], [ "first_name" => "Douglas", "last_name" => "Fir", "age" => 35, "about" => "I like to build cabinets", "interests" => [ "forestry" ] ] ]; $responses = $client->bulk($params); print_r($responses); ?>
|
显示如下就表示插入成功了:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
| Array ( [took] => 217 [errors] => [items] => Array ( [0] => Array ( [index] => Array ( [_index] => megacorp [_type] => employee [_id] => 2 [_version] => 1 [result] => created [_shards] => Array ( [total] => 2 [successful] => 1 [failed] => 0 ) [created] => 1 [status] => 201 ) ) [1] => Array ( [index] => Array ( [_index] => megacorp [_type] => employee [_id] => 3 [_version] => 1 [result] => created [_shards] => Array ( [total] => 2 [successful] => 1 [failed] => 0 ) [created] => 1 [status] => 201 ) ) ) )
|
获取文档
现在我们尝试获取文档。获取文档可以获取指定文档的全部字段或者指定字段。我们分开来讲解:
获取单个文档
获取全部字段
比如我们现在要获取id=2
的文档。
1 2 3 4 5 6 7 8 9 10 11 12 13 14
| <?php require_once './vendor/autoload.php'; $client = Elasticsearch\ClientBuilder::create(); $client->setHosts(['127.0.0.1']); $client = $client->build();
$params = [ 'index' => 'megacorp', 'type' => 'employee', 'id' => 2, ];
print_r($client->get($params)); ?>
|
运行之后输出的结果就是:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
| Array ( [_index] => megacorp [_type] => employee [_id] => 2 [_version] => 1 [found] => 1 [_source] => Array ( [first_name] => Jane [last_name] => Smith [age] => 32 [about] => I like to collect rock albums [interests] => Array ( [0] => music ) ) )
|
这里我们可以看到_source
字段包含的就是我们插入的内容。而found
字段为1表示文档已经找到,如果我们请求一个不存在的文档,也会返回一个json,只不过found
就会变成0了。
获取指定字段
比如我们这里用不到这么多的字段。我们仅仅需要first_name
, last_name
和age
。我们可以这么请求:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
| <?php require_once './vendor/autoload.php'; $client = Elasticsearch\ClientBuilder::create(); $client->setHosts(['127.0.0.1']); $client = $client->build();
$params = [ 'index' => 'megacorp', 'type' => 'employee', 'id' => 2, '_source' => ['first_name', 'last_name', 'age'] ];
print_r($client->get($params)); ?>
|
返回的结果仅仅是:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
| Array ( [_index] => megacorp [_type] => employee [_id] => 2 [_version] => 1 [found] => 1 [_source] => Array ( [last_name] => Smith [first_name] => Jane [age] => 32 ) )
|
检查文档是否存在
如果我们不需要返回指定文档的内容,而仅仅是想知道文档是否存在,我们可以这样:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
| <?php require_once './vendor/autoload.php'; $client = Elasticsearch\ClientBuilder::create(); $client->setHosts(['127.0.0.1']); $client = $client->build();
$params = [ 'index' => 'megacorp', 'type' => 'employee', 'id' => 2, ];
var_dump($client->exists($params)); ?>
|
返回的结果不是数组了。而是一个bool值:
获取多个文档
获取全部的字段:
更新文档
部分文档更新
此处的更新只适合修改现有字段或者增加新的字段。我们需要在body
字段中指定doc
字段。
比如我说现在要修改id
为2的员工。
首先我们先看看2号员工的信息:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
| Array ( [_index] => megacorp [_type] => employee [_id] => 2 [_version] => 1 [found] => 1 [_source] => Array ( [first_name] => Jane [last_name] => Smith [age] => 32 [about] => I like to collect rock albums [interests] => Array ( [0] => music ) ) )
|
接下来我们要修改它的信息,我们要将他的年龄修改成33,并且增加一个信息,mobile_phone
为1234567890
。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
| <?php require_once './vendor/autoload.php'; $client = Elasticsearch\ClientBuilder::create(); $client->setHosts(['127.0.0.1']); $client = $client->build();
$response = $client->update([ 'index' => 'megacorp', 'type' => 'employee', 'id' => 2, 'body' => [ 'doc' => [ 'age' => 33, 'mobile_phone' => '1234567890' ] ] ]); print_r($response); ?>
|
结果如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
| Array ( [_index] => megacorp [_type] => employee [_id] => 2 [_version] => 2 [result] => updated [_shards] => Array ( [total] => 2 [successful] => 1 [failed] => 0 ) )
|
我们再来看一下之前的员工的信息:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
| Array ( [_index] => megacorp [_type] => employee [_id] => 2 [_version] => 2 [found] => 1 [_source] => Array ( [first_name] => Jane [last_name] => Smith [age] => 33 [about] => I like to collect rock albums [interests] => Array ( [0] => music ) [mobile_phone] => 1234567890 ) )
|
脚本更新文档
有时候我们需要执行计数器更新,或者向数组中添加新值。我们就可以使用脚本式更新。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
| $params = [ 'index' => 'megacorp', 'type' => 'employee', 'id' => 2, 'body' => [ 'script' => 'ctx._source.interests.add("sports")'; ] ];
$params = [ 'index' => 'megacorp', 'type' => 'employee', 'id' => 2, 'body' => [ 'script' => 'ctx._source.age += 2'; ] ];
|
upsert
upsert 其实是更新或者插入操作,这意味着upsert将尝试更新操作,如果文档不存在,那么将插入默认值。
更新和冲突
为了避免更新数据,update
API在解锁阶段检索文档当前的_version
,然后在重建索引阶段通过index
请求提交,如果其它进程在检索和重建索引阶段修改了文档,_version
将不能被匹配,然后更新失败。
对于这种情况,我们只需要重新尝试更新就好了,其实这些我们可以通过retry_on_conflict
参数设置重试次数来自动完成,这样update
操作将会在发生错误前重试——这个值默认为0。
总结
其实update
这个操作似乎允许你修改文档的局部,但实际上还是遵循先查后改的过程,步骤如下:
- 从旧文档中检索JSON
- 修改它
- 删除旧文档
- 索引新文档
唯一的不同是update
这个操作只需要一个客户端请求就好,不需要get
和index
请求了。
删除
比如我们要删除id
为3的员工:
我们先查询一下这个员工:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
| Array ( [_index] => megacorp [_type] => employee [_id] => 3 [_version] => 1 [found] => 1 [_source] => Array ( [first_name] => Douglas [last_name] => Fir [age] => 35 [about] => I like to build cabinets [interests] => Array ( [0] => forestry ) ) )
|
下面我们来执行操作:
1 2 3 4 5
| print_r($client->delete([ 'index' => 'megacorp', 'type' => 'employee', 'id' => 3, ]));
|
返回结果为:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
| Array ( [found] => 1 [_index] => megacorp [_type] => employee [_id] => 3 [_version] => 2 [result] => deleted [_shards] => Array ( [total] => 2 [successful] => 1 [failed] => 0 ) )
|
注意看found
为1,并且_version
相比之前,已经变为2了。
当我们再次执行一下之前的删除操作,我们再看一下返回结果:
1
| {"found":false,"_index":"megacorp","_type":"employee","_id":"3","_version":1,"result":"not_found","_shards":{"total":2,"successful":1,"failed":0}}
|
1 2 3 4 5 6 7 8 9 10 11 12 13
| { "found": false, "_index": "megacorp", "_type": "employee", "_id": "3", "_version": 2, "result": "not_found", "_shards": { "total": 2, "successful": 1, "failed": 0 } }
|
删除不存在的文档的时候,抛出了一个错误。我们可以看到found
的值是false
,且_version
也有记录值。这是内部记录的一部分,它确保再多节点不同操作可以有正确的顺序。