Analyzer
Analyzerは以下の処理を行うことを目的とした機能。
- 全文検索で利用する為にインデックス、クエリ文字列を単語分割する処理。
- 言語固有の表記揺れの吸収
表記揺れとは?
- ステミング 語形の変化 making -> make, dogs -> dog, 食べる|食べた -> 食べ
- 正規化 大文字小文字, カタカナひらがな, 全角半角
- ストップワード the of など
Analyzerの定義
$ curl -XPUT 'http://localhost:9200/my_index' -H 'Content-Type: application/json' -d ' { "mappings": { "properties": { "blog_message": { "type": "text", "analyzer": "standard" } } } } '
カスタムAnalyzer
Analyzerは
- char_filter 入力文字列を別の文字列に置換
<br>
->¥n
- tokenizer ルールに従って単語を分割
- filter tokenizerで分割した単語をルールに従ってステミングや変換を行う
で構成される。各要素の処理を定義することができる。
$ curl -XPUT 'http://localhost:9200/my_index' -H 'Content-Type: application/json' -d ' { "settings": { "analysis": { "analyzer": { "my_analyzer": { "type": "custom", "char_filter": ["html_strip"], "tokenizer": "standard", "filter": ["lowercase", "stop"] } } } } }
カスタムfilterは以下のように定義できる。
$ curl -XPUT 'http://localhost:9200/my_index' -H 'Content-Type: application/json' -d ' { "settings": { "analysis": { "analyzer": { "my_analyzer": { "type": "custom", "char_filter": ["html_strip"], "tokenizer": "standard", "filter": ["lowercase", "my_stop"] } }, "filter": { "my_stop": { "type": "stop", "stopwords": ["the", "a", "not", "is"] } } } } } '
analyzerの確認
$ curl -XGET 'localhost:9200/my_index/_analyze?pretty' -H 'Content-Type: application/json' -d '{"analyzer": "my_analyzer", "text": "this is a pen"}'
日本語を扱うAnalyzerの導入
kuromojiというpluginを利用します。
$ cd /usr/share/elasticseaech $ sudo bin/elasticsearch-plugin install analysis-kuromoji
アップデートされた辞書のplugin
少し古いelasticsearchまでしか対応してないのでversionに注意
$ sudo bin/elasticsearch-plugin install org.codelibs:elasticsearch-analysis-kuromoji-ipadic-neologd:7.2.0
https://github.com/codelibs/elasticsearch-analysis-kuromoji-ipadic-neologd
kuromoji analyzerの利用
mappingの定義
$ curl -XPUT 'http://localhost:9200/my_index' -H 'Content-Type: application/json' -d ' { "mappings": { "properties": { "user_name": { "type": "text", "analyzer": "kuromoji" } } } } '
analyzerの確認
$ curl -XGET 'localhost:9200/my_index/_analyze?pretty' -H 'Content-Type: application/json' -d '{"analyzer": "kuromoji", "text": "私は日々貯金したお金で、近々、関西国際空港からアメリカ テキサス州へ旅行に行きます。常々夢見ていたので、とても楽しみで、このコンピューターで予約しました。料金は九万円ほどでした"}'
{ "tokens" : [ { "token" : "私", "start_offset" : 0, "end_offset" : 1, "type" : "word", "position" : 0 }, { "token" : "日々", "start_offset" : 2, "end_offset" : 4, "type" : "word", "position" : 2 }, { "token" : "貯金", "start_offset" : 4, "end_offset" : 6, "type" : "word", "position" : 3 }, { "token" : "お金", "start_offset" : 8, "end_offset" : 10, "type" : "word", "position" : 6 }, { "token" : "近々", "start_offset" : 12, "end_offset" : 14, "type" : "word", "position" : 8 }, { "token" : "関西", "start_offset" : 15, "end_offset" : 17, "type" : "word", "position" : 9 }, { "token" : "関西国際空港", "start_offset" : 15, "end_offset" : 21, "type" : "word", "position" : 9, "positionLength" : 3 }, { "token" : "国際", "start_offset" : 17, "end_offset" : 19, "type" : "word", "position" : 10 }, { "token" : "空港", "start_offset" : 19, "end_offset" : 21, "type" : "word", "position" : 11 }, { "token" : "アメリカ", "start_offset" : 23, "end_offset" : 27, "type" : "word", "position" : 13 }, { "token" : "テキサス", "start_offset" : 28, "end_offset" : 32, "type" : "word", "position" : 14 }, { "token" : "州", "start_offset" : 32, "end_offset" : 33, "type" : "word", "position" : 15 }, { "token" : "旅行", "start_offset" : 34, "end_offset" : 36, "type" : "word", "position" : 17 }, { "token" : "行く", "start_offset" : 37, "end_offset" : 39, "type" : "word", "position" : 19 }, { "token" : "常々", "start_offset" : 42, "end_offset" : 44, "type" : "word", "position" : 21 }, { "token" : "夢見る", "start_offset" : 44, "end_offset" : 46, "type" : "word", "position" : 22 }, { "token" : "とても", "start_offset" : 52, "end_offset" : 55, "type" : "word", "position" : 27 }, { "token" : "楽しみ", "start_offset" : 55, "end_offset" : 58, "type" : "word", "position" : 28 }, { "token" : "コンピュータ", "start_offset" : 62, "end_offset" : 69, "type" : "word", "position" : 31 }, { "token" : "予約", "start_offset" : 70, "end_offset" : 72, "type" : "word", "position" : 33 }, { "token" : "料金", "start_offset" : 77, "end_offset" : 79, "type" : "word", "position" : 37 }, { "token" : "九", "start_offset" : 80, "end_offset" : 81, "type" : "word", "position" : 39 }, { "token" : "万", "start_offset" : 81, "end_offset" : 82, "type" : "word", "position" : 40 }, { "token" : "円", "start_offset" : 82, "end_offset" : 83, "type" : "word", "position" : 41 } ] }
kuromoji カスタム
カスタムanalyzerの定義
$ curl -XPUT 'http://localhost:9200/my_index' -H 'Content-Type: application/json' -d ' { "settings": { "analysis": { "analyzer": { "my_japanese": { "type": "custom", "tokenizer": "kuromoji_tokenizer", "char_filter": ["kuromoji_iteration_mark"], "filter": [ "kuromoji_baseform", "kuromoji_part_of_speech", "ja_stop", "kuromoji_number", "kuromoji_stemmer" ] } } } }, "mappings": { "properties": { "user_name": { "type": "text", "analyzer": "my_japanese" } } } } '
カスタムanalyzerの確認
$ curl -XGET 'localhost:9200/my_index/_analyze?pretty' -H 'Content-Type: application/json' -d '{"analyzer": "my_japanese", "text": "私は日々貯金したお金で、近々、関西国際空港からアメリカ テキサス州へ旅行に行きます。常々夢見ていたので、とても楽しみで、このコンピューターで予約しました。料金は九万円ほどでした"}'
{ "tokens" : [ { "token" : "私", "start_offset" : 0, "end_offset" : 1, "type" : "word", "position" : 0 }, { "token" : "日日", "start_offset" : 2, "end_offset" : 4, "type" : "word", "position" : 2 }, { "token" : "貯金", "start_offset" : 4, "end_offset" : 6, "type" : "word", "position" : 3 }, { "token" : "お金", "start_offset" : 8, "end_offset" : 10, "type" : "word", "position" : 6 }, { "token" : "近近", "start_offset" : 12, "end_offset" : 14, "type" : "word", "position" : 8 }, { "token" : "関西", "start_offset" : 15, "end_offset" : 17, "type" : "word", "position" : 9 }, { "token" : "関西国際空港", "start_offset" : 15, "end_offset" : 21, "type" : "word", "position" : 9, "positionLength" : 3 }, { "token" : "国際", "start_offset" : 17, "end_offset" : 19, "type" : "word", "position" : 10 }, { "token" : "空港", "start_offset" : 19, "end_offset" : 21, "type" : "word", "position" : 11 }, { "token" : "アメリカ", "start_offset" : 23, "end_offset" : 27, "type" : "word", "position" : 13 }, { "token" : "テキサス", "start_offset" : 28, "end_offset" : 32, "type" : "word", "position" : 14 }, { "token" : "州", "start_offset" : 32, "end_offset" : 33, "type" : "word", "position" : 15 }, { "token" : "旅行", "start_offset" : 34, "end_offset" : 36, "type" : "word", "position" : 17 }, { "token" : "行く", "start_offset" : 37, "end_offset" : 39, "type" : "word", "position" : 19 }, { "token" : "常常", "start_offset" : 42, "end_offset" : 44, "type" : "word", "position" : 21 }, { "token" : "夢見る", "start_offset" : 44, "end_offset" : 46, "type" : "word", "position" : 22 }, { "token" : "とても", "start_offset" : 52, "end_offset" : 55, "type" : "word", "position" : 27 }, { "token" : "楽しみ", "start_offset" : 55, "end_offset" : 58, "type" : "word", "position" : 28 }, { "token" : "コンピュータ", "start_offset" : 62, "end_offset" : 69, "type" : "word", "position" : 31 }, { "token" : "予約", "start_offset" : 70, "end_offset" : 72, "type" : "word", "position" : 33 }, { "token" : "料金", "start_offset" : 77, "end_offset" : 79, "type" : "word", "position" : 37 }, { "token" : "90000", "start_offset" : 80, "end_offset" : 82, "type" : "word", "position" : 38 }, { "token" : "円", "start_offset" : 82, "end_offset" : 83, "type" : "word", "position" : 39 } ] }
その他
カタカナ読みに変換。
filter: [ "kuromoji_readingform" ]
kuromoji analyzerを用いたインデックスの例
$ curl -XPUT 'http://localhost:9200/my_index' -H 'Content-Type: application/json' -d ' { "mappings": { "properties": { "user_name": { "type": "text", "analyzer": "kuromoji" }, "date": { "type": "date" }, "message": { "type": "text", "analyzer": "kuromoji" } } } } '
テストデータ
$ curl -XPOST 'http://localhost:9200/my_index/_doc/' -H 'Content-Type: application/json' -d ' { "user_name": "山本 太郎", "date": "2017-10-15T15:09:45", "message": "秋は京都で紅葉狩りをします。" } ' $ curl -XPOST 'http://localhost:9200/my_index/_doc/' -H 'Content-Type: application/json' -d ' { "user_name": "佐藤 洋子", "date": "2017-10-15T15:09:45", "message": "冬は北海道でスキーをします。" } '
analyzerの確認
$ curl -XPOST 'http://localhost:9200/my_index/_analyze' -H 'Content-Type: application/json' -d ' { "analyzer": "kuromoji", "text": "秋は京都で紅葉狩りをします。" } '
{"tokens":[{"token":"秋","start_offset":0,"end_offset":1,"type":"word","position":0},{"token":"京都","start_offset":2,"end_offset":4,"type":"word","position":2},{"token":"紅葉狩り","start_offset":5,"end_offset":9,"type":"word","position":4}]}
検索クエリの例
$ curl -XGET 'http://localhost:9200/my_index/_doc/_search' -H 'Content-Type: application/json' -d ' { "query": { "match": { "message": "京都の秋のおすすめ" } } } '
{"took":186,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":1,"relation":"eq"},"max_score":1.3862942,"hits":[{"_index":"my_index","_type":"_doc","_id":"Uic8CnMBjvhLNEhN50wI","_score":1.3862942,"_source": { "user_name": "山本 太郎", "date": "2017-10-15T15:09:45", "message": "秋は京都で紅葉狩りをします。" } }]}}
インデックスは"紅葉狩り"で登録されているので、"紅葉"では検索できない
$ curl -XGET 'http://localhost:9200/my_index/_doc/_search' -H 'Content-Type: application/json' -d ' { "query": { "match": { "message": "紅葉の写真" } } } '
リンク