当前位置:网站首页>谷粒商城学习笔记,第五天:ES全文检索
谷粒商城学习笔记,第五天:ES全文检索
2020-11-09 16:11:00 【有一个小阿飞】
谷粒商城学习笔记,第五天:ES全文检索
一、基本概念
注:ES7和8以后就不再支持type了
1、Index索引
相当于MySQL中的Database
2、Type类型(ES8以后就不支持了)
相当于MySQL中的table
3、Document文档(JSON格式)
相当于MySQL中的数据
倒排索引:
正向索引:
当用户在主页上搜索关键词“华为手机”时,假设只存在正向索引(forward index),那么就需要扫描索引库中的所有文档,
找出所有包含关键词“华为手机”的文档,再根据打分模型进行打分,排出名次后呈现给用户。因为互联网上收录在搜索引擎中的文档的
数目是个天文数字,这样的索引结构根本无法满足实时返回排名结果的要求。
倒排索引:
一个倒排索引由文档中所有不重复词的列表构成,对于其中每个词,有一个包含它的文档列表。我们首先将每个文档的拆分成单独的词,
创建一个包含所有不重复词条的排序列表,然后列出每个词条出现在哪个文档。
得到倒排索引的结构如下:
关键词 | 文档ID |
---|---|
关键词1 | 文档1,文档2 |
关键词2 | 文档2,文档7 |
从词的关键字,去找文档。
二、Docker安装ES
1、下载镜像文件
docker pull elasticsearch:7.4.2
docker pull kibana:7.4.2 ##可视化检索数据
2、安装
##现在本地创建挂在路径
mkdir -p /opt/software/mydata/elasticsearch/config
mkdir -p /opt/software/mydata/elasticsearch/data
mkdir -p /opt/software/mydata/elasticsearch/plugins
##设置读写权限
chmod -R 777 /opt/software/mydata/elasticsearch/
##配置文件:远程访问
echo "http.host:0.0.0.0" >> /opt/software/mydata/elasticsearch/config/elasticsearch.yml
##安装ES
docker run --name elasticsearch -p 9200:9200 -p 9300:9300 \
-e "discovery.type=single-node" \
-e ES_JAVA_OPTS="-Xms64m -Xms128m" \
-v /opt/software/mydata/elasticsearch/config/elasticsearch.yml:/usr/share/elasticsearch/config/elasticsearch.yml \
-v /opt/software/mydata/elasticsearch/data:/usr/share/elasticsearch/data \
-v /opt/software/mydata/elasticsearch/plugins:/usr/share/elasticsearch/plugins \
-d elasticsearch:7.4.2
##查看日志
docker logs elasticsearch
##安装Kibana
docker run --name kibana -p 5601:5601 \
-e ELASTICSEARCH_HOSTS=http://182.92.191.49:9200 \
-d kibana:7.4.2
验证
访问:http://你的服务器IP:9200 es
http://你的服务器IP:5601 kibana
如果是阿里的服务器,要先在安全组规则中放开9200 5601
9200作为Http协议,主要用于外部通讯
9300作为Tcp协议,jar之间就是通过tcp协议通讯,ES集群之间是通过9300进行通讯
三、初步检索
3.1、_cat
命令 | 作用 |
---|---|
GET /_cat/nodes | 查看所有节点 |
GET /_cat/health | 查看ES健康状况 |
GET /_cat/master | 查看ES主节点 |
GET /_cat/indices | 查看所有索引:show databases |
3.2、新增数据:put/post
##put和post 都可以做新增和修改操作
##post 可以带ID也可以不带ID,第一次新增,第二次修改
##put 必须带ID,否则报错,同样第一次表新增,第二次表修改。[因为必须带ID,所有一般用来做修改]
##语法
POST http://IP:9200/index/type/id id可带可不带
{
name: "lee"
}
PUT http://IP:9200/index/type/id id必须带
{
name: "lee"
}
3.3、查询数据:GET&乐观锁
##语法
GET http://IP:9200/customer/external/1
结果:
{
"_index": "customer", ##索引
"_type": "external", ##类型
"_id": "1", ##doc的ID
"_version": 5, ##版本号,添加修改次数
"_seq_no": 5, ##并发控制字段,每次更新就会+1,用来做乐观锁
"_primary_term": 1, ##同上,主分片重新分配,如重启,就会变化,用来做乐观锁
"found": true, ##true表示查找到了
"_source": {
"name": "ren"
}
}
##乐观锁:为了防止并发修改
##我们在修改时可以做如下操作:
PUT http://IP:9200/customer/external/1?_seq_no=5&_primary_term=1
{
"name":"ren"
}
3.4、更新数据:PUT/POST
##覆盖更新:将原来的数据先删除,再新增
POST http://IP:9200/customer/external/1
{
"name":"lee",
"gender":"male",
"age":20
}
PUT http://IP:9200/customer/external/1
{
"name":"ren",
"gender":"male"
}
##局部更新:只修改列出来的字段
POST http://IP:9200/customer/external/1/_update
{
"doc":{
"name":"lee_ren",
}
}
3.5、删除数据和索引:DELETE
##删除数据
DELETE http://IP:9200/customer/external/1
##删除索引(没有删除type)
DELETE http://IP:9200/customer
3.6、批量导入数据:bulk
注意:独立操作,没有事务
##########语法:action表示POST PUT DELETE
{ action: { metadata }}
{ request body }
{ action: { metadata }}
{ request body }
#########批量新增
http://IP:9200/_bulk
{"create":{"_index":"haoke","_type":"user","_id":"aaa"}}
{"id":1001,"name":"name1","age":20,"sex":"男"}
{"create":{"_index":"haoke","_type":"user","_id":"bbb"}}
{"id":1002,"name":"name2","age":21,"sex":"女"}
{"create":{"_index":"haoke","_type":"user","_id":"ccc"}}
{"id":1003,"name":"name3","age":22,"sex":"女"}
或者:
{"index":{"_index":"haoke","_type":"user","_id":"aaa"}}
{"id":1001,"name":"name1","age":20,"sex":"男"}
{"index":{"_index":"haoke","_type":"user","_id":"bbb"}}
{"id":1002,"name":"name2","age":21,"sex":"女"}
{"index":{"_index":"haoke","_type":"user","_id":"ccc"}}
{"id":1003,"name":"name3","age":22,"sex":"女"}
#########批量删除
POST http://IP:9200/_bulk
{"delete":{"_index":"haoke","_type":"user","_id":"aaa"}}
{"delete":{"_index":"haoke","_type":"user","_id":"bbb"}}
{"delete":{"_index":"haoke","_type":"user","_id":"ccc"}}
########批量修改
POST http://IP:9200/_bulk
请求body:
{"index":{"_index":"haoke","_type":"user","_id":"aaa"}}
{"id":1001,"name":"name111","age":20,"sex":"男"}
{"index":{"_index":"haoke","_type":"user","_id":"bbb"}}
{"id":1002,"name":"name222","age":21,"sex":"女"}
{"index":{"_index":"haoke","_type":"user","_id":"ccc"}}
{"id":1003,"name":"name333","age":22,"sex":"女"}
#########批量局部修改
POST http://IP:9200/_bulk
请求body:
{"update":{"_index":"haoke","_type":"user","_id":"aaa"}}
{"doc":{"name":"name111a"}}
{"update":{"_index":"haoke","_type":"user","_id":"bbb"}}
{"doc":{"name":"name222b"}}
{"update":{"_index":"haoke","_type":"user","_id":"ccc"}}
{"doc":{"name":"name333c"}}
elasticsearch自带的批量 测试数据:
https://download.elastic.co/demos/kibana/gettingstarted/8.x/accounts.zip
四、检索进阶
第一种检索方式:
##将所有的请求参数,放在URI路径中
GET http://IP:9200/bank/_search?q=*&sort=account_number:asc
[q=*表示查询所有,sort表示排序字段为account_number 升序,默认from为0 size为10]
第二种检索方式:Query DSL
##将所有的请求参数,放在body体中
GET http://IP:9200/bank/_search
{
"query": {
"match_all": {}
},
"sort": {
"account_number": "asc"
}
}
[query表示查询 match_all表示查询所有,sort表示排序,默认from为0 size为10]
4.1、Query DSL基本语法
GET http://IP:9200/bank/_search
{
"query": {
"match_all": {}
},
"sort": {
"account_number": "asc"
}
"from": 0,
"size": 5,
"_source": ["account_number","balance"]
}
[query表示查询 match_all表示查询所有,sort表示排序,from表示从第几个数据开始查]
[size表示查询记录数,_source表示要返回的字段]
4.2、Query DSL:match
##match 精确匹配,
##非String类型的字段 相当于 == 号
##String类型的字段 相当于like,(分词后的==精确匹配)
GET http://IP:9200/bank/_search
{
"query": {
"match": {
"address": "mill"
}
}
}
##match加上keyword,是精确匹配 address必须是990 Mill Road,大小写也要区分
##注意macth+keyword同match_phrase的区别:
##macth+keyword必须是一模一样的 == ,match_phrase是包含就可以 like,且不区分大小写
GET http://IP:9200/bank/_search
{
"query": {
"match": {
"address.keyword": "990 Mill Road"
}
}
}
4.3、Query DSL:match_phrase
##match_phrase 短语匹配,
##match会将 "mill road"分成两个词"mill"和"road"进行匹配,
##match_phrase 则会将"mill road"作为一个完整的词进行匹配(大小写 和 中间多余的空格不会影响查询结果)
GET http://IP:9200/bank/_search
{
"query": {
"match_phrase": {
"address": "mill road"
}
}
}
4.4、Query DSL:multi_match
##multi_match多字段匹配
##相当于 username == "aa" or nickname == "aa"
GET http://IP:9200/bank/_search
{
"query": {
"multi_match": {
"query":"mill movico",
"fields":["address","city"]
}
}
}
##es会将mill和movico分词,结果类似于 address == mill or address == movico or city == mill or city == movico
4.5、Query DSL:bool
##bool复合查询
##bool组合几个查询,他们是and的关系
##must是必须满足,must_not是必须不满足,should是满足也可以不满足也可以,但满足后排序会靠前些
##must和should会提高相关得分score,must_not不会影响得分(filter也不会影响得分)
GET /bank/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"address":"mill"
}
},
{
"match": {
"gender": "m"
}
}
],
"must_not": [
{
"match": {
"age": "28"
}
}
],
"should": [
{
"match": {
"city": "blackgum"
}
}
]
}
}
}
4.6、Query DSL:filter
##filter结果过滤
##filter类似must,但不影响得分score结果
##must和should会影响相关结果的score,但filter不会
##range类似between
GET /bank/_search
{
"query": {
"bool": {
"filter": {
"range": {
"age": {
"gte": 10,
"lte": 30
}
}
}
}
}
}
4.7、Query DSL:term
##term等同于match,但term只能用于精确的非text文本格式的查找(不能分词)
##match可以进行text的分词后的查找的
##约定:非text字段用term,text字段用match
##注意例如:city:"hebei"是String可以用term查找
##address:"hebei handan"是text需要用match查找
GET /bank/_search
{
"query": {
"term": {
"address": "mill"
}
}
}
五、聚合:aggregations
##聚合:从数据中分组和提取数据
##类似于SQL中的group by\count\avg等
##聚合类型 term avg平均 min最小 max最大 sum综合
##注意:文本和字符串形式的聚合使用keyword聚合
##语法:
"aggs": {
"aggs name聚合名称": {
"aggs type聚合类型": {
"field": "聚合的字段",
"size": "想要多少个结果"
},
....子聚合
},
...其他聚合
}
##eg
##1、搜索address中包含mill的所有人的年龄分布以及平均年龄和平均薪资
##[size:0 显示搜索数据0条,即不显示搜索结果]
## 其他聚合
GET /bank/_search
{
"query": {
"match": {
"address": "mill"
}
},
"aggs": {
"ageAggs": {
"terms": {
"field": "age",
"size": 10
}
},
"ageAvgAggs":{
"avg": {
"field": "age"
}
},
"balanceAvgAggs":{
"avg": {
"field": "balance"
}
}
},
"size":0
}
##2、按照年龄聚合,并查处每个年龄段的平均薪资
##子聚合
GET /bank/_search
{
"query": {
"match_all": {}
},
"aggs": {
"ageAggs": {
"terms": {
"field": "age",
"size": 10
},
"aggs": {
"balanceAvg": {
"avg": {
"field": "balance"
}
}
}
}
},
"size": 0
}
##3、查处所有年龄分布,并且这些年龄段中的平均薪资以及年龄段中M男性的平均薪资和F女性的平均薪资
##[gender是文本形式的,无法进行聚合,用gender.keyword聚合]
GET /bank/_search
{
"query": {
"match_all": {}
},
"aggs": {
"ageAggs": {
"terms": {
"field": "age",
"size": 10
},
"aggs": {
"ageBalanceAvg": {
"avg": {
"field": "balance"
}
},
"genderAggs": {
"terms": {
"field": "gender.keyword"
},
"aggs": {
"genderBalanceAvg": {
"avg": {
"field": "balance"
}
}
}
}
}
}
},
"size": 0
}
六、映射:Mapping
##映射:mapping类似SQL中定义字段的数据类型
##1、查看映射
GET /bank/_mapping
##2、创建映射
PUT /myindex
{
"mappings": {
"properties": {
"name":{"type": "text"},
"age":{"type": "integer"},
"email":{"type": "keyword"},
"birthday":{"type": "date"}
}
}
}
##3、在已有的mapping上,添加新的映射字段
PUT /myindex/_mapping
{
"properties":{
"employee_id":{"type":"long"}
}
}
##4、修改:已存在映射的已存在字段是不能修改的,只能添加新的字段
##如果非要想 修改已有的映射,需要重新创建映射mapping,然后再迁移数据
##如:就有的bank索引是带有type的,现在我们创建新的映射并去掉bank
##创建新的mapping
PUT /newbank
{
"mappings": {
"properties": {
"account_number": {
"type": "long"
},
"address": {
"type": "text"
},
"age": {
"type": "integer"
},
"balance": {
"type": "long"
},
"city": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"email": {
"type": "keyword"
},
"employer": {
"type": "text"
},
"firstname": {
"type": "text"
},
"gender": {
"type": "keyword"
},
"lastname": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"state": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
##数据迁移:将原来bank中的数据迁移到newbank中
##数据迁移
POST _reindex
{
"source": {
"index": "bank",
"type": "account"
},
"dest": {
"index": "newbank"
}
}
##注意上面是ES旧版本带有type类型的数据迁移,新版本的数据不带type,迁移语法如下:
POST _reindex
{
"source": {
"index": "bank"
},
"dest": {
"index": "newbank"
}
}
七、分词
7.1、分词
一个tokenizer(分词器)接收一个字符流,将之分割为独立的tokens(词元,通常是独立的单词),然后输出tokens流。
##分词器
POST _analyze
{
"analyzer": "standard",
"text": "Besides traveling and volunteering, something else that’s great for when you don’t give a hoot is to read."
}
7.2、IK分词器
下载地址 GitHub medcl/elasticsearch-analysis-ik
##安装IK分词器: 下载到elasticsearch下的plugins目录下
wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.4.2/elasticsearch-analysis-ik-7.4.2.zip
##解压
unzip elasticsearch-analysis-ik-7.4.2.zip
##修改权限
chmod -R 777 ik/
##检测是否安装好了
docker exec -it fcd2b /bin/bash
cd bin
elasticsearch-plugin list
##重启es
docker restart fcd2
测试:
##中文分词
POST _analyze
{
"analyzer": "ik_smart",
"text": "我是中国河北人"
}
##中文分词
POST _analyze
{
"analyzer": "ik_max_word",
"text": "我是中国河北人"
}
7.3、自定义扩展词库
7.3.1、Docker安装Nginx
##下载nginx
docker pull nginx:1.10
##启动运行
docker run -p 80:80 --name nginx -d nginx:1.10
##复制docker容器中nginx的配置文件到本地
cd /opt/software/mydata/
mkdir nginx
cd nginx
mkdir html
mkdir logs
docker container cp nginx:/etc/nginx .
##安装新的nginx
docker run -p 80:80 --name nginx \
-v /opt/software/mydata/nginx/html:/usr/share/nginx/html \
-v /opt/software/mydata/nginx/logs:/var/log/nginx \
-v /opt/software/mydata/nginx/conf:/etc/nginx \
-d nginx:1.10
测试:
##在/opt/software/mydata/nginx/html下创建index.html
##内容:
<h1>welcome to nginx</h1>
##浏览器访问
http://IP:80/
7.3.2、扩展词库
创建词库
##在/opt/software/mydata/nginx/html下创建一个文件:es/fenci.txt
##内容:
乔碧萝
徐静蕾
##浏览器访问:
http://IP/es/fenci.txt
es关联词库
##配置es plugins目录下IK分词器conf目录下的配置文件IKAnalyzer.cfg.xml
<!--用户可以在这里配置远程扩展字典 -->
<entry key="remote_ext_dict">http://你的IP/es/fenci.txt</entry>
##重启es
docker restart
##测试
POST _analyze
{
"analyzer": "ik_smart",
"text": "我不认识乔碧萝殿下"
}
版权声明
本文为[有一个小阿飞]所创,转载请带上原文链接,感谢
https://my.oschina.net/ngc7293/blog/4710092
边栏推荐
- In the third stage, day16 user module jumps to SSO single sign on jsonp / CORS cross domain user login verification
- 什么是网站【新四化】?
- 你以为学编程很简单吗,其实它很难!你以为学编程很难吗,其实它很简单!
- 史上最惨黑客:偷走10亿美元比特币7年未花,最终被司法部全数缴获
- The way of a million year salary Architect: on the architecture design of application system
- Application of EMQ X in the Internet of things platform of China Construction Bank
- Explain three different authentication protocols in detail
- High quality defect analysis: let yourself write fewer bugs
- 揭秘在召唤师峡谷中移动路径选择逻辑?
- Cad2016 software installation tutorial
猜你喜欢
使用基于GAN的过采样技术提高非平衡COVID-19死亡率预测的模型准确性
什么是网站【新四化】?
shell脚本快速入门----shell基本语法总结
树莓派内网穿透建站与维护,使用内网穿透无需服务器
Cad2016 software installation tutorial
CAD tutorial cad2016 installation course
Set two ways of background image, and solve the mobile phone background image highly adaptive problem
MES系统在行业应用里区别于传统式管理
Do you think it's easy to learn programming? In fact, it's hard! Do you think it's hard to learn programming? In fact, it's very simple!
Equivalent judgment between floating point numbers
随机推荐
5分钟GET我使用Github 5 年总结的这些骚操作!
AutoCAD 2020 installation package & Installation Tutorial
OpenYurt 深度解读:如何构建 Kubernetes 原生云边高效协同网络?
SEO解决方案制定,如何脱离杯弓蛇影?
什么是网站【新四化】?
SEO solution development, how to break away from the shadow of the bow?
Cad2016 download autocad2016 download installation detailed tutorial CAD Download
寻找性能更优秀的不可变小字典
使用基于GAN的过采样技术提高非平衡COVID-19死亡率预测的模型准确性
如何使用Camtasia制作动态动画场景?
Booker · apachecn programming / back end / big data / AI learning resources 2020.11
In the third stage, day16 user module jumps to SSO single sign on jsonp / CORS cross domain user login verification
Set two ways of background image, and solve the mobile phone background image highly adaptive problem
缓存的数据一致性
MES系统在行业应用里区别于传统式管理
国际顶刊Radiology发表华为云最新联合成果,AI辅助检测脑动脉瘤
最新版PyCharm 2020.3 :可实现结对编程,智能文本校对等|附下载体验
Using art template to obtain weather forecast information
Do you think it's easy to learn programming? In fact, it's hard! Do you think it's hard to learn programming? In fact, it's very simple!
MES系统在工厂生产管理起到9大很重要的作用