当前位置:网站首页>Implementation method of Bert
Implementation method of Bert
2022-07-28 07:06:00 【ithicker】
bert-as-service
BERT The model is a kind of NLP Pre training technique , Not covered in this article BERT Principle , It mainly focuses on how to get started quickly BERT The model generates word vectors for downstream tasks .
Google It's been made public TensorFlow Version of Pre training model and code , It can be used to generate word vectors , But there are simpler ways : Directly call the encapsulated Library bert-as-service .
Use bert-as-service Generate word vectors
bert-as-service It's Tencent. AI Lab Open source BERT service , It allows users to use in the way of invoking services BERT Model without attention BERT Implementation details .bert-as-service Divided into client and server , Users can start from python Calling service in code , It can also be done through http Mode of access .
install
Use pip Command to install , The client and server can be installed on different machines :
pip install bert-serving-server # Server side
pip install bert-serving-client # client , Independent of the server
among , The running environment of the server is Python >= 3.5 and Tensorflow >= 1.10
The client can run on Python 2 or Python 3
Download pre training model
according to NLP There are different types and sizes of tasks ,Google A variety of pre training models are provided for selection :
- BERT-Base, Chinese: Simplified traditional Chinese , 12-layer, 768-hidden, 12-heads, 110M parameters
- BERT-Base, Multilingual Cased: Multilingual (104 Kind of ), 12-layer, 768-hidden, 12-heads, 110M parameters
- BERT-Base, Uncased: English is not case sensitive ( All to lowercase ), 12-layer, 768-hidden, 12-heads, 110M parameters
- BERT-Base, Cased: English is case sensitive , 12-layer, 768-hidden, 12-heads , 110M parameters
You can also use the Harbin Institute of technology version with better Chinese effect BERT:
- Chinese-BERT-wwm
The above lists several commonly used pre training models , You can go to here To view more .
Unzip the downloaded .zip After file , There will be 6 File :
- TensorFlow Model file (bert_model.ckpt) Include the weight of the pre training model , The model file has three
- Dictionary file (vocab.txt) Record entries and id The mapping relation of
- The configuration file (bert_config.json ) Record the super parameters of the model
start-up BERT service
Use bert-serving-start Command startup service :
bert-serving-start -model_dir /tmp/english_L-12_H-768_A-12/ -num_worker=2
among ,-model_dir It's the path of the pre training model ,-num_worker It's the number of threads , Indicates how many concurrent requests can be processed at the same time
If the startup is successful , The server side will display :
Insert picture description here
Get the sentence vector in the client
You can simply use the following code to obtain the vector representation of the corpus :
from bert_serving.client import BertClient
bc = BertClient()
doc_vecs = bc.encode(['First do it', 'then do it right', 'then do it better'])
doc_vecs It's a numpy.ndarray , Each line of it is a fixed length sentence vector , The length is determined by the maximum length of the input sentence . If you want to specify the length , It can be used in the startup service max_seq_len Parameters , Long sentences are truncated from the right .
BERT Another feature of is that you can get the vector of a pair of sentences , Use between sentences ||| As a division , for example :
bc.encode(['First do it ||| then do it right'])
Get the word vector
When starting the service, set the parameter pooling_strategy Set to None :
bert-serving-start -pooling_strategy NONE -model_dir /tmp/english_L-12_H-768_A-12/
In this case, the return is every... In the corpus token Corresponding embedding Matrix
bc = BertClient()
vec = bc.encode(['hey you', 'whats up?'])
vec # [2, 25, 768]
vec[0] # [1, 25, 768], sentence embeddings for `hey you`
vec[0][0] # [1, 1, 768], word embedding for `[CLS]`
vec[0][1] # [1, 1, 768], word embedding for `hey`
vec[0][2] # [1, 1, 768], word embedding for `you`
vec[0][3] # [1, 1, 768], word embedding for `[SEP]`
vec[0][4] # [1, 1, 768], word embedding for padding symbol
vec[0][25] # error, out of index!
The remote invocation BERT service
You can call from one machine to another BERT service :
#on another CPU machine
from bert_serving.client import BertClient
bc = BertClient(ip='xx.xx.xx.xx') # ip address of the GPU machine
bc.encode(['First do it', 'then do it right', 'then do it better'])
In this case , Just on the client side pip install -U bert-serving-client
other
Configuration requirements
BERT The model has high requirements for memory , If it's stuck at start-up load graph from model_dir Can be num_worker Set to 1 Or increase machine memory .
Whether to segment Chinese words in advance
When calculating the Chinese vector , You can input the whole sentence directly without any word segmentation in advance . because Chinese-BERT in , The corpus is processed in units of words , Therefore, for Chinese corpus, the output is word vector .
for instance , When user input :
bc.encode(['hey you', 'whats up?', ' How are you? ?', ' I also Sure '])
actually ,BERT The input to the model is :
tokens: [CLS] hey you [SEP]
input_ids: 101 13153 8357 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
input_mask: 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
tokens: [CLS] what ##s up ? [SEP]
input_ids: 101 9100 8118 8644 136 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
input_mask: 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
tokens: [CLS] you good Well ? [SEP]
input_ids: 101 872 1962 720 8043 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
input_mask: 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
tokens: [CLS] I also can With [SEP]
input_ids: 101 2769 6820 1377 809 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
input_mask: 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
In English, the word after the bar ##something What is it?
When a word is not in the dictionary , Use the longest subsequence method to itemize , for example :
input = "unaffable"
tokenizer_output = ["un", "##aff", "##able"]
Reference material
https://github.com/google-research/bert
https://github.com/hanxiao/bert-as-service
Several implementation methods :https://zhuanlan.zhihu.com/p/112235454
example :https://spaces.ac.cn/archives/6736
keras_bert and kert4keras:https://www.cnblogs.com/dogecheng/p/11617940.html
边栏推荐
- 在转化词向量之前先转化为AST再转化为词向量的实现方法
- Traversal binary tree
- Media set up live broadcast server
- Animation animation realizes the crossing (click) pause
- Iptables firewall
- Array to linked list
- MOOC Weng Kai C language week 8: pointer and string: 1. Pointer 2. Character type 3. String 4. String calculation
- [learning notes] knowledge management
- Wangeditor (@4.7.15) - lightweight rich text editor
- Starting point Chinese website font anti crawling technology web page can display numbers and letters, and the web page code is garbled or blank
猜你喜欢

Iptables firewall

MOOC翁恺C语言 第四周:进一步的判断与循环:3.多路分支4.循环的例子5.判断和循环常见的错误

MOOC Weng Kai C language week 7: array operation: 1. array operation 2. Search 3. preliminary sorting

Canvas drawing 2

JSON notes

VLAN的配置

Event_ Loop event loop mechanism

PXE无人值守安装管理

Shell script -- program conditional statements (conditional tests, if statements, case branch statements, echo usage, for loops, while loops)

Joern的代码使用-devign
随机推荐
MOOC Weng Kai C language week 7: array operation: 1. array operation 2. Search 3. preliminary sorting
[learning notes] VIM editor
MOOC翁恺C语言第五周:1.循环控制2.多重循环3.循环应用
On cookies and session
MySQL build database Series (I) -- download MySQL
Results fill in the blanks carelessly (violent solution)
MOOC翁恺C语言 第六周:数组与函数:1.数组2.函数的定义与使用3.函数的参数和变量4.二维数组
Traversal binary tree
Result fill in the blank (dfs*c language)
DNS domain name resolution
MySQL installation and use
MOOC Weng Kai C language week 3: judgment and cycle: 1. Judgment
关于正则的教程
Firewall - iptables firewall (four tables and five links, firewall configuration method, detailed explanation of matching rules)
MOOC翁恺 C语言 第三周:判断与循环:2.循环
Applet custom components - data, methods, and properties
Wechat applet custom compilation mode
静态和浮动路由
JS data type detection and modification detection
Ubuntu MySQL setting remote access permissions