当前位置：网站首页>Implementation method of Bert

Implementation method of Bert

2022-07-28 07:06:00 【ithicker】

bert-as-service

BERT The model is a kind of NLP Pre training technique , Not covered in this article BERT Principle , It mainly focuses on how to get started quickly BERT The model generates word vectors for downstream tasks .

Google It's been made public TensorFlow Version of Pre training model and code , It can be used to generate word vectors , But there are simpler ways ： Directly call the encapsulated Library bert-as-service .

Use bert-as-service Generate word vectors

bert-as-service It's Tencent. AI Lab Open source BERT service , It allows users to use in the way of invoking services BERT Model without attention BERT Implementation details .bert-as-service Divided into client and server , Users can start from python Calling service in code , It can also be done through http Mode of access .

install
Use pip Command to install , The client and server can be installed on different machines ：

pip install bert-serving-server # Server side
pip install bert-serving-client # client , Independent of the server

among , The running environment of the server is Python >= 3.5 and Tensorflow >= 1.10

The client can run on Python 2 or Python 3

Download pre training model

according to NLP There are different types and sizes of tasks ,Google A variety of pre training models are provided for selection ：

BERT-Base, Chinese: Simplified traditional Chinese , 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Base, Multilingual Cased: Multilingual （104 Kind of ）, 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Base, Uncased: English is not case sensitive （ All to lowercase ）, 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Base, Cased: English is case sensitive , 12-layer, 768-hidden, 12-heads , 110M parameters

You can also use the Harbin Institute of technology version with better Chinese effect BERT：

Chinese-BERT-wwm
The above lists several commonly used pre training models , You can go to here To view more .

Unzip the downloaded .zip After file , There will be 6 File ：

TensorFlow Model file （bert_model.ckpt) Include the weight of the pre training model , The model file has three
Dictionary file （vocab.txt) Record entries and id The mapping relation of
The configuration file （bert_config.json ) Record the super parameters of the model

start-up BERT service

Use bert-serving-start Command startup service ：

bert-serving-start -model_dir /tmp/english_L-12_H-768_A-12/ -num_worker=2

among ,-model_dir It's the path of the pre training model ,-num_worker It's the number of threads , Indicates how many concurrent requests can be processed at the same time

If the startup is successful , The server side will display ：

Insert picture description here

Get the sentence vector in the client

You can simply use the following code to obtain the vector representation of the corpus ：

from bert_serving.client import BertClient
bc = BertClient()
doc_vecs = bc.encode(['First do it', 'then do it right', 'then do it better'])

doc_vecs It's a numpy.ndarray , Each line of it is a fixed length sentence vector , The length is determined by the maximum length of the input sentence . If you want to specify the length , It can be used in the startup service max_seq_len Parameters , Long sentences are truncated from the right .

BERT Another feature of is that you can get the vector of a pair of sentences , Use between sentences ||| As a division , for example ：


bc.encode(['First do it ||| then do it right'])

Get the word vector

When starting the service, set the parameter pooling_strategy Set to None ：

bert-serving-start -pooling_strategy NONE -model_dir /tmp/english_L-12_H-768_A-12/

In this case, the return is every... In the corpus token Corresponding embedding Matrix

bc = BertClient()
vec = bc.encode(['hey you', 'whats up?'])

vec  # [2, 25, 768]
vec[0]  # [1, 25, 768], sentence embeddings for `hey you`
vec[0][0]  # [1, 1, 768], word embedding for `[CLS]`
vec[0][1]  # [1, 1, 768], word embedding for `hey`
vec[0][2]  # [1, 1, 768], word embedding for `you`
vec[0][3]  # [1, 1, 768], word embedding for `[SEP]`
vec[0][4]  # [1, 1, 768], word embedding for padding symbol
vec[0][25]  # error, out of index!

The remote invocation BERT service
You can call from one machine to another BERT service ：

#on another CPU machine
from bert_serving.client import BertClient
bc = BertClient(ip='xx.xx.xx.xx')  # ip address of the GPU machine
bc.encode(['First do it', 'then do it right', 'then do it better'])

In this case , Just on the client side pip install -U bert-serving-client

other

Configuration requirements

BERT The model has high requirements for memory , If it's stuck at start-up load graph from model_dir Can be num_worker Set to 1 Or increase machine memory .

Whether to segment Chinese words in advance
When calculating the Chinese vector , You can input the whole sentence directly without any word segmentation in advance . because Chinese-BERT in , The corpus is processed in units of words , Therefore, for Chinese corpus, the output is word vector .

for instance , When user input ：


bc.encode(['hey you', 'whats up?', ' How are you? ？', ' I   also   Sure '])

actually ,BERT The input to the model is ：

tokens: [CLS] hey you [SEP]
input_ids: 101 13153 8357 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
input_mask: 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

tokens: [CLS] what ##s up ? [SEP]
input_ids: 101 9100 8118 8644 136 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
input_mask: 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

tokens: [CLS]  you   good   Well  ？ [SEP]
input_ids: 101 872 1962 720 8043 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
input_mask: 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

tokens: [CLS]  I   also   can   With  [SEP]
input_ids: 101 2769 6820 1377 809 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
input_mask: 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

In English, the word after the bar ##something What is it?

When a word is not in the dictionary , Use the longest subsequence method to itemize , for example ：

input = "unaffable"
tokenizer_output = ["un", "##aff", "##able"]

Reference material

https://github.com/google-research/bert
https://github.com/hanxiao/bert-as-service
Several implementation methods ：https://zhuanlan.zhihu.com/p/112235454
example ：https://spaces.ac.cn/archives/6736
keras_bert and kert4keras：https://www.cnblogs.com/dogecheng/p/11617940.html

原网站

版权声明
本文为[ithicker]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/209/202207280520380922.html