当前位置:网站首页>[NLP] bert4vec: a sentence vector generation tool based on pre training
[NLP] bert4vec: a sentence vector generation tool based on pre training
2022-07-06 09:46:00 【Demeanor 78】
A sentence vector generation tool based on pre training bert4vec:
https://github.com/zejunwang1/bert4vec
Environmental Science
transformers>=4.6.0,<5.0.0
torch>=1.6.0
numpy
huggingface-hub
faiss (optional)
install
Mode one
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple/ bert4vec
Mode two
git clone https://github.com/zejunwang1/bert4vec
cd bert4vec/
python setup.py sdist
pip install dist/bert4vec-1.0.0.tar.gz
Function is introduced
At present, the sentence vector pre training models that support loading include SimBERT、RoFormer-Sim and paraphrase-multilingual-MiniLM-L12-v2, among SimBERT And RoFormer-Sim Open source Chinese sentence vector representation model for teacher Su Jianlin ,paraphrase-multilingual-MiniLM-L12-v2 by sentence-transformers Open multilingual pre training model , Support Chinese sentence vector generation .
Sentence vector generation
from bert4vec import Bert4Vec
# Four modes are supported :simbert-base/roformer-sim-base/roformer-sim-small/paraphrase-multilingual-minilm
model = Bert4Vec(mode='simbert-base')
sentences = [' What kind of girls do boys like to play basketball ', ' It snowed in Xi'an ? Is it very cold ?', ' How should parents behave when meeting their girlfriend for the first time ?', ' How about a little tadpole looking for his mother ',
' Recommend me a red car ', ' I like Beijing ']
vecs = model.encode(sentences, convert_to_numpy=True, normalize_to_unit=False)
# encode The default input parameters supported by the function are :batch_size=64, convert_to_numpy=False, normalize_to_unit=False
print(vecs.shape)
print(vecs)
give the result as follows :
When you need to calculate the dense vector of English sentences , Need to set up mode='paraphrase-multilingual-minilm'.
Similarity calculation
from bert4vec import Bert4Vec
model = Bert4Vec(mode='paraphrase-multilingual-minilm')
sent1 = [' What kind of girls do boys like to play basketball ', ' It snowed in Xi'an ? Is it very cold ?', ' How should parents behave when meeting their girlfriend for the first time ?', ' How about a little tadpole looking for his mother ',
' Recommend me a red car ', ' I like Beijing ', 'That is a happy person']
sent2 = [' What kind of girls do boys like to play basketball ', ' What's the weather like in Xi'an ? Is it still snowing ?', ' What should I do when I see my parents for the first time ', ' Who drew the little tadpole looking for his mother ',
' Recommend me a black car ', ' I don't like Beijing ', 'That is a happy dog']
similarity = model.similarity(sent1, sent2, return_matrix=False)
# similarity The default input parameters supported by the function are :batch_size=64, return_matrix=False
print(similarity)
give the result as follows :
hypothesis sent1
contain M A sentence ,sent2
contain N A sentence , When similarity Functional return_matrix Parameter set to False when , The function returns sent1
and sent2
Cosine similarity between two sentences in the same line , Demand at this time M=N, Otherwise, an error will be reported .
When similarity Functional return_matrix Parameter set to True when , The function returns a M*N Similarity matrix , The order of the matrix i Xing di j Column elements represent sent1
Of the i Two sentences and sent2
Of the j Cosine similarity between sentences .
similarity = model.similarity(sent1, sent2, return_matrix=True)
print(similarity)
give the result as follows :
Semantic retrieval
bert4vec Support use faiss structure cpu/gpu Sentence vector index ,Bert4Vec Class build_index The function parameters are listed below :
def build_index(
self,
sentences_or_file_path: Union[str, List[str]],
ann_search: bool = False,
gpu_index: bool = False,
n_search: int = 64,
batch_size: int = 64
)
sentences_or_file_path: The path of sentence file or sentence list to be indexed .
ann_search: Whether to perform approximate nearest neighbor search . if False, Then perform violent search calculation when searching , Returns the exact result .
gpu_index: Whether to build gpu Indexes .
n_search: The number of search categories for approximate nearest neighbor search , The larger the parameter , The more accurate the search results .
batch_size: The batch size of sentence vector calculation .
Use Chinese-STS-B Verification set (https://github.com/zejunwang1/CSTS) Index all sentences after removing duplicates , The example code for approximate nearest neighbor search is as follows :
from bert4vec import Bert4Vec
model = Bert4Vec(mode='roformer-sim-small')
sentences_path = "./sentences.txt" # examples Under the folder
model.build_index(sentences_path, ann_search=True, gpu_index=False, n_search=32)
results = model.search(queries=[' A man is playing the guitar .', ' A woman is cooking '], threshold=0.6, top_k=5)
# threshold Is the lowest similarity threshold ,top_k Is the number of nearest neighbors found
print(results)
give the result as follows :
Bert4Vec Class supports using the following functions to save and load sentence vector index files :
def write_index(self, index_path: str)
def read_index(self, sentences_path: str, index_path: str, is_faiss_index: bool = True)
sentences_path The path of sentence file for building sentence vector index ,index_path Store path for sentence vector index .
Model download
The author puts the original SimBERT and RoFormer-Sim Model weights are converted to support the use of Huggingface Transformers Format of the model to load :https://huggingface.co/WangZeJun
from bert4vec import Bert4Vec
model = Bert4Vec(mode='simbert-base', model_name_or_path='WangZeJun/simbert-base-chinese')
model = Bert4Vec(mode='roformer-sim-base', model_name_or_path='WangZeJun/roformer-sim-base-chinese')
model = Bert4Vec(mode='roformer-sim-small', model_name_or_path='WangZeJun/roformer-sim-small-chinese')
model = Bert4Vec(mode='paraphrase-multilingual-minilm', model_name_or_path='sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2')
mode And model_name_or_path The corresponding relationship is as follows :
mode | model_name_or_path |
---|---|
simbert-base | WangZeJun/simbert-base-chinese |
roformer-sim-base | WangZeJun/roformer-sim-base-chinese |
roformer-sim-small | WangZeJun/roformer-sim-small-chinese |
paraphrase-multilingual-minilm | sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 |
When mode After setting up , No need to set model_name_or_path, The code will start from https://huggingface.co/ Automatically download the corresponding pre training model weights and load .
link
https://github.com/ZhuiyiTechnology/simbert
https://github.com/ZhuiyiTechnology/roformer-sim
https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
Past highlights
It is suitable for beginners to download the route and materials of artificial intelligence ( Image & Text + video ) Introduction to machine learning series download Chinese University Courses 《 machine learning 》( Huang haiguang keynote speaker ) Print materials such as machine learning and in-depth learning notes 《 Statistical learning method 》 Code reproduction album
AI Basic download machine learning communication qq Group 955171419, Please scan the code to join wechat group :
边栏推荐
- Processes of libuv
- VH6501学习系列文章
- One article read, DDD landing database design practice
- MapReduce instance (x): chainmapreduce
- 为什么大学单片机课上51+汇编,为什么不直接来STM32
- CANoe仿真功能之自动化序列(Automation Sequences )
- Sqlmap installation tutorial and problem explanation under Windows Environment -- "sqlmap installation | CSDN creation punch in"
- 五月刷题27——图
- Mapreduce实例(八):Map端join
- [Yu Yue education] reference materials of power electronics technology of Jiangxi University of science and technology
猜你喜欢
MapReduce instance (x): chainmapreduce
Summary of May training - from a Guang
运维,放过监控-也放过自己吧
Popularization of security knowledge - twelve moves to protect mobile phones from network attacks
嵌入式開發中的防禦性C語言編程
英雄联盟轮播图自动轮播
【深度学习】语义分割:论文阅读(NeurIPS 2021)MaskFormer: per-pixel classification is not all you need
[Yu Yue education] Wuhan University of science and technology securities investment reference
一大波开源小抄来袭
Mapreduce实例(八):Map端join
随机推荐
Counter attack of noodles: redis asked 52 questions in a series, with detailed pictures and pictures. Now the interview is stable
CAPL脚本中关于相对路径/绝对路径操作的几个傻傻分不清的内置函数
发生OOM了,你知道是什么原因吗,又该怎么解决呢?
51单片机进修的一些感悟
Leetcode:608 树节点
[deep learning] semantic segmentation: paper reading: (CVPR 2022) mpvit (cnn+transformer): multipath visual transformer for dense prediction
解决小文件处过多
嵌入式開發中的防禦性C語言編程
[Yu Yue education] Wuhan University of science and technology securities investment reference
What you have to know about network IO model
面试突击62:group by 有哪些注意事项?
MapReduce instance (IV): natural sorting
Lua script of redis
Take you back to spark ecosystem!
基于B/S的医院管理住院系统的研究与实现(附:源码 论文 sql文件)
Mapreduce实例(九):Reduce端join
为什么要数据分层
Yarn organizational structure
Global and Chinese market of linear regulators 2022-2028: Research Report on technology, participants, trends, market size and share
Research and implementation of hospital management inpatient system based on b/s (attached: source code paper SQL file)