当前位置:网站首页>Basic use of transformers Library
Basic use of transformers Library
2022-06-25 01:31:00 【Empty cup realm】
This content mainly introduces Transformers library Basic use of .
1.1 Transformers Library profile
Transformers Library is an open source library , All the pre training models provided are based on transformer Model structure .
1.1.1 Transformers library
We can use Transformers Library provides the API Easily download and train state-of-the-art pre training models . Using the pre training model can reduce the computational cost , And save the time of training the model from scratch . These models can be used for different modal tasks , for example :
- Text : Text classification 、 Information extraction 、 Question answering system 、 Text in this paper, 、 Machinetranslation and text generation .
- Images : Image classification 、 Target detection and image segmentation .
- Audio : Speech recognition and audio classification .
- Multimodal : Tabular question answering system 、OCR、 Scan document information extraction 、 Video classification and visual Q & A .
Transformers The library supports three of the most popular deep learning libraries (PyTorch、TensorFlow and JAX).
The corresponding websites of relevant resources are as follows :
| website | |
|---|---|
| Library GitHub Address | https://github.com/huggingface/transformers |
| Official development documents | https://huggingface.co/docs/transformers/index |
| Pre training model download address | https://huggingface.co/models |
1.1.2 Transformers Models and frameworks supported by the library
The following table shows the current Transformers Library support for each model :
| Model | Tokenizer slow | Tokenizer fast | PyTorch support | TensorFlow support | Flax support |
|---|---|---|---|---|---|
| ALBERT | |||||
| BART | |||||
| BEiT | |||||
| BERT | |||||
| Bert Generation | |||||
| BigBird | |||||
| BigBirdPegasus | |||||
| Blenderbot | |||||
| BlenderbotSmall | |||||
| CamemBERT | |||||
| Canine | |||||
| CLIP | |||||
| ConvBERT | |||||
| ConvNext | |||||
| CTRL | |||||
| Data2VecAudio | |||||
| Data2VecText | |||||
| Data2VecVision | |||||
| DeBERTa | |||||
| DeBERTa-v2 | |||||
| Decision Transformer | |||||
| DeiT | |||||
| DETR | |||||
| DistilBERT | |||||
| DPR | |||||
| DPT | |||||
| ELECTRA | |||||
| Encoder decoder | |||||
| FairSeq Machine-Translation | |||||
| FlauBERT | |||||
| Flava | |||||
| FNet | |||||
| Funnel Transformer | |||||
| GLPN | |||||
| GPT Neo | |||||
| GPT-J | |||||
| Hubert | |||||
| I-BERT | |||||
| ImageGPT | |||||
| LayoutLM | |||||
| LayoutLMv2 | |||||
| LED | |||||
| Longformer | |||||
| LUKE | |||||
| LXMERT | |||||
| M2M100 | |||||
| Marian | |||||
| MaskFormer | |||||
| mBART | |||||
| MegatronBert | |||||
| MobileBERT | |||||
| MPNet | |||||
| mT5 | |||||
| Nystromformer | |||||
| OpenAI GPT | |||||
| OpenAI GPT-2 | |||||
| OPT | |||||
| Pegasus | |||||
| Perceiver | |||||
| PLBart | |||||
| PoolFormer | |||||
| ProphetNet | |||||
| QDQBert | |||||
| RAG | |||||
| Realm | |||||
| Reformer | |||||
| RegNet | |||||
| RemBERT | |||||
| ResNet | |||||
| RetriBERT | |||||
| RoBERTa | |||||
| RoFormer | |||||
| SegFormer | |||||
| SEW | |||||
| SEW-D | |||||
| Speech Encoder decoder | |||||
| Speech2Text | |||||
| Speech2Text2 | |||||
| Splinter | |||||
| SqueezeBERT | |||||
| Swin | |||||
| T5 | |||||
| TAPAS | |||||
| TAPEX | |||||
| Transformer-XL | |||||
| TrOCR | |||||
| UniSpeech | |||||
| UniSpeechSat | |||||
| VAN | |||||
| ViLT | |||||
| Vision Encoder decoder | |||||
| VisionTextDualEncoder | |||||
| VisualBert | |||||
| ViT | |||||
| ViTMAE | |||||
| Wav2Vec2 | |||||
| WavLM | |||||
| XGLM | |||||
| XLM | |||||
| XLM-RoBERTa | |||||
| XLM-RoBERTa-XL | |||||
| XLMProphetNet | |||||
| XLNet | |||||
| YOLOS |
Be careful :Tokenizer slow: Use Python Realization tokenization The process .Tokenizer fast: be based on Rust library Tokenizers To implement .
1.2 Pipeline
pipeline() The function of is to use the pre training model for inference , It supports from here All models downloaded .
1.2.1 Pipeline Supported task types
pipeline() Support for many common tasks :
- Text
- Sentiment analysis (Sentiment analysis)
- The text generated (Text generation)
- Named entity recognition (Name entity recognition,NER):
- Question answering system (Question answering)
- Mask recovery (Fill-mask)
- Text in this paper, (Summarization)
- Machine translation (Translation)
- feature extraction (Feature extraction)
- Images
- Image classification (Image classification)
- Image segmentation (Image segmentation)
- object detection (Object detection)
- Audio
- Audio classification (Audio classification)
- Automatic speech recognition (Automatic speech recognition,ASR)
Be careful : Can be in Transformers Source code of the library ( see
Transformers/pipelines/__init__.pyMediumSUPPORTED_TASKSDefinition ) View its supported tasks in , Different versions support different types .
1.2.2 Pipeline Use
(1) Easy to use
for example , At present, we need to carry out an inference task of emotion analysis . We can use the following code directly :
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
result = classifier("We are very happy to show you the Transformers library.")
print(result)
The following results will be output :
[{
'label': 'POSITIVE', 'score': 0.9997795224189758}]
In the above code pipeline("sentiment-analysis") A default pre training model for emotion analysis will be downloaded and cached, and the corresponding tokenizer. For different types of tasks , The corresponding parameter name can be viewed pipeline Parameters of task Explanation ( here ); The default pre training model for different types of tasks can be downloaded in Transformers Source code of the library ( see Transformers/pipelines/__init__.py Medium SUPPORTED_TASKS Definition ) View in .
When we need to reason more than one sentence at a time , have access to list The form is passed in as a parameter :
from transformers import pipeline
classifier = pipeline("sentiment-analysis")
results = classifier(["We are very happy to show you the Transformers library.",
"We hope you don't hate it."])
print(results)
The following results will be output :
[{
'label': 'POSITIVE', 'score': 0.9997795224189758},
{
'label': 'NEGATIVE', 'score': 0.5308570265769958}]
(2) Choose a model
The upper part , In reasoning , The default model of the corresponding task is used . But sometimes we want to use a specified model , You can specify pipeline() Parameters of model To achieve .
The first method :
from transformers import pipeline
classifier = pipeline("sentiment-analysis",
model="IDEA-CCNL/Erlangshen-Roberta-110M-Sentiment")
result = classifier(" I am in a good mood today ")
print(result)
The following results will be output :
[{
'label': 'Positive', 'score': 0.9374911785125732}]
The second method :( And the above method , The same model is loaded . However, this method can use local models for reasoning .)
from transformers import AutoModelForSequenceClassification
from transformers import AutoTokenizer
from transformers import pipeline
model_path = r"../pretrained_model/IDEA-CCNL(Erlangshen-Roberta-110M-Sentiment)"
model = AutoModelForSequenceClassification.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
classifier = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer)
result = classifier(" I am in a good mood today ")
print(result)
The following results will be output :
[{
'label': 'Positive', 'score': 0.9374911785125732}]
summary : The above section describes the use of pipeline() The inference method of text classification task . For other types of text tasks 、 Image and audio tasks , The use method is basically the same , For details, please refer to here .
1.3 Load model
Next, we will introduce some methods of loading models .
1.3.1 Random initialization of model weights
occasionally , Model weights need to be initialized randomly ( For example, use your own data for pre training ). First, we need to initialize a config object , And then put this config Object is passed to the model as a parameter :
from transformers import BertConfig
from transformers import BertModel
config = BertConfig()
model = BertModel(config)
above config The default value is used , But as needed , We can modify the corresponding parameters . Of course , We can also use AutoConfig.from_pretrained() Load other models config:
from transformers import AutoConfig
from transformers import AutoModel
model_path = r"../pretrained_model/IDEA-CCNL(Erlangshen-Roberta-110M-Sentiment)"
config = AutoConfig.from_pretrained(model_path)
model = AutoModel.from_config(config)
1.3.2 Use the pre training weights to initialize the model weights
occasionally , Need to load weights from the pre training model . In general use AutoModelForXXX.from_pretrained() Load the pre training model of the corresponding task , The reason why we use XXX, Because different types of tasks use different classes . for example , We need to load a text sequence classification model , Need to use AutoModelForSequenceClassification.
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained(
"IDEA-CCNL/Erlangshen-Roberta-110M-Sentiment")
AutoModelForSequenceClassification.from_pretrained() The first parameter of pretrained_model_name_or_path It can be a string , It can also be a folder path .
from transformers import AutoModelForSequenceClassification
model_path = r"../pretrained_model/IDEA-CCNL(Erlangshen-Roberta-110M-Sentiment)"
model = AutoModelForSequenceClassification.from_pretrained(model_path)
We can also use concrete model classes , Like the following BertForSequenceClassification:
from transformers import BertForSequenceClassification
model_path = r"../pretrained_model/IDEA-CCNL(Erlangshen-Roberta-110M-Sentiment)"
model = BertForSequenceClassification.from_pretrained(model_path)
Be careful : The above model types are all for PyTorch Model . If we use TensorFlow Model , Its class name needs to be in PyTorch Add... Before the model class name TF. such as BertForSequenceClassification Corresponding TF The model class name is TFBertForSequenceClassification
summary : Official recommendation AutoModelForXXX and TFAutoModelXXX Load pre training model . Officials believe that this will ensure that the correct framework is loaded every time .
1.4 Preprocessing
Because the model itself cannot understand the original text 、 Image or audio . So you need to convert the data into a form that the model can accept , Then it is transferred into the model .
1.4.1 NLP:AutoTokenizer
The main tools for processing text data are tokenizer. First ,tokenizer Text is split into... According to a set of rules token. then , Will these token Convert to numeric ( According to Thesaurus , namely vocab), These values are constructed as tensors and used as inputs to the model . Other inputs required for the model are also provided by tokenizer add to .
When we use the pre training model , Be sure to use the corresponding pre training tokenizer. That's the only way , To ensure that the text is segmented in the same way as the pre training corpus , And use the same correspondence token Indexes ( namely vocab).
(1)Tokenize
Use AutoTokenizer.from_pretrained() Load a Pre Workout tokenizer, And pass the text into tokenizer:
from transformers import AutoTokenizer
model_path = r"../pretrained_model/IDEA-CCNL(Erlangshen-Roberta-110M-Sentiment)"
tokenizer = AutoTokenizer.from_pretrained(model_path)
encoded_input = tokenizer(" I am in a good mood today ")
print(encoded_input)
The following results will be output :
{
'input_ids': [101, 791, 1921, 1921, 3698, 4696, 1962, 102],
'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0],
'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}
You can see that the output above consists of three parts :
- input_ids: Corresponding to each in the sentence token The index of .
- token_type_ids: When there are multiple sequences , identification token Belong to that sequence .
- attention_mask: Show the corresponding token Need to be noticed (1 Need to be noticed ,0 Do not need to be noticed . It involves attention mechanisms ).
We can also use tokenizer take input_ids Decode to original input :
decoded_input = tokenizer.decode(encoded_input["input_ids"])
print(decoded_input)
The following results will be output :
[CLS] today God God gas really good [SEP]
We can see the output above , More than the original text [CLS] and [SEP], They are in BERT And other models token.
If you need to process multiple sentences at the same time , Multiple texts can be typed as list Type in the form of tokenizer in .
(2) fill (Pad)
When we deal with a batch of sentences , Their lengths are not always the same . But the input of the model needs to have a uniform shape (shape). Population is a strategy to achieve this requirement , That is to say token Add special padding to fewer sentences token.
to tokenizer() Pass in the parameter padding=True:
batch_sentences = [" It's a beautiful day ",
" It's a beautiful day , Suitable for travel "]
encoded_inputs = tokenizer(batch_sentences, padding=True)
print(encoded_inputs)
The following results will be output :
{
'input_ids':
[[101, 791, 1921, 1921, 3698, 4696, 1962, 102, 0, 0, 0, 0, 0],
[101, 791, 1921, 1921, 3698, 4696, 1962, 8024, 6844, 1394, 1139, 3952, 102]], 'token_type_ids':
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]],
'attention_mask':
[[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}
You can see tokenizer Use 0 Fill in the first sentence .
(3) truncation (Truncation)
When the sentence is too short , The strategy of filling can be adopted . But sometimes , The sentence may be too long , The model can't handle . under these circumstances , Sentences can be truncated .
to tokenizer() Pass in the parameter truncation=True That is to say .
If you want to know tokenizer() More about parameters in padding and truncation Information about , You can refer to here
(4) Construction tensor (Build tensors)
Final , If we want to tokenizer Returns the actual tensor in the incoming model . You need to set the parameters return_tensors. If it's incoming PyTorch Model , Set it to pt; If it's incoming TensorFlow Model , Set it to tf.
batch_sentences = [" It's a beautiful day ",
" It's a beautiful day , Suitable for travel "]
encoded_inputs = tokenizer(batch_sentences,
padding=True, truncation=True,
return_tensors="pt")
print(encoded_inputs)
The following results will be output :
{
'input_ids':
tensor([[ 101, 791, 1921, 1921, 3698, 4696, 1962, 102, 0, 0, 0, 0,
0],
[ 101, 791, 1921, 1921, 3698, 4696, 1962, 8024, 6844, 1394, 1139, 3952,
102]]),
'token_type_ids':
tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]),
'attention_mask':
tensor([[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}
1.4.2 other
For audio data , Pretreatment mainly includes resampling (Resample)、 feature extraction (Feature Extractor)、 fill (pad) And truncation (Truncate), For details, please refer to here . For image data , Preprocessing mainly includes feature extraction (Feature Extractor) And data enhancement , For details, please refer to here . For multimodal data , Different types of data use the corresponding preprocessing methods described above. For details, please refer to here . Although each kind of data preprocessing method is not exactly the same , But the ultimate goal is the same : Transform the raw data into a form acceptable to the model .
1.5 Fine tune the pre training model
The following is an example of text multi classification , Briefly introduce how to use our own data to train a classification model .
1.5.1 Prepare the data
Before fine tuning the pre training model , We need to prepare the data first . We can use Datasets Library load_dataset Load data set :
from datasets import load_dataset
# The first 1 Step : Prepare the data
# Get raw data from files
datasets = load_dataset(f'./my_dataset.py')
# Output the first data in the training set
print(datasets["train"][0])
I need to pay attention to , Because we use our own data for model training , So the above load_dataset The parameter passed in is one py Path to file . This py The document follows Datasets Library rules read files and return training data , If you want more information , You can refer to here .
If we just want to learn simply Transformers Library usage , have access to Datasets Some data sets preset in this library , This is the time load_dataset The parameters passed in are strings ( such as ,load_dataset("imdb")), Then the corresponding data set will be automatically downloaded .
1.5.2 Preprocessing
Before feeding the data to the model , The data needs to be preprocessed (Tokenize、 fill 、 Truncation, etc ).
from transformers import AutoTokenizer
# The first 2 Step : Preprocessing data
# 2.1 load tokenizer
tokenizer = AutoTokenizer.from_pretrained(configure["model_path"])
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
# 2.2 Get through tokenization Later data
tokenized_datasets = datasets.map(tokenize_function, batched=True)
print(tokenized_datasets["train"][0])
First , load tokenizer; then , Use datasets.map() Generate preprocessed data . Because the data goes through tokenizer() After processing, it is no longer dataset Format , So we need to use datasets.map() To deal with .
1.5.3 Load model
In the front section , The method of model loading has been introduced , have access to AutoModelXXX.from_pretrained Load model :
from transformers import AutoModelForSequenceClassification
# The first 3 Step : Load model
classification_model = AutoModelForSequenceClassification.from_pretrained(
configure["model_path"], num_labels=get_num_labels())
The difference from the previous section is that : There's one in the code above num_labels Parameters , You need to pass this parameter to the number of categories in our dataset .
1.5.4 Set metrics
During model training , We want to be able to output the performance indicators of the model ( For example, accuracy 、 Accuracy 、 Recall rate 、F1 It's worth waiting for ) In order to understand the training of the model . We can go through Datasets Library provides the load_metric() To achieve . The following code implements the accuracy calculation :
import numpy as np
from datasets import load_metric
# The first 4 Step : Set metrics
metric = load_metric("./accuracy.py")
def compute_metrics(eval_pred):
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
If you want more information , You can refer to here .
1.5.5 Set training super parameters
In model training , You also need to set some super parameters ,Transformers The library provides TrainingArguments class .
from transformers import TrainingArguments
# The first 5 Step : Set training super parameters
training_args = TrainingArguments(output_dir=configure["output_dir"],
evaluation_strategy="epoch")
In the code above , We set two parameters :output_dir Specify the output path to save the model ;evaluation_strategy Decide when to evaluate the model , Set parameters epoch Indicates that after each training epoch And then conduct an assessment , The evaluation content is the measurement index set in the previous step .
If you want to know more about parameter settings and specific meanings , You can refer to here .
1.5.6 Train and save models
After the previous series of steps , We can finally start model training .Transformers The library provides Trainer class , Model training can be carried out simply and conveniently . First , Create a Trainer, And then call train() function , Start model training . When the model training is finished , call save_model() Save the model .
# The first 6 Step : Start training model
trainer = Trainer(model=classification_model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
tokenizer=tokenizer,
compute_metrics=compute_metrics)
trainer.train()
# Save the model
trainer.save_model()
occasionally , We need to debug the model , You need to write your own model training cycle , Detailed method , You can refer to here .
1.5.7 summary
After the introduction , Now we can start to train our own text multi classification model .
however , In the previous section, I introduced how to use... With an example of text multi classification Transformers Library fine tuning pre training model . For other types of tasks , There are some differences compared with text classification tasks , Specific guidance , You can refer to the following links :
| Task type | Reference link |
|---|---|
| Text classification (Text classification) | https://huggingface.co/docs/transformers/tasks/sequence_classification |
| Token classification( for example NER) | https://huggingface.co/docs/transformers/tasks/token_classification |
| Question answering system (Question answering) | https://huggingface.co/docs/transformers/tasks/question_answering |
| Language model (Language modeling) | https://huggingface.co/docs/transformers/tasks/language_modeling |
| Machine translation (Translation) | https://huggingface.co/docs/transformers/tasks/translation |
| Text in this paper, (Sumarization) | https://huggingface.co/docs/transformers/tasks/summarization |
| Multiple choice (Multiple choice) | https://huggingface.co/docs/transformers/tasks/multiple_choice |
| Audio classification (Audio classification) | https://huggingface.co/docs/transformers/tasks/audio_classification |
| Automatic speech recognition (ASR) | https://huggingface.co/docs/transformers/tasks/asr |
| Image classification (Image classification) | https://huggingface.co/docs/transformers/tasks/image_classification |
Reference resources :
[1] Github Address
[2] Official development documents
[4] https://github.com/nlp-with-transformers/notebooks
[5] https://github.com/datawhalechina/learn-nlp-with-transformers
边栏推荐
- Experiment 5 8254 timing / counter application experiment [microcomputer principle] [experiment]
- Abnova丨BSG 单克隆抗体中英文说明
- Bi-sql create
- Transformers 库的基本使用
- 【直播回顾】2022腾讯云未来社区城市运营方招募会暨SaaS 2.0新品发布会!
- C language boundary calculation and asymmetric boundary
- Some Modest Advice for Graduate Students - by Stephen C. Stearns, Ph.D.
- 天书夜读笔记——深入虚函数virtual
- 腾讯搬家了!
- Linux64Bit下安装MySQL5.6-不能修改root密码
猜你喜欢

Abnova丨A4GNT多克隆抗体中英文说明

新一代可级联的以太网远程I/O数据采集模块

(CVPR 2020) Learning Object Bounding Boxes for 3D Instance Segmentation on Point Clouds

Abnova a4gnt polyclonal antibody

海河实验室创新联合体成立 GBASE成为首批创新联合体(信创)成员单位

PS5连接OPPO K9电视不支持2160P/4K

Assembly language (4) function transfer parameters

谷歌浏览器控制台 f12怎么设置成中文/英文 切换方法,一定要看到最后!!!

Deep learning LSTM model for stock analysis and prediction

论文翻译 | RandLA-Net: Efficient Semantic Segmentation of Large-Scale Point Clouds
随机推荐
动手学数据分析 数据建模和模型评估
Ideas and examples of divide and conquer
Huawei laptop, which grew against the trend in Q1, is leading PC into the era of "smart office"
Abnova a4gnt polyclonal antibody
AUTOCAD——两种延伸方式
Some Modest Advice for Graduate Students - by Stephen C. Stearns, Ph.D.
Fan benefits, JVM manual (including PDF)
Yasea APK Download Image
Properties of DOM
第04天-文件IO
js数组对象转对象
IPC mechanism
AssertionError: CUDA unavailable, invalid device 0 requested
天书夜读笔记——内存分页机制
Use redis' sorted set to make weekly hot Reviews
带马尔科夫切换的正向随机微分方程数值格式模拟
Tencent moved!
归并排序求逆序数
Audio PCM data calculates sound decibel value to realize simple VAD function
Bi-sql index