当前位置:网站首页>Paddlenlp's UIE classification model [taking emotional propensity analysis news classification as an example] including intelligent annotation scheme)
Paddlenlp's UIE classification model [taking emotional propensity analysis news classification as an example] including intelligent annotation scheme)
2022-07-23 18:51:00 【Ting】
Related articles :
Paddlenlp And UIE Model actual combat entity extraction task 【 Taxi Data 、 Express bill 】
Project to connect : Baidu AIstudio direct fork My project can be reproduced
Paddlenlp And UIE Classification model 【 Take the classification of emotional tendency analysis news as an example 】 Including intelligent annotation scheme )
0 Preface
First, review the previous project :
Paddlenlp And UIE Model actual combat entity extraction task 【 Taxi Data 、 Express bill 】
There will be the following problems :
How to label your sample data
If the sample size is large, what is a good method for intelligent annotation
Visualization tools are introduced in detail
This project will put , Data tagging 、 Intelligent annotation 、 Data visualization method
Explain in detail .
0.1 How to label data —doccano
Strongly recommend : Data annotation platform doccano---- brief introduction 、 install 、 Use 、 Record on pit
For detailed steps, please refer to the blog 
Official documents :
https://github.com/doccano/doccano
Remember to enter the virtual environment !!!!!
Step 1. Local installation doccano( Please do not leave AI Studio Internal operation , Local test environment python=3.8)
$ pip install doccano
Step 2. Initialize database and account ( User name and password can be replaced with custom values )
# initialization , Set the username = admin, password =pass
doccano init
doccano createuser --username admin --password pass
------------------------- Personal settings ---------------------------
$ doccano init
$ doccano createuser --username my_admin_name --password my_password
Step 3. start-up doccano
Start in a window doccano Of WebServer, Hold window
$ doccano webserver --port 8000
Start in another window doccano Task queue for
$ doccano task


Open the browser ( recommend Chrome), Enter in the address bar http://127.0.0.1:8000/ Press enter to get the following interface .

Please refer to the blog or official website documents for specific annotation
0.2 Intelligent annotation
When your data sample is large , Marking one by one will be time-consuming , Efficiency is very low
It is recommended to go to hugging face Load some pre training models for one annotation and then conduct manual review .

according to doccano Label platform format requirements
json Format import data format requirements : Entity ; Include relationship style presentation

{
"text": "Google was founded on September 4, 1998, by Larry Page and Sergey Brin.",
"entities": [
{
"id": 0,
"start_offset": 0,
"end_offset": 6,
"label": "ORG"
},
{
"id": 1,
"start_offset": 22,
"end_offset": 39,
"label": "DATE"
},
{
"id": 2,
"start_offset": 44,
"end_offset": 54,
"label": "PERSON"
},
{
"id": 3,
"start_offset": 59,
"end_offset": 70,
"label": "PERSON"
}
],
"relations": [
{
"from_id": 0,
"to_id": 1,
"type": "foundedAt"
},
{
"from_id": 0,
"to_id": 2,
"type": "foundedBy"
},
{
"from_id": 0,
"to_id": 3,
"type": "foundedBy"
}
]
}
0.3 Entity intelligent annotation + format conversion
0.3.1 Long text ( One txt A long article )code
The annotation section contains pre training model recognition entities ; And the format requirements of the wizard annotation assistant
ps: Hint that the following program is in torch Under use , Because I took it directly huggingface Pre training model , Reduce our workload . use paddle Ask for quick recommendation UIE Direct small samples to make a simple pre model to assist in annotation !
from transformers import pipeline
import os
from tqdm import tqdm
import pandas as pd
from time import time
import json
def return_single_entity(name, start, end):
return [int(start), int(end), name]
# def return_single_entity(name, word, start, end, id, attributes=[]):
# entity = {}
# entity['type'] = 'T'
# entity['name'] = name
# entity['value'] = word
# entity['start'] = int(start)
# entity['end'] = int(end)
# entity['attributes'] = attributes
# entity['id'] = int(id)
# return entity
# input_dir = 'E:/datasets/myUIE/inputs'
input_dir = 'C:/Users/admin/Desktop//test_input.txt'
output_dir = 'C:/Users/admin/Desktop//outputs'
tagger = pipeline(task='ner', model='xlm-roberta-large-finetuned-conll03-english',
aggregation_strategy='simple')
keywords = {
'PER': ' people ', 'ORG': ' Institutions '} # loc Location misc Other types of entities
# for filename in tqdm(input_dir):
# # Read data and mark automatically
# json_list = []
with open(input_dir, 'r', encoding='utf8') as f:
text = f.readlines()
json_list = [0 for i in range(len(text))]
for t in text:
i = t.strip("\n").strip("'").strip('"')
named_ents = tagger(i) # Pre training model
# named_ents = tagger(text)
df = pd.DataFrame(named_ents)
""" Mark the results :entity_group score word start end 0 ORG 0.999997 National Science Board 18 40 1 ORG 0.999997 NSB 42 45 2 ORG 0.999997 NSF 71 74"""
# Put it in the loop , Then every time we start a new cycle, we will redefine it , The content defined last time is lost
# json_list = [0 for i in range(len(text))]
entity_list=[]
# entity_list2=[]
for index, elem in df.iterrows():
if not elem.entity_group in keywords:
continue
if elem.end - elem.start <= 1:
continue
entity = return_single_entity(
keywords[elem.entity_group], elem.start, elem.end)
entity_list.append(entity)
# entity_list2.append(entity_list)
json_obj = {
"text": text[index], "label": entity_list}
json_list[index] = json.dumps(json_obj)
# entity_list.append(entity)
# data = json.dumps(json_list)
# json_list.append(data)
with open(f'{
output_dir}/data_2.json', 'w', encoding='utf8') as f:
for line in json_list:
f.write(line+"\n")
# f.write('\n'.join(data))
# f.write(str(data))
print('done!')
# Convert to wizard annotation assistant import format ( But the elf annotation assistant nlp There is a coding problem in the annotation module , part utf8 Characters cannot be displayed normally , It will affect the annotation results )
# id = 1
# entity_list = ['']
# for index, elem in df.iterrows():
# if not elem.entity_group in keywords:
# continue
# entity = return_single_entity(keywords[elem.entity_group], elem.word, elem.start, elem.end, id)
# id += 1
# entity_list.append(entity)
# python_obj = {'path': f'{input_dir}/{filename}',
# 'outputs': {'annotation': {'T': entity_list, "E": [""], "R": [""], "A": [""]}},
# 'time_labeled': int(1000 * time()), 'labeled': True, 'content': text}
# data = json.dumps(python_obj)
# with open(f'{output_dir}/{filename.rstrip(".txt")}.json', 'w', encoding='utf8') as f:
# f.write(data)
Output results :
{"text": "The company was founded in 1852 by Jacob Estey\n", "label": [[35, 46, "\u4eba"]]}
{"text": "The company was founded in 1852 by Jacob Estey, who bought out another Brattleboro manufacturing business.", "label": [[35, 46, "\u4eba"], [71, 82, "\u673a\u6784"]]}
You can see label The label is garbled , Don't worry about importing to doccano The platform will display normal
![[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-reaFjMvd-1658544946106)(https://ai-studio-static-online.cdn.bcebos.com/e1f6216815864e18b9ad3217b66af3c7c8ec9e8fb12240bea25b5f2274c1d01c)]](/img/f5/5c883f0ec524f23ec443ba6613d2a5.png)
0.3.2 Improve the quality of tagging
- Manual review
Not to mention, just check the past one by one , Smart tagging has saved a lot of trouble
The marked data is
- Delete invalid annotations
import json
dir_path = r'C:/Users/admin/Desktop/ Photosynthetic project / Auto Label ' # Change the file address here
with open(f'{
dir_path}/pre_data.jsonl', 'r',encoding='utf8')as f: # File naming
text = f.readlines()
content = [json.loads(elem.strip('\n')) for elem in text]
content = [json.dumps(cont) for cont in content if cont['entities'] != []]
with open(f'{
dir_path}/remove_empty_data.jsonl', 'w',encoding='utf8')as f: # File naming
f.write('\n'.join(content))
print(" Output data ")
- The above processing is well reflected in the English version data set , Of course, the Chinese version can be based on the above , use paddle UIE Wait for the model , First, manually mark in small batches , Then generate a base Model , Pre annotate the input results through the model , Then manually recheck
Of course, you may ask whether there is a simpler way , Of course there are !!!
EasyData Introduction to data services
The debut !!!!
EasyData A one-stop data processing and service platform launched for Baidu brain , centered AI Data collection required in the process of model development 、 Data quality inspection 、 Intelligent data processing 、 Data annotation and other links provide complete data services . at present EasyData Pictures are supported 、 Text 、 Audio 、 video 、 Table 5 basic data processing .
meanwhile EasyData Already with EasyDL、BML Get through the platform data management module ,EasyData The processed data can be directly applied to EasyDL、BML Platform for model training .
![[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-XSppEoYw-1658544988575)(https://ai-studio-static-online.cdn.bcebos.com/4012d0f4461947eb91c7e56afacd278b6190b9d5e5d34aac884c11139063e574)]](/img/bd/014e8e2d637b5d2266c0ebc1fd5e29.png)

Complete and powerful , You guys can have a try !
I won't do more promotion ,, Trust Baidu no problem , ha-ha .
0.4 visualDL Tool use , Visualization tool .
VisualDL It is a visual tool for deep learning task design .VisualDL Using a wealth of charts to show the data , Users can be more intuitive 、 Clearly check the characteristics and change trend of data , Help to analyze data 、 Find mistakes in time , And then improve the design of neural network model .
at present ,VisualDL Support scalar, image, audio, graph, histogram, prcurve, high dimensional Seven components .
Don't introduce too much , You can refer to my project or blog . Detailed explanation !!!
paddle And visualDL Tool use , Visualization tool .
from visualdl import LogWriter
if __name__ == '__main__':
value = [i/1000.0 for i in range(1000)]
# Step one : Create parent folder :log And subfolders :scalar_test
with LogWriter(logdir="./log/scalar_test") as writer:
for step in range(1000):
# Step two : Add a... To the logger tag by `train/acc` The data of
writer.add_scalar(tag="train/acc", step=step, value=value[step])
# Step two : Add a... To the logger tag by `train/loss` The data of
writer.add_scalar(tag="train/loss", step=step, value=1/(value[step] + 1))
# Step one : Create a second subfolder scalar_test2
value = [i/500.0 for i in range(1000)]
with LogWriter(logdir="./log/scalar_test2") as writer:
for step in range(1000):
# Step two : In the same name `train/acc` Add below scalar_test2 Of accuracy The data of
writer.add_scalar(tag="train/acc", step=step, value=value[step])
# Step two : In the same name `train/loss` Add below scalar_test2 Of loss The data of
writer.add_scalar(tag="train/loss", step=step, value=1/(value[step] + 1))
1. Background introduction
Text classification is the most common task in natural language processing , The task of text classification is simply to use a text classifier to classify a given sentence or paragraph of text . Text classification tasks are widely used in long and short text classification 、 Sentiment analysis 、 News classification 、 Event category classification 、 Government data classification 、 Commodity information classification 、 Commodity category forecast 、 The article classification 、 Classification of papers 、 patent classification 、 Case description classification 、 Classification of charges 、 Classification of intention 、 Thesis patent classification 、 Automatic tagging of mail 、 Comment positive and negative identification 、 Classification of drug reactions 、 Dialogue classification 、 Tax identification 、 Automatic classification of incoming call information 、 Complaint classification 、 Advertising detection 、 Sensitive illegal content detection 、 Content security detection 、 Public opinion analysis 、 Topic markers and other daily or professional fields .
Text classification tasks can be divided into multiple categories according to label types (multi class)、 Multi label (multi label)、 Hierarchical classification (hierarchical) Wait for three kinds of tasks .
Enter positive , This project will demonstrate how the multi classification task can fine tune the model with small samples .
Data set situation :
An overview of the data : 7000 Multiple hotel review data ,5000 Multiple positive comments ,2000 Multiple negative comments
Recommended experiment : emotional / Point of view / Comment on Tendentiousness analysis
Data sources : ctrip
Original data set : ChnSentiCorp_htl, A data set compiled by teacher tansongbo
cla.jsonl It's a dataset demo:
{"id":1286,"text":" The environment and service attitude of this hotel are also good , But the room space is too small ~~ It is not declared to accommodate too large luggage ~~ And the style of the room is ok ~~ Cantonese dim sum in Chinese restaurant is not very delicious ~~ It needs to be improved ~~~~ But the price is fair ~~ acceptable ~~ The style of Western restaurants is very good ~~ But the taste of the food is ordinary and it makes people wait too long ~~ It needs to be improved ~~\t","label":[" positive "]}
{"id":1287,"text":"< Letter of recommendation > Recommend all like < Red Mansions > Our fans must collect this book , You know, when I heard about this book, I spent a long time in the library looking for it and borrowing it, but it didn't work , So this time, when I saw it, I should have , I'll buy it right away , Red fans should also remember to stock up !\t","label":[" positive "]}
{"id":1288,"text":" The shortage of goods has not been found yet , JD's order processing speed is really ....... Package on Tuesday , It will be delivered on Friday ...\t","label":[" Negative "]}
{"id":1289,"text":"2001 Fuzhou has lived here for years , This time I feel the room is a little , There is still hot spring water . On the whole, I'm very satisfied . Breakfast is simpler .\t","label":[" positive "]}
{"id":1290,"text":" Good netbook , The shape is very beautiful , The operating system should be a big Selling point , The battery is ok . On the whole , Positioning as a netbook , It's not bad .\t","label":[" positive "]}
{"id":1291,"text":" The carpet in the room is too dirty , It's very noisy near the railway station , Fortunately, it's double glass . The service was average , In front of the hotel TAXI It's a long-term cooperative relationship of the hotel , Pay the hotel every month . From the hotel to the airport, it's about clocking 147 element , When you arrive, you have to 200 element , May be slaughtered 30-40 element .\t","label":[" Negative "]}
{"id":1292,"text":" I wanted to turn over when I was free , Unfortunately, I can't watch it , Still can't compare with Zhang , Most of his books are still influenced by Zhang , I really don't like this man , I don't know how to buy it , regret \t","label":[" Negative "]}
1.1 Preview of result display
Input :
The hotel environment and service are all pretty good , The location is also good , Especially the northern Sichuan jelly in the north of the hotel is really delicious .
Facilities are aging , It's too noisy close to the road . At night, the water flow sound and air conditioning noise in the upstairs bathroom are very loud , Can't sleep .
A very good hotel , The bed is big , Very comfortable . The service attitude of the hotel staff is very friendly .
Very bad. ! We contracted a car to visit the West Lake through its business center , The car took us to informal scenic spots to buy tea .
in general , The hotel is not bad . It's quieter , The geographical location is better , Very good service , Including check-in and check-out .
Output :
[{'text': ' The hotel environment and service are all pretty good , The location is also good , Especially the northern Sichuan jelly in the north of the hotel is really delicious .\n', 'label': 'positive', 'score': 0.8420582413673401},
{'text': ' Facilities are aging , It's too noisy close to the road . At night, the water flow sound and air conditioning noise in the upstairs bathroom are very loud , Can't sleep .\n', 'label': 'negative', 'score': 0.9905866980552673},
{'text': ' A very good hotel , The bed is big , Very comfortable . The service attitude of the hotel staff is very friendly .\n', 'label': 'positive', 'score': 0.9800688028335571},
{'text': ' Very bad. ! We contracted a car to visit the West Lake through its business center , The car took us to informal scenic spots to buy tea .\n', 'label': 'negative', 'score': 0.9315289258956909},
{'text': ' in general , The hotel is not bad . It's quieter , The geographical location is better , Very good service , Including check-in and check-out .', 'label': 'positive', 'score': 0.90092933177948}]
1.2 Dataset loading
!python doccano.py \
--doccano_file ./data/cla.jsonl \
--task_type 'cls' \
--save_dir ./data \
--splits 0.8 0.1 0.1 \
--negative_ratio 5 \
--prompt_prefix " Emotional inclination " \
--options " positive " " Negative "
2022-07-18 11:28:41,687] [ INFO] - Converting doccano data...
0%| | 0/8 [00:00<?, ?it/s]
[2022-07-18 11:28:41,689] [ INFO] - Converting doccano data...
0%| | 0/1 [00:00<?, ?it/s]
[2022-07-18 11:28:41,690] [ INFO] - Converting doccano data...
0%| | 0/2 [00:00<?, ?it/s]
[2022-07-18 11:28:41,691] [ INFO] - Save 8 examples to ./data/train.txt.
[2022-07-18 11:28:41,691] [ INFO] - Save 1 examples to ./data/dev.txt.
[2022-07-18 11:28:41,691] [ INFO] - Save 2 examples to ./data/test.txt.
[2022-07-18 11:28:41,691] [ INFO] - Finished! It takes 0.00 seconds
doccano_file: from doccano Exported data annotation file .
save_dir: Training data storage directory , Default stored in data Under the table of contents .
negative_ratio: Maximum negative case ratio , This parameter is only valid for extraction type tasks , Properly constructing negative examples can improve the effect of the model . The number of negative instances is related to the actual number of tags , Maximum number of negative instances = negative_ratio * Number of positive examples . This parameter is only valid for training sets , The default is 5. In order to ensure the accuracy of the evaluation indicators , All negative examples of validation set and test set default construction .
splits: Training set when dividing data set 、 Proportion of validation sets . The default is [0.8, 0.1, 0.1] In accordance with the said 8:1:1 Divide the data into training sets 、 Validation set and test set .choices=[‘ext’, ‘cls’]
task_type: Select the task type , There are two types of tasks: extraction and classification .
options: Specify the category label of the classification task , This parameter is only valid for category type tasks . The default is [“ positive ”, “ Negative ”].
prompt_prefix: Declare the classification task prompt Prefix information , This parameter is only valid for category type tasks . The default is " Emotional inclination ".
In the data transformation phase , We will automatically construct for model training prompt Information . For example, in the sentence level emotion classification ,prompt For emotional tendency [ positive , Negative ], Can pass prompt_prefix and options Parameter .
is_shuffle: Whether to randomly scatter the data set , The default is True.
seed: Random seeds , The default is 1000.
*separator: Entity category / Separator between evaluation dimension and classification label , This parameter is only applicable to entities / Evaluate the effectiveness of dimension level classification tasks . The default is "##".
The output part shows :
{
"content": " The shortage of goods has not been found yet , JD's order processing speed is really ....... Package on Tuesday , It will be delivered on Friday ...\t", "result_list": [{
"text": " Negative ", "start": -4, "end": -2}], "prompt": " Emotional inclination [ positive , Negative ]"}
{
"content": " I wanted to turn over when I was free , Unfortunately, I can't watch it , Still can't compare with Zhang , Most of his books are still influenced by Zhang , I really don't like this man , I don't know how to buy it , regret \t", "result_list": [{
"text": " Negative ", "start": -7, "end": -5}], "prompt": " Emotional inclination [ Negative , positive ]"}
{
"content": " Full keyboard with numeric keys The graphics card is powerful enough .N Card relative A card , Personal bias N card GHOST XP be prone to . In addition to fingerprint recognition . All drives can be installed , fingerprint identification , Must be in XP Under the use of friends , Alternative drives can be used . ( ASUS official address , Don't worry )\t", "result_list": [{
"text": " positive ", "start": -4, "end": -2}], "prompt": " Emotional inclination [ Negative , positive ]"}
{
"content": " The carpet in the room is too dirty , It's very noisy near the railway station , Fortunately, it's double glass . The service was average , In front of the hotel TAXI It's a long-term cooperative relationship of the hotel , Pay the hotel every month . From the hotel to the airport, it's about clocking 147 element , When you arrive, you have to 200 element , May be slaughtered 30-40 element .\t", "result_list": [{
"text": " Negative ", "start": -7, "end": -5}], "prompt": " Emotional inclination [ Negative , positive ]"}
{
"content": "< Letter of recommendation > Recommend all like < Red Mansions > Our fans must collect this book , You know, when I heard about this book, I spent a long time in the library looking for it and borrowing it, but it didn't work , So this time, when I saw it, I should have , I'll buy it right away , Red fans should also remember to stock up !\t", "result_list": [{
"text": " positive ", "start": -7, "end": -5}], "prompt": " Emotional inclination [ positive , Negative ]"}
2. model training
!python finetune.py \
--train_path "./data/train.txt" \
--dev_path "./data/dev.txt" \
--save_dir "./checkpoint" \
--learning_rate 1e-5 \
--batch_size 16 \
--max_seq_len 512 \
--num_epochs 100 \
--model "uie-base" \
--seed 1000 \
--logging_steps 10 \
--valid_steps 50 \
--device "gpu"
Some training effects are displayed : The specific output has been folded
( Because there are few training samples , And it's relatively simple, so it's easy to achieve F1=100%)
[2022-07-17 11:33:46,088] [ INFO] - global step 10, epoch: 10, loss: 0.00021, speed: 1.50 step/s
[2022-07-17 11:33:52,276] [ INFO] - global step 20, epoch: 20, loss: 0.00011, speed: 1.62 step/s
[2022-07-17 11:33:58,431] [ INFO] - global step 30, epoch: 30, loss: 0.00007, speed: 1.62 step/s
[2022-07-17 11:34:04,630] [ INFO] - global step 40, epoch: 40, loss: 0.00006, speed: 1.61 step/s
[2022-07-17 11:34:10,816] [ INFO] - global step 50, epoch: 50, loss: 0.00005, speed: 1.62 step/s
[2022-07-17 11:34:10,863] [ INFO] - Evaluation precision: 1.00000, recall: 1.00000, F1: 1.00000
[2022-07-17 11:34:10,863] [ INFO] - best F1 performence has been updated: 0.00000 --> 1.00000
[2022-07-17 11:34:11,996] [ INFO] - tokenizer config file saved in ./checkpoint/model_best/tokenizer_config.json
[2022-07-17 11:34:11,997] [ INFO] - Special tokens file saved in ./checkpoint/model_best/special_tokens_map.json
[2022-07-17 11:34:18,202] [ INFO] - global step 60, epoch: 60, loss: 0.00004, speed: 1.61 step/s
[2022-07-17 11:34:24,355] [ INFO] - global step 70, epoch: 70, loss: 0.00003, speed: 1.63 step/s
[2022-07-17 11:34:30,515] [ INFO] - global step 80, epoch: 80, loss: 0.00003, speed: 1.62 step/s
[2022-07-17 11:34:36,700] [ INFO] - global step 90, epoch: 90, loss: 0.00003, speed: 1.62 step/s
[2022-07-17 11:34:42,851] [ INFO] - global step 100, epoch: 100, loss: 0.00002, speed: 1.63 step/s
[2022-07-17 11:34:42,897] [ INFO] - Evaluation precision: 1.00000, recall: 1.00000, F1: 1.00000
Recommended GPU Environmental Science , Otherwise, memory overflow may occur .CPU In the environment , You can modify model by uie-tiny, Adjust properly batch_size.
To increase the accuracy :–num_epochs Set up a bigger workout
Configurable parameter description :
train_path: Training set file path .
dev_path: Validation set file path .
save_dir: Model storage path , The default is ./checkpoint.
learning_rate: Learning rate , The default is 1e-5.
batch_size: Batch size , Please adjust it in combination with the video memory , If there is insufficient video memory , Please lower this parameter appropriately , The default is 16.
max_seq_len: Maximum segmentation length of text , When the input exceeds the maximum length, the input text will be automatically segmented , The default is 512.
num_epochs: Number of training rounds , The default is 100.
model Choose a model , The program will fine tune the model based on the selected model , Optional uie-base and uie-tiny, The default is uie-base.
seed: Random seeds , The default is 1000.
logging_steps: Log printing interval steps Count , Default 10.
valid_steps: evaluate The interval of steps Count , Default 100.
device: What equipment to use for training , Optional cpu or gpu.
3. Model to evaluate
!python evaluate.py \
--model_path ./checkpoint/model_best \
--test_path ./data/test.txt \
--batch_size 16 \
--max_seq_len 512
[2022-07-18 11:37:05,934] [ INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load './checkpoint/model_best'.
W0718 11:37:05.965226 2210 gpu_context.cc:278] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 10.1
W0718 11:37:05.969079 2210 gpu_context.cc:306] device: 0, cuDNN Version: 7.6.
[2022-07-18 11:37:11,584] [ INFO] - -----------------------------
[2022-07-18 11:37:11,584] [ INFO] - Class Name: all_classes
[2022-07-18 11:37:11,584] [ INFO] - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000
model_path: Model folder path for evaluation , The path needs to contain the model weight file model_state.pdparams And configuration files model_config.json.
test_path: Test set files for evaluation .
batch_size: Batch size , Please adjust it according to the situation of the machine , The default is 16.
max_seq_len: Maximum segmentation length of text , When the input exceeds the maximum length, the input text will be automatically segmented , The default is 512.
model: Select the model used , Optional uie-base, uie-medium, uie-mini, uie-micro and uie-nano, The default is uie-base.
debug: Open or not debug The model evaluates each positive example category separately , This mode is only used for model debugging , Off by default .
4. Prediction of results
from pprint import pprint
import json
from paddlenlp import Taskflow
def openreadtxt(file_name):
data = []
file = open(file_name,'r',encoding='UTF-8') # Open file
file_data = file.readlines() # Read all lines
for row in file_data:
data.append(row) # Insert each row of data into data in
return data
data_input=openreadtxt('./input/nlp.txt')
schema = ' Emotional inclination [ positive , Negative ]'
few_ie = Taskflow('information_extraction', schema=schema, batch_size=32,task_path='./checkpoint/model_best')
# few_ie = Taskflow('sentiment_analysis', schema=schema, batch_size=32,task_path='./checkpoint/model_best')
results=few_ie(data_input)
with open("./output/result.txt", "w+",encoding='UTF-8') as f: #a : write file , If the file does not exist, it will be created first and then written , But it will not overwrite the original file , Instead, it is appended at the end of the file
for result in results:
line = json.dumps(result, ensure_ascii=False) # Default for Chinese ascii code . To output real Chinese, you need to specify ensure_ascii=False
f.write(line + "\n")
print(" Data results have been exported ")
print(results)
Input file display :
The hotel environment and service are all pretty good , The location is also good , Especially the northern Sichuan jelly in the north of the hotel is really delicious .
Facilities are aging , It's too noisy close to the road . At night, the water flow sound and air conditioning noise in the upstairs bathroom are very loud , Can't sleep .
A very good hotel , The bed is big , Very comfortable . The service attitude of the hotel staff is very friendly .
Very bad. ! We contracted a car to visit the West Lake through its business center , The car took us to informal scenic spots to buy tea .
in general , The hotel is not bad . It's quieter , The geographical location is better , Very good service , Including check-in and check-out .
Output shows :
[{'text': ' The hotel environment and service are all pretty good , The location is also good , Especially the northern Sichuan jelly in the north of the hotel is really delicious .\n', 'label': 'positive', 'score': 0.8420582413673401},
{'text': ' Facilities are aging , It's too noisy close to the road . At night, the water flow sound and air conditioning noise in the upstairs bathroom are very loud , Can't sleep .\n', 'label': 'negative', 'score': 0.9905866980552673},
{'text': ' A very good hotel , The bed is big , Very comfortable . The service attitude of the hotel staff is very friendly .\n', 'label': 'positive', 'score': 0.9800688028335571},
{'text': ' Very bad. ! We contracted a car to visit the West Lake through its business center , The car took us to informal scenic spots to buy tea .\n', 'label': 'negative', 'score': 0.9315289258956909},
{'text': ' in general , The hotel is not bad . It's quieter , The geographical location is better , Very good service , Including check-in and check-out .', 'label': 'positive', 'score': 0.90092933177948}]
Of course paddlenlp It also provides two models of emotional analysis , The default is BiLSTM.
as well as SKEP.
Integrate the emotional knowledge enhancement pre training model developed by Baidu SKEP, Use emotional knowledge to build pre training goals , Pre training on massive Chinese data , Provide unified and powerful emotional semantic expression ability for all kinds of emotional analysis tasks . Emotional pre training model SKEP(Sentiment Knowledge Enhanced Pre-training for Sentiment Analysis).SKEP Using emotional knowledge to enhance pre training model , stay 14 Xiang Zhongying's emotional analysis surpasses the typical tasks in an all-round way SOTA, This work has been ACL 2020 Employment .SKEP It is an emotional pre training algorithm based on emotional knowledge enhancement proposed by Baidu research team , This algorithm adopts unsupervised method to automatically mine emotional knowledge , Then use emotional knowledge to build pre training goals , So that the machine can understand the emotional semantics .SKEP Provide a unified and powerful emotional semantic representation for all kinds of emotional analysis tasks .
SKEP: Sentiment Knowledge Enhanced Pre-training for Sentiment Analysis
from pprint import pprint
import json
from paddlenlp import Taskflow
def openreadtxt(file_name):
data = []
file = open(file_name,'r',encoding='UTF-8') # Open file
file_data = file.readlines() # Read all lines
for row in file_data:
data.append(row) # Insert each row of data into data in
return data
data_input=openreadtxt('./input/nlp.txt')
schema = ' Emotional inclination [ positive , Negative ]'
few_ie = Taskflow("sentiment_analysis", schema=schema,model="skep_ernie_1.0_large_ch", batch_size=16)
results=few_ie(data_input)
with open("./output/result.txt", "w+",encoding='UTF-8') as f: #a : write file , If the file does not exist, it will be created first and then written , But it will not overwrite the original file , Instead, it is appended at the end of the file
for result in results:
line = json.dumps(result, ensure_ascii=False) # Default for Chinese ascii code . To output real Chinese, you need to specify ensure_ascii=False
f.write(line + "\n")
print(" Data results have been exported ")
print(results)
[{
'text': ' The hotel environment and service are all pretty good , The location is also good , Especially the northern Sichuan jelly in the north of the hotel is really delicious .\n', 'label': 'positive', 'score': 0.9441452622413635}, {
'text': ' Facilities are aging , It's too noisy close to the road . At night, the water flow sound and air conditioning noise in the upstairs bathroom are very loud , Can't sleep .\n', 'label': 'negative', 'score': 0.991821825504303}, {
'text': ' A very good hotel , The bed is big , Very comfortable . The service attitude of the hotel staff is very friendly .\n', 'label': 'positive', 'score': 0.989535927772522}, {
'text': ' Very bad. ! We contracted a car to visit the West Lake through its business center , The car took us to informal scenic spots to buy tea .\n', 'label': 'negative', 'score': 0.9811170697212219}, {
'text': ' in general , The hotel is not bad . It's quieter , The geographical location is better , Very good service , Including check-in and check-out .', 'label': 'positive', 'score': 0.8622702360153198}]
5. Active learning and flexible application ( News text classification demo)
Get relevant data sets , And then deal with , Agriculture is selected here 、 Finance 、 Part of the data of the real estate is only used to test the feasibility of the scheme
Export the news data set provided by the official website , Then mark the platform by yourself !
!python doccano.py \
--doccano_file ./data/input.jsonl \
--task_type 'cls' \
--save_dir ./data \
--splits 0.85 0.15 0 \
--negative_ratio 5 \
--prompt_prefix " News classification " \
--options " Agriculture " " Real estate " " Finance "
!python finetune.py \
--train_path "./data/train.txt" \
--dev_path "./data/dev.txt" \
--save_dir "./checkpoint2" \
--learning_rate 1e-5 \
--batch_size 16 \
--max_seq_len 512 \
--num_epochs 200 \
--model "uie-base" \
--seed 1000 \
--logging_steps 10 \
--valid_steps 50 \
--device "gpu"
from pprint import pprint
import json
from paddlenlp import Taskflow
data = [
[" The dollar once again raises the financial butcher's knife , this 10 The economy of a country may be in jeopardy ”, Xicheng is exposed again 10 Illegal intermediary "],
[" The era of Wiki chain is coming , It will hit the whole China in an instant , Skyrocketing 800%! Grab and earn !"],
[" Our province issued the minimum living security standards for urban and rural residents this year "],
]
schema = ' News classification [ Agriculture , Real estate , Finance ]'
few_ie = Taskflow('information_extraction', schema=schema,task_path='./checkpoint2/model_best')
results=few_ie([" The dollar once again raises the financial butcher's knife , this 10 The economy of a country may be in jeopardy ”, Xicheng is exposed again 10 Illegal intermediary "," The era of Wiki chain is coming , It will hit the whole China in an instant , Skyrocketing 800%! Grab and earn !"," Our province issued the minimum living security standards for urban and rural residents this year "])
# print(results)
for idx, text in enumerate(data):
print('Data: {} \t Lable: {}'.format(text[0], results[idx]))
Data: The dollar once again raises the financial butcher's knife , this 10 The economy of a country may be in jeopardy ”, Xicheng is exposed again 10 Illegal intermediary Lable: {‘ News classification [ Agriculture , Real estate , Finance ]’: [{‘text’: ‘ Finance ’, ‘probability’: 0.8674286780486753}]}
Data: The era of Wiki chain is coming , It will hit the whole China in an instant , Skyrocketing 800%! Grab and earn ! Lable: {‘ News classification [ Agriculture , Real estate , Finance ]’: [{‘text’: ‘ Agriculture ’, ‘probability’: 0.4909489670645364}]}
Data: Our province issued the minimum living security standards for urban and rural residents this year Lable: {‘ News classification [ Agriculture , Real estate , Finance ]’: [{‘text’: ‘ Agriculture ’, ‘probability’: 0.980139386504348}]}
It can be seen that the result is still good , There are some effects and performances , This is just an attempt , I haven't verified the specific performance
However, it is recommended to use a special classification model ,ernie-3.0- It works better
** Follow up to verify hub、ERNIE、prompt Three ways of performance , Of course, it is expected that based on a model, the effect may be similar , Recommend the most convenient solution at that time ! **
7. summary
UIE(Universal Information Extraction):Yaojie Lu Et al. ACL-2022 A unified framework for general information extraction is proposed UIE. The framework implements Entity extraction 、 Relationship extraction 、 Event extraction 、 Sentiment analysis Unified modeling of such tasks , And make different tasks have good migration and generalization ability .PaddleNLP Learn from the method of this paper , be based on ERNIE 3.0 Knowledge enhancement pre training model , Train and open source the first Chinese general information extraction model UIE. The model can support the extraction of key information without limiting the industry field and the extraction target , Realize zero sample fast cold start , And have excellent small sample fine-tuning ability , Quickly adapt to specific extraction targets .
UIE The advantages of
Easy to use : Users can use natural language to customize the extraction target , The corresponding information in the input text can be extracted uniformly without training . Out of the box , And meet all kinds of information extraction needs .
Authors efficiency : The previous information extraction technology needs a large number of labeled data to ensure the effect of information extraction , In order to improve the development efficiency in the development process , Reduce unnecessary duplication of time , Open domain information extraction can achieve zero samples (zero-shot) Or less samples (few-shot) extract , Greatly reduce label data dependency , While reducing costs , It also improves the effect .
The effect is leading : Open domain information extraction is used in many scenarios , On a variety of tasks , All of them have excellent performance .
This time I mainly classified by emotion 、 The case of news classification is shared with you , Mainly for open source paddlenlp The case of , The official document does not cover how to fine tune the specific classification , This side gives demo For your reference , At present, it seems that the effect of small samples and multi classification is ok , Later, I will expand the sample for a complete test .
Of course, you can replace the data set on the basis of the project , Try to make a classification model ! First from the second classification — Try more categories one by one !
My blog :https://blog.csdn.net/sinat_39620217?type=blog
Related articles :
Paddlenlp And UIE Model actual combat entity extraction task 【 Taxi Data 、 Express bill 】
Project to connect : Baidu AIstudio direct fork My project can be reproduced
Paddlenlp And UIE Classification model 【 Take the classification of emotional tendency analysis news as an example 】 Including intelligent annotation scheme )
边栏推荐
猜你喜欢

ResponseBodyAdvice接口使用导致的报错及解决

BOM系列之BOM介绍

大神“魔改”AirPods,配备USB-C接口,3D打印外壳让维修更容易

使用kail破解wifi密码
![[sharing game modeling model making skills] how ZBrush adjusts the brush size](/img/12/4c9be15266bd01c17d6aa761e6115f.png)
[sharing game modeling model making skills] how ZBrush adjusts the brush size
![[toggle 30 days of ML] Diabetes genetic risk detection challenge (2)](/img/39/d0134a8493877beaa6b6850c045f2b.png)
[toggle 30 days of ML] Diabetes genetic risk detection challenge (2)

deepstream学习笔记(二):gstreamer与deepstream-test1说明
![[2013] [paper notes] terahertz band nano particle surface enhanced Raman——](/img/09/af80a744573ce53a0056e7124675a8.png)
[2013] [paper notes] terahertz band nano particle surface enhanced Raman——

【2018】【论文笔记】石墨烯场效应管及【2】——石墨烯的制备、转移

建模刚学习很迷茫,次世代角色建模流程具体该怎么学习?
随机推荐
Redis [super superfine introductory tutorial]
【2020】【论文笔记】基于二维光子晶体的光控分光比可调Y——
【2020】【论文笔记】基于Rydberg原子的——
deepstream学习笔记(二):gstreamer与deepstream-test1说明
How to evaluate the accuracy of stock analysts' prediction?
JUC concurrent programming [detailed explanation and demonstration]
【游戏建模模型制作全流程】3ds Max和ZBrush制作无线电接收器
【3D建模制作技巧分享】Zbrush如何将图片转浮雕模型
sklearn 分类器常见问题
类的基础
[2020] [paper notes] optically controlled spectral ratio adjustable y based on two-dimensional photonic crystal——
【攻防世界WEB】难度三星9分入门题(终):fakebook、favorite_number
What problems do you usually encounter when learning 3D modeling? It's too much
BOM introduction of BOM series
How to capture the analyst rating data of Sina Financial Data Center?
398. 随机数索引-哈希表法
Log framework [detailed learning]
Detailed explanation of common curl commands and parameters
【2018】【论文笔记】石墨烯场效应管及【1】——GFETs的种类和原理,GFETs特性,GFETs在太赫兹中的应用和原理
手写bind、call、apply其实很简单