当前位置：网站首页>Paddlenlp's UIE classification model [taking emotional propensity analysis news classification as an example] including intelligent annotation scheme)

Paddlenlp's UIE classification model [taking emotional propensity analysis news classification as an example] including intelligent annotation scheme)

2022-07-23 18:51:00 【Ting】

Project to connect ： Baidu AIstudio direct fork My project can be reproduced
Paddlenlp And UIE Classification model 【 Take the classification of emotional tendency analysis news as an example 】 Including intelligent annotation scheme ）

0 Preface

First, review the previous project ：
Paddlenlp And UIE Model actual combat entity extraction task 【 Taxi Data 、 Express bill 】

There will be the following problems ：

How to label your sample data
If the sample size is large, what is a good method for intelligent annotation
Visualization tools are introduced in detail

This project will put , Data tagging 、 Intelligent annotation 、 Data visualization method
Explain in detail .

0.1 How to label data —doccano

Strongly recommend ： Data annotation platform doccano---- brief introduction 、 install 、 Use 、 Record on pit
For detailed steps, please refer to the blog
Insert picture description here

Official documents ：
https://github.com/doccano/doccano

Remember to enter the virtual environment ！！！！！

Step 1. Local installation doccano（ Please do not leave AI Studio Internal operation , Local test environment python=3.8）

$ pip install doccano

Step 2. Initialize database and account （ User name and password can be replaced with custom values ）

#  initialization , Set the username = admin, password =pass
doccano init
doccano createuser --username admin --password pass
 
------------------------- Personal settings ---------------------------
$ doccano init
 
$ doccano createuser --username my_admin_name --password my_password

Step 3. start-up doccano

Start in a window doccano Of WebServer, Hold window

$ doccano webserver --port 8000

Start in another window doccano Task queue for

$ doccano task

Insert picture description here

Open the browser （ recommend Chrome）, Enter in the address bar http://127.0.0.1:8000/ Press enter to get the following interface .

Insert picture description here

Please refer to the blog or official website documents for specific annotation

0.2 Intelligent annotation

When your data sample is large , Marking one by one will be time-consuming , Efficiency is very low
It is recommended to go to hugging face Load some pre training models for one annotation and then conduct manual review .

be based on hugging face Entity recognition intelligent annotation scheme of pre training model ： Generate doccano requirement json Format

Insert picture description here

according to doccano Label platform format requirements

json Format import data format requirements ： Entity ; Include relationship style presentation

Insert picture description here

{
    "text": "Google was founded on September 4, 1998, by Larry Page and Sergey Brin.",
    "entities": [
        {
            "id": 0,
            "start_offset": 0,
            "end_offset": 6,
            "label": "ORG"
        },
        {
            "id": 1,
            "start_offset": 22,
            "end_offset": 39,
            "label": "DATE"
        },
        {
            "id": 2,
            "start_offset": 44,
            "end_offset": 54,
            "label": "PERSON"
        },
        {
            "id": 3,
            "start_offset": 59,
            "end_offset": 70,
            "label": "PERSON"
        }
    ],
    "relations": [
        {
            "from_id": 0,
            "to_id": 1,
            "type": "foundedAt"
        },
        {
            "from_id": 0,
            "to_id": 2,
            "type": "foundedBy"
        },
        {
            "from_id": 0,
            "to_id": 3,
            "type": "foundedBy"
        }
    ]
}

0.3 Entity intelligent annotation + format conversion

0.3.1 Long text （ One txt A long article ）code

The annotation section contains pre training model recognition entities ; And the format requirements of the wizard annotation assistant

ps： Hint that the following program is in torch Under use , Because I took it directly huggingface Pre training model , Reduce our workload . use paddle Ask for quick recommendation UIE Direct small samples to make a simple pre model to assist in annotation ！

from transformers import pipeline
import os
from tqdm import tqdm
import pandas as pd
from time import time
import json
 
 
def return_single_entity(name, start, end):
    return [int(start), int(end), name]
 
# def return_single_entity(name, word, start, end, id, attributes=[]):
# entity = {}
# entity['type'] = 'T'
# entity['name'] = name
# entity['value'] = word
# entity['start'] = int(start)
# entity['end'] = int(end)
# entity['attributes'] = attributes
# entity['id'] = int(id)
# return entity
 
 
# input_dir = 'E:/datasets/myUIE/inputs'
input_dir = 'C:/Users/admin/Desktop//test_input.txt'
output_dir = 'C:/Users/admin/Desktop//outputs'
 
tagger = pipeline(task='ner', model='xlm-roberta-large-finetuned-conll03-english',
                  aggregation_strategy='simple')
 
keywords = {
    'PER': ' people ', 'ORG': ' Institutions '}  # loc  Location  misc  Other types of entities 
 
# for filename in tqdm(input_dir):
# #  Read data and mark automatically 
# json_list = []
 
with open(input_dir, 'r', encoding='utf8') as f:
    text = f.readlines()
 
json_list = [0 for i in range(len(text))]
for t in text:
    i = t.strip("\n").strip("'").strip('"')
    named_ents = tagger(i)  #  Pre training model 
    # named_ents = tagger(text)
    df = pd.DataFrame(named_ents)
    """  Mark the results ：entity_group score word start end 0 ORG 0.999997 National Science Board 18 40 1 ORG 0.999997 NSB 42 45 2 ORG 0.999997 NSF 71 74"""
    #  Put it in the loop , Then every time we start a new cycle, we will redefine it , The content defined last time is lost 
    # json_list = [0 for i in range(len(text))]
    entity_list=[]
    # entity_list2=[]
    for index, elem in df.iterrows():
        if not elem.entity_group in keywords:
            continue
        if elem.end - elem.start <= 1:
            continue
        entity = return_single_entity(
            keywords[elem.entity_group], elem.start, elem.end)
        entity_list.append(entity)
        # entity_list2.append(entity_list)
    json_obj = {
    "text": text[index], "label": entity_list}
    json_list[index] = json.dumps(json_obj)
    # entity_list.append(entity)
 
 
# data = json.dumps(json_list)
# json_list.append(data)
    
with open(f'{
      output_dir}/data_2.json', 'w', encoding='utf8') as f:
    for line in json_list:
        f.write(line+"\n")
    # f.write('\n'.join(data))
    # f.write(str(data))
        
print('done!')
 

    #  Convert to wizard annotation assistant import format （ But the elf annotation assistant nlp There is a coding problem in the annotation module , part utf8 Characters cannot be displayed normally , It will affect the annotation results ）
    # id = 1
    # entity_list = ['']
    # for index, elem in df.iterrows():
    # if not elem.entity_group in keywords:
    # continue
    # entity = return_single_entity(keywords[elem.entity_group], elem.word, elem.start, elem.end, id)
    # id += 1
    # entity_list.append(entity)
 
    # python_obj = {'path': f'{input_dir}/{filename}',
    # 'outputs': {'annotation': {'T': entity_list, "E": [""], "R": [""], "A": [""]}},
    # 'time_labeled': int(1000 * time()), 'labeled': True, 'content': text}
    # data = json.dumps(python_obj)
    # with open(f'{output_dir}/{filename.rstrip(".txt")}.json', 'w', encoding='utf8') as f:
    # f.write(data)

Output results ：

{"text": "The company was founded in 1852 by Jacob Estey\n", "label": [[35, 46, "\u4eba"]]}
{"text": "The company was founded in 1852 by Jacob Estey, who bought out another Brattleboro manufacturing business.", "label": [[35, 46, "\u4eba"], [71, 82, "\u673a\u6784"]]}

You can see label The label is garbled , Don't worry about importing to doccano The platform will display normal

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-reaFjMvd-1658544946106)(https://ai-studio-static-online.cdn.bcebos.com/e1f6216815864e18b9ad3217b66af3c7c8ec9e8fb12240bea25b5f2274c1d01c)]

0.3.2 Improve the quality of tagging

Manual review
Not to mention, just check the past one by one , Smart tagging has saved a lot of trouble

The marked data is

Delete invalid annotations

import json
 
dir_path = r'C:/Users/admin/Desktop/ Photosynthetic project / Auto Label '  #  Change the file address here 
 
with open(f'{
      dir_path}/pre_data.jsonl', 'r',encoding='utf8')as f:  #  File naming 
    text = f.readlines()
 
content = [json.loads(elem.strip('\n')) for elem in text]
content = [json.dumps(cont) for cont in content if cont['entities'] != []]
 
with open(f'{
      dir_path}/remove_empty_data.jsonl', 'w',encoding='utf8')as f:  #  File naming 
    f.write('\n'.join(content))
    
print(" Output data ")

The above processing is well reflected in the English version data set , Of course, the Chinese version can be based on the above , use paddle UIE Wait for the model , First, manually mark in small batches , Then generate a base Model , Pre annotate the input results through the model , Then manually recheck

Of course, you may ask whether there is a simpler way , Of course there are ！！！

EasyData Introduction to data services

The debut ！！！！

EasyData A one-stop data processing and service platform launched for Baidu brain , centered AI Data collection required in the process of model development 、 Data quality inspection 、 Intelligent data processing 、 Data annotation and other links provide complete data services . at present EasyData Pictures are supported 、 Text 、 Audio 、 video 、 Table 5 basic data processing .
meanwhile EasyData Already with EasyDL、BML Get through the platform data management module ,EasyData The processed data can be directly applied to EasyDL、BML Platform for model training .

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-XSppEoYw-1658544988575)(https://ai-studio-static-online.cdn.bcebos.com/4012d0f4461947eb91c7e56afacd278b6190b9d5e5d34aac884c11139063e574)]
Insert picture description here
Complete and powerful , You guys can have a try ！

I won't do more promotion ,, Trust Baidu no problem , ha-ha .

0.4 visualDL Tool use , Visualization tool .

VisualDL It is a visual tool for deep learning task design .VisualDL Using a wealth of charts to show the data , Users can be more intuitive 、 Clearly check the characteristics and change trend of data , Help to analyze data 、 Find mistakes in time , And then improve the design of neural network model .

at present ,VisualDL Support scalar, image, audio, graph, histogram, prcurve, high dimensional Seven components .
Insert picture description here

Don't introduce too much , You can refer to my project or blog . Detailed explanation ！！！

paddle And visualDL Tool use , Visualization tool .

VisualDL2.0 Visual display

from visualdl import LogWriter
 
if __name__ == '__main__':
    value = [i/1000.0 for i in range(1000)]
    #  Step one ： Create parent folder ：log And subfolders ：scalar_test
    with LogWriter(logdir="./log/scalar_test") as writer:
        for step in range(1000):
            #  Step two ： Add a... To the logger tag by `train/acc` The data of 
            writer.add_scalar(tag="train/acc", step=step, value=value[step])
            #  Step two ： Add a... To the logger tag by `train/loss` The data of 
            writer.add_scalar(tag="train/loss", step=step, value=1/(value[step] + 1))
    #  Step one ： Create a second subfolder scalar_test2 
    value = [i/500.0 for i in range(1000)]
    with LogWriter(logdir="./log/scalar_test2") as writer:
        for step in range(1000):
            #  Step two ： In the same name `train/acc` Add below scalar_test2 Of accuracy The data of 
            writer.add_scalar(tag="train/acc", step=step, value=value[step])
            #  Step two ： In the same name `train/loss` Add below scalar_test2 Of loss The data of 
            writer.add_scalar(tag="train/loss", step=step, value=1/(value[step] + 1))

1. Background introduction

Text classification is the most common task in natural language processing , The task of text classification is simply to use a text classifier to classify a given sentence or paragraph of text . Text classification tasks are widely used in long and short text classification 、 Sentiment analysis 、 News classification 、 Event category classification 、 Government data classification 、 Commodity information classification 、 Commodity category forecast 、 The article classification 、 Classification of papers 、 patent classification 、 Case description classification 、 Classification of charges 、 Classification of intention 、 Thesis patent classification 、 Automatic tagging of mail 、 Comment positive and negative identification 、 Classification of drug reactions 、 Dialogue classification 、 Tax identification 、 Automatic classification of incoming call information 、 Complaint classification 、 Advertising detection 、 Sensitive illegal content detection 、 Content security detection 、 Public opinion analysis 、 Topic markers and other daily or professional fields .

Text classification tasks can be divided into multiple categories according to label types （multi class）、 Multi label （multi label）、 Hierarchical classification （hierarchical） Wait for three kinds of tasks .

Enter positive , This project will demonstrate how the multi classification task can fine tune the model with small samples .

Data set situation ：

An overview of the data ： 7000 Multiple hotel review data ,5000 Multiple positive comments ,2000 Multiple negative comments

Recommended experiment ： emotional / Point of view / Comment on Tendentiousness analysis

Data sources ： ctrip

Original data set ： ChnSentiCorp_htl, A data set compiled by teacher tansongbo

cla.jsonl It's a dataset demo：

{"id":1286,"text":" The environment and service attitude of this hotel are also good , But the room space is too small ~~ It is not declared to accommodate too large luggage ~~ And the style of the room is ok ~~  Cantonese dim sum in Chinese restaurant is not very delicious ~~ It needs to be improved ~~~~ But the price is fair ~~ acceptable ~~  The style of Western restaurants is very good ~~ But the taste of the food is ordinary and it makes people wait too long ~~ It needs to be improved ~~\t","label":[" positive "]}
{"id":1287,"text":"< Letter of recommendation >  Recommend all like < Red Mansions > Our fans must collect this book , You know, when I heard about this book, I spent a long time in the library looking for it and borrowing it, but it didn't work , So this time, when I saw it, I should have , I'll buy it right away , Red fans should also remember to stock up !\t","label":[" positive "]}
{"id":1288,"text":" The shortage of goods has not been found yet , JD's order processing speed is really ....... Package on Tuesday , It will be delivered on Friday ...\t","label":[" Negative "]}
{"id":1289,"text":"２００１ Fuzhou has lived here for years , This time I feel the room is a little , There is still hot spring water ． On the whole, I'm very satisfied ． Breakfast is simpler ．\t","label":[" positive "]}
{"id":1290,"text":" Good netbook , The shape is very beautiful , The operating system should be a big   Selling point , The battery is ok . On the whole , Positioning as a netbook , It's not bad .\t","label":[" positive "]}
{"id":1291,"text":" The carpet in the room is too dirty , It's very noisy near the railway station , Fortunately, it's double glass . The service was average , In front of the hotel TAXI It's a long-term cooperative relationship of the hotel , Pay the hotel every month . From the hotel to the airport, it's about clocking 147 element , When you arrive, you have to 200 element , May be slaughtered 30-40 element .\t","label":[" Negative "]}
{"id":1292,"text":" I wanted to turn over when I was free , Unfortunately, I can't watch it , Still can't compare with Zhang , Most of his books are still influenced by Zhang , I really don't like this man , I don't know how to buy it , regret \t","label":[" Negative "]}

1.1 Preview of result display

Input ：

 The hotel environment and service are all pretty good , The location is also good , Especially the northern Sichuan jelly in the north of the hotel is really delicious .
 Facilities are aging , It's too noisy close to the road . At night, the water flow sound and air conditioning noise in the upstairs bathroom are very loud , Can't sleep .
 A very good hotel , The bed is big , Very comfortable . The service attitude of the hotel staff is very friendly .
 Very bad. ！ We contracted a car to visit the West Lake through its business center , The car took us to informal scenic spots to buy tea .
 in general , The hotel is not bad . It's quieter , The geographical location is better , Very good service , Including check-in and check-out .

Output ：

[{'text': ' The hotel environment and service are all pretty good , The location is also good , Especially the northern Sichuan jelly in the north of the hotel is really delicious .\n', 'label': 'positive', 'score': 0.8420582413673401}, 
{'text': ' Facilities are aging , It's too noisy close to the road . At night, the water flow sound and air conditioning noise in the upstairs bathroom are very loud , Can't sleep .\n', 'label': 'negative', 'score': 0.9905866980552673}, 
{'text': ' A very good hotel , The bed is big , Very comfortable . The service attitude of the hotel staff is very friendly .\n', 'label': 'positive', 'score': 0.9800688028335571},
{'text': ' Very bad. ！ We contracted a car to visit the West Lake through its business center , The car took us to informal scenic spots to buy tea .\n', 'label': 'negative', 'score': 0.9315289258956909}, 
{'text': ' in general , The hotel is not bad . It's quieter , The geographical location is better , Very good service , Including check-in and check-out .', 'label': 'positive', 'score': 0.90092933177948}]

1.2 Dataset loading

!python doccano.py \
    --doccano_file ./data/cla.jsonl \
    --task_type 'cls' \
    --save_dir ./data \
    --splits 0.8 0.1 0.1 \
    --negative_ratio 5 \
    --prompt_prefix " Emotional inclination " \
    --options " positive " " Negative "

2022-07-18 11:28:41,687] [    INFO] - Converting doccano data...
  0%|                                                     | 0/8 [00:00<?, ?it/s]
[2022-07-18 11:28:41,689] [    INFO] - Converting doccano data...
  0%|                                                     | 0/1 [00:00<?, ?it/s]
[2022-07-18 11:28:41,690] [    INFO] - Converting doccano data...
  0%|                                                     | 0/2 [00:00<?, ?it/s]
[2022-07-18 11:28:41,691] [    INFO] - Save 8 examples to ./data/train.txt.
[2022-07-18 11:28:41,691] [    INFO] - Save 1 examples to ./data/dev.txt.
[2022-07-18 11:28:41,691] [    INFO] - Save 2 examples to ./data/test.txt.
[2022-07-18 11:28:41,691] [    INFO] - Finished! It takes 0.00 seconds

doccano_file: from doccano Exported data annotation file .

save_dir: Training data storage directory , Default stored in data Under the table of contents .

negative_ratio: Maximum negative case ratio , This parameter is only valid for extraction type tasks , Properly constructing negative examples can improve the effect of the model . The number of negative instances is related to the actual number of tags , Maximum number of negative instances = negative_ratio * Number of positive examples . This parameter is only valid for training sets , The default is 5. In order to ensure the accuracy of the evaluation indicators , All negative examples of validation set and test set default construction .

splits: Training set when dividing data set 、 Proportion of validation sets . The default is [0.8, 0.1, 0.1] In accordance with the said 8:1:1 Divide the data into training sets 、 Validation set and test set .choices=[‘ext’, ‘cls’]

task_type: Select the task type , There are two types of tasks: extraction and classification .

options: Specify the category label of the classification task , This parameter is only valid for category type tasks . The default is [“ positive ”, “ Negative ”].

prompt_prefix: Declare the classification task prompt Prefix information , This parameter is only valid for category type tasks . The default is " Emotional inclination ".

In the data transformation phase , We will automatically construct for model training prompt Information . For example, in the sentence level emotion classification ,prompt For emotional tendency [ positive , Negative ], Can pass prompt_prefix and options Parameter .

is_shuffle: Whether to randomly scatter the data set , The default is True.

seed: Random seeds , The default is 1000.

*separator: Entity category / Separator between evaluation dimension and classification label , This parameter is only applicable to entities / Evaluate the effectiveness of dimension level classification tasks . The default is "##".

 The output part shows ：

{
    "content": " The shortage of goods has not been found yet , JD's order processing speed is really ....... Package on Tuesday , It will be delivered on Friday ...\t", "result_list": [{
    "text": " Negative ", "start": -4, "end": -2}], "prompt": " Emotional inclination [ positive , Negative ]"}
{
    "content": " I wanted to turn over when I was free , Unfortunately, I can't watch it , Still can't compare with Zhang , Most of his books are still influenced by Zhang , I really don't like this man , I don't know how to buy it , regret \t", "result_list": [{
    "text": " Negative ", "start": -7, "end": -5}], "prompt": " Emotional inclination [ Negative , positive ]"}
{
    "content": " Full keyboard with numeric keys   The graphics card is powerful enough .N Card relative A card , Personal bias N card  GHOST XP be prone to . In addition to fingerprint recognition . All drives can be installed , fingerprint identification , Must be in XP Under the use of friends , Alternative drives can be used . ( ASUS official address , Don't worry )\t", "result_list": [{
    "text": " positive ", "start": -4, "end": -2}], "prompt": " Emotional inclination [ Negative , positive ]"}
{
    "content": " The carpet in the room is too dirty , It's very noisy near the railway station , Fortunately, it's double glass . The service was average , In front of the hotel TAXI It's a long-term cooperative relationship of the hotel , Pay the hotel every month . From the hotel to the airport, it's about clocking 147 element , When you arrive, you have to 200 element , May be slaughtered 30-40 element .\t", "result_list": [{
    "text": " Negative ", "start": -7, "end": -5}], "prompt": " Emotional inclination [ Negative , positive ]"}
{
    "content": "< Letter of recommendation >  Recommend all like < Red Mansions > Our fans must collect this book , You know, when I heard about this book, I spent a long time in the library looking for it and borrowing it, but it didn't work , So this time, when I saw it, I should have , I'll buy it right away , Red fans should also remember to stock up !\t", "result_list": [{
    "text": " positive ", "start": -7, "end": -5}], "prompt": " Emotional inclination [ positive , Negative ]"}

2. model training

!python finetune.py \
    --train_path "./data/train.txt" \
    --dev_path "./data/dev.txt" \
    --save_dir "./checkpoint" \
    --learning_rate 1e-5 \
    --batch_size 16 \
    --max_seq_len 512 \
    --num_epochs 100 \
    --model "uie-base" \
    --seed 1000 \
    --logging_steps 10 \
    --valid_steps 50 \
    --device "gpu"

Some training effects are displayed ： The specific output has been folded
( Because there are few training samples , And it's relatively simple, so it's easy to achieve F1=100%)

[2022-07-17 11:33:46,088] [    INFO] - global step 10, epoch: 10, loss: 0.00021, speed: 1.50 step/s
[2022-07-17 11:33:52,276] [    INFO] - global step 20, epoch: 20, loss: 0.00011, speed: 1.62 step/s
[2022-07-17 11:33:58,431] [    INFO] - global step 30, epoch: 30, loss: 0.00007, speed: 1.62 step/s
[2022-07-17 11:34:04,630] [    INFO] - global step 40, epoch: 40, loss: 0.00006, speed: 1.61 step/s
[2022-07-17 11:34:10,816] [    INFO] - global step 50, epoch: 50, loss: 0.00005, speed: 1.62 step/s
[2022-07-17 11:34:10,863] [    INFO] - Evaluation precision: 1.00000, recall: 1.00000, F1: 1.00000
[2022-07-17 11:34:10,863] [    INFO] - best F1 performence has been updated: 0.00000 --> 1.00000
[2022-07-17 11:34:11,996] [    INFO] - tokenizer config file saved in ./checkpoint/model_best/tokenizer_config.json
[2022-07-17 11:34:11,997] [    INFO] - Special tokens file saved in ./checkpoint/model_best/special_tokens_map.json
[2022-07-17 11:34:18,202] [    INFO] - global step 60, epoch: 60, loss: 0.00004, speed: 1.61 step/s
[2022-07-17 11:34:24,355] [    INFO] - global step 70, epoch: 70, loss: 0.00003, speed: 1.63 step/s
[2022-07-17 11:34:30,515] [    INFO] - global step 80, epoch: 80, loss: 0.00003, speed: 1.62 step/s
[2022-07-17 11:34:36,700] [    INFO] - global step 90, epoch: 90, loss: 0.00003, speed: 1.62 step/s
[2022-07-17 11:34:42,851] [    INFO] - global step 100, epoch: 100, loss: 0.00002, speed: 1.63 step/s
[2022-07-17 11:34:42,897] [    INFO] - Evaluation precision: 1.00000, recall: 1.00000, F1: 1.00000

Recommended GPU Environmental Science , Otherwise, memory overflow may occur .CPU In the environment , You can modify model by uie-tiny, Adjust properly batch_size.

To increase the accuracy ：–num_epochs Set up a bigger workout

Configurable parameter description ：
train_path: Training set file path .

dev_path: Validation set file path .

save_dir: Model storage path , The default is ./checkpoint.

learning_rate: Learning rate , The default is 1e-5.

batch_size: Batch size , Please adjust it in combination with the video memory , If there is insufficient video memory , Please lower this parameter appropriately , The default is 16.

max_seq_len: Maximum segmentation length of text , When the input exceeds the maximum length, the input text will be automatically segmented , The default is 512.

num_epochs: Number of training rounds , The default is 100.

model Choose a model , The program will fine tune the model based on the selected model , Optional uie-base and uie-tiny, The default is uie-base.

seed: Random seeds , The default is 1000.

logging_steps: Log printing interval steps Count , Default 10.

valid_steps: evaluate The interval of steps Count , Default 100.

device: What equipment to use for training , Optional cpu or gpu.

3. Model to evaluate

!python evaluate.py \
    --model_path ./checkpoint/model_best \
    --test_path ./data/test.txt \
    --batch_size 16 \
    --max_seq_len 512

[2022-07-18 11:37:05,934] [    INFO] - We are using <class 'paddlenlp.transformers.ernie.tokenizer.ErnieTokenizer'> to load './checkpoint/model_best'.
W0718 11:37:05.965226  2210 gpu_context.cc:278] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 10.1
W0718 11:37:05.969079  2210 gpu_context.cc:306] device: 0, cuDNN Version: 7.6.
[2022-07-18 11:37:11,584] [    INFO] - -----------------------------
[2022-07-18 11:37:11,584] [    INFO] - Class Name: all_classes
[2022-07-18 11:37:11,584] [    INFO] - Evaluation Precision: 1.00000 | Recall: 1.00000 | F1: 1.00000

model_path: Model folder path for evaluation , The path needs to contain the model weight file model_state.pdparams And configuration files model_config.json.

test_path: Test set files for evaluation .

batch_size: Batch size , Please adjust it according to the situation of the machine , The default is 16.

max_seq_len: Maximum segmentation length of text , When the input exceeds the maximum length, the input text will be automatically segmented , The default is 512.

model: Select the model used , Optional uie-base, uie-medium, uie-mini, uie-micro and uie-nano, The default is uie-base.

debug: Open or not debug The model evaluates each positive example category separately , This mode is only used for model debugging , Off by default .

4. Prediction of results

from pprint import pprint
import json
from paddlenlp import Taskflow

def openreadtxt(file_name):
    data = []
    file = open(file_name,'r',encoding='UTF-8')  # Open file 
    file_data = file.readlines() # Read all lines 
    for row in file_data:
        data.append(row) # Insert each row of data into data in  
    return data

data_input=openreadtxt('./input/nlp.txt')

schema = ' Emotional inclination [ positive , Negative ]'
few_ie = Taskflow('information_extraction', schema=schema, batch_size=32,task_path='./checkpoint/model_best')
# few_ie = Taskflow('sentiment_analysis', schema=schema, batch_size=32,task_path='./checkpoint/model_best')
results=few_ie(data_input)

with open("./output/result.txt", "w+",encoding='UTF-8') as f:    #a :  write file , If the file does not exist, it will be created first and then written , But it will not overwrite the original file , Instead, it is appended at the end of the file 
    for result in results:
        line = json.dumps(result, ensure_ascii=False)  # Default for Chinese ascii code . To output real Chinese, you need to specify ensure_ascii=False
        f.write(line + "\n")

print(" Data results have been exported ")
print(results)

Input file display ：

 The hotel environment and service are all pretty good , The location is also good , Especially the northern Sichuan jelly in the north of the hotel is really delicious .
 Facilities are aging , It's too noisy close to the road . At night, the water flow sound and air conditioning noise in the upstairs bathroom are very loud , Can't sleep .
 A very good hotel , The bed is big , Very comfortable . The service attitude of the hotel staff is very friendly .
 Very bad. ！ We contracted a car to visit the West Lake through its business center , The car took us to informal scenic spots to buy tea .
 in general , The hotel is not bad . It's quieter , The geographical location is better , Very good service , Including check-in and check-out .

Output shows ：

[{'text': ' The hotel environment and service are all pretty good , The location is also good , Especially the northern Sichuan jelly in the north of the hotel is really delicious .\n', 'label': 'positive', 'score': 0.8420582413673401}, 
{'text': ' Facilities are aging , It's too noisy close to the road . At night, the water flow sound and air conditioning noise in the upstairs bathroom are very loud , Can't sleep .\n', 'label': 'negative', 'score': 0.9905866980552673}, 
{'text': ' A very good hotel , The bed is big , Very comfortable . The service attitude of the hotel staff is very friendly .\n', 'label': 'positive', 'score': 0.9800688028335571},
{'text': ' Very bad. ！ We contracted a car to visit the West Lake through its business center , The car took us to informal scenic spots to buy tea .\n', 'label': 'negative', 'score': 0.9315289258956909}, 
{'text': ' in general , The hotel is not bad . It's quieter , The geographical location is better , Very good service , Including check-in and check-out .', 'label': 'positive', 'score': 0.90092933177948}]

Of course paddlenlp It also provides two models of emotional analysis , The default is BiLSTM.
as well as SKEP.

Integrate the emotional knowledge enhancement pre training model developed by Baidu SKEP, Use emotional knowledge to build pre training goals , Pre training on massive Chinese data , Provide unified and powerful emotional semantic expression ability for all kinds of emotional analysis tasks . Emotional pre training model SKEP（Sentiment Knowledge Enhanced Pre-training for Sentiment Analysis）.SKEP Using emotional knowledge to enhance pre training model , stay 14 Xiang Zhongying's emotional analysis surpasses the typical tasks in an all-round way SOTA, This work has been ACL 2020 Employment .SKEP It is an emotional pre training algorithm based on emotional knowledge enhancement proposed by Baidu research team , This algorithm adopts unsupervised method to automatically mine emotional knowledge , Then use emotional knowledge to build pre training goals , So that the machine can understand the emotional semantics .SKEP Provide a unified and powerful emotional semantic representation for all kinds of emotional analysis tasks .

SKEP: Sentiment Knowledge Enhanced Pre-training for Sentiment Analysis

from pprint import pprint
import json
from paddlenlp import Taskflow

def openreadtxt(file_name):
    data = []
    file = open(file_name,'r',encoding='UTF-8')  # Open file 
    file_data = file.readlines() # Read all lines 
    for row in file_data:
        data.append(row) # Insert each row of data into data in  
    return data

data_input=openreadtxt('./input/nlp.txt')

schema = ' Emotional inclination [ positive , Negative ]'
few_ie = Taskflow("sentiment_analysis", schema=schema,model="skep_ernie_1.0_large_ch", batch_size=16)

results=few_ie(data_input)

with open("./output/result.txt", "w+",encoding='UTF-8') as f:    #a :  write file , If the file does not exist, it will be created first and then written , But it will not overwrite the original file , Instead, it is appended at the end of the file 
    for result in results:
        line = json.dumps(result, ensure_ascii=False)  # Default for Chinese ascii code . To output real Chinese, you need to specify ensure_ascii=False
        f.write(line + "\n")

print(" Data results have been exported ")
print(results)

[{
    'text': ' The hotel environment and service are all pretty good , The location is also good , Especially the northern Sichuan jelly in the north of the hotel is really delicious .\n', 'label': 'positive', 'score': 0.9441452622413635}, {
    'text': ' Facilities are aging , It's too noisy close to the road . At night, the water flow sound and air conditioning noise in the upstairs bathroom are very loud , Can't sleep .\n', 'label': 'negative', 'score': 0.991821825504303}, {
    'text': ' A very good hotel , The bed is big , Very comfortable . The service attitude of the hotel staff is very friendly .\n', 'label': 'positive', 'score': 0.989535927772522}, {
    'text': ' Very bad. ！ We contracted a car to visit the West Lake through its business center , The car took us to informal scenic spots to buy tea .\n', 'label': 'negative', 'score': 0.9811170697212219}, {
    'text': ' in general , The hotel is not bad . It's quieter , The geographical location is better , Very good service , Including check-in and check-out .', 'label': 'positive', 'score': 0.8622702360153198}]

5. Active learning and flexible application （ News text classification demo）

Get relevant data sets , And then deal with , Agriculture is selected here 、 Finance 、 Part of the data of the real estate is only used to test the feasibility of the scheme

Export the news data set provided by the official website , Then mark the platform by yourself ！

!python doccano.py \
    --doccano_file ./data/input.jsonl \
    --task_type 'cls' \
    --save_dir ./data \
    --splits 0.85 0.15 0 \
    --negative_ratio 5 \
    --prompt_prefix " News classification " \
    --options " Agriculture " " Real estate " " Finance "

!python finetune.py \
    --train_path "./data/train.txt" \
    --dev_path "./data/dev.txt" \
    --save_dir "./checkpoint2" \
    --learning_rate 1e-5 \
    --batch_size 16 \
    --max_seq_len 512 \
    --num_epochs 200 \
    --model "uie-base" \
    --seed 1000 \
    --logging_steps 10 \
    --valid_steps 50 \
    --device "gpu"

from pprint import pprint
import json
from paddlenlp import Taskflow
data = [
    [" The dollar once again raises the financial butcher's knife , this 10 The economy of a country may be in jeopardy ”, Xicheng is exposed again 10 Illegal intermediary "], 
    [" The era of Wiki chain is coming , It will hit the whole China in an instant , Skyrocketing 800%！ Grab and earn ！"],
    [" Our province issued the minimum living security standards for urban and rural residents this year "],
]
schema = ' News classification [ Agriculture , Real estate , Finance ]'
few_ie = Taskflow('information_extraction', schema=schema,task_path='./checkpoint2/model_best')
results=few_ie([" The dollar once again raises the financial butcher's knife , this 10 The economy of a country may be in jeopardy ”, Xicheng is exposed again 10 Illegal intermediary "," The era of Wiki chain is coming , It will hit the whole China in an instant , Skyrocketing 800%！ Grab and earn ！"," Our province issued the minimum living security standards for urban and rural residents this year "])

# print(results)
for idx, text in enumerate(data):
    print('Data: {} \t Lable: {}'.format(text[0], results[idx]))

Data: The dollar once again raises the financial butcher's knife , this 10 The economy of a country may be in jeopardy ”, Xicheng is exposed again 10 Illegal intermediary Lable: {‘ News classification [ Agriculture , Real estate , Finance ]’: [{‘text’: ‘ Finance ’, ‘probability’: 0.8674286780486753}]}
Data: The era of Wiki chain is coming , It will hit the whole China in an instant , Skyrocketing 800%！ Grab and earn ！ Lable: {‘ News classification [ Agriculture , Real estate , Finance ]’: [{‘text’: ‘ Agriculture ’, ‘probability’: 0.4909489670645364}]}
Data: Our province issued the minimum living security standards for urban and rural residents this year Lable: {‘ News classification [ Agriculture , Real estate , Finance ]’: [{‘text’: ‘ Agriculture ’, ‘probability’: 0.980139386504348}]}

It can be seen that the result is still good , There are some effects and performances , This is just an attempt , I haven't verified the specific performance
However, it is recommended to use a special classification model ,ernie-3.0- It works better

** Follow up to verify hub、ERNIE、prompt Three ways of performance , Of course, it is expected that based on a model, the effect may be similar , Recommend the most convenient solution at that time ！ **

7. summary

UIE(Universal Information Extraction)：Yaojie Lu Et al. ACL-2022 A unified framework for general information extraction is proposed UIE. The framework implements Entity extraction 、 Relationship extraction 、 Event extraction 、 Sentiment analysis Unified modeling of such tasks , And make different tasks have good migration and generalization ability .PaddleNLP Learn from the method of this paper , be based on ERNIE 3.0 Knowledge enhancement pre training model , Train and open source the first Chinese general information extraction model UIE. The model can support the extraction of key information without limiting the industry field and the extraction target , Realize zero sample fast cold start , And have excellent small sample fine-tuning ability , Quickly adapt to specific extraction targets .

UIE The advantages of

Easy to use ： Users can use natural language to customize the extraction target , The corresponding information in the input text can be extracted uniformly without training . Out of the box , And meet all kinds of information extraction needs .

Authors efficiency ： The previous information extraction technology needs a large number of labeled data to ensure the effect of information extraction , In order to improve the development efficiency in the development process , Reduce unnecessary duplication of time , Open domain information extraction can achieve zero samples （zero-shot） Or less samples （few-shot） extract , Greatly reduce label data dependency , While reducing costs , It also improves the effect .

The effect is leading ： Open domain information extraction is used in many scenarios , On a variety of tasks , All of them have excellent performance .

This time I mainly classified by emotion 、 The case of news classification is shared with you , Mainly for open source paddlenlp The case of , The official document does not cover how to fine tune the specific classification , This side gives demo For your reference , At present, it seems that the effect of small samples and multi classification is ok , Later, I will expand the sample for a complete test .

Of course, you can replace the data set on the basis of the project , Try to make a classification model ！ First from the second classification — Try more categories one by one ！

My blog ：https://blog.csdn.net/sinat_39620217?type=blog

原网站

版权声明
本文为[Ting]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/204/202207231624592391.html