当前位置:网站首页>Intelligent annotation scheme of entity recognition based on hugging Face Pre training model: generate doccano request JSON format
Intelligent annotation scheme of entity recognition based on hugging Face Pre training model: generate doccano request JSON format
2022-07-07 05:24:00 【Ting】
Reference resources : Data annotation platform doccano---- brief introduction 、 install 、 Use 、 Record on pit
1.hugging face
Relevant tutorials refer directly to others' : And training models
【Huggingface Transformers】 Nanny level tutorial — On - You know
huggingface transformers Of trainer Use guide - You know
2.doccano Label platform format requirements
doccano The link at the beginning of the platform operation reference article :
json Format import data format requirements : Entity ; Include relationship style presentation
{
"text": "Google was founded on September 4, 1998, by Larry Page and Sergey Brin.",
"entities": [
{
"id": 0,
"start_offset": 0,
"end_offset": 6,
"label": "ORG"
},
{
"id": 1,
"start_offset": 22,
"end_offset": 39,
"label": "DATE"
},
{
"id": 2,
"start_offset": 44,
"end_offset": 54,
"label": "PERSON"
},
{
"id": 3,
"start_offset": 59,
"end_offset": 70,
"label": "PERSON"
}
],
"relations": [
{
"from_id": 0,
"to_id": 1,
"type": "foundedAt"
},
{
"from_id": 0,
"to_id": 2,
"type": "foundedBy"
},
{
"from_id": 0,
"to_id": 3,
"type": "foundedBy"
}
]
}
3. Entity intelligent annotation + format conversion
3.1 Long text ( One txt A long article )
The annotation section contains pre training model recognition entities ; And the format requirements of the wizard annotation assistant
from transformers import pipeline
import os
from tqdm import tqdm
import pandas as pd
from time import time
import json
def return_single_entity(name, start, end):
return [int(start), int(end), name]
# def return_single_entity(name, word, start, end, id, attributes=[]):
# entity = {}
# entity['type'] = 'T'
# entity['name'] = name
# entity['value'] = word
# entity['start'] = int(start)
# entity['end'] = int(end)
# entity['attributes'] = attributes
# entity['id'] = int(id)
# return entity
# input_dir = 'E:/datasets/myUIE/inputs'
input_dir = 'C:/Users/admin/Desktop//test_input.txt'
output_dir = 'C:/Users/admin/Desktop//outputs'
tagger = pipeline(task='ner', model='xlm-roberta-large-finetuned-conll03-english',
aggregation_strategy='simple')
keywords = {'PER': ' people ', 'ORG': ' Institutions '} # loc Location misc Other types of entities
# for filename in tqdm(input_dir):
# # Read data and mark automatically
# json_list = []
with open(input_dir, 'r', encoding='utf8') as f:
text = f.readlines()
json_list = [0 for i in range(len(text))]
for t in text:
i = t.strip("\n").strip("'").strip('"')
named_ents = tagger(i) # Pre training model
# named_ents = tagger(text)
df = pd.DataFrame(named_ents)
""" Mark the results :entity_group score word start end
0 ORG 0.999997 National Science Board 18 40
1 ORG 0.999997 NSB 42 45
2 ORG 0.999997 NSF 71 74"""
# Put it in the loop , Then every time we start a new cycle, we will redefine it , The content defined last time is lost
# json_list = [0 for i in range(len(text))]
entity_list=[]
# entity_list2=[]
for index, elem in df.iterrows():
if not elem.entity_group in keywords:
continue
if elem.end - elem.start <= 1:
continue
entity = return_single_entity(
keywords[elem.entity_group], elem.start, elem.end)
entity_list.append(entity)
# entity_list2.append(entity_list)
json_obj = {"text": text[index], "label": entity_list}
json_list[index] = json.dumps(json_obj)
# entity_list.append(entity)
# data = json.dumps(json_list)
# json_list.append(data)
with open(f'{output_dir}/data_2.json', 'w', encoding='utf8') as f:
for line in json_list:
f.write(line+"\n")
# f.write('\n'.join(data))
# f.write(str(data))
print('done!')
# Convert to wizard annotation assistant import format ( But the elf annotation assistant nlp There is a coding problem in the annotation module , part utf8 Characters cannot be displayed normally , It will affect the annotation results )
# id = 1
# entity_list = ['']
# for index, elem in df.iterrows():
# if not elem.entity_group in keywords:
# continue
# entity = return_single_entity(keywords[elem.entity_group], elem.word, elem.start, elem.end, id)
# id += 1
# entity_list.append(entity)
# python_obj = {'path': f'{input_dir}/{filename}',
# 'outputs': {'annotation': {'T': entity_list, "E": [""], "R": [""], "A": [""]}},
# 'time_labeled': int(1000 * time()), 'labeled': True, 'content': text}
# data = json.dumps(python_obj)
# with open(f'{output_dir}/{filename.rstrip(".txt")}.json', 'w', encoding='utf8') as f:
# f.write(data)
Recognition result :
{"text": "The company was founded in 1852 by Jacob Estey\n", "label": [[35, 46, "\u4eba"]]}
{"text": "The company was founded in 1852 by Jacob Estey, who bought out another Brattleboro manufacturing business.", "label": [[35, 46, "\u4eba"], [71, 82, "\u673a\u6784"]]}
You can see label The label is garbled , Don't worry about importing to doccano The platform will display normal
3.2 Multiple essays (txt file )
from transformers import pipeline
import os
from tqdm import tqdm
import pandas as pd
import json
def return_single_entity(name, start, end):
return [int(start), int(end), name]
input_dir = 'C:/Users/admin/Desktop/inputs_test'
output_dir = 'C:/Users/admin/Desktop//outputs'
tagger = pipeline(task='ner', model='xlm-roberta-large-finetuned-conll03-english', aggregation_strategy='simple')
json_list = []
keywords = {'PER': ' people ', 'ORG': ' Institutions '}
for filename in tqdm(os.listdir(input_dir)[:3]):
# Read data and mark automatically
with open(f'{input_dir}/{filename}', 'r', encoding='utf8') as f:
text = f.read()
named_ents = tagger(text)
df = pd.DataFrame(named_ents)
# Turn into doccano Import format
entity_list = []
for index, elem in df.iterrows():
if not elem.entity_group in keywords:
continue
if elem.end - elem.start <= 1:
continue
entity = return_single_entity(keywords[elem.entity_group], elem.start, elem.end)
entity_list.append(entity)
file_obj = {'text': text, 'label': entity_list}
json_obj = json.dumps(file_obj)
json_list.append(json_obj)
with open(f'{output_dir}/data3.json', 'w', encoding='utf8') as f:
f.write('\n'.join(json_list))
print('done!')
3.3 Including annotation wizard format, which requires conversion
from transformers import pipeline
import os
from tqdm import tqdm
import pandas as pd
from time import time
import json
def return_single_entity(name, word, start, end, id, attributes=[]):
entity = {}
entity['type'] = 'T'
entity['name'] = name
entity['value'] = word
entity['start'] = int(start)
entity['end'] = int(end)
entity['attributes'] = attributes
entity['id'] = int(id)
return entity
input_dir = 'E:/datasets/myUIE/inputs'
output_dir = 'E:/datasets/myUIE/outputs'
tagger = pipeline(task='ner', model='xlm-roberta-large-finetuned-conll03-english', aggregation_strategy='simple')
keywords = {'PER': ' people ', 'ORG': ' Institutions '}
for filename in tqdm(os.listdir(input_dir)):
# Read data and mark automatically
with open(f'{input_dir}/{filename}', 'r', encoding='utf8') as f:
text = f.read()
named_ents = tagger(text)
df = pd.DataFrame(named_ents)
# Convert to wizard annotation assistant import format ( But the elf annotation assistant nlp There is a coding problem in the annotation module , part utf8 Characters cannot be displayed normally , It will affect the annotation results )
id = 1
entity_list = ['']
for index, elem in df.iterrows():
if not elem.entity_group in keywords:
continue
entity = return_single_entity(keywords[elem.entity_group], elem.word, elem.start, elem.end, id)
id += 1
entity_list.append(entity)
python_obj = {'path': f'{input_dir}/{filename}',
'outputs': {'annotation': {'T': entity_list, "E": [""], "R": [""], "A": [""]}},
'time_labeled': int(1000 * time()), 'labeled': True, 'content': text}
data = json.dumps(python_obj)
with open(f'{output_dir}/{filename.rstrip(".txt")}.json', 'w', encoding='utf8') as f:
f.write(data)
print('done!')
边栏推荐
- Understand common network i/o models
- Tencent cloud database public cloud market ranks top 2!
- PMP证书有没有必要续期?
- app clear data源码追踪
- Is the human body sensor easy to use? How to use it? Which do you buy between aqara green rice and Xiaomi
- 数字化如何影响工作流程自动化
- Where is NPDP product manager certification sacred?
- 【最佳网页宽度及其实现】「建议收藏」
- AOSP ~binder communication principle (I) - Overview
- 《四》表单
猜你喜欢
随机推荐
window定时计划任务
Timer create timer
How does redis implement multiple zones?
MySQL数据库学习(8) -- mysql 内容补充
做自媒体,有哪些免费下载视频剪辑素材的网站?
Leetcode(46)——全排列
《5》 Table
删除文件时提示‘源文件名长度大于系统支持的长度’无法删除解决办法
pmp真的有用吗?
Operand of null-aware operation ‘!‘ has type ‘SchedulerBinding‘ which excludes null.
利用OPNET进行网络任意源组播(ASM)仿真的设计、配置及注意点
【最佳网页宽度及其实现】「建议收藏」
痛心啊 收到教训了
Creation and use of thread pool
Is PMP really useful?
2039: [Bluebridge cup 2022 preliminaries] Li Bai's enhanced version (dynamic planning)
ASP. Net MVC - resource cannot be found error - asp Net MVC – Resource Cannot be found error
JVM(十九) -- 字节码与类的加载(四) -- 再谈类的加载器
y58.第三章 Kubernetes从入门到精通 -- 持续集成与部署(三一)
How can project managers counter attack with NPDP certificates? Look here