Knowledge mapping 1.1 -- starting from NER

This article outlines ： Come back to know -KG Open source project set Medium BERT-NER-pytorch After the project , Some of the learning records , It has reference significance for Xiaobai, who has just entered the industry .

Information ： About BERT In the model transformer Introduce , What has to be shared is Jay Alammar Animation of , Why didn't I see such a foreign masterpiece early ？

One 、 preparation ：

1、 Data sets

How data sets are organized 、 The way we deal with it is the most important thing in the field of depth , It can be said that an algorithm engineer 80% We're all dealing with data . Now let's talk about datasets ：

Access method ： Direct use google Expand gitzip from git Download it directly ,

Raw data Only 3 term , yes

among train There is 45000 individual The sentence ,test Set is 3442, We need to artificially divide val Set .

Raw data form - It looks like this , Let's take a look msra_train_bio Before 17 That's ok （ altogether 176042 That's ok ）：

 in     B-ORG
 common     I-ORG
 in     I-ORG
 Central     I-ORG
 Cause     O
 in     B-ORG
 countries     I-ORG
 Cause     I-ORG
 Male     I-ORG
 The party     I-ORG
 Ten     I-ORG
 One     I-ORG
 Big     I-ORG
 Of     O
 greetings     O
 word     O
 various     O

tags（ There are only three entities ： Institutions , people , Location ）：

O
B-ORG
I-PER
B-PER
I-LOC
I-ORG
B-LOC

ps： You can see , It's using BIO Tagging , We can certainly modify ！

We'll divide the dataset into （3 A catalog ）：

Dataset	Number
training set	42000
validation set	3000
test set	3442

Get three directories .

The form of data processing ( Take the first two )：

sentences.txt file ：

 Such as   What   Explain   "   foot   The ball   world   Long   period   save   stay   Of   various   many   Spear   shield  ,  heavy   Vibration   Once upon a time   Japan   tianjin   door   foot   The ball   Of   Male   wind  ,  become   by   God   tianjin   foot   The altar   On   Next   Inside   Outside   To   It's about   discussion   On   Of   word   topic  .
 The   county   One   hand   Catch   farmers   trade   Technology   Technique   PUSH   wide  ,  One   hand   Catch   farmers   civil   Families,   Technology   teach   Education   and   farmers   Technology   water   flat   Of   carry   high  .
 and   gen   new   Of   Turn off   key   Just   yes   know   knowledge   and   Letter   Rest   Of   raw   production  、  Pass on   seeding  、  send   use  .

Corresponding ,tags.txt file ：

O O O O O O O O O O O O O O O O O O O O O B-LOC I-LOC O O O O O O O O B-LOC I-LOC O O O O O O O O O O O O O O
O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O
O O O O O O O O O O O O O O O O O O O O O O O

2、 Prepare models and trained model parameters

After many attempts in the experiment , It is not difficult to find the model parameters ,
When the author reproduces the code , There is no direct access to pt The model parameters under , Now it's a matter of parameters .
train.py In the code , establish model Sometimes there is such a sentence ：

model = BertForTokenClassification.from_pretrained()

Let's click in to see from_pretrained() Method , stay pytorch_pretrained_bert In the catalog modeling.py In file .
You can use pathnames or url To get models and parameters （ And then there was a little bit of an episode in training , I'm abandoning the author's offer , Compressed from the package itself , It will be mentioned later ）：

PRETRAINED_MODEL_ARCHIVE_MAP = {
    'bert-base-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-uncased.tar.gz",
    'bert-large-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-uncased.tar.gz",
    'bert-base-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased.tar.gz",
    'bert-large-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-large-cased.tar.gz",
    'bert-base-multilingual-uncased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-uncased.tar.gz",
    'bert-base-multilingual-cased': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased.tar.gz",
    'bert-base-chinese': "https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese.tar.gz",
}

ps： First get tf The model parameters are then converted to pt, It took me a lot of effort , Directly from Transformer The library downloaded a convert_tf_checkpoint_to_pytorch.py File to tf Model parameters directory , After a little operation, I got pt Of .bin file , The good thing is, a word now .

3、train.py Document interpretation

Data set processing and super parameter setting are simple , It's not about the core of our article , We don't make too many records here , If you want to know, go and see github project , Portal ,
This file contains a lot of things , Yes
①parse,params Set up ,logger journal
②dataloader,model,optimizer and train_and_evaluate

Parameters parse（ This parameter resolution is to run .py File parameters ）

parser = argparse.ArgumentParser()
parser.add_argument('--data_dir', default='NER-BERT-pytorch-data-msra', help="Directory containing the dataset")
parser.add_argument('--bert_model_dir', default='pt_things', help="Directory containing the BERT model in PyTorch")
……

After the main in , Record parameters to memory ：
args = parser.parse_args()
In the next use, by args.param To get the parameters ,

logger journal

The relevant processing is put in utils.py in , stay train.py In the direct ：

# establish 
utils.set_logger(os.path.join(args.model_dir, 'train.log'))
#  When you need to record ：
logging.info("device: {}, n_gpu: {}, 16-bits training: {}".format(params.device, params.n_gpu, args.fp16))

model

Simple , Two words ：

model = BertForTokenClassification.from_pretrained(args.bert_model_dir, num_labels=len(params.tag2idx))
model.to(params.device)

It includes model loading and parameter loading , In training, we're seeing modelconfig after , There are also two hints ：

Weights of BertForTokenClassification not initialized from pretrained model: ['classifier.weight', 'classifier.bias']
Weights from pretrained model not used in BertForTokenClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']

（ At first, I thought it was a model parameter import failure , Find a solution .

I used to think pytorch_model.bin The parameters in the file do not match the model , Download the compressed package again from the link provided by the source code , But there are still these two words .

It turns out that Google The last forum found that these two sentences mean successful call of parameters .）
Solve the mystery ： This model It includes embedding and bert NER Two parts , The parameters of the model should be embedding Part of the , Specific NER Tasks should use our own data to train ！
read model Source code ： Each layer explains in detail
see model（BertForTokenClassification）：
There are three floors ：
①BertModel
BertEmbeddings： There are multiple layers in it

    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): BertLayerNorm()
    (dropout): Dropout(p=0.1, inplace=False)

BertEncoder： Is punctuated with 12 layer encoder, every last encoder It's all one BertLayer：

    (attention): BertAttention  # A top priority , Be sure to have a chance to brush 
    (intermediate): BertIntermediate
    (output): BertOutput

BertPooler

②DropOut
③Linear

optimizer

full_finetuning

4、 Single machine multi card parallel and fp16

Dorka

To put Models and data All assigned to multi card ;
① Specify virtual gpu：
os.environ["CUDA_VISIBLE_DEVICES"] = '1,2,3,0'
The physical address corresponding to the virtual address is 1,2,3,0（ At that time, because the master of elder martial brother gpu It's physics 0 Number one card , So avoid this card with a relatively small amount of remaining video memory ）

ps： In the screenshot, elder martial brother no longer uses .

② Before preparing models and data , Put this sentence ：
params.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
It is estimated that 'cuda' The latter part is that the author does not want to see the error reported ……
Be careful ： If you want to write 'cuda:[num]', Please write the previous designation gpu The first of the time , Here it is 1！

③ Random seeds are assigned to each gpu：

#  Set up random seeds for repeatable trials 
random.seed(args.seed)
torch.manual_seed(args.seed)    # to cpu Set up 
if params.n_gpu > 0:
    torch.cuda.manual_seed_all(args.seed)  # set random seed for all GPUs

ps： If a single card , Just get rid of all;

④ to model Assign multiple cards ：

model = BertForTokenClassification.from_pretrained(args.bert_model_dir, num_labels=len(params.tag2idx))
model.to(params.device)
if params.n_gpu > 1 and args.multi_gpu:
    model = torch.nn.DataParallel(model)

⑤ Last , Assign multiple cards to the data ：

# Initialize the DataLoader
data_loader = DataLoader(args.data_dir, args.bert_model_dir, params, token_pad_idx=0)
# Load training data and test data
train_data = data_loader.load_data('train')
val_data = data_loader.load_data('val')
……
 stay train() Function before ：
# data iterator for training
train_data_iterator = data_loader.data_iterator(train_data, shuffle=True)
# Train for one epoch on training set
train(model, train_data_iterator, optimizer, scheduler, params)

stay data_iterator() Set in for each batch Assigned cards ：

# shift tensors to GPU if available
batch_data, batch_tags = batch_data.to(self.device), batch_tags.to(self.device)
yield batch_data, batch_tags

there self.device Is the parameter accepted by the class .

fp16

Reference resources ： A talk fp16 Acceleration principle CSDN
fp16 Use 2 Byte coded storage
advantage ： Less memory usage （ Lord ）+ Accelerate Computing
shortcoming ： Addition operation is easy to overflow
（ If you have a chance, you can make a special experiment ）

5、 Progress bar tool

Here we calculate each one epoch Lower computation 1400 individual batch,
So put the progress bar in every one of them epoch in ：

t = trange(params.train_steps)
for i in t:
    # fetch the next training batch
 batch_data, batch_tags = next(data_iterator)
 ……
 loss = model（~）
 ……
 t.set_postfix(loss='{:05.3f}'.format(loss_avg()))

Result display ：

6、 Evaluation index database metrics

Indicators of outcome evaluation that can be used （ The tip of the iceberg ）：

from metrics import f1_score
from metrics import accuracy_score
from metrics import classification_report

Templates ：

metrics = {}
f1 = f1_score(true_tags, pred_tags)
accuracy=accuracy_score(true_tags, pred_tags)
metrics['loss'] = loss_avg()
metrics['f1'] = f1
metrics['accuracy']=accuracy
metrics_str = "; ".join("{}: {:05.2f}".format(k, v) for k, v in metrics.items())
logging.info("- {} metrics: ".format(mark) + metrics_str)

Result display ：

How to distinguish between accuracy and recall ？

7、 The problem with many experiments

my f1 The scores are always there 50 following , But accuracy has always been 97% near , Recently, because of the liver link extraction , So let's put this part first . I'll go back and complete it

当前位置：网站首页>Knowledge mapping 1.1 -- starting from NER