当前位置：网站首页>Re11: read EPM legal judgment prediction via event extraction with constraints

Re11: read EPM legal judgment prediction via event extraction with constraints

2022-07-28 17:02:00 【The gods were silent】

The gods were silent - personal CSDN Blog Directory

Title of thesis ：Legal Judgment Prediction via Event Extraction with Constraints
The paper ACL Official download address ：https://aclanthology.org/2022.acl-long.48/
Paper official GitHub project ：WAPAY/EPM

This article is about 2022 year ACL The paper , The author is from Nanjing University .
This article focuses on CAIL Prediction of legal decisions on data sets legal judgment prediction problem , That is, take the case event description text as input , Prediction method 、 charges 、 Term of imprisonment , It's multitasking multi-class Classification task . Restrictions on the use of this article （ Add a penalty term to the loss function ） To take advantage of the relationship between the three subtasks
The intermediate task is to extract event features , Use event information to assist in predicting the decision result .
Insert picture description here

List of articles

1. Background & Motivation
2. Problem definition and model introduction
- 2.1 Define hierarchical events
- 2.2 EPM
3. experiment
4. Code reappearance

1. Background & Motivation

This article considers the past LJP The reason for the misprediction of the model lies in the wrong positioning of the key event information that affects the decision result , And not used LJP Cross task consistency constraints between subtasks （ That is to say, a specific law can only correspond to a specific crime and sentence ）, Therefore, this paper proposes a prediction model based on events and constraints EPM To solve these problems .

The law consists of event mode （event pattern） And judgment （judgment）/ punishment （penalty） Two parts . This paper believes that as long as the event information in the case can be extracted , Can predict the correct verdict .

① Extract events to assist LJP Mission （ It is believed that the previous model mispredicted events, resulting in prediction failure ）.
② Between event output and subtask constraint（ Increase when certain conditions are not met penalty. Restrict certain event roles to appear 、 Event types must correspond , Certain laws will restrict charge and terms of penalty Range of options . This list of specific constraints is given in the code ）.

2. Problem definition and model introduction

2.1 Define hierarchical events

（ It is different from the traditional definition of events in the legal field , In order to trigger types and argument roles Can be used for LJP Task defined ）
Define legal events based on legal provisions , Because the law is hierarchical , Therefore, the corresponding defined events are also hierarchical
Insert picture description here

Fine grained events ：
Insert picture description here

event trigger： Indicates the occurrence of an event , Match specific events （ Such as events Robbery Corresponding trigger type by Trigger-Rob）
event role： Type of event element （ The feeling can be compared to ,role It's a class ,argument Is the instance ）

token labeling Task paradigm ：subordinate trigger（ If it's time to token yes trigger Part of ） perhaps subordinate role type（ If it's time to token yes argument Part of ）

2.2 EPM

I swear this is the most magical model I have seen this year , This is too folding ！

Joint training ：① Extract events .② Multi task classification using event characteristics （ Use a text feature to do attention, Consider the event output constraint And multitasking constraint）.
baseline edition /EPM Complete model （ use Switch Classifier to switch ： See the experiment section later ）
baseline edition ： Use facts to describe textual representations （context features） And bar representation （article embeddings） do attention, Then I do 3 Category tasks
EPM edition ： Use the event representation of the extracted event , And the corresponding token The representation of concat The characterization obtained after （event features）, Replace baseline Medium context features

Insert picture description here

Token representation layer： Using pre-trained Legal BERT The model implements fact description and text representation
The text representation of fact description max pooling, obtain context feature（ This is in baseline It's used in , stay EPM Will be replaced by the event features that will be introduced later ）
Use the semantic information of the rule ： use Token representation layer Carry out character characterization 、 Use max pooling Get the characterization of each bar , Then use this and context feature do attention：
Insert picture description here

Legal judgment prediction layer： Implement a linear classifier for each subtask
Insert picture description here

hierarchical event extraction layer：①superordinate module： Calculate each fact description text token The representation vector of comes with superordinate types/roles Of correlation score ②subordinate module： Calculate based on hierarchical information subordinate type/role The probability of distribution of

Each... Is represented by a trainable vector superordinate type/role The semantic features of , Use the full connection layer to calculate each token With each superordinate type/role Of correlation score：
use softmax Calculate each token Of superordinate type/role feature（ A weighted sum , Soft representation ）
forecast token Belong to subordinate type/role Probability ： The input feature is concat token Characterization and superordinate type/role feature：
use CRF Generate the highest score types/roles Sequence ：
Use the predicted types Sequence to generate event characteristics ： Take each one out span The text representation of （token Characterized by max pooling obtain ） and subordinate type/role embedding Conduct concat, obtain span The representation of ; And then all span The representation takes max pooling, Get the final event characteristics
Replace the previous text with event characteristics baseline With context feature

Training stage loss function ：
3 The loss function of the subtask is cross entropy .
Loss function of event extraction and total loss function （ This penalty without event output limit ）：
Insert picture description here

Event output limit ： If specific trigger or role defect , Will increase penalty; given trigger type A specific role must appear
Insert picture description here

Consistency constraints between multiple tasks ： The predicted law will limit charge and term penalty Range of options （ During training , If the legal prediction is correct , Add... To the loss function mask I talked with my younger brother about feeling as a single category , It's useless to train , But in ablation study in article It will also be affected , So the training stage should also have an impact . In addition, the loss function here in the original paper has two consecutive plus signs , But other loss function formulas are on single samples , So it is suspected that it is wrong. This is the case . I asked the author , be supposed to ① Did label smoothing, therefore mask It will always work .② The data itself has noise , So the real label of training set is not necessarily right , So when y by 1 when mask Not necessarily 1. Always add mask）
author の reply ：
Insert picture description here
The code also follows from one-hot Changed to label smoothing.

Add mask Methods , Directly set the probability of categories that are not allowed to be output to 0：
Insert picture description here

3. experiment

3.1 Data sets

CAIL（big and small Two data sets ）
New dataset LJP-E（ Manually marked 15 Event information of cases on charges ）

3.2 baseline

①baseline： Remove event extraction and constraint、 Replace the event features with facts to describe the context features of the text EPM Model .
③EPM The model first removes the event part （ finger ① With baseline Model ） In the original dataset CAIL Pre training on the training set , Then mark the data set of event information LJP-E On the training set .（ problem ：trick？ I feel that I have gained fact description Information about , Unfair ）

3.3 Experimental setup

Super parameter setting ：Legal-Bert The longest input length of is 512, use Adam As an optimizer , The learning rate is $10^{-4}$ ,batch size by 32,warmup step by 3000. Models train at most 20 individual epoch, Save each subtask on the validation set Macro-F1 The highest checkpoint（CAIL-big No validation set , Use directly CAIL-small The verification set of ）. stay LJP-E Running on a dataset 5 An experiment , Report average results .
stay ② Four of the total loss functions listed in λ yes 0.5, 0.5, 0.4, 0.2, The event output limits the superparameter of the penalty $\lambda_p$ yes 0.1
Use 2 individual Tesla V100 GPUs Run the experiment .

3.4 The main experimental results

The indicators used to measure the model ：Accuracy (Acc), Macro-Precision (MP), Macro-Recall (MR) and Macro-F1 (F1)
（ In the table of experimental results gold or @G Refers to the use of real event tags （ Instead of predicting events ） To generate results ）

stay LJP-E The results of the experiment on ：
Insert picture description here

stay CAIL The results of the experiment on ：
because LJP-E The dataset contains only 15 Types of cases , So first CAIL One was trained in the training set legal BERT, use [CLS] token Whether the representation prediction case belongs to this 15 One of the species （ This classifier is called Switch.batch size by 32, Training 20 individual epoch, use Adam Be an optimizer , The learning rate is 0.0001, stay CAIL-big The accuracy on is 89.82%,CAIL-small by 85.32%）, If so, use EPM To predict the , If not, use the one before fine-tuning EPM（③ In the pre training EPM） To predict the .

Except for direct use EPM And SOTA Outside the model ：

In contrast SOTA Add EPM（ If Switch The prediction case belongs to LJP-E Of 15 One of the species , Just use fine-tuned EPM To classify ; On the contrary, use the original model to classify ）（ I think it's strange to add it up directly ）
modify TOPJUDGE Model （ The results of the TOPJUDGE+Event）： take CNN encoder Switch to LSTM, Replace the original input fact description representation with event characteristics . The effect will be better than direct TOPJUDGE+EPM Worse , Explain to take it directly EPM When the black box is used, the effect will be better .

Insert picture description here

Indicators of event extraction ：
Insert picture description here

3.5 model analysis

3.5.1 Ablation Study

Delete the event element
Delete the event output limit （absolute constraint→CSTR1,event-based consistency constraint→CSTR2）
Remove restrictions between subtasks （article-charge constraint→DEP1,article-term constraint→DEP2）
Delete Superordinate types, The model directly predicts token Of superordinate features
take event extraction As auxiliary task（ and LJP Task sharing encoder）

Insert picture description here