当前位置:网站首页>DialogRPT-Dialog Ranking Pretrained Transformers
DialogRPT-Dialog Ranking Pretrained Transformers
2022-07-26 01:54:00 【just do it now】
Find the right human feedback data
Want such a robot , First, we need to understand human preferences through appropriate data . Such data should be in the form of one to many chat , That is, there are multiple replies in the same chat scene , Each reply has an indicator of its popularity , Indicates the enthusiasm of the feedback received in this reply .
However , It takes time and effort to collect one to many chat data directly and hire humans to tag . Moreover, the annotation of feedback enthusiasm is not just for a annotator , More people are required to vote . meanwhile , Other conventional quality evaluation indicators that can be calculated automatically , Such as word diversity , Nor can it reflect the degree of human preference .

chart 1: The tree structure of replies to posts in social media .
There are three indicators that can measure the degree of human feedback :Updown= Number of likes - Inverse logarithm ;Depth: depth , Number of subsequent reply rounds .Width: Width , The number of comments that responded directly to this comment .
therefore , Researchers look for appropriate human feedback data from social networks .
Pictured 1 Shown , Posts and reply comments on social networks can form a tree structure . Each node represents a comment , Include its contents , give the thumbs-up / Objection number information . The parent node of each node is the comment it replies to , The child node of each node is to reply to its comments . For a node , The path from the root node to its parent node defines the context it generates C( The above information ), This comment is aimed at the context C Reply to r. In this setting , You can get a dialog data (C, r). Because one comment can get multiple replies , That is, a parent node can have multiple child nodes , It is one to many dialog data .

chart 2: Different human feedback indicators Spearman coefficient , The larger the value, the higher the correlation between the two indicators .
The feedback data of each comment contains three :(1) Width : Number of comments received , That is, the number of child nodes .(2) depth : The number of rounds of comment replies after this , That is, the maximum depth of the descendant node .(3) Number of endorsements , Including likes and dislikes . The author uses Reddit According to the data on the website, these three indicators are studied according to the Correlation .
The result is shown in Fig. 2 Shown , There is a strong correlation between width and depth , But the correlation with likes is relatively weak . The author speculates that the possible reason is that people stop commenting after they like it . It is worth noting that , Here Updown The index is the difference between the number of likes and dislikes . Maybe controversial posts can arouse people's discussion more , Then posts with a large number of likes and dislikes or a small gap will cause heated discussion , Wider and deeper .
These three indicators can reflect the enthusiasm of human feedback in some ways . But according to relevant research , Although the number of likes and popularity are generally related , But it is also affected by many other factors . First , These indicators all show the characteristics of long tail distribution , A few replies received the vast majority of replies and likes . Besides , These indicators are also related to the section where the post is located , The time of posting comments is related to the influence of the publisher on Social Networks . Therefore, when using these data to train the model , Need to be standardized carefully .
DIALOGRPT: be based on GPT2 Of 12 layer transformer Model
Next , Researchers use these data to inject human preference information for reply into the model . Given a chat scenario C And a series of replies , The model needs to be based on the enthusiasm of the feedback received from these replies , That is, width 、 depth 、 Like three indicators , Sort them separately .
Existing chat systems mainly reply to relevance ( Such as confusion perplexity And mutual information ) Or manually designed features to measure the appropriateness of the responses in the candidate set . This method cannot directly use human feedback data in the real world for training in the form of end-to-end .
As mentioned above , Many factors may affect the feedback received from a comment , Therefore, the author transforms the ranking problem into a comparison problem . In the process of training , The model no longer predicts the score of a comment alone , Instead, compare two comparable responses each time : Example of training model alignment ( Get more replies or likes ) Give higher scores .

chart 3:DIALOGRPT Objective function , Optimize while maximizing the positive example (+) The score of , Minimize negative examples (-) The score of .
In order to make the positive and negative examples comparable , The author has made strict restrictions on data pairs :(1) Positive and negative examples two replies are generated in the same chat scene , That is, their C identical . That is, in the social network reply tree , The parent nodes of the two comments are the same .(2) The time interval between two comments does not exceed a certain threshold ( In this article, it is one hour ). Avoid the impact of too large time difference on feedback .(3) The number of replies of positive cases must exceed a certain threshold of negative cases , To reduce noise . Because the three indicators to measure response have long tail characteristics , The author uses both absolute threshold and percentage threshold to filter the data , At the same time, remove the data with the number of objections greater than the number of likes . In the process of multi index training , In order to prevent only one indicator from working , The model makes a weighted combination of them .
On the basis of considering human preferences , The model should still be able to evaluate the content relevance of the reply and the chat scene at the same time . therefore , The researchers introduced human-vs-fake Mission , Given context and comments , Judge whether the comment comes from this chat scene .

chart 4: The model evaluates responses from two aspects : The degree of human preference (3 Indicators ) And comments “ Humanoid ” Degree of . The training data of the two parts are shown in the right figure .
Researchers from Reddit Crawled on the website 2011-2012 Years of Posts and composition 1.47 Billion chat data ( The dataset is open source ). Then learn the task of human preference for information , According to the treatment method of the volume mentioned above, we get 1.33 Billion pairs of data .
In the task of learning content related information , Negative examples can be obtained in two ways :(1) Search (Sampled): Randomly select from the training set .(2) Generative (Generated): Use dialog to generate model DialoGPT Generate a reply as a negative example . because DialoGPT The ability to imitate human beings to generate replies is very strong , Therefore, the author only chooses 5.30 The reply of Wan Gao's feedback value is taken as a positive example , This is different from the results generated by the machine .
Use the data , The author trained based on GPT2 Of 12 layer transformer Model DIALOGRPT(Dialog Ranking Pretrained Transformer), And use DIaloGPT-medium Initialize some parameters . The author combines the model with BoW( The word bag model ),Dialog perplexity( Dialogue confusion ),BM2.5( Keyword similarity measurement index ),ConvRT(Reddit Pre training on data is based on transformer Model ) And length and other benchmark methods are compared under different experimental settings .

chart 4: Examples of experimental results .
From the analysis results of word bag model, we can find , Replies that contain less information also receive fewer responses and likes . And the reply in the form of questions (what/who/why/how) The conversation is longer ( deeper ), Comments from a wider audience (anyone,everyone) Will receive more direct comments ( Wider ). Pictured 4 Shown ,DIALOGRPT The model also captures such information .

chart 5: Sort the evaluation results .
Pairwise Column : Given a group of positive and negative examples in the same chat scene , The ratio that the model gives a higher score to the positive example .Spearman Column : Sort a group of replies in the same chat scene .
Next , The author quantitatively evaluates the ability of the model through two sorting methods .(1) The same goal as the training stage , Give two examples of positive and negative , Can the test model give a higher score to the positive example .(2) Given a group of replies under the same chat scenario , adopt Spearman The coefficient measures the difference between the ranking result of the model and the actual degree of human preference .
Under two different settings ,DIALOGRPT The results are significantly better than other methods .

chart 6: Experimental results of reply retrieval on different data sets .
The model needs to retrieve the appropriate reply according to the chat scene information [email protected] Indicates in the arranged reply list , The correct answer appears first k The ratio .K Values represent K Value represents the number of human replies in each chat scene of the corresponding data set , That is, the number of correct answers .
The author will also Reddit The training model is directly applied to DailyDialog,Twitter,PersonaChat Equal data set , Search the appropriate reply according to the chat scene . Pictured 6, The model has not been trained on these data sets and still achieved the best results . stay Reddit Data set , There are multiple correct replies in each chat scene (k>5), So the author introduces BLEU And other evaluation indicators based on reference .DIALOGRPT The model cannot see these reference data , But it significantly exceeds the method of using these reference replies .

chart 7: Data pair evaluation experiment results .
Given a data pair containing positive and negative examples , Can the model give a higher score to the positive example .Human vs. Human: Positive and negative cases are all human responses , The relevant indicators of positive examples are higher .Human vs. Fake: An example is the human reply in this chat scene , Negative examples are randomly selected human replies (Rand) Or the reply generated by the machine in this chat scenario (Generated).
Next , The author evaluates the ability of the model to generate content related replies . He will use different data and task training models to compare . The result is shown in Fig. 7 Shown . The experimental results on the upper left show that they are consistent with the correlation research results of the above three indicators . According to the experimental results on the right , The model trained with human preference information has a weak ability to distinguish positive human responses from randomly selected human responses , But it can better distinguish the results generated by the machine .
This explanation , Even though DialogGPT Can generate content related humanoid replies , But we can't get more human feedback . The results also show that , There is still a certain gap between the goals of human like tasks and those in line with human preferences , So in the last line , The author integrates the two tasks into a training model , Finally, we achieved excellent results in both aspects . such , The model can weigh two indicators at the same time when choosing a reply .
Today, chat robots talk more and more like people , How to generate high-quality replies has become a new research direction . What this article puts forward , The intensity of response to feedback is a concrete manifestation of high quality . This goal is valuable in many application directions . For example, in the field of psychological intervention where people have high expectations , Chatting robots not only need to generate responses that match the content , It is also necessary to keep the interlocutors willing to communicate , At the same time, guide him to change into a positive psychological state .
边栏推荐
- The SQL script generated by powerdispatcher model runs incorrectly
- MySQL locking table problem
- 给RestTemplate添加拦截器记录请求响应,还需解决流只读一次的问题
- AUTOCAD——计算面积的方法
- Go operation excel library excel use
- pdf. JS introduction
- The work of robot engineering and the puzzle of postgraduate entrance examination "volume" supplement
- Speech comprehension center comprehension summary
- proto转换Dart | 项目使用Protobuf | flutter 使用grpc
- Worthington产气荚膜梭菌神经氨酸酶的特征及测定
猜你喜欢

Pt onnx ncnn conversion problem record (followed by yolov5 training)

The detailed knowledge summary of MySQL can be collected

AUTOCAD——计算面积的方法

pdf. JS introduction

CPU的三种模式

SQLyog数据导入导出图文教程

Three modes of CPU

网络之二三层转发

pt-onnx-ncnn转换的问题记录(接yolov5训练)
![[Verilog digital system design (Xia Yuwen) 4 ----- basic concepts of Verilog syntax 2]](/img/fe/746ecaf4123072cca59d7510e9796c.png)
[Verilog digital system design (Xia Yuwen) 4 ----- basic concepts of Verilog syntax 2]
随机推荐
Image batch processing Gaussian filter noise reduction + peak signal-to-noise ratio calculation
The work of robot engineering and the puzzle of postgraduate entrance examination "volume" supplement
"Weilai Cup" 2022 Niuke summer multi school training camp 2 g.[link with monotonic subsequence] block structure
D. Rating compression (thinking + double pointer)
AutoCAD -- Method of calculating area
Make and makefile summary I
怎么使用宝塔面板把node全栈项目部署到服务器上
When everything can be metauniverse, the development of metauniverse seems to have entered a new stage of development
Y77. Chapter IV Prometheus' monitoring system and practice -- Prometheus' service discovery mechanism (VIII)
Create a future and enjoy extraordinary | gbase Nantah General Motors unveiled opengauss Developer Day 2022
3、 Pinda general permission system__ pd-tools-swagger2
Mark and lightbulbs (thinking)
Silicon Valley classroom - official account cloud on demand Silicon Valley classroom microservice project practical notes
SVN版本控制分支、合并功能使用
重发布基础与配置
"Weilai Cup" 2022 Niuke summer multi school training camp 2 k.[link with bracket sequence i] bracket sequence DP
Quickly create a topic folder
大佬们, flinksql datahub源表,源表有字段 timestamp 16位, 写入Ora
BGP knowledge points summary
元素和小于等于阈值的正方形的最大边长(来源:力扣(LeetCode))