当前位置：网站首页>[2020 overview] overview of link prediction based on knowledge map embedding

[2020 overview] overview of link prediction based on knowledge map embedding

2022-07-01 04:40:00 【Necther】

source | Expertise

Abstract

Knowledge map (KGs) There are many applications in industry and academia , This, in turn, has driven a great deal of research towards extracting information from various sources on a large scale . Despite these efforts , But as we all know , Even the most advanced KGs It's also incomplete . Link to predict (Link Prediction, LP) It's a kind of basis KG The task of predicting missing facts by existing entities in , Is a promising 、 Extensively studied 、 To solve KG Incomplete task of . Recent LP In technology , be based on KG Embedded LP The technology has achieved good performance in some benchmark tests . Although the literature in this field is increasing rapidly , But the influence of various design choices in these methods has not attracted enough attention . Besides , The standard practice in this area is to report accuracy by testing a large number of facts , Some of these entities are over represented ; This allows LP Method shows good performance by modifying only the structural attributes that contain these entities , And ignore KG Major part . This overview analysis provides embedded based LP A comprehensive comparison of methods , Extend the dimension of analysis beyond the scope of common literature . We compared... Through experiments 16 The effectiveness and efficiency of the most advanced methods , Consider a rule-based benchmark , A detailed analysis of the most popular benchmarks in the literature is reported .

Introduce

Knowledge map (KGs) Is a structured representation of real-world information . In a KG in , Nodes represent entities , For example, people and places ; Tags are the relationship types that connect them ; An edge is a specific fact that connects two entities with a relationship . because KGs Ability to structure in a machine-readable manner 、 Modeling complex data , So it is widely used in various fields , From question and answer to information retrieval and content-based recommendation systems , And for any semantics web Projects are very important . common KG Yes FreeBase、WikiData、DBPedia、Yago And Industry KG There's Google KG、Satori and Facebook Graph Search. These are huge KG It can contain millions of entities and billions of facts . Despite such efforts , But as we all know , Even the most advanced KGs There are also problems of incompleteness . for example , According to observation FreeBase It is the largest and most widely used for research purposes KGs One of , But in FreeBase In more than 70% Of individuals have no place of birth , exceed 99% The individual has no nationality . This has led researchers to propose a variety of techniques to correct errors , And add the missing facts to KGs in , It is often called knowledge map completion or knowledge map enhancement task . You can do this by using an external source ( Such as Web corpus ) Extract new facts , Or from KG The fact that an existing fact in infers a missing fact , To grow existing KG. Later methods , It is called link prediction (LP), Is the focus of our analysis .LP It has been an increasingly active research field , Recently, it has benefited from the explosive growth of machine learning and deep learning technology . At present, the vast majority of LP The model uses the original KG Element to learn low dimensional representation , It is called knowledge map embedding , Then use them to infer new facts . In just a few years , The researchers were RESCAL and TransE And so on , Dozens of new models based on different architectures have been developed . Most papers in this field have one thing in common , But there are problems , That is, the results they report are summarized on a large number of test facts , Few of these entities are over represented . therefore ,LP The method can perform well on these benchmarks , Only these entities are accessed , And ignore other entities . Besides , The limitations of current best practices may make it difficult to understand how the papers in this literature are combined , And how to describe the research direction worth pursuing . besides , Advantages of current technology 、 Shortcomings and limitations remain unknown , in other words , Few studies have been done to allow the model to perform better . Roughly speaking , We still don't know what makes a fact easy or difficult to learn and predict . In order to alleviate the above problems , We have a representative group based on KG Embedded LP The models are widely compared and analyzed . We give priority to the most advanced systems , And consider work that belongs to a broad architecture . We train and adjust these systems from scratch , And by proposing new 、 Informative assessment practices , Provide experimental results beyond the original paper . The concrete is ：

We have considered 16 A model , Belong to different machine learning and deep learning architectures ; We also use an additional state-of-the-art based on rule mining LP Model as baseline . We provide a detailed description of the methods considered in the experimental comparison and a summary of the relevant literature , And the educational classification of knowledge map embedding technology .
We have considered 5 The most commonly used data sets , And the most popular metrics currently used for benchmarking ; We analyzed their characteristics in detail .
For each model , We provide quantitative results of efficiency and effectiveness for each data set .
We propose a set of structural features in the training data , And measure how they affect the predictive performance of each model for each test fact .

Methods an overview

In this section , We describe and discuss the main methods of knowledge management based on latent characteristics . As in No 2 As described in section ,LP Models can utilize a variety of methods and architectures , It depends on how they model the optimization problem , And the techniques they implement to deal with optimization problems .

In order to outline their highly different characteristics , We propose a new classification , Pictured 1 Shown . We have listed three main series of models , And further divide them into smaller groups , Marked with unique colors . For each group , We all include the most effective representative models , Give priority to models that achieve the most advanced performance , And whenever possible , Prioritize models that have publicly available implementations . The result is a set of 16 A model , Based on an extremely diverse architecture ; These are the models we used later in the experimental part of comparative analysis . For each model , We also report the year of publication and information from other models . We think , This classification is helpful to understand these models and the experiments carried out in our work . surface 1 Further information about the included models is reported , For example, their loss function and spatial complexity . We have identified three types of models :1) Tensor decomposition model ;2) Geometric models ;3) Deep learning model .

Tensor decomposition model

The model of this family will LP Explain the task as a tensor decomposition . These models implicitly put KG Consider a three-dimensional adjacency matrix ( That is, a 3 D tensor ), because KG The incompleteness of , This adjacency matrix is only partially observable . The tensor is decomposed into a combination of low dimensional vectors ( For example, a multilinear product ): These vectors are used as embedded representations of entities and relationships . The core idea of tensor decomposition is , As long as the training set does not fit , Then the learned embedding should be able to generalize , And the high value is associated with the unobservable real facts in the graph adjacency matrix . In practice , The score of each fact is calculated by combining the specific embeddedness involved in the fact ; By optimizing the scoring function of all training facts , You can learn to embed as usual . These models tend to use little or no shared parameters ; This makes them particularly easy to train .

Geometric models

The geometric model interprets the relationship as a geometric transformation of the potential space . For a given fact , The head entity is embedded for space conversion τ, Use the embedded relationship as the value of the parameter . The value of the fact score is the distance between the result vector and the tail vector ; In this way, the distance function can be used to calculate δ( for example L1 and L2 norm ).

Deep learning model

The deep learning model uses deep neural networks to perform LP Mission . Neural network learning parameters , Such as weight and deviation , They combine input data , To identify significant patterns . Deep neural networks usually organize parameters into independent layers , Usually, nonlinear activation functions are interspersed .

as time goes on , People have developed many different types of layers , Apply different operations to the input data . for example , The full connection layer will put the input data X And weight W Combine , And add a deviation B: W X + B. For the sake of simplicity , We will not mention the use of deviation in the following formula , Keep it implicit . Higher layers perform more complex operations , Such as convolution ( It learns convolution kernel to apply to input data ) Or recursive layer ( Process sequential input recursively ).

stay LP Tasks , It is usually learned by combining the weights and deviations of each layer KG The embedded ; These shared parameters make these models more expressive , But it may lead to more parameters , Harder to train , It's easier to over fit .