当前位置：网站首页>Summary and practice of knowledge map construction technology

Summary and practice of knowledge map construction technology

2022-07-05 02:02:00 【Necther】

Preface

Knowledge map , It's a special semantic network , It USES Entity 、 Relationship 、 attribute These basic units , In the form of symbols, it describes the relationship between different concepts and concepts in the physical world . Why is the knowledge map for Information retrieval 、 Recommendation system 、 Question answering system It's very important , Let's use an example to illustrate ：

Suppose you are in a search scenario , We type in the search box

Can I take a bath after confinement ？

You can see this sentence Query Is a complete question , If there is a large question and answer corpus in the retrieval system （ such as FAQ scene ）, Or a large enough article database （ The article title High coverage ）, Use Semantic matching Technical right Query And FAQ Interrogative sentences 、 article title Doing similarity calculation may be a more ideal solution （ Refer to the previous article [ quote ]）. however , Reality is always skinny , It is very difficult for the content to be retrieved to be complete in language expression . So for this case, Retrieved baseline What is it like ？

Under the traditional search process , We'll start with Query Conduct participle , That is, the original sentence becomes

sit , Confinement , Sure , Take a shower , Do you ,？

For the articles in the database , utilize Inverted index Index it beforehand , According to BM25 The algorithm recalls the words in the word segmentation results 、 Sort .

Of course , We can create a business dictionary according to the needs of business scenarios , Prepare some business keywords , For example, in the example “ Sitting on the moon ” Is a compound noun ; In addition, some stop words are discarded when indexing , such as “ Sure ”,“ Do you ” etc. . Then adjust the weight according to the part of speech （ For example, disease words are better than other words ）, Expect the relevance of the final article ranking to be improved .

But so far , We can see that the search center is based on key word It's done , In other words, the content of the article must appear “ Sitting on the moon ” 、“ Take a shower ” These words . meanwhile , Because the stop words are removed 、 Verb , It can also lead to Query Of Semantic loss ,“ Sitting on the moon ” and “ Take a shower ” Become two concepts of separation .

that , As a natural human , This is how we understand this sentence Query Of ：

Sitting on the moon –> Puerperal women
Take a shower –> Activities of daily living

The whole sentence can be understood as “ Whether puerperal women can do daily activities such as bathing ”, And we can also reason naturally , Maybe this one Query The appropriate feedback is in the form of “ Daily taboos during puerperium ” And so on .

You can find , When human beings think about the expression of natural language , yes Based on several concepts , And have a clear distinction between superior and subordinate concepts , These concepts come from the knowledge system acquired by human beings for many years , We can reason from what we already know 、 analysis 、 lenovo , This is what computers do not have . Tradition is based on Corpus Of NLP technology , Hope to reverse map the whole real world from a large amount of text data , This method is easy to implement , You can get good results in some small scenes , But it will expose various problems as mentioned above , The mapped knowledge representation is severely limited by the scenario , Difficult to migrate ; Lack of organization of knowledge systems ; The effect cannot be explained ;

To solve this kind of problem , The author thinks Knowledge map Maybe it is the key to break through . Compared to long text , Knowledge point （ Entity 、 Relationship ） Easier to organize 、 expand 、 to update . Natural language is the abstraction of human cognitive world , Want computers to understand natural languages , It also needs a brain filled with knowledge .

Dingxiangyuan is also trying to build its own medical brain knowledge map , In the future, knowledge extraction related to the knowledge map will be shared 、 Knowledge means 、 Knowledge fusion 、 Technical research notes on knowledge reasoning , As well as the landing practice in search recommendation and other scenarios . This article is mainly for us to share the technical research notes in the construction of knowledge map .

Concept

We already know in each knowledge map Overview , Knowledge mapping is essentially a semantic network , Nodes represent Entity （entity） perhaps Concept （concept）, The edge represents the entity （ Or concept ） Between the various Semantic relations . At the level of knowledge system, knowledge atlas has three specific organizational classifications , Include Ontology、Taxonomy and Folksonomy.

These three categories can be simply understood as knowledge map pairs Hierarchy Three different degrees of strictness .Ontology It is a tree structure , It is the most strict between nodes of different layers IsA Relationship （ For example ,Human activities -> sports -> football）, The advantage of this kind of atlas is that it is convenient for knowledge reasoning , But it cannot express the diversity of conceptual relationships ;Taxonomy It's also a tree structure , But the hierarchy is less strict , The nodes are connected by Hypernym-Hyponym Relationship structure , The conceptual relationship of such benefits is relatively rich , But it is also easy to cause ambiguity , Difficult to reason effectively ;Folksonomy Is a non hierarchical structure , All nodes are classified by labels , In addition to flexibility , Semantic accuracy 、 The ability to reason has all been lost .

at present ,Taxonomy The organizational structure of is a popular type in the Internet industry , Because it takes into account the relationship between the upper and lower levels and the label system to a certain extent , The flexibility in various applications is the best , This article focuses on Taxonomy Building technology for .

structure

The data source for building large-scale knowledge base can come from some open semi-structured 、 Unstructured and third-party structured databases . Getting data from a structured database is the easiest , Most of what needs to be done is to unify the concept 、 Align entities, etc ; The second is to acquire knowledge from semi-structured data , For example, from Wikipedia ：

The entry description in the encyclopedia is a good data source , It provides a wealth of entity words context, And each entry has a detailed table of contents and paragraph differentiation ：

We can extract the information of corresponding structured fields from different paragraphs , Of course , Some words may be hidden in the text , This requires some help NLP Algorithm to do recognition and extraction , such as Named entity recognition 、 Relationship extraction 、 Attribute extraction 、 Anaphora digestion wait . Related technologies and model optimization will be discussed in detail in the following articles , I won't elaborate here .

that , As mentioned above Taxonomy The type of knowledge map is a IsA Tree structure of , From structured or semi-structured data sources , Specific types of relationships can be easily obtained , however Hierarchical relationship between superior and subordinate The data is relatively small . Want to expand this category of data , Or expand the relationship types that are not in the original knowledge map , You need to extract from the text paragraphs of unstructured text or semi-structured data sources .

For example, you can extract from the text above “ Gout Is A Purine metabolic disorder ”,“ Proteoglycan Is A Extracellular matrix ”,“ collagen protein Is A Extracellular matrix ”,“ Chondroitin sulfate Is A Extracellular matrix ” etc. .

As the field of knowledge mapping heats up , In recent years, the research in this field has gradually increased , It also includes the construction methods in the Chinese field , Now we're right Taxonomy The construction technology of has been investigated , And introduce some specific methods of Chinese and English domain construction .

Taxonomy Build technology

Although the previous imagination to build a perfect Taxonomy For the follow-up NLP The benefits of the application , However, we should be calm to know that the current research in this field is not perfect , The main difficulties come from the following three reasons ： firstly , There is space in the text data 、 Huge differences in theme and quality , Designed extraction template （ For example, some regular expressions ） It is difficult to be compatible with language scenarios in different fields ; second , Due to the diversity of language expression , It is also difficult to extract data integrity , This also greatly affects the final accuracy ; third , It is also the difference in the field , Disambiguation of extracted knowledge is also a headache .

Now we will introduce some existing academic circles 、 The research results of industry , These methods use algorithms from different angles to improve accuracy , It also includes Taxonomy Several subtasks under the build task , Include Hyponym acquisition （hyponym acquisition）、 Superordinate prediction （hypernym prediction）、 Structure induction （taxonomy induction）.

At present, it is based on free text-based Of taxonomy In the construction process , The following two main steps can be summarized ：i) Using templates (Pattern-based) or Distributed mode (Distributional) Extract... From text Is-A The relationship between ;ii) Apply the extracted relation to the data induct Complete taxonomy structure .

Pattern-based Method

Template method , As the name suggests, it is to design some fixed templates to match the original text , The most intuitive thing is to write Regular expressions Go grab it . This field was first developed by Hearst Several simple templates are designed, such as “[C] such as [E]”,“[C] and [E]” , So we can get the hyponymic word pairs from the sentences that conform to these logical sentence patterns . These templates look simple , But there are also many successful applications based on them , For example, the most famous Microsoft Probase Data sets .

Just because it is simple , The template method obviously has many disadvantages , The biggest problem is the low recall rate , The reason is also simple , Natural language has a variety of expressions , The number of templates is limited , It is difficult to cover all sentence structures . secondly , The flexibility of language will also affect the accuracy of extraction , Common mistakes include Unknown common words 、 Wrong expression 、 Extract and complete 、 Ambiguity etc. . There have been many research results in the academic circles under the template based approach , How to improve recall and accuracy .

a). How to improve recall ？

The first is how to improve the recall rate , The first method is to extend the template （Pattern Generalization）, For example, design corresponding templates for different types of entity words ; Template internal article 、 Make flexible substitution of auxiliary words ;

There is also some work to automatically extend templates , such as Rion Snow Of the group 《Learning syntactic patterns for automatic hypernym discovery》 In achievement , Use syntactic dependencies path Automatically get new templates

however , This automatic template generation method will also bring new problems , If our original corpus is very large , The characteristics of a single template are very sparse （Feature sparsity problem）, Another idea is to pay attention to Pattern Characteristics of , Improve their generality, To raise the recall rate . Related work includes Navigli The group proposed star pattern The concept of ： Replace the low-frequency entity words in the sentence , And then, the more general Of pattern.

Syntactic dependency as a feature of this approach has a similar idea , stay PATTY In the system , Will be in dependency path Random replacement in pos tag,ontological type or Substantive words , Finally, choose pattern.

The second method is Iterative extraction , The main assumption is that some wrong relational data are caused by language ambiguity or semantic drift （semantic drift） Will be some too general Of parttern Multiple extraction of . So if you design an authentication mechanism , Maybe we can get rid of them . For example 《Semantic Class Learning from the Web with Hyponym Pattern Linkage Graphs》 Designed “doubly-anchored” The pattern of , Use one bootstrapping loop To extract .

The third method is The superordinate infers （Hypernym Inference）, Another factor that affects the recall rate of using templates to extract relational data is , The goal of the template is to complete sentences , It means that the superior and inferior words of a certain relationship must appear in the sentence at the same time . A natural idea is whether we can make use of the transitivity of words , for instance y yes x The superordinate of , At the same time x And x' Very similar , Then this superior inferior relationship can be used as a transmission , There are already jobs train One HMM To make cross sentence predictions . Except between nouns , There are also some studies that make syntactic inferences based on the modifiers of hyponyms , for instance “grizzly bear” It is also a kind of “bear”

Above, 《Revisiting Taxonomy Induction over Wikipedia》 The job of , Use the head word in the phrase as a feature , We have done a set of heuristic extraction process .

b). How to improve accuracy ？

For problems such as map construction , How to evaluate the accuracy ？ The most common are some statistics based methods . such as (x, y) Is a pair of candidates is-a The relationship is right , stay KnowItAll In the system , Use search engines to calculate x And y Point mutual information （PMI）; stay Probase Likelihood probability is used to express y yes x Is the upper probability , Take the one with the greatest probability as the result ; Others also have prediction results through Bayesian classifiers 、 External data validation 、 Expert evaluation and verification .

How to improve the accuracy outside the extraction process , Most research methods are Select a verification indicator , Then build a classifier to iteratively optimize . however , The accuracy of simply using templates is still generally low , Most of the work of introducing classifiers is in the template + Distributed hybrid solutions are mentioned , Let's introduce the idea of distributed extraction .

Distributional Method

stay NLP In the field , The distributed approach includes The word vector 、 Sentence vector And so on to show the results of learning . One of the great advantages of distributed representation is that NLP Transform originally discrete data into continuous data in the field 、 Computable . This idea It can also be introduced into map construction , Because computability means that there are some relations between word vectors , These relationships can also be Is-A The upper and lower positions of data pairs . Another advantage of distributed method extraction is that we can Is-A Direct prediction of relationships , Not by extracting . The main steps of this method can be summarized as ：i) Get seed data set （Key term）; ii) Use unsupervised or supervised models to obtain more candidates Is-A The relationship is right .

a). Key Terms extract

There are many ways to obtain seed data sets , The most intuitive is the strict design pattern, The advantage of this is that it can guarantee Key term With high accuracy , In the case of a large number of corpora, the effect is good , But when there is less corpus data , There may be insufficient number of samples , Lead to subsequent models training Over fitting . Besides using pattern extract , There are also studies using Sequence tagging model or NER Tools Perform pre extraction , Then use several rules to filter .

Part of it is based on the vertical domain Toxonomy In the study of construction , Some domain specific post-processing will be attached （domain filtering）. Most of them do some threshold based on some statistical values , such as TF、TF-IDF Or scores related to other fields . There are also studies that give weight to sentences when selecting them , Extract sentences with high domain weight key term.

obtain key term after , The next step is how to expand new relationship pairs based on these seed data .

b).Unsupervised Model

The first direction is the clustering scheme , For clustering , The core of the research is which distance evaluation index is used , Simple indicators include cosine、Jaccard、Jensen-Shannon divergence Can be used as a try , There are also slightly more complex pairs of entities (x, y) Compare with other features or weights , such as LIN measure：

among Fx ,Fy Represents extracted feature,w Express feature The weight of .

Besides , Some researchers have paid attention to , For example, in the entry page of Wikipedia , Hyponyms only appear in some of the words that describe the superordinate context in . But the superordinate may appear in the whole of the hyponym context in , Because of this asymmetry , The distance assessment has also been adjusted accordingly , For example, use WeedPrec:

Such assumptions are called Distributional Inclusion Hypothesis（DIH）, Similar distance evaluations include WeedRec、BalAPInc、ClarkeDE、cosWeeds、invCL etc. .

Besides the distance evaluation index , Another concern is feature How to get it , Common examples are Co occurrence frequency in text corpus 、 Point mutual information 、LMI etc. .

c). Supervised Model

In a key term And after the clustering operation , The way to further improve the accuracy is to build a supervised model , have access to Classification or Ranking The way .

From the perspective of classifiers , The most popular solution is to train a language model in advance , such as Word2Vec, Candidate data pairs (x, y) Take the corresponding mapping as a vector , Put two vectors together , Then use, for example SVM To make a binary classifier . This method has been used in many subsequent studies baseline To compare . This method is simple and effective , However, in recent years, it has also been pointed out that there are some problems . Practical discovery , This classifier learns semantic connections , Not the relationship between the superior and the inferior that we expect , In other words, it is very easy to over fit . The alternative is to ( vector x) and ( vector y) do diff operation , Or combine and add 、 Point multiplication and other methods to take a comprehensive feature.

Follow up researchers believe that , Word vector's training It is greatly influenced by the language environment , Map the hyponymic relationship to the word embedding It's more difficult . So on the basis of word vector , by x And y Build a separate layer embedding To express a relationship , The experimental results show that this method has a good index improvement in the construction of maps in specific fields .

Except for classifiers , The generation method of superordinate words （Hypernym Generation） It's also a choice , At the same time, this is the best method at present , It is roughly to build a piecewise linear projection model Used to select and ( vector x) The supernumerary word of is closest to ( vector y) , Also used here Ranking The technique of . We have chosen the related work in the Chinese field to try , See the following chapters for details .

Taxonomy Induction

In the previous chapter , This paper introduces various techniques to extract from text Is-A The relationship is right , The last step is how to merge these relationships into the data , Form a complete map . Most methods are incremental learning （Incremental Learning） The pattern of , Initialize a seed taxonomy, Then the new Is-A Data is added to the graph . The research in this direction is to use the evaluation index as the basis for inserting new data .

A common approach is to treat construction as a clustering problem , Similar subtrees are merged by clustering . Such as 《Unsupervised Learning of an IS-A Taxonomy from a Limited Domain-Specific Corpus》 Just use K-Medoids Clustering to find the minimum common ancestor stage .

Graph related algorithms can also be used as a direction , because Taxonomy It is naturally a graph structure , For example 《A Semi-Supervised Method to Learn and Construct Taxonomies using the Web》 Provides a way of thinking , Find all the entrances for 0 The node of , They are probably taxonomy At the top of the , Find all the outputs with 0 The node of , They are probably at the bottom instance, And then look in the graph for root To instance The longest path of , You can get a more reasonable structure .

Others are also shown in the figure edge Weight values related to various fields are attached , Then we use dynamic programming algorithm to find the optimal segmentation , such as Optimal branching algorithm.

The final step in building is Yes taxonomy Do the cleaning , Put the wrong Is-A Relationships remove data . The first key feature is taxonomy There is no ring structure in the hierarchy of ,Probase The database is built by removing the ring structure , Cleaned up about 74K Error of Is-A The relationship is right .

Another big problem is The ambiguity of entity words , At present, there is no particularly effective solution to this problem . Especially in some automated atlas construction systems , Introduce the... Mentioned above “ Transitivity ” To expand data often brings greater risk of dirty data . for instance , There are two Is-A The relationship is right ：

(Albert Einstein, is-a, professor)
(professor, is-a, position)

But we don't Can not be Get by transitivity

(Albert Einstein, is-a, position)

Although there are some jobs that try to learn a entity word multiple senses, But having multiple choices doesn't mean you know which is the right choice . Most of the time , The disambiguation of substantive words needs more circumstantial evidence , That means it requires you to have rich knowledge background data first , What we are doing is building the map , This becomes “ Egg or layer ” The problem of . From the academic level , Construct a fully-disambiguated Of taxonomy A long way to go , Fortunately, there are many other applications trick, This includes collecting users' searches 、 Click log , analysis UCG Content , Get information from it to help us resolve our differences , And feed back to the knowledge map .

Above, we are right Taxonomy This paper gives a brief overview of the construction technology of , Let's take a look at the complete process of building maps in the Chinese and English fields .

Probase The construction of

From Microsoft Probase Start , The construction of the atlas emphasizes probabilistic taxonomy The concept of , The existence of knowledge is not “ Black and white ” Of , But in the form of a certain probability , Retaining this level of uncertainty can reduce the impact of noise in the data , It is also helpful for subsequent knowledge calculation . stay Probase Data in , Each pair of hypernym-hyponym In order to retain the Co occurrence frequency To correspond to the degree of certainty , as follows ：

such as company And google Associated in the map construction 7816 Time , This value can be used in the subsequent application of the calculation of confidence .

that ,Probase The specific construction process can be understood as two steps , The first is to make use of Hearst Patterns（ The figure below ） The candidate pairs of hyponymic noun phrases are obtained from the original corpus .

After obtaining candidate pairs , Then, according to the parent-child relationship, it is merged into a tree data structure , The whole process is relatively simple , as follows ：

From the original sentence The candidate entity pairs in （ Such as （company, IBM）、（company, Nokia） etc. ） Dig it out , Form small subtrees ;

Then, according to the horizontal or vertical merging principle, each sub tree is merged into a complete map , The following part of the paper gives a complete construction process

You can see , In the process of merging subtrees , It is not easy to do a direct merger , Instead, a similarity calculation is used Sim(Child(T1), Child(T2)) decision , The similarity calculation is also relatively simple , Use Jaccard similarity that will do ,

Suppose there are three subtrees ：

A = {Microsoft, IBM, HP}
B = {Microsoft, IBM, Intel}
C = {Microsoft, IBM, HP, EMC, Intel, Google, Apple}

By calculation, we can get J(A,B) = 2/4 = 0.5,J(A,C) = 3/7 = 0.43, Here, a threshold value will be set during construction , Assuming that 0.5, that A And B Horizontal consolidation can be done , and A And C Do not place .

Problems in the construction of Chinese atlas

In general ,Probase The method is relatively simple , In terms of construction, it provides a large framework level idea , But there are still many problems in the details . In the paper 《Learning Semantic Hierarchies via Word Embeddings》 This problem is discussed in detail in , Especially in Chinese , The first problem is that Chinese grammar is more flexible than English , Use Chinese Hearst-style lexical patterns Although the accuracy will be very high , But the recall rate is very low compared with other methods , then F1 The value is also poor . Because it is difficult to manually traverse and sort out all the sentence structures , It is also very inefficient , It can be seen from the comparison of several methods in the following figure ：

On the other hand , Just use patterns Methods , Build in semantic structure （semantic hierarchy construction） There are also deficiencies in the ability to

image Probase This simple construction method is often prone to missing relationships or errors .

A hybrid framework for mapping models and templates

At present, the mainstream construction schemes are , Besides using Hearst Pattern outside , The common method is to use distributed representation , For example, calculating the upper and lower words PMI、Standard IR Average Precision, Or calculate both The word vector Of offset etc. . Continue the work of the team of Harbin Institute of technology , The team of China Normal University has completed a set of associative word vectors 、 Mapping model and Hearst Pattern Of Taxonomy Construction method , This section focuses on 《Predicting hypernym-hyponym relations for Chinese taxonomy learning》 Introduce the work of the discussion .

Problem definition

First , According to the knowledge background and data of the vertical field , We have built a basic framework of knowledge map , be called Taxonomy, Write it down as T = (V, R), among V Represents an entity vector ,R Relationship means relationship .

Then from the atlas T Sampling out of is-a Relational data , Write it down as R. And then take it out Pass closures Relationship set of , Write it down as R*, It can be understood that the relational data in this set is the data with high reliability .

Take Baidu Encyclopedia as an example , The entity collection obtained from the external data source is recorded as E, Each element in the collection is , The parent of this element is marked as Cat(x) , Then the whole crawled data set can be expressed as ：

For the whole problem , Can be defined as ： according to Rx Learn an algorithm F , To be able to U in unlabeled The data of , And blend into T in .

Basic framework

In the article 《Learning Semantic Hierarchies via Word Embeddings》 The use of The word vector It's similar to v(king) - v(queen) = v(man) - v(woman) The feature of can help predict the hierarchical relationship of entities in the construction process hypernym-hyponym relations（ such as , Diabetes and abnormal glucose metabolism , Gastric ulcer and gastrointestinal diseases ）. This article continues the use of word vectors , And train the idea of mapping model in word vector space , The overall framework is as follows ：

The process can be expressed as ： First, from the existing Taxonomy Extract some initialization data from , Map them to the word vector space （embedding space）, Then train in this space Linear mapping model (piecewise linear projection model), A priori representation of this space is obtained . then , Get new relationship pairs from the data source , Through the prediction of the model and the screening of other rules , Extract a new batch of relationships to update the data training set, This batch of new data can be corrected again projection model, In this cycle .

Model definition

According to the previous description , The first step is to use a large corpus train A reliable word vector , The author used Skip-gram Model of , stay 10 One hundred million words Word vectors are obtained from the corpus of , The method will not be repeated here .

After getting the word vector , For a given vocabulary x , Get the words u The conditional probability can be expressed as ：

here v(x) Indicates the operation of word taking vector ,V The dictionary obtained from the whole corpus

The second step , Build a mapping model , Such a model is also very simple , For a relation to data (xi, yi) , The model assumes that it can pass through a transformation matrix M And an offset vector ：

Besides , The author also draws on the experience of his predecessors , In the experiment, it is found that a single mapping model can not well learn the mapping relationship of this space , For example , Under open domain datasets , It may represent the domain knowledge of natural organisms And It means knowledge in financial and economic fields The spatial representation of is too different , With a single model cover Unable to live . What do I do ？ Then make several more models to deal with them separately , However, the author did not introduce prior knowledge , It's about using K-means Find the category directly ：

You can see , After clustering algorithm , Animals and professions are divided into different clusters , Then you can build a mapping model for each cluster .

The optimization objectives of the whole model are also well understood , On the real x After transforming the vector of , Be as close to the entity as possible y vector , The objective function is as follows ：

among , k Represents the... After clustering k A cluster of , Ck Represents the relational data set under each cluster , The optimization method uses Stochastic gradient descent (Stochastic Gradient Descent)

Training methods

The system is trained in a circular way , The core idea is to dynamically expand the training set through clustering and mapping model R(t)(t = 1, 2, ..., T) , After constantly retraining the model , Gradually enhance the generalization ability of the target data source .

First, in the initialization part, some Term marks :

The following is the circulation process ：

Step 1.

Step 2.

After model prediction , It needs to be filtered by the template , Available Chinese templates such as ：

After screening , Finally, we get a set of relations with high reliability ：

In particular , Here f Not simply through the template , For the above ”Is-A”、”Such-As” and ”Co-Hyponym” The three templates are analyzed respectively ：

Based on the above analysis , Here, a set of algorithms is designed to determine how to U(t)- The relationship data in the filter enters U(t)+ , For candidate data sets U(t)- Relational data in , According to the definition of credibility positive and negative Two quantities ,positive Specifically defined as ：

among ,a The range is (0, 1), Is an adjustment factor ,gamma Is the smoothing coefficient , The paper sets the empirical value a = 0.5, gamma = 1.

negative Specifically defined as ：

If NS(t) High scores , representative xi And yi It's more likely to be ”co-hyponyms” Relationship , It is ”Is-A” The less likely the relationship is .

next , The algorithm involved here is to maximize and minimize , The formal expression is ：

here ,m Express U(t)+ Of size,theta Is a constraint threshold . We found that the problem is Maximum budget coverage (budgeted maximum coverage proplem) A special case of , It's a NP-hard problem , Need to introduce Greedy Algorithm To solve ：

Last , In addition to updates U(t)+ , The other is to put U(t)+ And the original training Data merging :

Step 3.

Sep 4.

Model to predict

summary

The knowledge map construction method investigated in this paper is one of the few works in the field of Chinese , Compared with English , Whether it's pattern The design of the 、 The data sources are very different . The second half of the paper also discusses the problems found in this process , First, the clustering method is used in the system , In the process of parameter adjustment , The discovery effect depends on the number of cluster centers K It's not sensitive , stay K In small cases, the effect difference is not much , Almost K = 10 Is the best . But if the setting is too large , The end result will be very poor . But what the author has done is an open domain knowledge map , And we are doing vertical fields （ Medical care ） The map of knowledge , In practice, we try to divide sub datasets artificially according to prior knowledge , Thus, the mapping models are trained respectively .

secondly , In specific cases , Some upper and lower errors are found , for instance “ Herbs ” Be identified as “ traditional Chinese medicine ” The father of , Although it is true that most of Chinese medicine is made up of herbs , But from the perspective of classification, it is unreasonable . This kind of situation may be due to the problem of Chinese expression in the data source , At the same time, it is not easy to deal with it without the help of external knowledge . In our actual practice, we will do some directional work according to specific data sources pattern Design , For example 《 Practical diagnostics 》 In a Book , The common sentence patterns used to describe attribute classification are

according to * Different , It is divided into a1, a2, a3
x Yes * Types ：y1, y2, y3 etc.

Artificial early observation helps to improve the recall and accuracy of information extraction .

in addition , It is also found that some word vectors are not well trained , For example, in the paper experiment “ plant ” And “ Monocotyledons ” The word vectors of are very similar , It may be that some words are too low-frequency in the corpus , But this is a hard wound , We need to find ways to improve the effect of pre training in the Chinese field .