当前位置：网站首页>Research on technology subject division method based on patent multi-attribute fusion

Research on technology subject division method based on patent multi-attribute fusion

2022-07-26 14:01:00 【Midoer technology house】

Abstract

【 Purpose 】 reasonable 、 It works 、 It is of great significance to accurately divide technical topics , This paper aims to integrate multiple attributes of patents to improve the effect of technical subject division .【 Method 】 Based on the text content of the patent 、 Reference relation and classification information to construct patent text vector 、 Patent citation vector and patent classification vector , Synthesize the three to get the patent vector based on multi-attribute fusion , On this basis, technical topics are obtained through patent clustering .【 result 】 Compared with the patent vector representation method based on one or two attributes , Based on the patent multi-attribute fusion method at different levels IPC Classification and different number of sample sets show higher accuracy of patent classification 、 Recall rate and F1 value , The measurement of patent similarity is more accurate , Indirectly proved that the technology subject division method based on patent multi-attribute fusion has more advantages .【 limited 】 Use the patent automatic classification experiment rather than the direct method to evaluate the effect of technical subject classification .【 Conclusion 】 The technology subject division method based on patent multi-attribute fusion can integrate the representation ability of different patent attributes on the technology subject , Improve the accuracy of patent similarity measurement and technology subject division .

key word ： Multi attribute fusion ; Technical topics ; Patent similarity

1 introduction

The division of technology topics is the identification of emerging technologies 、 Technology evolution analysis 、 The basis of technology development trend analysis and other related research , reasonable 、 It works 、 It is of great significance to accurately divide technical topics . Technical topics usually refer to branch technologies in a technical field 、 Sub technology or technical direction . When conducting research on papers or patents , Usually use a group of keywords of papers or patents / Phrases or a group of documents to reveal the core content of the technical topic [1]. Aiming at the current main technical subject classification methods, such as patent citation network clustering 、 Shortcomings of text mining , This paper proposes a technology topic division method based on patent multi-attribute fusion , Integrate the content of patent text 、 Reference relationship and classification information to express patent vector and calculate patent similarity , And obtain the technical theme through patent clustering , Improve the accuracy of patent vector representation and patent similarity measurement , Improve the effect of technical theme division .

2 Research status at home and abroad

There are mainly traditional methods of dividing technical topics according to papers or patent classification system 、 Bibliometric analysis method and text mining method, etc . among , Citation Analysis 、 Co word analysis and other literature measurement methods and vector space model 、LDA Text mining methods represented by topic models are most widely used .

Some scholars use CO word analysis to divide technical topics , According to the co-occurrence relationship of two words or phrases, carry out multi-dimensional scale analysis 、 Principal component analysis 、 Clustering analysis , So as to obtain the technical topics represented by word clusters . Shen Jun et al [2] Take the third generation mobile communication technology as an example , Extract subject words from the titles and abstracts of patent documents , And cluster the co word network , Each cluster is a technical topic , Finally, the distribution of technical topics in this field and the development trend of technical topics are analyzed . Huang Lu et al [3] Prediction of emerging technology topics based on weighted network links , Convert the network after link prediction into matrix , After removing the low-frequency words, the patent co word network is hierarchical clustered to obtain the technical topic .

There are also some scholars' direct citation networks for papers or patents 、 Clustering analysis of citation coupling network or co cited network , Gather documents with similar topics to form several technical topics .Kajikawa etc. [4] Through the direct citation network clustering analysis of the paper, the technical topics in the field of energy are obtained , To track emerging technology changes in this field .Small etc. [5] Combining the co citation network clustering and direct citation network clustering methods, we can identify the science and technology topics with novelty and rapid growth .Hopcroft etc. [6] Several emerging topics in the computer field are identified by citation coupling analysis .

With the wide application of text mining methods , Clustering analysis of Patent Texts 、LDA Topic models and other methods are increasingly used for technical topic segmentation . The core of patent text clustering method is the vector representation of patent text . Traditional text representation models mainly include Boolean models 、 Word bag models such as vector space model and probability model . at present , Some scholars will Word2Vec、Doc2Vec Equal depth learning model is used for word vector learning and text vector representation .Feng etc. [7] Use text mining and Word2Vec Cluster analysis to carry out research on patent technology opportunity discovery . Xue Jincheng et al [8] be based on Word2Vec Algorithm , Take the average value of the word vector as the text vector of the patent . Scholars at home and abroad are applying LDA The theme model has made many explorations in analyzing the theme of patented technology . Such as Trappey etc. [9]、Wang etc. [10]、 Zhoujingsheng [11]、 Luo Jian et al [12] use LDA Topic model extracts patent technology topics , Conduct patent technology trend analysis 、 Technology theme value and emerging technology theme identification .

The above methods accurately reflect the semantic and thematic relationships between words or texts 、 There are still some deficiencies in accurately dividing technical topics , For example, literature quotation has subjectivity and diversity of purposes , The core theme relevance between the cited literatures is not necessarily strong , And there is not necessarily a citation relationship between the literature related to the topic . Although text mining methods can go deep into text content 、 Deeply reveal the theme of the literature , But limited by the current participle 、 Term extraction 、 Feature selection technology , The accuracy of its technical subject division is also difficult to achieve a very satisfactory level . therefore , Multiple attributes of patent documents can be considered , Improve the accuracy of text vector representation .

At present, there are also some studies that combine patent citation and text content to improve the accuracy of patent similarity measurement and technology topic classification . Liu Xiaoling et al [13] In the study of technological evolution , The weighted average value of patent citation relationship and patent text similarity is used as the weight of edges in the patent network , Through the clustering analysis of patent network, the evolution process of technology is revealed ; Xiao Xue et al [14] This paper proposes a community partition method based on weighted Citation Network , The improved cosine similarity method is used to calculate the similarity between papers , As the weight of edges in the citation network . This kind of method assumes the importance of patent citation relationship or text content in patent similarity measurement in advance , Strong subjectivity . In this paper, different attributes of patents are fused in a more objective way in patent vector representation , In order to improve the accuracy of technical subject division .

3 Methods to design

3.1 Method design ideas and theoretical basis

The subject of patented technology has two characteristics . One is a single theme , That is, there is only one technical subject in a patent document , Usually only for one device 、 equipment 、 Material or a method 、 Introduce the process . Second, there are various forms of patent writing , There are many expressions of technical subject words that express the same meaning , That is, synonyms exist in technical subject words [15,16]. Therefore, patents can be clustered directly according to patent similarity , So as to obtain the technical theme .

In this paper, the content of the patent text 、 Patent citation relationship and patent classification information are called the attributes of patent documents . These attributes reflect the different characteristics of patent documents as the carrier of technical knowledge records , All contain the technical subject information of patents . Patent text content analysis and patent citation analysis are two relatively independent methods , Only based on the single attribute of patent documents , The theoretical basis and foothold of the analysis are different , The two methods can make up for each other's shortcomings , Give full play to their advantages . Except for the citation of patents 、 Outside the text , The classification number of a patent is also important information reflecting its technical field or technical subject . Patents with the same classification number usually belong to the same technical field , Similar in technical topics , Therefore, patent classification can also be used as one of the factors of patent similarity calculation and technical subject division . The theoretical assumptions of patent similarity measurement based on multi-attribute fusion are as follows: surface 1 Shown . be based on surface 1, This paper integrates the content of patent text 、 Reference relationship and classification information , A patent similarity calculation method based on multi-attribute fusion is proposed .

surface 1 The theoretical assumption of patent similarity measurement based on multi-attribute fusion

Table 1 Theoretical Hypothesis of Patent Similarity Measurement Based on Multi-Attribute Fusion

Similarity calculated by a single attribute	Actual patent similarity	Correction of similarity by other attributes
Suppose the patent A And patents B There are few common patent references or cited patents , The similarity calculated by this method is small	A and B The content of the patent text is similar , The classification number is the same or similar , The actual technical topics are similar	The patent similarity based on text content or classification is large , The similarity based on reference relationship can be modified
Suppose the patent A And patents B There are many common patent references or cited patents , The similarity calculated by this method is large	A and B The citation of is not its core related technology , There is little similarity between actual technical topics	The patent similarity based on text content or classification is small , The similarity based on reference relationship can be corrected
Suppose the patent A And patents B The similarity of text content is small	A and B Writing habits are different , There are great differences in the use of words , There are many common patent references or cited patents , In fact, the technical themes of the two are similar	The similarity based on reference relationship or classification is large , The similarity based on text content can be corrected

3.2 Method flow

The process of technology subject division method based on patent multi-attribute fusion is as follows chart 1 Shown .

（1） Based on the text content of the patent 、 Reference relationship and classification information for patent vector representation .① utilize Doc2Vec Title of model training patent 、 Get the text vector of each patent （ Short for patent text vector ）.② Based on the same patent references between patents 、 Introduce the number of patents to build a patent vector （ Patent citation vector for short ）.③ Build a patent vector based on the number of common patent classification numbers （ Patent classification vector for short ）.

（2） Determine the dimensions of three types of vectors through experiments , Connect the three end-to-end to form a patent multi-attribute vector , Calculate the space distance of the vector , Form a patent similarity matrix .

（3） For patent similarity matrix , Use hierarchical clustering algorithm to cluster patents , Obtain several patent clusters , Each cluster represents a patented technology topic .

（4） use TF-IDF or TextRank The algorithm extracts keywords representing the core content of patent clusters from patent titles and abstracts / phrase .

chart 1

chart 1 Technology subject division process based on patent multi-attribute fusion

Fig.1 Process of Technology Topics Division Based on Patent Multi-Attribute Fusion

3.3 Patent similarity measurement based on multi-attribute fusion

Patent similarity calculation method based on multi-attribute fusion, such as chart 2 Shown , Represented by patent text vector 、 Patent citation vector representation 、 Patent classification vector representation and patent similarity calculation .

chart 2

chart 2 Schematic diagram of patent similarity calculation method based on multi-attribute fusion

Fig.2 Schematic Diagram of Patent Similarity Calculation Method Based on Multi-Attribute Fusion

（1） be based on Doc2Vec The patent text vector representation of the model

utilize Doc2Vec The model represents the patent text by vector .Doc2Vec Model is a distributed document expression method that can directly convert sentences or paragraphs into fixed dimension vectors , Be able to comprehensively examine words 、 The contextual semantics of sentences and even paragraphs 、 Contextual information . In recent years , Some scholars have explored Doc2Vec Application of model in patent text analysis , Such as discovering technical topics 、 Analyze patent similarity, etc [17,18].

Take the vector of words appearing in the patent title and abstract as Doc2Vec DM The input layer of the model , At the same time, the patent title and abstract are spliced together as the paragraph vector of the input layer , After the training of the model, constantly adjust the word vector and paragraph vector , Finally get the paragraph vector of the patent title and abstract , That is, patent text vector .

patent Pi Text vector of V（Pi） It is expressed as formula （1） Shown .

V(Pi)=<w1,w2,…,wm>V(Pi)=<w1,w2,…,wm>

(1)

among ,V（Pi） Indicates a patent Pi Text vector of ,m Represents the dimension of a vector ,wm It means the first one m Value of dimension .m Usually take 50 or 100 dimension , But it also needs many tests to determine .

（2） Patent citation vector representation based on Citation relation

The citation relationship between documents can reflect the inheritance of document knowledge and the relevance of subject content to a certain extent , Among them, citation coupling and co citation analysis can be mutually confirmed and supplemented in the similarity analysis of literature topics [19,20]. Therefore, this paper takes the citation coupling and co citation relationship of patents as one of the bases for the division of technical topics .

By patent Pi And patents Pj As an example , You can get patents Pi And patents Pj Common citation intensity , As formula （2） Shown .

SCI(Pi,Pj)=c(Pi,Pj)a(Pi,Pj)SCI(Pi,Pj)=c(Pi,Pj)a(Pi,Pj)

(2)

among ,SCI（Pi,Pj） Express Pi and Pj Common citation intensity ,c（Pi,Pj） Express Pi and Pj The number of patents at the intersection of citation sets ,a（Pi,Pj） Express Pi and Pj The number of patents in the union of citation sets . The greater the strength of the common quotation , It shows that the stronger the citation relationship between patents , The greater the similarity of the theme revealed by the citation relationship .

According to the common citation strength between patents , The common citation strength matrix of the patent set can be obtained , As formula （3） Shown .

A=⎡⎣⎢⎢⎢⎢a11a21⋯an1a12a22⋯an2⋯⋯⋯⋯a1na2n⋯ann⎤⎦⎥⎥⎥⎥A=a11a12⋯a1na21a22⋯a2n⋯⋯⋯⋯an1an2⋯ann

(3)

matrix A Is symmetric matrix , The element value is the common citation strength between two patents ,n Represents the number of patents in the patent set .

A(Pi)=<ai1,ai2,…,ain>A(Pi)=<ai1,ai2,…,ain>

(4)

（3） Patent classification vector representation based on patent classification number

International Patent Classification （International Patent Classification,IPC） Adopt a hierarchical classification structure , According to 6 A hierarchical —— Ministry 、 branch 、 Categories: 、 Subclass 、 Group 、 The group expands step by step . The greater the number of patents with the same patent number, the greater the similarity of patents in technical functions and application fields , The more relevant the technical topic is , therefore IPC No. can be used as an attribute for judging the similarity of technical topics , In this paper IPC Subclass （4 A code ） Calculate the common classification strength of patents , As formula （5） Shown .

SCL(Pi,Pj)=n(Pi,Pj)m(Pi,Pj)SCL(Pi,Pj)=n(Pi,Pj)m(Pi,Pj)

(5)

among ,SCL（Pi,Pj） Express Pi and Pj Common classification strength ,n（Pi,Pj） Express Pi and Pj The common IPC Number of classification numbers ,m（Pi,Pj） Express Pi and Pj The number of classification numbers of the union of classification numbers . The greater the intensity of common classification, the more common IPC More numbers , The more similar the themes revealed by patent classification .

According to the common classification strength between patents, the common classification strength matrix of the patent set is obtained , As formula （6） Shown .

B=⎡⎣⎢⎢⎢⎢b11b21⋯bn1b12b22⋯bn2⋯⋯⋯⋯b1nb2n⋯bnn⎤⎦⎥⎥⎥⎥B=b11b12⋯b1nb21b22⋯b2n⋯⋯⋯⋯bn1bn2⋯bnn

(6)

matrix B The element value in is the common classification strength between two patents ,n Represents the number of patents in the patent set .

B(Pi)=<bi1,bi2,…,bin>B(Pi)=<bi1,bi2,…,bin>

(7)

（4） Patent similarity calculation based on multi-attribute fusion

Three kinds of vectors are connected end to end to get the patent vector of multi-attribute fusion , Their respective dimensions are determined by experiments , And singular value decomposition （Singular Value Decomposition,SVD） Method to compress dimensions . Calculate the cosine distance of the multi-attribute vector , Then we get the patent similarity of multi-attribute fusion .

To test the effect of the method proposed in this paper , Compare based on multi-attribute fusion with based on single 、 Accuracy of patent similarity measurement of two attributes . Because it is difficult to obtain the patent text similarity training data set manually indexed , And the manual interpretation of large quantities of patent similarity is time-consuming, labor-consuming and not necessarily accurate , So in order to IPC The classification number is used as the classification label of the patent , adopt XGBoost Classification algorithm for automatic patent classification , According to the accuracy of classification, we can indirectly judge the accuracy of patent text vector representation and similarity measurement .

4 Empirical analysis —— Take natural language processing technology as an example

natural language processing （Natural Language Processing,NLP） It is a very important technical branch in the field of artificial intelligence , Known as the “ Pearl on the crown of artificial intelligence ”.NLP The main purpose of is to overcome various limitations in man-machine dialogue , Enable users to talk with computers in their own language , Teach computers to understand natural language .NLP Research includes basic research 、 Common technology and Application Research [21,22].

Patent data comes from Kerui Weian Derwent Innovation database （ abbreviation DI database ）. According to the classification of natural language processing technology and its keywords published by the World Intellectual Property Organization / phrase 、IPC Classification number and CPC Category number construction NLP Retrieval of technology patents . The world intellectual property organization divides natural language processing technology into 8 Class sub Technology ： General technology of natural language processing 、 Man-machine dialogue 、 Information extraction 、 Machine translation 、 morphology 、 Natural language generation 、 Semantics and emotional analysis . Finally retrieved 15 429 Invention patents authorized by the United States Patent and Trademark Office , The retrieval time is 2020 year 6 month 15 Japan .

Calculate the patent similarity of multi-attribute fusion based on the method designed in this paper . To verify the hypothesis that this method can measure patent similarity more accurately , Compare it with a method that uses only one or two attributes of the patent . With IPC The classification number is used as the classification label of the patent , utilize XGBoost The classification algorithm classifies and learns patent vectors obtained based on different attributes . The more accurate the classification is, the more the corresponding patent vector can reflect the subject content of the patent , Indirectly prove that the patent similarity measurement is more accurate , The division of technical topics is more reasonable and accurate .

4.1 Patent vector representation and evaluation based on multi-attribute fusion

The patent text vector passes Doc2Vec Model training results in , Dimension for 300 dimension . The patent classification vector is based on IPC Subclass （4 A code ） To calculate the . Patent citation vector and patent classification vector are compressed into 300 dimension , Connect the three kinds of vectors end to end to get the vector of multi-attribute fusion , common 900 dimension .

The same method is also used to represent the patent vector based on the two attributes of the patent , There are three combinations ： Text vector and citation vector 、 Text vector and classification vector 、 Citation vector and classification vector . The dimensions of the three types of vectors are 600 dimension .

To compare different methods in different IPC Patent classification effect under classification level , Choose... Respectively IPC Subclass （4 A code ） and IPC Large group （6 A code ） Three sample sets are constructed from different patent data under . Sample set 1 Including top patents 10 Of IPC Subclass （G06F、G10L、G06N、H04L、G06K、G06Q、H04M、H04N、G09B、G06T） All patents under , common 14 897 strip ; Sample set 2 Ranking top in the number of patents 10 Of IPC Large group （G06F17、G10L15、G06F03、G06F07、G06F16、G06F15、G06F09、G06K09、G06F40、G06N05） All patents under , common 12 913 strip ; Sample set 3 Ranking No. in the number of patents 2 To 6 Of IPC Patents under large groups , common 3 413 strip , The classification of this sample set is more balanced . use XGBoost Algorithm for automatic classification ,75% As a training set ,25% For test set .

The method of multi-attribute fusion is better than other methods in different sample sets , And the vector representation method of two kinds of attribute fusion is better than the method based on a single attribute . For sample set 1, Patent vector representation method based on multi-attribute fusion （Doc2Vec_CIT_IPC） The classification accuracy of 、 Recall rate and F1 Reach respectively 0.853、0.864 and 0.847, Are higher than other methods . In three methods based on single attribute Doc2Vec、CIT and IPC in ,IPC The values of the three evaluation indicators are all high ,Doc2Vec The accuracy of is lower than CIT, But recall rate and F1 The values are higher than CIT.IPC The reason why the effect of this method is significantly better than the other two methods , A very important reason is that the method is based on the common IPC The number of classification numbers generates a patent vector , Naturally IPC Classification number prediction is more accurate , even so ,IPC The values of the three evaluation indexes of the method are also lower than those of the multi-attribute fusion method . In the method of merging two kinds of attributes , Classification information and text content 、 When the citation relation is fused separately, the effect is quite , The integration of text content and citation is relatively poor .

Sample set 2 And sample set 3 Patented IPC The classification level is large group , namely 6 A code , Than the sample set 1 The classification of is more detailed , The patent similarity under the same classification number is greater . Sample set 3 The number of patents in each category is larger than the sample set 2 More balanced . From the sample set 2 The classification results of , The accuracy of the patent vector representation method based on multi-attribute fusion is slightly lower than Doc2Vec_IPC Outside method , Recall rate and F1 Values are higher than other methods . A more balanced sample set of categories 3 for , The classification effect of multi-attribute fusion method is significantly better than other methods .

therefore , The classification effect of the three sample sets can be seen , The patent vector representation method based on multi-attribute fusion is better than the patent vector representation method based on single attribute or two attributes , For different levels IPC The classification number prediction has achieved good results , It shows that this method can better represent the subject content of the patent , It helps to divide technical topics more accurately . The patent classification results of the patent vector representation method based on multi-attribute fusion are compared with those of other methods, such as surface 2 Shown .

surface 2 The patent vector representation method based on multi-attribute fusion is compared with the patent classification results of other methods

Table 2 Patent Classification Results of Patent Vector Representation Based on Multi-Attribute Fusion Method and Other Methods

Patent vector representation	Attribute dimension	IPC_4（Top10）			IPC_6（Top10）			IPC_6（Top2-6）
Patent vector representation	Attribute dimension	P	R	F1	P	R	F1	P	R	F1
Doc2Vec_CIT_IPC	Multiple attributes	0.853	0.864	0.847	0.708	0.725	0.662	0.694	0.700	0.692
Doc2Vec	Single attribute	0.777	0.816	0.760	0.644	0.680	0.590	0.609	0.617	0.609
CIT	Single attribute	0.791	0.812	0.757	0.684	0.683	0.592	0.540	0.547	0.539
IPC	Single attribute	0.820	0.836	0.801	0.623	0.700	0.610	0.586	0.542	0.512
Doc2Vec_CIT	Two properties	0.814	0.829	0.779	0.674	0.693	0.610	0.644	0.650	0.644
Doc2Vec_IPC	Two properties	0.848	0.857	0.837	0.712	0.719	0.649	0.653	0.664	0.652
CIT_IPC	Two properties	0.844	0.855	0.838	0.696	0.724	0.659	0.639	0.652	0.641

（ notes ：Doc2Vec_CIT_IPC It is a patent vector representation method based on multi-attribute fusion ,CIT It is a patent citation vector representation method based on patent citation relationship ,IPC It is a patent classification vector representation method based on the classification of patents ,Doc2Vec_CIT、Doc2Vec_IPC、CIT_IPC They are the patent vector representation methods integrating two dimensional attributes ;IPC_4（Top10）、IPC_6（Top10）、IPC_6（Top2-6） They are sample sets 1、 Sample set 2、 Sample set 3;P、R It represents the accuracy rate and recall rate respectively .）

4.2 Patent similarity measurement and evaluation based on multi-attribute fusion

For the sample set with more balanced classification 3, be based on Doc2Vec The patent text vector representation method of the model has the best effect , The second is the representation of citation vector and classification vector , Because the classification vector is based on IPC Subclasses get , The classification granularity is coarse , So patent text vector 、 The dimensions of citation vector and classification vector are set as 300、250 and 50, The three are connected end to end to obtain 600 Dimensional patent multi-attribute vector , Calculate the patent similarity of multi-attribute fusion .

Choose one at random NLP Technology patents “US9275036B2”, Respectively extract the similarity of its multi-attribute fusion 、 Text similarity and citation similarity are the highest 20 Compare the patent data , Such as surface 3 to surface 5 Shown . The patent is a system and method for adaptive spell checking and correction , For previously changed or unrecognized strings , Provides a list of historical replacement strings , Replace with the selected string .

surface 3 And patents “US9275036B2” The most similar citation 20 Patents

Table 3 Top 20 Patents with the Highest Citation Similarity to the Patent US9275036B2

ranking	Public number	Patent name	Application year	Similarity degree
1	US10229108B2	System and method of adaptive spell checking	2016	0.773
2	US6732333B2	System and method for managing statistical data related to correction of word processing documents	2001	0.085
3	US7647554B2	Improve the system and method of spell checking	2006	0.071
4	US8543378B1	System and method for recognizing words with spelling errors	2003	0.068
5	US9489372B2	be based on Web Spell checker for	2013	0.060
6	US9069753B2	The proximity measurement between misspelled input and expected input	2010	0.057
7	US4730269A	utilize Alpha Method and device for generating word skeleton by collection	1986	0.053
8	US5765180A	Methods and systems for correcting misspelled words	1996	0.046
9	US5572423A	Use the wrong frequency to correct spelling	1995	0.045
10	US7669112B2	Automatic spelling analysis	2007	0.044
……
20	US10310628B2	Input error correction method	2013	0.038

surface 4 And patents “US9275036B2” The text with the highest similarity 20 Patents

Table 4 Top 20 Patents with the Highest Text Similarity to the Patent US9275036B2

ranking	Public number	Patent name	Application year	Similarity degree
1	US10229108B2	System and method of adaptive spell checking	2016	0.992
2	US9779080B2	adopt N-gram Make text AutoCorrect	2012	0.813
3	US5765180A	Methods and systems for correcting misspelled words	1996	0.782
4	US5604897A	Methods and systems for correcting misspelled words	1990	0.781
5	US10468015B2	Automated TTS Self tuning system	2017	0.775
6	US4777596B1	Text replacement typing AIDS , For computer text editor	1986	0.769
7	US10318631B2	Removable spell checker device	2018	0.765
8	US5270927A	The method of converting Chinese speech into Chinese characters	1990	0.765
9	US4783758A	Use the numerical ranking of the structural differences between misspelled words and candidate replacement words to automatically replace words	1985	0.763
10	US8543378B1	Systems and methods for recognizing misspelled words	2003	0.755
……
20	US10467338B2	Correct user input	2017	0.740

surface 5 And patents calculated based on multi-attribute fusion method “US9275036B2” The most similar 20 Patents

Table 5 Top 20 Patents with the Highest Similarity to the Patent US9275036B2 Calculated Based on the Multi-Attribute Fusion Method

ranking	Public number	Patent name	Application year	Similarity degree
1	US10229108B2	System and method of adaptive spell checking	2016	0.996
2	US9779080B2	adopt N-gram Make text AutoCorrect	2012	0.926
3	US5765180A	Method and system for correcting misspelled words	1996	0.913
4	US5604897A	Method and system for correcting misspelled words	1990	0.913
5	US4777596B1	Text replacement typing AIDS , For computer text editor	1986	0.908
6	US10318631B2	Removable spell checker device	2018	0.907
7	US4783758A	Use the numerical ranking of the structural differences between misspelled words and candidate replacement words to automatically replace words	1985	0.906
8	US8543378B1	System and method for recognizing words with spelling errors	2003	0.903
9	US5276741A	Fuzzy string matcher	1991	0.903
10	US5761687A	A character based correction method with correction propagation	1995	0.902
11	US7831911B2	Spell checking system , Including voice speller	2006	0.901
……
20	US8457946B2	Recognition architecture for generating Asian characters	2007	0.888

Calculated by several methods and patents “US9275036B2” The most similar patents are “US10229108B2”, This patent is a technological update of the target patent , The titles and abstracts of the two patents are almost the same , Most of its citations and references are the same . Except for the most similar patents mentioned above , The citation similarity between other patents and target patents is not high , Ranking the first 3 Patents “US7647554B2” Only for 0.071, Because the number of patents and patent references cited together with the target patent is only 4 strip , The number of cited patents and patent references of the two is 54 strip , It shows that even if the topic is highly relevant , There may be no or weak citation relationship between patents . and , Patents similar to the subject matter of the target patent , Due to the small number of common citations with the target patent , The degree of similarity is also small .

The first one with the highest similarity to the target patent text 20 See surface 4, Most of these patents are related to spell checking 、 Automatic correction of character input and other methods are related to the system , It has high subject similarity with the target patent . But no 8 Patents “US5270927A” It belongs to speech recognition technology , Weak correlation with the subject of the target patent , indicate Doc2Vec The model's judgment of text semantics is not completely accurate . Through the relationship with patent quotation 、 Combination of classification information , Corrected the similarity between the patent and the target patent , stay surface 5 In the similarity ranking of multi-attribute fusion, it does not enter the top 20, It is more consistent with the actual situation .

4.3 Patent clustering and subject word extraction

According to the experiment of automatic patent classification , The patent vector representation based on multi-attribute fusion can reveal the subject of patent more accurately . Adopt the bottom-up hierarchical clustering algorithm to NLP technical field 2008 year -2017 Applied for in 9 800 Cluster analysis of patents . After many experiments, select more accurate clustering results , That is to ensure that the patent topics in each cluster are highly relevant , At last get 473 A cluster of . Because the patent similarity matrix is not pruned , Reserved patents with weak relationship , Therefore, more clusters are obtained , Some clusters have fewer patents .

According to the patent clustering results , Use them separately TF-IDF and TextRank The algorithm extracts keywords that can represent the subject from the patent titles and abstracts of the cluster . Both methods can better extract words that can represent cluster topics , but TextRank Methods and TF-IDF The method has more noise than the extracted words .

By carefully reading the titles and abstracts of each cluster of patents , Combined with the automatically extracted subject words , Get the theme of the main cluster , De clustering 1 Outside , The themes of other clusters are relatively concentrated . The topics and patents of some clusters are as follows surface 6 Shown .

surface 6 NLP Topics and patents of some clusters

Table 6 Number of Topics and Patents in Some NLP Clusters

Cluster serial number	The theme	Number of patents （ Pieces of ）
1	Document processing ; Entity extraction ; information extraction ; Question answering system ; Keywords extraction ; speech recognition ; Word meaning disambiguation ; Semantic text search ; Text clustering ; Text visualization system ; Text similarity ; Document query ; Structured text indexing technology ; Generate digital documents ; Question preprocessing in question answering system ; Professional language recognition ;SVO Structure extraction ; Sentiment analysis ; Generate candidate answers ; Document classification ; Web page ranking	569
2	Dialogue system ; Voice control system ; Voice command processing ; Dynamic voice mail reception ; Internet of things dialogue ; Virtual reality system interaction technology ; Digital assistant ; Voice calls are converted to text ; Voice understanding ; Automatically translate multi-user audio and video ; Concierge robot system ; Dialogue dynamic analysis ; Intelligent automation assistant ; Transcribe dialogue ; Voice response ; Virtual assistant system ; Dialect recognition ; Audio information extraction ; speech recognition ; Voice translation ; Interactive voice system	325
3	Character input error correction ; Touch the keyboard ; Text editing ; User input suggestions ; Emoticon word sense disambiguation ; Character input ; The writing system ; String auto prompt ; input method editor ; Target text selection method ; Multilingual keyboard system ; User input forecast ; Assist keyboard input ; Spelling check ; virtual keyboard ; Cube input system ; Predictive input	293
4	Information retrieval ; Search engine ; Semi structured question answering system ; Digital element search ; Extract object data from unstructured documents ; Information Service ; Web search mapping ; Corpus query ; Record search system ; Concept recommendation based on multilingual user interaction ; Natural language query generation ; Automatically generate structured queries ; Semantic search ; Search results rank ; Search query intention	288
5	Document processing ; Document parsing ; Rule based parser ; Document conversion ; Summarize the contents of the document ; Method and device for displaying web pages ;XML File parsing ; Document grouping ; Processing structured data files ; Paging point identification ; Structured search query ; Documentation ; Acronym generation ; Automatic file acquisition ; Document sequence management ; Generate portable format documents ; File preparation platform ; Document content identification	279
……
10	Question answering system ; Relationship extraction ; Deep question and answer system ; Context based language analysis in conversation ; Human computer interaction system ; Medical differential diagnosis and treatment using question and answer system ; Authentication using cognitive analysis ; Type evaluation in question and answer system ; Generate candidate answers ; Session query processor ; Generate complete questions ; Intelligent q&a ; Use the chat robot system to provide answers to questions	212

5 Conclusion

This paper proposes a technology topic division method based on patent multi-attribute fusion , It integrates the semantic information revealed by the patent text 、 The knowledge association revealed by the citation relationship and the subject category reflected by the classification number , Be able to synthesize the representation ability of different attributes of patents on technical topics , Improve the accuracy of patent similarity measurement and technology subject division . In the empirical analysis of natural language processing technology , Compared with the patent vector representation method based on single or two attributes of patents , Based on the patent multi-attribute fusion method at different levels IPC Classification and different number of sample sets show higher accuracy of patent classification 、 Recall rate and F1 value , It indirectly proves the advantages of the technology subject division method based on patent multi-attribute fusion .

Because it is difficult to build training set and test set , Failed to directly evaluate the effect of technical subject division , Instead, the indirect evaluation method of automatic patent classification is used to replace . In the future, the effect of this method will be evaluated through other ways , For example, compare the accuracy difference of patent similarity measurement between this method and other methods , Or invite experts from the interpretability of technical topics 、 The clarity of the boundary 、 Evaluate the effect of technical subject division from the perspective of existing noise . meanwhile , The application of this method in larger data sets and other technical fields will also be explored .

原网站

版权声明
本文为[Midoer technology house]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/207/202207261354274078.html

当前位置：网站首页>Research on technology subject division method based on patent multi-attribute fusion

Research on technology subject division method based on patent multi-attribute fusion

边栏推荐

猜你喜欢

随机推荐