当前位置：网站首页>Aggregation analysis of research word association based on graph data

Aggregation analysis of research word association based on graph data

2022-06-13 03:20:00 【Tnoy. Ma】

Aggregation analysis of research report word association based on graph data

Research Report keyword aggregation analysis based on graph data

Here’s the table of contents:

Research Report keyword aggregation analysis based on graph data

Naturallanguageprocessing technology is one of the key technologies used in mining text data , Mining word association based on ontology is helpful for synonym analysis of synonyms . Word association in speech processing marks 、 analysis 、 It is very useful in naturallanguageprocessing tasks such as entity extraction . Common word associations mainly include aggregation and combination , In this test, we mainly focus on the word association analysis of aggregation relationship , The data source is the research report data .

One 、 Algorithm is introduced

The analysis of aggregation relations uses the word context window and Jaccard（ jaccard ） Algorithm to calculate . For example, to calculate word1 and word2 Aggregate correlation of , Then use Jaccard Calculate the above similarity and the following similarity of the two words respectively , Then sum up . Encyclopedias Jaccard Introduction to coefficient

Two 、 Data model

Data model schema As shown below ：( key word )-[ link ]->( key word )

Keyword data needs word segmentation when it is generated , And remove stop words and other words that do not significantly improve the business analysis effect , This can be achieved by customizing the dictionary . Generated key word The context connection network is shown in the figure .

3、 ... and 、 Calculate keyword context aggregation similarity

Use CYPHER Implement aggregation correlation analysis algorithm , It supports iterative calculation of aggregate correlations between all keywords , And write the final results back to the graph database .

MATCH (s: key word )
//  First get s Of left and right
// left
MATCH (w: key word )-[: Connect ]->(s)
WITH COLLECT(DISTINCT w.name) AS left,s
// right
MATCH (w: key word )<-[: Connect ]-(s)
WITH left,s,COLLECT(DISTINCT w.name) AS right
//  Match except s Other words of 
MATCH (o: key word ) WHERE NOT s=o
WITH left,right,s,o
//  obtain o Of left and right
// left
MATCH (w: key word )-[: Connect ]->(o)
WITH COLLECT(DISTINCT w.name) AS left_o,left,right,s,o
// right
MATCH (w: key word )<-[: Connect ]-(o)
WITH left_o,COLLECT(DISTINCT w.name) AS right_o,left,right,s,o
//  Calculation left Union and intersection of 
WITH [x IN left WHERE x IN left_o] AS l_intersect,(left+left_o) AS l_union,right_o,right,s,o
//  Calculation right Union and intersection of 
WITH [x IN right WHERE x IN right_o] AS r_intersect,(right+right_o) AS r_union,l_intersect,l_union,s,o
WITH DISTINCT l_intersect,r_intersect,l_union,r_union,s,o
//  Calculate jacquard 【Jaccard Similarity coefficient 】
WITH s,o,
// left-Jaccard Similarity coefficient 
1.0*SIZE(l_intersect)/SIZE(l_union) AS l_jaccard,
// right-Jaccard Similarity coefficient 
1.0*SIZE(r_intersect)/SIZE(r_union) AS r_jaccard
//  Aggregate similarity ： To calculate the number of words `left` and `right` The coefficient of a set Jaccard Average 
WITH s,o,l_jaccard,r_jaccard,(l_jaccard+r_jaccard)/2 AS aggSim
CREATE UNIQUE (s)-[r:AggSim]->(o) SET r.parading=aggSim;
//RETURN s,o,l_jaccard,r_jaccard,aggSim
//LIMIT 1

Four 、 Keyword context aggregation performance test

Figure database service ： Heap memory allocated by the single node graph database 4G、 Page caching 8G; Server configuration ：AWS The server CPU-8 nucleus 8 Threads , Hard disk -2T Mechanical drive ; Data scale ： Keyword map node 15 ten thousand , Relationship 295 ten thousand . The main test is to obtain the performance of a keyword set , The conclusion is that CYPHER Use of data in WITH Pass on ID More efficient , It is more efficient than the complete transmission of node data CYPHER Improved performance 3 About times .

MATCH (s: key word )
//  First get s Of left and right
// left
MATCH (w: key word )-[: Connect ]->(s)
WITH COLLECT(DISTINCT ID(w)) AS left,ID(s) AS s
RETURN left,s LIMIT 1
// Started streaming 1 records after 3671 ms and completed after 3672 ms.
// Started streaming 1 records after 3731 ms and completed after 3731 ms.
// Started streaming 1 records after 3691 ms and completed after 3691 ms.

MATCH (s: key word )
//  First get s Of left and right
// left
MATCH (w: key word )-[: Connect ]->(s)
WITH COLLECT(DISTINCT ID(w)) AS left,s
RETURN left,s LIMIT 1
// Started streaming 1 records after 5665 ms and completed after 5665 ms.
// Started streaming 1 records after 5013 ms and completed after 5013 ms.
// Started streaming 1 records after 5048 ms and completed after 5048 ms.

MATCH (s: key word )
//  First get s Of left and right
// left
MATCH (w: key word )-[: Connect ]->(s)
WITH COLLECT(DISTINCT w.name) AS left,s
RETURN left,s LIMIT 1
// Started streaming 1 records after 9308 ms and completed after 9308 ms.
// Started streaming 1 records after 8312 ms and completed after 8312 ms.
// Started streaming 1 records after 8568 ms and completed after 8568 ms.

5、 ... and 、 Calculate aggregate similarity 【CYPHER Optimize 】

In this optimization script , It mainly realizes the modification of downward data to nodes ID, Performance ratio 3、 ... and The script in section promotes 3 About times .

MATCH (s: key word )
//  First get s Of left and right
// left
MATCH (w: key word )-[: Connect ]->(s)
WITH COLLECT(DISTINCT ID(w)) AS left,ID(s) AS sId
// right
MATCH (w: key word )<-[: Connect ]-(s: key word ) WHERE ID(s)=sId
WITH left,sId,COLLECT(DISTINCT ID(w)) AS right
//  Match except s Other words of 
MATCH (o: key word ) WHERE NOT ID(o)=sId
WITH left,right,sId,ID(o) AS oId
//  obtain o Of left and right
// left
MATCH (w: key word )-[: Connect ]->(o) WHERE ID(o)=oId
WITH COLLECT(DISTINCT ID(w)) AS left_o,left,right,sId,oId
// right
MATCH (w: key word )<-[: Connect ]-(o) WHERE ID(o)=oId
WITH left_o,COLLECT(DISTINCT ID(w)) AS right_o,left,right,sId,oId
//  Calculation left Union and intersection of 
WITH [x IN left WHERE x IN left_o] AS l_intersect,(left+left_o) AS l_union,right_o,right,sId,oId
//  Calculation right Union and intersection of 
WITH [x IN right WHERE x IN right_o] AS r_intersect,(right+right_o) AS r_union,l_intersect,l_union,sId,oId
WITH DISTINCT l_intersect,r_intersect,l_union,r_union,sId,oId
//  Calculate jacquard 【Jaccard Similarity coefficient 】
WITH sId,oId,
// left-Jaccard Similarity coefficient 
1.0*SIZE(l_intersect)/SIZE(l_union) AS l_jaccard,
// right-Jaccard Similarity coefficient 
1.0*SIZE(r_intersect)/SIZE(r_union) AS r_jaccard
//  Aggregate similarity ： To calculate the number of words `left` and `right` The coefficient of a set Jaccard Average 
WITH sId,oId,l_jaccard,r_jaccard,(l_jaccard+r_jaccard)/2 AS aggSim
//CREATE UNIQUE (s)-[r:AggSim]->(o) SET r.parading=aggSim;
RETURN sId,oId,l_jaccard,r_jaccard,aggSim
LIMIT 1

6、 ... and 、 Word pairs calculate aggregate similarity

This script is in the 5、 ... and Section is modified to aggregate similarity analysis of two words .

MATCH (s: key word ) WHERE s.name IN [' business ',' Industry ']
//  First get s Of left and right
// left
MATCH (w: key word )-[: Connect ]->(s)
WITH COLLECT(DISTINCT ID(w)) AS left,ID(s) AS sId
// right
MATCH (w: key word )<-[: Connect ]-(s: key word ) WHERE ID(s)=sId
WITH left,sId,COLLECT(DISTINCT ID(w)) AS right
//  Match except s Other words of 
MATCH (o: key word ) WHERE NOT ID(o)=sId AND o.name IN [' business ',' Industry ']
WITH left,right,sId,ID(o) AS oId
//  obtain o Of left and right
// left
MATCH (w: key word )-[: Connect ]->(o) WHERE ID(o)=oId
WITH COLLECT(DISTINCT ID(w)) AS left_o,left,right,sId,oId
// right
MATCH (w: key word )<-[: Connect ]-(o) WHERE ID(o)=oId
WITH left_o,COLLECT(DISTINCT ID(w)) AS right_o,left,right,sId,oId
//  Calculation left Union and intersection of 
WITH [x IN left WHERE x IN left_o] AS l_intersect,(left+left_o) AS l_union,right_o,right,sId,oId
//  Calculation right Union and intersection of 
WITH [x IN right WHERE x IN right_o] AS r_intersect,(right+right_o) AS r_union,l_intersect,l_union,sId,oId
WITH DISTINCT l_intersect,r_intersect,l_union,r_union,sId,oId
//  Calculate jacquard 【Jaccard Similarity coefficient 】
WITH sId,oId,
// left-Jaccard Similarity coefficient 
1.0*SIZE(l_intersect)/SIZE(l_union) AS l_jaccard,
// right-Jaccard Similarity coefficient 
1.0*SIZE(r_intersect)/SIZE(r_union) AS r_jaccard
//  Aggregate similarity ： To calculate the number of words `left` and `right` The coefficient of a set Jaccard Average 
//WHERE l_jaccard>0 OR r_jaccard>0
WITH sId,oId,l_jaccard,r_jaccard,(l_jaccard+r_jaccard)/2 AS aggSim
//CREATE UNIQUE (s)-[r:AggSim]->(o) SET r.parading=aggSim;
RETURN sId,oId,l_jaccard,r_jaccard,aggSim

7、 ... and 、 Concurrent computing aggregation similarity 【CYPHER Optimization II 】

//  By default , Maximum number of partitions / The number of parallels is CPU Number of cores  x 100;
//  The maximum number of batches is 10000. for example , If Neo4j The database is allocated 4 Kernel ,
//  Then the maximum number of parallel processes is 400.
CALL apoc.cypher.parallel(
   fragment,
   params,
   parallelizeOn
) YIELD value

// 1 A word is right 
// Started streaming 2 records in less than 1 ms and completed after 499 ms.
// Started streaming 2 records in less than 1 ms and completed after 498 ms.
// Started streaming 2 records after 1 ms and completed after 500 ms.
CALL apoc.cypher.parallel(
  'MATCH (s: key word ) WHERE s.name IN $name MATCH (w: key word )-[: Connect ]->(s) WITH COLLECT(DISTINCT ID(w)) AS left,ID(s) AS sId MATCH (w: key word )<-[: Connect ]-(s: key word ) WHERE ID(s)=sId WITH left,sId,COLLECT(DISTINCT ID(w)) AS right MATCH (o: key word ) WHERE NOT ID(o)=sId AND o.name IN $name WITH left,right,sId,ID(o) AS oId MATCH (w: key word )-[: Connect ]->(o) WHERE ID(o)=oId WITH COLLECT(DISTINCT ID(w)) AS left_o,left,right,sId,oId MATCH (w: key word )<-[: Connect ]-(o) WHERE ID(o)=oId WITH left_o,COLLECT(DISTINCT ID(w)) AS right_o,left,right,sId,oId WITH [x IN left WHERE x IN left_o] AS l_intersect,(left+left_o) AS l_union,right_o,right,sId,oId WITH [x IN right WHERE x IN right_o] AS r_intersect,(right+right_o) AS r_union,l_intersect,l_union,sId,oId WITH DISTINCT l_intersect,r_intersect,l_union,r_union,sId,oId WITH sId,oId, 1.0*SIZE(l_intersect)/SIZE(l_union) AS l_jaccard, 1.0*SIZE(r_intersect)/SIZE(r_union) AS r_jaccard WITH sId,oId,l_jaccard,r_jaccard,(l_jaccard+r_jaccard)/2 AS aggSim RETURN sId,oId,l_jaccard,r_jaccard,aggSim',
  {name:[[' Industry ',' business ']]},
  'name'
)

//  The query generates a list of words with 100 words 
MATCH (s: key word ),(o: key word ) WITH s.name AS s,o.name AS o limit 100
WITH [s,o] AS list
return COLLECT(list)

// 100 A word is right 
// Started streaming 200 records after 2 ms and completed after 47239 ms.
// Started streaming 200 records after 1 ms and completed after 48107 ms.
// Started streaming 200 records in less than 1 ms and completed after 48266 ms.
CALL apoc.cypher.parallel(
  'MATCH (s: key word ) WHERE s.name IN $name MATCH (w: key word )-[: Connect ]->(s) WITH COLLECT(DISTINCT ID(w)) AS left,ID(s) AS sId MATCH (w: key word )<-[: Connect ]-(s: key word ) WHERE ID(s)=sId WITH left,sId,COLLECT(DISTINCT ID(w)) AS right MATCH (o: key word ) WHERE NOT ID(o)=sId AND o.name IN $name WITH left,right,sId,ID(o) AS oId MATCH (w: key word )-[: Connect ]->(o) WHERE ID(o)=oId WITH COLLECT(DISTINCT ID(w)) AS left_o,left,right,sId,oId MATCH (w: key word )<-[: Connect ]-(o) WHERE ID(o)=oId WITH left_o,COLLECT(DISTINCT ID(w)) AS right_o,left,right,sId,oId WITH [x IN left WHERE x IN left_o] AS l_intersect,(left+left_o) AS l_union,right_o,right,sId,oId WITH [x IN right WHERE x IN right_o] AS r_intersect,(right+right_o) AS r_union,l_intersect,l_union,sId,oId WITH DISTINCT l_intersect,r_intersect,l_union,r_union,sId,oId WITH sId,oId, 1.0*SIZE(l_intersect)/SIZE(l_union) AS l_jaccard, 1.0*SIZE(r_intersect)/SIZE(r_union) AS r_jaccard WITH sId,oId,l_jaccard,r_jaccard,(l_jaccard+r_jaccard)/2 AS aggSim RETURN sId,oId,l_jaccard,r_jaccard,aggSim',
  {name:[[' Industry ',' business '],[' Bonding property ',' Tear resistance '],[' Bonding property ',' For reference '],[' Bonding property ',' Compared with the valuation '],[' Bonding property ',' Nantong new Youfei Hotel '],[' Bonding property ',' Xu Changjiang '],[' Bonding property ',' Dongwenfeng '],[' Bonding property ',' Single crystal camp '],[' Bonding property ',' Garbo group '],[' Bonding property ',' Jiabao '],[' Bonding property ',' Chrysanthemum Garden '],[' Bonding property ',' Shengchuang '],[' Bonding property ',' Style room '],[' Bonding property ',' Green style '],[' Bonding property ',' Harbor city '],[' Bonding property ',' Limited to '],[' Bonding property ',' Science Park period '],[' Bonding property ',' Shanghai Anting old temple gold '],[' Bonding property ',' Shanghai Gaotai Precious Metals Co., Ltd '],[' Bonding property ',' Gaotai '],[' Bonding property ',' China nuclear industry group '],[' Bonding property ',' Zirconium tube '],[' Bonding property ',' Producing nuclear reactors '],[' Bonding property ',' Shape material '],[' Bonding property ','29X-31X'],[' Bonding property ',' Oil tank '],[' Bonding property ',' Shanghai Pudong New Area land resources reserve center '],[' Bonding property ',' Anna '],[' Bonding property ',' Quasi magnetic '],[' Bonding property ',' Ping Hai '],[' Bonding property ',' Network power '],[' Bonding property ','551'],[' Bonding property ',' Control this '],[' Bonding property ',' Construction permit '],[' Bonding property ',' Qingyang '],[' Bonding property ',' Order milk '],[' Bonding property ',' Order milk quantity '],[' Bonding property ',' The new emperor '],[' Bonding property ',' Dairy stores '],[' Bonding property ',' score '],[' Bonding property ',' The production capacity of Sier has increased '],[' Bonding property ',' Make '],[' Bonding property ',' Amortized options '],[' Bonding property ',' To decide '],[' Bonding property ',' State owned assets management '],[' Bonding property ',' Burden －'],[' Bonding property ',' Huizhou Desai shaped battery '],[' Bonding property ',' Shanghai Economic and Information Commission '],[' Bonding property ',' Watch the party '],[' Bonding property ',' Porcelain clay '],[' Bonding property ',' Changyuan technology industry '],[' Bonding property ',' Protective plate '],[' Bonding property ',' Desai '],[' Bonding property ',' Up to a high point '],[' Bonding property ',' Supply slightly exceeds demand '],[' Bonding property ',' Net increase '],[' Bonding property ',' Huazhong '],[' Bonding property ',' Ten thousand people buy a house '],[' Bonding property ',' Shuffle year '],[' Bonding property ',' Yitong '],[' Bonding property ',' Third quarter rate '],[' Bonding property ',' Chuannanzi '],[' Bonding property ',' Evaluation Center '],[' Bonding property ',' Virtue '],[' Bonding property ',' Additional notes '],[' Bonding property ',' Hammersley '],[' Bonding property ',' Preah Vihear District, Cambodia '],[' Bonding property ',' Western Australia '],[' Bonding property ',' Up can '],[' Bonding property ',' Qingdao Tianxin '],[' Bonding property ',' Kangneng '],[' Bonding property ',' Hero '],[' Bonding property ',' GCL accounts for '],[' Bonding property ',' Covering machine '],[' Bonding property ',' Spinning frame '],[' Bonding property ',' Wall glue '],[' Bonding property ',' Door and window sealant '],[' Bonding property ',' Industrial glue '],[' Bonding property ',' Benefit from the '],[' Bonding property ',' Xiamen Sanhong '],[' Bonding property ',' Jiangxi Jutong Industrial Co., Ltd '],[' Bonding property ',' Alloy powder '],[' Bonding property ',' Cobaltic acid '],[' Bonding property ',' According to the certificate '],[' Bonding property ',' Quick money '],[' Bonding property ','MNC'],[' Bonding property ',' Lingtong '],[' Bonding property ',' Huayou century '],[' Bonding property ',' To worsen '],[' Bonding property ',' Beiwei company '],[' Bonding property ',' Traffic flow '],[' Bonding property ',' Che '],[' Bonding property ',' Medium 〕'],[' Bonding property ',' origin '],[' Bonding property ',' Internal machine '],[' Bonding property ',' Zhangyubing '],[' Bonding property ',' Quality wine '],[' Bonding property ',' White Wine '],[' Bonding property ',' Without exception '],[' Bonding property ',' Changyu KAS ']]},
  'name'
)

8、 ... and 、 Word pair calculation CYPHER Scripts become processes

8.1 Further optimize the query

In the 6、 ... and Section and continue to optimize the query , The previous query is in MATCH Will repeatedly match keywords , Optimize the generation of word pairs here , Support the analysis of two words ; The specified context depth is not supported for the time being , Default one degree .

MATCH (s: key word  {name:' business '}) MATCH (o: key word  {name:' Industry '})
WITH ID(s) AS sId,ID(o) AS oId
WITH sId,oId
WHERE sId<>oId
//  Get word 
MATCH (s: key word ) WHERE ID(s)=sId
//  First get s Of left and right
// left
MATCH (w: key word )-[: Connect ]->(s)
WITH COLLECT(DISTINCT ID(w)) AS left,sId,oId
// right
MATCH (w: key word )<-[: Connect ]-(s: key word ) WHERE ID(s)=sId
WITH left,COLLECT(DISTINCT ID(w)) AS right,sId,oId
//  Match except s Other words of 
MATCH (o: key word ) WHERE ID(o)=oId
WITH left,right,sId,oId
//  obtain o Of left and right
// left
MATCH (w: key word )-[: Connect ]->(o) WHERE ID(o)=oId
WITH COLLECT(DISTINCT ID(w)) AS left_o,left,right,sId,oId
// right
MATCH (w: key word )<-[: Connect ]-(o) WHERE ID(o)=oId
WITH left_o,COLLECT(DISTINCT ID(w)) AS right_o,left,right,sId,oId
//  Calculation left Union and intersection of 
WITH [x IN left WHERE x IN left_o] AS l_intersect,(left+left_o) AS l_union,right_o,right,sId,oId
//  Calculation right Union and intersection of 
WITH [x IN right WHERE x IN right_o] AS r_intersect,(right+right_o) AS r_union,l_intersect,l_union,sId,oId
WITH DISTINCT l_intersect,r_intersect,l_union,r_union,sId,oId
//  Calculate jacquard 【Jaccard Similarity coefficient 】
WITH sId,oId,
// left-Jaccard Similarity coefficient 
1.0*SIZE(l_intersect)/SIZE(l_union) AS l_jaccard,
// right-Jaccard Similarity coefficient 
1.0*SIZE(r_intersect)/SIZE(r_union) AS r_jaccard
//  Aggregate similarity ： To calculate the number of words `left` and `right` The coefficient of a set Jaccard Average 
WITH sId,oId,l_jaccard,r_jaccard,(l_jaccard+r_jaccard)/2 AS aggSim
//CREATE UNIQUE (s)-[r:AggSim]->(o) SET r.parading=aggSim;
RETURN sId,oId,l_jaccard,r_jaccard,aggSim

8.2 Install the query as a procedure

Wrap a complex query as a procedure or function , It can facilitate the call of data analysts .

8.2.1 Context Jaccard Coefficient addition

CALL apoc.custom.asProcedure(
	'jaccard.agg.lr.sum',
    'MATCH (s: key word  {name:$first}) MATCH (o: key word  {name:$second}) WITH ID(s) AS sId,ID(o) AS oId WITH sId,oId WHERE sId<>oId MATCH (s: key word ) WHERE ID(s)=sId MATCH (w: key word )-[: Connect ]->(s) WITH COLLECT(DISTINCT ID(w)) AS left,sId,oId MATCH (w: key word )<-[: Connect ]-(s: key word ) WHERE ID(s)=sId WITH left,COLLECT(DISTINCT ID(w)) AS right,sId,oId MATCH (o: key word ) WHERE ID(o)=oId WITH left,right,sId,oId MATCH (w: key word )-[: Connect ]->(o) WHERE ID(o)=oId WITH COLLECT(DISTINCT ID(w)) AS left_o,left,right,sId,oId MATCH (w: key word )<-[: Connect ]-(o) WHERE ID(o)=oId WITH left_o,COLLECT(DISTINCT ID(w)) AS right_o,left,right,sId,oId WITH [x IN left WHERE x IN left_o] AS l_intersect,(left+left_o) AS l_union,right_o,right,sId,oId WITH [x IN right WHERE x IN right_o] AS r_intersect,(right+right_o) AS r_union,l_intersect,l_union,sId,oId WITH DISTINCT l_intersect,r_intersect,l_union,r_union,sId,oId WITH sId,oId, 1.0*SIZE(l_intersect)/SIZE(l_union) AS l_jaccard, 1.0*SIZE(r_intersect)/SIZE(r_union) AS r_jaccard WITH sId,oId,l_jaccard,r_jaccard,(l_jaccard+r_jaccard) AS aggSim RETURN sId,oId,l_jaccard,r_jaccard,aggSim',
    'READ',
    [['sId','LONG'],['oId','LONG'],['l_jaccard','DOUBLE'],['r_jaccard','DOUBLE'],['aggSim','DOUBLE']],
    [['first','STRING'],['second','STRING']],
    ' Analyze the aggregation similarity of two word pairs ： Context jackard similarity addition '
 );
 // CALL custom.jaccard.agg.lr.sum(' business ',' Industry ')

8.2.1 Context Jaccard The coefficients are averaged

CALL apoc.custom.asProcedure(
	'jaccard.agg.lr.avr',
    'MATCH (s: key word  {name:$first}) MATCH (o: key word  {name:$second}) WITH ID(s) AS sId,ID(o) AS oId WITH sId,oId WHERE sId<>oId MATCH (s: key word ) WHERE ID(s)=sId MATCH (w: key word )-[: Connect ]->(s) WITH COLLECT(DISTINCT ID(w)) AS left,sId,oId MATCH (w: key word )<-[: Connect ]-(s: key word ) WHERE ID(s)=sId WITH left,COLLECT(DISTINCT ID(w)) AS right,sId,oId MATCH (o: key word ) WHERE ID(o)=oId WITH left,right,sId,oId MATCH (w: key word )-[: Connect ]->(o) WHERE ID(o)=oId WITH COLLECT(DISTINCT ID(w)) AS left_o,left,right,sId,oId MATCH (w: key word )<-[: Connect ]-(o) WHERE ID(o)=oId WITH left_o,COLLECT(DISTINCT ID(w)) AS right_o,left,right,sId,oId WITH [x IN left WHERE x IN left_o] AS l_intersect,(left+left_o) AS l_union,right_o,right,sId,oId WITH [x IN right WHERE x IN right_o] AS r_intersect,(right+right_o) AS r_union,l_intersect,l_union,sId,oId WITH DISTINCT l_intersect,r_intersect,l_union,r_union,sId,oId WITH sId,oId, 1.0*SIZE(l_intersect)/SIZE(l_union) AS l_jaccard, 1.0*SIZE(r_intersect)/SIZE(r_union) AS r_jaccard WITH sId,oId,l_jaccard,r_jaccard,(l_jaccard+r_jaccard)/2 AS aggSim RETURN sId,oId,l_jaccard,r_jaccard,aggSim',
    'READ',
    [['sId','LONG'],['oId','LONG'],['l_jaccard','DOUBLE'],['r_jaccard','DOUBLE'],['aggSim','DOUBLE']],
    [['first','STRING'],['second','STRING']],
    ' Analyze the aggregation similarity of two word pairs ： Context jackard similarity addition '
 );
 // CALL custom.jaccard.agg.lr.avr(' business ',' Industry ')

8.2.2 Procedure usage and return value description

// sId： First key word 
// oId： The second key word 
// l_jaccard： The similarity above 
// r_jaccard： The following similarity 
// aggSim： Aggregate similarity 
CALL custom.jaccard.agg.lr.avr(' business ',' Industry ') YIELD sId,oId,l_jaccard,r_jaccard,aggSim RETURN sId,oId,l_jaccard,r_jaccard,aggSim
CALL custom.jaccard.agg.lr.sum(' business ',' Industry ') YIELD sId,oId,l_jaccard,r_jaccard,aggSim RETURN sId,oId,l_jaccard,r_jaccard,aggSim

Nine 、 Analyze the aggregation relevance of the Research Report keyword list

9.1 Word list analysis

WITH [' business ',' Industry ',' Enterprises ',' Practice ',' Business ',' chemical industry ',' Drop out ',' Textile industry ',' The financial sector ',' Business professionals ',' Different industries '] AS wordList
UNWIND wordList AS first
UNWIND wordList AS second
WITH first,second
WHERE first<>second
CALL custom.jaccard.agg.lr.sum(first,second) YIELD sId,oId,l_jaccard,r_jaccard,aggSim RETURN sId,algo.asNode(sId).name AS sIdName,oId,algo.asNode(oId).name AS oIdName,l_jaccard,r_jaccard,aggSim ORDER BY aggSim DESC

Insert picture description here

9.2 Word list analysis optimization

From the analysis results in the previous section, we can see , The calculation results of word pairs are repeated . To avoid the CYPHER Double counting can greatly improve the performance of queries 【QPS】.N The result of aggregation correlation analysis for word pairs that do not repeat keywords should be C(n,m) strip ,

So optimize the query in the previous section as follows , The key words are numbered to facilitate the realization of the above combination formula .

WITH [{id:1,word:' business '},{id:2,word:' Industry '},{id:3,word:' Enterprises '},{id:4,word:' Practice '},{id:5,word:' Business '},{id:6,word:' chemical industry '},{id:7,word:' Drop out '},{id:8,word:' Textile industry '},{id:9,word:' The financial sector '},{id:10,word:' Business professionals '},{id:11,word:' Different industries '}] AS wordList
UNWIND wordList AS first
UNWIND wordList AS second
WITH first,second
WHERE first.id<second.id
WITH first.word AS first,second.word AS second
MATCH (o: key word ),(s: key word ) WHERE o.name=first AND s.name=second 
CALL custom.jaccard.agg.lr.sum(first,second) YIELD sId,oId,l_jaccard,r_jaccard,aggSim RETURN sId,algo.asNode(sId).name AS sIdName,oId,algo.asNode(oId).name AS oIdName,l_jaccard,r_jaccard,aggSim ORDER BY aggSim DESC

Insert picture description here