当前位置:网站首页>CSDN question and answer tag skill tree (II) -- effect optimization

CSDN question and answer tag skill tree (II) -- effect optimization

2022-07-06 10:42:00 Alexxinlu

Series articles


Team blog : CSDN AI team


1. The problem background

This article continues from the previous article 《CSDN Q & a tag skill tree ( One ) —— Construction of basic framework 》. In short , It's right CSDN Each field label in the question and answer module , Build a complete knowledge system , And the content in the question and answer module 、 The user etc. , Relate to the knowledge system , Finally, a map containing heterogeneous nodes is formed , Better for downstream NLP Tasks provide a resource base .

This article mainly introduces the optimization of skill tree structure , And the effect optimization of matching the question and answer content with the skill tree .

2. Skill tree optimization

This chapter mainly includes two parts , One is the optimization of skill tree structure , The other is to optimize the effect of matching the question and answer content with the skill tree .

The skill tree has been expanded to 16 Languages , But due to the poor effect , Therefore, the current focus is on 2 Plant in CSDN Q & a module is a popular programming language Python and Java, And now a popular concept Cloud native , The concept contains multiple tags , It is a large knowledge field .

2.1 Skill tree structure optimization

The structure of skill tree mainly has two requirements , One is that the coverage of knowledge should be complete , Another is the need for a hierarchy .

2.1.1 Knowledge coverage

The coverage of knowledge is Last article 2.1 section It has been mentioned in , The construction of skill tree is mainly realized by crawling the directory of books and the directory of learning forums .

In this update , I crawled more book catalogs , Used to expand the coverage of knowledge . In the future, we will continue to expand more knowledge , And we will also consider adding the function of user editing , Let users edit and add knowledge points .

2.1.2 Hierarchy

Knowledge is divided into junior high school and senior high school , With python Language as an example , There is a preliminary knowledge skeleton as shown below :

python
├──  The first step 
│   ├──  Preliminary knowledge 
│   ├──  Basic grammar 
│   ├──  Advanced Grammar 
│   └──  object-oriented programming 
├──  Middle stage 
│   ├──  Basic skills 
│   ├── Web application development 
│   ├──  Web crawler 
│   └──  Desktop application development 
└──  Higher order 
    ├──  Basic software package of scientific computing NumPy
    ├──  Structured data analysis tools Pandas
    ├──  Drawing library Matplotlib
    ├──  Scientific computing Toolkit SciPy
    ├──  Machine learning kits Scikit-learn
    ├──  Deep learning 
    ├──  Computer vision 
    └──  natural language processing 

In order to make the knowledge system have a hierarchical structure , The idea of this article is : First, we will build a knowledge skeleton as shown in the above figure , Then split and integrate the directories of all the books or learning forums obtained , Hang it on the corresponding node on the skeleton . for example : For a book python introduction Books , Split its directory and hang it in the skeleton The first step On the corresponding node , If the skeleton does not cover relevant knowledge points , Then add a new node , If there is redundancy in the newly added knowledge points , There is no need to hang it on the skeleton .

In the specific implementation stage , The current strategy is relatively simple , It's only through manual assignment , Hang the directory on the corresponding level node , As shown below :

 The first step  --> using_python
 The first step  --> tutorial_python
 The first step  --> reference_python
 The first step  --> library_python
 The first step  --> liaoxuefeng_python
 The first step  --> Python Basic course _ The first 3 edition 
 The first step  --> Python Programming fast _ Automate tedious work _ The first 2 edition 
 The first step  -->  smooth Python
 The first step  --> Python Programming _ From introduction to practice _ The first 2 edition 
 The first step  --> Python From entry to mastery 
 The first step  --> Python_Cookbook_ The first 3 edition 
 Middle stage  --> Python Advanced Programming 
 Middle stage  --> Python Programming core _ The first 3 edition 
 Middle stage  --> Python Geek project programming 
 Middle stage  --> Flask_Web Development _ be based on Python Of Web Application development practice _ The first 2 edition 
 Middle stage  -->  Web crawler  --> Python3 Web crawler development practice 
 Higher order  --> Python3 Advanced tutorials _ The first 3 edition 
 Higher order  --> Python Basis of data analysis 
 Higher order  -->  utilize Python Data analysis _ The book first 2 edition 
 Higher order  --> Python Advanced data analysis : machine learning 、 Deep learning and NLP example 
 Higher order  -->  Computer vision  -->  Practical convolutional neural network : Application Python Implement advanced deep learning model 

Later, we will consider using Decision tree The idea of algorithm building tree , To build a tree with less redundancy , Skill tree with more reasonable structure .

2.2 Optimization of matching algorithm

After building the skill tree , The questions adopted in the Q & a module need to be , Based on Matching Algorithm , Hang it on the corresponding node , As the data resource of this node . For users' new questions , Based on Matching Algorithm , Recommend the adopted problems on the most relevant nodes to users , It is further recommended that users learn relevant knowledge adjacent to the node on the skill tree .

2.2.1 Optimization of matching algorithm

stay Last article 2.4 section in , Only the leaf node of the skill tree and the question title are used to match , This method has the following defects :

  • The ancestor nodes of leaf nodes also contain a lot of useful information , Using only leaf nodes will lose ancestral information ;
  • Some users' questions are not specific to the leaf nodes that are described in great detail , It may be an overview of a large class of non leaf nodes ;
  • English keywords play a vital role in matching , Need to focus on .

Therefore, in view of the above problems , The matching algorithm is improved as follows :

  • Get the path set of all nodes in the skill tree . Path refers to the path from the root node to the current node ;
  • Calculate the similarity between the question title and all paths . Because the last node in the path is the node to be matched , And the larger the number of layers, the more accurate the description of nodes , More information . So when calculating the similarity , The weight of the node is given according to the layer number of the node in the path , The larger the floor number , The greater the weight ;
  • When the match , Increase the weight of English keywords .

Besides , Because the path is long and short , The above algorithm will be more inclined to match shorter paths , The longer path gets the lower score , So the final score , Normalization is required based on the length of the path . The specific code implementation is as follows :

def get_most_sim_node_paths_to_leaves(self, all_paths_to_leaves, node_id_seg_dict, query, lang):
    ''' Calculate the scores of user question titles and all paths in the skill tree  '''
    #  first floor ( Programming language name ) And the second node ( Subdirectory name ) meaningless 
    max_path_len = max([len(m_path) for m_path in all_paths_to_leaves]) - 2   

    #  Yes query Carry out word segmentation 
    query_seg = word_segmentation(query)

    #  Remove the name of the current programming language 
    lang_syn_name_list = lang_std_syn_dict[lang]
    query_seg = [m_word for m_word in query_seg if m_word not in lang_syn_name_list]   

    result = {
    'score': 0, 'path_to_leaf': []}
    for ori_path_to_leaf in all_paths_to_leaves:   
        #  first floor ( Programming language name ) And the second node ( Subdirectory name ) meaningless 
        path_to_leaf = ori_path_to_leaf[2:]   
        score_list = []
        num_nodes = len(path_to_leaf)
        if num_nodes == 0:
            continue
        
        for m_node_id in path_to_leaf:
            #  Calculate the similarity of two word sequences 
            sim_score, _, _ = cal_simlarity(node_id_seg_dict[m_node_id], query_seg, lcs_ratio_threshold=0, alpha=0)   
            score_list.append(sim_score)

        #  On the path from root node to leaf node , After removing the tail, the score is 0 Of , Sub path to leaf node 
        score_list.reverse()
        non_zero_index = 0
        for m_index, sim_score in enumerate(score_list):
            if sim_score != 0:
                non_zero_index = m_index
                break
        score_list = score_list[non_zero_index :]
        score_list.reverse()

        #  If there is intersection and union ratio 1 The node of , Then directly return the path from the root node to the node 
        value_one_flag = False
        value_one_index_list = [m_index for m_index, sim_score in enumerate(score_list) if sim_score == 1]
        if len(value_one_index_list) != 0:
            value_one_index = value_one_index_list[-1]
            score_list = score_list[: value_one_index + 1]
            value_one_flag = True
                    
        #  The weight given to the node according to the layer number , Also consider the length of the path 
        num_nodes = len(score_list)
        final_score = 0.0
        denominator = (1.0 + num_nodes) * num_nodes / 2
        for m_index, m_score in enumerate(score_list):
            floor_score = (m_index + 1) / denominator
            final_score += m_score * floor_score
            
        path_len_score = num_nodes / max_path_len
        final_score *= path_len_score

        if final_score > result['score']:
            result['score'] = final_score
            result['path_to_leaf'] = ori_path_to_leaf[: num_nodes + 2]

        if value_one_flag:
            result['score'] = final_score
            result['path_to_leaf'] = ori_path_to_leaf[: num_nodes + 2]
            break

    return result

2.2.2 Pretreatment optimization

This task mainly involves the matching of node description and user questions , And the existence of stop words , And the quality of keywords , Will affect the matching effect , Therefore, it is necessary to aim at the specific characteristics of the task , Carry out appropriate pretreatment operations , Only keep high-quality keywords , And then improve the matching effect .

Besides , Users' questions are more colloquial , The description of nodes in the skill tree is more written , When matching, just remove the stop words of one party , Therefore, this paper only analyzes the description of nodes in the skill tree , This greatly reduces the workload of preprocessing operations .

  1. Stop word filtering
    Previously, only a few recognized stop word dictionaries on the Internet were used to filter stop words , But the tasks in specific fields are not enough . After analyzing the data , Find some words with specific parts of speech , It has little effect on the task , It mainly includes the following parts of speech :

    pos Chinese name
    c Conjunction
    e interjection
    h Prefix
    k suffix
    m numeral
    o an onomatopoeia
    p Preposition
    r pronouns
    u auxiliary word
    w Punctuation
    x character string
    y Statement label designator
    z State words

    p.s. All parts of speech beginning with the above initials are counted

    For both valid words , It also includes the part of speech of stop words , Through the statistics + The method of manual screening , Find out the high-frequency stop words , Add to the stop phrase dictionary , Total additions 159 Stop words , As shown in the following table :

    The first 1 Column The first 2 Column The first 3 Column The first 4 Column
    in Item in Novice modern
    more Less than New people In the near future
    fast Class time Vegetable dog At present
    finger Adorable new rookie Be overdue
    in solve Vegetable chicken need
    front Equipped with brother Use
    help Just in master Three big
    look for This is a self-taught Stop words
    hand predecessors thank you
    pit a string thank Which ones
    when request Brother Can't find
    Add just Homework Meng Xingang
    learn At least Crazy Just entered the pit
    Just. trouble teacher The same
    ask answer Be really something Next month,
    thank emergency ! This question Help
    Next Urgent demand subject Look not to understand
    On Rescue Source code Ah ah
    after uncle Problem. Online, etc.
    Inside emergency Between It's too hard
    Outside Help Easy to Family
    between Please Trivial brothers
    new Ball huge one's junior of equal standing
    used For help appropriate This question
    Big Ask for advice mysterious This question
    Small Beg Bizarre Examination questions
    First A great god complete A question
    take bosses best Source code
    not Instruction again Programming questions
    No Give directions Simple convert to
    most The small white complex How to write the question
    still children correct How to solve the problem
    can Help error Source code
    yes Help direct haha
    Yes brother indirect curriculum design
    to want to sister Commonly used Fundamentally
    Lieutenant general Younger sister all Everything is
    Beginners Youngest sister Ignore notorious
    close The younger brother more unique
    How many? eldest brother A certain
  2. Based on part of speech bigram phrase
    For words incorrectly segmented by the word splitter , Merge bigram phrase , And bigram The meaning is clearer , It can improve the accuracy of matching . To be specific , According to the part of speech , Splice adjacent words , Form larger phrase units . By observing the data , Get the following common part of speech combinations that need to be spliced :

    first possecond pos
    dq
    dv
    dn
    fq
    hn
    hv
    ha
    nk
    nn
    nq
    ns

    p.s. All parts of speech beginning with the above initials are counted

2.2.3 Add other class nodes

Some users' questions have gone beyond the current field , Or it is difficult to exhaust on the skill tree , Therefore, another node of other classes is added , Used to match this kind of problem . For example, for Python label , Other class nodes , It is subdivided into the following three categories :

others
├──  Other category label problems 
├──  Application problems 
├──  Third party package problem 

The judgment of other problems is mainly realized by rules , For the above three kinds of problems , The following rules :

  • Other category label problems

Does not contain the current domain label , But the question title contains labels of other fields .

  • Application problems
re.compile(r'(( how | Yes? ).*?( Calculation | Realization | Make )| practice | subject )')
  • Third party package problem
re.compile(r'( tool kit |(?: The third party |[a-z]+?) library |py(?!thon)[a-z0-9]+| install [a-z0-9]+|[a-z0-9]+ Installation )')

2.3 Match effect

After the above optimization , The specific effect improvement is as follows , The evaluation standard used is the accuracy (Accuracy).

Domain label baselinenow
Python54%78%
Java57%82%
Cloud native --

p.s. Because there is less valid data native to the cloud , Therefore, the data set has not been built , And test the effect .

3. Summary and next step plan

The skill tree already supports 16 Tree , This article focuses on optimizing PythonJava Cloud native Skill tree of three domain labels , It mainly includes the optimization of skill tree structure , And the optimization of matching algorithm .

Next step :

  • The skill tree structure is further upgraded : reference Decision tree The idea of algorithm building tree , To build a tree with less redundancy , Skill tree with more reasonable structure ;
  • Construction of cloud native data resources ;
  • Expand other types of data resources . Currently, only question and answer data resources are used , We will consider using CSDN Heterogeneous data resources of other modules , Include : Blog 、 Course 、 video 、 special column 、 User favorites etc. .

Related links

P.S.

This series of articles will be continuously updated . hope NLP Colleagues in other fields 、 Teachers and experts can provide valuable advice , thank you !

原网站

版权声明
本文为[Alexxinlu]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/187/202207060911378398.html