当前位置：网站首页>Word bag model and TF-IDF

Word bag model and TF-IDF

2022-07-06 21:01:00 【wx5d786476cd8b2】
       
        from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
        
import os
        
import re
        
import jieba.posseg as pseg
        

        
#  Load thesaurus 
        
'''stop_words_path = './stop_words/'
        
stopwords1 = [line.rstrip() for line in open(os.path.join(stop_words_path, ' Chinese Thesaurus .txt'), 'r',encoding='utf-8')]
        
stopwords2 = [line.rstrip() for line in open(os.path.join(stop_words_path, ' Stoppage vocabulary of Harbin Institute of Technology .txt'), 'r',encoding='utf-8')]
        
stopwords3 = [line.rstrip() for line in
        
              open(os.path.join(stop_words_path, ' The Machine Intelligence Laboratory of Sichuan University stopped using thesaurus .txt'), 'r', encoding='utf-8')]
        
stopwords = stopwords1 + stopwords2 + stopwords3
        
'''
        
def proc_text(raw_line):
        
    """
        
         Processing text data 
        
         Return the word segmentation result 
        
    """
        

        
    # 1.  Use regular expressions to remove non Chinese characters 
        
    filter_pattern = re.compile('[^\u4E00-\u9FD5]+')
        
    chinese_only = filter_pattern.sub('', raw_line)
        

        
    # 2.  Stuttering participle + Part of speech tagging 
        
    word_list = pseg.cut(chinese_only)
        

        
    # 3.  Remove stop words , Keep meaningful parts of speech 
        
    #  Verb , Adjective , adverb 
        
    used_flags = ['v', 'a', 'ad']
        
    meaninful_words = []
        
    for word, flag in word_list:
        
        if flag in used_flags:
        
            meaninful_words.append(word)
        
    return ' '.join(meaninful_words)
        
count_vectorizer = CountVectorizer()
        
transformer=TfidfTransformer()
        
print(count_vectorizer)
        
ch_text1 = '  Very disappointed , The script is completely perfunctory , The main plot didn't break through, you can understand , But all the characters lack motivation , Between good and evil 、 There is no spark inside the women's Federation . unity - split - Although the three-stage style of unity is old-fashioned, it can also make use of the accumulated image charm to make sense , But the script is very superficial 、 Plane . The scheduling on the scene is chaotic and rigid , Full screen of armor aesthetic fatigue . Only a smile can be regarded as unsatisfactory .'
        
ch_text2 = ' 2015 The most disappointing work of the year . Think everything is covered , In fact, it's like painting a snake to make it superfluous ; Think the theme is profound , In fact, the old tune is repeated ; Think that through the old and bring forth the new , In fact, it is unbearable ; I thought the scene was very high, But in fact high Lack of strength . gas ！ The last episode was completely uninteresting , The laughter point of this episode is obviously deliberately guilty . There is no episode in the whole film, which gives me a time of tension , Too weak. , Like aochuang .'
        
ch_text3 = ' 《 Iron Man 2》 Seduce iron man ,《 Women's Federation 1》 Seduce eagle eyes ,《 American team 2》 Seduce Captain America , stay 《 Women's Federation 2》 Finally …… Confessed to the Hulk , The black widow told us what loyalty is with practical actions ; And in order to treat infertility, even combat weapons have become two pregnancy test rods ( Firmly believe that kuaiyin is not dead , I have to come back later )'
        
ch_text4 = '  Although from beginning to end , But it's really boring .'
        
ch_text5 = '  The plot is not as interesting as the first episode , It all depends on dense laughter to refresh . The direct consequence of too many monks and too few girls is that every widowed sister has to change her teammates to fall in love , It's harder than fighting , Sincerely beg to let go ～～～（ At the end, the egg thought it was rocky , As a result, I bah ！）'
        
ch_texts = [ch_text1, ch_text2, ch_text3, ch_text4, ch_text5]
        
corpus = [proc_text(ch_text) for ch_text in ch_texts]
        
print(corpus)
        
tfidf=transformer.fit_transform(count_vectorizer.fit_transform(corpus))
        
word = count_vectorizer.get_feature_names()
        
print(tfidf)
        
print(tfidf.toarray())
        
for i in range(len(tfidf.toarray())):# Print... For each type of text tf-idf Word weight , first for Traverse all text , the second for Facilitate the weight of words in a certain type of text 
        
        print (u"------- Here is the output of ",i,u" A text like word tf-idf The weight ------")
        
        for j in range(len(word)):
        
            print(word[j],tfidf.toarray()[i][j])
        
new_text = ' The plot is chaotic , I'm so disappointed '
        
new_pro_text = proc_text(new_text)
        
print(new_pro_text)
        
print(transformer.fit_transform(count_vectorizer.transform([new_pro_text])).toarray())
       
       
        1.
        2.
        3.
        4.
        5.
        6.
        7.
        8.
        9.
        10.
        11.
        12.
        13.
        14.
        15.
        16.
        17.
        18.
        19.
        20.
        21.
        22.
        23.
        24.
        25.
        26.
        27.
        28.
        29.
        30.
        31.
        32.
        33.
        34.
        35.
        36.
        37.
        38.
        39.
        40.
        41.
        42.
        43.
        44.
        45.
        46.
        47.
        48.
        49.
        50.
        51.
        52.
        53.
        54.
        55.
        56.
        57.
原网站
版权声明
本文为[wx5d786476cd8b2]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202131134238883.html
当前位置：网站首页>Word bag model and TF-IDF

Word bag model and TF-IDF

边栏推荐

猜你喜欢

随机推荐