当前位置:网站首页>Word bag model and TF-IDF

Word bag model and TF-IDF

2022-07-06 21:01:00 wx5d786476cd8b2


       
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
import os
import re
import jieba.posseg as pseg

# Load thesaurus
'''stop_words_path = './stop_words/'
stopwords1 = [line.rstrip() for line in open(os.path.join(stop_words_path, ' Chinese Thesaurus .txt'), 'r',encoding='utf-8')]
stopwords2 = [line.rstrip() for line in open(os.path.join(stop_words_path, ' Stoppage vocabulary of Harbin Institute of Technology .txt'), 'r',encoding='utf-8')]
stopwords3 = [line.rstrip() for line in
open(os.path.join(stop_words_path, ' The Machine Intelligence Laboratory of Sichuan University stopped using thesaurus .txt'), 'r', encoding='utf-8')]
stopwords = stopwords1 + stopwords2 + stopwords3
'''
def proc_text(raw_line):
"""
Processing text data
Return the word segmentation result
"""

# 1. Use regular expressions to remove non Chinese characters
filter_pattern = re.compile('[^\u4E00-\u9FD5]+')
chinese_only = filter_pattern.sub('', raw_line)

# 2. Stuttering participle + Part of speech tagging
word_list = pseg.cut(chinese_only)

# 3. Remove stop words , Keep meaningful parts of speech
# Verb , Adjective , adverb
used_flags = ['v', 'a', 'ad']
meaninful_words = []
for word, flag in word_list:
if flag in used_flags:
meaninful_words.append(word)
return ' '.join(meaninful_words)
count_vectorizer = CountVectorizer()
transformer=TfidfTransformer()
print(count_vectorizer)
ch_text1 = ' Very disappointed , The script is completely perfunctory , The main plot didn't break through, you can understand , But all the characters lack motivation , Between good and evil 、 There is no spark inside the women's Federation . unity - split - Although the three-stage style of unity is old-fashioned, it can also make use of the accumulated image charm to make sense , But the script is very superficial 、 Plane . The scheduling on the scene is chaotic and rigid , Full screen of armor aesthetic fatigue . Only a smile can be regarded as unsatisfactory .'
ch_text2 = ' 2015 The most disappointing work of the year . Think everything is covered , In fact, it's like painting a snake to make it superfluous ; Think the theme is profound , In fact, the old tune is repeated ; Think that through the old and bring forth the new , In fact, it is unbearable ; I thought the scene was very high, But in fact high Lack of strength . gas ! The last episode was completely uninteresting , The laughter point of this episode is obviously deliberately guilty . There is no episode in the whole film, which gives me a time of tension , Too weak. , Like aochuang .'
ch_text3 = ' 《 Iron Man 2》 Seduce iron man ,《 Women's Federation 1》 Seduce eagle eyes ,《 American team 2》 Seduce Captain America , stay 《 Women's Federation 2》 Finally …… Confessed to the Hulk , The black widow told us what loyalty is with practical actions ; And in order to treat infertility, even combat weapons have become two pregnancy test rods ( Firmly believe that kuaiyin is not dead , I have to come back later )'
ch_text4 = ' Although from beginning to end , But it's really boring .'
ch_text5 = ' The plot is not as interesting as the first episode , It all depends on dense laughter to refresh . The direct consequence of too many monks and too few girls is that every widowed sister has to change her teammates to fall in love , It's harder than fighting , Sincerely beg to let go ~~~( At the end, the egg thought it was rocky , As a result, I bah !)'
ch_texts = [ch_text1, ch_text2, ch_text3, ch_text4, ch_text5]
corpus = [proc_text(ch_text) for ch_text in ch_texts]
print(corpus)
tfidf=transformer.fit_transform(count_vectorizer.fit_transform(corpus))
word = count_vectorizer.get_feature_names()
print(tfidf)
print(tfidf.toarray())
for i in range(len(tfidf.toarray())):# Print... For each type of text tf-idf Word weight , first for Traverse all text , the second for Facilitate the weight of words in a certain type of text
print (u"------- Here is the output of ",i,u" A text like word tf-idf The weight ------")
for j in range(len(word)):
print(word[j],tfidf.toarray()[i][j])
new_text = ' The plot is chaotic , I'm so disappointed '
new_pro_text = proc_text(new_text)
print(new_pro_text)
print(transformer.fit_transform(count_vectorizer.transform([new_pro_text])).toarray())
  • 1.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.
  • 52.
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.


原网站

版权声明
本文为[wx5d786476cd8b2]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202131134238883.html