当前位置:网站首页>Word bag model and TF-IDF
Word bag model and TF-IDF
2022-07-06 21:01:00 【wx5d786476cd8b2】
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
import os
import re
import jieba.posseg as pseg
# Load thesaurus
'''stop_words_path = './stop_words/'
stopwords1 = [line.rstrip() for line in open(os.path.join(stop_words_path, ' Chinese Thesaurus .txt'), 'r',encoding='utf-8')]
stopwords2 = [line.rstrip() for line in open(os.path.join(stop_words_path, ' Stoppage vocabulary of Harbin Institute of Technology .txt'), 'r',encoding='utf-8')]
stopwords3 = [line.rstrip() for line in
open(os.path.join(stop_words_path, ' The Machine Intelligence Laboratory of Sichuan University stopped using thesaurus .txt'), 'r', encoding='utf-8')]
stopwords = stopwords1 + stopwords2 + stopwords3
'''
def proc_text(raw_line):
"""
Processing text data
Return the word segmentation result
"""
# 1. Use regular expressions to remove non Chinese characters
filter_pattern = re.compile('[^\u4E00-\u9FD5]+')
chinese_only = filter_pattern.sub('', raw_line)
# 2. Stuttering participle + Part of speech tagging
word_list = pseg.cut(chinese_only)
# 3. Remove stop words , Keep meaningful parts of speech
# Verb , Adjective , adverb
used_flags = ['v', 'a', 'ad']
meaninful_words = []
for word, flag in word_list:
if flag in used_flags:
meaninful_words.append(word)
return ' '.join(meaninful_words)
count_vectorizer = CountVectorizer()
transformer=TfidfTransformer()
print(count_vectorizer)
ch_text1 = ' Very disappointed , The script is completely perfunctory , The main plot didn't break through, you can understand , But all the characters lack motivation , Between good and evil 、 There is no spark inside the women's Federation . unity - split - Although the three-stage style of unity is old-fashioned, it can also make use of the accumulated image charm to make sense , But the script is very superficial 、 Plane . The scheduling on the scene is chaotic and rigid , Full screen of armor aesthetic fatigue . Only a smile can be regarded as unsatisfactory .'
ch_text2 = ' 2015 The most disappointing work of the year . Think everything is covered , In fact, it's like painting a snake to make it superfluous ; Think the theme is profound , In fact, the old tune is repeated ; Think that through the old and bring forth the new , In fact, it is unbearable ; I thought the scene was very high, But in fact high Lack of strength . gas ! The last episode was completely uninteresting , The laughter point of this episode is obviously deliberately guilty . There is no episode in the whole film, which gives me a time of tension , Too weak. , Like aochuang .'
ch_text3 = ' 《 Iron Man 2》 Seduce iron man ,《 Women's Federation 1》 Seduce eagle eyes ,《 American team 2》 Seduce Captain America , stay 《 Women's Federation 2》 Finally …… Confessed to the Hulk , The black widow told us what loyalty is with practical actions ; And in order to treat infertility, even combat weapons have become two pregnancy test rods ( Firmly believe that kuaiyin is not dead , I have to come back later )'
ch_text4 = ' Although from beginning to end , But it's really boring .'
ch_text5 = ' The plot is not as interesting as the first episode , It all depends on dense laughter to refresh . The direct consequence of too many monks and too few girls is that every widowed sister has to change her teammates to fall in love , It's harder than fighting , Sincerely beg to let go ~~~( At the end, the egg thought it was rocky , As a result, I bah !)'
ch_texts = [ch_text1, ch_text2, ch_text3, ch_text4, ch_text5]
corpus = [proc_text(ch_text) for ch_text in ch_texts]
print(corpus)
tfidf=transformer.fit_transform(count_vectorizer.fit_transform(corpus))
word = count_vectorizer.get_feature_names()
print(tfidf)
print(tfidf.toarray())
for i in range(len(tfidf.toarray())):# Print... For each type of text tf-idf Word weight , first for Traverse all text , the second for Facilitate the weight of words in a certain type of text
print (u"------- Here is the output of ",i,u" A text like word tf-idf The weight ------")
for j in range(len(word)):
print(word[j],tfidf.toarray()[i][j])
new_text = ' The plot is chaotic , I'm so disappointed '
new_pro_text = proc_text(new_text)
print(new_pro_text)
print(transformer.fit_transform(count_vectorizer.transform([new_pro_text])).toarray())
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.
- 24.
- 25.
- 26.
- 27.
- 28.
- 29.
- 30.
- 31.
- 32.
- 33.
- 34.
- 35.
- 36.
- 37.
- 38.
- 39.
- 40.
- 41.
- 42.
- 43.
- 44.
- 45.
- 46.
- 47.
- 48.
- 49.
- 50.
- 51.
- 52.
- 53.
- 54.
- 55.
- 56.
- 57.
边栏推荐
- Pytest (3) - Test naming rules
- Summary of different configurations of PHP Xdebug 3 and xdebug2
- C language operators
- 审稿人dis整个研究方向已经不仅仅是在审我的稿子了怎么办?
- KDD 2022 | 通过知识增强的提示学习实现统一的对话式推荐
- The most comprehensive new database in the whole network, multidimensional table platform inventory note, flowus, airtable, seatable, Vig table Vika, flying Book Multidimensional table, heipayun, Zhix
- Logic is a good thing
- Intel 48 core new Xeon run point exposure: unexpected results against AMD zen3 in 3D cache
- Interviewer: what is the internal implementation of ordered collection in redis?
- Recyclerview GridLayout bisects the middle blank area
猜你喜欢
2022 refrigeration and air conditioning equipment installation and repair examination contents and new version of refrigeration and air conditioning equipment installation and repair examination quest
Swagger UI tutorial API document artifact
强化学习-学习笔记5 | AlphaGo
Redis insert data garbled solution
拼多多败诉,砍价始终差0.9%一案宣判;微信内测同一手机号可注册两个账号功能;2022年度菲尔兹奖公布|极客头条
Swagger UI教程 API 文档神器
use. Net analysis Net talent challenge participation
Mécanisme de fonctionnement et de mise à jour de [Widget Wechat]
Infrared thermometer based on STM32 single chip microcomputer (with face detection)
Reference frame generation based on deep learning
随机推荐
Math symbols in lists
Reinforcement learning - learning notes 5 | alphago
快过年了,心也懒了
[MySQL] trigger
c#使用oracle存储过程获取结果集实例
PG basics -- Logical Structure Management (transaction)
(工作记录)2020年3月11日至2021年3月15日
Laravel notes - add the function of locking accounts after 5 login failures in user-defined login (improve system security)
Comprehensive evaluation and recommendation of the most comprehensive knowledge base management tools in the whole network: flowus, baklib, jiandaoyun, ones wiki, pingcode, seed, mebox, Yifang cloud,
Infrared thermometer based on STM32 single chip microcomputer (with face detection)
Application layer of tcp/ip protocol cluster
C language games - three chess
Swagger UI教程 API 文档神器
2110 summary of knowledge points and common problems in redis class
Tips for web development: skillfully use ThreadLocal to avoid layer by layer value transmission
3D人脸重建:从基础知识到识别/重建方法!
Minimum cut edge set of undirected graph
15million employees are easy to manage, and the cloud native database gaussdb makes HR office more efficient
【微信小程序】运行机制和更新机制
SAP UI5 框架的 manifest.json