当前位置:网站首页>Word bag model and TF-IDF
Word bag model and TF-IDF
2022-07-06 21:01:00 【wx5d786476cd8b2】
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer
import os
import re
import jieba.posseg as pseg
# Load thesaurus
'''stop_words_path = './stop_words/'
stopwords1 = [line.rstrip() for line in open(os.path.join(stop_words_path, ' Chinese Thesaurus .txt'), 'r',encoding='utf-8')]
stopwords2 = [line.rstrip() for line in open(os.path.join(stop_words_path, ' Stoppage vocabulary of Harbin Institute of Technology .txt'), 'r',encoding='utf-8')]
stopwords3 = [line.rstrip() for line in
open(os.path.join(stop_words_path, ' The Machine Intelligence Laboratory of Sichuan University stopped using thesaurus .txt'), 'r', encoding='utf-8')]
stopwords = stopwords1 + stopwords2 + stopwords3
'''
def proc_text(raw_line):
"""
Processing text data
Return the word segmentation result
"""
# 1. Use regular expressions to remove non Chinese characters
filter_pattern = re.compile('[^\u4E00-\u9FD5]+')
chinese_only = filter_pattern.sub('', raw_line)
# 2. Stuttering participle + Part of speech tagging
word_list = pseg.cut(chinese_only)
# 3. Remove stop words , Keep meaningful parts of speech
# Verb , Adjective , adverb
used_flags = ['v', 'a', 'ad']
meaninful_words = []
for word, flag in word_list:
if flag in used_flags:
meaninful_words.append(word)
return ' '.join(meaninful_words)
count_vectorizer = CountVectorizer()
transformer=TfidfTransformer()
print(count_vectorizer)
ch_text1 = ' Very disappointed , The script is completely perfunctory , The main plot didn't break through, you can understand , But all the characters lack motivation , Between good and evil 、 There is no spark inside the women's Federation . unity - split - Although the three-stage style of unity is old-fashioned, it can also make use of the accumulated image charm to make sense , But the script is very superficial 、 Plane . The scheduling on the scene is chaotic and rigid , Full screen of armor aesthetic fatigue . Only a smile can be regarded as unsatisfactory .'
ch_text2 = ' 2015 The most disappointing work of the year . Think everything is covered , In fact, it's like painting a snake to make it superfluous ; Think the theme is profound , In fact, the old tune is repeated ; Think that through the old and bring forth the new , In fact, it is unbearable ; I thought the scene was very high, But in fact high Lack of strength . gas ! The last episode was completely uninteresting , The laughter point of this episode is obviously deliberately guilty . There is no episode in the whole film, which gives me a time of tension , Too weak. , Like aochuang .'
ch_text3 = ' 《 Iron Man 2》 Seduce iron man ,《 Women's Federation 1》 Seduce eagle eyes ,《 American team 2》 Seduce Captain America , stay 《 Women's Federation 2》 Finally …… Confessed to the Hulk , The black widow told us what loyalty is with practical actions ; And in order to treat infertility, even combat weapons have become two pregnancy test rods ( Firmly believe that kuaiyin is not dead , I have to come back later )'
ch_text4 = ' Although from beginning to end , But it's really boring .'
ch_text5 = ' The plot is not as interesting as the first episode , It all depends on dense laughter to refresh . The direct consequence of too many monks and too few girls is that every widowed sister has to change her teammates to fall in love , It's harder than fighting , Sincerely beg to let go ~~~( At the end, the egg thought it was rocky , As a result, I bah !)'
ch_texts = [ch_text1, ch_text2, ch_text3, ch_text4, ch_text5]
corpus = [proc_text(ch_text) for ch_text in ch_texts]
print(corpus)
tfidf=transformer.fit_transform(count_vectorizer.fit_transform(corpus))
word = count_vectorizer.get_feature_names()
print(tfidf)
print(tfidf.toarray())
for i in range(len(tfidf.toarray())):# Print... For each type of text tf-idf Word weight , first for Traverse all text , the second for Facilitate the weight of words in a certain type of text
print (u"------- Here is the output of ",i,u" A text like word tf-idf The weight ------")
for j in range(len(word)):
print(word[j],tfidf.toarray()[i][j])
new_text = ' The plot is chaotic , I'm so disappointed '
new_pro_text = proc_text(new_text)
print(new_pro_text)
print(transformer.fit_transform(count_vectorizer.transform([new_pro_text])).toarray())
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
- 17.
- 18.
- 19.
- 20.
- 21.
- 22.
- 23.
- 24.
- 25.
- 26.
- 27.
- 28.
- 29.
- 30.
- 31.
- 32.
- 33.
- 34.
- 35.
- 36.
- 37.
- 38.
- 39.
- 40.
- 41.
- 42.
- 43.
- 44.
- 45.
- 46.
- 47.
- 48.
- 49.
- 50.
- 51.
- 52.
- 53.
- 54.
- 55.
- 56.
- 57.
边栏推荐
- Performance test process and plan
- 全网最全的知识库管理工具综合评测和推荐:FlowUs、Baklib、简道云、ONES Wiki 、PingCode、Seed、MeBox、亿方云、智米云、搜阅云、天翎
- 请问sql group by 语句问题
- 正则表达式收集
- 基于STM32单片机设计的红外测温仪(带人脸检测)
- 2022 construction electrician (special type of construction work) free test questions and construction electrician (special type of construction work) certificate examination
- 面试官:Redis中有序集合的内部实现方式是什么?
- Reference frame generation based on deep learning
- C # use Oracle stored procedure to obtain result set instance
- [MySQL] basic use of cursor
猜你喜欢
Infrared thermometer based on STM32 single chip microcomputer (with face detection)
[DIY]如何制作一款个性的收音机
Swagger UI教程 API 文档神器
OAI 5g nr+usrp b210 installation and construction
1500万员工轻松管理,云原生数据库GaussDB让HR办公更高效
Comprehensive evaluation and recommendation of the most comprehensive knowledge base management tools in the whole network: flowus, baklib, jiandaoyun, ones wiki, pingcode, seed, mebox, Yifang cloud,
OAI 5G NR+USRP B210安装搭建
Data Lake (VIII): Iceberg data storage format
Core principles of video games
No Yum source to install SPuG monitoring
随机推荐
use. Net drives the OLED display of Jetson nano
Mtcnn face detection
Implementation of packaging video into MP4 format and storing it in TF Card
The most comprehensive new database in the whole network, multidimensional table platform inventory note, flowus, airtable, seatable, Vig table Vika, flying Book Multidimensional table, heipayun, Zhix
[weekly pit] positive integer factorization prime factor + [solution] calculate the sum of prime numbers within 100
快过年了,心也懒了
C # use Oracle stored procedure to obtain result set instance
@PathVariable
防火墙基础之外网服务器区部署和双机热备
[DIY]如何制作一款個性的收音機
(工作记录)2020年3月11日至2021年3月15日
Notes - detailed steps of training, testing and verification of yolo-v4-tiny source code
知识图谱之实体对齐二
1500萬員工輕松管理,雲原生數據庫GaussDB讓HR辦公更高效
Leetcode hot topic Hot 100 day 32: "minimum coverage substring"
Xcode6 error: "no matching provisioning profiles found for application"
[weekly pit] output triangle
Summary of different configurations of PHP Xdebug 3 and xdebug2
Variable star --- article module (1)
Value of APS application in food industry