当前位置:网站首页>Ml9 self study notes
Ml9 self study notes
2022-07-29 06:17:00 【19-year-old flower girl】
Text features
import pandas as pd
import numpy as np
import re
import nltk #pip install nltk
Basic pretreatment
corpus = ['The sky is blue and beautiful.',
'Love this blue and beautiful sky!',
'The quick brown fox jumps over the lazy dog.',
'The brown fox is quick and the blue dog is lazy!',
'The sky is very blue and the sky is very beautiful today',
'The dog is lazy but the brown fox is quick!'
]
labels = ['weather', 'weather', 'animals', 'animals', 'weather', 'animals']
corpus = np.array(corpus)
corpus_df = pd.DataFrame({
'Document': corpus,
'Category': labels})
corpus_df = corpus_df[['Document', 'Category']]
corpus_df

What we need to do is based on the classification of articles , Is it animal theme or weather theme .
After executing this sentence In the pop-up window The column on the right of the most volume needs to be installed ,stopwords.
nltk.download()
Remove some words that do not highlight the theme .
# Load stop words
wpt = nltk.WordPunctTokenizer()
stop_words = nltk.corpus.stopwords.words('english')
def normalize_document(doc):
# Remove special characters
doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I)
# Convert to lowercase
doc = doc.lower()
doc = doc.strip()
# participle
tokens = wpt.tokenize(doc)
# To stop using words
filtered_tokens = [token for token in tokens if token not in stop_words]
# Regroup into articles
doc = ' '.join(filtered_tokens)
return doc
norm_corpus = normalize_corpus(corpus)
norm_corpus
Processing results , You can compare it with before .
The word bag model
Word bag model counts word frequency .
from sklearn.feature_extraction.text import CountVectorizer
print (norm_corpus)
# Instantiation
cv = CountVectorizer(min_df=0., max_df=1.)
cv.fit(norm_corpus)
print (cv.get_feature_names())
cv_matrix = cv.fit_transform(norm_corpus)
# formation array Format
cv_matrix = cv_matrix.toarray()
cv_matrix
The words of these sentences form a corpus .
The result of coding . One occurrence is 1, Appear twice is 2, Does not appear 0.
Look at it .
N-Grams Model
Make up for the lack of context information in the word bag model .
Considering the combination between words ,ngram_range=(2,2) Indicates a combination of two words .
It usually uses two words , Because it will make the matrix bigger , And sparse .
bv = CountVectorizer(ngram_range=(2,2))
bv_matrix = bv.fit_transform(norm_corpus)
bv_matrix = bv_matrix.toarray()
vocab = bv.get_feature_names()
pd.DataFrame(bv_matrix, columns=vocab)

TF-IDF Model
TP: Word frequency ,IDP: Reverse document frequency
If the frequency of a word in the corpus is not high , But there are many times in the current sample , It means his IDF Great value , More important , Differentiated .
from sklearn.feature_extraction.text import TfidfVectorizer
tv = TfidfVectorizer(min_df=0., max_df=1., use_idf=True)
tv_matrix = tv.fit_transform(norm_corpus)
tv_matrix = tv_matrix.toarray()
vocab = tv.get_feature_names()
pd.DataFrame(np.round(tv_matrix, 2), columns=vocab)

Similarity features
The similarity between articles can also be used as a feature .
from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix = cosine_similarity(tv_matrix)
similarity_df = pd.DataFrame(similarity_matrix)
similarity_df

Clustering characteristics
Not very easy to use
from sklearn.cluster import KMeans
km = KMeans(n_clusters=2)
km.fit_transform(similarity_df)
cluster_labels = km.labels_
cluster_labels = pd.DataFrame(cluster_labels, columns=['ClusterLabel'])
pd.concat([corpus_df, cluster_labels], axis=1)
Theme model
Less commonly used
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_topics=2, max_iter=100, random_state=42)
dt_matrix = lda.fit_transform(tv_matrix)
features = pd.DataFrame(dt_matrix, columns=['T1', 'T2'])
features
Word embedding model word2vec
Solved the problem mentioned before , Ignore the connection between words and contexts . I don't know the context , Such as apples and bananas , Keyboard and mouse , These words should be similar in space .
from gensim.models import word2vec
wpt = nltk.WordPunctTokenizer()
tokenized_corpus = [wpt.tokenize(document) for document in norm_corpus]
# You need to set some parameters
feature_size = 10 # Word vector dimension
window_context = 10 # The sliding window
min_word_count = 1 # Minimum word frequency
w2v_model = word2vec.Word2Vec(tokenized_corpus, size=feature_size,
window=window_context, min_count = min_word_count)

For one sentence . Add the one-dimensional vector value of the corresponding position of each word in the sentence to the total number of words to get the one-dimensional value of the sentence .
def average_word_vectors(words, model, vocabulary, num_features):
feature_vector = np.zeros((num_features,),dtype="float64")
nwords = 0.
for word in words:
if word in vocabulary:
nwords = nwords + 1.
feature_vector = np.add(feature_vector, model[word])
if nwords:
feature_vector = np.divide(feature_vector, nwords)
return feature_vector
def averaged_word_vectorizer(corpus, model, num_features):
vocabulary = set(model.wv.index2word)
features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features)
for tokenized_sentence in corpus]
return np.array(features)
Construct word vectors .
w2v_feature_array = averaged_word_vectorizer(corpus=tokenized_corpus, model=w2v_model,
num_features=feature_size)
pd.DataFrame(w2v_feature_array) #lstm
Horizontal is the article , The column term is a ten dimensional vector 
In fact, taking the average is a little problematic , There will be improvements later .
边栏推荐
- 噪音监测传感系统
- DP1332E 多协议高度集成非接触式读写芯片
- Power electronics: single inverter design (matlab program +ad schematic diagram)
- CV520国产替代Ci521 13.56MHz 非接触式读写器芯片
- Error importing Spacy module - oserror: [e941] can't find model 'en'
- 2022 spring move - core technology FPGA post technical aspects (one side experience)
- 新能源共享充电桩管理运营平台
- 电力电子:单项逆变器设计(MATLAB程序+AD原理图)
- Pytorch's data reading mechanism
- HAL库学习笔记-10 HAL库外设驱动框架概述
猜你喜欢

【软件工程之美 - 专栏笔记】25 | 有哪些方法可以提高开发效率?

新能源充电桩后台管理系统平台

STM32 MDK(Keil5) Contents mismatch错误总结

零基础学FPGA(五):时序逻辑电路设计之计数器(附有呼吸灯实验、简单组合逻辑设计介绍)

【软件工程之美 - 专栏笔记】24 | 技术债务:是继续修修补补凑合着用,还是推翻重来?

倾角传感器精度校准检测

ML4自学笔记

Reading papers on false news detection (5): a semi supervised learning method for fake news detection in social media

Transfer learning

智慧充电桩系统由什么组成?
随机推荐
【软件工程之美 - 专栏笔记】16 | 怎样才能写好项目文档?
QT学习笔记-Qt Model/View
基于51单片机ADC0808的proteus仿真
给二维表添加时间序列索引
FPGA based: moving target detection (supplementary simulation results, available)
京微齐力:基于HMEP060的心率血氧模块开发(1:FPGA发送多位指令)
HR must ask questions - how to fight with HR (collected from FPGA Explorer)
EPS32+Platform+Arduino 跑马灯
基于msp430f2491的proteus仿真
2.4G频段的无线收发芯片 SI24R1 问题汇总解答
Huawei cloud 14 day Hongmeng device development -day1 source code acquisition
Hal library learning notes - 9 DMA
Transfer feature learning with joint distribution adaptation
markdown与Typora
华为云14天鸿蒙设备开发-Day3内核开发
SQLyog 安装和配置教程
HAL学习笔记 - 7 定时器之基本定时器
SimpleFOC调参1-力矩控制
QT learning notes QT model/view
2022 spring recruit - Hesai technology FPGA technology post (one or two sides, collected from: Digital IC workers and FPGA Explorers)