当前位置:网站首页>Ml9 self study notes
Ml9 self study notes
2022-07-29 06:17:00 【19-year-old flower girl】
Text features
import pandas as pd
import numpy as np
import re
import nltk #pip install nltk
Basic pretreatment
corpus = ['The sky is blue and beautiful.',
'Love this blue and beautiful sky!',
'The quick brown fox jumps over the lazy dog.',
'The brown fox is quick and the blue dog is lazy!',
'The sky is very blue and the sky is very beautiful today',
'The dog is lazy but the brown fox is quick!'
]
labels = ['weather', 'weather', 'animals', 'animals', 'weather', 'animals']
corpus = np.array(corpus)
corpus_df = pd.DataFrame({
'Document': corpus,
'Category': labels})
corpus_df = corpus_df[['Document', 'Category']]
corpus_df

What we need to do is based on the classification of articles , Is it animal theme or weather theme .
After executing this sentence In the pop-up window The column on the right of the most volume needs to be installed ,stopwords.
nltk.download()
Remove some words that do not highlight the theme .
# Load stop words
wpt = nltk.WordPunctTokenizer()
stop_words = nltk.corpus.stopwords.words('english')
def normalize_document(doc):
# Remove special characters
doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I)
# Convert to lowercase
doc = doc.lower()
doc = doc.strip()
# participle
tokens = wpt.tokenize(doc)
# To stop using words
filtered_tokens = [token for token in tokens if token not in stop_words]
# Regroup into articles
doc = ' '.join(filtered_tokens)
return doc
norm_corpus = normalize_corpus(corpus)
norm_corpus
Processing results , You can compare it with before .
The word bag model
Word bag model counts word frequency .
from sklearn.feature_extraction.text import CountVectorizer
print (norm_corpus)
# Instantiation
cv = CountVectorizer(min_df=0., max_df=1.)
cv.fit(norm_corpus)
print (cv.get_feature_names())
cv_matrix = cv.fit_transform(norm_corpus)
# formation array Format
cv_matrix = cv_matrix.toarray()
cv_matrix
The words of these sentences form a corpus .
The result of coding . One occurrence is 1, Appear twice is 2, Does not appear 0.
Look at it .
N-Grams Model
Make up for the lack of context information in the word bag model .
Considering the combination between words ,ngram_range=(2,2) Indicates a combination of two words .
It usually uses two words , Because it will make the matrix bigger , And sparse .
bv = CountVectorizer(ngram_range=(2,2))
bv_matrix = bv.fit_transform(norm_corpus)
bv_matrix = bv_matrix.toarray()
vocab = bv.get_feature_names()
pd.DataFrame(bv_matrix, columns=vocab)

TF-IDF Model
TP: Word frequency ,IDP: Reverse document frequency
If the frequency of a word in the corpus is not high , But there are many times in the current sample , It means his IDF Great value , More important , Differentiated .
from sklearn.feature_extraction.text import TfidfVectorizer
tv = TfidfVectorizer(min_df=0., max_df=1., use_idf=True)
tv_matrix = tv.fit_transform(norm_corpus)
tv_matrix = tv_matrix.toarray()
vocab = tv.get_feature_names()
pd.DataFrame(np.round(tv_matrix, 2), columns=vocab)

Similarity features
The similarity between articles can also be used as a feature .
from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix = cosine_similarity(tv_matrix)
similarity_df = pd.DataFrame(similarity_matrix)
similarity_df

Clustering characteristics
Not very easy to use
from sklearn.cluster import KMeans
km = KMeans(n_clusters=2)
km.fit_transform(similarity_df)
cluster_labels = km.labels_
cluster_labels = pd.DataFrame(cluster_labels, columns=['ClusterLabel'])
pd.concat([corpus_df, cluster_labels], axis=1)
Theme model
Less commonly used
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_topics=2, max_iter=100, random_state=42)
dt_matrix = lda.fit_transform(tv_matrix)
features = pd.DataFrame(dt_matrix, columns=['T1', 'T2'])
features
Word embedding model word2vec
Solved the problem mentioned before , Ignore the connection between words and contexts . I don't know the context , Such as apples and bananas , Keyboard and mouse , These words should be similar in space .
from gensim.models import word2vec
wpt = nltk.WordPunctTokenizer()
tokenized_corpus = [wpt.tokenize(document) for document in norm_corpus]
# You need to set some parameters
feature_size = 10 # Word vector dimension
window_context = 10 # The sliding window
min_word_count = 1 # Minimum word frequency
w2v_model = word2vec.Word2Vec(tokenized_corpus, size=feature_size,
window=window_context, min_count = min_word_count)

For one sentence . Add the one-dimensional vector value of the corresponding position of each word in the sentence to the total number of words to get the one-dimensional value of the sentence .
def average_word_vectors(words, model, vocabulary, num_features):
feature_vector = np.zeros((num_features,),dtype="float64")
nwords = 0.
for word in words:
if word in vocabulary:
nwords = nwords + 1.
feature_vector = np.add(feature_vector, model[word])
if nwords:
feature_vector = np.divide(feature_vector, nwords)
return feature_vector
def averaged_word_vectorizer(corpus, model, num_features):
vocabulary = set(model.wv.index2word)
features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features)
for tokenized_sentence in corpus]
return np.array(features)
Construct word vectors .
w2v_feature_array = averaged_word_vectorizer(corpus=tokenized_corpus, model=w2v_model,
num_features=feature_size)
pd.DataFrame(w2v_feature_array) #lstm
Horizontal is the article , The column term is a ten dimensional vector 
In fact, taking the average is a little problematic , There will be improvements later .
边栏推荐
- CS4344国产替代DP4344 192K 双通道 24 位 DA 转换器
- PHY6252是一款超低功耗物联网蓝牙无线通信芯片
- NFC双向通讯13.56MHZ非接触式阅读器芯片--Si512替代PN512
- 2022春招——芯动科技FPGA开发岗笔试题(原题以及心得)
- Am model in NLP field
- DP4301—SUB-1G高集成度无线收发芯片
- Hal library learning notes-12 SPI
- arduino uno错误分析avrdude: stk500_recv(): programmer is not responding
- 华为云14天鸿蒙设备开发-Day1源码获取
- 【软件工程之美 - 专栏笔记】13 | 白天开会,加班写代码的节奏怎么破?
猜你喜欢

Hal library learning notes-10 overview of Hal library peripheral driver framework

【软件工程之美 - 专栏笔记】“一问一答”第2期 | 30个软件开发常见问题解决策略

基于51单片机的DAC0832波形发生器

【软件工程之美 - 专栏笔记】20 | 如何应对让人头疼的需求变更问题?

6、 Pointer meter recognition based on deep learning key points

Low rank transfer subspace learning

零基础学FPGA(五):时序逻辑电路设计之计数器(附有呼吸灯实验、简单组合逻辑设计介绍)

新能源充电桩后台管理系统平台

【软件工程之美 - 专栏笔记】21 | 架构设计:普通程序员也能实现复杂系统?

Hal library learning notes-12 SPI
随机推荐
NRF52832-QFAA 蓝牙无线芯片
基于AD9850的多功能信号发生器
Jingwei Qili: OLED character display based on hmep060 (and Fuxi project establishment demonstration)
SQLyog 安装和配置教程
Power electronics: single inverter design (matlab program +ad schematic diagram)
HAL库学习笔记- 8 串口通信之使用
倾角传感器精度校准检测
数学建模心得
【软件工程之美 - 专栏笔记】24 | 技术债务:是继续修修补补凑合着用,还是推翻重来?
基于F407ZGT6的WS2812B彩灯驱动
Pytorch Basics (Introductory)
PHY6252是一款超低功耗物联网蓝牙无线通信芯片
Huawei cloud 14 day Hongmeng device development -day2 compilation framework
5、 Image pixel statistics
2.4G频段的无线收发芯片 SI24R1 问题汇总解答
Hal library learning notes-13 application of I2C and SPI
【软件工程之美 - 专栏笔记】27 | 软件工程师的核心竞争力是什么?(上)
CS5340国产替代DP5340多比特音频 A/D 转换器
LoRa开启物联网新时代-ASR6500S、ASR6501/6502、ASR6505、ASR6601
Huawei cloud 14 day Hongmeng device development -day3 kernel development