当前位置：网站首页>Ml9 self study notes

Ml9 self study notes

2022-07-29 06:17:00 【19-year-old flower girl】

Text features

import pandas as pd
import numpy as np
import re
import nltk #pip install nltk

Basic pretreatment

corpus = ['The sky is blue and beautiful.',
          'Love this blue and beautiful sky!',
          'The quick brown fox jumps over the lazy dog.',
          'The brown fox is quick and the blue dog is lazy!',
          'The sky is very blue and the sky is very beautiful today',
          'The dog is lazy but the brown fox is quick!'    
]
labels = ['weather', 'weather', 'animals', 'animals', 'weather', 'animals']
corpus = np.array(corpus)
corpus_df = pd.DataFrame({
    'Document': corpus, 
                          'Category': labels})
corpus_df = corpus_df[['Document', 'Category']]
corpus_df

Insert picture description here
What we need to do is based on the classification of articles , Is it animal theme or weather theme .
After executing this sentence In the pop-up window The column on the right of the most volume needs to be installed ,stopwords.

nltk.download()

Remove some words that do not highlight the theme .

# Load stop words 
wpt = nltk.WordPunctTokenizer()
stop_words = nltk.corpus.stopwords.words('english')

def normalize_document(doc):
    #  Remove special characters 
    doc = re.sub(r'[^a-zA-Z0-9\s]', '', doc, re.I)
    #  Convert to lowercase 
    doc = doc.lower()
    doc = doc.strip()
    #  participle 
    tokens = wpt.tokenize(doc)
    #  To stop using words 
    filtered_tokens = [token for token in tokens if token not in stop_words]
    #  Regroup into articles 
    doc = ' '.join(filtered_tokens)
    return doc

norm_corpus = normalize_corpus(corpus)
norm_corpus

Processing results , You can compare it with before .
Insert picture description here

The word bag model

Word bag model counts word frequency .

from sklearn.feature_extraction.text import CountVectorizer
print (norm_corpus)
# Instantiation 
cv = CountVectorizer(min_df=0., max_df=1.)

cv.fit(norm_corpus)
print (cv.get_feature_names())
cv_matrix = cv.fit_transform(norm_corpus)
# formation array Format 
cv_matrix = cv_matrix.toarray()
cv_matrix

The words of these sentences form a corpus .
Insert picture description here
The result of coding . One occurrence is 1, Appear twice is 2, Does not appear 0.

Look at it .

N-Grams Model

Make up for the lack of context information in the word bag model .
Considering the combination between words ,ngram_range=(2,2) Indicates a combination of two words .
It usually uses two words , Because it will make the matrix bigger , And sparse .

bv = CountVectorizer(ngram_range=(2,2))
bv_matrix = bv.fit_transform(norm_corpus)
bv_matrix = bv_matrix.toarray()
vocab = bv.get_feature_names()
pd.DataFrame(bv_matrix, columns=vocab)

Insert picture description here

TF-IDF Model

TP： Word frequency ,IDP： Reverse document frequency
If the frequency of a word in the corpus is not high , But there are many times in the current sample , It means his IDF Great value , More important , Differentiated .

from sklearn.feature_extraction.text import TfidfVectorizer 
tv = TfidfVectorizer(min_df=0., max_df=1., use_idf=True)
tv_matrix = tv.fit_transform(norm_corpus)
tv_matrix = tv_matrix.toarray()

vocab = tv.get_feature_names()
pd.DataFrame(np.round(tv_matrix, 2), columns=vocab)

Insert picture description here

Similarity features

The similarity between articles can also be used as a feature .

from sklearn.metrics.pairwise import cosine_similarity

similarity_matrix = cosine_similarity(tv_matrix)
similarity_df = pd.DataFrame(similarity_matrix)
similarity_df

Insert picture description here

Clustering characteristics

Not very easy to use

from sklearn.cluster import KMeans

km = KMeans(n_clusters=2)
km.fit_transform(similarity_df)
cluster_labels = km.labels_
cluster_labels = pd.DataFrame(cluster_labels, columns=['ClusterLabel'])
pd.concat([corpus_df, cluster_labels], axis=1)

Theme model

Less commonly used

from sklearn.decomposition import LatentDirichletAllocation

lda = LatentDirichletAllocation(n_topics=2, max_iter=100, random_state=42)
dt_matrix = lda.fit_transform(tv_matrix)
features = pd.DataFrame(dt_matrix, columns=['T1', 'T2'])
features

Word embedding model word2vec

Solved the problem mentioned before , Ignore the connection between words and contexts . I don't know the context , Such as apples and bananas , Keyboard and mouse , These words should be similar in space .

from gensim.models import word2vec

wpt = nltk.WordPunctTokenizer()
tokenized_corpus = [wpt.tokenize(document) for document in norm_corpus]

#  You need to set some parameters 
feature_size = 10    #  Word vector dimension 
window_context = 10  #  The sliding window  
min_word_count = 1   #  Minimum word frequency  

w2v_model = word2vec.Word2Vec(tokenized_corpus, size=feature_size, 
                          window=window_context, min_count = min_word_count)

Insert picture description here
For one sentence . Add the one-dimensional vector value of the corresponding position of each word in the sentence to the total number of words to get the one-dimensional value of the sentence .

def average_word_vectors(words, model, vocabulary, num_features):
    
    feature_vector = np.zeros((num_features,),dtype="float64")
    nwords = 0.
    
    for word in words:
        if word in vocabulary: 
            nwords = nwords + 1.
            feature_vector = np.add(feature_vector, model[word])
    
    if nwords:
        feature_vector = np.divide(feature_vector, nwords)
        
    return feature_vector
    
   
def averaged_word_vectorizer(corpus, model, num_features):
    vocabulary = set(model.wv.index2word)
    features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features)
                    for tokenized_sentence in corpus]
    return np.array(features)

Construct word vectors .

w2v_feature_array = averaged_word_vectorizer(corpus=tokenized_corpus, model=w2v_model,
                                             num_features=feature_size)
pd.DataFrame(w2v_feature_array) #lstm

Horizontal is the article , The column term is a ten dimensional vector
Insert picture description here
In fact, taking the average is a little problematic , There will be improvements later .