当前位置：网站首页>Fasttext learning - text classification

Fasttext learning - text classification

2022-07-29 06:11:00 【Quinn-ntmy】

Before, there were mainly One-hot、Bag of Words、N-gram、TF-IDF Word vector representation , But they have shortcomings ：

The converted vector has a high dimension , It takes a long time to train ;
Without considering the relationship between words , Just statistics .

And better than TF-IDF Specifically, ：
1、FastText In words Embedding Stack the resulting document vector , Divide similar sentences into one category ;
2、FastText Learned Embedding The spatial dimension is relatively low , You can train quickly .

Then apply deep learning to text representation , A typical example ：fastText、Word2Vec、Bert.

Next, this article mainly introduces fastText：

adopt Embedding Layer maps words to dense spaces , Then the words of the whole document and n-gram vector Stack average Get the document vector , And then do it softmax Many classification operation —— It can greatly reduce the model training time .
Mainly involves 2 individual trick： Character level n-gram The introduction of features and layering Softmax classification
Three layer neural network ： Input layer 、 The hidden layer and the output layer .
The input is multiple vectorized words , Attached character level n-gram As feature input ; The output is a specific target（ The class mark corresponding to the text ）; The hidden layer is the superposition average of multiple word vectors .

Let's look at it first fastText Network structure ：

#  Use keras Realization FastText Network structure 
from __future__ import unicode_literals
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import GlobalAveragePooling1D
from tensorflow.keras.layers import Dense

Vocab_size = 2000
Embedding_dim = 100
Max_words = 500
Class_num = 5

def build_fastText():
    model = Sequential()   #  As a container 
    #  adopt embedding layer , Map vocabulary to Embedding_dim Dimension vector 
    model.add(Embedding(Vocab_size, Embedding_dim, input_length=Max_words))
    #  adopt GlobalAveragePooling1D, Average all words in the document embedding
    model.add(GlobalAveragePooling1D())
    #  Through the output layer Softmax classification ( Actual fastText This one is layered Softmax), Get the category probability distribution 
    model.add(Dense(Class_num, activation='softmax'))
    #  Define the loss function 、 Optimizer 、 Classification metrics 
    model.compile(loss='categorical_crossentropy', optimizer='SGD', metrics=['accuracy'])
    return model

if __name__ == '__main__':
    model = build_fastText()
    print(model.summary())

fastText Text classification flow chart ：
Insert picture description here
【 notes 】： The important knowledge points are basically in the notes

data fetch

import pandas as pd
from sklearn.metrics import f1_score

train_df = pd.read_csv('../data/train_set.csv', sep='\t', nrows=15000)

Data processing
Convert data to fastText Required format

train_df['label_ft'] = '__label__' + train_df['label'].astype(str)
# __label__:  Category prefix ,__label__ Followed by category , For example, it's str
train_df[['text', 'label_ft']].iloc[:-5000].to_csv('train.csv', index=None, header=None, sep='\t')
# iloc Method provides an integer based indexing method

Training models

import fasttext
# fasttext.train_unsupervised() Unsupervised is used to train word vectors ;fasttext.train_supervised() Train a monitoring model , Returns a model object 
model = fasttext.train_supervised('train.csv', lr=1.0, wordNgrams=2,
                                  verbose=2, minCount=1, epoch=25, loss='hs')
# 'hs' finger hierarchical softmax( layered softmax)  When there are more categories , Accelerate... By building a Huffman coding tree softmax layer The calculation of , and word2vec Medium trick identical ;
# minCount—— Word frequency threshold ,  Less than this value will be filtered out during initialization 
# verbose =0 when , Do not output log information , Progress bar 、loss、acc None of these outputs ; =1 when , Output log information with progress bar ; =2 when , For each epoch Output line record ( Without progress bar )

fasttext.train_supervised() Parameter description ：

input                       Training file path （ must ）
lr                          Learning rate  default 0.1
label                       Category prefix  default __label__
lrUpdateRate                Learning rate update rate  default 100
dim                         Word vector dimension  default 100
ws                          Context window size  default 5, cbow
epoch                      epochs  Number  default 5
minCount                    Minimum word frequency  default 5
minCountLabel               Category threshold , If the category is smaller than this value, it will be filtered out during initialization 
wordNgrams                 n-gram Set up  default 1
loss                        Loss function  {ns,hs,softmax} default softmax
minn                        Minimum character length  default 0
maxn                        Maximum character length  default 0
thread                      Number of threads  default 12
t                           Sampling threshold  default 0.0001
silent                      Ban  c++  Extended log output  default 1
encoding                    Appoint  input_file  code  default utf-8
pretrainedVectors           Pre trained word vector file path ,  If word The initialization that appears in the folder is no longer random  default None

Model prediction and evaluation

val_pred = [model.predict(x)[0][0].split('__')[-1] for x in train_df.iloc[-5000:]['text']]
print(f1_score(train_df['label'].values[-5000:].astype(str), val_pred, average='macro'))   #  The score is 0.8214....

原网站

版权声明
本文为[Quinn-ntmy]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/210/202207290519491339.html

当前位置：网站首页>Fasttext learning - text classification

Fasttext learning - text classification

边栏推荐

猜你喜欢

随机推荐