当前位置:网站首页>Fasttext learning - text classification
Fasttext learning - text classification
2022-07-29 06:11:00 【Quinn-ntmy】
Before, there were mainly One-hot、Bag of Words、N-gram、TF-IDF Word vector representation , But they have shortcomings :
- The converted vector has a high dimension , It takes a long time to train ;
- Without considering the relationship between words , Just statistics .
And better than TF-IDF Specifically, :
1、FastText In words Embedding Stack the resulting document vector , Divide similar sentences into one category ;
2、FastText Learned Embedding The spatial dimension is relatively low , You can train quickly .
Then apply deep learning to text representation , A typical example :fastText、Word2Vec、Bert.
Next, this article mainly introduces fastText:
- adopt Embedding Layer maps words to dense spaces , Then the words of the whole document and n-gram vector Stack average Get the document vector , And then do it softmax Many classification operation —— It can greatly reduce the model training time .
- Mainly involves 2 individual trick: Character level n-gram The introduction of features and layering Softmax classification
- Three layer neural network : Input layer 、 The hidden layer and the output layer .
The input is multiple vectorized words , Attached character level n-gram As feature input ; The output is a specific target( The class mark corresponding to the text ); The hidden layer is the superposition average of multiple word vectors .
Let's look at it first fastText Network structure :
# Use keras Realization FastText Network structure
from __future__ import unicode_literals
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import GlobalAveragePooling1D
from tensorflow.keras.layers import Dense
Vocab_size = 2000
Embedding_dim = 100
Max_words = 500
Class_num = 5
def build_fastText():
model = Sequential() # As a container
# adopt embedding layer , Map vocabulary to Embedding_dim Dimension vector
model.add(Embedding(Vocab_size, Embedding_dim, input_length=Max_words))
# adopt GlobalAveragePooling1D, Average all words in the document embedding
model.add(GlobalAveragePooling1D())
# Through the output layer Softmax classification ( Actual fastText This one is layered Softmax), Get the category probability distribution
model.add(Dense(Class_num, activation='softmax'))
# Define the loss function 、 Optimizer 、 Classification metrics
model.compile(loss='categorical_crossentropy', optimizer='SGD', metrics=['accuracy'])
return model
if __name__ == '__main__':
model = build_fastText()
print(model.summary())
fastText Text classification flow chart :
【 notes 】: The important knowledge points are basically in the notes
- data fetch
import pandas as pd
from sklearn.metrics import f1_score
train_df = pd.read_csv('../data/train_set.csv', sep='\t', nrows=15000)
- Data processing
Convert data to fastText Required format
train_df['label_ft'] = '__label__' + train_df['label'].astype(str)
# __label__: Category prefix ,__label__ Followed by category , For example, it's str
train_df[['text', 'label_ft']].iloc[:-5000].to_csv('train.csv', index=None, header=None, sep='\t')
# iloc Method provides an integer based indexing method
- Training models
import fasttext
# fasttext.train_unsupervised() Unsupervised is used to train word vectors ;fasttext.train_supervised() Train a monitoring model , Returns a model object
model = fasttext.train_supervised('train.csv', lr=1.0, wordNgrams=2,
verbose=2, minCount=1, epoch=25, loss='hs')
# 'hs' finger hierarchical softmax( layered softmax) When there are more categories , Accelerate... By building a Huffman coding tree softmax layer The calculation of , and word2vec Medium trick identical ;
# minCount—— Word frequency threshold , Less than this value will be filtered out during initialization
# verbose =0 when , Do not output log information , Progress bar 、loss、acc None of these outputs ; =1 when , Output log information with progress bar ; =2 when , For each epoch Output line record ( Without progress bar )
fasttext.train_supervised() Parameter description :
input Training file path ( must )
lr Learning rate default 0.1
label Category prefix default __label__
lrUpdateRate Learning rate update rate default 100
dim Word vector dimension default 100
ws Context window size default 5, cbow
epoch epochs Number default 5
minCount Minimum word frequency default 5
minCountLabel Category threshold , If the category is smaller than this value, it will be filtered out during initialization
wordNgrams n-gram Set up default 1
loss Loss function {ns,hs,softmax} default softmax
minn Minimum character length default 0
maxn Maximum character length default 0
thread Number of threads default 12
t Sampling threshold default 0.0001
silent Ban c++ Extended log output default 1
encoding Appoint input_file code default utf-8
pretrainedVectors Pre trained word vector file path , If word The initialization that appears in the folder is no longer random default None
- Model prediction and evaluation
val_pred = [model.predict(x)[0][0].split('__')[-1] for x in train_df.iloc[-5000:]['text']]
print(f1_score(train_df['label'].values[-5000:].astype(str), val_pred, average='macro')) # The score is 0.8214....
边栏推荐
- 逻辑回归-项目实战-信用卡检测任务(下)
- 研究生新生培训第一周:深度学习和pytorch基础
- Power Bi report server custom authentication
- tensorboard使用
- 预训练语言模型的使用方法
- Transformer回顾+理解
- pip安装后仍有解决ImportError: No module named XX
- 一、网页端文件流的传输
- 2、 During OCR training, txt files and picture data are converted to LMDB file format
- [semantic segmentation] overview of semantic segmentation
猜你喜欢

HR面必问问题——如何与HR斗志斗勇(收集于FPGA探索者)
![[network design] convnext:a convnet for the 2020s](/img/c5/fe3771aaa1c4168402bea232190a35.png)
[network design] convnext:a convnet for the 2020s

Yum local source production

Transformer回顾+理解

研究生新生培训第一周:深度学习和pytorch基础

2、 During OCR training, txt files and picture data are converted to LMDB file format
![[target detection] 6. SSD](/img/7d/f137ffa4b251360441a9e4ff0f2219.png)
[target detection] 6. SSD

CNOOC, desktop cloud & network disk storage system application case
![[tensorrt] convert pytorch into deployable tensorrt](/img/56/81d641b494cf8b02ff77246c2207db.png)
[tensorrt] convert pytorch into deployable tensorrt

GA-RPN:引导锚点的建议区域网络
随机推荐
神经网络相关知识回顾(PyTorch篇)
华为云14天鸿蒙设备开发-Day3内核开发
预训练语言模型的使用方法
【Transformer】AdaViT: Adaptive Tokens for Efficient Vision Transformer
ML16 neural network (2)
【Transformer】AdaViT: Adaptive Vision Transformers for Efficient Image Recognition
迁移学习—— Transfer Feature Learning with Joint Distribution Adaptation
CNOOC, desktop cloud & network disk storage system application case
2022春招——禾赛科技FPGA技术岗(一、二面,收集于:数字IC打工人及FPGA探索者)
1、 Focal loss theory and code implementation
Torch. NN. Embedding() details
2、 Multi concurrent interface pressure test
Error in installing pyspider under Windows: Please specify --curl dir=/path/to/build/libcurl solution
C connect to SharePoint online webservice
迁移学习——Robust Visual Domain Adaptation with Low-Rank Reconstruction
【Transformer】SOFT: Softmax-free Transformer with Linear Complexity
第一周任务 深度学习和pytorch基础
入门到入魂:单片机如何利用TB6600高精度控制步进电机(42/57)
基于STM32开源:磁流体蓝牙音箱(包含源码+PCB)
华为云14天鸿蒙设备开发-Day1源码获取