当前位置:网站首页>NLP - fastText
NLP - fastText
2022-06-11 22:20:00 【Yizhi code】

List of articles
About fastText
FastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers.
It works on standard, generic hardware.
Models can later be reduced in size to even fit on mobile devices.
- Official website
https://fasttext.cc - Github
https://github.com/facebookresearch/fastText/ - pre-trained models
https://fasttext.cc/docs/en/english-vectors.html - apachecn:FastText Chinese document ( recommend )
http://fasttext.apachecn.org/#/
install & compile
Use make compile ( recommend )
need C++11 And above .
Will generate fasttext Binary
$ wget https://github.com/facebookresearch/fastText/archive/v0.9.2.zip
$ unzip v0.9.2.zip
$ cd fastText-0.9.2
$ make
Use cmake compile
$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ mkdir build && cd build && cmake ..
$ make && make install
Successful installation verification
$ fasttext -h
usage: fasttext <command> <args>
The commands supported by fasttext are:
supervised train a supervised classifier
quantize quantize a model to reduce the memory usage
test evaluate a supervised classifier
test-label print labels with precision and recall scores
predict predict most likely labels
predict-prob predict most likely labels with probabilities
skipgram train a skipgram model
cbow train a cbow model
print-word-vectors print word vectors given a trained model
print-sentence-vectors print sentence vectors given a trained model
print-ngrams print ngrams given a trained model and word
nn query for nearest neighbors
analogies query for analogies
dump dump arguments,dictionary,input/output vectors
Use pip install
$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ pip3 install .
Examples of use
This library is mainly used in two places :
1、 Words mean learning
2、 Text classification
These are described in the following two papers :
[1] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information
[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification
1、 Words mean learning
$ ./fasttext skipgram -input data.txt -output model
data.txtFiles for training , The content is UTF-8 Encoded text .
Finally, the project will be saved as two files :model.binandmodel.vecmodel.vecIs a text file containing word vectors , One word per line .model.binIt's a binary file , Contains the parameters of the containing model 、 Dictionary and all super parameters .
2、 Get word vectors for unlisted words
Models of previous training , It can be used to calculate word vectors for words outside the vocabulary .queries.txt The file contains the words you need to calculate the word vector .
Next, the standard output word vector , One word per line .
$ ./fasttext print-word-vectors model.bin < queries.txt
have access to pipline Method call
$ cat queries.txt | ./fasttext print-word-vectors model.bin
Use the provided script as an example , Code will be compiled 、 Download data 、 Calculate the word vector , And in the rare word similarity dataset RW Evaluate it on .
$ ./word-vector-example.sh
3、 Text classification
This library can also be used to train supervised text categorization , For example, emotional analysis .
In order to use the paper 2 Methods Train a text categorization , Use the following command :
$ ./fasttext supervised -input train.txt -output model
train.txtfile , One sentence per line and labels . Label use__label__As a prefix .- This will output two files
model.binandmodel.vec.
Once the model is trained , You can calculate on the test set k To evaluate the accuracy and recall ([email protected] and [email protected]), Use the following command
$ ./fasttext test model.bin test.txt k
kIt's optional , The default is 1
To get a piece of text k The most likely tags , Use :
$ ./fasttext predict model.bin test.txt k
Or use predict-prob You can also get the probability of each tag
$ ./fasttext predict-prob model.bin test.txt k
test.txtInclude one line and one paragraph . This time a line is generated k The standard output of the maximum possible label .kIt's optional , The default is 1- See also
classification-example.shScript to call . This will download all data sets , And regenerate the results .
If you want to calculate the word vector representation of a sentence or paragraph , have access to :
$ ./fasttext print-sentence-vectors model.bin < text.txt
You can also quantify the supervised model using the following commands , To reduce its memory usage :
$ ./fasttext quantize -output model
This will result in a smaller memory footprint .ftz file .
All standard functions ( Such as testing or forecasting ) Work the same way on quantitative models :
$ ./fasttext test model.ftz test.txt
You can use quantization-example.sh As an example .
Full text file
$ ./fasttext supervised
Empty input or output path.
The following arguments are mandatory:
-input training file path
-output output file path
The following arguments are optional:
-verbose verbosity level [2]
The following arguments for the dictionary are optional:
-minCount minimal number of word occurrences [1]
-minCountLabel minimal number of label occurrences [0]
-wordNgrams max length of word ngram [1]
-bucket number of buckets [2000000]
-minn min length of char ngram [0]
-maxn max length of char ngram [0]
-t sampling threshold [0.0001]
-label labels prefix [__label__]
The following arguments for training are optional:
-lr learning rate [0.1]
-lrUpdateRate change the rate of updates for the learning rate [100]
-dim size of word vectors [100]
-ws size of the context window [5]
-epoch number of epochs [5]
-neg number of negatives sampled [5]
-loss loss function {
ns, hs, softmax} [softmax]
-thread number of threads [12]
-pretrainedVectors pretrained word vectors for supervised learning []
-saveOutput whether output params should be saved [0]
The following arguments for quantization are optional:
-cutoff number of words and ngrams to retain [0]
-retrain finetune embeddings if a cutoff is applied [0]
-qnorm quantizing the norm separately [0]
-qout quantizing the classifier [0]
-dsub size of each sub-vector [2]
Defaults may vary by mode. (Word-representation modes skipgram and cbow use a default -minCount of 5.)
2022-06-10( 5、 ... and )
边栏推荐
猜你喜欢

Players must read starfish NFT advanced introduction

What is deadlock? (explain the deadlock to everyone and know what it is, why it is used and how to use it)

FastAPI 5 - 常用请求及 postman、curl 使用(parameters,x-www-form-urlencoded, raw)

Top - K problem

Precision twist jitter

【LeetCode】11. Container with the most water

Optimization of quick sort

二叉树的基本操作与题型总结

A simple example of linear regression in machine learning

图书管理系统
随机推荐
习题8-8 判断回文字符串 (20 分)
leetcode 中的位运算
习题10-1 判断满足条件的三位数 (15 分)
926. 将字符串翻转到单调递增
Popular science | what are the types of NFT (Part 1)
[Chongqing Guangdong education] college physics of Xiangtan University: mechanical and thermal reference materials
Basic operation and question type summary of linked list
超标量处理器设计 姚永斌 第2章 Cache --2.2 小节摘录
Tkinter study notes (IV)
启牛商学院送华泰账户安不安全?真的吗
论文阅读《Dense Visual SLAM for RGB-D Cameras》
Neglected technique: bit operation
重温c语言一
R language book learning 03 "in simple terms R language data analysis" - Chapter 10 association rules Chapter 11 random forest
MATLAB点云处理(二十四):点云中值滤波(pcmedian)
启牛推荐开通的证券账户安全吗?靠谱吗
One question of the day - delete duplicates of the ordered array
Go IO module
[Matlab]二阶节约响应
Introduction to MySQL transactions