当前位置:网站首页>NLP - fastText
NLP - fastText
2022-06-11 22:20:00 【Yizhi code】

List of articles
About fastText
FastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers.
It works on standard, generic hardware.
Models can later be reduced in size to even fit on mobile devices.
- Official website
https://fasttext.cc - Github
https://github.com/facebookresearch/fastText/ - pre-trained models
https://fasttext.cc/docs/en/english-vectors.html - apachecn:FastText Chinese document ( recommend )
http://fasttext.apachecn.org/#/
install & compile
Use make compile ( recommend )
need C++11 And above .
Will generate fasttext Binary
$ wget https://github.com/facebookresearch/fastText/archive/v0.9.2.zip
$ unzip v0.9.2.zip
$ cd fastText-0.9.2
$ make
Use cmake compile
$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ mkdir build && cd build && cmake ..
$ make && make install
Successful installation verification
$ fasttext -h
usage: fasttext <command> <args>
The commands supported by fasttext are:
supervised train a supervised classifier
quantize quantize a model to reduce the memory usage
test evaluate a supervised classifier
test-label print labels with precision and recall scores
predict predict most likely labels
predict-prob predict most likely labels with probabilities
skipgram train a skipgram model
cbow train a cbow model
print-word-vectors print word vectors given a trained model
print-sentence-vectors print sentence vectors given a trained model
print-ngrams print ngrams given a trained model and word
nn query for nearest neighbors
analogies query for analogies
dump dump arguments,dictionary,input/output vectors
Use pip install
$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ pip3 install .
Examples of use
This library is mainly used in two places :
1、 Words mean learning
2、 Text classification
These are described in the following two papers :
[1] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information
[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification
1、 Words mean learning
$ ./fasttext skipgram -input data.txt -output model
data.txtFiles for training , The content is UTF-8 Encoded text .
Finally, the project will be saved as two files :model.binandmodel.vecmodel.vecIs a text file containing word vectors , One word per line .model.binIt's a binary file , Contains the parameters of the containing model 、 Dictionary and all super parameters .
2、 Get word vectors for unlisted words
Models of previous training , It can be used to calculate word vectors for words outside the vocabulary .queries.txt The file contains the words you need to calculate the word vector .
Next, the standard output word vector , One word per line .
$ ./fasttext print-word-vectors model.bin < queries.txt
have access to pipline Method call
$ cat queries.txt | ./fasttext print-word-vectors model.bin
Use the provided script as an example , Code will be compiled 、 Download data 、 Calculate the word vector , And in the rare word similarity dataset RW Evaluate it on .
$ ./word-vector-example.sh
3、 Text classification
This library can also be used to train supervised text categorization , For example, emotional analysis .
In order to use the paper 2 Methods Train a text categorization , Use the following command :
$ ./fasttext supervised -input train.txt -output model
train.txtfile , One sentence per line and labels . Label use__label__As a prefix .- This will output two files
model.binandmodel.vec.
Once the model is trained , You can calculate on the test set k To evaluate the accuracy and recall ([email protected] and [email protected]), Use the following command
$ ./fasttext test model.bin test.txt k
kIt's optional , The default is 1
To get a piece of text k The most likely tags , Use :
$ ./fasttext predict model.bin test.txt k
Or use predict-prob You can also get the probability of each tag
$ ./fasttext predict-prob model.bin test.txt k
test.txtInclude one line and one paragraph . This time a line is generated k The standard output of the maximum possible label .kIt's optional , The default is 1- See also
classification-example.shScript to call . This will download all data sets , And regenerate the results .
If you want to calculate the word vector representation of a sentence or paragraph , have access to :
$ ./fasttext print-sentence-vectors model.bin < text.txt
You can also quantify the supervised model using the following commands , To reduce its memory usage :
$ ./fasttext quantize -output model
This will result in a smaller memory footprint .ftz file .
All standard functions ( Such as testing or forecasting ) Work the same way on quantitative models :
$ ./fasttext test model.ftz test.txt
You can use quantization-example.sh As an example .
Full text file
$ ./fasttext supervised
Empty input or output path.
The following arguments are mandatory:
-input training file path
-output output file path
The following arguments are optional:
-verbose verbosity level [2]
The following arguments for the dictionary are optional:
-minCount minimal number of word occurrences [1]
-minCountLabel minimal number of label occurrences [0]
-wordNgrams max length of word ngram [1]
-bucket number of buckets [2000000]
-minn min length of char ngram [0]
-maxn max length of char ngram [0]
-t sampling threshold [0.0001]
-label labels prefix [__label__]
The following arguments for training are optional:
-lr learning rate [0.1]
-lrUpdateRate change the rate of updates for the learning rate [100]
-dim size of word vectors [100]
-ws size of the context window [5]
-epoch number of epochs [5]
-neg number of negatives sampled [5]
-loss loss function {
ns, hs, softmax} [softmax]
-thread number of threads [12]
-pretrainedVectors pretrained word vectors for supervised learning []
-saveOutput whether output params should be saved [0]
The following arguments for quantization are optional:
-cutoff number of words and ngrams to retain [0]
-retrain finetune embeddings if a cutoff is applied [0]
-qnorm quantizing the norm separately [0]
-qout quantizing the classifier [0]
-dsub size of each sub-vector [2]
Defaults may vary by mode. (Word-representation modes skipgram and cbow use a default -minCount of 5.)
2022-06-10( 5、 ... and )
边栏推荐
- Tkinter学习笔记(三)
- 【NodeJs】Electron安装
- How to adjust the font blur of win10
- Tkinter学习笔记(二)
- 每日一题-1317. 将整数转换为两个无零整数的和
- 被忽略的技巧:位运算
- 【数据挖掘时间序列分析】餐厅销量预测
- 图的基本操作(C语言)
- [Matlab]二阶节约响应
- [academic related] under the application review system, how difficult is it to study for a doctoral degree in a double first-class university?
猜你喜欢

Basic operation of graph (C language)

【学术相关】申请审核制下,到双一流大学读博的难度有多大?

玩家必读|Starfish NFT进阶攻略

Nmap performs analysis of all network segment IP survivals in host detection

Superscalar processor design yaoyongbin Chapter 2 cache -- Excerpt from subsection 2.4

图书管理系统

Leetcode - day 2

移动端——swipe特效之图片时间轴

NLP - fastText

One question per day -- verifying palindrome string
随机推荐
向线程池提交任务
Learning bit segment (1)
Is it safe for qiniu business school to send Huatai account? Really?
論文閱讀《Dense Visual SLAM for RGB-D Cameras》
[Yu Yue education] General English of Shenyang Institute of Engineering (4) reference materials
图的基本操作(C语言)
一款开源的Markdown转富文本编辑器的实现原理剖析
点云读写(二):读写txt点云(空格分隔 | 逗号分隔)
启牛商学院送华泰账户安不安全?真的吗
inner join执行计划变了
[niuke.com] ky41 put apples
NLP - fastText
926. 将字符串翻转到单调递增
Three methods of quick sorting
C language to achieve eight sorts (2)
Unity3D getLaunchIntentForPackage 获取包返回null问题
Optimization of quick sort
Submit task to thread pool
win11怎么看电脑显卡信息
Use the securecrtportable script function to read data from network devices