当前位置:网站首页>NLP - fastText
NLP - fastText
2022-06-11 22:09:00 【伊织code】

关于 fastText
FastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers.
It works on standard, generic hardware.
Models can later be reduced in size to even fit on mobile devices.
- 官网
https://fasttext.cc - Github
https://github.com/facebookresearch/fastText/ - pre-trained models
https://fasttext.cc/docs/en/english-vectors.html - apachecn:FastText 中文文档(推荐)
http://fasttext.apachecn.org/#/
安装 & 编译
使用 make 编译 (推荐)
需要 C++11 及以上支持。
会生成 fasttext 二进制文件
$ wget https://github.com/facebookresearch/fastText/archive/v0.9.2.zip
$ unzip v0.9.2.zip
$ cd fastText-0.9.2
$ make
使用 cmake 编译
$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ mkdir build && cd build && cmake ..
$ make && make install
安装成功验证
$ fasttext -h
usage: fasttext <command> <args>
The commands supported by fasttext are:
supervised train a supervised classifier
quantize quantize a model to reduce the memory usage
test evaluate a supervised classifier
test-label print labels with precision and recall scores
predict predict most likely labels
predict-prob predict most likely labels with probabilities
skipgram train a skipgram model
cbow train a cbow model
print-word-vectors print word vectors given a trained model
print-sentence-vectors print sentence vectors given a trained model
print-ngrams print ngrams given a trained model and word
nn query for nearest neighbors
analogies query for analogies
dump dump arguments,dictionary,input/output vectors
使用 pip 安装
$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ pip3 install .
使用示例
这个库主要用在两个地方:
1、词表示学习
2、文本分类
这些描述在下面两篇论文中:
[1] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information
[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification
1、词表示学习
$ ./fasttext skipgram -input data.txt -output model
data.txt文件用于训练,内容是 UTF-8 编码的文本。
最后这个工程会保存为两个文件:model.bin和model.vecmodel.vec是一个包含词向量的文本文件,一行一个词向量。model.bin是一个二进制文件,包含包含模型的参数、字典和所有超参数。
2、为未登陆词获取词向量
从前训练的模型,可以用来为词表以外的词计算词向量。queries.txt 文件包含你需要计算词向量的词。
下面明亮将标准输出词向量,一行一个词向量。
$ ./fasttext print-word-vectors model.bin < queries.txt
可以使用 pipline 的方式调用
$ cat queries.txt | ./fasttext print-word-vectors model.bin
使用提供的脚本作为示例,将编译代码、下载数据、计算词向量,并在罕见词相似性数据集RW上对其进行评估。
$ ./word-vector-example.sh
3、文本分类
这个库也可以用来训练有监督文本分类,比如情感分析。
为了使用论文2的方法 训练一个文本分类,使用如下命令:
$ ./fasttext supervised -input train.txt -output model
train.txt文件,一行一个句子和标签。标签使用__label__作为前缀。- 这将输出两个文件
model.bin和model.vec。
一旦对模型进行了训练,您就可以在测试集上通过计算k的精确度和召回率来对其进行评估([email protected]和[email protected]),使用如下面命令
$ ./fasttext test model.bin test.txt k
k是可选的,默认为1
为了获得一段文本的k个最可能的标签,使用:
$ ./fasttext predict model.bin test.txt k
或者使用 predict-prob 也可以获得每个标签的概率
$ ./fasttext predict-prob model.bin test.txt k
test.txt包含一行一段话。这回产生一行一个 k 最大可能标签的标准输出。k是可选的,默认为1- 可参见
classification-example.sh脚本来调用。这将下载所有数据集,并重新产生结果。
如果你想计算句子或段落的词向量表示,可以使用:
$ ./fasttext print-sentence-vectors model.bin < text.txt
您还可以使用以下命令量化有监督模型,以减少其内存使用:
$ ./fasttext quantize -output model
这将产生一个内存占用较小的 .ftz 文件。
所有标准功能(如测试或预测)在量化模型上的工作方式相同:
$ ./fasttext test model.ftz test.txt
你可以使用 quantization-example.sh 作为示例。
全文档
$ ./fasttext supervised
Empty input or output path.
The following arguments are mandatory:
-input training file path
-output output file path
The following arguments are optional:
-verbose verbosity level [2]
The following arguments for the dictionary are optional:
-minCount minimal number of word occurrences [1]
-minCountLabel minimal number of label occurrences [0]
-wordNgrams max length of word ngram [1]
-bucket number of buckets [2000000]
-minn min length of char ngram [0]
-maxn max length of char ngram [0]
-t sampling threshold [0.0001]
-label labels prefix [__label__]
The following arguments for training are optional:
-lr learning rate [0.1]
-lrUpdateRate change the rate of updates for the learning rate [100]
-dim size of word vectors [100]
-ws size of the context window [5]
-epoch number of epochs [5]
-neg number of negatives sampled [5]
-loss loss function {
ns, hs, softmax} [softmax]
-thread number of threads [12]
-pretrainedVectors pretrained word vectors for supervised learning []
-saveOutput whether output params should be saved [0]
The following arguments for quantization are optional:
-cutoff number of words and ngrams to retain [0]
-retrain finetune embeddings if a cutoff is applied [0]
-qnorm quantizing the norm separately [0]
-qout quantizing the classifier [0]
-dsub size of each sub-vector [2]
Defaults may vary by mode. (Word-representation modes skipgram and cbow use a default -minCount of 5.)
2022-06-10(五)
边栏推荐
猜你喜欢

Tkinter学习笔记(四)

Basic operation of graph (C language)

Uncover the secret of the popular app. Why is it so black

All features of polymorphism

crontab中定时执行shell脚本

R语言书籍学习03 《深入浅出R语言数据分析》-第七章 线性回归模型

Take off efficiently! Can it be developed like this?

BUUCTF(5)

《物联网开发实战》18 场景联动:智能电灯如何感知光线?(上)(学习笔记)

【LeetCode】11. Container with the most water
随机推荐
R language book learning 03 "in simple terms R language data analysis" - Chapter 8 logistic regression model Chapter 9 clustering model
Static PVC with CEPH CSI
R language book learning 03 "in simple terms R language data analysis" - Chapter 12 support vector machine Chapter 13 neural network
一款开源的Markdown转富文本编辑器的实现原理剖析
Leetcode - day 2
大学三年应该这样过
Players must read starfish NFT advanced introduction
STM32开发笔记112:ADS1258驱动设计——读寄存器
【Uniapp 原生插件】商米钱箱插件
All features of polymorphism
Analysis of the implementation principle of an open source markdown to rich text editor
Tkinter学习笔记(四)
Latex combat notes 3- macro package and control commands
剑指offer数组题型总结篇
Uncover the secret of the popular app. Why is it so black
Is the securities account recommended by qiniu safe? Is it reliable
重温c语言一
Stack栈的实现
[niuke.com] dp31 [template] complete Backpack
Non recursive writing of quick sort