当前位置:网站首页>NLP - fastText
NLP - fastText
2022-06-11 22:09:00 【伊织code】

关于 fastText
FastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers.
It works on standard, generic hardware.
Models can later be reduced in size to even fit on mobile devices.
- 官网
https://fasttext.cc - Github
https://github.com/facebookresearch/fastText/ - pre-trained models
https://fasttext.cc/docs/en/english-vectors.html - apachecn:FastText 中文文档(推荐)
http://fasttext.apachecn.org/#/
安装 & 编译
使用 make 编译 (推荐)
需要 C++11 及以上支持。
会生成 fasttext 二进制文件
$ wget https://github.com/facebookresearch/fastText/archive/v0.9.2.zip
$ unzip v0.9.2.zip
$ cd fastText-0.9.2
$ make
使用 cmake 编译
$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ mkdir build && cd build && cmake ..
$ make && make install
安装成功验证
$ fasttext -h
usage: fasttext <command> <args>
The commands supported by fasttext are:
supervised train a supervised classifier
quantize quantize a model to reduce the memory usage
test evaluate a supervised classifier
test-label print labels with precision and recall scores
predict predict most likely labels
predict-prob predict most likely labels with probabilities
skipgram train a skipgram model
cbow train a cbow model
print-word-vectors print word vectors given a trained model
print-sentence-vectors print sentence vectors given a trained model
print-ngrams print ngrams given a trained model and word
nn query for nearest neighbors
analogies query for analogies
dump dump arguments,dictionary,input/output vectors
使用 pip 安装
$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ pip3 install .
使用示例
这个库主要用在两个地方:
1、词表示学习
2、文本分类
这些描述在下面两篇论文中:
[1] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information
[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification
1、词表示学习
$ ./fasttext skipgram -input data.txt -output model
data.txt文件用于训练,内容是 UTF-8 编码的文本。
最后这个工程会保存为两个文件:model.bin和model.vecmodel.vec是一个包含词向量的文本文件,一行一个词向量。model.bin是一个二进制文件,包含包含模型的参数、字典和所有超参数。
2、为未登陆词获取词向量
从前训练的模型,可以用来为词表以外的词计算词向量。queries.txt 文件包含你需要计算词向量的词。
下面明亮将标准输出词向量,一行一个词向量。
$ ./fasttext print-word-vectors model.bin < queries.txt
可以使用 pipline 的方式调用
$ cat queries.txt | ./fasttext print-word-vectors model.bin
使用提供的脚本作为示例,将编译代码、下载数据、计算词向量,并在罕见词相似性数据集RW上对其进行评估。
$ ./word-vector-example.sh
3、文本分类
这个库也可以用来训练有监督文本分类,比如情感分析。
为了使用论文2的方法 训练一个文本分类,使用如下命令:
$ ./fasttext supervised -input train.txt -output model
train.txt文件,一行一个句子和标签。标签使用__label__作为前缀。- 这将输出两个文件
model.bin和model.vec。
一旦对模型进行了训练,您就可以在测试集上通过计算k的精确度和召回率来对其进行评估([email protected]和[email protected]),使用如下面命令
$ ./fasttext test model.bin test.txt k
k是可选的,默认为1
为了获得一段文本的k个最可能的标签,使用:
$ ./fasttext predict model.bin test.txt k
或者使用 predict-prob 也可以获得每个标签的概率
$ ./fasttext predict-prob model.bin test.txt k
test.txt包含一行一段话。这回产生一行一个 k 最大可能标签的标准输出。k是可选的,默认为1- 可参见
classification-example.sh脚本来调用。这将下载所有数据集,并重新产生结果。
如果你想计算句子或段落的词向量表示,可以使用:
$ ./fasttext print-sentence-vectors model.bin < text.txt
您还可以使用以下命令量化有监督模型,以减少其内存使用:
$ ./fasttext quantize -output model
这将产生一个内存占用较小的 .ftz 文件。
所有标准功能(如测试或预测)在量化模型上的工作方式相同:
$ ./fasttext test model.ftz test.txt
你可以使用 quantization-example.sh 作为示例。
全文档
$ ./fasttext supervised
Empty input or output path.
The following arguments are mandatory:
-input training file path
-output output file path
The following arguments are optional:
-verbose verbosity level [2]
The following arguments for the dictionary are optional:
-minCount minimal number of word occurrences [1]
-minCountLabel minimal number of label occurrences [0]
-wordNgrams max length of word ngram [1]
-bucket number of buckets [2000000]
-minn min length of char ngram [0]
-maxn max length of char ngram [0]
-t sampling threshold [0.0001]
-label labels prefix [__label__]
The following arguments for training are optional:
-lr learning rate [0.1]
-lrUpdateRate change the rate of updates for the learning rate [100]
-dim size of word vectors [100]
-ws size of the context window [5]
-epoch number of epochs [5]
-neg number of negatives sampled [5]
-loss loss function {
ns, hs, softmax} [softmax]
-thread number of threads [12]
-pretrainedVectors pretrained word vectors for supervised learning []
-saveOutput whether output params should be saved [0]
The following arguments for quantization are optional:
-cutoff number of words and ngrams to retain [0]
-retrain finetune embeddings if a cutoff is applied [0]
-qnorm quantizing the norm separately [0]
-qout quantizing the classifier [0]
-dsub size of each sub-vector [2]
Defaults may vary by mode. (Word-representation modes skipgram and cbow use a default -minCount of 5.)
2022-06-10(五)
边栏推荐
- Nmap performs analysis of all network segment IP survivals in host detection
- Superscalar processor design yaoyongbin Chapter 2 cache -- Excerpt from subsection 2.2
- How to use the transaction code sat to find the name of the background storage database table corresponding to a sapgui screen field
- If I take the college entrance examination again, I will study mathematics well!
- Players must read starfish NFT advanced introduction
- Implementation stack and queue
- Analysis of the implementation principle of an open source markdown to rich text editor
- 仅需三步学会使用低代码ThingJS与森数据DIX数据对接
- C language to achieve eight sorts (2)
- Matlab: solution of folder locking problem
猜你喜欢

玩家必读|Starfish NFT进阶攻略
![[niuke.com] ky41 put apples](/img/55/cc246aed1438fdd245530beb7574f0.jpg)
[niuke.com] ky41 put apples

win11怎么看电脑显卡信息

《物联网开发实战》18 场景联动:智能电灯如何感知光线?(上)(学习笔记)

二叉树的基本操作与题型总结
![[Chongqing Guangdong education] college physics of Xiangtan University: mechanical and thermal reference materials](/img/64/683a190d14406a9971edd79037cc97.jpg)
[Chongqing Guangdong education] college physics of Xiangtan University: mechanical and thermal reference materials

Superscalar processor design yaoyongbin Chapter 2 cache -- Excerpt from subsection 2.3

Simple example of logistic regression for machine learning

Example of using zypper command

Huawei equipment configuration h-vpn
随机推荐
Why microservices are needed
Basic operation of graph (C language)
【Uniapp 原生插件】商米钱箱插件
被忽略的技巧:位运算
C language implements eight sorts of sort merge sort
Classes and objects (2)
[Yu Yue education] calculus of Zhejiang University in autumn and winter 2021 (I) reference materials
剑指offer数组题型总结篇
Nmap进行主机探测出现网段IP全部存活情况分析
R语言相关文章、文献整理合集(持续更新)
A simple example of linear regression in machine learning
Neglected technique: bit operation
超标量处理器设计 姚永斌 第2章 Cache --2.3 小节摘录
Daily question - Roman numeral to integer
Introduction to MySQL transactions
【LeetCode】11. Container with the most water
向线程池提交任务
Huawei equipment configuration hovpn
Matplotlib和tkinter学习笔记(一)
Leetcode stack topic summary