当前位置：网站首页>NLP - fastText

NLP - fastText

2022-06-11 22:09:00 【伊织code】

在这里插入图片描述

文章目录

关于 fastText

FastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers.
It works on standard, generic hardware.
Models can later be reduced in size to even fit on mobile devices.

官网
https://fasttext.cc
Github
https://github.com/facebookresearch/fastText/
pre-trained models
https://fasttext.cc/docs/en/english-vectors.html
apachecn:FastText 中文文档(推荐)
http://fasttext.apachecn.org/#/

安装 & 编译

使用 make 编译（推荐）

需要 C++11 及以上支持。

会生成 fasttext 二进制文件

$ wget https://github.com/facebookresearch/fastText/archive/v0.9.2.zip
$ unzip v0.9.2.zip
$ cd fastText-0.9.2

$ make

使用 cmake 编译

$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ mkdir build && cd build && cmake ..
$ make && make install

安装成功验证

$ fasttext -h
usage: fasttext <command> <args>

The commands supported by fasttext are:

  supervised              train a supervised classifier
  quantize                quantize a model to reduce the memory usage
  test                    evaluate a supervised classifier
  test-label              print labels with precision and recall scores
  predict                 predict most likely labels
  predict-prob            predict most likely labels with probabilities
  skipgram                train a skipgram model
  cbow                    train a cbow model
  print-word-vectors      print word vectors given a trained model
  print-sentence-vectors  print sentence vectors given a trained model
  print-ngrams            print ngrams given a trained model and word
  nn                      query for nearest neighbors
  analogies               query for analogies
  dump                    dump arguments,dictionary,input/output vectors

使用 pip 安装

$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ pip3 install .

使用示例

这个库主要用在两个地方：
1、词表示学习
2、文本分类

这些描述在下面两篇论文中：
[1] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information
[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

1、词表示学习

$ ./fasttext skipgram -input data.txt -output model

data.txt 文件用于训练，内容是 UTF-8 编码的文本。
最后这个工程会保存为两个文件：model.bin 和 model.vec
model.vec 是一个包含词向量的文本文件，一行一个词向量。
model.bin 是一个二进制文件，包含包含模型的参数、字典和所有超参数。

2、为未登陆词获取词向量

从前训练的模型，可以用来为词表以外的词计算词向量。
queries.txt 文件包含你需要计算词向量的词。

下面明亮将标准输出词向量，一行一个词向量。

$ ./fasttext print-word-vectors model.bin < queries.txt

可以使用 pipline 的方式调用

$ cat queries.txt | ./fasttext print-word-vectors model.bin

使用提供的脚本作为示例，将编译代码、下载数据、计算词向量，并在罕见词相似性数据集RW上对其进行评估。

$ ./word-vector-example.sh

3、文本分类

这个库也可以用来训练有监督文本分类，比如情感分析。
为了使用论文2的方法训练一个文本分类，使用如下命令：

$ ./fasttext supervised -input train.txt -output model

train.txt 文件，一行一个句子和标签。标签使用 __label__ 作为前缀。
这将输出两个文件 model.bin 和 model.vec 。

一旦对模型进行了训练，您就可以在测试集上通过计算k的精确度和召回率来对其进行评估([email protected]和[email protected])，使用如下面命令

$ ./fasttext test model.bin test.txt k

k 是可选的，默认为1

为了获得一段文本的k个最可能的标签，使用：

$ ./fasttext predict model.bin test.txt k

或者使用 predict-prob 也可以获得每个标签的概率

$ ./fasttext predict-prob model.bin test.txt k

test.txt 包含一行一段话。这回产生一行一个 k 最大可能标签的标准输出。
k 是可选的，默认为1
可参见 classification-example.sh 脚本来调用。这将下载所有数据集，并重新产生结果。

如果你想计算句子或段落的词向量表示，可以使用：

$ ./fasttext print-sentence-vectors model.bin < text.txt

您还可以使用以下命令量化有监督模型，以减少其内存使用：

$ ./fasttext quantize -output model

这将产生一个内存占用较小的 .ftz 文件。

所有标准功能（如测试或预测）在量化模型上的工作方式相同：

$ ./fasttext test model.ftz test.txt

你可以使用 quantization-example.sh 作为示例。

全文档

$ ./fasttext supervised
Empty input or output path.

The following arguments are mandatory:
  -input              training file path
  -output             output file path

The following arguments are optional:
  -verbose            verbosity level [2]

The following arguments for the dictionary are optional:
  -minCount           minimal number of word occurrences [1]
  -minCountLabel      minimal number of label occurrences [0]
  -wordNgrams         max length of word ngram [1]
  -bucket             number of buckets [2000000]
  -minn               min length of char ngram [0]
  -maxn               max length of char ngram [0]
  -t                  sampling threshold [0.0001]
  -label              labels prefix [__label__]

The following arguments for training are optional:
  -lr                 learning rate [0.1]
  -lrUpdateRate       change the rate of updates for the learning rate [100]
  -dim                size of word vectors [100]
  -ws                 size of the context window [5]
  -epoch              number of epochs [5]
  -neg                number of negatives sampled [5]
  -loss               loss function {
    ns, hs, softmax} [softmax]
  -thread             number of threads [12]
  -pretrainedVectors  pretrained word vectors for supervised learning []
  -saveOutput         whether output params should be saved [0]

The following arguments for quantization are optional:
  -cutoff             number of words and ngrams to retain [0]
  -retrain            finetune embeddings if a cutoff is applied [0]
  -qnorm              quantizing the norm separately [0]
  -qout               quantizing the classifier [0]
  -dsub               size of each sub-vector [2]