当前位置：网站首页>NLP - fastText

NLP - fastText

2022-06-11 22:20:00 【Yizhi code】

Insert picture description here

List of articles

About fastText

FastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers.
It works on standard, generic hardware.
Models can later be reduced in size to even fit on mobile devices.

Official website
https://fasttext.cc
Github
https://github.com/facebookresearch/fastText/
pre-trained models
https://fasttext.cc/docs/en/english-vectors.html
apachecn:FastText Chinese document ( recommend )
http://fasttext.apachecn.org/#/

install & compile

Use make compile （ recommend ）

need C++11 And above .

Will generate fasttext Binary

$ wget https://github.com/facebookresearch/fastText/archive/v0.9.2.zip
$ unzip v0.9.2.zip
$ cd fastText-0.9.2

$ make

Use cmake compile

$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ mkdir build && cd build && cmake ..
$ make && make install

Successful installation verification

$ fasttext -h
usage: fasttext <command> <args>

The commands supported by fasttext are:

  supervised              train a supervised classifier
  quantize                quantize a model to reduce the memory usage
  test                    evaluate a supervised classifier
  test-label              print labels with precision and recall scores
  predict                 predict most likely labels
  predict-prob            predict most likely labels with probabilities
  skipgram                train a skipgram model
  cbow                    train a cbow model
  print-word-vectors      print word vectors given a trained model
  print-sentence-vectors  print sentence vectors given a trained model
  print-ngrams            print ngrams given a trained model and word
  nn                      query for nearest neighbors
  analogies               query for analogies
  dump                    dump arguments,dictionary,input/output vectors

Use pip install

$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ pip3 install .

Examples of use

This library is mainly used in two places ：
1、 Words mean learning
2、 Text classification

These are described in the following two papers ：
[1] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information
[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

1、 Words mean learning

$ ./fasttext skipgram -input data.txt -output model

data.txt Files for training , The content is UTF-8 Encoded text .
Finally, the project will be saved as two files ：model.bin and model.vec
model.vec Is a text file containing word vectors , One word per line .
model.bin It's a binary file , Contains the parameters of the containing model 、 Dictionary and all super parameters .

2、 Get word vectors for unlisted words

Models of previous training , It can be used to calculate word vectors for words outside the vocabulary .
queries.txt The file contains the words you need to calculate the word vector .

Next, the standard output word vector , One word per line .

$ ./fasttext print-word-vectors model.bin < queries.txt

have access to pipline Method call

$ cat queries.txt | ./fasttext print-word-vectors model.bin

Use the provided script as an example , Code will be compiled 、 Download data 、 Calculate the word vector , And in the rare word similarity dataset RW Evaluate it on .

$ ./word-vector-example.sh

3、 Text classification

This library can also be used to train supervised text categorization , For example, emotional analysis .
In order to use the paper 2 Methods Train a text categorization , Use the following command ：

$ ./fasttext supervised -input train.txt -output model

train.txt file , One sentence per line and labels . Label use __label__ As a prefix .
This will output two files model.bin and model.vec .

Once the model is trained , You can calculate on the test set k To evaluate the accuracy and recall ([email protected] and [email protected]), Use the following command

$ ./fasttext test model.bin test.txt k

k It's optional , The default is 1

To get a piece of text k The most likely tags , Use ：

$ ./fasttext predict model.bin test.txt k

Or use predict-prob You can also get the probability of each tag

$ ./fasttext predict-prob model.bin test.txt k

test.txt Include one line and one paragraph . This time a line is generated k The standard output of the maximum possible label .
k It's optional , The default is 1
See also classification-example.sh Script to call . This will download all data sets , And regenerate the results .

If you want to calculate the word vector representation of a sentence or paragraph , have access to ：

$ ./fasttext print-sentence-vectors model.bin < text.txt

You can also quantify the supervised model using the following commands , To reduce its memory usage ：

$ ./fasttext quantize -output model

This will result in a smaller memory footprint .ftz file .

All standard functions （ Such as testing or forecasting ） Work the same way on quantitative models ：

$ ./fasttext test model.ftz test.txt

You can use quantization-example.sh As an example .

Full text file

$ ./fasttext supervised
Empty input or output path.

The following arguments are mandatory:
  -input              training file path
  -output             output file path

The following arguments are optional:
  -verbose            verbosity level [2]

The following arguments for the dictionary are optional:
  -minCount           minimal number of word occurrences [1]
  -minCountLabel      minimal number of label occurrences [0]
  -wordNgrams         max length of word ngram [1]
  -bucket             number of buckets [2000000]
  -minn               min length of char ngram [0]
  -maxn               max length of char ngram [0]
  -t                  sampling threshold [0.0001]
  -label              labels prefix [__label__]

The following arguments for training are optional:
  -lr                 learning rate [0.1]
  -lrUpdateRate       change the rate of updates for the learning rate [100]
  -dim                size of word vectors [100]
  -ws                 size of the context window [5]
  -epoch              number of epochs [5]
  -neg                number of negatives sampled [5]
  -loss               loss function {
    ns, hs, softmax} [softmax]
  -thread             number of threads [12]
  -pretrainedVectors  pretrained word vectors for supervised learning []
  -saveOutput         whether output params should be saved [0]

The following arguments for quantization are optional:
  -cutoff             number of words and ngrams to retain [0]
  -retrain            finetune embeddings if a cutoff is applied [0]
  -qnorm              quantizing the norm separately [0]
  -qout               quantizing the classifier [0]
  -dsub               size of each sub-vector [2]