Library for fast text representation and classification.

Overview

fastText

fastText is a library for efficient learning of word representations and sentence classification.

CircleCI

Table of contents

Resources

Models

Supplementary data

FAQ

You can find answers to frequently asked questions on our website.

Cheatsheet

We also provide a cheatsheet full of useful one-liners.

Requirements

We are continuously building and testing our library, CLI and Python bindings under various docker images using circleci.

Generally, fastText builds on modern Mac OS and Linux distributions. Since it uses some C++11 features, it requires a compiler with good C++11 support. These include :

  • (g++-4.7.2 or newer) or (clang-3.3 or newer)

Compilation is carried out using a Makefile, so you will need to have a working make. If you want to use cmake you need at least version 2.8.9.

One of the oldest distributions we successfully built and tested the CLI under is Debian jessie.

For the word-similarity evaluation script you will need:

  • Python 2.6 or newer
  • NumPy & SciPy

For the python bindings (see the subdirectory python) you will need:

  • Python version 2.7 or >=3.4
  • NumPy & SciPy
  • pybind11

One of the oldest distributions we successfully built and tested the Python bindings under is Debian jessie.

If these requirements make it impossible for you to use fastText, please open an issue and we will try to accommodate you.

Building fastText

We discuss building the latest stable version of fastText.

Getting the source code

You can find our latest stable release in the usual place.

There is also the master branch that contains all of our most recent work, but comes along with all the usual caveats of an unstable branch. You might want to use this if you are a developer or power-user.

Building fastText using make (preferred)

$ wget https://github.com/facebookresearch/fastText/archive/v0.9.2.zip
$ unzip v0.9.2.zip
$ cd fastText-0.9.2
$ make

This will produce object files for all the classes as well as the main binary fasttext. If you do not plan on using the default system-wide compiler, update the two macros defined at the beginning of the Makefile (CC and INCLUDES).

Building fastText using cmake

For now this is not part of a release, so you will need to clone the master branch.

$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ mkdir build && cd build && cmake ..
$ make && make install

This will create the fasttext binary and also all relevant libraries (shared, static, PIC).

Building fastText for Python

For now this is not part of a release, so you will need to clone the master branch.

$ git clone https://github.com/facebookresearch/fastText.git
$ cd fastText
$ pip install .

For further information and introduction see python/README.md

Example use cases

This library has two main use cases: word representation learning and text classification. These were described in the two papers 1 and 2.

Word representation learning

In order to learn word vectors, as described in 1, do:

$ ./fasttext skipgram -input data.txt -output model

where data.txt is a training file containing UTF-8 encoded text. By default the word vectors will take into account character n-grams from 3 to 6 characters. At the end of optimization the program will save two files: model.bin and model.vec. model.vec is a text file containing the word vectors, one per line. model.bin is a binary file containing the parameters of the model along with the dictionary and all hyper parameters. The binary file can be used later to compute word vectors or to restart the optimization.

Obtaining word vectors for out-of-vocabulary words

The previously trained model can be used to compute word vectors for out-of-vocabulary words. Provided you have a text file queries.txt containing words for which you want to compute vectors, use the following command:

$ ./fasttext print-word-vectors model.bin < queries.txt

This will output word vectors to the standard output, one vector per line. This can also be used with pipes:

$ cat queries.txt | ./fasttext print-word-vectors model.bin

See the provided scripts for an example. For instance, running:

$ ./word-vector-example.sh

will compile the code, download data, compute word vectors and evaluate them on the rare words similarity dataset RW [Thang et al. 2013].

Text classification

This library can also be used to train supervised text classifiers, for instance for sentiment analysis. In order to train a text classifier using the method described in 2, use:

$ ./fasttext supervised -input train.txt -output model

where train.txt is a text file containing a training sentence per line along with the labels. By default, we assume that labels are words that are prefixed by the string __label__. This will output two files: model.bin and model.vec. Once the model was trained, you can evaluate it by computing the precision and recall at k ([email protected] and [email protected]) on a test set using:

$ ./fasttext test model.bin test.txt k

The argument k is optional, and is equal to 1 by default.

In order to obtain the k most likely labels for a piece of text, use:

$ ./fasttext predict model.bin test.txt k

or use predict-prob to also get the probability for each label

$ ./fasttext predict-prob model.bin test.txt k

where test.txt contains a piece of text to classify per line. Doing so will print to the standard output the k most likely labels for each line. The argument k is optional, and equal to 1 by default. See classification-example.sh for an example use case. In order to reproduce results from the paper 2, run classification-results.sh, this will download all the datasets and reproduce the results from Table 1.

If you want to compute vector representations of sentences or paragraphs, please use:

$ ./fasttext print-sentence-vectors model.bin < text.txt

This assumes that the text.txt file contains the paragraphs that you want to get vectors for. The program will output one vector representation per line in the file.

You can also quantize a supervised model to reduce its memory usage with the following command:

$ ./fasttext quantize -output model

This will create a .ftz file with a smaller memory footprint. All the standard functionality, like test or predict work the same way on the quantized models:

$ ./fasttext test model.ftz test.txt

The quantization procedure follows the steps described in 3. You can run the script quantization-example.sh for an example.

Full documentation

Invoke a command without arguments to list available arguments and their default values:

$ ./fasttext supervised
Empty input or output path.

The following arguments are mandatory:
  -input              training file path
  -output             output file path

The following arguments are optional:
  -verbose            verbosity level [2]

The following arguments for the dictionary are optional:
  -minCount           minimal number of word occurrences [1]
  -minCountLabel      minimal number of label occurrences [0]
  -wordNgrams         max length of word ngram [1]
  -bucket             number of buckets [2000000]
  -minn               min length of char ngram [0]
  -maxn               max length of char ngram [0]
  -t                  sampling threshold [0.0001]
  -label              labels prefix [__label__]

The following arguments for training are optional:
  -lr                 learning rate [0.1]
  -lrUpdateRate       change the rate of updates for the learning rate [100]
  -dim                size of word vectors [100]
  -ws                 size of the context window [5]
  -epoch              number of epochs [5]
  -neg                number of negatives sampled [5]
  -loss               loss function {ns, hs, softmax} [softmax]
  -thread             number of threads [12]
  -pretrainedVectors  pretrained word vectors for supervised learning []
  -saveOutput         whether output params should be saved [0]

The following arguments for quantization are optional:
  -cutoff             number of words and ngrams to retain [0]
  -retrain            finetune embeddings if a cutoff is applied [0]
  -qnorm              quantizing the norm separately [0]
  -qout               quantizing the classifier [0]
  -dsub               size of each sub-vector [2]

Defaults may vary by mode. (Word-representation modes skipgram and cbow use a default -minCount of 5.)

References

Please cite 1 if using this code for learning word representations or 2 if using for text classification.

Enriching Word Vectors with Subword Information

[1] P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information

@article{bojanowski2017enriching,
  title={Enriching Word Vectors with Subword Information},
  author={Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas},
  journal={Transactions of the Association for Computational Linguistics},
  volume={5},
  year={2017},
  issn={2307-387X},
  pages={135--146}
}

Bag of Tricks for Efficient Text Classification

[2] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

@InProceedings{joulin2017bag,
  title={Bag of Tricks for Efficient Text Classification},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas},
  booktitle={Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers},
  month={April},
  year={2017},
  publisher={Association for Computational Linguistics},
  pages={427--431},
}

FastText.zip: Compressing text classification models

[3] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, FastText.zip: Compressing text classification models

@article{joulin2016fasttext,
  title={FastText.zip: Compressing text classification models},
  author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{\'e}gou, H{\'e}rve and Mikolov, Tomas},
  journal={arXiv preprint arXiv:1612.03651},
  year={2016}
}

(* These authors contributed equally.)

Join the fastText community

See the CONTRIBUTING file for information about how to help out.

License

fastText is MIT-licensed.

Comments
  • fasttext installed but import fails

    fasttext installed but import fails

    Hi have successfully installed fasttext on python3.5. However, when I try to import it I get the following error:

    Using /usr/local/lib/python3.5/dist-packages
    Finished processing dependencies for fasttext==0.8.22
    [email protected]:~/GitHub/fastText$ python3.5
    Python 3.5.2 (default, Nov 23 2017, 16:37:01) 
    [GCC 5.4.0 20160609] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import fasttext
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ImportError: No module named 'fasttext'
    >>> 
    

    I have tried installing both with pip install . and python setup.y install with no luck.

    opened by ahmedahmedov 25
  • Assertion failed on ./fasttext predict

    Assertion failed on ./fasttext predict

    predict command failed!

    ./fasttext predict model.bin test.txt

    Assertion failed: (counts.size() == osz_), function setTargetCounts, file src/model.cc, line 188.
    Abort trap: 6
    

    model train command was:

    ./fasttext supervised -input train.txt -output model -wordNgrams 4 -bucket 1000000 -thread 16

    Read 4223M words
    Number of words:  16577869
    Number of labels: 25
    Progress: 100.0%  words/sec/thread: 375706  lr: 0.000000  loss: 0.169518  eta: 0h0m 
    
    opened by spate141 25
  • How can we get the vector of a paragraph?

    How can we get the vector of a paragraph?

    I have ever tried doc2vec (from gensim, based on word2vec), with which I can extract fixed length vector for variant length paragraphs. Can I do the same with fastText?

    Thank you!

    opened by xchangcheng 22
  • OS X install problem

    OS X install problem

    When I install fasttext using "pip install .", I get some errors like following

    Failed to build fasttext
    Installing collected packages: fasttext
      Running setup.py install for fasttext ... error
        Complete output from command /miniconda3/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/tz/msp_r50s03s59q40s_gmhx600000gn/T/pip-req-build-i2z3pyel/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /private/var/folders/tz/msp_r50s03s59q40s_gmhx600000gn/T/pip-record-yg0h6noh/install-record.txt --single-version-externally-managed --compile:
        running install
        running build
        running build_py
        creating build
        creating build/lib.macosx-10.7-x86_64-3.6
        creating build/lib.macosx-10.7-x86_64-3.6/fastText
        copying python/fastText/__init__.py -> build/lib.macosx-10.7-x86_64-3.6/fastText
        copying python/fastText/FastText.py -> build/lib.macosx-10.7-x86_64-3.6/fastText
        creating build/lib.macosx-10.7-x86_64-3.6/fastText/util
        copying python/fastText/util/util.py -> build/lib.macosx-10.7-x86_64-3.6/fastText/util
        copying python/fastText/util/__init__.py -> build/lib.macosx-10.7-x86_64-3.6/fastText/util
        creating build/lib.macosx-10.7-x86_64-3.6/fastText/tests
        copying python/fastText/tests/test_script.py -> build/lib.macosx-10.7-x86_64-3.6/fastText/tests
        copying python/fastText/tests/__init__.py -> build/lib.macosx-10.7-x86_64-3.6/fastText/tests
        copying python/fastText/tests/test_configurations.py -> build/lib.macosx-10.7-x86_64-3.6/fastText/tests
        running build_ext
        gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -c /var/folders/tz/msp_r50s03s59q40s_gmhx600000gn/T/tmp1upvarhx.cpp -o var/folders/tz/msp_r50s03s59q40s_gmhx600000gn/T/tmp1upvarhx.o -stdlib=libc++
        gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -c /var/folders/tz/msp_r50s03s59q40s_gmhx600000gn/T/tmp9dzh7j94.cpp -o var/folders/tz/msp_r50s03s59q40s_gmhx600000gn/T/tmp9dzh7j94.o -std=c++14
        warning: include path for stdlibc++ headers not found; pass '-std=libc++' on the command line to use the libc++ standard library instead [-Wstdlibcxx-not-found]
        1 warning generated.
        gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -c /var/folders/tz/msp_r50s03s59q40s_gmhx600000gn/T/tmpw5pz6xr0.cpp -o var/folders/tz/msp_r50s03s59q40s_gmhx600000gn/T/tmpw5pz6xr0.o -fvisibility=hidden
        warning: include path for stdlibc++ headers not found; pass '-std=libc++' on the command line to use the libc++ standard library instead [-Wstdlibcxx-not-found]
        1 warning generated.
        building 'fasttext_pybind' extension
        creating build/temp.macosx-10.7-x86_64-3.6
        creating build/temp.macosx-10.7-x86_64-3.6/python
        creating build/temp.macosx-10.7-x86_64-3.6/python/fastText
        creating build/temp.macosx-10.7-x86_64-3.6/python/fastText/pybind
        creating build/temp.macosx-10.7-x86_64-3.6/src
        gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -I/Users/ruanxiaoyi/.local/include/python3.6m -Isrc -I/miniconda3/include/python3.6m -c python/fastText/pybind/fasttext_pybind.cc -o build/temp.macosx-10.7-x86_64-3.6/python/fastText/pybind/fasttext_pybind.o -stdlib=libc++ -DVERSION_INFO="0.8.22" -std=c++14 -fvisibility=hidden
        python/fastText/pybind/fasttext_pybind.cc:219:35: warning: comparison of integers of different signs: 'int32_t' (aka 'int') and 'std::__1::vector<long long, std::__1::allocator<long long> >::size_type' (aka 'unsigned long') [-Wsign-compare]
                    for (int32_t i = 0; i < vocab_freq.size(); i++) {
                                        ~ ^ ~~~~~~~~~~~~~~~~~
        python/fastText/pybind/fasttext_pybind.cc:233:35: warning: comparison of integers of different signs: 'int32_t' (aka 'int') and 'std::__1::vector<long long, std::__1::allocator<long long> >::size_type' (aka 'unsigned long') [-Wsign-compare]
                    for (int32_t i = 0; i < labels_freq.size(); i++) {
                                        ~ ^ ~~~~~~~~~~~~~~~~~~
        2 warnings generated.
        gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -I/Users/ruanxiaoyi/.local/include/python3.6m -Isrc -I/miniconda3/include/python3.6m -c src/dictionary.cc -o build/temp.macosx-10.7-x86_64-3.6/src/dictionary.o -stdlib=libc++ -DVERSION_INFO="0.8.22" -std=c++14 -fvisibility=hidden
        src/dictionary.cc:181:52: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int' [-Wsign-compare]
            for (size_t j = i, n = 1; j < word.size() && n <= args_->maxn; n++) {
                                                         ~ ^  ~~~~~~~~~~~
        src/dictionary.cc:186:13: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int' [-Wsign-compare]
              if (n >= args_->minn && !(n == 1 && (i == 0 || j == word.size()))) {
                  ~ ^  ~~~~~~~~~~~
        src/dictionary.cc:198:24: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
          for (size_t i = 0; i < size_; i++) {
                             ~ ^ ~~~~~
        src/dictionary.cc:296:24: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
          for (size_t i = 0; i < size_; i++) {
                             ~ ^ ~~~~~
        src/dictionary.cc:316:25: warning: comparison of integers of different signs: 'int32_t' (aka 'int') and 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-compare]
          for (int32_t i = 0; i < hashes.size(); i++) {
                              ~ ^ ~~~~~~~~~~~~~
        src/dictionary.cc:318:31: warning: comparison of integers of different signs: 'int32_t' (aka 'int') and 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-compare]
            for (int32_t j = i + 1; j < hashes.size() && j < i + n; j++) {
                                    ~ ^ ~~~~~~~~~~~~~
        src/dictionary.cc:515:25: warning: comparison of integers of different signs: 'int32_t' (aka 'int') and 'std::__1::vector<fasttext::entry, std::__1::allocator<fasttext::entry> >::size_type' (aka 'unsigned long') [-Wsign-compare]
          for (int32_t i = 0; i < words_.size(); i++) {
                              ~ ^ ~~~~~~~~~~~~~
        src/dictionary.cc:517:12: warning: comparison of integers of different signs: 'int32_t' (aka 'int') and 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-compare]
                (j < words.size() && words[j] == i)) {
                 ~ ^ ~~~~~~~~~~~~
        8 warnings generated.
        gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -I/Users/ruanxiaoyi/.local/include/python3.6m -Isrc -I/miniconda3/include/python3.6m -c src/main.cc -o build/temp.macosx-10.7-x86_64-3.6/src/main.o -stdlib=libc++ -DVERSION_INFO="0.8.22" -std=c++14 -fvisibility=hidden
        src/main.cc:348:3: warning: code will never be executed [-Wunreachable-code]
          exit(0);
          ^~~~
        1 warning generated.
        gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -I/Users/ruanxiaoyi/.local/include/python3.6m -Isrc -I/miniconda3/include/python3.6m -c src/fasttext.cc -o build/temp.macosx-10.7-x86_64-3.6/src/fasttext.o -stdlib=libc++ -DVERSION_INFO="0.8.22" -std=c++14 -fvisibility=hidden
        src/fasttext.cc:92:21: warning: comparison of integers of different signs: 'int' and 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-compare]
          for (int i = 0; i < ngrams.size(); i++) {
                          ~ ^ ~~~~~~~~~~~~~
        src/fasttext.cc:302:18: warning: comparison of integers of different signs: 'const int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
            return eosid == i1 || (eosid != i2 && norms[i1] > norms[i2]);
                   ~~~~~ ^  ~~
        src/fasttext.cc:302:34: warning: comparison of integers of different signs: 'const int' and 'size_t' (aka 'unsigned long') [-Wsign-compare]
            return eosid == i1 || (eosid != i2 && norms[i1] > norms[i2]);
                                   ~~~~~ ^  ~~
        src/fasttext.cc:323:16: warning: 'selectEmbeddings' is deprecated: selectEmbeddings is being deprecated. [-Wdeprecated-declarations]
            auto idx = selectEmbeddings(qargs.cutoff);
                       ^
        src/fasttext.h:165:3: note: 'selectEmbeddings' has been explicitly marked deprecated here
          FASTTEXT_DEPRECATED("selectEmbeddings is being deprecated.")
          ^
        src/utils.h:18:49: note: expanded from macro 'FASTTEXT_DEPRECATED'
        #define FASTTEXT_DEPRECATED(msg) __attribute__((__deprecated__(msg)))
                                                        ^
        src/fasttext.cc:322:40: warning: comparison of integers of different signs: 'const size_t' (aka 'const unsigned long') and 'int64_t' (aka 'long long') [-Wsign-compare]
          if (qargs.cutoff > 0 && qargs.cutoff < input->size(0)) {
                                  ~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~
        src/fasttext.cc:327:24: warning: comparison of integers of different signs: 'int' and 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-compare]
            for (auto i = 0; i < idx.size(); i++) {
                             ~ ^ ~~~~~~~~~~
        src/fasttext.cc:380:25: warning: comparison of integers of different signs: 'int32_t' (aka 'int') and 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-compare]
          for (int32_t w = 0; w < line.size(); w++) {
                              ~ ^ ~~~~~~~~~~~
        src/fasttext.cc:384:41: warning: comparison of integers of different signs: 'int' and 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-compare]
              if (c != 0 && w + c >= 0 && w + c < line.size()) {
                                          ~~~~~ ^ ~~~~~~~~~~~
        src/fasttext.cc:398:25: warning: comparison of integers of different signs: 'int32_t' (aka 'int') and 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-compare]
          for (int32_t w = 0; w < line.size(); w++) {
                              ~ ^ ~~~~~~~~~~~
        src/fasttext.cc:402:41: warning: comparison of integers of different signs: 'int' and 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-compare]
              if (c != 0 && w + c >= 0 && w + c < line.size()) {
                                          ~~~~~ ^ ~~~~~~~~~~~
        src/fasttext.cc:479:27: warning: comparison of integers of different signs: 'int32_t' (aka 'int') and 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-compare]
            for (int32_t i = 0; i < line.size(); i++) {
                                ~ ^ ~~~~~~~~~~~
        src/fasttext.cc:514:25: warning: comparison of integers of different signs: 'int32_t' (aka 'int') and 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-compare]
          for (int32_t i = 0; i < ngrams.size(); i++) {
                              ~ ^ ~~~~~~~~~~~~~
        src/fasttext.cc:551:5: warning: 'precomputeWordVectors' is deprecated: precomputeWordVectors is being deprecated. [-Wdeprecated-declarations]
            precomputeWordVectors(*wordVectors_);
            ^
        src/fasttext.h:180:3: note: 'precomputeWordVectors' has been explicitly marked deprecated here
          FASTTEXT_DEPRECATED("precomputeWordVectors is being deprecated.")
          ^
        src/utils.h:18:49: note: expanded from macro 'FASTTEXT_DEPRECATED'
        #define FASTTEXT_DEPRECATED(msg) __attribute__((__deprecated__(msg)))
                                                        ^
        src/fasttext.cc:585:23: warning: comparison of integers of different signs: 'std::__1::vector<std::__1::pair<float, std::__1::basic_string<char> >, std::__1::allocator<std::__1::pair<float, std::__1::basic_string<char> > > >::size_type' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
              if (heap.size() == k && similarity < heap.front().first) {
                  ~~~~~~~~~~~ ^  ~
        src/fasttext.cc:590:23: warning: comparison of integers of different signs: 'std::__1::vector<std::__1::pair<float, std::__1::basic_string<char> >, std::__1::allocator<std::__1::pair<float, std::__1::basic_string<char> > > >::size_type' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
              if (heap.size() > k) {
                  ~~~~~~~~~~~ ^ ~
        src/fasttext.cc:701:24: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int64_t' (aka 'long long') [-Wsign-compare]
          for (size_t i = 0; i < n; i++) {
                             ~ ^ ~
        src/fasttext.cc:706:26: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int64_t' (aka 'long long') [-Wsign-compare]
            for (size_t j = 0; j < dim; j++) {
                               ~ ^ ~~~
        src/fasttext.cc:718:24: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int64_t' (aka 'long long') [-Wsign-compare]
          for (size_t i = 0; i < n; i++) {
                             ~ ^ ~
        src/fasttext.cc:723:26: warning: comparison of integers of different signs: 'size_t' (aka 'unsigned long') and 'int64_t' (aka 'long long') [-Wsign-compare]
            for (size_t j = 0; j < dim; j++) {
                               ~ ^ ~~~
        19 warnings generated.
        gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -I/Users/ruanxiaoyi/.local/include/python3.6m -Isrc -I/miniconda3/include/python3.6m -c src/utils.cc -o build/temp.macosx-10.7-x86_64-3.6/src/utils.o -stdlib=libc++ -DVERSION_INFO="0.8.22" -std=c++14 -fvisibility=hidden
        gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -I/Users/ruanxiaoyi/.local/include/python3.6m -Isrc -I/miniconda3/include/python3.6m -c src/model.cc -o build/temp.macosx-10.7-x86_64-3.6/src/model.o -stdlib=libc++ -DVERSION_INFO="0.8.22" -std=c++14 -fvisibility=hidden
        gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -I/Users/ruanxiaoyi/.local/include/python3.6m -Isrc -I/miniconda3/include/python3.6m -c src/loss.cc -o build/temp.macosx-10.7-x86_64-3.6/src/loss.o -stdlib=libc++ -DVERSION_INFO="0.8.22" -std=c++14 -fvisibility=hidden
        src/loss.cc:83:21: warning: comparison of integers of different signs: 'std::__1::vector<std::__1::pair<float, int>, std::__1::allocator<std::__1::pair<float, int> > >::size_type' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
            if (heap.size() == k && std_log(output[i]) < heap.front().first) {
                ~~~~~~~~~~~ ^  ~
        src/loss.cc:88:21: warning: comparison of integers of different signs: 'std::__1::vector<std::__1::pair<float, int>, std::__1::allocator<std::__1::pair<float, int> > >::size_type' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
            if (heap.size() > k) {
                ~~~~~~~~~~~ ^ ~
        src/loss.cc:257:25: warning: comparison of integers of different signs: 'int32_t' (aka 'int') and 'std::__1::vector<int, std::__1::allocator<int> >::size_type' (aka 'unsigned long') [-Wsign-compare]
          for (int32_t i = 0; i < pathToRoot.size(); i++) {
                              ~ ^ ~~~~~~~~~~~~~~~~~
        src/loss.cc:282:19: warning: comparison of integers of different signs: 'std::__1::vector<std::__1::pair<float, int>, std::__1::allocator<std::__1::pair<float, int> > >::size_type' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
          if (heap.size() == k && score < heap.front().first) {
              ~~~~~~~~~~~ ^  ~
        src/loss.cc:289:21: warning: comparison of integers of different signs: 'std::__1::vector<std::__1::pair<float, int>, std::__1::allocator<std::__1::pair<float, int> > >::size_type' (aka 'unsigned long') and 'int32_t' (aka 'int') [-Wsign-compare]
            if (heap.size() > k) {
                ~~~~~~~~~~~ ^ ~
        5 warnings generated.
        gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -I/Users/ruanxiaoyi/.local/include/python3.6m -Isrc -I/miniconda3/include/python3.6m -c src/productquantizer.cc -o build/temp.macosx-10.7-x86_64-3.6/src/productquantizer.o -stdlib=libc++ -DVERSION_INFO="0.8.22" -std=c++14 -fvisibility=hidden
        src/productquantizer.cc:246:22: warning: comparison of integers of different signs: 'int' and 'std::__1::vector<float, std::__1::allocator<float> >::size_type' (aka 'unsigned long') [-Wsign-compare]
          for (auto i = 0; i < centroids_.size(); i++) {
                           ~ ^ ~~~~~~~~~~~~~~~~~
        1 warning generated.
        gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -I/Users/ruanxiaoyi/.local/include/python3.6m -Isrc -I/miniconda3/include/python3.6m -c src/args.cc -o build/temp.macosx-10.7-x86_64-3.6/src/args.o -stdlib=libc++ -DVERSION_INFO="0.8.22" -std=c++14 -fvisibility=hidden
        src/args.cc:93:23: warning: comparison of integers of different signs: 'int' and 'std::__1::vector<std::__1::basic_string<char>, std::__1::allocator<std::__1::basic_string<char> > >::size_type' (aka 'unsigned long') [-Wsign-compare]
          for (int ai = 2; ai < args.size(); ai += 2) {
                           ~~ ^ ~~~~~~~~~~~
        1 warning generated.
        gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -I/Users/ruanxiaoyi/.local/include/python3.6m -Isrc -I/miniconda3/include/python3.6m -c src/quantmatrix.cc -o build/temp.macosx-10.7-x86_64-3.6/src/quantmatrix.o -stdlib=libc++ -DVERSION_INFO="0.8.22" -std=c++14 -fvisibility=hidden
        gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -I/Users/ruanxiaoyi/.local/include/python3.6m -Isrc -I/miniconda3/include/python3.6m -c src/matrix.cc -o build/temp.macosx-10.7-x86_64-3.6/src/matrix.o -stdlib=libc++ -DVERSION_INFO="0.8.22" -std=c++14 -fvisibility=hidden
        gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -I/Users/ruanxiaoyi/.local/include/python3.6m -Isrc -I/miniconda3/include/python3.6m -c src/meter.cc -o build/temp.macosx-10.7-x86_64-3.6/src/meter.o -stdlib=libc++ -DVERSION_INFO="0.8.22" -std=c++14 -fvisibility=hidden
        gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -I/Users/ruanxiaoyi/.local/include/python3.6m -Isrc -I/miniconda3/include/python3.6m -c src/vector.cc -o build/temp.macosx-10.7-x86_64-3.6/src/vector.o -stdlib=libc++ -DVERSION_INFO="0.8.22" -std=c++14 -fvisibility=hidden
        gcc -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/miniconda3/include -arch x86_64 -I/miniconda3/include -arch x86_64 -I/miniconda3/include/python3.6m -I/Users/ruanxiaoyi/.local/include/python3.6m -Isrc -I/miniconda3/include/python3.6m -c src/densematrix.cc -o build/temp.macosx-10.7-x86_64-3.6/src/densematrix.o -stdlib=libc++ -DVERSION_INFO="0.8.22" -std=c++14 -fvisibility=hidden
        g++ -bundle -undefined dynamic_lookup -L/miniconda3/lib -arch x86_64 -L/miniconda3/lib -arch x86_64 -arch x86_64 build/temp.macosx-10.7-x86_64-3.6/python/fastText/pybind/fasttext_pybind.o build/temp.macosx-10.7-x86_64-3.6/src/dictionary.o build/temp.macosx-10.7-x86_64-3.6/src/main.o build/temp.macosx-10.7-x86_64-3.6/src/fasttext.o build/temp.macosx-10.7-x86_64-3.6/src/utils.o build/temp.macosx-10.7-x86_64-3.6/src/model.o build/temp.macosx-10.7-x86_64-3.6/src/loss.o build/temp.macosx-10.7-x86_64-3.6/src/productquantizer.o build/temp.macosx-10.7-x86_64-3.6/src/args.o build/temp.macosx-10.7-x86_64-3.6/src/quantmatrix.o build/temp.macosx-10.7-x86_64-3.6/src/matrix.o build/temp.macosx-10.7-x86_64-3.6/src/meter.o build/temp.macosx-10.7-x86_64-3.6/src/vector.o build/temp.macosx-10.7-x86_64-3.6/src/densematrix.o -o build/lib.macosx-10.7-x86_64-3.6/fasttext_pybind.cpython-36m-darwin.so
        clang: warning: libstdc++ is deprecated; move to libc++ with a minimum deployment target of OS X 10.9 [-Wdeprecated]
        ld: library not found for -lstdc++
        clang: error: linker command failed with exit code 1 (use -v to see invocation)
        error: command 'g++' failed with exit status 1
    
        ----------------------------------------
    Command "/miniconda3/bin/python -u -c "import setuptools, tokenize;__file__='/private/var/folders/tz/msp_r50s03s59q40s_gmhx600000gn/T/pip-req-build-i2z3pyel/setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record /private/var/folders/tz/msp_r50s03s59q40s_gmhx600000gn/T/pip-record-yg0h6noh/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /private/var/folders/tz/msp_r50s03s59q40s_gmhx600000gn/T/pip-req-build-i2z3pyel/
    

    And my environment is

    Apple LLVM version 10.0.0 (clang-1000.10.44.4)
    Target: x86_64-apple-darwin18.2.0
    Thread model: posix
    InstalledDir: /Library/Developer/CommandLineTools/usr/bin
     "/Library/Developer/CommandLineTools/usr/bin/clang" -cc1 -triple x86_64-apple-macosx10.14.0 -Wdeprecated-objc-isa-usage -Werror=deprecated-objc-isa-usage -E -disable-free -disable-llvm-verifier -discard-value-names -main-file-name - -mrelocation-model pic -pic-level 2 -mthread-model posix -mdisable-fp-elim -fno-strict-return -masm-verbose -munwind-tables -target-cpu penryn -dwarf-column-info -debugger-tuning=lldb -target-linker-version 409.12 -v -resource-dir /Library/Developer/CommandLineTools/usr/lib/clang/10.0.0 -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk -I/usr/local/include -stdlib=libc++ -fdeprecated-macro -fdebug-compilation-dir /Users/ruanxiaoyi/Downloads/fastText-master -ferror-limit 19 -fmessage-length 204 -stack-protector 1 -fblocks -fencode-extended-block-signature -fobjc-runtime=macosx-10.14.0 -fcxx-exceptions -fexceptions -fmax-type-align=16 -fdiagnostics-show-option -fcolor-diagnostics -o - -x c++ -
    clang -cc1 version 10.0.0 (clang-1000.10.44.4) default target x86_64-apple-darwin18.2.0
    ignoring nonexistent directory "/Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk/usr/include/c++/v1"
    ignoring nonexistent directory "/Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk/usr/local/include"
    ignoring nonexistent directory "/Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk/Library/Frameworks"
    #include "..." search starts here:
    #include <...> search starts here:
     /usr/local/include
     /Library/Developer/CommandLineTools/usr/include/c++/v1
     /Library/Developer/CommandLineTools/usr/lib/clang/10.0.0/include
     /Library/Developer/CommandLineTools/usr/include
     /Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk/usr/include
     /Library/Developer/CommandLineTools/SDKs/MacOSX10.14.sdk/System/Library/Frameworks (framework directory)
    

    Any suggestion for this problem?

    Python Build 
    opened by rxy1212 21
  • Any plan to support different weight for each class in loss function?

    Any plan to support different weight for each class in loss function?

    Looking at the current code, it seems to me that loss function are evaluated with the same weight for each class, which is OK for balanced data. For highly imbalanced data, are there any plan to support different weight for each class in loss function? I am thinking in command line, do:

    fasttext -input XXX -output XXX -weight_class1 10 -weight_class2 1 -weight_class3 3 
    

    or simply

    fasttext -weight_balanced 
    

    if the weight is inversely proportional to number of instances in that class?

    opened by kuangchen 18
  • Interpreting Multilabel output

    Interpreting Multilabel output

    So I loaded multilabel values for my targets. But when I use the predict_prob function; it seems like conditional probablity more than multilabel output.

    I was assuming that all the labels would have a value between 1 and 0, but I am seeing that all the labels add up to 1 instead for each class to have a value between 1 and 0.

    Can someone help me understand this output.

    opened by iymitchell 17
  • The memory error when loading the pre-trained model

    The memory error when loading the pre-trained model

    There is a memory error when I trying to load the pre-trained model, e.g., model = fasttext.load_model('D:/download/wiki.en/wiki.en.bin').

    Since the size of this bin file is almost 9G, and my memory size is only 4G. I am trying to find a memory friendly method to load the model. Can anyone give me a clue?
    Thanks a lot!

    opened by zhouchichun 16
  • Quantize error

    Quantize error

    I already have trained model_1.bin with supervised option, and when I am trying to quantize that model, I am getting following error!

    /opt/fastText/fasttext quantize -input data.txt -output models/model_1 -verbose 3 -wordNgrams 3 -bucket 1000000 -minn 3 -maxn 6 -lr 0.010 -dim 100 -loss ns -thread 8 -epoch 10 -qnorm -retrain -cutoff 100000
    
    fasttext: src/vector.cc:71: void fasttext::Vector::addRow(const fasttext::Matrix&, int64_t): Assertion `i < A.m_' failed.
    Aborted (core dumped)
    

    Edit: If I dont use -cutoff then I can run this without any error!

    opened by spate141 16
  • Loss - OVA model - Not predicting sigmoid output in Ubuntu 16.04

    Loss - OVA model - Not predicting sigmoid output in Ubuntu 16.04

    Install Log:

    c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -c src/args.cc c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -c src/matrix.cc c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -c src/dictionary.cc c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -c src/loss.cc c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -c src/productquantizer.cc c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -c src/densematrix.cc c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -c src/quantmatrix.cc c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -c src/vector.cc c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -c src/model.cc c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -c src/utils.cc c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -c src/meter.cc c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -c src/fasttext.cc src/fasttext.cc: In member function ‘void fasttext::FastText::quantize(const fasttext::Args&)’: src/fasttext.cc:323:16: warning: ‘std::vector fasttext::FastText::selectEmbeddings(int32_t) const’ is deprecated: selectEmbeddings is being deprecated. [-Wdeprecated-declarations] auto idx = selectEmbeddings(qargs.cutoff); ^ src/fasttext.cc:293:22: note: declared here std::vector<int32_t> FastText::selectEmbeddings(int32_t cutoff) const { ^ src/fasttext.cc:323:45: warning: ‘std::vector fasttext::FastText::selectEmbeddings(int32_t) const’ is deprecated: selectEmbeddings is being deprecated. [-Wdeprecated-declarations] auto idx = selectEmbeddings(qargs.cutoff); ^ src/fasttext.cc:293:22: note: declared here std::vector<int32_t> FastText::selectEmbeddings(int32_t cutoff) const { ^ src/fasttext.cc: In member function ‘void fasttext::FastText::lazyComputeWordVectors()’: src/fasttext.cc:551:5: warning: ‘void fasttext::FastText::precomputeWordVectors(fasttext::DenseMatrix&)’ is deprecated: precomputeWordVectors is being deprecated. [-Wdeprecated-declarations] precomputeWordVectors(*wordVectors_); ^ src/fasttext.cc:534:6: note: declared here void FastText::precomputeWordVectors(DenseMatrix& wordVectors) { ^ src/fasttext.cc:551:40: warning: ‘void fasttext::FastText::precomputeWordVectors(fasttext::DenseMatrix&)’ is deprecated: precomputeWordVectors is being deprecated. [-Wdeprecated-declarations] precomputeWordVectors(*wordVectors_); ^ src/fasttext.cc:534:6: note: declared here void FastText::precomputeWordVectors(DenseMatrix& wordVectors) { ^ c++ -pthread -std=c++0x -march=native -O3 -funroll-loops args.o matrix.o dictionary.o loss.o productquantizer.o densematrix.o quantmatrix.o vector.o model.o utils.o meter.o fasttext.o src/main.cc -o fasttext

    The output is not sigmoid. Its still same as the Softmax. Args: dim 100 ws 5 epoch 1 minCount 1 neg 5 wordNgrams 3 loss one-vs-all model sup bucket 1000000 minn 3 maxn 3 lrUpdateRate 100 t 0.0001

    bug 
    opened by giriannamalai 15
  • Binary model that was trained on Common crawl

    Binary model that was trained on Common crawl

    Hello! I enjoy using your library and pretrained vectors. I see that for vectors that were trained on wiki you provide both binary model and pretrained vectors. However, for vectors that were trained on Common crawl, you only provide pretrained vectors. Is it possible for you to publish binary model for them?

    Thanks, Alexander.

    opened by MrBoor 15
  • Running on PowerPC64LE (ppc64le)

    Running on PowerPC64LE (ppc64le)

    I am able to compile the stable (0.1.0) version of the code on a powerpc64le (IBM Minsky) without any errors/warnings. However when I run on any dataset (eg stackexchange cooking) using just the defaults ./fasttext supervised -input ... -output ... the program just hangs after displaying Reading ... words. I tried make debug as well. Same problem. (details: make 4.1, Ubuntu 16.04.3 LTS. Any ideas?

    opened by ironv 15
  • What's the status of this project?

    What's the status of this project?

    Last release in 2020-04, I see a lot unsolved installing issues and I miss pre-build wheels on https://pypi.org/project/fasttext/#files

    What is the future of this project or is it just dead?

    opened by return42 1
  • denpendency errors

    denpendency errors

    Hi,

    We recently conducted a study to detect build dependency errors, focusing on missing dependencies and redundant dependencies. A missing dependency (MS) is a dependency that is not declared in the build script and a redundant dependency(RD) is a dependency that is declared in the build script that is not actually used. We have detected the following dependency errors in your public projects. Could you please help us to check these dependency errors? The data format is dependency --- target. MS 0['/home/lv/WorkSpace/vmake_experiment/fastText-master/src/densematrix.h---fasttext'] 1['/home/lv/WorkSpace/vmake_experiment/fastText-master/src/vector.h---fasttext'] 2['/home/lv/WorkSpace/vmake_experiment/fastText-master/src/model.h---fasttext'] 3['/home/lv/WorkSpace/vmake_experiment/fastText-master/src/args.h---fasttext'] 4['/home/lv/WorkSpace/vmake_experiment/fastText-master/src/meter.h---fasttext'] 5['/home/lv/WorkSpace/vmake_experiment/fastText-master/src/fasttext.h---fasttext'] 6['/home/lv/WorkSpace/vmake_experiment/fastText-master/src/real.h---fasttext'] 7['/home/lv/WorkSpace/vmake_experiment/fastText-master/src/main.cc---fasttext'] 8['/home/lv/WorkSpace/vmake_experiment/fastText-master/src/matrix.h---fasttext'] 9['/home/lv/WorkSpace/vmake_experiment/fastText-master/src/utils.h---fasttext'] 10['/home/lv/WorkSpace/vmake_experiment/fastText-master/src/dictionary.h---fasttext']

    RD 0['src/utils.h---productquantizer.o'] 1['src/utils.h---quantmatrix.o'] 2['src/fasttext.cc---fasttext'] 3['src/utils.h---vector.o'] 4['src/args.h---model.o']

    opened by Meiye-lj 0
  • Program running results are abnormal

    Program running results are abnormal

    anaconda3/bin/python3.8

    import fasttext.util ft = fasttext.load_model('cc.zh.300.bin') sentence_w1=ft.get_sentence_vector('色诫');print(sentence_w1)

    [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

    opened by Chu-J 0
  • Language names of Languages supported by Fasttext

    Language names of Languages supported by Fasttext

    I am trying to find out the names of languages supported by Fasttext's LID tool, given these language codes listed here:

    af als am an ar arz as ast av az azb ba bar bcl be bg bh bn bo bpy br bs bxr ca cbk ce ceb ckb co cs cv cy da de diq dsb dty dv el eml en eo es et eu fa fi fr frr fy ga gd gl gn gom gu gv he hi hif hr hsb ht hu hy ia id ie ilo io is it ja jbo jv ka kk km kn ko krc ku kv kw ky la lb lez li lmo lo lrc lt lv mai mg mhr min mk ml mn mr mrj ms mt mwl my myv mzn nah nap nds ne new nl nn no oc or os pa pam pfl pl pms pnb ps pt qu rm ro ru rue sa sah sc scn sco sd sh si sk sl so sq sr su sv sw ta te tg th tk tl tr tt tyv ug uk ur uz vec vep vi vls vo wa war wuu xal xmf yi yo yue zh
    

    I tried to map the ISO codes to each language, but it seems non-standard, either using ISO-639-1 or ISO-639-3. Does anyone have a list of language names for these codes, or know how to find them?
    Wikipedia's list does not cover all of them either, so manual mapping too did not help.
    Thanks!

    opened by AetherPrior 1
  • Lookup tables for language labels - Python module

    Lookup tables for language labels - Python module

    I'm using the prebuilt lid.176.ftz model to do simple language ID on short texts (160 chars or fewer) using the Python module.

    Is there a lookup table (dictionary) for the labels?

    eg

    {
        "en": "English", 
        "fr": "French",
         ...
    }
    

    Some of the labels fastText returns are quite obscure languages & I've had to trawl a lot of ISO-639 docs to establish what they refer to in order to build my own lookup table.

    Or have I simply missed something in the docs /API that tells me how to get these?

    opened by RedactedCode 0
Releases(v0.9.2)
  • v0.9.2(Apr 28, 2020)

    We are happy to announce the release of version 0.9.2.

    WebAssembly

    We are excited to release fastText bindings for WebAssembly. Classification tasks are widely used in web applications and we believe giving access to the complete fastText API from the browser will notably help our community to build nice tools. See our documentation to learn more.

    Autotune: automatic hyperparameter optimization

    Finding the best hyperparameters is crucial for building efficient models. However, searching the best hyperparameters manually is difficult. This release includes the autotune feature that allows you to find automatically the best hyperparameters for your dataset. You can find more information on how to use it here.

    Python

    fastText loves Python. In this release, we have:

    • several bug fixes for prediction functions
    • nearest neighbors and analogies for Python
    • a memory leak fix
    • website tutorials with Python examples

    The autotune feature is fully integrated with our Python API. This allows us to have a more stable autotune optimization loop from Python and to synchronize the best hyper-parameters with the _FastText model object.

    Pre-trained models tool

    We release two helper scripts:

    They can also be used directly from our Python API.

    More metrics

    When you test a trained model, you can now have more detailed results for the precision/recall metrics of a specific label or all labels.

    Paper source code

    This release contains the source code of the unsupervised multilingual alignment paper.

    Community feedback and contributions

    We want to thank our community for giving us feedback on Facebook and on GitHub.

    Source code(tar.gz)
    Source code(zip)
  • v0.9.1(Jul 4, 2019)

    We are happy to announce the release of version 0.9.1.

    New release of python module

    The main goal of this release is to merge two existing python modules: the official fastText module which was available on our github repository and the unofficial fasttext module which was available on pypi.org.

    You can find an overview of the new API here, and more insight in our blog post.

    Refactoring

    This version includes a massive rewrite of internal classes. The training and test are now split into three different classes : Model that takes care of the computational aspect, Loss that handles loss and applies gradients to the output matrix, and State that is responsible of holding the model's state inside each thread.

    That makes the code more straighforward to read but also gives a smaller memory footprint, because the data needed for loss computation is now hold only once unlike before where there was one for each thread.

    Misc

    • Compilation issues fix for recent versions of Mac OS X.
    • Better unicode handling :
      • on_unicode_error argument that helps to handle unicode issues one can face with some datasets
      • bug fix related to different behaviour of pybind11's py::str class between python2 and python3
    • script for unsupervised alignment
    • public file hosting changed from aws to fbaipublicfiles
    • we added a Code of Conduct file.

    Thank you !

    As always, we want to thank you for your help and your precious feedback which helps making this project better.

    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Dec 19, 2018)

    We are happy to announce the change of the license from BSD+patents to MIT and the release of fastText 0.2.0.

    The main purpose of this release is to set a beta C++ API of the FastText class. The class now behaves as a computational library: we moved the display and some usage error handlings outside of it (mainly to main.cc and fasttext_pybind.cc). It is still compatible with older versions of the class, but some methods are now marked as deprecated and will probably be removed in the next release.

    In this respect, we also introduce the official support for python. The python binding of fastText is a client of the FastText class.

    Here is a short summary of the 104 commits since 0.1.0 :

    New :

    • Introduction of the “OneVsAll” loss function for multi-label classification, which corresponds to the sum of binary cross-entropy computed independently for each label. This new loss can be used with the -loss ova or -loss one-vs-all command line option ( 8850c51b972ed68642a15c17fbcd4dd58766291d ).
    • Computation of the precision and recall metrics for each label ( be1e597cb67c069ba9940ff241d9aad38ccd37da ).
    • Removed printing functions from FastText class ( 256032b87522cdebc4850c99b204b81b3255cb2a ).
    • Better default for number of threads ( 501b9b1e4543fd2de55e4a621a9924ce7d2b5b17 ).
    • Python support ( f10ec1faea1605d40fdb79fe472cc2204f3d584c ).
    • More tests for circleci/python ( eb9703a4a7ed0f7559d6f341cc8e5d166d5e4d88, 97fcde80ea107ca52d3d778a083564619175039c, 1de0624bfaff02d91fd265f331c07a4a0a7bb857 ).

    Bug fixes :

    • Normalize buffer vector in analogy queries.
    • Typo fixes and clarifications on website.
    • Improvements on python install issues : setup.py OS X compiler flags, pybind11 include.
    • Fix: getSubwords for EOS.
    • Fix: ETA time.
    • Fix: division by 0 in word analogy evaluation.
    • Fix for the infinite loop on ARM cpu.

    Operations :

    • We released more pre-trained vectors (92bc7d230959e2a94125fbe7d3b05257effb1111, 5bf8b4c615b6308d76ad39a5a50fa6c4174113ea ).

    Worth noting :

    • We added circleci build badges to the README.md
    • We modified the style to be in compliance with Facebook C++ style.
    • We added coverage option for Makefile and setup.py in order to build for measuring the coverage.

    Thank you fastText community!

    We want to thank you all for being a part of this community and sharing your passion with us. Some of these improvements would not have been possible without your help.

    Source code(tar.gz)
    Source code(zip)
Owner
Facebook Research
Facebook Research
TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset. TunBERT was applied to three NLP downstream tasks: Sentiment Analysis (S

InstaDeep Ltd 72 Dec 09, 2022
Semantic search through a vectorized Wikipedia (SentenceBERT) with the Weaviate vector search engine

Semantic search through Wikipedia with the Weaviate vector search engine Weaviate is an open source vector search engine with build-in vectorization a

SeMI Technologies 191 Dec 26, 2022
Line as a Visual Sentence: Context-aware Line Descriptor for Visual Localization

Line as a Visual Sentence with LineTR This repository contains the inference code, pretrained model, and demo scripts of the following paper. It suppo

SungHo Yoon 158 Dec 27, 2022
Create a machine learning model which will predict if the mortgage will be approved or not based on 5 variables

Mortgage-Application-Analysis Create a machine learning model which will predict if the mortgage will be approved or not based on 5 variables: age, in

1 Jan 29, 2022
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS)

This repository is an implementation of Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis (SV2TTS) with a vocoder that works in real-time. Feel free to check my the

Corentin Jemine 38.5k Jan 03, 2023
Revisiting Pre-trained Models for Chinese Natural Language Processing (Findings of EMNLP 2020)

This repository contains the resources in our paper "Revisiting Pre-trained Models for Chinese Natural Language Processing", which will be published i

Yiming Cui 463 Dec 30, 2022
The source code of HeCo

HeCo This repo is for source code of KDD 2021 paper "Self-supervised Heterogeneous Graph Neural Network with Co-contrastive Learning". Paper Link: htt

Nian Liu 106 Dec 27, 2022
天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

天池中药说明书实体识别挑战冠军方案;中文命名实体识别;NER; BERT-CRF & BERT-SPAN & BERT-MRC;Pytorch

zxx飞翔的鱼 751 Dec 30, 2022
DeepAmandine is an artificial intelligence that allows you to talk to it for hours, you won't know the difference.

DeepAmandine This is an artificial intelligence based on GPT-3 that you can chat with, it is very nice and makes a lot of jokes. We wish you a good ex

BuyWithCrypto 3 Apr 19, 2022
Easy Language Model Pretraining leveraging Huggingface's Transformers and Datasets

Easy Language Model Pretraining leveraging Huggingface's Transformers and Datasets What is LASSL • How to Use What is LASSL LASSL은 LAnguage Semi-Super

LASSL: LAnguage Self-Supervised Learning 116 Dec 27, 2022
GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training Code and model from our AAAI 2021 paper

Amazon Web Services - Labs 83 Jan 09, 2023
Statistics and Mathematics for Machine Learning, Deep Learning , Deep NLP

Stat4ML Statistics and Mathematics for Machine Learning, Deep Learning , Deep NLP This is the first course from our trio courses: Statistics Foundatio

Omid Safarzadeh 83 Dec 29, 2022
Submit issues and feature requests for our API here.

AIx GPT API Submit issues and feature requests for our API here. See https://apps.aixsolutionsgroup.com for more info. Python Quick Start pip install

AIx Solutions 7 Mar 27, 2022
TPlinker for NER 中文/英文命名实体识别

本项目是参考 TPLinker 中HandshakingTagging思想,将TPLinker由原来的关系抽取(RE)模型修改为命名实体识别(NER)模型。

GodK 113 Dec 28, 2022
An official implementation for "CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval"

The implementation of paper CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval. CLIP4Clip is a video-text retrieval model based

ArrowLuo 456 Jan 06, 2023
Unsupervised Language Model Pre-training for French

FlauBERT and FLUE FlauBERT is a French BERT trained on a very large and heterogeneous French corpus. Models of different sizes are trained using the n

GETALP 212 Dec 10, 2022
DeepSpeech - Easy-to-use Speech Toolkit including SOTA ASR pipeline, influential TTS with text frontend and End-to-End Speech Simultaneous Translation.

(简体中文|English) Quick Start | Documents | Models List PaddleSpeech is an open-source toolkit on PaddlePaddle platform for a variety of critical tasks i

5.6k Jan 03, 2023
Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks

Prompt-learning is the latest paradigm to adapt pre-trained language models (PLMs) to downstream NLP tasks, which modifies the input text with a textual template and directly uses PLMs to conduct pre

THUNLP 2.3k Jan 08, 2023
ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab

AliceMind AliceMind: ALIbaba's Collection of Encoder-decoders from MinD (Machine IntelligeNce of Damo) Lab This repository provides pre-trained encode

Alibaba 1.4k Jan 04, 2023
Voice Assistant inspired by Google Assistant, Cortana, Alexa, Siri, ...

author: @shival_gupta VoiceAI This program is an example of a simple virtual assitant It will listen to you and do accordingly It will begin with wish

Shival Gupta 1 Jan 06, 2022