Topic Modelling for Humans

Overview

gensim – Topic Modelling in Python

Build Status GitHub release Downloads DOI Mailing List Follow

Gensim is a Python library for topic modelling, document indexing and similarity retrieval with large corpora. Target audience is the natural language processing (NLP) and information retrieval (IR) community.

Features

  • All algorithms are memory-independent w.r.t. the corpus size (can process input larger than RAM, streamed, out-of-core),
  • Intuitive interfaces
    • easy to plug in your own input corpus/datastream (trivial streaming API)
    • easy to extend with other Vector Space algorithms (trivial transformation API)
  • Efficient multicore implementations of popular algorithms, such as online Latent Semantic Analysis (LSA/LSI/SVD), Latent Dirichlet Allocation (LDA), Random Projections (RP), Hierarchical Dirichlet Process (HDP) or word2vec deep learning.
  • Distributed computing: can run Latent Semantic Analysis and Latent Dirichlet Allocation on a cluster of computers.
  • Extensive documentation and Jupyter Notebook tutorials.

If this feature list left you scratching your head, you can first read more about the Vector Space Model and unsupervised document analysis on Wikipedia.

Installation

This software depends on NumPy and Scipy, two Python packages for scientific computing. You must have them installed prior to installing gensim.

It is also recommended you install a fast BLAS library before installing NumPy. This is optional, but using an optimized BLAS such as ATLAS or OpenBLAS is known to improve performance by as much as an order of magnitude. On OS X, NumPy picks up the BLAS that comes with it automatically, so you don’t need to do anything special.

Install the latest version of gensim:

    pip install --upgrade gensim

Or, if you have instead downloaded and unzipped the source tar.gz package:

    python setup.py install

For alternative modes of installation, see the documentation.

Gensim is being continuously tested under Python 3.6, 3.7 and 3.8. Support for Python 2.7 was dropped in gensim 4.0.0 – install gensim 3.8.3 if you must use Python 2.7.

How come gensim is so fast and memory efficient? Isn’t it pure Python, and isn’t Python slow and greedy?

Many scientific algorithms can be expressed in terms of large matrix operations (see the BLAS note above). Gensim taps into these low-level BLAS libraries, by means of its dependency on NumPy. So while gensim-the-top-level-code is pure Python, it actually executes highly optimized Fortran/C under the hood, including multithreading (if your BLAS is so configured).

Memory-wise, gensim makes heavy use of Python’s built-in generators and iterators for streamed data processing. Memory efficiency was one of gensim’s design goals, and is a central feature of gensim, rather than something bolted on as an afterthought.

Documentation

Support

Ask open-ended or research questions on the Gensim Mailing List.

Raise bugs on Github but make sure you follow the issue template. Issues that are not bugs or fail to follow the issue template will be closed without inspection.


Adopters

Company Logo Industry Use of Gensim
RARE Technologies rare ML & NLP consulting Creators of Gensim – this is us!
Amazon amazon Retail Document similarity.
National Institutes of Health nih Health Processing grants and publications with word2vec.
Cisco Security cisco Security Large-scale fraud detection.
Mindseye mindseye Legal Similarities in legal documents.
Channel 4 channel4 Media Recommendation engine.
Talentpair talent-pair HR Candidate matching in high-touch recruiting.
Juju juju HR Provide non-obvious related job suggestions.
Tailwind tailwind Media Post interesting and relevant content to Pinterest.
Issuu issuu Media Gensim's LDA module lies at the very core of the analysis we perform on each uploaded publication to figure out what it's all about.
Search Metrics search-metrics Content Marketing Gensim word2vec used for entity disambiguation in Search Engine Optimisation.
12K Research 12k Media Document similarity analysis on media articles.
Stillwater Supercomputing stillwater Hardware Document comprehension and association with word2vec.
SiteGround siteground Web hosting An ensemble search engine which uses different embeddings models and similarities, including word2vec, WMD, and LDA.
Capital One capitalone Finance Topic modeling for customer complaints exploration.

Citing gensim

When citing gensim in academic papers and theses, please use this BibTeX entry:

@inproceedings{rehurek_lrec,
      title = {{Software Framework for Topic Modelling with Large Corpora}},
      author = {Radim {\v R}eh{\r u}{\v r}ek and Petr Sojka},
      booktitle = {{Proceedings of the LREC 2010 Workshop on New
           Challenges for NLP Frameworks}},
      pages = {45--50},
      year = 2010,
      month = May,
      day = 22,
      publisher = {ELRA},
      address = {Valletta, Malta},
      note={\url{http://is.muni.cz/publication/884893/en}},
      language={English}
}
Comments
  • NMF metrics and wikipedia

    NMF metrics and wikipedia

    Add clean up and fixes on top of #2361:

    opened by anotherbugmaster 113
  • File-based fast training for Any2Vec models

    File-based fast training for Any2Vec models

    Tutorial explaining the whats & hows: Jupyter notebook

    note: all preliminary discussions are in https://github.com/RaRe-Technologies/gensim/pull/2048

    This PR summarizes all my work during GSoC 2018. For more understanding what's going on, follow the links:

    • My proposal: https://persiyanov.github.io/jekyll/update/2018/04/24/accepted-to-gsoc-2018.html
    • First benchmarks: https://persiyanov.github.io/jekyll/update/2018/05/28/gsoc-first-weeks.html
    • Last blog post about the almost final solution: https://persiyanov.github.io/2018/07/06/gsoc-midreport.html
    • Links to all benchmarks: https://gist.github.com/persiyanov/84b806233947e0069a243433579b35db
    • Previous PR about vocab building : https://github.com/RaRe-Technologies/gensim/pull/2078 (reverted these changes in current PR because of API design issues)
    • Previous PR about multistream training (all useful changes in this PR): https://github.com/RaRe-Technologies/gensim/pull/2048

    Summary

    In this pull request, new argument corpus_file is proposed for Word2Vec, FastText and Doc2Vec models. It is supposed to use corpus_file instead of standard sentences argument if you have the preprocessed dataset on disk and want to get significant speedup during model training.

    On our benchmarks, training Word2Vec on English Wikipedia dump is 370% faster with corpus_file than training with sentences (see the attached jupyter notebook with the code).

    Look at this chart for Word2Vec: word2vec_file_scaling

    Usage

    The usage is really simple. I'll provide examples for Word2Vec while the usage for FastText and Doc2Vec is identical. The corpus_file argument is supported for:

    Constructor

    # Standard way
    model = Word2Vec(sentences=my_corpus, <...other arguments...>)
    
    # New way
    model = Word2Vec(corpus_file='my_corpus_saved.txt', <...other arguments...>)
    
    # You can save your own corpus using
    gensim.utils.save_as_line_sentence(my_corpus, 'my_corpus_saved.txt')
    
    

    build_vocab

    # Create the model without training
    model = Word2Vec(<...other arguments...>)
    
    # Standard way
    model.build_vocab(sentences=my_corpus, ...)
    
    # New way
    model.build_vocab(corpus_file='my_corpus_saved.txt', ...)
    

    train

    # Create the model without training
    model = Word2Vec(<...other arguments...>)
    
    # Build vocab (with `sentences` or `corpus_file` way, choose what you like)
    model.build_vocab(corpus_file='my_corpus_saved.txt')
    
    # Train the model (old way)
    model.train(sentences=my_corpus, total_examples=model.corpus_count, ...)
    
    # Train the model (new way)
    model.train(corpus_file='my_corpus_saved.txt', total_words=model.corpus_total_words, ...)
    

    That's it! Everything else remains the same as before.

    Details

    Firstly, let me describe the standard approach to train *2Vec models:

    1. A user provides input data stream (python iterable object)
    2. One job_producer python thread is created. This thread reads data from the input stream and pushes batches into the python threading.Queue (job_queue).
    3. Several worker threads pull batches from job_queue and perform model updates. Batches are python lists of lists of tokens. They are first translated into C structures and then a model update is performed without GIL.

    Such approach allows to scale model updates linearly, but batch producing (from reading up to filling C structures from python object) is a bottleneck in this pipeline.

    It is evident that we can't optimize batch generation for abstract python stream (with custom user logic). Instead of this, we performed such an optimization only for data which is stored on a disk in a form of gensim.models.word2vec.LineSentence (one sentence per line, words are separated by whitespace).

    Such a restriction allowed us to read the data directly on C++ level without GIL. And then, immediately, perform model updates. Finally, this resulted in linear scaling during training.

    opened by persiyanov 102
  • Use FastSS for fast kNN over Levenshtein distance

    Use FastSS for fast kNN over Levenshtein distance

    Introduction

    The LevenshteinSimilarityIndex term similarity index in the termsim.levenshtein module implements the lexical text similarity search technique described by Charlet and Damnati (2017) in their paper describing their winning system at SemEval-2017 Task 3: Community Question Answering.

    We are showing a related semantic similarity search technique using the WordEmbeddingSimilarityIndex term similarity index in our Soft Cosine Similarity autoexample, which enjoys some popularity among our users. We would like to also advertise LevenshteinSimilarityIndex, which provides a different but equally useful kind of search. However, the current implementation uses brute-force kNN search over the Levenshtein distance to produce a term similarity matrix, which is so slow that it can take years to produce a matrix even for medium-sized corpora such as the English Wikipedia.

    Following the discussion in https://github.com/RaRe-Technologies/gensim/issues/2541, @piskvorky and I implemented indexing using the FastSS algorithm for kNN search over the Levenshtein distance in hopes that this would speed LevenshteinSimilarityIndex up by at least three orders of magnitude (1,000×), so that it can compete with WordEmbeddingSimilarityIndex. As an added bonus, using the FastSS algorithm allows us to remove our external dependence on the python-Levenshtein library.

    Speed comparison

    Below, I will show a before-and-after speed comparison of LevenshteinSimilarityIndex compared to the standard WordEmbeddingSimilarityIndex shown in the Soft Cosine Similarity autoexample. We are measuring how many kNN searches per second, k = 100, a term similarity index can perform. To produce my dictionary (253,854 terms) and word embeddings, I will use the text8 corpus (100 MB). I am running the code on a Dell Inspiron 15 7559.

    Before the change

    We can see that even with our tiny corpus, the LevenshteinSimilarityIndex takes over three days to find the 100 nearest neighbors for all 253,854 terms in our vocabulary. Contrast this with the WordEmbeddingSimilarityIndex, which finishes in under four minutes even though we are using exact nearest-neighbor search and we could get further speed-up by using e.g. the Annoy index.

    $ pip install gensim==4.0.1 python-Levenshtein
    $ wget http://mattmahoney.net/dc/text8.zip
    $ unzip text8.zip
    $ python
    >>> from gensim.corpora import Dictionary
    >>> from gensim.models.word2vec import LineSentence, Word2Vec
    >>> from gensim.similarities import (
    ...     SparseTermSimilarityMatrix,
    ...     WordEmbeddingSimilarityIndex,
    ...     LevenshteinSimilarityIndex,
    ... )
    >>> 
    >>> corpus = LineSentence('text8')
    >>> dictionary = Dictionary(corpus)
    >>> w2v_model = Word2Vec(sentences=corpus)
    >>> embedding_index = WordEmbeddingSimilarityIndex(w2v_model.wv)
    >>> levenshtein_index = LevenshteinSimilarityIndex(dictionary)
    >>>
    >>> SparseTermSimilarityMatrix(embedding_index, dictionary)
    100%|███████████████████████████████| 253854/253854 [04:04<00:00, 1037.97it/s]
    >>> SparseTermSimilarityMatrix(levenshtein_index, dictionary)
      0%|                               | 124/253854 [02:24<80:18:05,  1.14s/it]
    

    After the change

    With the FastSS algorithm, the LevenshteinSimilarityIndex receives a 1,500× speed-up and is now not only not slower than the WordEmbeddingSimilarityIndex, but 1.5× faster. Both term similarity indexes now find the 100 nearest neighbors for all 253,854 terms in our vocabulary in under 4 minutes.

    $ pip install lexpy git+https://github.com/witiko/[email protected]
    $ python
    >>> from gensim.corpora import Dictionary
    >>> from gensim.models.word2vec import LineSentence, Word2Vec
    >>> from gensim.similarities import (
    ...     SparseTermSimilarityMatrix,
    ...     WordEmbeddingSimilarityIndex,
    ...     LevenshteinSimilarityIndex,
    ... )
    >>> 
    >>> corpus = LineSentence('text8')
    >>> dictionary = Dictionary(corpus)
    >>> w2v_model = Word2Vec(sentences=corpus)
    >>> embedding_index = WordEmbeddingSimilarityIndex(w2v_model.wv)
    >>> levenshtein_index = LevenshteinSimilarityIndex(dictionary)
    >>>
    >>> SparseTermSimilarityMatrix(embedding_index, dictionary)
    100%|███████████████████████████████| 253854/253854 [03:57<00:00, 1070.14it/s]
    >>> SparseTermSimilarityMatrix(levenshtein_index, dictionary)
    100%|███████████████████████████████| 253854/253854 [02:34<00:00, 1639.23it/s]
    

    Conclusion

    Using the FastSS algorithm for kNN search over the Levenshtein distance, we managed to increase the speed of the LevenshteinSimilarityIndex term similarity index by four orders of magnitude (1,500×) on the text8 corpus. As an added bonus, using the FastSS algorithm allowed us to remove our external dependence on the Levenshtein library. Closes #2541.

    opened by Witiko 70
  • numpy 1.19.2 incompatible with gensim 4.1.0

    numpy 1.19.2 incompatible with gensim 4.1.0

    Problem description

    When importing gensim I get the following error

    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/mbertoni/software/miniconda3/envs/test/lib/python3.7/site-packages/gensim/__init__.py", line 11, in <module>
        from gensim import parsing, corpora, matutils, interfaces, models, similarities, utils  # noqa:F401
      File "/home/mbertoni/software/miniconda3/envs/test/lib/python3.7/site-packages/gensim/corpora/__init__.py", line 6, in <module>
        from .indexedcorpus import IndexedCorpus  # noqa:F401 must appear before the other classes
      File "/home/mbertoni/software/miniconda3/envs/test/lib/python3.7/site-packages/gensim/corpora/indexedcorpus.py", line 14, in <module>
        from gensim import interfaces, utils
      File "/home/mbertoni/software/miniconda3/envs/test/lib/python3.7/site-packages/gensim/interfaces.py", line 19, in <module>
        from gensim import utils, matutils
      File "/home/mbertoni/software/miniconda3/envs/test/lib/python3.7/site-packages/gensim/matutils.py", line 1024, in <module>
        from gensim._matutils import logsumexp, mean_absolute_difference, dirichlet_expectation
      File "gensim/_matutils.pyx", line 1, in init gensim._matutils
    ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject
    

    Steps/code/corpus to reproduce

    conda create --name=test python=3.7 -y
    conda install -y numpy==1.19.2
    pip install gensim
    

    Versions

    Linux-5.11.0-25-generic-x86_64-with-debian-bullseye-sid Python 3.7.11 (default, Jul 27 2021, 14:32:16) [GCC 7.5.0] Bits 64 NumPy 1.19.2 SciPy 1.7.1

    opened by martinobertoni 58
  • Doc2vec not parallelizing

    Doc2vec not parallelizing

    Doc2vec does not use all my cores despite my setting workers=8 when I instantiate it.

    My install passes the assert below:

    assert gensim.models.doc2vec.FAST_VERSION > -1
    

    Do I have to do something else?

    bug difficulty hard 
    opened by fccoelho 55
  • LSI worker getting

    LSI worker getting "stuck"

    Description

    When building an LsiModel in distributed mode, one of the workers gets "stuck" while orthonormalizing the action matrix. This stalls the whole process of building the model, as the dispatcher hangs on "reached the end of input; now waiting for all remaining jobs to finish".

    Steps/Code/Corpus to Reproduce

    lsi_model = LsiModel(
            id2word=bow,
            num_topics=300,
            chunksize=5000,
            distributed=True
        )
    lsi_model.add_documents(corpus)
    

    LSI dispatcher and workers are initialized in separate bash script. I have tried with the number of LSI workers set to 16 and 8.

    Gensim version: 3.6.0 Pyro4 version: 4.63

    Expected Results

    Process should run to completion

    Actual Results

    Main script output:

    [2019-01-06 04:04:09,862] [23465] [gensim.models.lsimodel] [INFO] {add_documents:462} updating model with new documents
    [2019-01-06 04:04:09,862] [23465] [gensim.models.lsimodel] [INFO] {add_documents:485} initializing 8 workers
    [2019-01-06 04:05:12,131] [23465] [gensim.models.lsimodel] [INFO] {add_documents:488} preparing a new chunk of documents
    [2019-01-06 04:05:12,135] [23465] [gensim.models.lsimodel] [DEBUG] {add_documents:492} converting corpus to csc format
    [2019-01-06 04:05:12,497] [23465] [gensim.models.lsimodel] [DEBUG] {add_documents:499} creating job #0
    [2019-01-06 04:05:12,541] [23465] [gensim.models.lsimodel] [INFO] {add_documents:503} dispatched documents up to #5000
    [2019-01-06 04:06:46,191] [23465] [gensim.models.lsimodel] [INFO] {add_documents:488} preparing a new chunk of documents
    [2019-01-06 04:06:46,200] [23465] [gensim.models.lsimodel] [DEBUG] {add_documents:492} converting corpus to csc format
    [2019-01-06 04:06:46,618] [23465] [gensim.models.lsimodel] [DEBUG] {add_documents:499} creating job #1
    [2019-01-06 04:06:46,682] [23465] [gensim.models.lsimodel] [INFO] {add_documents:503} dispatched documents up to #10000
    [2019-01-06 04:08:11,839] [23465] [gensim.models.lsimodel] [INFO] {add_documents:488} preparing a new chunk of documents
    [2019-01-06 04:08:11,843] [23465] [gensim.models.lsimodel] [DEBUG] {add_documents:492} converting corpus to csc format
    [2019-01-06 04:08:12,561] [23465] [gensim.models.lsimodel] [DEBUG] {add_documents:499} creating job #2
    [2019-01-06 04:08:12,786] [23465] [gensim.models.lsimodel] [INFO] {add_documents:503} dispatched documents up to #15000
    [2019-01-06 04:09:48,217] [23465] [gensim.models.lsimodel] [INFO] {add_documents:488} preparing a new chunk of documents
    [2019-01-06 04:09:48,230] [23465] [gensim.models.lsimodel] [DEBUG] {add_documents:492} converting corpus to csc format
    [2019-01-06 04:09:48,700] [23465] [gensim.models.lsimodel] [DEBUG] {add_documents:499} creating job #3
    [2019-01-06 04:09:48,786] [23465] [gensim.models.lsimodel] [INFO] {add_documents:503} dispatched documents up to #20000
    [2019-01-06 04:09:48,938] [23465] [gensim.models.lsimodel] [INFO] {add_documents:518} reached the end of input; now waiting for all remaining jobs to finish
    

    Output of LSI worker that is stuck:

    2019-01-06 04:04:09,867 - INFO - resetting worker #1
    2019-01-06 04:06:46,705 - INFO - worker #1 received job #208
    2019-01-06 04:06:46,705 - INFO - updating model with new documents
    2019-01-06 04:06:46,705 - INFO - using 100 extra samples and 2 power iterations
    2019-01-06 04:06:46,705 - INFO - 1st phase: constructing (500000, 400) action matrix
    2019-01-06 04:06:48,402 - INFO - orthonormalizing (500000, 400) action matrix
    

    CPU for that LSI worker has been ~100% for >24 hours.

    Versions

    Linux-4.10.0-38-generic-x86_64-with-Ubuntu-16.04-xenial Python 3.5.2 (default, Nov 23 2017, 16:37:01) [GCC 5.4.0 20160609] NumPy 1.15.2 SciPy 1.1.0 gensim 3.6.0 FAST_VERSION 1

    need info 
    opened by robguinness 53
  • Loading fastText binary output to gensim like word2vec

    Loading fastText binary output to gensim like word2vec

    Facebook's recent open sourced fasttext https://github.com/facebookresearch/fastText improves the word2vec SkipGram model. It follows a similar output format for word - vector key value pairs, and the similarity calculation is about the same too, but their binary output format is kind of different from that of the C version of word2vec binary format. Do we want to support loading fastText model output in gensim? Thanks.

    need info 
    opened by phunterlau 49
  • Add nmslib indexer

    Add nmslib indexer

    Hi, I added nmslib indexer.

    Some research shows nmslib is better than annoy indexer. https://erikbern.com/2018/06/17/new-approximate-nearest-neighbor-benchmarks.html https://www.benfrederickson.com/approximate-nearest-neighbours-for-recommender-systems/

    This is the first time to contribute to gensim. If I miss something, please let me know.

    feature 
    opened by masa3141 47
  • Distributed Representations of Sentences and Documents

    Distributed Representations of Sentences and Documents

    A reasonable approximation of the method described in the paper Distributed Representations of Sentences and Documents (http://cs.stanford.edu/~quocle/paragraph_vector.pdf).

    Python isn't my first language, so I don't pretend that I did everything in the most "pythonic" way here, but I pulled out the portions of the word2vec code that needed to be modified into their own methods, then added another class extending word2vec which just modifies those few functions.

    I also don't really know what to do in terms of refactoring cython code to limit code duplication, so doc2vec has its own set of cython functions which are completely independent of the word2vec ones.

    Hope that helps. Tim

    opened by temerick 47
  • Easy import of GloVe vectors using Gensim

    Easy import of GloVe vectors using Gensim

    word2vec embeddings start with a line with the number of lines (tokens?) and the number of dimensions of the file. This allows gensim to allocate memory accordingly for querying the model. Larger dimensions mean larger memory is held captive. Accordingly, this line has to be inserted into the GloVe embeddings file.

    feature 
    opened by manasRK 46
  • Implement Okapi BM25 variants in Gensim

    Implement Okapi BM25 variants in Gensim

    This pull request implements the gensim.models.bm25model module, which contains an implementation of the Okapi BM25 model and its modifications (Lucene BM25 and ATIRE) as discussed in https://github.com/RaRe-Technologies/gensim/issues/2592#issuecomment-866799145. The module acts as a replacement for the gensim.summarization.bm25model module deprecated and removed in Gensim 4. The module should supersede the gensim.models.tfidfmodel module as the baseline weighting function for information retrieval and related NLP tasks.

    Most implementations of BM25 such as the rank-bm25 library combine indexing with weighting and often forgo dictionary building for a speed improvement at indexing time (but a hefty penalty at retrieval time). To give an example, here is how a user would search for documents with rank-bm25:

    >>> from rank_bm25 import BM25Okapi
    >>>
    >>> corpus = [["Hello", "world"], ["bar", "bar"], ["foo", "bar"]]
    >>> bm25_model = BM25Okapi(corpus)
    >>>
    >>> query = ["Hello", "bar"]
    >>> similarities = bm25_model.get_scores(query)
    >>> similarities
    
    array([0.51082562, 0.09121886, 0.0638532 ])
    
    >>> best_document, = bm25_model.get_top_n(query, corpus, n=1)
    >>> best_document
    
    ['Hello', 'world']
    

    As you can see, the interface is convenient, but retrieval is slow due to the lack of a dictionary. Furthermore, any advanced operations such as pruning the dictionary, applying semantic matching (e.g. SCM) and query expansion (e.g. RM3), or sharding the index are unavailable.

    By contrast, the gensim.models.bm25 module separates the three operations. To give an example, here is how a user would search for documents with the gensim.models.bm25 module:

    >>> from gensim.corpora import Dictionary
    >>> from gensim.models import TfidfModel, OkapiBM25Model
    >>> from gensim.similarities import SparseMatrixSimilarity
    >>> import numpy as np
    >>>
    >>> corpus = [["Hello", "world"], ["bar", "bar"], ["foo", "bar"]]
    >>> dictionary = Dictionary(corpus)
    >>> bm25_model = OkapiBM25Model(dictionary=dictionary)
    >>> bm25_corpus = bm25_model[list(map(dictionary.doc2bow, corpus))]
    >>> bm25_index = SparseMatrixSimilarity(bm25_corpus, num_docs=len(corpus), num_terms=len(dictionary),
    ...                                     normalize_queries=False, normalize_documents=False)
    >>>
    >>> query = ["Hello", "bar"]
    >>> tfidf_model = TfidfModel(dictionary=dictionary, smartirs='bnn')  # Enforce binary weighting of queries
    >>> tfidf_query = tfidf_model[dictionary.doc2bow(query)]
    >>>
    >>> similarities = bm25_index[tfidf_query]
    >>> similarities
    
    array([0.51082563, 0.09121886, 0.0638532 ], dtype=float32)
    
    >>> best_document = corpus[np.argmax(similarities)]
    >>> best_document
    
    ['Hello', 'world']
    

    Tasks:

    • [x] Add Okapi BM25, ~~BM25L and BM25+~~ [1, 2], Lucene BM25 [3, 4], and ATIRE BM25 [3, 5].
    • [x] Add comments and docstrings to models.bm25.
    • [x] Add comments and docstrings to similarities.docsim.
    • [x] Add BM25 to the run_topics_and_transformations autoexample.
    • [x] Add normalize_queries=True, normalize_documents=True named parameters to SparseMatrixSimilarity, DenseMatrixSimilarity, and SoftCosineSimilarity classes as discussed in https://github.com/RaRe-Technologies/gensim/pull/3304#issuecomment-1061031969 and on the Gensim mailing list. Deprecate the normalize named parameter of SoftCosineSimilarity. Add normalize_queries=False, normalize_documents=False to TF-IDF and BM25 examples.
    opened by Witiko 44
  • pip install gensim==4.2.0 raises deprecation warning

    pip install gensim==4.2.0 raises deprecation warning

    When installing gensim in a fresh environment I get the following warning: Sorry it is a lot of output. The command is pip3 install gensim--4.2.0 my pip version is 22.3.1 and python version 3.11 (see below)

    because of character limits, i have cut out chunks of the output to just the warnings:

          ...
          running egg_info
          writing gensim.egg-info/PKG-INFO
          writing dependency_links to gensim.egg-info/dependency_links.txt
          writing requirements to gensim.egg-info/requires.txt
          writing top-level names to gensim.egg-info/top_level.txt
          reading manifest file 'gensim.egg-info/SOURCES.txt'
          reading manifest template 'MANIFEST.in'
          warning: no files found matching 'COPYING.LESSER'
          warning: no files found matching 'ez_setup.py'
          warning: no files found matching 'gensim/models/doc2vec_inner.c'
          adding license file 'COPYING'
          writing manifest file 'gensim.egg-info/SOURCES.txt'
          /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/setuptools/command/build_py.py:202: SetuptoolsDeprecationWarning:     Installing 'gensim.test.test_data' as data is deprecated, please list it in `packages`.
              !!
          
          
              ############################
              # Package would be ignored #
              ############################
              Python recognizes 'gensim.test.test_data' as an importable package,
              but it is not listed in the `packages` configuration of setuptools.
          
              'gensim.test.test_data' has been automatically added to the distribution only
              because it may contain data files, but this behavior is likely to change
              in future versions of setuptools (and therefore is considered deprecated).
          
              Please make sure that 'gensim.test.test_data' is included as a package by using
              the `packages` configuration field or the proper discovery methods
              (for example by using `find_namespace_packages(...)`/`find_namespace:`
              instead of `find_packages(...)`/`find:`).
          
              You can read more about "package discovery" and "data files" on setuptools
              documentation page.
          
          
          !!
          
            check.warn(importable)
          /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/setuptools/command/build_py.py:202: SetuptoolsDeprecationWarning:     Installing 'gensim.test.test_data.DTM' as data is deprecated, please list it in `packages`.
              !!
          
          
              ############################
              # Package would be ignored #
              ############################
              Python recognizes 'gensim.test.test_data.DTM' as an importable package,
              but it is not listed in the `packages` configuration of setuptools.
          
              'gensim.test.test_data.DTM' has been automatically added to the distribution only
              because it may contain data files, but this behavior is likely to change
              in future versions of setuptools (and therefore is considered deprecated).
          
              Please make sure that 'gensim.test.test_data.DTM' is included as a package by using
              the `packages` configuration field or the proper discovery methods
              (for example by using `find_namespace_packages(...)`/`find_namespace:`
              instead of `find_packages(...)`/`find:`).
          
              You can read more about "package discovery" and "data files" on setuptools
              documentation page.
          
          
          !!
          
            check.warn(importable)
          /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/setuptools/command/build_py.py:202: SetuptoolsDeprecationWarning:     Installing 'gensim.test.test_data.PathLineSentences' as data is deprecated, please list it in `packages`.
              !!
          
          
              ############################
              # Package would be ignored #
              ############################
              Python recognizes 'gensim.test.test_data.PathLineSentences' as an importable package,
              but it is not listed in the `packages` configuration of setuptools.
          
              'gensim.test.test_data.PathLineSentences' has been automatically added to the distribution only
              because it may contain data files, but this behavior is likely to change
              in future versions of setuptools (and therefore is considered deprecated).
          
              Please make sure that 'gensim.test.test_data.PathLineSentences' is included as a package by using
              the `packages` configuration field or the proper discovery methods
              (for example by using `find_namespace_packages(...)`/`find_namespace:`
              instead of `find_packages(...)`/`find:`).
          
              You can read more about "package discovery" and "data files" on setuptools
              documentation page.
          
          
          !!
          
            check.warn(importable)
          /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/setuptools/command/build_py.py:202: SetuptoolsDeprecationWarning:     Installing 'gensim.test.test_data.old_d2v_models' as data is deprecated, please list it in `packages`.
              !!
          
          
              ############################
              # Package would be ignored #
              ############################
              Python recognizes 'gensim.test.test_data.old_d2v_models' as an importable package,
              but it is not listed in the `packages` configuration of setuptools.
          
              'gensim.test.test_data.old_d2v_models' has been automatically added to the distribution only
              because it may contain data files, but this behavior is likely to change
              in future versions of setuptools (and therefore is considered deprecated).
          
              Please make sure that 'gensim.test.test_data.old_d2v_models' is included as a package by using
              the `packages` configuration field or the proper discovery methods
              (for example by using `find_namespace_packages(...)`/`find_namespace:`
              instead of `find_packages(...)`/`find:`).
          
              You can read more about "package discovery" and "data files" on setuptools
              documentation page.
          
          
          !!
          
            check.warn(importable)
          /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/setuptools/command/build_py.py:202: SetuptoolsDeprecationWarning:     Installing 'gensim.test.test_data.old_w2v_models' as data is deprecated, please list it in `packages`.
              !!
          
          
              ############################
              # Package would be ignored #
              ############################
              Python recognizes 'gensim.test.test_data.old_w2v_models' as an importable package,
              but it is not listed in the `packages` configuration of setuptools.
          
              'gensim.test.test_data.old_w2v_models' has been automatically added to the distribution only
              because it may contain data files, but this behavior is likely to change
              in future versions of setuptools (and therefore is considered deprecated).
          
              Please make sure that 'gensim.test.test_data.old_w2v_models' is included as a package by using
              the `packages` configuration field or the proper discovery methods
              (for example by using `find_namespace_packages(...)`/`find_namespace:`
              instead of `find_packages(...)`/`find:`).
          
              You can read more about "package discovery" and "data files" on setuptools
              documentation page.
          
          
          !!
          
            check.warn(importable)
          copying gensim/_matutils.c -> build/lib.macosx-10.9-universal2-cpython-311/gensim
          copying gensim/_matutils.pyx -> build/lib.macosx-10.9-universal2-cpython-311/gensim
          ...
          copying gensim/corpora/_mmreader.c -> build/lib.macosx-10.9-universal2-cpython-311/gensim/corpora
          copying gensim/corpora/_mmreader.pyx -> build/lib.macosx-10.9-universal2-cpython-311/gensim/corpora
          running build_ext
          building 'gensim.models.word2vec_inner' extension
          creating build/temp.macosx-10.9-universal2-cpython-311
          creating build/temp.macosx-10.9-universal2-cpython-311/gensim
          creating build/temp.macosx-10.9-universal2-cpython-311/gensim/models
          clang -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -arch arm64 -arch x86_64 -g -I/Library/Frameworks/Python.framework/Versions/3.11/include/python3.11 -I/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/numpy/core/include -c gensim/models/word2vec_inner.c -o build/temp.macosx-10.9-universal2-cpython-311/gensim/models/word2vec_inner.o
          In file included from gensim/models/word2vec_inner.c:706:
          In file included from /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/numpy/core/include/numpy/arrayobject.h:5:
          In file included from /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/numpy/core/include/numpy/ndarrayobject.h:12:
          In file included from /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/numpy/core/include/numpy/ndarraytypes.h:1948:
          /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:17:2: warning: "Using deprecated NumPy API, disable it with "          "#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-W#warnings]
          #warning "Using deprecated NumPy API, disable it with " \
           ^
          gensim/models/word2vec_inner.c:12424:5: error: incomplete definition of type 'struct _frame'
              __Pyx_PyFrame_SetLineNumber(py_frame, py_line);
              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          gensim/models/word2vec_inner.c:457:62: note: expanded from macro '__Pyx_PyFrame_SetLineNumber'
            #define __Pyx_PyFrame_SetLineNumber(frame, lineno)  (frame)->f_lineno = (lineno)
                                                                ~~~~~~~^
          /Library/Frameworks/Python.framework/Versions/3.11/include/python3.11/pytypedefs.h:22:16: note: forward declaration of 'struct _frame'
          typedef struct _frame PyFrameObject;
                         ^
          1 warning and 1 error generated.
          error: command '/usr/bin/clang' failed with exit code 1
          [end of output]
      
      note: This error originates from a subprocess, and is likely not a problem with pip.
      ERROR: Failed building wheel for gensim
      Running setup.py clean for gensim
    Failed to build gensim
    Installing collected packages: gensim
      Running setup.py install for gensim ... error
      error: subprocess-exited-with-error
      
      × Running setup.py install for gensim did not run successfully.
      │ exit code: 1
      ╰─> [540 lines of output]
          running install
          /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
            warnings.warn(
          running build
          running build_py
          creating build
          creating build/lib.macosx-10.9-universal2-cpython-311
          creating build/lib.macosx-10.9-universal2-cpython-311/gensim
          ...
          copying gensim/corpora/svmlightcorpus.py -> build/lib.macosx-10.9-universal2-cpython-311/gensim/corpora
          copying gensim/corpora/hashdictionary.py -> build/lib.macosx-10.9-universal2-cpython-311/gensim/corpora
          running egg_info
          writing gensim.egg-info/PKG-INFO
          writing dependency_links to gensim.egg-info/dependency_links.txt
          writing requirements to gensim.egg-info/requires.txt
          writing top-level names to gensim.egg-info/top_level.txt
          reading manifest file 'gensim.egg-info/SOURCES.txt'
          reading manifest template 'MANIFEST.in'
          warning: no files found matching 'COPYING.LESSER'
          warning: no files found matching 'ez_setup.py'
          warning: no files found matching 'gensim/models/doc2vec_inner.c'
          adding license file 'COPYING'
          writing manifest file 'gensim.egg-info/SOURCES.txt'
          /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/setuptools/command/build_py.py:202: SetuptoolsDeprecationWarning:     Installing 'gensim.test.test_data' as data is deprecated, please list it in `packages`.
              !!
          
          
              ############################
              # Package would be ignored #
              ############################
              Python recognizes 'gensim.test.test_data' as an importable package,
              but it is not listed in the `packages` configuration of setuptools.
          
              'gensim.test.test_data' has been automatically added to the distribution only
              because it may contain data files, but this behavior is likely to change
              in future versions of setuptools (and therefore is considered deprecated).
          
              Please make sure that 'gensim.test.test_data' is included as a package by using
              the `packages` configuration field or the proper discovery methods
              (for example by using `find_namespace_packages(...)`/`find_namespace:`
              instead of `find_packages(...)`/`find:`).
          
              You can read more about "package discovery" and "data files" on setuptools
              documentation page.
          
          
          !!
          
            check.warn(importable)
          /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/setuptools/command/build_py.py:202: SetuptoolsDeprecationWarning:     Installing 'gensim.test.test_data.DTM' as data is deprecated, please list it in `packages`.
              !!
          
          
              ############################
              # Package would be ignored #
              ############################
              Python recognizes 'gensim.test.test_data.DTM' as an importable package,
              but it is not listed in the `packages` configuration of setuptools.
          
              'gensim.test.test_data.DTM' has been automatically added to the distribution only
              because it may contain data files, but this behavior is likely to change
              in future versions of setuptools (and therefore is considered deprecated).
          
              Please make sure that 'gensim.test.test_data.DTM' is included as a package by using
              the `packages` configuration field or the proper discovery methods
              (for example by using `find_namespace_packages(...)`/`find_namespace:`
              instead of `find_packages(...)`/`find:`).
          
              You can read more about "package discovery" and "data files" on setuptools
              documentation page.
          
          
          !!
          
            check.warn(importable)
          /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/setuptools/command/build_py.py:202: SetuptoolsDeprecationWarning:     Installing 'gensim.test.test_data.PathLineSentences' as data is deprecated, please list it in `packages`.
              !!
          
          
              ############################
              # Package would be ignored #
              ############################
              Python recognizes 'gensim.test.test_data.PathLineSentences' as an importable package,
              but it is not listed in the `packages` configuration of setuptools.
          
              'gensim.test.test_data.PathLineSentences' has been automatically added to the distribution only
              because it may contain data files, but this behavior is likely to change
              in future versions of setuptools (and therefore is considered deprecated).
          
              Please make sure that 'gensim.test.test_data.PathLineSentences' is included as a package by using
              the `packages` configuration field or the proper discovery methods
              (for example by using `find_namespace_packages(...)`/`find_namespace:`
              instead of `find_packages(...)`/`find:`).
          
              You can read more about "package discovery" and "data files" on setuptools
              documentation page.
          
          
          !!
          
            check.warn(importable)
          /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/setuptools/command/build_py.py:202: SetuptoolsDeprecationWarning:     Installing 'gensim.test.test_data.old_d2v_models' as data is deprecated, please list it in `packages`.
              !!
          
          
              ############################
              # Package would be ignored #
              ############################
              Python recognizes 'gensim.test.test_data.old_d2v_models' as an importable package,
              but it is not listed in the `packages` configuration of setuptools.
          
              'gensim.test.test_data.old_d2v_models' has been automatically added to the distribution only
              because it may contain data files, but this behavior is likely to change
              in future versions of setuptools (and therefore is considered deprecated).
          
              Please make sure that 'gensim.test.test_data.old_d2v_models' is included as a package by using
              the `packages` configuration field or the proper discovery methods
              (for example by using `find_namespace_packages(...)`/`find_namespace:`
              instead of `find_packages(...)`/`find:`).
          
              You can read more about "package discovery" and "data files" on setuptools
              documentation page.
          
          
          !!
          
            check.warn(importable)
          /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/setuptools/command/build_py.py:202: SetuptoolsDeprecationWarning:     Installing 'gensim.test.test_data.old_w2v_models' as data is deprecated, please list it in `packages`.
              !!
          
          
              ############################
              # Package would be ignored #
              ############################
              Python recognizes 'gensim.test.test_data.old_w2v_models' as an importable package,
              but it is not listed in the `packages` configuration of setuptools.
          
              'gensim.test.test_data.old_w2v_models' has been automatically added to the distribution only
              because it may contain data files, but this behavior is likely to change
              in future versions of setuptools (and therefore is considered deprecated).
          
              Please make sure that 'gensim.test.test_data.old_w2v_models' is included as a package by using
              the `packages` configuration field or the proper discovery methods
              (for example by using `find_namespace_packages(...)`/`find_namespace:`
              instead of `find_packages(...)`/`find:`).
          
              You can read more about "package discovery" and "data files" on setuptools
              documentation page.
          
          
          !!
          
            check.warn(importable)
          copying gensim/_matutils.c -> build/lib.macosx-10.9-universal2-cpython-311/gensim
          copying gensim/_matutils.pyx -> build/lib.macosx-10.9-universal2-cpython-311/gensim
         ...
          copying gensim/corpora/_mmreader.pyx -> build/lib.macosx-10.9-universal2-cpython-311/gensim/corpora
          running build_ext
          building 'gensim.models.word2vec_inner' extension
          creating build/temp.macosx-10.9-universal2-cpython-311
          creating build/temp.macosx-10.9-universal2-cpython-311/gensim
          creating build/temp.macosx-10.9-universal2-cpython-311/gensim/models
          clang -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -arch arm64 -arch x86_64 -g -I/Library/Frameworks/Python.framework/Versions/3.11/include/python3.11 -I/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/numpy/core/include -c gensim/models/word2vec_inner.c -o build/temp.macosx-10.9-universal2-cpython-311/gensim/models/word2vec_inner.o
          In file included from gensim/models/word2vec_inner.c:706:
          In file included from /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/numpy/core/include/numpy/arrayobject.h:5:
          In file included from /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/numpy/core/include/numpy/ndarrayobject.h:12:
          In file included from /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/numpy/core/include/numpy/ndarraytypes.h:1948:
          /Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:17:2: warning: "Using deprecated NumPy API, disable it with "          "#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-W#warnings]
          #warning "Using deprecated NumPy API, disable it with " \
           ^
          gensim/models/word2vec_inner.c:12424:5: error: incomplete definition of type 'struct _frame'
              __Pyx_PyFrame_SetLineNumber(py_frame, py_line);
              ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
          gensim/models/word2vec_inner.c:457:62: note: expanded from macro '__Pyx_PyFrame_SetLineNumber'
            #define __Pyx_PyFrame_SetLineNumber(frame, lineno)  (frame)->f_lineno = (lineno)
                                                                ~~~~~~~^
          /Library/Frameworks/Python.framework/Versions/3.11/include/python3.11/pytypedefs.h:22:16: note: forward declaration of 'struct _frame'
          typedef struct _frame PyFrameObject;
                         ^
          1 warning and 1 error generated.
          error: command '/usr/bin/clang' failed with exit code 1
          [end of output]
      
      note: This error originates from a subprocess, and is likely not a problem with pip.
    error: legacy-install-failure
    
    × Encountered error while trying to install package.
    ╰─> gensim
    
    note: This is an issue with the package mentioned above, not pip.
    hint: See above for output from the failure.
    
    

    Versions

    Please provide the output of:

    >>> import platform; print(platform.platform())
    
    import sys; print("Python", sys.version)
    
    import struct; print("Bits", 8 * struct.calcsize("P"))
    
    import numpy; print("NumPy", numpy.__version__)
    
    import scipy; print("SciPy", scipy.__version__)
    macOS-12.5-arm64-arm-64bit
    >>> 
    >>> import sys; print("Python", sys.version)
    Python 3.11.1 (v3.11.1:a7a450f84a, Dec  6 2022, 15:24:06) [Clang 13.0.0 (clang-1300.0.29.30)]
    >>> 
    >>> import struct; print("Bits", 8 * struct.calcsize("P"))
    Bits 64
    >>> 
    >>> import numpy; print("NumPy", numpy.__version__)
    NumPy 1.23.5
    >>> 
    >>> import scipy; print("SciPy", scipy.__version__)
    SciPy 1.9.3
    
    opened by labouz 2
  • Parameter shardsize ignored on queries

    Parameter shardsize ignored on queries

    Problem description

    When I use the shardsize parameter in the similarities.Similarity method, when querying the index the same parameter is not used, causing errors:

    self._similarity_index = similarities.Similarity(MODELS_PATH + f'/{model}', sim_vectors, num_features=len(self._dictionary), shardsize=50000)
    
    sims = self._similarity_index[doc_vector]
    

    image

    PS: If I don't use the parameter shardsize, the error already occurs in the similarities.Similarity call.

    Steps/code/corpus to reproduce

    Save the .py files in the pruvo folder (package), the .parquet file in data folder and run this script:

    import pandas as pd
    
    from pruvo.embedding import Corpus
    
    df = pd.read_parquet('data/preprocess.parquet')
    
    corpus = Corpus()
    corpus.add(list(df['bookingRoomType'].unique()), pre_processed=True)
    corpus.add(list(df['mappedRoomType'].unique()), pre_processed=True)
    
    w2v = corpus.train(model='word2vec')
    
    w2v_similars = corpus.get_similars('apartment 1 king bed in neverland')
    w2v_similars.head(10)
    

    Versions

    Please provide the output of:

    import platform; print(platform.platform())
    import sys; print("Python", sys.version)
    import struct; print("Bits", 8 * struct.calcsize("P"))
    import numpy; print("NumPy", numpy.__version__)
    import scipy; print("SciPy", scipy.__version__)
    import gensim; print("gensim", gensim.__version__)
    from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION)
    

    image

    files.zip

    opened by MaickelHubner 0
  • Add parameter from_topn in evaluate_word_analogies

    Add parameter from_topn in evaluate_word_analogies

    from_topn will mark correct if the expected vector is not necessarily the most similar but among to from_topn most similar.

    Useful for the evaluation of vectors like confusion vectors, in which any of the top two results match then it is marked correct.

    opened by divyanx 3
  • BUG: word2vec skipgram model wont work with numpy array

    BUG: word2vec skipgram model wont work with numpy array

    Problem description

    i have language with 240 distinct words. Because of it can fit 1 byte, i have map each word to bytes and save them in numpy uint8 array to minimize memory footprint. Doing this significantly reduce memory consumtion. However, due to "gensim\models\word2vec_inner.pyx", line 542, numpy arrays cant be used and throws: "The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()" error

    Related line checks if sentence is empty or not, however it doing it as "if not sent:" More generic checker if len(sent)==0: will fix the problem.

    Work around is, casting numpy array to python list. However this significantly increase memory footprint and time consuming operation on big dataset.

    What are you trying to achieve? What is the expected result? What are you seeing instead?

    Steps/code/corpus to reproduce

    reproduce:

    class SentenceIterator:
        def __init__(self, dataset):
            self.dataset = dataset
    
        def __iter__(self):
            for sentence in self.dataset:
                yield sentence
    
    data= []
    data.append(np.array([22,33,44,55,1,2,3,5,4,100]))
    data.append(np.array([100,100,100,100,11]))
    
    sentences = SentenceIterator(data)
    model = gensim.models.Word2Vec(sentences, vector_size=32, window=3, workers=4, sg=1, negative=10)
    

    ps: casting np.array to python list fixes the issue however casting is very slow on big dataset and significantly increases memory footprint

    **workaround:**
    class SentenceIterator:
        def __init__(self, dataset):
            self.dataset = dataset
    
        def __iter__(self):
            for sentence in self.dataset:
                yield sentence.tolist()
    

    possible fix

    changing "if not sent:" controls to "if len(sent) ==0:"

    Versions

    Python 3.9.13 (main, Oct 13 2022, 21:23:06) [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32 Type "help", "copyright", "credits" or "license" for more information.

    import platform; print(platform.platform()) Windows-10-10.0.19044-SP0 import sys; print("Python", sys.version) Python 3.9.13 (main, Oct 13 2022, 21:23:06) [MSC v.1916 64 bit (AMD64)] import struct; print("Bits", 8 * struct.calcsize("P")) Bits 64 import numpy; print("NumPy", numpy.version) NumPy 1.23.4 import scipy; print("SciPy", scipy.version) SciPy 1.9.3 import gensim; print("gensim", gensim.version) gensim 4.2.0 from gensim.models import word2vec;print("FAST_VERSION", word2vec.FAST_VERSION) FAST_VERSION 0

    opened by isimsizolan 1
  • Add tox environments to setup.cfg, fix napoleon import

    Add tox environments to setup.cfg, fix napoleon import

    • tox -e compile,docs fails with ERROR: tox config file (either pyproject.toml, tox.ini, setup.cfg) not found otherwise on my machine. tox -e ALL can be used now
    • fixed napoleon import Could not import extension sphinxcontrib.napoleon (exception: cannot import name 'Callable' from 'collections' (/usr/lib/python3.10/collections/__init__.py)) https://github.com/sphinx-doc/sphinx/issues/10378#issuecomment-1107455569

    Python 3.10.6, 5.15.65-1-MANJARO

    documentation housekeeping 
    opened by sezanzeb 6
  • FastTextKeyedVectors.add_vectors is not adding vectors

    FastTextKeyedVectors.add_vectors is not adding vectors

    Problem description

    I have been trying to create a FastTextKeyedVectors and adding vectors to it using either add_vector or add_vectors but the methods are not adding anything. After looking at the implementation of those methods, I think there is an error while checking if a key has already been added.

    Steps/code/corpus to reproduce

    I create a FastTextKeyedVectors using the defaults used by the FastText model, then try to add vectors to it using add_vector or add_vectors:

    wv = FastTextKeyedVectors(vector_size=2, min_n=3, max_n=6, bucket=2000000)
    wv.add_vector("test", [0.5, 0.5])
    print(wv.key_to_index)
    >> {}
    print(wv.index_to_key)
    >> []
    print(wv.vectors)
    >> []
    
    wv.add_vectors(["test"], [[0.5, 0.5]])
    print(wv.key_to_index)
    >> {}
    print(wv.index_to_key)
    >> []
    print(wv.vectors)
    >> []
    

    wv.key_to_index, wv.index_to_key and wv.vectors are all empty.

    FastTextKeyedVectors is a child of KeyedVectors where the add_vector/s methods are implemented. add_vector does a few checks then calls add_vectors. In add_vectors, there is an in_vocab_mask, which is a list of booleans indicating if a key is already present in the KeyedVectors.

    in_vocab_mask = np.zeros(len(keys), dtype=bool)
            for idx, key in enumerate(keys):
                if key in self:
                    in_vocab_mask[idx] = True
    

    Since Gensim 4.0, key in wv will always return True with FastText by design. The proper way of checking if a key exists is by calling key in wv.key_to_index (See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4#10-check-if-a-word-is-fully-oov-out-of-vocabulary-for-fasttext)

    So replacing the above code by

    in_vocab_mask = np.zeros(len(keys), dtype=bool)
            for idx, key in enumerate(keys):
                if key in self.key_to_index:
                    in_vocab_mask[idx] = True
    

    seems to fix the issue.

    wv = FastTextKeyedVectors(vector_size=2, min_n=3, max_n=6, bucket=2000000)
    wv.add_vectors(["test"], [[0.5, 0.5]])
    print(wv.key_to_index)
    >> {'test': 0}
    print(wv.index_to_key)
    >> ['test']
    print(wv.vectors)
    >> [[0.5 0.5]]
    

    I am not sure how FastText models are able to add vectors to FastTextKeyedVectors the proper way when training without encountering this issue as I have not looked at the training code in detail.

    Versions

    Linux-5.10.0-17-amd64-x86_64-with-glibc2.31 Python 3.10.4 (main, Mar 31 2022, 08:41:55) [GCC 7.5.0] Bits 64 NumPy 1.21.6 SciPy 1.7.3 gensim 4.2.0 FAST_VERSION 1

    bug 
    opened by globba 2
Releases(4.3.0)
  • 4.3.0(Dec 21, 2022)

    What's Changed

    • Allow overriding the Cython version requirement by @pabs3 in https://github.com/RaRe-Technologies/gensim/pull/3323
    • Update Python module MANIFEST by @pabs3 in https://github.com/RaRe-Technologies/gensim/pull/3343
    • Clean up references to Morfessor, tox and gensim.models.wrappers by @pabs3 in https://github.com/RaRe-Technologies/gensim/pull/3345
    • Disable the Gensim 3=>4 warning in docs by @piskvorky in https://github.com/RaRe-Technologies/gensim/pull/3346
    • pin sphinx versions, add explicit gallery_top label by @mpenkov in https://github.com/RaRe-Technologies/gensim/pull/3383
    • Declare variables prior to for loop in fastss.pyx for ANSI C compatibility by @hstk30 in https://github.com/RaRe-Technologies/gensim/pull/3378
    • Fix typo in word2vec and KeyedVectors docstrings by @dymil in https://github.com/RaRe-Technologies/gensim/pull/3365
    • Replace np.multiply with np.square and copyedit in translation_matrix.py by @dymil in https://github.com/RaRe-Technologies/gensim/pull/3374
    • Copyedit and fix outdated statements in translation matrix tutorial by @dymil in https://github.com/RaRe-Technologies/gensim/pull/3375
    • Implement Okapi BM25 variants in Gensim by @Witiko in https://github.com/RaRe-Technologies/gensim/pull/3304
    • Giving missing credit in EnsembleLDA to Alex in docs by @sezanzeb in https://github.com/RaRe-Technologies/gensim/pull/3393
    • PERF: pyemd to POT for EMD computation in wmdistance by @TLouf in https://github.com/RaRe-Technologies/gensim/pull/3327
    • Fixed bug in loss computation for Word2Vec with hierarchical softmax by @TalIfargan in https://github.com/RaRe-Technologies/gensim/pull/3397
    • fix deprecation warning from pytest by @martino-vic in https://github.com/RaRe-Technologies/gensim/pull/3354
    • Switch to Cython language level 3 by @pabs3 in https://github.com/RaRe-Technologies/gensim/pull/3344
    • Implement numpy hack in setup.py to enable install under Poetry by @jaymegordo in https://github.com/RaRe-Technologies/gensim/pull/3363
    • Fixed the broken link in readme.md by @aswin2108 in https://github.com/RaRe-Technologies/gensim/pull/3409
    • Path Coherence Model to correctly handle empty documents by @PrimozGodec in https://github.com/RaRe-Technologies/gensim/pull/3406
    • Add support for Python 3.11 and drop support for Python 3.7 by @acul3 in https://github.com/RaRe-Technologies/gensim/pull/3402
    • clarify runtime expectations by @gojomo in https://github.com/RaRe-Technologies/gensim/pull/3381
    • Fix bug that prevents loading old models by @funasshi in https://github.com/RaRe-Technologies/gensim/pull/3359
    • refactor wheel building and testing workflow by @mpenkov in https://github.com/RaRe-Technologies/gensim/pull/3410
    • Fixed FastTextKeyedVectors handling in add_vector by @globba in https://github.com/RaRe-Technologies/gensim/pull/3389
    • Flsamodel by @ERijck in https://github.com/RaRe-Technologies/gensim/pull/3398
    • Fix backwards compatibility bug in Word2Vec by @mpenkov in https://github.com/RaRe-Technologies/gensim/pull/3415
    • fix numpy hack in setup.py by @mpenkov in https://github.com/RaRe-Technologies/gensim/pull/3416
    • updated changelog for next release by @mpenkov in https://github.com/RaRe-Technologies/gensim/pull/3412

    New Contributors

    • @hstk30 made their first contribution in https://github.com/RaRe-Technologies/gensim/pull/3378
    • @TLouf made their first contribution in https://github.com/RaRe-Technologies/gensim/pull/3327
    • @TalIfargan made their first contribution in https://github.com/RaRe-Technologies/gensim/pull/3397
    • @martino-vic made their first contribution in https://github.com/RaRe-Technologies/gensim/pull/3354
    • @jaymegordo made their first contribution in https://github.com/RaRe-Technologies/gensim/pull/3363
    • @aswin2108 made their first contribution in https://github.com/RaRe-Technologies/gensim/pull/3409
    • @acul3 made their first contribution in https://github.com/RaRe-Technologies/gensim/pull/3402
    • @funasshi made their first contribution in https://github.com/RaRe-Technologies/gensim/pull/3359
    • @globba made their first contribution in https://github.com/RaRe-Technologies/gensim/pull/3389
    • @ERijck made their first contribution in https://github.com/RaRe-Technologies/gensim/pull/3398

    Full Changelog: https://github.com/RaRe-Technologies/gensim/compare/4.2.0...4.3.0

    Source code(tar.gz)
    Source code(zip)
  • 4.2.0(May 1, 2022)

  • 4.1.2(Sep 18, 2021)

    4.1.2, 2021-09-17

    This is a bugfix release that addresses left over compatibility issues with older versions of numpy and MacOS.

    4.1.1, 2021-09-14

    This is a bugfix release that addresses compatibility issues with older versions of numpy.

    4.1.0, 2021-08-15

    Gensim 4.1 brings two major new functionalities:

    There are several minor changes that are not backwards compatible with previous versions of Gensim. The affected functionality is relatively less used, so it is unlikely to affect most users, so we have opted to not require a major version bump. Nevertheless, we describe them below.

    Improved parameter edge-case handling in KeyedVectors most_similar and most_similar_cosmul methods

    We now handle both positive and negative keyword parameters consistently. They may now be either:

    1. A string, in which case the value is reinterpreted as a list of one element (the string value)
    2. A vector, in which case the value is reinterpreted as a list of one element (the vector)
    3. A list of strings
    4. A list of vectors

    So you can now simply do:

        model.most_similar(positive='war', negative='peace')
    

    instead of the slightly more involved

    model.most_similar(positive=['war'], negative=['peace'])
    

    Both invocations remain correct, so you can use whichever is most convenient. If you were somehow expecting gensim to interpret the strings as a list of characters, e.g.

    model.most_similar(positive=['w', 'a', 'r'], negative=['p', 'e', 'a', 'c', 'e'])
    

    then you will need to specify the lists explicitly in gensim 4.1.

    Deprecated obsolete step parameter from doc2vec

    With the newer version, do this:

    model.infer_vector(..., epochs=123)
    

    instead of this:

    model.infer_vector(..., steps=123)
    

    Plus a large number of smaller improvements and fixes, as usual.

    ⚠️ If migrating from old Gensim 3.x, read the Migration guide first.

    :+1: New features

    • #3169: Implement shrink_windows argument for Word2Vec, by @M-Demay
    • #3163: Optimize word mover distance (WMD) computation, by @flowlight0
    • #3157: New KeyedVectors.vectors_for_all method for vectorizing all words in a dictionary, by @Witiko
    • #3153: Vectorize word2vec.predict_output_word for speed, by @M-Demay
    • #3146: Use FastSS for fast kNN over Levenshtein distance, by @Witiko
    • #3128: Materialize and copy the corpus passed to SoftCosineSimilarity, by @Witiko
    • #3115: Make LSI dispatcher CLI param for number of jobs optional, by @robguinness
    • #3091: LsiModel: Only log top words that actually exist in the dictionary, by @kmurphy4
    • #2980: Added EnsembleLda for stable LDA topics, by @sezanzeb
    • #2978: Optimize performance of Author-Topic model, by @horpto
    • #3000: Tidy up KeyedVectors.most_similar() API, by @simonwiles

    :books: Tutorials and docs

    :red_circle: Bug fixes

    • #3178: Fix Unicode string incompatibility in gensim.similarities.fastss.editdist, by @Witiko
    • #3174: Fix loading Phraser models stored in Gensim 3.x into Gensim 4.0, by @emgucv
    • #3136: Fix indexing error in word2vec_inner.pyx, by @bluekura
    • #3131: Add missing import to NMF docs and models/init.py, by @properGrammar
    • #3116: Fix bug where saved Phrases model did not load its connector_words, by @aloknayak29
    • #2830: Fixed KeyError in coherence model, by @pietrotrope

    :warning: Removed functionality & deprecations

    • #3176: Eliminate obsolete step parameter from doc2vec infer_vector and similarity_unseen_docs, by @rock420
    • #2965: Remove strip_punctuation2 alias of strip_punctuation, by @sciatro
    • #3180: Move preprocessing functions from gensim.corpora.textcorpus and gensim.corpora.lowcorpus to gensim.parsing.preprocessing, by @rock420

    🔮 Testing, CI, housekeeping

    • #3156: Update Numpy minimum version to 1.17.0, by @PrimozGodec
    • #3143: replace _mul function with explicit casts, by @mpenkov
    • #2952: Allow newer versions of the Morfessor module for the tests, by @pabs3
    • #2965: Remove strip_punctuation2 alias of strip_punctuation, by @sciatro
    Source code(tar.gz)
    Source code(zip)
  • 4.1.1(Sep 14, 2021)

    4.1.1, 2021-09-14

    This is a bugfix release that addresses compatibility issues with older versions of numpy.

    4.1.0, 2021-08-15

    Gensim 4.1 brings two major new functionalities:

    There are several minor changes that are not backwards compatible with previous versions of Gensim. The affected functionality is relatively less used, so it is unlikely to affect most users, so we have opted to not require a major version bump. Nevertheless, we describe them below.

    Improved parameter edge-case handling in KeyedVectors most_similar and most_similar_cosmul methods

    We now handle both positive and negative keyword parameters consistently. They may now be either:

    1. A string, in which case the value is reinterpreted as a list of one element (the string value)
    2. A vector, in which case the value is reinterpreted as a list of one element (the vector)
    3. A list of strings
    4. A list of vectors

    So you can now simply do:

        model.most_similar(positive='war', negative='peace')
    

    instead of the slightly more involved

    model.most_similar(positive=['war'], negative=['peace'])
    

    Both invocations remain correct, so you can use whichever is most convenient. If you were somehow expecting gensim to interpret the strings as a list of characters, e.g.

    model.most_similar(positive=['w', 'a', 'r'], negative=['p', 'e', 'a', 'c', 'e'])
    

    then you will need to specify the lists explicitly in gensim 4.1.

    Deprecated obsolete step parameter from doc2vec

    With the newer version, do this:

    model.infer_vector(..., epochs=123)
    

    instead of this:

    model.infer_vector(..., steps=123)
    

    Plus a large number of smaller improvements and fixes, as usual.

    ⚠️ If migrating from old Gensim 3.x, read the Migration guide first.

    :+1: New features

    • #3169: Implement shrink_windows argument for Word2Vec, by @M-Demay
    • #3163: Optimize word mover distance (WMD) computation, by @flowlight0
    • #3157: New KeyedVectors.vectors_for_all method for vectorizing all words in a dictionary, by @Witiko
    • #3153: Vectorize word2vec.predict_output_word for speed, by @M-Demay
    • #3146: Use FastSS for fast kNN over Levenshtein distance, by @Witiko
    • #3128: Materialize and copy the corpus passed to SoftCosineSimilarity, by @Witiko
    • #3115: Make LSI dispatcher CLI param for number of jobs optional, by @robguinness
    • #3091: LsiModel: Only log top words that actually exist in the dictionary, by @kmurphy4
    • #2980: Added EnsembleLda for stable LDA topics, by @sezanzeb
    • #2978: Optimize performance of Author-Topic model, by @horpto
    • #3000: Tidy up KeyedVectors.most_similar() API, by @simonwiles

    :books: Tutorials and docs

    :red_circle: Bug fixes

    • #3178: Fix Unicode string incompatibility in gensim.similarities.fastss.editdist, by @Witiko
    • #3174: Fix loading Phraser models stored in Gensim 3.x into Gensim 4.0, by @emgucv
    • #3136: Fix indexing error in word2vec_inner.pyx, by @bluekura
    • #3131: Add missing import to NMF docs and models/init.py, by @properGrammar
    • #3116: Fix bug where saved Phrases model did not load its connector_words, by @aloknayak29
    • #2830: Fixed KeyError in coherence model, by @pietrotrope

    :warning: Removed functionality & deprecations

    • #3176: Eliminate obsolete step parameter from doc2vec infer_vector and similarity_unseen_docs, by @rock420
    • #2965: Remove strip_punctuation2 alias of strip_punctuation, by @sciatro
    • #3180: Move preprocessing functions from gensim.corpora.textcorpus and gensim.corpora.lowcorpus to gensim.parsing.preprocessing, by @rock420

    🔮 Testing, CI, housekeeping

    • #3156: Update Numpy minimum version to 1.17.0, by @PrimozGodec
    • #3143: replace _mul function with explicit casts, by @mpenkov
    • #2952: Allow newer versions of the Morfessor module for the tests, by @pabs3
    • #2965: Remove strip_punctuation2 alias of strip_punctuation, by @sciatro
    Source code(tar.gz)
    Source code(zip)
  • 4.1.0(Aug 29, 2021)

    4.1.0, 2021-08-15

    Gensim 4.1 brings two major new functionalities:

    There are several minor changes that are not backwards compatible with previous versions of Gensim. The affected functionality is relatively less used, so it is unlikely to affect most users, so we have opted to not require a major version bump. Nevertheless, we describe them below.

    Improved parameter edge-case handling in KeyedVectors most_similar and most_similar_cosmul methods

    We now handle both positive and negative keyword parameters consistently. They may now be either:

    1. A string, in which case the value is reinterpreted as a list of one element (the string value)
    2. A vector, in which case the value is reinterpreted as a list of one element (the vector)
    3. A list of strings
    4. A list of vectors

    So you can now simply do:

        model.most_similar(positive='war', negative='peace')
    

    instead of the slightly more involved

    model.most_similar(positive=['war'], negative=['peace'])
    

    Both invocations remain correct, so you can use whichever is most convenient. If you were somehow expecting gensim to interpret the strings as a list of characters, e.g.

    model.most_similar(positive=['w', 'a', 'r'], negative=['p', 'e', 'a', 'c', 'e'])
    

    then you will need to specify the lists explicitly in gensim 4.1.

    Deprecated obsolete step parameter from doc2vec

    With the newer version, do this:

    model.infer_vector(..., epochs=123)
    

    instead of this:

    model.infer_vector(..., steps=123)
    

    Plus a large number of smaller improvements and fixes, as usual.

    ⚠️ If migrating from old Gensim 3.x, read the Migration guide first.

    :+1: New features

    • #3169: Implement shrink_windows argument for Word2Vec, by @M-Demay
    • #3163: Optimize word mover distance (WMD) computation, by @flowlight0
    • #3157: New KeyedVectors.vectors_for_all method for vectorizing all words in a dictionary, by @Witiko
    • #3153: Vectorize word2vec.predict_output_word for speed, by @M-Demay
    • #3146: Use FastSS for fast kNN over Levenshtein distance, by @Witiko
    • #3128: Materialize and copy the corpus passed to SoftCosineSimilarity, by @Witiko
    • #3115: Make LSI dispatcher CLI param for number of jobs optional, by @robguinness
    • #3091: LsiModel: Only log top words that actually exist in the dictionary, by @kmurphy4
    • #2980: Added EnsembleLda for stable LDA topics, by @sezanzeb
    • #2978: Optimize performance of Author-Topic model, by @horpto
    • #3000: Tidy up KeyedVectors.most_similar() API, by @simonwiles

    :books: Tutorials and docs

    :red_circle: Bug fixes

    • #3178: Fix Unicode string incompatibility in gensim.similarities.fastss.editdist, by @Witiko
    • #3174: Fix loading Phraser models stored in Gensim 3.x into Gensim 4.0, by @emgucv
    • #3136: Fix indexing error in word2vec_inner.pyx, by @bluekura
    • #3131: Add missing import to NMF docs and models/init.py, by @properGrammar
    • #3116: Fix bug where saved Phrases model did not load its connector_words, by @aloknayak29
    • #2830: Fixed KeyError in coherence model, by @pietrotrope

    :warning: Removed functionality & deprecations

    • #3176: Eliminate obsolete step parameter from doc2vec infer_vector and similarity_unseen_docs, by @rock420
    • #2965: Remove strip_punctuation2 alias of strip_punctuation, by @sciatro
    • #3180: Move preprocessing functions from gensim.corpora.textcorpus and gensim.corpora.lowcorpus to gensim.parsing.preprocessing, by @rock420

    🔮 Testing, CI, housekeeping

    • #3156: Update Numpy minimum version to 1.17.0, by @PrimozGodec
    • #3143: replace _mul function with explicit casts, by @mpenkov
    • #2952: Allow newer versions of the Morfessor module for the tests, by @pabs3
    • #2965: Remove strip_punctuation2 alias of strip_punctuation, by @sciatro

    4.0.1, 2021-04-01

    Bugfix release to address issues with Wheels on Windows:

    • https://github.com/RaRe-Technologies/gensim/issues/3095
    • https://github.com/RaRe-Technologies/gensim/issues/3097

    4.0.0, 2021-03-24

    ⚠️ Gensim 4.0 contains breaking API changes! See the Migration guide to update your existing Gensim 3.x code and models.

    Gensim 4.0 is a major release with lots of performance & robustness improvements, and a new website.

    Main highlights

    • Massively optimized popular algorithms the community has grown to love: fastText, word2vec, doc2vec, phrases:

      a. Efficiency

      | model | 3.8.3: wall time / peak RAM / throughput | 4.0.0: wall time / peak RAM / throughput | |----------|------------|--------| | fastText | 2.9h / 4.11 GB / 822k words/s | 2.3h / 1.26 GB / 914k words/s | | word2vec | 1.7h / 0.36 GB / 1685k words/s | 1.2h / 0.33 GB / 1762k words/s |

      In other words, fastText now needs 3x less RAM (and is faster); word2vec has 2x faster init (and needs less RAM, and is faster); detecting collocation phrases is 2x faster. (4.0 benchmarks)

      b. Robustness. We fixed a bunch of long-standing bugs by refactoring the internal code structure (see 🔴 Bug fixes below)

      c. Simplified OOP model for easier model exports and integration with TensorFlow, PyTorch &co.

      These improvements come to you transparently aka "for free", but see Migration guide for some changes that break the old Gensim 3.x API. Update your code accordingly.

    • Dropped a bunch of externally contributed modules and wrappers: summarization, pivoted TFIDF, Mallet…

      • Code quality was not up to our standards. Also there was no one to maintain these modules, answer user questions, support them.

        So rather than let them rot, we took the hard decision of removing these contributed modules from Gensim. If anyone's interested in maintaining them, please fork & publish into your own repo. They can live happily outside of Gensim.

    • Dropped Python 2. Gensim 4.0 is Py3.6+. Read our Python version support policy.

      • If you still need Python 2 for some reason, stay at Gensim 3.8.3.
    • A new Gensim website – finally! 🙃

    So, a major clean-up release overall. We're happy with this tighter, leaner and faster Gensim.

    This is the direction we'll keep going forward: less kitchen-sink of "latest academic algorithms", more focus on robust engineering, targetting concrete NLP & document similarity use-cases.

    :+1: New features

    :books: Tutorials and docs

    :red_circle: Bug fixes

    • #2891: Fix fastText word-vectors with ngrams off, by @gojomo
    • #2907: Fix doc2vec crash for large sets of doc-vectors, by @gojomo
    • #2899: Fix similarity bug in NMSLIB indexer, by @piskvorky
    • #2899: Fix deprecation warnings in Annoy integration, by @piskvorky
    • #2901: Fix inheritance of WikiCorpus from TextCorpus, by @jenishah
    • #2940: Fix deprecations in SoftCosineSimilarity, by @Witiko
    • #2944: Fix save_facebook_model failure after update-vocab & other initialization streamlining, by @gojomo
    • #2846: Fix for Python 3.9/3.10: remove xml.etree.cElementTree, by @hugovk
    • #2973: phrases.export_phrases() doesn't yield all bigrams, by @piskvorky
    • #2942: Segfault when training doc2vec, by @gojomo
    • #3041: Fix RuntimeError in export_phrases (change defaultdict to dict), by @thalishsajeed
    • #3059: Fix race condition in FastText tests, by @sleepy-owl

    :warning: Removed functionality & deprecations

    🔮 Testing, CI, housekeeping

    4.0.0.rc1, 2021-03-19

    ⚠️ Gensim 4.0 contains breaking API changes! See the Migration guide to update your existing Gensim 3.x code and models.

    Gensim 4.0 is a major release with lots of performance & robustness improvements and a new website.

    Main highlights (see also 👍 Improvements below)

    • Massively optimized popular algorithms the community has grown to love: fastText, word2vec, doc2vec, phrases:

      a. Efficiency

      | model | 3.8.3: wall time / peak RAM / throughput | 4.0.0: wall time / peak RAM / throughput | |----------|------------|--------| | fastText | 2.9h / 4.11 GB / 822k words/s | 2.3h / 1.26 GB / 914k words/s | | word2vec | 1.7h / 0.36 GB / 1685k words/s | 1.2h / 0.33 GB / 1762k words/s |

      In other words, fastText now needs 3x less RAM (and is faster); word2vec has 2x faster init (and needs less RAM, and is faster); detecting collocation phrases is 2x faster. (4.0 benchmarks)

      b. Robustness. We fixed a bunch of long-standing bugs by refactoring the internal code structure (see 🔴 Bug fixes below)

      c. Simplified OOP model for easier model exports and integration with TensorFlow, PyTorch &co.

      These improvements come to you transparently aka "for free", but see Migration guide for some changes that break the old Gensim 3.x API. Update your code accordingly.

    • Dropped a bunch of externally contributed modules: summarization, pivoted TFIDF normalization, FIXME.

      • Code quality was not up to our standards. Also there was no one to maintain them, answer user questions, support these modules.

        So rather than let them rot, we took the hard decision of removing these contributed modules from Gensim. If anyone's interested in maintaining them please fork into your own repo, they can live happily outside of Gensim.

    • Dropped Python 2. Gensim 4.0 is Py3.6+. Read our Python version support policy.

      • If you still need Python 2 for some reason, stay at Gensim 3.8.3.
    • A new Gensim website – finally! 🙃

    So, a major clean-up release overall. We're happy with this tighter, leaner and faster Gensim.

    This is the direction we'll keep going forward: less kitchen-sink of "latest academic algorithms", more focus on robust engineering, targetting common concrete NLP & document similarity use-cases.

    :star2: New Features

    :red_circle: Bug fixes

    :books: Tutorial and doc improvements

    • fix various documentation warnings (mpenkov, #3077)
    • Fix broken link in run_doc how-to (sezanzeb, #2991)
    • Point WordEmbeddingSimilarityIndex documentation to gensim.similarities (Witiko, #3003)
    • Make the link to the Gensim 3.8.3 documentation dynamic (Witiko, #2996)

    :warning: Removed functionality

    🔮 Miscellaneous

    4.0.0beta, 2020-10-31

    ⚠️ Gensim 4.0 contains breaking API changes! See the Migration guide to update your existing Gensim 3.x code and models.

    Gensim 4.0 is a major release with lots of performance & robustness improvements and a new website.

    Main highlights (see also 👍 Improvements below)

    • Massively optimized popular algorithms the community has grown to love: fastText, word2vec, doc2vec, phrases:

      a. Efficiency

      | model | 3.8.3: wall time / peak RAM / throughput | 4.0.0: wall time / peak RAM / throughput | |----------|------------|--------| | fastText | 2.9h / 4.11 GB / 822k words/s | 2.3h / 1.26 GB / 914k words/s | | word2vec | 1.7h / 0.36 GB / 1685k words/s | 1.2h / 0.33 GB / 1762k words/s |

      In other words, fastText now needs 3x less RAM (and is faster); word2vec has 2x faster init (and needs less RAM, and is faster); detecting collocation phrases is 2x faster. (4.0 benchmarks)

      b. Robustness. We fixed a bunch of long-standing bugs by refactoring the internal code structure (see 🔴 Bug fixes below)

      c. Simplified OOP model for easier model exports and integration with TensorFlow, PyTorch &co.

      These improvements come to you transparently aka "for free", but see Migration guide for some changes that break the old Gensim 3.x API. Update your code accordingly.

    • Dropped a bunch of externally contributed modules: summarization, pivoted TFIDF normalization, FIXME.

      • Code quality was not up to our standards. Also there was no one to maintain them, answer user questions, support these modules.

        So rather than let them rot, we took the hard decision of removing these contributed modules from Gensim. If anyone's interested in maintaining them please fork into your own repo, they can live happily outside of Gensim.

    • Dropped Python 2. Gensim 4.0 is Py3.6+. Read our Python version support policy.

      • If you still need Python 2 for some reason, stay at Gensim 3.8.3.
    • A new Gensim website – finally! 🙃

    So, a major clean-up release overall. We're happy with this tighter, leaner and faster Gensim.

    This is the direction we'll keep going forward: less kitchen-sink of "latest academic algorithms", more focus on robust engineering, targetting common concrete NLP & document similarity use-cases.

    Why pre-release?

    This 4.0.0beta pre-release is for users who want the cutting edge performance and bug fixes. Plus users who want to help out, by testing and providing feedback: code, documentation, workflows… Please let us know on the mailing list!

    Install the pre-release with:

    pip install --pre --upgrade gensim
    

    What will change between this pre-release and a "full" 4.0 release?

    Production stability is important to Gensim, so we're improving the process of upgrading already-trained saved models. There'll be an explicit model upgrade script between each 4.n to 4.(n+1) Gensim release. Check progress here.

    :+1: Improvements

    :books: Tutorials and docs

    :red_circle: Bug fixes

    • #2891: Fix fastText word-vectors with ngrams off, by @gojomo
    • #2907: Fix doc2vec crash for large sets of doc-vectors, by @gojomo
    • #2899: Fix similarity bug in NMSLIB indexer, by @piskvorky
    • #2899: Fix deprecation warnings in Annoy integration, by @piskvorky
    • #2901: Fix inheritance of WikiCorpus from TextCorpus, by @jenishah
    • #2940; Fix deprecations in SoftCosineSimilarity, by @Witiko
    • #2944: Fix save_facebook_model failure after update-vocab & other initialization streamlining, by @gojomo
    • #2846: Fix for Python 3.9/3.10: remove xml.etree.cElementTree, by @hugovk
    • #2973: phrases.export_phrases() doesn't yield all bigrams
    • #2942: Segfault when training doc2vec

    :warning: Removed functionality & deprecations

    • #6: No more binary wheels for x32 platforms, by menshikh-iv
    • #2899: Renamed overly broad similarities.index to the more appropriate similarities.annoy, by @piskvorky
    • #2958: Remove gensim.summarization subpackage, docs and test data, by @mpenkov
    • #2926: Rename num_words to topn in dtm_coherence, by @MeganStodel
    • #2937: Remove Keras dependency, by @piskvorky
    • Removed all code, methods, attributes and functions marked as deprecated in Gensim 3.8.3.
    • Removed pattern dependency (PR #3012, @mpenkov). If you need to lemmatize, do it prior to passing the corpus to gensim.
    Source code(tar.gz)
    Source code(zip)
Owner
RARE Technologies
Commercial Machine Learning & NLP
RARE Technologies
IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

IndoBERTweet 🐦 🇮🇩 1. Paper Fajri Koto, Jey Han Lau, and Timothy Baldwin. IndoBERTweet: A Pretrained Language Model for Indonesian Twitter with Effe

IndoLEM 40 Nov 30, 2022
scikit-learn wrappers for Python fastText.

skift scikit-learn wrappers for Python fastText. from skift import FirstColFtClassifier df = pandas.DataFrame([['woof', 0], ['meow', 1]], colu

Shay Palachy 233 Sep 09, 2022
Tools to download and cleanup Common Crawl data

cc_net Tools to download and clean Common Crawl as introduced in our paper CCNet. If you found these resources useful, please consider citing: @inproc

Meta Research 483 Jan 02, 2023
2021 2학기 데이터크롤링 기말프로젝트

공지 주제 웹 크롤링을 이용한 취업 공고 스케줄러 스케줄 주제 정하기 코딩하기 핵심 코드 설명 + 피피티 구조 구상 // 12/4 토 피피티 + 스크립트(대본) 제작 + 녹화 // ~ 12/10 ~ 12/11 금~토 영상 편집 // ~12/11 토 웹크롤러 사람인_평균

Choi Eun Jeong 2 Aug 16, 2022
Official code for Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset

Official code for our Interspeech 2021 - Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset [1]*. Visually-grounded spoken language datasets c

Ian Palmer 3 Jan 26, 2022
ByT5: Towards a token-free future with pre-trained byte-to-byte models

ByT5: Towards a token-free future with pre-trained byte-to-byte models ByT5 is a tokenizer-free extension of the mT5 model. Instead of using a subword

Google Research 409 Jan 06, 2023
Applied Natural Language Processing in the Enterprise - An O'Reilly Media Publication

Applied Natural Language Processing in the Enterprise This is the companion repo for Applied Natural Language Processing in the Enterprise, an O'Reill

Applied Natural Language Processing in the Enterprise 95 Jan 05, 2023
Official Pytorch implementation of Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision.

This repository is the official Pytorch implementation of Test-Agnostic Long-Tailed Recognition by Test-Time Aggregating Diverse Experts with Self-Supervision.

vanint 101 Dec 30, 2022
Submit issues and feature requests for our API here.

AIx GPT API Submit issues and feature requests for our API here. See https://apps.aixsolutionsgroup.com for more info. Python Quick Start pip install

AIx Solutions 7 Mar 27, 2022
Pre-training BERT masked language models with custom vocabulary

Pre-training BERT Masked Language Models (MLM) This repository contains the method to pre-train a BERT model using custom vocabulary. It was used to p

Stella Douka 14 Nov 02, 2022
A fast hierarchical dimensionality reduction algorithm.

h-NNE: Hierarchical Nearest Neighbor Embedding A fast hierarchical dimensionality reduction algorithm. h-NNE is a general purpose dimensionality reduc

Marios Koulakis 35 Dec 12, 2022
Ελληνικά νέα (Python script) / Greek News Feed (Python script)

Ελληνικά νέα (Python script) / Greek News Feed (Python script) Ελληνικά English Το 2017 είχα υλοποιήσει ένα Python script για να εμφανίζει τα τωρινά ν

Loren Kociko 1 Jun 14, 2022
LCG T-TEST USING EUCLIDEAN METHOD

This project has been created for statistical usage, purposing for determining ATL takers and nontakers using LCG ttest and Euclidean Method, especially for internal business case in Telkomsel.

2 Jan 21, 2022
Materials (slides, code, assignments) for the NYU class I teach on NLP and ML Systems (Master of Engineering).

FREE_7773 Repo containing material for the NYU class (Master of Engineering) I teach on NLP, ML Sys etc. For context on what the class is trying to ac

Jacopo Tagliabue 90 Dec 19, 2022
Simple Annotated implementation of GPT-NeoX in PyTorch

Simple Annotated implementation of GPT-NeoX in PyTorch This is a simpler implementation of GPT-NeoX in PyTorch. We have taken out several optimization

labml.ai 101 Dec 03, 2022
Textpipe: clean and extract metadata from text

textpipe: clean and extract metadata from text textpipe is a Python package for converting raw text in to clean, readable text and extracting metadata

Textpipe 298 Nov 21, 2022
The SVO-Probes Dataset for Verb Understanding

The SVO-Probes Dataset for Verb Understanding This repository contains the SVO-Probes benchmark designed to probe for Subject, Verb, and Object unders

DeepMind 20 Nov 30, 2022
An easy to use Natural Language Processing library and framework for predicting, training, fine-tuning, and serving up state-of-the-art NLP models.

Welcome to AdaptNLP A high level framework and library for running, training, and deploying state-of-the-art Natural Language Processing (NLP) models

Novetta 407 Jan 03, 2023
An implementation of model parallel GPT-2 and GPT-3-style models using the mesh-tensorflow library.

GPT Neo 🎉 1T or bust my dudes 🎉 An implementation of model & data parallel GPT3-like models using the mesh-tensorflow library. If you're just here t

EleutherAI 6.7k Dec 28, 2022
Simple, hackable offline speech to text - using the VOSK-API.

Simple, hackable offline speech to text - using the VOSK-API.

Campbell Barton 844 Jan 07, 2023