Top2Vec is an algorithm for topic modeling and semantic search.

Overview

Update: Pre-trained Universal Sentence Encoders and BERT Sentence Transformer now available for embedding. Read more.

Top2Vec

Top2Vec is an algorithm for topic modeling and semantic search. It automatically detects topics present in text and generates jointly embedded topic, document and word vectors. Once you train the Top2Vec model you can:

  • Get number of detected topics.
  • Get topics.
  • Get topic sizes.
  • Get hierarchichal topics.
  • Search topics by keywords.
  • Search documents by topic.
  • Search documents by keywords.
  • Find similar words.
  • Find similar documents.
  • Expose model with RESTful-Top2Vec

See the paper for more details on how it works.

Benefits

  1. Automatically finds number of topics.
  2. No stop word lists required.
  3. No need for stemming/lemmatization.
  4. Works on short text.
  5. Creates jointly embedded topic, document, and word vectors.
  6. Has search functions built in.

How does it work?

The assumption the algorithm makes is that many semantically similar documents are indicative of an underlying topic. The first step is to create a joint embedding of document and word vectors. Once documents and words are embedded in a vector space the goal of the algorithm is to find dense clusters of documents, then identify which words attracted those documents together. Each dense area is a topic and the words that attracted the documents to the dense area are the topic words.

The Algorithm:

1. Create jointly embedded document and word vectors using Doc2Vec or Universal Sentence Encoder or BERT Sentence Transformer.

Documents will be placed close to other similar documents and close to the most distinguishing words.

2. Create lower dimensional embedding of document vectors using UMAP.

Document vectors in high dimensional space are very sparse, dimension reduction helps for finding dense areas. Each point is a document vector.

3. Find dense areas of documents using HDBSCAN.

The colored areas are the dense areas of documents. Red points are outliers that do not belong to a specific cluster.

4. For each dense area calculate the centroid of document vectors in original dimension, this is the topic vector.

The red points are outlier documents and do not get used for calculating the topic vector. The purple points are the document vectors that belong to a dense area, from which the topic vector is calculated.

5. Find n-closest word vectors to the resulting topic vector.

The closest word vectors in order of proximity become the topic words.

Installation

The easy way to install Top2Vec is:

pip install top2vec

To install pre-trained universal sentence encoder options:

pip install top2vec[sentence_encoders]

To install pre-trained BERT sentence transformer options:

pip install top2vec[sentence_transformers]

To install indexing options:

pip install top2vec[indexing]

Usage

from top2vec import Top2Vec

model = Top2Vec(documents)

Important parameters:

  • documents: Input corpus, should be a list of strings.

  • speed: This parameter will determine how fast the model takes to train. The 'fast-learn' option is the fastest and will generate the lowest quality vectors. The 'learn' option will learn better quality vectors but take a longer time to train. The 'deep-learn' option will learn the best quality vectors but will take significant time to train.

  • workers: The amount of worker threads to be used in training the model. Larger amount will lead to faster training.

Trained models can be saved and loaded.

model.save("filename")
model = Top2Vec.load("filename")

For more information view the API guide.

Pretrained Models

Doc2Vec will be used by default to generate the joint word and document embeddings. However there are also pretrained embedding_model options for generating joint word and document embeddings:

  • universal-sentence-encoder
  • universal-sentence-encoder-multilingual
  • distiluse-base-multilingual-cased
from top2vec import Top2Vec

model = Top2Vec(documents, embedding_model='universal-sentence-encoder')

For large data sets and data sets with very unique vocabulary doc2vec could produce better results. This will train a doc2vec model from scratch. This method is language agnostic. However multiple languages will not be aligned.

Using the universal sentence encoder options will be much faster since those are pre-trained and efficient models. The universal sentence encoder options are suggested for smaller data sets. They are also good options for large data sets that are in English or in languages covered by the multilingual model. It is also suggested for data sets that are multilingual.

The distiluse-base-multilingual-cased pre-trained sentence transformer is suggested for multilingual datasets and languages that are not covered by the multilingual universal sentence encoder. The transformer is significantly slower than the universal sentence encoder options.

More information on universal-sentence-encoder, universal-sentence-encoder-multilingual, and distiluse-base-multilingual-cased.

Citation

If you would like to cite Top2Vec in your work this is the current reference:

@article{angelov2020top2vec,
      title={Top2Vec: Distributed Representations of Topics}, 
      author={Dimo Angelov},
      year={2020},
      eprint={2008.09470},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Example

Train Model

Train a Top2Vec model on the 20newsgroups dataset.

from top2vec import Top2Vec
from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

model = Top2Vec(documents=newsgroups.data, speed="learn", workers=8)

Get Number of Topics

This will return the number of topics that Top2Vec has found in the data.

>>> model.get_num_topics()
77

Get Topic Sizes

This will return the number of documents most similar to each topic. Topics are in decreasing order of size.

topic_sizes, topic_nums = model.get_topic_sizes()

Returns:

  • topic_sizes: The number of documents most similar to each topic.

  • topic_nums: The unique index of every topic will be returned.

Get Topics

This will return the topics in decreasing size.

topic_words, word_scores, topic_nums = model.get_topics(77)

Returns:

  • topic_words: For each topic the top 50 words are returned, in order of semantic similarity to topic.

  • word_scores: For each topic the cosine similarity scores of the top 50 words to the topic are returned.

  • topic_nums: The unique index of every topic will be returned.

Search Topics

We are going to search for topics most similar to medicine.

topic_words, word_scores, topic_scores, topic_nums = model.search_topics(keywords=["medicine"], num_topics=5)

Returns:

  • topic_words: For each topic the top 50 words are returned, in order of semantic similarity to topic.

  • word_scores: For each topic the cosine similarity scores of the top 50 words to the topic are returned.

  • topic_scores: For each topic the cosine similarity to the search keywords will be returned.

  • topic_nums: The unique index of every topic will be returned.

>>> topic_nums
[21, 29, 9, 61, 48]

>>> topic_scores
[0.4468, 0.381, 0.2779, 0.2566, 0.2515]

Topic 21 was the most similar topic to "medicine" with a cosine similarity of 0.4468. (Values can be from least similar 0, to most similar 1)

Generate Word Clouds

Using a topic number you can generate a word cloud. We are going to generate word clouds for the top 5 most similar topics to our medicine topic search from above.

topic_words, word_scores, topic_scores, topic_nums = model.search_topics(keywords=["medicine"], num_topics=5)
for topic in topic_nums:
    model.generate_topic_wordcloud(topic)

Search Documents by Topic

We are going to search by topic 48, a topic that appears to be about science.

documents, document_scores, document_ids = model.search_documents_by_topic(topic_num=48, num_docs=5)

Returns:

  • documents: The documents in a list, the most similar are first.

  • doc_scores: Semantic similarity of document to topic. The cosine similarity of the document and topic vector.

  • doc_ids: Unique ids of documents. If ids were not given, the index of document in the original corpus.

For each of the returned documents we are going to print its content, score and document number.

documents, document_scores, document_ids = model.search_documents_by_topic(topic_num=48, num_docs=5)
for doc, score, doc_id in zip(documents, document_scores, document_ids):
    print(f"Document: {doc_id}, Score: {score}")
    print("-----------")
    print(doc)
    print("-----------")
    print()
Document: 15227, Score: 0.6322
-----------
  Evolution is both fact and theory.  The THEORY of evolution represents the
scientific attempt to explain the FACT of evolution.  The theory of evolution
does not provide facts; it explains facts.  It can be safely assumed that ALL
scientific theories neither provide nor become facts but rather EXPLAIN facts.
I recommend that you do some appropriate reading in general science.  A good
starting point with regard to evolution for the layman would be "Evolution as
Fact and Theory" in "Hen's Teeth and Horse's Toes" [pp 253-262] by Stephen Jay
Gould.  There is a great deal of other useful information in this publication.
-----------

Document: 14515, Score: 0.6186
-----------
Just what are these "scientific facts"?  I have never heard of such a thing.
Science never proves or disproves any theory - history does.

-Tim
-----------

Document: 9433, Score: 0.5997
-----------
The same way that any theory is proven false.  You examine the predicitions
that the theory makes, and try to observe them.  If you don't, or if you
observe things that the theory predicts wouldn't happen, then you have some 
evidence against the theory.  If the theory can't be modified to 
incorporate the new observations, then you say that it is false.

For example, people used to believe that the earth had been created
10,000 years ago.  But, as evidence showed that predictions from this 
theory were not true, it was abandoned.
-----------

Document: 11917, Score: 0.5845
-----------
The point about its being real or not is that one does not waste time with
what reality might be when one wants predictions. The questions if the
atoms are there or if something else is there making measurements indicate
atoms is not necessary in such a system.

And one does not have to write a new theory of existence everytime new
models are used in Physics.
-----------

...

Semantic Search Documents by Keywords

Search documents for content semantically similar to cryptography and privacy.

documents, document_scores, document_ids = model.search_documents_by_keywords(keywords=["cryptography", "privacy"], num_docs=5)
for doc, score, doc_id in zip(documents, document_scores, document_ids):
    print(f"Document: {doc_id}, Score: {score}")
    print("-----------")
    print(doc)
    print("-----------")
    print()
Document: 16837, Score: 0.6112
-----------
...
Email and account privacy, anonymity, file encryption,  academic 
computer policies, relevant legislation and references, EFF, and 
other privacy and rights issues associated with use of the Internet
and global networks in general.
...

Document: 16254, Score: 0.5722
-----------
...
The President today announced a new initiative that will bring
the Federal Government together with industry in a voluntary
program to improve the security and privacy of telephone
communications while meeting the legitimate needs of law
enforcement.
...
-----------
...

Similar Keywords

Search for similar words to space.

words, word_scores = model.similar_words(keywords=["space"], keywords_neg=[], num_words=20)
for word, score in zip(words, word_scores):
    print(f"{word} {score}")
space 1.0
nasa 0.6589
shuttle 0.5976
exploration 0.5448
planetary 0.5391
missions 0.5069
launch 0.4941
telescope 0.4821
astro 0.4696
jsc 0.4549
ames 0.4515
satellite 0.446
station 0.4445
orbital 0.4438
solar 0.4386
astronomy 0.4378
observatory 0.4355
facility 0.4325
propulsion 0.4251
aerospace 0.4226
Comments
  • numpy causing various errors

    numpy causing various errors

    I've been having trouble with numpy when using Top2Vec version 1.0.20 with Python 3.8.0 on Ubuntu 18.04; I experience the same problems using Python 3.7.5. I've tried installing numpy 1.0.20, numpy 1.19.5.

    see this issuefor the hbsc error.

    and this issue for the umap error.

    UMAP

    PicklingError:
    
    (snip)
    
    /data/.top2vec/lib/python3.8/site-packages/umap/umap_.py in fit(self, X, y)
       2571 
       2572         numba.set_num_threads(self._original_n_threads)
    -> 2573         self._input_hash = joblib.hash(self._raw_data)
       2574 
       2575         return self
    
    /data/.top2vec/lib/python3.8/site-packages/joblib/hashing.py in hash(obj, hash_name, coerce_mmap)
        259     else:
        260         hasher = Hasher(hash_name=hash_name)
    --> 261     return hasher.hash(obj)
    
    /data/.top2vec/lib/python3.8/site-packages/joblib/hashing.py in hash(self, obj, return_digest)
         61     def hash(self, obj, return_digest=True):
         62         try:
    ---> 63             self.dump(obj)
         64         except pickle.PicklingError as e:
         65             e.args += ('PicklingError while hashing %r: %r' % (obj, e),)
    
    (snip)
    
    PicklingError: ("Can't pickle <class 'numpy.dtype[float32]'>: it's not found as numpy.dtype[float32]", 'PicklingError while hashing array([[ 0.002187  , -0.00357572, -0.00279311, ...,  0.00120361,\n        -0.00115495,  0.00059189],\n       [-0.05823869,  0.01436491,  0.02220243, ...,  0.00703284,\n        -0.01716192, -0.01003473],\n       [-0.00334117,  0.00051066,  0.00269544, ...,  0.00070796,\n        -0.00202038, -0.00233051],\n       ...,\n       [ 0.00062888,  0.0027382 ,  0.0044361 , ..., -0.00229976,\n         0.00057765, -0.00033288],\n       [-0.00081269,  0.00099852, -0.00054314, ...,  0.00133646,\n        -0.00026089, -0.00150439],\n       [-0.01297437,  0.0104734 ,  0.01563089, ..., -0.00051685,\n        -0.00144138, -0.00556232]], dtype=float32): PicklingError("Can\'t pickle <class \'numpy.dtype[float32]\'>: it\'s not found as numpy.dtype[float32]")')
    

    HDBSCAN

    from top2vec import Top2Vec
    
    (snip)
    
    /data/.top2vec/lib/python3.8/site-packages/hdbscan/hdbscan_.py in <module>
         19 from scipy.sparse import csgraph
         20 
    ---> 21 from ._hdbscan_linkage import (single_linkage,
         22                                mst_linkage_core,
         23                                mst_linkage_core_vector,
    
    hdbscan/_hdbscan_linkage.pyx in init hdbscan._hdbscan_linkage()
    
    ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject
    
    
    opened by AltfunsMA 12
  • Run out of memory on 1.6m point dataset with 300 dimensions.

    Run out of memory on 1.6m point dataset with 300 dimensions.

    Hi, great work for Top2Vec, I am trying to apply it to my dataset which has 1.6million instances. I successfully trained Doc2vec inside Top2vec. with 300 dimensions as the default. but I run out of memory on the Umap procedure in 2 minutes. BTW I have a 32g memory. I also try low_memory=True. The same oom.

    So, I wonder that how many memory UMAP gonna take for 2m points with 300 dimensions? For precaution, how many more memory HDBScan gonna cost?

    Thank you!

    opened by kongyq 11
  • How to display Top2Vec Model in HDBSCAN or UMAP ?

    How to display Top2Vec Model in HDBSCAN or UMAP ?

    Hello,

    Forgive me for the newbie question, but having successfully built and saved a Top2Vec model:

         How can a saved Top2Vec model be viewed (visually rendered) in HDBSCAN or UMAP?
    

    I may be over looking the obvious, but in reading through the documentation and Googling for answers nothing has jumped out so far.

    Most grateful,

    Chris

    opened by None-Such 10
  • TypeError: __init__() got an unexpected keyword argument 'vector_size'

    TypeError: __init__() got an unexpected keyword argument 'vector_size'

    Hi,

    I created a conda env with Python 3.6 and installed top2vec.

    I then tried the example below to test the install

    from top2vec import Top2Vec from sklearn.datasets import fetch_20newsgroups newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes')) model = Top2Vec(documents=newsgroups.data, speed="learn", workers=8)

    and I get the following output/error: 2020-12-18 00:20:24,861 - top2vec - INFO - Pre-processing documents for training 2020-12-18 00:20:31,459 - top2vec - INFO - Creating joint document/word embedding Traceback (most recent call last): File "", line 1, in File "conda_envs/top2vec/lib/python3.6/site-packages/top2vec/Top2Vec.py", line 285, in init self.model = Doc2Vec(**doc2vec_args) File "/home/.local/lib/python3.6/site-packages/gensim/models/doc2vec.py", line 634, in init **kwargs) TypeError: init() got an unexpected keyword argument 'vector_size'

    Can you please help me with it ?

    Thanks

    opened by gianfilippo 10
  • [Installation Issue] Unable to install dependencies(tensorflow-text) while installing Top2Vec

    [Installation Issue] Unable to install dependencies(tensorflow-text) while installing Top2Vec

    I am trying to install top2vec but getting the following error when I do 'pip install top2vec==1.0.15'

    ERROR: Could not find a version that satisfies the requirement tensorflow-text (from top2vec) (from versions: none) ERROR: No matching distribution found for tensorflow-text (from top2vec)

    I have windows 10, python 3.7, x64.

    From what I understand, currently, tensorflow-text isn't available for Windows, so could you guys provide any resolution for this?

    opened by Alisha1992 10
  • ValueError: numpy.ndarray size changed, may indicate binary incompatibility.

    ValueError: numpy.ndarray size changed, may indicate binary incompatibility.

    A few days ago the problem with ValueError: numpy.ndarray size changed, may indicate binary incompatibility. occurred during executing the code that worked one week ago without any problems.

    The same issue is with BERTopic (https://github.com/MaartenGr/BERTopic/issues/392), so I thought maybe it would be beneficial to link it there. For now, it seems there is no easy solution to that problem

    opened by maciejbiesek 9
  • What would be the best way to incorporate NER?

    What would be the best way to incorporate NER?

    Id like to use an NER to embed broader terms instead of just the unigrams.

    Im not 100% sure how the unigrams are consumed. So if I wanted to embed "New York" instead of splitting it, what does the format of the output of the tokenizer need to be?

    opened by datavistics 9
  • ValueError: list.remove(x): x not in list in model.hierarchical_topic_reduction()

    ValueError: list.remove(x): x not in list in model.hierarchical_topic_reduction()

    I created a model with

    model= Top2Vec(documents_text2, min_count = 4,
                           speed = "fast-learn", 
                           document_ids=document_ids2, 
                           workers = workers_n,keep_documents=False)
    

    Then I tried to reduce the number of topics with

    model.hierarchical_topic_reduction()

    and get this error

    model10 = model.hierarchical_topic_reduction(1000)
    Traceback (most recent call last):
    
      File "<ipython-input-12-4ee6e263e4a0>", line 1, in <module>
        model10 = model.hierarchical_topic_reduction(1000)
    
      File "C:\Users\anaconda\.conda\envs\top2vec_final\lib\site-packages\top2vec\Top2Vec.py", line 1215, in hierarchical_topic_reduction
        ix_keep.remove(most_sim)
    
    ValueError: list.remove(x): x not in list
    
    opened by p-dre 9
  • ImportError: universal-sentence-encoder is not available.

    ImportError: universal-sentence-encoder is not available.

    Hi! I'm getting the above error on the following code:

    from top2vec import Top2Vec
    model = Top2Vec(documents=df['transcript'].values, speed="learn", embedding_model='universal-sentence-encoder')
    

    Full Exception Traceback:

    ---------------------------------------------------------------------------
    ImportError                               Traceback (most recent call last)
    <ipython-input-3-12fb6ba4e3a8> in <module>
          1 from top2vec import Top2Vec
          2 
    ----> 3 model = Top2Vec(documents=df['transcript'].values, speed="learn", embedding_model='universal-sentence-encoder')
    
    ~\Anaconda3\lib\site-packages\top2vec\Top2Vec.py in __init__(self, documents, min_count, embedding_model, embedding_model_path, speed, use_corpus_file, document_ids, keep_documents, workers, tokenizer, verbose)
        278             self.embedding_model = embedding_model
        279 
    --> 280             self._check_import_status()
        281 
        282             logger.info('Pre-processing documents for training')
    
    ~\Anaconda3\lib\site-packages\top2vec\Top2Vec.py in _check_import_status(self)
        642         if self.embedding_model != 'distiluse-base-multilingual-cased':
        643             if not _HAVE_TENSORFLOW:
    --> 644                 raise ImportError(f"{self.embedding_model} is not available.\n\n"
        645                                   "Try: pip install top2vec[sentence_encoders]\n\n"
        646                                   "Alternatively try: pip install tensorflow tensorflow_hub tensorflow_text")
    
    ImportError: universal-sentence-encoder is not available.
    
    Try: pip install top2vec[sentence_encoders]
    
    Alternatively try: pip install tensorflow tensorflow_hub tensorflow_text
    

    I have all of these libraries installed (see below) - but this error wont go.

    (base) C:\Users\rsiddiqui>pip install top2vec[sentence_encoders] Requirement already satisfied: top2vec[sentence_encoders] in c:\users\rsiddiqui\anaconda3\lib\site-packages (1.0.16) Requirement already satisfied: numpy in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from top2vec[sentence_encoders]) (1.18.5) Requirement already satisfied: umap-learn in c:\users\rsiddiqui\anaconda3\lib\site-packages (from top2vec[sentence_encoders]) (0.4.6) Requirement already satisfied: gensim in c:\users\rsiddiqui\anaconda3\lib\site-packages (from top2vec[sentence_encoders]) (3.8.3) Requirement already satisfied: pandas in c:\users\rsiddiqui\anaconda3\lib\site-packages (from top2vec[sentence_encoders]) (1.1.3) Requirement already satisfied: wordcloud in c:\users\rsiddiqui\anaconda3\lib\site-packages (from top2vec[sentence_encoders]) (1.8.1) Requirement already satisfied: hdbscan in c:\users\rsiddiqui\anaconda3\lib\site-packages (from top2vec[sentence_encoders]) (0.8.26) Requirement already satisfied: pynndescent>=0.4 in c:\users\rsiddiqui\anaconda3\lib\site-packages (from top2vec[sentence_encoders]) (0.5.1) Requirement already satisfied: tensorflow-text; extra == "sentence_encoders" in c:\users\rsiddiqui\anaconda3\lib\site-packages (from top2vec[sentence_encoders]) (2.4.0rc0) Requirement already satisfied: tensorflow-hub; extra == "sentence_encoders" in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from top2vec[sentence_encoders]) (0.9.0) Requirement already satisfied: tensorflow; extra == "sentence_encoders" in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from top2vec[sentence_encoders]) (2.3.1) Requirement already satisfied: numba!=0.47,>=0.46 in c:\users\rsiddiqui\anaconda3\lib\site-packages (from umap-learn->top2vec[sentence_encoders]) (0.51.2) Requirement already satisfied: scikit-learn>=0.20 in c:\users\rsiddiqui\anaconda3\lib\site-packages (from umap-learn->top2vec[sentence_encoders]) (0.23.2) Requirement already satisfied: scipy>=1.3.1 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from umap-learn->top2vec[sentence_encoders]) (1.5.4) Requirement already satisfied: smart-open>=1.8.1 in c:\users\rsiddiqui\anaconda3\lib\site-packages (from gensim->top2vec[sentence_encoders]) (3.0.0) Requirement already satisfied: six>=1.5.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from gensim->top2vec[sentence_encoders]) (1.15.0) Requirement already satisfied: Cython==0.29.14 in c:\users\rsiddiqui\anaconda3\lib\site-packages (from gensim->top2vec[sentence_encoders]) (0.29.14) Requirement already satisfied: python-dateutil>=2.7.3 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from pandas->top2vec[sentence_encoders]) (2.8.1) Requirement already satisfied: pytz>=2017.2 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from pandas->top2vec[sentence_encoders]) (2020.4) Requirement already satisfied: pillow in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from wordcloud->top2vec[sentence_encoders]) (8.0.1) Requirement already satisfied: matplotlib in c:\users\rsiddiqui\anaconda3\lib\site-packages (from wordcloud->top2vec[sentence_encoders]) (3.2.2) Requirement already satisfied: joblib in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from hdbscan->top2vec[sentence_encoders]) (0.15.1) Requirement already satisfied: llvmlite>=0.30 in c:\users\rsiddiqui\anaconda3\lib\site-packages (from pynndescent>=0.4->top2vec[sentence_encoders]) (0.34.0) Requirement already satisfied: protobuf>=3.8.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow-hub; extra == "sentence_encoders"->top2vec[sentence_encoders]) (3.13.0) Requirement already satisfied: tensorflow-estimator<2.4.0,>=2.3.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (2.3.0) Requirement already satisfied: google-pasta>=0.1.8 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (0.2.0) Requirement already satisfied: wheel>=0.26 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (0.35.1) Requirement already satisfied: absl-py>=0.7.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (0.10.0) Requirement already satisfied: h5py<2.11.0,>=2.10.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (2.10.0) Requirement already satisfied: termcolor>=1.1.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (1.1.0) Requirement already satisfied: keras-preprocessing<1.2,>=1.1.1 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (1.1.2) Requirement already satisfied: opt-einsum>=2.3.2 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (3.3.0) Requirement already satisfied: wrapt>=1.11.1 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (1.12.1) Requirement already satisfied: grpcio>=1.8.6 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (1.32.0) Requirement already satisfied: gast==0.3.3 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (0.3.3) Requirement already satisfied: astunparse==1.6.3 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (1.6.3) Requirement already satisfied: tensorboard<3,>=2.3.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (2.4.0) Requirement already satisfied: setuptools in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from numba!=0.47,>=0.46->umap-learn->top2vec[sentence_encoders]) (50.3.2) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from scikit-learn>=0.20->umap-learn->top2vec[sentence_encoders]) (2.1.0) Requirement already satisfied: requests in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from smart-open>=1.8.1->gensim->top2vec[sentence_encoders]) (2.25.0) Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from matplotlib->wordcloud->top2vec[sentence_encoders]) (1.3.1) Requirement already satisfied: cycler>=0.10 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from matplotlib->wordcloud->top2vec[sentence_encoders]) (0.10.0) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from matplotlib->wordcloud->top2vec[sentence_encoders]) (2.4.7) Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorboard<3,>=2.3.0->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (0.4.2) Requirement already satisfied: werkzeug>=0.11.15 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorboard<3,>=2.3.0->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (1.0.1) Requirement already satisfied: tensorboard-plugin-wit>=1.6.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorboard<3,>=2.3.0->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (1.7.0) Requirement already satisfied: markdown>=2.6.8 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorboard<3,>=2.3.0->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (3.3.3) Requirement already satisfied: google-auth<2,>=1.6.3 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorboard<3,>=2.3.0->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (1.23.0) Requirement already satisfied: idna<3,>=2.5 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from requests->smart-open>=1.8.1->gensim->top2vec[sentence_encoders]) (2.10) Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from requests->smart-open>=1.8.1->gensim->top2vec[sentence_encoders]) (1.26.2) Requirement already satisfied: certifi>=2017.4.17 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from requests->smart-open>=1.8.1->gensim->top2vec[sentence_encoders]) (2020.11.8) Requirement already satisfied: chardet<4,>=3.0.2 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from requests->smart-open>=1.8.1->gensim->top2vec[sentence_encoders]) (3.0.4) Requirement already satisfied: requests-oauthlib>=0.7.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<3,>=2.3.0->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (1.3.0) Requirement already satisfied: cachetools<5.0,>=2.0.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (4.1.1) Requirement already satisfied: pyasn1-modules>=0.2.1 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (0.2.8) Requirement already satisfied: rsa<5,>=3.1.4; python_version >= "3.5" in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (4.6) Requirement already satisfied: oauthlib>=3.0.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<3,>=2.3.0->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (3.1.0) Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from pyasn1-modules>=0.2.1->google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (0.4.8)

    (base) C:\Users\rsiddiqui>pip install tensorflow tensorflow_hub tensorflow_text Requirement already satisfied: tensorflow in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (2.3.1) Requirement already satisfied: tensorflow_hub in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (0.9.0) Requirement already satisfied: tensorflow_text in c:\users\rsiddiqui\anaconda3\lib\site-packages (2.4.0rc0) Requirement already satisfied: protobuf>=3.9.2 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (3.13.0) Requirement already satisfied: gast==0.3.3 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (0.3.3) Requirement already satisfied: termcolor>=1.1.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (1.1.0) Requirement already satisfied: tensorboard<3,>=2.3.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (2.4.0) Requirement already satisfied: grpcio>=1.8.6 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (1.32.0) Requirement already satisfied: tensorflow-estimator<2.4.0,>=2.3.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (2.3.0) Requirement already satisfied: astunparse==1.6.3 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (1.6.3) Requirement already satisfied: six>=1.12.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (1.15.0) Requirement already satisfied: wrapt>=1.11.1 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (1.12.1) Requirement already satisfied: google-pasta>=0.1.8 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (0.2.0) Requirement already satisfied: keras-preprocessing<1.2,>=1.1.1 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (1.1.2) Requirement already satisfied: wheel>=0.26 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (0.35.1) Requirement already satisfied: opt-einsum>=2.3.2 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (3.3.0) Requirement already satisfied: h5py<2.11.0,>=2.10.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (2.10.0) Requirement already satisfied: numpy<1.19.0,>=1.16.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (1.18.5) Requirement already satisfied: absl-py>=0.7.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (0.10.0) Requirement already satisfied: setuptools in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from protobuf>=3.9.2->tensorflow) (50.3.2) Requirement already satisfied: requests<3,>=2.21.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorboard<3,>=2.3.0->tensorflow) (2.25.0) Requirement already satisfied: tensorboard-plugin-wit>=1.6.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorboard<3,>=2.3.0->tensorflow) (1.7.0) Requirement already satisfied: werkzeug>=0.11.15 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorboard<3,>=2.3.0->tensorflow) (1.0.1) Requirement already satisfied: google-auth<2,>=1.6.3 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorboard<3,>=2.3.0->tensorflow) (1.23.0) Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorboard<3,>=2.3.0->tensorflow) (0.4.2) Requirement already satisfied: markdown>=2.6.8 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorboard<3,>=2.3.0->tensorflow) (3.3.3) Requirement already satisfied: chardet<4,>=3.0.2 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from requests<3,>=2.21.0->tensorboard<3,>=2.3.0->tensorflow) (3.0.4) Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from requests<3,>=2.21.0->tensorboard<3,>=2.3.0->tensorflow) (1.26.2) Requirement already satisfied: idna<3,>=2.5 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from requests<3,>=2.21.0->tensorboard<3,>=2.3.0->tensorflow) (2.10) Requirement already satisfied: certifi>=2017.4.17 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from requests<3,>=2.21.0->tensorboard<3,>=2.3.0->tensorflow) (2020.11.8) Requirement already satisfied: rsa<5,>=3.1.4; python_version >= "3.5" in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow) (4.6) Requirement already satisfied: pyasn1-modules>=0.2.1 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow) (0.2.8) Requirement already satisfied: cachetools<5.0,>=2.0.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow) (4.1.1) Requirement already satisfied: requests-oauthlib>=0.7.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<3,>=2.3.0->tensorflow) (1.3.0) Requirement already satisfied: pyasn1>=0.1.3 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from rsa<5,>=3.1.4; python_version >= "3.5"->google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow) (0.4.8) Requirement already satisfied: oauthlib>=3.0.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<3,>=2.3.0->tensorflow) (3.1.0)

    opened by Rmsharks4 9
  • "embedding_model" parameter in Top2Vec is unrecognized

    In code documentation it is mentioned that we can use pretrained model using embedding_model but it is not recognized. I have updated the library as well

    opened by Prashant118 9
  • TypeError: Cannot use scipy.linalg.eigh for sparse A with k >= N

    TypeError: Cannot use scipy.linalg.eigh for sparse A with k >= N

    I'm using a set of text documents (pdf documents converted into text) for topic modeling. While training the model I'm getting this error. It's a great help if someone can help me to sort this out. C:\Users\prabo\Desktop\Topic modeling pipeline.venv\lib\site-packages\umap\umap_.py:1678: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1 warn( C:\Users\prabo\Desktop\Topic modeling pipeline.venv\lib\site-packages\scipy\sparse\linalg\eigen\arpack\arpack.py:1590: RuntimeWarning: k >= N for N * N square matrix. Attempting to use scipy.linalg.eigh instead. warnings.warn("k >= N for N * N square matrix. " Traceback (most recent call last): File "c:/Users/prabo/Desktop/Topic modeling pipeline/test.py", line 27, in model = Top2Vec(documents=df.text, speed="learn", workers=8) File "C:\Users\prabo\Desktop\Topic modeling pipeline.venv\lib\site-packages\top2vec\Top2Vec.py", line 222, in init umap_model = umap.UMAP(n_neighbors=15, File "C:\Users\prabo\Desktop\Topic modeling pipeline.venv\lib\site-packages\umap\umap_.py", line 1965, in fit self.embedding_ = simplicial_set_embedding( File "C:\Users\prabo\Desktop\Topic modeling pipeline.venv\lib\site-packages\umap\umap_.py", line 1033, in simplicial_set_embedding initialisation = spectral_layout( File "C:\Users\prabo\Desktop\Topic modeling pipeline.venv\lib\site-packages\umap\spectral.py", line 324, in spectral_layout eigenvalues, eigenvectors = scipy.sparse.linalg.eigsh( File "C:\Users\prabo\Desktop\Topic modeling pipeline.venv\lib\site-packages\scipy\sparse\linalg\eigen\arpack\arpack.py", line 1595, in eigsh raise TypeError("Cannot use scipy.linalg.eigh for sparse A with " TypeError: Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k.

    opened by dulanafdo 9
  • How can extract topics from new added documents in inference

    How can extract topics from new added documents in inference

    Hey, this is an amazing project to work with. I was wondering is there any way to extract topics from newly added document in inference. Thanks in advance.

    opened by meetttttt 0
  • Stop words are included in the model and topics are generated with them

    Stop words are included in the model and topics are generated with them

    Here is my topic_words outputs :

    0 Words: ['and' 'the' 'in' 'to' 'of' 'games' 'or' 'first' 'game' 'that' 'by' 'at' 'is' 'released' 'with' 'as' 'its' 'was' 'from' 'developed' 'for' 'it' 'series' 'video' 'were' 'produced' 'an' 'on' 'designed' 'aircraft' 'published' 'built'] 1 Words: ['series' 'games' 'an' 'was' 'by' 'with' 'and' 'first' 'published' 'in' 'is' 'from' 'released' 'of' 'to' 'as' 'the' 'it' 'at' 'were' 'designed' 'for' 'or' 'game' 'aircraft' 'its' 'on' 'built' 'that' 'produced' 'video' 'developed']

    It is written that no stop word elimination is needed before using Top2Vec - and in a youtube tutorial he just called Top2Vec function without any parameters and it worked well without stop words. What am I doing wrong or is it a bug?

    Thanks

    opened by cuneyttyler 0
  • AttributeError: 'CountVectorizer' object has no attribute 'get_feature_names'

    AttributeError: 'CountVectorizer' object has no attribute 'get_feature_names'

    Hello,

    Following the examples in the readme I created this code:

    documents = df["cleaned_message"].tolist()
    model = Top2Vec(
        documents,
        embedding_model="universal-sentence-encoder",
        speed="learn",
        workers=multiprocessing.cpu_count() - 1,
    )
    
    print(f"Num topics: {model.get_num_topics()}")
    

    And that is throwing the following error:

    AttributeError: 'CountVectorizer' object has no attribute 'get_feature_names'

    opened by jmorenobl 1
  • access topic/document/etc. vectors

    access topic/document/etc. vectors

    First of all, great package! it is awesome to use!

    I was wondering if it is possible to access individual vectors on different levels of the model. For example, if I want to extract the 3 topics that cover the most documents I would want to use a combination of between-topic spread and within-topic spread of the vectors. Is it possible to extract these from the trained model?

    thanks in advance!

    opened by SjoerdBraaksma 0
  • how to get bi-gram and tri-gram and n-gram topic words ?

    how to get bi-gram and tri-gram and n-gram topic words ?

    I remember in LDA and NMF we have configuration parameter called ngram_range where by configuring it as (2,2) or (3,3) we can get topic words as bigrams and trigrams. Is there any such configuration in Top2vec where we can get bigram and trigram or ngram based topic words?

    opened by sivachaitanya 2
Releases(1.0.27)
  • 1.0.27(Apr 3, 2022)

    • New pre-trained transformer models available
    • Ability to use any embedding model by passing callable to embedding_model
    • New embedding_batch_size option
    • Document chunking options for long documents
    • Phrases in topics by setting ngram_vocab=True
    Source code(tar.gz)
    Source code(zip)
  • 1.0.25(Jun 23, 2021)

    Added query_documents and query_topics methods which allow for using a sequence of text such as a question, a sentence, a paragraph or a document to query documents or topics.

    Added num_topics parameter to get_documents_topics method which allows retrieving multiple topics per document.

    Source code(tar.gz)
    Source code(zip)
  • 1.0.24(Apr 1, 2021)

  • 1.0.23(Feb 12, 2021)

  • 1.0.22(Feb 12, 2021)

  • 1.0.21(Feb 5, 2021)

  • 1.0.20(Jan 9, 2021)

    Added use_embedding_model_tokenizer parameter. If set to True and if using an embedding_model other than doc2vec, use the model's tokenizer for document embedding.

    Fixed dependency issue with joblib.

    Fixed issues with wordclouds caused by negative similarity scores.

    Source code(tar.gz)
    Source code(zip)
  • 1.0.19(Dec 10, 2020)

  • 1.0.18(Dec 10, 2020)

    Added option for indexing word vectors, this will speed up search for models with large vocabularies. Specifically search_words_by_vector and similar_words.

    Added new method search_words_by_vector.

    Source code(tar.gz)
    Source code(zip)
  • 1.0.17(Dec 7, 2020)

    Added option for indexing document vectors, this will speed up search for models with large number of documents. Specifically search_documents_by_vector, search_documents_by_keywords, and search_documents_by_documents.

    Added new method search_documents_by_vector.

    Added code to prevent hierarchical topic reduction error #79.

    Source code(tar.gz)
    Source code(zip)
  • 1.0.16(Nov 10, 2020)

    Dependencies for universal sentence encoder and BERT sentence transformer options are now optional. With pip install top2vec[sentence-encoders] and pip install top2vec[sentence_transformers]

    Faster cosine similarity.

    Source code(tar.gz)
    Source code(zip)
  • 1.0.15(Oct 16, 2020)

    The verbose parameter will be set to True by default.

    Fixed a bug that stopped showing logging updates after downloading pre-trained models.

    Source code(tar.gz)
    Source code(zip)
  • 1.0.12(Oct 15, 2020)

    Top2Vec now has an option to choose the embedding model with doc2vec, universal-sentence-encoder, universal-sentence-encoder-multilingual, and distiluse-base-multilingual-cased as the options.

    A get_documents_topics method was added.

    Source code(tar.gz)
    Source code(zip)
  • 1.0.11(Oct 8, 2020)

    Added a method for deleting documents from model.

    Fixed bug when using corpus_file that resulted in documents getting dropped. Fixed bug when using add_documents and delete_documents which resulted in improper ordering of topic words.

    Source code(tar.gz)
    Source code(zip)
  • 1.0.10(Aug 29, 2020)

    There was an issue with UMAP install due to a missing comma in the setup.py file, this has been fixed. An optional min_count parameter has been added, the default is still 50. All words with total frequency lower min_count are ignored by the model.

    Source code(tar.gz)
    Source code(zip)
  • 1.0.9(Jun 26, 2020)

    Added functionality to perform hierarchical topic reduction. Added the ability to add new documents to an already trained model. Added use_corpus option which may lead to faster training with very large datasets in multi-worker environments.

    Source code(tar.gz)
    Source code(zip)
  • 1.0.8(Apr 18, 2020)

    Added option for custom document ids, these can be string or int. Option to not save documents in model, this allows for the trained model to be used as an index and for saved models to be smaller in size. Ability to pass in a custom tokenizer that will override the default. Verbose mode that will log status of training. Also added the ability to search documents by multiple documents, positive and negative semantic search.

    Source code(tar.gz)
    Source code(zip)
  • 1.0.7(Apr 7, 2020)

    Topic size is defined as the number of document vectors which have the topic as its nearest topic vector. Search by topic has been modified to only show documents who have the topic as its nearest topic, in order to avoid overlapping results from similar topics.

    Topic deduplication is added to make topics more robust.

    Source code(tar.gz)
    Source code(zip)
  • 1.0.6(Mar 25, 2020)

Owner
Dimo Angelov
Data Scientist
Dimo Angelov
A cross platform OCR Library based on PaddleOCR & OnnxRuntime

A cross platform OCR Library based on PaddleOCR & OnnxRuntime

RapidOCR Team 767 Jan 09, 2023
ADCS cert template modification and ACL enumeration

Purpose This tool is designed to aid an operator in modifying ADCS certificate templates so that a created vulnerable state can be leveraged for privi

Fortalice Solutions, LLC 78 Dec 12, 2022
Repository for Graph2Pix: A Graph-Based Image to Image Translation Framework

Graph2Pix: A Graph-Based Image to Image Translation Framework Installation Install the dependencies in env.yml $ conda env create -f env.yml $ conda a

18 Nov 17, 2022
Implementation of some unbalanced loss like focal_loss, dice_loss, DSC Loss, GHM Loss et.al

Implementation of some unbalanced loss for NLP task like focal_loss, dice_loss, DSC Loss, GHM Loss et.al Summary Here is a loss implementation reposit

121 Jan 01, 2023
Global Rhythm Style Transfer Without Text Transcriptions

Global Prosody Style Transfer Without Text Transcriptions This repository provides a PyTorch implementation of AutoPST, which enables unsupervised glo

Kaizhi Qian 193 Dec 30, 2022
Statistics and Mathematics for Machine Learning, Deep Learning , Deep NLP

Stat4ML Statistics and Mathematics for Machine Learning, Deep Learning , Deep NLP This is the first course from our trio courses: Statistics Foundatio

Omid Safarzadeh 83 Dec 29, 2022
⛵️The official PyTorch implementation for "BERT-of-Theseus: Compressing BERT by Progressive Module Replacing" (EMNLP 2020).

BERT-of-Theseus Code for paper "BERT-of-Theseus: Compressing BERT by Progressive Module Replacing". BERT-of-Theseus is a new compressed BERT by progre

Kevin Canwen Xu 284 Nov 25, 2022
This repository contains the code for "Generating Datasets with Pretrained Language Models".

Datasets from Instructions (DINO 🦕 ) This repository contains the code for Generating Datasets with Pretrained Language Models. The paper introduces

Timo Schick 154 Jan 01, 2023
A telegram bot to translate 100+ Languages

🔥 GOOGLE TRANSLATER 🔥 The owner would not be responsible for any kind of bans due to the bot. • ⚡ INSTALLING ⚡ • • 🔰 Deploy To Railway 🔰 • • ✅ OFF

Aɴᴋɪᴛ Kᴜᴍᴀʀ 5 Dec 20, 2021
A Transformer Implementation that is easy to understand and customizable.

Simple Transformer I've written a series of articles on the transformer architecture and language models on Medium. This repository contains an implem

Naoki Shibuya 4 Jan 20, 2022
Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

anaGo anaGo is a Python library for sequence labeling(NER, PoS Tagging,...), implemented in Keras. anaGo can solve sequence labeling tasks such as nam

Hiroki Nakayama 1.5k Dec 05, 2022
Tensorflow implementation of paper: Learning to Diagnose with LSTM Recurrent Neural Networks.

Multilabel time series classification with LSTM Tensorflow implementation of model discussed in the following paper: Learning to Diagnose with LSTM Re

Aaqib 552 Nov 28, 2022
A text file containing 479k English words for all your dictionary/word-based projects e.g: auto-completion / autosuggestion

List Of English Words A text file containing over 466k English words. While searching for a list of english words (for an auto-complete tutorial) I fo

dwyl 8.5k Jan 03, 2023
BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model

BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model

303 Dec 17, 2022
I label phrases on a scale of five values: negative, somewhat negative, neutral, somewhat positive, positive

I label phrases on a scale of five values: negative, somewhat negative, neutral, somewhat positive, positive. Obstacles like sentence negation, sarcasm, terseness, language ambiguity, and many others

1 Jan 13, 2022
This is a simple item2vec implementation using gensim for recbole

recbole-item2vec-model This is a simple item2vec implementation using gensim for recbole( https://recbole.io ) Usage When you want to run experiment f

Yusuke Fukasawa 2 Oct 06, 2022
Python library for processing Chinese text

SnowNLP: Simplified Chinese Text Processing SnowNLP是一个python写的类库,可以方便的处理中文文本内容,是受到了TextBlob的启发而写的,由于现在大部分的自然语言处理库基本都是针对英文的,于是写了一个方便处理中文的类库,并且和TextBlob

Rui Wang 6k Jan 02, 2023
This repository has a implementations of data augmentation for NLP for Japanese.

daaja This repository has a implementations of data augmentation for NLP for Japanese: EDA: Easy Data Augmentation Techniques for Boosting Performance

Koga Kobayashi 60 Nov 11, 2022
Turn clang-tidy warnings and fixes to comments in your pull request

clang-tidy pull request comments A GitHub Action to post clang-tidy warnings and suggestions as review comments on your pull request. What platisd/cla

Dimitris Platis 30 Dec 13, 2022
[ICLR'19] Trellis Networks for Sequence Modeling

TrellisNet for Sequence Modeling This repository contains the experiments done in paper Trellis Networks for Sequence Modeling by Shaojie Bai, J. Zico

CMU Locus Lab 460 Oct 13, 2022