Top2Vec is an algorithm for topic modeling and semantic search.

Last update: Jan 06, 2023

Overview

Update: Pre-trained Universal Sentence Encoders and BERT Sentence Transformer now available for embedding. Read more.

Top2Vec

Top2Vec is an algorithm for topic modeling and semantic search. It automatically detects topics present in text and generates jointly embedded topic, document and word vectors. Once you train the Top2Vec model you can:

Get number of detected topics.
Get topics.
Get topic sizes.
Get hierarchichal topics.
Search topics by keywords.
Search documents by topic.
Search documents by keywords.
Find similar words.
Find similar documents.
Expose model with RESTful-Top2Vec

See the paper for more details on how it works.

Benefits

Automatically finds number of topics.
No stop word lists required.
No need for stemming/lemmatization.
Works on short text.
Creates jointly embedded topic, document, and word vectors.
Has search functions built in.

How does it work?

The assumption the algorithm makes is that many semantically similar documents are indicative of an underlying topic. The first step is to create a joint embedding of document and word vectors. Once documents and words are embedded in a vector space the goal of the algorithm is to find dense clusters of documents, then identify which words attracted those documents together. Each dense area is a topic and the words that attracted the documents to the dense area are the topic words.

The Algorithm:

1. Create jointly embedded document and word vectors using Doc2Vec or Universal Sentence Encoder or BERT Sentence Transformer.

Documents will be placed close to other similar documents and close to the most distinguishing words.

2. Create lower dimensional embedding of document vectors using UMAP.

Document vectors in high dimensional space are very sparse, dimension reduction helps for finding dense areas. Each point is a document vector.

3. Find dense areas of documents using HDBSCAN.

The colored areas are the dense areas of documents. Red points are outliers that do not belong to a specific cluster.

4. For each dense area calculate the centroid of document vectors in original dimension, this is the topic vector.

The red points are outlier documents and do not get used for calculating the topic vector. The purple points are the document vectors that belong to a dense area, from which the topic vector is calculated.

5. Find n-closest word vectors to the resulting topic vector.

The closest word vectors in order of proximity become the topic words.

Installation

The easy way to install Top2Vec is:

pip install top2vec

To install pre-trained universal sentence encoder options:

pip install top2vec[sentence_encoders]

To install pre-trained BERT sentence transformer options:

pip install top2vec[sentence_transformers]

To install indexing options:

pip install top2vec[indexing]

Usage

from top2vec import Top2Vec

model = Top2Vec(documents)

Important parameters:

documents: Input corpus, should be a list of strings.
speed: This parameter will determine how fast the model takes to train. The 'fast-learn' option is the fastest and will generate the lowest quality vectors. The 'learn' option will learn better quality vectors but take a longer time to train. The 'deep-learn' option will learn the best quality vectors but will take significant time to train.
workers: The amount of worker threads to be used in training the model. Larger amount will lead to faster training.

Trained models can be saved and loaded.

model.save("filename")
model = Top2Vec.load("filename")

For more information view the API guide.

Pretrained Models

Doc2Vec will be used by default to generate the joint word and document embeddings. However there are also pretrained embedding_model options for generating joint word and document embeddings:

universal-sentence-encoder
universal-sentence-encoder-multilingual
distiluse-base-multilingual-cased

from top2vec import Top2Vec

model = Top2Vec(documents, embedding_model='universal-sentence-encoder')

For large data sets and data sets with very unique vocabulary doc2vec could produce better results. This will train a doc2vec model from scratch. This method is language agnostic. However multiple languages will not be aligned.

Using the universal sentence encoder options will be much faster since those are pre-trained and efficient models. The universal sentence encoder options are suggested for smaller data sets. They are also good options for large data sets that are in English or in languages covered by the multilingual model. It is also suggested for data sets that are multilingual.

The distiluse-base-multilingual-cased pre-trained sentence transformer is suggested for multilingual datasets and languages that are not covered by the multilingual universal sentence encoder. The transformer is significantly slower than the universal sentence encoder options.

More information on universal-sentence-encoder, universal-sentence-encoder-multilingual, and distiluse-base-multilingual-cased.

Citation

If you would like to cite Top2Vec in your work this is the current reference:

@article{angelov2020top2vec,
      title={Top2Vec: Distributed Representations of Topics}, 
      author={Dimo Angelov},
      year={2020},
      eprint={2008.09470},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Example

Train Model

Train a Top2Vec model on the 20newsgroups dataset.

from top2vec import Top2Vec
from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))

model = Top2Vec(documents=newsgroups.data, speed="learn", workers=8)

Get Number of Topics

This will return the number of topics that Top2Vec has found in the data.

>>> model.get_num_topics()
77

Get Topic Sizes

This will return the number of documents most similar to each topic. Topics are in decreasing order of size.

topic_sizes, topic_nums = model.get_topic_sizes()

Returns:

topic_sizes: The number of documents most similar to each topic.
topic_nums: The unique index of every topic will be returned.

Get Topics

This will return the topics in decreasing size.

topic_words, word_scores, topic_nums = model.get_topics(77)

Returns:

topic_words: For each topic the top 50 words are returned, in order of semantic similarity to topic.
word_scores: For each topic the cosine similarity scores of the top 50 words to the topic are returned.
topic_nums: The unique index of every topic will be returned.

Search Topics

We are going to search for topics most similar to medicine.

topic_words, word_scores, topic_scores, topic_nums = model.search_topics(keywords=["medicine"], num_topics=5)

Returns:

topic_words: For each topic the top 50 words are returned, in order of semantic similarity to topic.
word_scores: For each topic the cosine similarity scores of the top 50 words to the topic are returned.
topic_scores: For each topic the cosine similarity to the search keywords will be returned.
topic_nums: The unique index of every topic will be returned.

>>> topic_nums
[21, 29, 9, 61, 48]

>>> topic_scores
[0.4468, 0.381, 0.2779, 0.2566, 0.2515]

Topic 21 was the most similar topic to "medicine" with a cosine similarity of 0.4468. (Values can be from least similar 0, to most similar 1)

Generate Word Clouds

Using a topic number you can generate a word cloud. We are going to generate word clouds for the top 5 most similar topics to our medicine topic search from above.

topic_words, word_scores, topic_scores, topic_nums = model.search_topics(keywords=["medicine"], num_topics=5)
for topic in topic_nums:
    model.generate_topic_wordcloud(topic)

Search Documents by Topic

We are going to search by topic 48, a topic that appears to be about science.

documents, document_scores, document_ids = model.search_documents_by_topic(topic_num=48, num_docs=5)

Returns:

documents: The documents in a list, the most similar are first.
doc_scores: Semantic similarity of document to topic. The cosine similarity of the document and topic vector.
doc_ids: Unique ids of documents. If ids were not given, the index of document in the original corpus.

For each of the returned documents we are going to print its content, score and document number.

documents, document_scores, document_ids = model.search_documents_by_topic(topic_num=48, num_docs=5)
for doc, score, doc_id in zip(documents, document_scores, document_ids):
    print(f"Document: {doc_id}, Score: {score}")
    print("-----------")
    print(doc)
    print("-----------")
    print()

Document: 15227, Score: 0.6322
-----------
  Evolution is both fact and theory.  The THEORY of evolution represents the
scientific attempt to explain the FACT of evolution.  The theory of evolution
does not provide facts; it explains facts.  It can be safely assumed that ALL
scientific theories neither provide nor become facts but rather EXPLAIN facts.
I recommend that you do some appropriate reading in general science.  A good
starting point with regard to evolution for the layman would be "Evolution as
Fact and Theory" in "Hen's Teeth and Horse's Toes" [pp 253-262] by Stephen Jay
Gould.  There is a great deal of other useful information in this publication.
-----------

Document: 14515, Score: 0.6186
-----------
Just what are these "scientific facts"?  I have never heard of such a thing.
Science never proves or disproves any theory - history does.

-Tim
-----------

Document: 9433, Score: 0.5997
-----------
The same way that any theory is proven false.  You examine the predicitions
that the theory makes, and try to observe them.  If you don't, or if you
observe things that the theory predicts wouldn't happen, then you have some 
evidence against the theory.  If the theory can't be modified to 
incorporate the new observations, then you say that it is false.

For example, people used to believe that the earth had been created
10,000 years ago.  But, as evidence showed that predictions from this 
theory were not true, it was abandoned.
-----------

Document: 11917, Score: 0.5845
-----------
The point about its being real or not is that one does not waste time with
what reality might be when one wants predictions. The questions if the
atoms are there or if something else is there making measurements indicate
atoms is not necessary in such a system.

And one does not have to write a new theory of existence everytime new
models are used in Physics.
-----------

...

Semantic Search Documents by Keywords

Search documents for content semantically similar to cryptography and privacy.

documents, document_scores, document_ids = model.search_documents_by_keywords(keywords=["cryptography", "privacy"], num_docs=5)
for doc, score, doc_id in zip(documents, document_scores, document_ids):
    print(f"Document: {doc_id}, Score: {score}")
    print("-----------")
    print(doc)
    print("-----------")
    print()

Document: 16837, Score: 0.6112
-----------
...
Email and account privacy, anonymity, file encryption,  academic 
computer policies, relevant legislation and references, EFF, and 
other privacy and rights issues associated with use of the Internet
and global networks in general.
...

Document: 16254, Score: 0.5722
-----------
...
The President today announced a new initiative that will bring
the Federal Government together with industry in a voluntary
program to improve the security and privacy of telephone
communications while meeting the legitimate needs of law
enforcement.
...
-----------
...

Similar Keywords

Search for similar words to space.

words, word_scores = model.similar_words(keywords=["space"], keywords_neg=[], num_words=20)
for word, score in zip(words, word_scores):
    print(f"{word} {score}")

space 1.0
nasa 0.6589
shuttle 0.5976
exploration 0.5448
planetary 0.5391
missions 0.5069
launch 0.4941
telescope 0.4821
astro 0.4696
jsc 0.4549
ames 0.4515
satellite 0.446
station 0.4445
orbital 0.4438
solar 0.4386
astronomy 0.4378
observatory 0.4355
facility 0.4325
propulsion 0.4251
aerospace 0.4226

Comments

numpy causing various errors

I've been having trouble with numpy when using Top2Vec version 1.0.20 with Python 3.8.0 on Ubuntu 18.04; I experience the same problems using Python 3.7.5. I've tried installing numpy 1.0.20, numpy 1.19.5.

see this issuefor the hbsc error.

and this issue for the umap error.

UMAP

PicklingError:

(snip)

/data/.top2vec/lib/python3.8/site-packages/umap/umap_.py in fit(self, X, y)
   2571 
   2572         numba.set_num_threads(self._original_n_threads)
-> 2573         self._input_hash = joblib.hash(self._raw_data)
   2574 
   2575         return self

/data/.top2vec/lib/python3.8/site-packages/joblib/hashing.py in hash(obj, hash_name, coerce_mmap)
    259     else:
    260         hasher = Hasher(hash_name=hash_name)
--> 261     return hasher.hash(obj)

/data/.top2vec/lib/python3.8/site-packages/joblib/hashing.py in hash(self, obj, return_digest)
     61     def hash(self, obj, return_digest=True):
     62         try:
---> 63             self.dump(obj)
     64         except pickle.PicklingError as e:
     65             e.args += ('PicklingError while hashing %r: %r' % (obj, e),)

(snip)

PicklingError: ("Can't pickle <class 'numpy.dtype[float32]'>: it's not found as numpy.dtype[float32]", 'PicklingError while hashing array([[ 0.002187  , -0.00357572, -0.00279311, ...,  0.00120361,\n        -0.00115495,  0.00059189],\n       [-0.05823869,  0.01436491,  0.02220243, ...,  0.00703284,\n        -0.01716192, -0.01003473],\n       [-0.00334117,  0.00051066,  0.00269544, ...,  0.00070796,\n        -0.00202038, -0.00233051],\n       ...,\n       [ 0.00062888,  0.0027382 ,  0.0044361 , ..., -0.00229976,\n         0.00057765, -0.00033288],\n       [-0.00081269,  0.00099852, -0.00054314, ...,  0.00133646,\n        -0.00026089, -0.00150439],\n       [-0.01297437,  0.0104734 ,  0.01563089, ..., -0.00051685,\n        -0.00144138, -0.00556232]], dtype=float32): PicklingError("Can\'t pickle <class \'numpy.dtype[float32]\'>: it\'s not found as numpy.dtype[float32]")')

HDBSCAN

from top2vec import Top2Vec

(snip)

/data/.top2vec/lib/python3.8/site-packages/hdbscan/hdbscan_.py in <module>
     19 from scipy.sparse import csgraph
     20 
---> 21 from ._hdbscan_linkage import (single_linkage,
     22                                mst_linkage_core,
     23                                mst_linkage_core_vector,

hdbscan/_hdbscan_linkage.pyx in init hdbscan._hdbscan_linkage()

ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

opened by AltfunsMA 12

Run out of memory on 1.6m point dataset with 300 dimensions.

Hi, great work for Top2Vec, I am trying to apply it to my dataset which has 1.6million instances. I successfully trained Doc2vec inside Top2vec. with 300 dimensions as the default. but I run out of memory on the Umap procedure in 2 minutes. BTW I have a 32g memory. I also try low_memory=True. The same oom.

So, I wonder that how many memory UMAP gonna take for 2m points with 300 dimensions? For precaution, how many more memory HDBScan gonna cost?

Thank you!

opened by kongyq 11
How to display Top2Vec Model in HDBSCAN or UMAP ?
Hello,

Forgive me for the newbie question, but having successfully built and saved a Top2Vec model:

How can a saved Top2Vec model be viewed (visually rendered) in HDBSCAN or UMAP?

I may be over looking the obvious, but in reading through the documentation and Googling for answers nothing has jumped out so far.

Most grateful,

Chris
opened by None-Such 10
TypeError: __init__() got an unexpected keyword argument 'vector_size'

Hi,

I created a conda env with Python 3.6 and installed top2vec.

I then tried the example below to test the install

from top2vec import Top2Vec from sklearn.datasets import fetch_20newsgroups newsgroups = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes')) model = Top2Vec(documents=newsgroups.data, speed="learn", workers=8)

and I get the following output/error: 2020-12-18 00:20:24,861 - top2vec - INFO - Pre-processing documents for training 2020-12-18 00:20:31,459 - top2vec - INFO - Creating joint document/word embedding Traceback (most recent call last): File "", line 1, in File "conda_envs/top2vec/lib/python3.6/site-packages/top2vec/Top2Vec.py", line 285, in init self.model = Doc2Vec(**doc2vec_args) File "/home/.local/lib/python3.6/site-packages/gensim/models/doc2vec.py", line 634, in init **kwargs) TypeError: init() got an unexpected keyword argument 'vector_size'

Can you please help me with it ?

Thanks

opened by gianfilippo 10
[Installation Issue] Unable to install dependencies(tensorflow-text) while installing Top2Vec

I am trying to install top2vec but getting the following error when I do 'pip install top2vec==1.0.15'

ERROR: Could not find a version that satisfies the requirement tensorflow-text (from top2vec) (from versions: none) ERROR: No matching distribution found for tensorflow-text (from top2vec)

I have windows 10, python 3.7, x64.

From what I understand, currently, tensorflow-text isn't available for Windows, so could you guys provide any resolution for this?

opened by Alisha1992 10
ValueError: numpy.ndarray size changed, may indicate binary incompatibility.

A few days ago the problem with ValueError: numpy.ndarray size changed, may indicate binary incompatibility. occurred during executing the code that worked one week ago without any problems.

The same issue is with BERTopic (https://github.com/MaartenGr/BERTopic/issues/392), so I thought maybe it would be beneficial to link it there. For now, it seems there is no easy solution to that problem

opened by maciejbiesek 9
What would be the best way to incorporate NER?

Id like to use an NER to embed broader terms instead of just the unigrams.

Im not 100% sure how the unigrams are consumed. So if I wanted to embed "New York" instead of splitting it, what does the format of the output of the tokenizer need to be?

opened by datavistics 9

ValueError: list.remove(x): x not in list in model.hierarchical_topic_reduction()

I created a model with

model= Top2Vec(documents_text2, min_count = 4,
                       speed = "fast-learn", 
                       document_ids=document_ids2, 
                       workers = workers_n,keep_documents=False)

Then I tried to reduce the number of topics with

model.hierarchical_topic_reduction()

and get this error

model10 = model.hierarchical_topic_reduction(1000)
Traceback (most recent call last):

  File "<ipython-input-12-4ee6e263e4a0>", line 1, in <module>
    model10 = model.hierarchical_topic_reduction(1000)

  File "C:\Users\anaconda\.conda\envs\top2vec_final\lib\site-packages\top2vec\Top2Vec.py", line 1215, in hierarchical_topic_reduction
    ix_keep.remove(most_sim)

ValueError: list.remove(x): x not in list

opened by p-dre 9

ImportError: universal-sentence-encoder is not available.
Hi! I'm getting the above error on the following code:

from top2vec import Top2Vec model = Top2Vec(documents=df['transcript'].values, speed="learn", embedding_model='universal-sentence-encoder')

Full Exception Traceback:

--------------------------------------------------------------------------- ImportError Traceback (most recent call last) <ipython-input-3-12fb6ba4e3a8> in <module> 1 from top2vec import Top2Vec 2 ----> 3 model = Top2Vec(documents=df['transcript'].values, speed="learn", embedding_model='universal-sentence-encoder') ~\Anaconda3\lib\site-packages\top2vec\Top2Vec.py in __init__(self, documents, min_count, embedding_model, embedding_model_path, speed, use_corpus_file, document_ids, keep_documents, workers, tokenizer, verbose) 278 self.embedding_model = embedding_model 279 --> 280 self._check_import_status() 281 282 logger.info('Pre-processing documents for training') ~\Anaconda3\lib\site-packages\top2vec\Top2Vec.py in _check_import_status(self) 642 if self.embedding_model != 'distiluse-base-multilingual-cased': 643 if not _HAVE_TENSORFLOW: --> 644 raise ImportError(f"{self.embedding_model} is not available.\n\n" 645 "Try: pip install top2vec[sentence_encoders]\n\n" 646 "Alternatively try: pip install tensorflow tensorflow_hub tensorflow_text") ImportError: universal-sentence-encoder is not available. Try: pip install top2vec[sentence_encoders] Alternatively try: pip install tensorflow tensorflow_hub tensorflow_text

I have all of these libraries installed (see below) - but this error wont go.

(base) C:\Users\rsiddiqui>pip install top2vec[sentence_encoders] Requirement already satisfied: top2vec[sentence_encoders] in c:\users\rsiddiqui\anaconda3\lib\site-packages (1.0.16) Requirement already satisfied: numpy in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from top2vec[sentence_encoders]) (1.18.5) Requirement already satisfied: umap-learn in c:\users\rsiddiqui\anaconda3\lib\site-packages (from top2vec[sentence_encoders]) (0.4.6) Requirement already satisfied: gensim in c:\users\rsiddiqui\anaconda3\lib\site-packages (from top2vec[sentence_encoders]) (3.8.3) Requirement already satisfied: pandas in c:\users\rsiddiqui\anaconda3\lib\site-packages (from top2vec[sentence_encoders]) (1.1.3) Requirement already satisfied: wordcloud in c:\users\rsiddiqui\anaconda3\lib\site-packages (from top2vec[sentence_encoders]) (1.8.1) Requirement already satisfied: hdbscan in c:\users\rsiddiqui\anaconda3\lib\site-packages (from top2vec[sentence_encoders]) (0.8.26) Requirement already satisfied: pynndescent>=0.4 in c:\users\rsiddiqui\anaconda3\lib\site-packages (from top2vec[sentence_encoders]) (0.5.1) Requirement already satisfied: tensorflow-text; extra == "sentence_encoders" in c:\users\rsiddiqui\anaconda3\lib\site-packages (from top2vec[sentence_encoders]) (2.4.0rc0) Requirement already satisfied: tensorflow-hub; extra == "sentence_encoders" in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from top2vec[sentence_encoders]) (0.9.0) Requirement already satisfied: tensorflow; extra == "sentence_encoders" in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from top2vec[sentence_encoders]) (2.3.1) Requirement already satisfied: numba!=0.47,>=0.46 in c:\users\rsiddiqui\anaconda3\lib\site-packages (from umap-learn->top2vec[sentence_encoders]) (0.51.2) Requirement already satisfied: scikit-learn>=0.20 in c:\users\rsiddiqui\anaconda3\lib\site-packages (from umap-learn->top2vec[sentence_encoders]) (0.23.2) Requirement already satisfied: scipy>=1.3.1 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from umap-learn->top2vec[sentence_encoders]) (1.5.4) Requirement already satisfied: smart-open>=1.8.1 in c:\users\rsiddiqui\anaconda3\lib\site-packages (from gensim->top2vec[sentence_encoders]) (3.0.0) Requirement already satisfied: six>=1.5.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from gensim->top2vec[sentence_encoders]) (1.15.0) Requirement already satisfied: Cython==0.29.14 in c:\users\rsiddiqui\anaconda3\lib\site-packages (from gensim->top2vec[sentence_encoders]) (0.29.14) Requirement already satisfied: python-dateutil>=2.7.3 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from pandas->top2vec[sentence_encoders]) (2.8.1) Requirement already satisfied: pytz>=2017.2 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from pandas->top2vec[sentence_encoders]) (2020.4) Requirement already satisfied: pillow in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from wordcloud->top2vec[sentence_encoders]) (8.0.1) Requirement already satisfied: matplotlib in c:\users\rsiddiqui\anaconda3\lib\site-packages (from wordcloud->top2vec[sentence_encoders]) (3.2.2) Requirement already satisfied: joblib in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from hdbscan->top2vec[sentence_encoders]) (0.15.1) Requirement already satisfied: llvmlite>=0.30 in c:\users\rsiddiqui\anaconda3\lib\site-packages (from pynndescent>=0.4->top2vec[sentence_encoders]) (0.34.0) Requirement already satisfied: protobuf>=3.8.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow-hub; extra == "sentence_encoders"->top2vec[sentence_encoders]) (3.13.0) Requirement already satisfied: tensorflow-estimator<2.4.0,>=2.3.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (2.3.0) Requirement already satisfied: google-pasta>=0.1.8 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (0.2.0) Requirement already satisfied: wheel>=0.26 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (0.35.1) Requirement already satisfied: absl-py>=0.7.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (0.10.0) Requirement already satisfied: h5py<2.11.0,>=2.10.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (2.10.0) Requirement already satisfied: termcolor>=1.1.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (1.1.0) Requirement already satisfied: keras-preprocessing<1.2,>=1.1.1 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (1.1.2) Requirement already satisfied: opt-einsum>=2.3.2 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (3.3.0) Requirement already satisfied: wrapt>=1.11.1 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (1.12.1) Requirement already satisfied: grpcio>=1.8.6 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (1.32.0) Requirement already satisfied: gast==0.3.3 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (0.3.3) Requirement already satisfied: astunparse==1.6.3 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (1.6.3) Requirement already satisfied: tensorboard<3,>=2.3.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (2.4.0) Requirement already satisfied: setuptools in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from numba!=0.47,>=0.46->umap-learn->top2vec[sentence_encoders]) (50.3.2) Requirement already satisfied: threadpoolctl>=2.0.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from scikit-learn>=0.20->umap-learn->top2vec[sentence_encoders]) (2.1.0) Requirement already satisfied: requests in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from smart-open>=1.8.1->gensim->top2vec[sentence_encoders]) (2.25.0) Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from matplotlib->wordcloud->top2vec[sentence_encoders]) (1.3.1) Requirement already satisfied: cycler>=0.10 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from matplotlib->wordcloud->top2vec[sentence_encoders]) (0.10.0) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from matplotlib->wordcloud->top2vec[sentence_encoders]) (2.4.7) Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorboard<3,>=2.3.0->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (0.4.2) Requirement already satisfied: werkzeug>=0.11.15 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorboard<3,>=2.3.0->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (1.0.1) Requirement already satisfied: tensorboard-plugin-wit>=1.6.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorboard<3,>=2.3.0->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (1.7.0) Requirement already satisfied: markdown>=2.6.8 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorboard<3,>=2.3.0->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (3.3.3) Requirement already satisfied: google-auth<2,>=1.6.3 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorboard<3,>=2.3.0->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (1.23.0) Requirement already satisfied: idna<3,>=2.5 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from requests->smart-open>=1.8.1->gensim->top2vec[sentence_encoders]) (2.10) Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from requests->smart-open>=1.8.1->gensim->top2vec[sentence_encoders]) (1.26.2) Requirement already satisfied: certifi>=2017.4.17 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from requests->smart-open>=1.8.1->gensim->top2vec[sentence_encoders]) (2020.11.8) Requirement already satisfied: chardet<4,>=3.0.2 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from requests->smart-open>=1.8.1->gensim->top2vec[sentence_encoders]) (3.0.4) Requirement already satisfied: requests-oauthlib>=0.7.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<3,>=2.3.0->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (1.3.0) Requirement already satisfied: cachetools<5.0,>=2.0.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (4.1.1) Requirement already satisfied: pyasn1-modules>=0.2.1 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (0.2.8) Requirement already satisfied: rsa<5,>=3.1.4; python_version >= "3.5" in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (4.6) Requirement already satisfied: oauthlib>=3.0.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<3,>=2.3.0->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (3.1.0) Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from pyasn1-modules>=0.2.1->google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow; extra == "sentence_encoders"->top2vec[sentence_encoders]) (0.4.8)

(base) C:\Users\rsiddiqui>pip install tensorflow tensorflow_hub tensorflow_text Requirement already satisfied: tensorflow in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (2.3.1) Requirement already satisfied: tensorflow_hub in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (0.9.0) Requirement already satisfied: tensorflow_text in c:\users\rsiddiqui\anaconda3\lib\site-packages (2.4.0rc0) Requirement already satisfied: protobuf>=3.9.2 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (3.13.0) Requirement already satisfied: gast==0.3.3 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (0.3.3) Requirement already satisfied: termcolor>=1.1.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (1.1.0) Requirement already satisfied: tensorboard<3,>=2.3.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (2.4.0) Requirement already satisfied: grpcio>=1.8.6 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (1.32.0) Requirement already satisfied: tensorflow-estimator<2.4.0,>=2.3.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (2.3.0) Requirement already satisfied: astunparse==1.6.3 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (1.6.3) Requirement already satisfied: six>=1.12.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (1.15.0) Requirement already satisfied: wrapt>=1.11.1 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (1.12.1) Requirement already satisfied: google-pasta>=0.1.8 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (0.2.0) Requirement already satisfied: keras-preprocessing<1.2,>=1.1.1 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (1.1.2) Requirement already satisfied: wheel>=0.26 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (0.35.1) Requirement already satisfied: opt-einsum>=2.3.2 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (3.3.0) Requirement already satisfied: h5py<2.11.0,>=2.10.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (2.10.0) Requirement already satisfied: numpy<1.19.0,>=1.16.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (1.18.5) Requirement already satisfied: absl-py>=0.7.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorflow) (0.10.0) Requirement already satisfied: setuptools in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from protobuf>=3.9.2->tensorflow) (50.3.2) Requirement already satisfied: requests<3,>=2.21.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorboard<3,>=2.3.0->tensorflow) (2.25.0) Requirement already satisfied: tensorboard-plugin-wit>=1.6.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorboard<3,>=2.3.0->tensorflow) (1.7.0) Requirement already satisfied: werkzeug>=0.11.15 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorboard<3,>=2.3.0->tensorflow) (1.0.1) Requirement already satisfied: google-auth<2,>=1.6.3 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorboard<3,>=2.3.0->tensorflow) (1.23.0) Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorboard<3,>=2.3.0->tensorflow) (0.4.2) Requirement already satisfied: markdown>=2.6.8 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from tensorboard<3,>=2.3.0->tensorflow) (3.3.3) Requirement already satisfied: chardet<4,>=3.0.2 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from requests<3,>=2.21.0->tensorboard<3,>=2.3.0->tensorflow) (3.0.4) Requirement already satisfied: urllib3<1.27,>=1.21.1 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from requests<3,>=2.21.0->tensorboard<3,>=2.3.0->tensorflow) (1.26.2) Requirement already satisfied: idna<3,>=2.5 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from requests<3,>=2.21.0->tensorboard<3,>=2.3.0->tensorflow) (2.10) Requirement already satisfied: certifi>=2017.4.17 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from requests<3,>=2.21.0->tensorboard<3,>=2.3.0->tensorflow) (2020.11.8) Requirement already satisfied: rsa<5,>=3.1.4; python_version >= "3.5" in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow) (4.6) Requirement already satisfied: pyasn1-modules>=0.2.1 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow) (0.2.8) Requirement already satisfied: cachetools<5.0,>=2.0.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow) (4.1.1) Requirement already satisfied: requests-oauthlib>=0.7.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<3,>=2.3.0->tensorflow) (1.3.0) Requirement already satisfied: pyasn1>=0.1.3 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from rsa<5,>=3.1.4; python_version >= "3.5"->google-auth<2,>=1.6.3->tensorboard<3,>=2.3.0->tensorflow) (0.4.8) Requirement already satisfied: oauthlib>=3.0.0 in c:\users\rsiddiqui\appdata\roaming\python\python38\site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<3,>=2.3.0->tensorflow) (3.1.0)
opened by Rmsharks4 9
"embedding_model" parameter in Top2Vec is unrecognized

In code documentation it is mentioned that we can use pretrained model using embedding_model but it is not recognized. I have updated the library as well

opened by Prashant118 9
TypeError: Cannot use scipy.linalg.eigh for sparse A with k >= N

I'm using a set of text documents (pdf documents converted into text) for topic modeling. While training the model I'm getting this error. It's a great help if someone can help me to sort this out. C:\Users\prabo\Desktop\Topic modeling pipeline.venv\lib\site-packages\umap\umap_.py:1678: UserWarning: n_neighbors is larger than the dataset size; truncating to X.shape[0] - 1 warn( C:\Users\prabo\Desktop\Topic modeling pipeline.venv\lib\site-packages\scipy\sparse\linalg\eigen\arpack\arpack.py:1590: RuntimeWarning: k >= N for N * N square matrix. Attempting to use scipy.linalg.eigh instead. warnings.warn("k >= N for N * N square matrix. " Traceback (most recent call last): File "c:/Users/prabo/Desktop/Topic modeling pipeline/test.py", line 27, in model = Top2Vec(documents=df.text, speed="learn", workers=8) File "C:\Users\prabo\Desktop\Topic modeling pipeline.venv\lib\site-packages\top2vec\Top2Vec.py", line 222, in init umap_model = umap.UMAP(n_neighbors=15, File "C:\Users\prabo\Desktop\Topic modeling pipeline.venv\lib\site-packages\umap\umap_.py", line 1965, in fit self.embedding_ = simplicial_set_embedding( File "C:\Users\prabo\Desktop\Topic modeling pipeline.venv\lib\site-packages\umap\umap_.py", line 1033, in simplicial_set_embedding initialisation = spectral_layout( File "C:\Users\prabo\Desktop\Topic modeling pipeline.venv\lib\site-packages\umap\spectral.py", line 324, in spectral_layout eigenvalues, eigenvectors = scipy.sparse.linalg.eigsh( File "C:\Users\prabo\Desktop\Topic modeling pipeline.venv\lib\site-packages\scipy\sparse\linalg\eigen\arpack\arpack.py", line 1595, in eigsh raise TypeError("Cannot use scipy.linalg.eigh for sparse A with " TypeError: Cannot use scipy.linalg.eigh for sparse A with k >= N. Use scipy.linalg.eigh(A.toarray()) or reduce k.

opened by dulanafdo 9
How can extract topics from new added documents in inference

Hey, this is an amazing project to work with. I was wondering is there any way to extract topics from newly added document in inference. Thanks in advance.

opened by meetttttt 0
Stop words are included in the model and topics are generated with them

Here is my topic_words outputs :

0 Words: ['and' 'the' 'in' 'to' 'of' 'games' 'or' 'first' 'game' 'that' 'by' 'at' 'is' 'released' 'with' 'as' 'its' 'was' 'from' 'developed' 'for' 'it' 'series' 'video' 'were' 'produced' 'an' 'on' 'designed' 'aircraft' 'published' 'built'] 1 Words: ['series' 'games' 'an' 'was' 'by' 'with' 'and' 'first' 'published' 'in' 'is' 'from' 'released' 'of' 'to' 'as' 'the' 'it' 'at' 'were' 'designed' 'for' 'or' 'game' 'aircraft' 'its' 'on' 'built' 'that' 'produced' 'video' 'developed']

It is written that no stop word elimination is needed before using Top2Vec - and in a youtube tutorial he just called Top2Vec function without any parameters and it worked well without stop words. What am I doing wrong or is it a bug?

Thanks

opened by cuneyttyler 0
AttributeError: 'CountVectorizer' object has no attribute 'get_feature_names'
Hello,

Following the examples in the readme I created this code:

documents = df["cleaned_message"].tolist() model = Top2Vec( documents, embedding_model="universal-sentence-encoder", speed="learn", workers=multiprocessing.cpu_count() - 1, ) print(f"Num topics: {model.get_num_topics()}")

And that is throwing the following error:

AttributeError: 'CountVectorizer' object has no attribute 'get_feature_names'
opened by jmorenobl 1
access topic/document/etc. vectors

First of all, great package! it is awesome to use!

I was wondering if it is possible to access individual vectors on different levels of the model. For example, if I want to extract the 3 topics that cover the most documents I would want to use a combination of between-topic spread and within-topic spread of the vectors. Is it possible to extract these from the trained model?

thanks in advance!

opened by SjoerdBraaksma 0
how to get bi-gram and tri-gram and n-gram topic words ?

I remember in LDA and NMF we have configuration parameter called ngram_range where by configuring it as (2,2) or (3,3) we can get topic words as bigrams and trigrams. Is there any such configuration in Top2vec where we can get bigram and trigram or ngram based topic words?

opened by sivachaitanya 2

Releases(1.0.27)

1.0.27(Apr 3, 2022)
New pre-trained transformer models available

Ability to use any embedding model by passing callable to embedding_model

New embedding_batch_size option

Document chunking options for long documents

Phrases in topics by setting ngram_vocab=True

Source code(tar.gz)
Source code(zip)
1.0.26(Jul 9, 2021)

Source code(tar.gz)
Source code(zip)
1.0.25(Jun 23, 2021)

Added query_documents and query_topics methods which allow for using a sequence of text such as a question, a sentence, a paragraph or a document to query documents or topics.

Added num_topics parameter to get_documents_topics method which allows retrieving multiple topics per document.
Source code(tar.gz)
Source code(zip)
1.0.24(Apr 1, 2021)

Fixes #152
Source code(tar.gz)
Source code(zip)
1.0.23(Feb 12, 2021)

Added numpy>=1.20.0 dependency.
Source code(tar.gz)
Source code(zip)
1.0.22(Feb 12, 2021)

Numpy related bug fix and document id validation performance upgrade.
Source code(tar.gz)
Source code(zip)
1.0.21(Feb 5, 2021)

Addressed #90, #125, #126

Added custom umap and hdbscan arg option. Fixed issue with loading model with custom tokenizer.
Source code(tar.gz)
Source code(zip)
1.0.20(Jan 9, 2021)

Added use_embedding_model_tokenizer parameter. If set to True and if using an embedding_model other than doc2vec, use the model's tokenizer for document embedding.

Fixed dependency issue with joblib.

Fixed issues with wordclouds caused by negative similarity scores.
Source code(tar.gz)
Source code(zip)
1.0.19(Dec 10, 2020)

Fixed bug #91
Source code(tar.gz)
Source code(zip)
1.0.18(Dec 10, 2020)

Added option for indexing word vectors, this will speed up search for models with large vocabularies. Specifically search_words_by_vector and similar_words.

Added new method search_words_by_vector.
Source code(tar.gz)
Source code(zip)
1.0.17(Dec 7, 2020)

Added option for indexing document vectors, this will speed up search for models with large number of documents. Specifically search_documents_by_vector, search_documents_by_keywords, and search_documents_by_documents.

Added new method search_documents_by_vector.

Added code to prevent hierarchical topic reduction error #79.
Source code(tar.gz)
Source code(zip)
1.0.16(Nov 10, 2020)

Dependencies for universal sentence encoder and BERT sentence transformer options are now optional. With pip install top2vec[sentence-encoders] and pip install top2vec[sentence_transformers]

Faster cosine similarity.
Source code(tar.gz)
Source code(zip)
1.0.15(Oct 16, 2020)

The verbose parameter will be set to True by default.

Fixed a bug that stopped showing logging updates after downloading pre-trained models.
Source code(tar.gz)
Source code(zip)
1.0.13(Oct 15, 2020)

Source code(tar.gz)
Source code(zip)
1.0.12(Oct 15, 2020)

Top2Vec now has an option to choose the embedding model with doc2vec, universal-sentence-encoder, universal-sentence-encoder-multilingual, and distiluse-base-multilingual-cased as the options.

A get_documents_topics method was added.
Source code(tar.gz)
Source code(zip)
1.0.11(Oct 8, 2020)

Added a method for deleting documents from model.

Fixed bug when using corpus_file that resulted in documents getting dropped. Fixed bug when using add_documents and delete_documents which resulted in improper ordering of topic words.
Source code(tar.gz)
Source code(zip)
1.0.10(Aug 29, 2020)

There was an issue with UMAP install due to a missing comma in the setup.py file, this has been fixed. An optional min_count parameter has been added, the default is still 50. All words with total frequency lower min_count are ignored by the model.
Source code(tar.gz)
Source code(zip)
1.0.9(Jun 26, 2020)

Added functionality to perform hierarchical topic reduction. Added the ability to add new documents to an already trained model. Added use_corpus option which may lead to faster training with very large datasets in multi-worker environments.
Source code(tar.gz)
Source code(zip)
1.0.8(Apr 18, 2020)

Added option for custom document ids, these can be string or int. Option to not save documents in model, this allows for the trained model to be used as an index and for saved models to be smaller in size. Ability to pass in a custom tokenizer that will override the default. Verbose mode that will log status of training. Also added the ability to search documents by multiple documents, positive and negative semantic search.
Source code(tar.gz)
Source code(zip)
1.0.7(Apr 7, 2020)

Topic size is defined as the number of document vectors which have the topic as its nearest topic vector. Search by topic has been modified to only show documents who have the topic as its nearest topic, in order to avoid overlapping results from similar topics.

Topic deduplication is added to make topics more robust.
Source code(tar.gz)
Source code(zip)
1.0.6(Mar 25, 2020)

Top2Vec initial release.
Source code(tar.gz)
Source code(zip)

Top2Vec is an algorithm for topic modeling and semantic search.

Related tags

Overview

Top2Vec

Benefits

How does it work?

The Algorithm:

1. Create jointly embedded document and word vectors using Doc2Vec or Universal Sentence Encoder or BERT Sentence Transformer.

2. Create lower dimensional embedding of document vectors using UMAP.

3. Find dense areas of documents using HDBSCAN.

4. For each dense area calculate the centroid of document vectors in original dimension, this is the topic vector.

5. Find n-closest word vectors to the resulting topic vector.

Installation

Usage

Pretrained Models

Citation

Example

Train Model

Get Number of Topics

Get Topic Sizes

Get Topics

Search Topics

Generate Word Clouds

Search Documents by Topic

Semantic Search Documents by Keywords

Similar Keywords

Comments

Releases(1.0.27)

1.0.27(Apr 3, 2022)

1.0.26(Jul 9, 2021)

1.0.25(Jun 23, 2021)

1.0.24(Apr 1, 2021)

1.0.23(Feb 12, 2021)

1.0.22(Feb 12, 2021)

1.0.21(Feb 5, 2021)

1.0.20(Jan 9, 2021)

1.0.19(Dec 10, 2020)

1.0.18(Dec 10, 2020)

1.0.17(Dec 7, 2020)

1.0.16(Nov 10, 2020)

1.0.15(Oct 16, 2020)

1.0.13(Oct 15, 2020)

1.0.12(Oct 15, 2020)

1.0.11(Oct 8, 2020)

1.0.10(Aug 29, 2020)

1.0.9(Jun 26, 2020)

1.0.8(Apr 18, 2020)

1.0.7(Apr 7, 2020)

1.0.6(Mar 25, 2020)

Owner

Dimo Angelov

A cross platform OCR Library based on PaddleOCR & OnnxRuntime

ADCS cert template modification and ACL enumeration

Repository for Graph2Pix: A Graph-Based Image to Image Translation Framework

Implementation of some unbalanced loss like focal_loss, dice_loss, DSC Loss, GHM Loss et.al

Global Rhythm Style Transfer Without Text Transcriptions

Statistics and Mathematics for Machine Learning, Deep Learning , Deep NLP

⛵️The official PyTorch implementation for "BERT-of-Theseus: Compressing BERT by Progressive Module Replacing" (EMNLP 2020).

This repository contains the code for "Generating Datasets with Pretrained Language Models".

A telegram bot to translate 100+ Languages

A Transformer Implementation that is easy to understand and customizable.

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

Tensorflow implementation of paper: Learning to Diagnose with LSTM Recurrent Neural Networks.

A text file containing 479k English words for all your dictionary/word-based projects e.g: auto-completion / autosuggestion

BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model

I label phrases on a scale of five values: negative, somewhat negative, neutral, somewhat positive, positive

This is a simple item2vec implementation using gensim for recbole

Python library for processing Chinese text

This repository has a implementations of data augmentation for NLP for Japanese.

Turn clang-tidy warnings and fixes to comments in your pull request

[ICLR'19] Trellis Networks for Sequence Modeling