BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions

Overview

PyPI - Python Build docs PyPI - PyPi PyPI - License DOI

BERTopic

BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.

BERTopic supports guided, (semi-) supervised, and dynamic topic modeling. It even supports visualizations similar to LDAvis!

Corresponding medium posts can be found here and here.

Installation

Installation, with sentence-transformers, can be done using pypi:

pip install bertopic

You may want to install more depending on the transformers and language backends that you will be using. The possible installations are:

pip install bertopic[flair]
pip install bertopic[gensim]
pip install bertopic[spacy]
pip install bertopic[use]

Getting Started

For an in-depth overview of the features of BERTopic you can check the full documentation here or you can follow along with one of the examples below:

Name Link
Topic Modeling with BERTopic Open In Colab
(Custom) Embedding Models in BERTopic Open In Colab
Advanced Customization in BERTopic Open In Colab
(semi-)Supervised Topic Modeling with BERTopic Open In Colab
Dynamic Topic Modeling with Trump's Tweets Open In Colab
Topic Modeling arXiv Abstracts Kaggle

Quick Start

We start by extracting topics from the well-known 20 newsgroups dataset containing English documents:

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
 
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)

After generating topics and their probabilities, we can access the frequent topics that were generated:

>>> topic_model.get_topic_info()

Topic	Count	Name
-1	4630	-1_can_your_will_any
0	693	49_windows_drive_dos_file
1	466	32_jesus_bible_christian_faith
2	441	2_space_launch_orbit_lunar
3	381	22_key_encryption_keys_encrypted

-1 refers to all outliers and should typically be ignored. Next, let's take a look at the most frequent topic that was generated, topic 0:

>>> topic_model.get_topic(0)

[('windows', 0.006152228076250982),
 ('drive', 0.004982897610645755),
 ('dos', 0.004845038866360651),
 ('file', 0.004140142872194834),
 ('disk', 0.004131678774810884),
 ('mac', 0.003624848635985097),
 ('memory', 0.0034840976976789903),
 ('software', 0.0034415334250699077),
 ('email', 0.0034239554442333257),
 ('pc', 0.003047105930670237)]

NOTE: Use BERTopic(language="multilingual") to select a model that supports 50+ languages.

Visualize Topics

After having trained our BERTopic model, we can iteratively go through hundreds of topics to get a good understanding of the topics that were extracted. However, that takes quite some time and lacks a global representation. Instead, we can visualize the topics that were generated in a way very similar to LDAvis:

topic_model.visualize_topics()

We can create an overview of the most frequent topics in a way that they are easily interpretable. Horizontal barcharts typically convey information rather well and allow for an intuitive representation of the topics:

topic_model.visualize_barchart()

Find all possible visualizations with interactive examples in the documentation here.

Embedding Models

BERTopic supports many embedding models that can be used to embed the documents and words:

  • Sentence-Transformers
  • Flair
  • Spacy
  • Gensim
  • USE

Sentence-Transformers is typically used as it has shown great results embedding documents meant for semantic similarity. Simply select any from their documentation here and pass it to BERTopic:

topic_model = BERTopic(embedding_model="all-MiniLM-L6-v2")

Flair allows you to choose almost any 🤗 transformers model. Simply select any from here and pass it to BERTopic:

from flair.embeddings import TransformerDocumentEmbeddings

roberta = TransformerDocumentEmbeddings('roberta-base')
topic_model = BERTopic(embedding_model=roberta)

Click here for a full overview of all supported embedding models.

Dynamic Topic Modeling

Dynamic topic modeling (DTM) is a collection of techniques aimed at analyzing the evolution of topics over time. These methods allow you to understand how a topic is represented over time. Here, we will be using all of Donald Trump's tweet to see how he talked over certain topics over time:

import re
import pandas as pd

trump = pd.read_csv('https://drive.google.com/uc?export=download&id=1xRKHaP-QwACMydlDnyFPEaFdtskJuBa6')
trump.text = trump.apply(lambda row: re.sub(r"http\S+", "", row.text).lower(), 1)
trump.text = trump.apply(lambda row: " ".join(filter(lambda x:x[0]!="@", row.text.split())), 1)
trump.text = trump.apply(lambda row: " ".join(re.sub("[^a-zA-Z]+", " ", row.text).split()), 1)
trump = trump.loc[(trump.isRetweet == "f") & (trump.text != ""), :]
timestamps = trump.date.to_list()
tweets = trump.text.to_list()

Then, we need to extract the global topic representations by simply creating and training a BERTopic model:

topic_model = BERTopic(verbose=True)
topics, probs = topic_model.fit_transform(tweets)

From these topics, we are going to generate the topic representations at each timestamp for each topic. We do this by simply calling topics_over_time and pass in his tweets, the corresponding timestamps, and the related topics:

topics_over_time = topic_model.topics_over_time(tweets, topics, timestamps, nr_bins=20)

Finally, we can visualize the topics by simply calling visualize_topics_over_time():

topic_model.visualize_topics_over_time(topics_over_time, top_n_topics=6)

Overview

For quick access to common functions, here is an overview of BERTopic's main methods:

Method Code
Fit the model .fit(docs)
Fit the model and predict documents .fit_transform(docs)
Predict new documents .transform([new_doc])
Access single topic .get_topic(topic=12)
Access all topics .get_topics()
Get topic freq .get_topic_freq()
Get all topic information .get_topic_info()
Get representative docs per topic .get_representative_docs()
Get topics per class .topics_per_class(docs, topics, classes)
Dynamic Topic Modeling .topics_over_time(docs, topics, timestamps)
Update topic representation .update_topics(docs, topics, n_gram_range=(1, 3))
Reduce nr of topics .reduce_topics(docs, topics, nr_topics=30)
Find topics .find_topics("vehicle")
Save model .save("my_model")
Load model BERTopic.load("my_model")
Get parameters .get_params()

For an overview of BERTopic's visualization methods:

Method Code
Visualize Topics .visualize_topics()
Visualize Topic Hierarchy .visualize_hierarchy()
Visualize Topic Terms .visualize_barchart()
Visualize Topic Similarity .visualize_heatmap()
Visualize Term Score Decline .visualize_term_rank()
Visualize Topic Probability Distribution .visualize_distribution(probs[0])
Visualize Topics over Time .visualize_topics_over_time(topics_over_time)
Visualize Topics per Class .visualize_topics_per_class(topics_per_class)

Citation

To cite BERTopic in your work, please use the following bibtex reference:

@misc{grootendorst2020bertopic,
  author       = {Maarten Grootendorst},
  title        = {BERTopic: Leveraging BERT and c-TF-IDF to create easily interpretable topics.},
  year         = 2020,
  publisher    = {Zenodo},
  version      = {v0.9.4},
  doi          = {10.5281/zenodo.4381785},
  url          = {https://doi.org/10.5281/zenodo.4381785}
}
Comments
  • Github actions: ValueError: numpy.ndarray size changed, may indicate binary incompatibility.

    Github actions: ValueError: numpy.ndarray size changed, may indicate binary incompatibility.

    The github actions workflow is suddenly giving me the following error:

    ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject

    It seems that it has most likely to do with numpy-based binary compatibility issues (some more info here). However, I cannot seem to fix it thus far with the suggested method (setting oldest-supported-numpy in pyproject.toml).

    If you have any idea, please follow along with the full discussions here. Any help is greatly appreciated!

    opened by MaartenGr 26
  • Train and Predict BERTopic

    Train and Predict BERTopic

    Hi @MaartenGr ,

    As I understand about BERTopic; fit_transform() is to train model while transform() is for prediction. Am I right?? what is the best method to train the model for data from different sources e.g. twitter, reddit, facebook comments etc. I want to train the model once and use it for various datasets? should I have to divide data in sentences because some sources has very large comments (paragraphs) e.g. reddit or news articles?

    Thanks

    opened by mjavedgohar 26
  • Memory inefficient algorithm and getting error while saving the model

    Memory inefficient algorithm and getting error while saving the model

    I was trying to train 20 Lakh data points and I have tried lots of GPU instances in AWS, I have tried GPU instances with 16GB RAM, 32GB RAM, 64 GB RAM, and 256 GB RAM on AWS. All of them failed and not able to train. And on 256 GB RAM, it was trained successfully but I was unable to save the model

    Below is the error I was getting while saving the model.

    topic_model.save("topic_model_all_20L.pt",save_embedding_model=False)

    KeyError                                  Traceback (most recent call last)
    ~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/numba/core/caching.py in save(self, key, data)
        482             # If key already exists, we will overwrite the file
    --> 483             data_name = overloads[key]
        484         except KeyError:
    KeyError: ((array(int32, 1d, C), array(int32, 1d, C), array(float32, 1d, C), array(float32, 2d, C), type(CPUDispatcher(<function alternative_cosine at 0x7f3c3ca174d0>)), array(int64, 1d, C), float64), ('x86_64-unknown-linux-gnu', 'cascadelake', '+64bit,+adx,+aes,+avx,+avx2,-avx512bf16,-avx512bitalg,+avx512bw,+avx512cd,+avx512dq,-avx512er,+avx512f,-avx512ifma,-avx512pf,-avx512vbmi,-avx512vbmi2,+avx512vl,+avx512vnni,-avx512vpopcntdq,+bmi,+bmi2,-cldemote,+clflushopt,+clwb,-clzero,+cmov,+cx16,+cx8,-enqcmd,+f16c,+fma,-fma4,+fsgsbase,+fxsr,-gfni,+invpcid,-lwp,+lzcnt,+mmx,+movbe,-movdir64b,-movdiri,-mwaitx,+pclmul,-pconfig,+pku,+popcnt,-prefetchwt1,+prfchw,-ptwrite,-rdpid,+rdrnd,+rdseed,-rtm,+sahf,-sgx,-sha,-shstk,+sse,+sse2,+sse3,+sse4.1,+sse4.2,-sse4a,+ssse3,-tbm,-vaes,-vpclmulqdq,-waitpkg,-wbnoinvd,-xop,+xsave,+xsavec,+xsaveopt,+xsaves'), ('308c49885ad3c35a475c360e21af1359caa88c78eb495fa0f5e8c6676ae5019e', 'e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855'))
    During handling of the above exception, another exception occurred:
    TypeError                                 Traceback (most recent call last)
    <ipython-input-25-32c887ac8b59> in <module>
          1 # Saving model
    ----> 2 topic_model.save("topic_model_all_20L.pt",save_embedding_model=False)
          3 print("model saved")
    ~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/bertopic/_bertopic.py in save(self, path, save_embedding_model)
       1201                 embedding_model = self.embedding_model
       1202                 self.embedding_model = None
    -> 1203                 joblib.dump(self, file)
       1204                 self.embedding_model = embedding_model
       1205             else:
    ~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/joblib/numpy_pickle.py in dump(value, filename, compress, protocol, cache_size)
        480             NumpyPickler(f, protocol=protocol).dump(value)
        481     else:
    --> 482         NumpyPickler(filename, protocol=protocol).dump(value)
        483 
        484     # If the target container is a file object, nothing is returned.
    ~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/pickle.py in dump(self, obj)
        435         if self.proto >= 4:
        436             self.framer.start_framing()
    --> 437         self.save(obj)
        438         self.write(STOP)
        439         self.framer.end_framing()
    ~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/joblib/numpy_pickle.py in save(self, obj)
        280             return
        281 
    --> 282         return Pickler.save(self, obj)
        283 
        284 
    ~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/pickle.py in save(self, obj, save_persistent_id)
        547 
        548         # Save the reduce() output and finally memoize the object
    --> 549         self.save_reduce(obj=obj, *rv)
        550 
        551     def persistent_id(self, obj):
    ~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/pickle.py in save_reduce(self, func, args, state, listitems, dictitems, obj)
        660 
        661         if state is not None:
    --> 662             save(state)
        663             write(BUILD)
        664 
    ~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/joblib/numpy_pickle.py in save(self, obj)
        280             return
        281 
    --> 282         return Pickler.save(self, obj)
        283 
        284 
    ~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/pickle.py in save(self, obj, save_persistent_id)
        502         f = self.dispatch.get(t)
        503         if f is not None:
    --> 504             f(self, obj) # Call unbound method with explicit self
        505             return
        506 
    ~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/pickle.py in save_dict(self, obj)
        857 
        858         self.memoize(obj)
    --> 859         self._batch_setitems(obj.items())
        860 
        861     dispatch[dict] = save_dict
    ~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/pickle.py in _batch_setitems(self, items)
        883                 for k, v in tmp:
        884                     save(k)
    --> 885                     save(v)
        886                 write(SETITEMS)
        887             elif n:
    ~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/joblib/numpy_pickle.py in save(self, obj)
        280             return
        281 
    --> 282         return Pickler.save(self, obj)
        283 
        284 
    ~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/pickle.py in save(self, obj, save_persistent_id)
        547 
        548         # Save the reduce() output and finally memoize the object
    --> 549         self.save_reduce(obj=obj, *rv)
        550 
        551     def persistent_id(self, obj):
    ~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/pickle.py in save_reduce(self, func, args, state, listitems, dictitems, obj)
        660 
        661         if state is not None:
    --> 662             save(state)
        663             write(BUILD)
        664 
    ~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/joblib/numpy_pickle.py in save(self, obj)
        280             return
        281 
    --> 282         return Pickler.save(self, obj)
        283 
        284 
    ~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/pickle.py in save(self, obj, save_persistent_id)
        502         f = self.dispatch.get(t)
        503         if f is not None:
    --> 504             f(self, obj) # Call unbound method with explicit self
        505             return
        506 
    ~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/pickle.py in save_dict(self, obj)
        857 
        858         self.memoize(obj)
    --> 859         self._batch_setitems(obj.items())
        860 
        861     dispatch[dict] = save_dict
    ~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/pickle.py in _batch_setitems(self, items)
        883                 for k, v in tmp:
        884                     save(k)
    --> 885                     save(v)
        886                 write(SETITEMS)
        887             elif n:
    ~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/joblib/numpy_pickle.py in save(self, obj)
        280             return
        281 
    --> 282         return Pickler.save(self, obj)
        283 
        284 
    ~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/pickle.py in save(self, obj, save_persistent_id)
        522             reduce = getattr(obj, "__reduce_ex__", None)
        523             if reduce is not None:
    --> 524                 rv = reduce(self.proto)
        525             else:
        526                 reduce = getattr(obj, "__reduce__", None)
    ~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/pynndescent/pynndescent_.py in __getstate__(self)
        900     def __getstate__(self):
        901         if not hasattr(self, "_search_graph"):
    --> 902             self._init_search_graph()
        903         if not hasattr(self, "_search_function"):
        904             if self._is_sparse:
    ~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/pynndescent/pynndescent_.py in _init_search_graph(self)
       1061                 self._distance_func,
       1062                 self.rng_state,
    -> 1063                 self.diversify_prob,
       1064             )
       1065         reverse_graph.eliminate_zeros()
    ~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/numba/core/dispatcher.py in _compile_for_args(self, *args, **kws)
        431                     e.patch_message('\n'.join((str(e).rstrip(), help_msg)))
        432             # ignore the FULL_TRACEBACKS config, this needs reporting!
    --> 433             raise e
        434 
        435     def inspect_llvm(self, signature=None):
    ~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/numba/core/dispatcher.py in _compile_for_args(self, *args, **kws)
        364                 argtypes.append(self.typeof_pyval(a))
        365         try:
    --> 366             return self.compile(tuple(argtypes))
        367         except errors.ForceLiteralArg as e:
        368             # Received request for compiler re-entry with the list of arguments
    ~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/numba/core/compiler_lock.py in _acquire_compile_lock(*args, **kwargs)
         30         def _acquire_compile_lock(*args, **kwargs):
         31             with self:
    ---> 32                 return func(*args, **kwargs)
         33         return _acquire_compile_lock
         34 
    ~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/numba/core/dispatcher.py in compile(self, sig)
        861                 raise e.bind_fold_arguments(folded)
        862             self.add_overload(cres)
    --> 863             self._cache.save_overload(sig, cres)
        864             return cres.entry_point
        865 
    ~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/numba/core/caching.py in save_overload(self, sig, data)
        665         """
        666         with self._guard_against_spurious_io_errors():
    --> 667             self._save_overload(sig, data)
        668 
        669     def _save_overload(self, sig, data):
    ~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/numba/core/caching.py in _save_overload(self, sig, data)
        675         key = self._index_key(sig, _get_codegen(data))
        676         data = self._impl.reduce(data)
    --> 677         self._cache_file.save(key, data)
        678 
        679     @contextlib.contextmanager
    ~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/numba/core/caching.py in save(self, key, data)
        490                     break
        491             overloads[key] = data_name
    --> 492             self._save_index(overloads)
        493         self._save_data(data_name, data)
        494 
    ~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/numba/core/caching.py in _save_index(self, overloads)
        536     def _save_index(self, overloads):
        537         data = self._source_stamp, overloads
    --> 538         data = self._dump(data)
        539         with self._open_for_write(self._index_path) as f:
        540             pickle.dump(self._version, f, protocol=-1)
    ~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/numba/core/caching.py in _dump(self, obj)
        564 
        565     def _dump(self, obj):
    --> 566         return pickle.dumps(obj, protocol=-1)
        567 
        568     @contextlib.contextmanager
    TypeError: can't pickle weakref objects
    
    
    opened by makkarss929 24
  • No loop matching the specified signature and casting was found for ufunc add

    No loop matching the specified signature and casting was found for ufunc add

    Hi @MaartenGr, Thanks for releasing the new version of BERTopic with Guided Topic Modeling. However, I got an error message for my code

    seed_topic_list = [["flight", "air", "norwegian", "aircanada", "air canada", "sas", "stopover", "air france", "airline", "airport"],
                       ["car rental", "car", "rental center", "drover", "ecars", "cars", "car hire", "rent a car", "taxi", "cab", "ground", "chauffeur", "uber"],
                       ["room", "hotel night", "reception", "hotels", "hotel", "rooms","property", "properties", "accommodation"],
                       ["sncf", "sj", "railcard", "railway", "rail", "train", "trains"]]
    
    topic_model = BERTopic(seed_topic_list=seed_topic_list, calculate_probabilities=False)
    topics, probs= topic_model.fit_transform(data_de)
    

    The error is

    if self.seed_topic_list is not None and self.embedding_model is not None:
    --> 287             y, embeddings = self._guided_topic_modeling(embeddings)
    TypeError: No loop matching the specified signature and casting was found for ufunc add
    

    I don't think the error is caused by my "data_de", since it works well if I don't specify seed_topic_list. Any suggestions on fixing this error?

    opened by YuanyuanLi96 22
  • reduce_topics assigns many documents to -1

    reduce_topics assigns many documents to -1

    From what I can see from both experience and in the code reduce_topics() reassigns to -1 frequently. Is this the expected behavior? If I'm understanding the overall picture, topic clusters are selected based on the HDBSCAN results and documents are assigned to -1 based on a low likelihood of belonging to an identified cluster. Then these clusters are aggregated and a c_tf_idfscore is calculated for the entire topic. When doing the reduction, the cosine similarity of the topic being reduced is compared with all of the other topics and then assigned to the most similar topic. It seems counter-intuitive that if a particular document was sorted as part of a valid cluster by HDBSCAN, but then discounted per the similarity score during the reduction. It feels like there is a mismatch between doing the initial cluster assignment in a way that captures non-symmetric groupings but then using a Euclidean calculation to determine similarity and therefore topic assignment. While not perfect, wouldn't it be reasonable to omit -1 as a potential assignment?

    opened by drob-xx 21
  • topic extraction from 'Quick Start' taking forever

    topic extraction from 'Quick Start' taking forever

    Hi Maarten, I've been following your Github. I installed Bertopic using conda. Then I tried to replicate your Quick Start to see if it's working as expected:

    from bertopic import BERTopic from sklearn.datasets import fetch_20newsgroups

    docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']

    topic_model = BERTopic() topics, probs = topic_model.fit_transform(docs)

    Then, at first I was getting the following (which goes on forever):

    runfile('/Users/nayza/Desktop/YTproject/AAHSA/addictionStudy_2.py', wdir='/Users/nayza/Desktop/YTproject/AAHSA') Ignored unknown kwarg option direction Ignored unknown kwarg option direction Ignored unknown kwarg option direction Ignored unknown kwarg option direction Ignored unknown kwarg option direction Ignored unknown kwarg option direction Ignored unknown kwarg option direction Ignored unknown kwarg option direction Traceback (most recent call last):

    But then I suspected that I needed to update my tokenizer so I updated it from version 0.10.3 to 0.11.0. Then, I see that it doesn't show the 'Ignored unknown...' output anymore but it's taking forever to run. Plus, my Mac started to get really loud as well. Do you an idea what might be an issue here?

    opened by nzaw96 21
  • Support of clustering plot (2D UMAP)

    Support of clustering plot (2D UMAP)

    Hi there, Just wandering, if the current version of BERTopic supports 2D UMAP plot with clustering, like first plot in original post https://towardsdatascience.com/topic-modeling-with-bert-779f7db187e6

    Didn't find such plot in documentation, but it could be rather useful in analysis of document collection.

    opened by karelin 19
  • GPU error

    GPU error


    TypeError Traceback (most recent call last) in ----> 1 from bertopic import BERTopic 2 from cuml.cluster import HDBSCAN 3 from cuml.manifold import UMAP 4 # Create instances of GPU-accelerated UMAP and HDBSCAN 5 umap_model = UMAP(n_components=5, n_neighbors=15, min_dist=0.0)

    3 frames /usr/local/lib/python3.7/dist-packages/hdbscan/hdbscan_.py in 507 leaf_size=40, 508 algorithm="best", --> 509 memory=Memory(cachedir=None, verbose=0), 510 approx_min_span_tree=True, 511 gen_min_span_tree=False,

    TypeError: init() got an unexpected keyword argument 'cachedir'

    opened by research2023 17
  • topic -1

    topic -1

    i use with berttopic to analyze text in social media, dividing the docs to topics and looking for some common and important words in each division. in some case, my model input is 150,000 docs and after the transform, the frequency of topic -1 is very high( 35%-40%).

    so I want to know what exactly is topic -1 and what the problem cause is...

    Thank you all

    opened by roikremer 17
  • don't save the fitted vectorizer model in the model!

    don't save the fitted vectorizer model in the model!

    A fix to https://github.com/MaartenGr/BERTopic/issues/383

    Storing a fitted vectorizer_model makes the topic model extremely memory hungry. This way, self._c_tf_idf is slower, but it's anyway only ever sped up in the second run in the case of doing topics_over_time or topics_per_class after already fitting the model once.

    CountVectorizer is pretty fast already (takes about 2 minutes on my 48.000 document dataset), so I don't think it's worth storing the entire fitted model just to .get_feature_names() a little faster in special use cases compared to the cost in memory. It's in most use cases only fitted once, and the fitting stage isn't time-sensitive either. Plus, the .transform(documents) that had to be done anyway still fits first anyway, so the speedup isn't so big in the first place.

    Also updated .get_feature_names() to .get_feature_names_out(), as the former is deprecated and will stop working with sklearn 1.2, which will be out soon.

    opened by simonfelding 17
  • Issue installing BERTTopic

    Issue installing BERTTopic

    I am trying to install BERTTopic in a Colab Pro+ setting (High-Ram machine)

    !pip install bertopic -q
    from bertopic import BERTopic
    

    I get the following error:

    ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
    yellowbrick 1.3.post1 requires numpy<1.20,>=1.16.0, but you have numpy 1.21.4 which is incompatible.
    datascience 0.10.6 requires folium==0.2.1, but you have folium 0.8.3 which is incompatible.
    albumentations 0.1.12 requires imgaug<0.2.7,>=0.2.5, but you have imgaug 0.2.9 which is incompatible.
    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    /usr/local/lib/python3.7/dist-packages/pkg_resources/__init__.py in _dep_map(self)
       3015         try:
    -> 3016             return self.__dep_map
       3017         except AttributeError:
    
    18 frames
    /usr/local/lib/python3.7/dist-packages/pkg_resources/__init__.py in __getattr__(self, attr)
       2812         if attr.startswith('_'):
    -> 2813             raise AttributeError(attr)
       2814         return getattr(self._provider, attr)
    
    AttributeError: _DistInfoDistribution__dep_map
    
    During handling of the above exception, another exception occurred:
    
    AttributeError                            Traceback (most recent call last)
    /usr/local/lib/python3.7/dist-packages/pkg_resources/__init__.py in _parsed_pkg_info(self)
       3006         try:
    -> 3007             return self._pkg_info
       3008         except AttributeError:
    
    /usr/local/lib/python3.7/dist-packages/pkg_resources/__init__.py in __getattr__(self, attr)
       2812         if attr.startswith('_'):
    -> 2813             raise AttributeError(attr)
       2814         return getattr(self._provider, attr)
    
    AttributeError: _pkg_info
    
    During handling of the above exception, another exception occurred:
    
    FileNotFoundError                         Traceback (most recent call last)
    <ipython-input-14-0ca5028c9978> in <module>()
          1 get_ipython().system('pip install bertopic -q')
          2 
    ----> 3 from bertopic import BERTopic
          4 force_training = True
    
    /usr/local/lib/python3.7/dist-packages/bertopic/__init__.py in <module>()
    ----> 1 from bertopic._bertopic import BERTopic
          2 
          3 __version__ = "0.9.3"
          4 
          5 __all__ = [
    
    /usr/local/lib/python3.7/dist-packages/bertopic/_bertopic.py in <module>()
         20 # Models
         21 import hdbscan
    ---> 22 from umap import UMAP
         23 from sklearn.feature_extraction.text import CountVectorizer
         24 from sklearn.metrics.pairwise import cosine_similarity
    
    /usr/local/lib/python3.7/dist-packages/umap/__init__.py in <module>()
          1 from warnings import warn, catch_warnings, simplefilter
    ----> 2 from .umap_ import UMAP
          3 
          4 try:
          5     with catch_warnings():
    
    /usr/local/lib/python3.7/dist-packages/umap/umap_.py in <module>()
         45 )
         46 
    ---> 47 from pynndescent import NNDescent
         48 from pynndescent.distances import named_distances as pynn_named_distances
         49 from pynndescent.sparse import sparse_named_distances as pynn_sparse_named_distances
    
    /usr/local/lib/python3.7/dist-packages/pynndescent/__init__.py in <module>()
         13         numba.config.THREADING_LAYER = "workqueue"
         14 
    ---> 15 __version__ = pkg_resources.get_distribution("pynndescent").version
    
    /usr/local/lib/python3.7/dist-packages/pkg_resources/__init__.py in get_distribution(dist)
        464         dist = Requirement.parse(dist)
        465     if isinstance(dist, Requirement):
    --> 466         dist = get_provider(dist)
        467     if not isinstance(dist, Distribution):
        468         raise TypeError("Expected string, Requirement, or Distribution", dist)
    
    /usr/local/lib/python3.7/dist-packages/pkg_resources/__init__.py in get_provider(moduleOrReq)
        340     """Return an IResourceProvider for the named module or requirement"""
        341     if isinstance(moduleOrReq, Requirement):
    --> 342         return working_set.find(moduleOrReq) or require(str(moduleOrReq))[0]
        343     try:
        344         module = sys.modules[moduleOrReq]
    
    /usr/local/lib/python3.7/dist-packages/pkg_resources/__init__.py in require(self, *requirements)
        884         included, even if they were already activated in this working set.
        885         """
    --> 886         needed = self.resolve(parse_requirements(requirements))
        887 
        888         for dist in needed:
    
    /usr/local/lib/python3.7/dist-packages/pkg_resources/__init__.py in resolve(self, requirements, env, installer, replace_conflicting, extras)
        778 
        779             # push the new requirements onto the stack
    --> 780             new_requirements = dist.requires(req.extras)[::-1]
        781             requirements.extend(new_requirements)
        782 
    
    /usr/local/lib/python3.7/dist-packages/pkg_resources/__init__.py in requires(self, extras)
       2732     def requires(self, extras=()):
       2733         """List of Requirements needed for this distro if `extras` are used"""
    -> 2734         dm = self._dep_map
       2735         deps = []
       2736         deps.extend(dm.get(None, ()))
    
    /usr/local/lib/python3.7/dist-packages/pkg_resources/__init__.py in _dep_map(self)
       3016             return self.__dep_map
       3017         except AttributeError:
    -> 3018             self.__dep_map = self._compute_dependencies()
       3019             return self.__dep_map
       3020 
    
    /usr/local/lib/python3.7/dist-packages/pkg_resources/__init__.py in _compute_dependencies(self)
       3025         reqs = []
       3026         # Including any condition expressions
    -> 3027         for req in self._parsed_pkg_info.get_all('Requires-Dist') or []:
       3028             reqs.extend(parse_requirements(req))
       3029 
    
    /usr/local/lib/python3.7/dist-packages/pkg_resources/__init__.py in _parsed_pkg_info(self)
       3007             return self._pkg_info
       3008         except AttributeError:
    -> 3009             metadata = self.get_metadata(self.PKG_INFO)
       3010             self._pkg_info = email.parser.Parser().parsestr(metadata)
       3011             return self._pkg_info
    
    /usr/local/lib/python3.7/dist-packages/pkg_resources/__init__.py in get_metadata(self, name)
       1405             return ""
       1406         path = self._get_metadata_path(name)
    -> 1407         value = self._get(path)
       1408         try:
       1409             return value.decode('utf-8')
    
    /usr/local/lib/python3.7/dist-packages/pkg_resources/__init__.py in _get(self, path)
       1609 
       1610     def _get(self, path):
    -> 1611         with open(path, 'rb') as stream:
       1612             return stream.read()
       1613 
    
    FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.7/dist-packages/numpy-1.19.5.dist-info/METADATA'
    

    The issue is usually resolved after re-running the command

    opened by asafcloud-romarketinggroup 17
  • topics_over_time gets stuck

    topics_over_time gets stuck

    Hello!

    I am trying to run the topics_over_time() function, in order to later run visualize_topics_over_time(). However when I run topics_over_time(), it runs for 1 it, and then it's stuck.

    image

    • I am using GPU
    • The size of my text corpus is 80,000 documents

    Please help!

    Kind regards

    opened by oscarm3l1n 0
  • ValueError: empty vocabulary; perhaps the documents only contain stop word

    ValueError: empty vocabulary; perhaps the documents only contain stop word

    Hi, I tried to reduce the outliers in my BERTopic model, I used the first and third ways here but I got this error in both ways: ValueError: empty vocabulary; perhaps the documents only contain stop words.

    function to delete stopword:

    # Remove Stopwords
    def remove_stopwords(txt):
        txt_clean = [word for word in txt.split(' ') if word not in stop_words] #txt.split(' ')
        new_txt_clean = ' '.join(txt_clean)
        return new_txt_clean
    

    training the model:

    cluster_model = KMeans(n_clusters=50)
    topic_model = BERTopic(hdbscan_model=cluster_model)
    topics, probs = topic_model.fit_transform(documents)
    

    what is the reason for this error and how can I solve it

    opened by As2066 1
  • Ways to increase representative documents for a topic?

    Ways to increase representative documents for a topic?

    Hello and apologies if this is not the right method to ask for guidance with BERTopic. I am performing topic modeling and want to get the representative docs for a topic but most seem to only return the default 3 documents when I would like to return say the top n most relevant documents. My intuition says to grab the topic_embeddings_ list and compare the embedding for the doc with that and rank based on cosine similarity, but I saw in another thread from a few months ago that the topic_embeddings_ list is only the avg and not recommended but the docs seem to indicate its the weighted avg now? For my document embeddings, from sentence bert for example, would I need to recalculate them to take into consideration the c_tfidf weights similar to how you generate them in the code to get better similarity ranking?

    Thanks and love this framework. Will def be contributing back to it :)

    opened by GeorgeDittmar 2
  • Update _bertopic.py

    Update _bertopic.py

    Adding functionality in topics_over_time to allow users to specify how many terms under each topic they want to see at each timestep t. Right now, the default value is 5, but there are use cases where the user may want to see more than 5

    opened by nbalepur 0
  • topic_model.get_topic()

    topic_model.get_topic()

    Hi.. I used this line to display 20 words for topic number 0, topic_model.get_topic(0)[:20] but only 10 words appeared for me. Is there a way to display more words?

    opened by As2066 1
  • Flexibility of Cluster (-1) - Outliers Cluster

    Flexibility of Cluster (-1) - Outliers Cluster

    Hello everybody!.

    I've been experimenting with BERTopic recently and the thing is that, once the model is trained and I visualize the number of docs that each cluster contains... the group with more docs by far is indeed the -1 (under the outlier umbrella)... therefore, if this model goes into production, many of the docs will be considered as outliers.

    1. Is there any way I can remove the clustering inside the outlier (-1) category?. Maybe going after the most similar cluster eventhough there is not enough confidence.
    2. If not, how can I reduce the -1 cluster as much as possible? Maybe with parameters such as min_cluster_size (HDBSCAN) or n_neighbors (UMAP).
    3. In the following repo, is it counting the cluster -1 for the evaluation with OCTIS?.

    Many thanks in advance!! :+1:

    Here is my model architecture:

    # Embedding model: See [1] for more details 
    embedding_model = SentenceTransformer("distiluse-base-multilingual-cased-v1")
    
    # Clustering model: See [2] for more details
    cluster_model = HDBSCAN(min_cluster_size = 15, 
                            metric = 'euclidean', 
                            cluster_selection_method = 'eom', 
                            prediction_data = True)
    
    # BERTopic model
    topic_model = BERTopic(embedding_model = embedding_model,
                           hdbscan_model = cluster_model,
                           language = "multilingual")
    
    
    # Fit the model on a corpus
    topics, probs = topic_model.fit_transform(text)
    
    # topic reduction
    topic_model.reduce_topics(text, nr_topics=30)
    
    opened by miguelfrutos 2
Releases(v0.12.0)
  • v0.12.0(Sep 11, 2022)

    Highlights

    • Perform online/incremental topic modeling with .partial_fit
    • Expose c-TF-IDF model for customization with bertopic.vectorizers.ClassTfidfTransformer
      • The parameters bm25_weighting and reduce_frequent_words were added to potentially improve representations:
    • Expose attributes for easier access to internal data
    • Added many tests with the intention of making development a bit more stable

    Documentation

    Fixes

    • Fixed iteratively merging topics (#632 and (#648)
    • Fixed 0th topic not showing up in visualizations (#667)
    • Fixed lowercasing not being optional (#682)
    • Fixed spelling (#664 and (#673)
    • Fixed 0th topic not shown in .get_topic_info by @oxymor0n in #660
    • Fixed spelling by @domenicrosati in #674
    • Add custom labels and title options to barchart @leloykun in #694

    Online/incremental topic modeling

    Online topic modeling (sometimes called "incremental topic modeling") is the ability to learn incrementally from a mini-batch of instances. Essentially, it is a way to update your topic model with data on which it was not trained before. In Scikit-Learn, this technique is often modeled through a .partial_fit function, which is also used in BERTopic.

    At a minimum, the cluster model needs to support a .partial_fit function in order to use this feature. The default HDBSCAN model will not work as it does not support online updating.

    from sklearn.datasets import fetch_20newsgroups
    from sklearn.cluster import MiniBatchKMeans
    from sklearn.decomposition import IncrementalPCA
    from bertopic.vectorizers import OnlineCountVectorizer
    from bertopic import BERTopic
    
    # Prepare documents
    all_docs = fetch_20newsgroups(subset="all",  remove=('headers', 'footers', 'quotes'))["data"]
    doc_chunks = [all_docs[i:i+1000] for i in range(0, len(all_docs), 1000)]
    
    # Prepare sub-models that support online learning
    umap_model = IncrementalPCA(n_components=5)
    cluster_model = MiniBatchKMeans(n_clusters=50, random_state=0)
    vectorizer_model = OnlineCountVectorizer(stop_words="english", decay=.01)
    
    topic_model = BERTopic(umap_model=umap_model,
                           hdbscan_model=cluster_model,
                           vectorizer_model=vectorizer_model)
    
    # Incrementally fit the topic model by training on 1000 documents at a time
    for docs in doc_chunks:
        topic_model.partial_fit(docs)
    

    Only the topics for the most recent batch of documents are tracked. If you want to be using online topic modeling, not for a streaming setting but merely for low-memory use cases, then it is advised to also update the .topics_ attribute as variations such as hierarchical topic modeling will not work afterward:

    # Incrementally fit the topic model by training on 1000 documents at a time and tracking the topics in each iteration
    topics = []
    for docs in doc_chunks:
        topic_model.partial_fit(docs)
        topics.extend(topic_model.topics_)
    
    topic_model.topics_ = topics
    

    c-TF-IDF

    Explicitly define, use, and adjust the ClassTfidfTransformer with new parameters, bm25_weighting and reduce_frequent_words, to potentially improve the topic representation:

    from bertopic import BERTopic
    from bertopic.vectorizers import ClassTfidfTransformer
    
    ctfidf_model = ClassTfidfTransformer(bm25_weighting=True)
    topic_model = BERTopic(ctfidf_model=ctfidf_model)
    

    Attributes

    After having fitted your BERTopic instance, you can use the following attributes to have quick access to certain information, such as the topic assignment for each document in topic_model.topics_.

    | Attribute | Type | Description | |--------------------|----|---------------------------------------------------------------------------------------------| | topics_ | List[int] | The topics that are generated for each document after training or updating the topic model. The most recent topics are tracked. | | probabilities_ | List[float] | The probability of the assigned topic per document. These are only calculated if an HDBSCAN model is used for the clustering step. When calculate_probabilities=True, then it is the probabilities of all topics per document. | | topic_sizes_ | Mapping[int, int] | The size of each topic. | | topic_mapper_ | TopicMapper | A class for tracking topics and their mappings anytime they are merged, reduced, added, or removed. | | topic_representations_ | Mapping[int, Tuple[int, float]] | The top n terms per topic and their respective c-TF-IDF values. | | c_tf_idf_ | csr_matrix | The topic-term matrix as calculated through c-TF-IDF. To access its respective words, run .vectorizer_model.get_feature_names() or .vectorizer_model.get_feature_names_out() | | topic_labels_ | Mapping[int, str] | The default labels for each topic. | | custom_labels_ | List[str] | Custom labels for each topic as generated through .set_topic_labels. | | topic_embeddings_ | np.ndarray | The embeddings for each topic. It is calculated by taking the weighted average of word embeddings in a topic based on their c-TF-IDF values. | | representative_docs_ | Mapping[int, str] | The representative documents for each topic if HDBSCAN is used.

    Source code(tar.gz)
    Source code(zip)
  • v0.11.0(Jul 11, 2022)

    Highlights

    Documentation

    • Added example for finding similar topics between two models in the tips & tricks page
    • Add multi-modal example in the tips & tricks page

    Fixes

    • Fix support for k-Means in .visualize_heatmap (#532)
    • Fix missing topic 0 in .visualize_topics (#533)
    • Fix inconsistencies in .get_topic_info (#572) and (#581)
    • Add optimal_ordering parameter to .visualize_hierarchy by @rafaelvalero in #390
    • Fix RuntimeError when used as sklearn estimator by @simonfelding in #448
    • Fix typo in visualization documentation by @dwhdai in #475
    • Fix typo in docstrings by @xwwwwww in #549
    • Support higher Flair versions

    Visualization examples

    Visualize hierarchical topic representations with .visualize_hierarchy:

    image

    Extract a text-based hierarchical topic representation with .get_topic_tree:

    .
    └─atheists_atheism_god_moral_atheist
         ├─atheists_atheism_god_atheist_argument
         │    ├─■──atheists_atheism_god_atheist_argument ── Topic: 21
         │    └─■──br_god_exist_genetic_existence ── Topic: 124
         └─■──moral_morality_objective_immoral_morals ── Topic: 29
    

    Visualize 2D documents with .visualize_documents():

    visualize_documents

    Visualize 2D hierarchical documents with .visualize_hierarchical_documents():

    visualize_hierarchical_documents

    Source code(tar.gz)
    Source code(zip)
  • v0.10.0(Apr 30, 2022)

    Highlights

    • Use any dimensionality reduction technique instead of UMAP:
    from bertopic import BERTopic
    from sklearn.decomposition import PCA
    
    dim_model = PCA(n_components=5)
    topic_model = BERTopic(umap_model=dim_model)
    
    • Use any clustering technique instead of HDBSCAN:
    from bertopic import BERTopic
    from sklearn.cluster import KMeans
    
    cluster_model = KMeans(n_clusters=50)
    topic_model = BERTopic(hdbscan_model=cluster_model)
    

    Documentation

    • Add a CountVectorizer page with tips and tricks on how to create topic representations that fit your use case
    • Added pages on how to use other dimensionality reduction and clustering algorithms
    • Additional instructions on how to reduce outliers in the FAQ:
    import numpy as np
    probability_threshold = 0.01
    new_topics = [np.argmax(prob) if max(prob) >= probability_threshold else -1 for prob in probs] 
    

    Fixes

    • Fixed None being returned for probabilities when transforming unseen documents
    • Replaced all instances of arg: with Arguments: for consistency
    • Before saving a fitted BERTopic instance, we remove the stopwords in the fitted CountVectorizer model as it can get quite large due to the number of words that end in stopwords if min_df is set to a value larger than 1
    • Set "hdbscan>=0.8.28" to prevent numpy issues
      • Although this was already fixed by the new release of HDBSCAN, it is technically still possible to install 0.8.27 with BERTopic which leads to these numpy issues
    • Update gensim dependency to >=4.0.0 (#371)
    • Fix topic 0 not appearing in visualizations (#472)
    • Fix #506
    • Fix #429
    Source code(tar.gz)
    Source code(zip)
  • v0.9.4(Dec 14, 2021)

    A number of fixes, documentation updates, and small features:

    Highlights:

    • Expose diversity parameter
      • Use BERTopic(diversity=0.1) to change how diverse the words in a topic representation are (ranges from 0 to 1)
    • Improve stability of topic reduction by only computing the cosine similarity within c-TF-IDF and not the topic embeddings
    • Added property to c-TF-IDF that all IDF values should be positive (#351)
    • Major documentation overhaul (mkdocs, tutorials, FAQ, images, etc. ) (#330)
    • Additional logging for .transform (#356)

    Fixes:

    • Drop python 3.6 (#333)
    • Relax plotly dependency (#88)
    • Improve stability of .visualize_barchart() and .visualize_hierarchy()
    Source code(tar.gz)
    Source code(zip)
  • v0.9.3(Oct 17, 2021)

    Fix #282, #285, and #288.

    Fixes

    • #282
      • As it turns out the old implementation of topic mapping was still found in the transform function
    • #285
      • Fix getting all representative docs
    • Fix #288
      • A recent issue with the package pyyaml that can be found in Google Colab
    • Remove the YAMLLoadWarning each time BERTopic is imported
    import yaml
    yaml._warnings_enabled["YAMLLoadWarning"] = False
    
    Source code(tar.gz)
    Source code(zip)
  • v0.9.2(Oct 12, 2021)

    A release focused on algorithmic optimization and fixing several issues:

    Highlights:

    • Update the non-multilingual paraphrase-* models to the all-* models due to improved performance
    • Reduce necessary RAM in c-TF-IDF top 30 word extraction

    Fixes:

    • Fix topic mapping
      • When reducing the number of topics, these need to be mapped to the correct input/output which had some issues in the previous version
      • A new class was created as a way to track these mappings regardless of how many times they were executed
      • In other words, you can iteratively reduce the number of topics after training the model without the need to continuously train the model
    • Fix typo in embeddings page (#200)
    • Fix link in README (#233)
    • Fix documentation .visualize_term_rank() (#253)
    • Fix getting correct representative docs (#258)
    • Update memory FAQ with HDBSCAN pr
    Source code(tar.gz)
    Source code(zip)
  • v0.9.1(Sep 1, 2021)

    Fixes:

    • Fix TypeError when auto-reducing topics (#210)
    • Fix mapping representative docs when reducing topics (#208)
    • Fix visualization issues with probabilities (#205)
    • Fix missing normalize_frequency param in plots (#213)
    Source code(tar.gz)
    Source code(zip)
  • v0.9.0(Aug 7, 2021)

    Highlights

    • Implemented a Guided BERTopic -> Use seeds to steer the Topic Modeling
    • Get the most representative documents per topic: topic_model.get_representative_docs(topic=1)
      • This allows users to see which documents are good representations of a topic and better understand the topics that were created
    • Added normalize_frequency parameter to visualize_topics_per_class and visualize_topics_over_time in order to better compare the relative topic frequencies between topics
    • Return flat probabilities as default, only calculate the probabilities of all topics per document if calculate_probabilities is True
    • Added several FAQs

    Fixes

    • Fix loading pre-trained BERTopic model
    • Fix mapping of probabilities
    • Fix #190

    Guided BERTopic

    Guided BERTopic works in two ways:

    First, we create embeddings for each seeded topics by joining them and passing them through the document embedder. These embeddings will be compared with the existing document embeddings through cosine similarity and assigned a label. If the document is most similar to a seeded topic, then it will get that topic's label. If it is most similar to the average document embedding, it will get the -1 label. These labels are then passed through UMAP to create a semi-supervised approach that should nudge the topic creation to the seeded topics.

    Second, we take all words in seed_topic_list and assign them a multiplier larger than 1. Those multipliers will be used to increase the IDF values of the words across all topics thereby increasing the likelihood that a seeded topic word will appear in a topic. This does, however, also increase the chance of an irrelevant topic having unrelated words. In practice, this should not be an issue since the IDF value is likely to remain low regardless of the multiplier. The multiplier is now a fixed value but may change to something more elegant, like taking the distribution of IDF values and its position into account when defining the multiplier.

    seed_topic_list = [["company", "billion", "quarter", "shrs", "earnings"],
                       ["acquisition", "procurement", "merge"],
                       ["exchange", "currency", "trading", "rate", "euro"],
                       ["grain", "wheat", "corn"],
                       ["coffee", "cocoa"],
                       ["natural", "gas", "oil", "fuel", "products", "petrol"]]
    
    topic_model = BERTopic(seed_topic_list=seed_topic_list)
    topics, probs = topic_model.fit_transform(docs)
    
    Source code(tar.gz)
    Source code(zip)
  • v0.8.1(Jun 8, 2021)

    Highlights:

    • Improved models:
      • For English documents the default is now: "paraphrase-MiniLM-L6-v2"
      • For Non-English or multi-lingual documents the default is now: "paraphrase-multilingual-MiniLM-L12-v2"
      • Both models show not only great performance but are much faster!
    • Add interactive visualizations to the plotting API documentation

    For even better performance, please use the following models:

    • English: "paraphrase-mpnet-base-v2"
    • Non-English or multi-lingual: "paraphrase-multilingual-mpnet-base-v2"

    Fixes:

    • Improved unit testing for more stability
    • Set transformers version for Flair
    Source code(tar.gz)
    Source code(zip)
  • v0.8.0(May 31, 2021)

    Mainly a visualization update to improve understanding of the topic model.

    Features

    • Additional visualizations:
      • Topic Hierarchy: topic_model.visualize_hierarchy()
      • Topic Similarity Heatmap: topic_model.visualize_heatmap()
      • Topic Representation Barchart: topic_model.visualize_barchart()
      • Term Score Decline: topic_model.visualize_term_rank()

    Improvements

    • Created bertopic.plotting library to easily extend visualizations
    • Improved automatic topic reduction by using HDBSCAN to detect similar topics
    • Sort topic ids by their frequency. -1 is the outlier class and contains typically the most documents. After that 0 is the largest topic, 1 the second largest, etc.
    • Update MKDOCS with new visualizations

    Fixes

    • Fix typo #113, #117
    • Fix #121 by removing the following two lines:
      • https://github.com/MaartenGr/BERTopic/blob/5c6cf22776fafaaff728370781a5d33727d3dc8f/bertopic/_bertopic.py#L359-L360
    • Fix mapping of topics after reduction (it now excludes 0) (#103)
    Source code(tar.gz)
    Source code(zip)
  • v0.7.0(Apr 26, 2021)

    The two main features are (semi-)supervised topic modeling and several backends to use instead of Flair and SentenceTransformers!

    Highlights:

    • (semi-)supervised topic modeling by leveraging supervised options in UMAP
      • model.fit(docs, y=target_classes)
    • Backends:
      • Added Spacy, Gensim, USE (TFHub)
      • Use a different backend for document embeddings and word embeddings
      • Create your own backends with bertopic.backend.BaseEmbedder
      • Click here for an overview of all new backends
    • Calculate and visualize topics per class
      • Calculate: topics_per_class = topic_model.topics_per_class(docs, topics, classes)
      • Visualize: topic_model.visualize_topics_per_class(topics_per_class)
    • Several tutorials were updated and added:

    | Name | Link | |---|---| | Topic Modeling with BERTopic | Open In Colab | | (Custom) Embedding Models in BERTopic | Open In Colab | | Advanced Customization in BERTopic | Open In Colab | | (semi-)Supervised Topic Modeling with BERTopic | Open In Colab | | Dynamic Topic Modeling with Trump's Tweets | Open In Colab |

    Fixes:

    • Fixed issues with Torch req
    • Prevent saving term frequency matrix in CTFIDF class
    • Fixed DTM not working when reducing topics (#96)
    • Moved visualization dependencies to base BERTopic
      • pip install bertopic[visualization] becomes pip install bertopic
    • Allow precomputed embeddings in bertopic.find_topics() (#79):
    model = BERTopic(embedding_model=my_embedding_model)
    model.fit(docs, my_precomputed_embeddings)
    model.find_topics(search_term)
    
    Source code(tar.gz)
    Source code(zip)
  • v0.6.0(Mar 9, 2021)

    Highlights:

    • DTM: Added a basic dynamic topic modeling technique based on the global c-TF-IDF representation
      • model.topics_over_time(docs, timestamps, global_tuning=True)
    • DTM: Option to evolve topics based on t-1 c-TF-IDF representation which results in evolving topics over time
      • Only uses topics at t-1 and skips evolution if there is a gap
      • model.topics_over_time(docs, timestamps, evolution_tuning=True)
    • DTM: Function to visualize topics over time
      • model.visualize_topics_over_time(topics_over_time)
    • DTM: Add binning of timestamps
      • model.topics_over_time(docs, timestamps, nr_bins=10)
    • Add function get general information about topics (id, frequency, name, etc.)
      • get_topic_info()
    • Improved stability of c-TF-IDF by taking the average number of words across all topics instead of the number of documents

    Fixes:

    • _map_probabilities() does not take into account that there is no probability of the outlier class and the probabilities are mutated instead of copied (#63, #64)
    Source code(tar.gz)
    Source code(zip)
  • v0.5.0(Feb 8, 2021)

    Features

    • Add Flair to allow for more (custom) token/document embeddings
    • Option to use custom UMAP, HDBSCAN, and CountVectorizer
    • Added low_memory parameter to reduce memory during computation
    • Improved verbosity (shows progress bar)
    • Improved testing
    • Use the newest version of sentence-transformers as it speeds ups encoding significantly
    • Return the figure of visualize_topics()
    • Expose all parameters with a single function: get_params()
    • Option to disable the saving of embedding_model, should reduce BERTopic size significantly
    • Add FAQ page

    Fixes

    • To simplify the API, the parameters stop_words and n_neighbors were removed. These can still be used when a custom UMAP or CountVectorizer is used.
    • Set calculate_probabilities to False as a default. Calculating probabilities with HDBSCAN significantly increases computation time and memory usage. Better to remove calculating probabilities or only allow it by manually turning this on.
    Source code(tar.gz)
    Source code(zip)
  • v0.4.2(Jan 10, 2021)

    Fixed the parameter embedding_model not working properly when language had been set. If you are using an older version of BERTopic, please set language to False when you want to set embedding_model.

    Source code(tar.gz)
    Source code(zip)
  • v0.4.1(Jan 7, 2021)

  • v0.4.0(Dec 21, 2020)

    Highlights:

    • Visualize Topics similar to LDAvis
    • Added option to reduce topics after training
    • Added option to update topic representation after training
    • Added option to search topics using a search term
    • Significantly improved the stability of generating clusters
    • Finetune the topic words by selecting the most coherent words with the highest c-TF-IDF values
    • More extensive tutorials in the documentation

    Notable Changes:

    • Option to select language instead of sentence-transformers models to minimize the complexity of using BERTopic
    • Improved logging (remove duplicates)
    • Check if BERTopic is fitted
    • Added TF-IDF as an embedder instead of transformer models (see tutorial)
    • Numpy for Python 3.6 will be dropped and was therefore removed from the workflow.
    • Preprocess text before passing it through c-TF-IDF
    • Merged get_topics_freq() with get_topic_freq()

    Fixes:

    • Fix error handling topic probabilities
    Source code(tar.gz)
    Source code(zip)
  • v0.3.2(Nov 16, 2020)

    Fixed a bug with the topic reduction method that seems to reduce the number of topics but not to the nr_topics as defined in the class. Since this was, to a certain extend, breaking the topic reduction method a new release was necessary.

    Source code(tar.gz)
    Source code(zip)
  • v0.3.1(Nov 4, 2020)

    Adding the option to use custom embeddings or embeddings that you generated beforehand with whatever package you'd like to use. This allows users to further customize BERTopic to their liking.

    NOTE: I cannot guarantee that using your own embeddings would result in better performance. It is likely to swing both ways depending on the embeddings you are using. For example, if you use poorly-trained W2V embeddings then it is likely to result in a poor topic generation. Thus, it is up to the user to experiment with the embeddings that best serve their purposes.

    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Oct 29, 2020)

    • transform() and fit_transform() now also return the topic probability distributions
    • Added visualize_distribution() which visualizes the topic probability distribution for a single document
    Source code(tar.gz)
    Source code(zip)
  • v0.2.3(Oct 17, 2020)

  • v0.2.1(Oct 11, 2020)

    Improved the calculation of the class-based TF-IDF procedure by limiting the calculation to sparse matrices. This prevents out-of-memory problems when faced with large datasets.

    Source code(tar.gz)
    Source code(zip)
  • v0.1.2(Oct 1, 2020)

  • v0.1.1(Sep 24, 2020)

  • v0.1.0(Sep 24, 2020)

    • Added parameters for UMAP and HDBSCAN
    • Option to choose sentence-transformer model
    • Method for transforming unseen documents
    • Save and load trained models (UMAP and HDBSCAN)
    • Extract topics and their sizes
    • Optimized c-TF-IDF
    • Improved documentation
    • Improved topic reduction
    Source code(tar.gz)
    Source code(zip)
Owner
Maarten Grootendorst
Data Scientist | Psychologist
Maarten Grootendorst
Contact Extraction with Question Answering.

contactsQA Extraction of contact entities from address blocks and imprints with Extractive Question Answering. Goal Input: Dr. Max Mustermann Hauptstr

Jan 2 Apr 20, 2022
Weakly-supervised Text Classification Based on Keyword Graph

Weakly-supervised Text Classification Based on Keyword Graph How to run? Download data Our dataset follows previous works. For long texts, we follow C

Hello_World 20 Dec 29, 2022
Protein Language Model

ProteinLM We pretrain protein language model based on Megatron-LM framework, and then evaluate the pretrained model results on TAPE (Tasks Assessing P

THUDM 77 Dec 27, 2022
A multi-lingual approach to AllenNLP CoReference Resolution along with a wrapper for spaCy.

Crosslingual Coreference Coreference is amazing but the data required for training a model is very scarce. In our case, the available training for non

Pandora Intelligence 71 Jan 04, 2023
Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Bunkai Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts. Quick Start $ pip install bunkai $ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎ

Megagon Labs 160 Dec 23, 2022
本插件是pcrjjc插件的重置版,可以独立于后端api运行

pcrjjc2 本插件是pcrjjc重置版,不需要使用其他后端api,但是需要自行配置客户端 本项目基于AGPL v3协议开源,由于项目特殊性,禁止基于本项目的任何商业行为 配置方法 环境需求:.net framework 4.5及以上 jre8 别忘了装jre8 别忘了装jre8 别忘了装jre8

132 Dec 26, 2022
Natural Language Processing Specialization

Natural Language Processing Specialization In this folder, Natural Language Processing Specialization projects and notes can be found. WHAT I LEARNED

Kaan BOKE 3 Oct 06, 2022
LeBenchmark: a reproducible framework for assessing SSL from speech

LeBenchmark: a reproducible framework for assessing SSL from speech

11 Nov 30, 2022
👑 spaCy building blocks and visualizers for Streamlit apps

spacy-streamlit: spaCy building blocks for Streamlit apps This package contains utilities for visualizing spaCy models and building interactive spaCy-

Explosion 620 Dec 29, 2022
Code for PED: DETR For (Crowd) Pedestrian Detection

Code for PED: DETR For (Crowd) Pedestrian Detection

36 Sep 13, 2022
nlp-tutorial is a tutorial for who is studying NLP(Natural Language Processing) using Pytorch

nlp-tutorial is a tutorial for who is studying NLP(Natural Language Processing) using Pytorch. Most of the models in NLP were implemented with less than 100 lines of code.(except comments or blank li

Tae-Hwan Jung 11.9k Jan 08, 2023
Outreachy TFX custom component project

Schema Curation Custom Component Outreachy TFX custom component project This repo contains the code for Schema Curation Custom Component made as a par

Robert Crowe 5 Jul 16, 2021
SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

Introduction This codebase contains source-code of the Python-based implementation (ARES) of our SIGIR 2022 paper. Chen, Jia, et al. "Axiomatically Re

Jia Chen 17 Nov 09, 2022
SimBERT升级版(SimBERTv2)!

RoFormer-Sim RoFormer-Sim,又称SimBERTv2,是我们之前发布的SimBERT模型的升级版。 介绍 https://kexue.fm/archives/8454 训练 tensorflow 1.14 + keras 2.3.1 + bert4keras 0.10.6 下载

317 Dec 23, 2022
This is the main repository of open-sourced speech technology by Huawei Noah's Ark Lab.

Speech-Backbones This is the main repository of open-sourced speech technology by Huawei Noah's Ark Lab. Grad-TTS Official implementation of the Grad-

HUAWEI Noah's Ark Lab 295 Jan 07, 2023
ByT5: Towards a token-free future with pre-trained byte-to-byte models

ByT5: Towards a token-free future with pre-trained byte-to-byte models ByT5 is a tokenizer-free extension of the mT5 model. Instead of using a subword

Google Research 409 Jan 06, 2023
Python library for parsing resumes using natural language processing and machine learning

CVParser Python library for parsing resumes using natural language processing and machine learning. Setup Installation on Linux and Mac OS Follow the

nafiu 0 Jul 29, 2021
Implementation for paper BLEU: a Method for Automatic Evaluation of Machine Translation

BLEU Score Implementation for paper: BLEU: a Method for Automatic Evaluation of Machine Translation Author: Ba Ngoc from ProtonX BLEU score is a popul

Ngoc Nguyen Ba 6 Oct 07, 2021
Code for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned Language Models in the wild .

🌳 Fingerprinting Fine-tuned Language Models in the wild This is the code and dataset for our ACL 2021 (Findings) Paper - Fingerprinting Fine-tuned La

LCS2-IIITDelhi 5 Sep 13, 2022
RuCLIP tiny (Russian Contrastive Language–Image Pretraining) is a neural network trained to work with different pairs (images, texts).

RuCLIPtiny Zero-shot image classification model for Russian language RuCLIP tiny (Russian Contrastive Language–Image Pretraining) is a neural network

Shahmatov Arseniy 26 Sep 20, 2022