Sentence Embeddings with BERT & XLNet

Overview

Sentence Transformers: Multilingual Sentence Embeddings using BERT / RoBERTa / XLM-RoBERTa & Co. with PyTorch

This framework provides an easy method to compute dense vector representations for sentences and paragraphs (also known as sentence embeddings). The models are based on transformer networks like BERT / RoBERTa / XLM-RoBERTa etc. and are tuned specificially meaningul sentence embeddings such that sentences with similar meanings are close in vector space.

We provide an increasing number of state-of-the-art pretrained models for more than 100 languages, fine-tuned for various use-cases.

Further, this framework allows an easy fine-tuning of custom embeddings models, to achieve maximal performance on your specific task.

For the full documentation, see www.SBERT.net, as well as our publications:

Installation

We recommend Python 3.6 or higher, PyTorch 1.6.0 or higher and transformers v3.1.0 or higher. The code does not work with Python 2.7.

Install with pip

Install the sentence-transformers with pip:

pip install -U sentence-transformers

Install from sources

Alternatively, you can also clone the latest version from the repository and install it directly from the source code:

pip install -e .

PyTorch with CUDA If you want to use a GPU / CUDA, you must install PyTorch with the matching CUDA Version. Follow PyTorch - Get Started for further details how to install PyTorch.

Getting Started

See Quickstart in our documenation.

This example shows you how to use an already trained Sentence Transformer model to embed sentences for another task.

First download a pretrained model.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('paraphrase-distilroberta-base-v1')

Then provide some sentences to the model.

sentences = ['This framework generates embeddings for each input sentence',
    'Sentences are passed as a list of string.', 
    'The quick brown fox jumps over the lazy dog.']
sentence_embeddings = model.encode(sentences)

And that's it already. We now have a list of numpy arrays with the embeddings.

for sentence, embedding in zip(sentences, sentence_embeddings):
    print("Sentence:", sentence)
    print("Embedding:", embedding)
    print("")

Pre-Trained Models

We provide a large list of Pretrained Models for more than 100 languages. Some models are general purpose models, while others produce embeddings for specific use cases. Pre-trained models can be loaded by just passing the model name: SentenceTransformer('model_name').

» Full list of pretrained models

Training

This framework allows you to fine-tune your own sentence embedding methods, so that you get task-specific sentence embeddings. You have various options to choose from in order to get perfect sentence embeddings for your specific task.

See Training Overview for an introduction how to train your own embedding models. We provide various examples how to train models on various datasets.

Some highlights are:

  • Support of various transformer networks including BERT, RoBERTa, XLM-R, DistilBERT, Electra, BART, ...
  • Multi-Lingual and multi-task learning
  • Evaluation during training to find optimal model
  • 10+ loss-functions allowing to tune models specifically for semantic search, paraphrase mining, semantic similarity comparison, clustering, triplet loss, constrative loss.

Performance

Our models are evaluated extensively and achieve state-of-the-art performance on various tasks. Further, the code is tuned to provide the highest possible speed.

Model STS benchmark SentEval
Avg. GloVe embeddings 58.02 81.52
BERT-as-a-service avg. embeddings 46.35 84.04
BERT-as-a-service CLS-vector 16.50 84.66
InferSent - GloVe 68.03 85.59
Universal Sentence Encoder 74.92 85.10
Sentence Transformer Models
nli-bert-base 77.12 86.37
nli-bert-large 79.19 87.78
stsb-bert-base 85.14 86.07
stsb-bert-large 85.29 86.66
stsb-roberta-base 85.44 -
stsb-roberta-large 86.39 -
stsb-distilbert-base 85.16 -

Application Examples

You can use this framework for:

and many more use-cases.

For all examples, see examples/applications.

Citing & Authors

If you find this repository helpful, feel free to cite our publication Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks:

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

If you use one of the multilingual models, feel free to cite our publication Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation:

@inproceedings{reimers-2020-multilingual-sentence-bert,
    title = "Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2020",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/2004.09813",
}

If you use the code for data augmentation, feel free to cite our publication Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks:

@article{thakur-2020-AugSBERT,
    title = "Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks",
    author = "Thakur, Nandan and Reimers, Nils and Daxenberger, Johannes and  Gurevych, Iryna", 
    journal= "arXiv preprint arXiv:2010.08240",
    month = "10",
    year = "2020",
    url = "https://arxiv.org/abs/2010.08240",
}

The main contributors of this repository are:

Contact person: Nils Reimers, [email protected]

https://www.ukp.tu-darmstadt.de/

Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

Comments
  • Is it Multilingual?

    Is it Multilingual?

    Hello,

    This might be a stupid question, but i wanted to know if I can use the clustering on German sentences? Will it work with the pre-trained model or do I need to train it on German data first?

    Thanks.

    opened by SouravDutta91 44
  • Fine-tune multilingual model for domain specific vocab

    Fine-tune multilingual model for domain specific vocab

    Thanks for the repository and for continuous updates.

    Wanted to check if understood it correctly: Is it possible to continue fine-tuning one of the multilingual models for a specific domain? For example I can take 'xlm-r-distilroberta-base-paraphrase-v1' and fine-tune it on domain-related parallel data( English-other languages) with MultipleNegativesRankingLoss?

    opened by langineer 30
  • Is it possible to encode by using multi-GPU?

    Is it possible to encode by using multi-GPU?

    Thanks for this beautiful package, it saves a lot of work to do semantic embedding. I am running a large size data base trying to transform docs into embedding matrix. When I was running with the code, it seemed only using single GPU to encode the sentence. Is there any way that I could do this by multi-GPU?

    opened by z307287280 30
  • public.ukp.informatik.tu-darmstadt.de Unreachable

    public.ukp.informatik.tu-darmstadt.de Unreachable

    It looks like the server which hosts the pre-trained models (https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/v0.2/) has been unavailable for a few hours now.

    opened by Ganners 20
  • ModuleNotFoundError: No module named 'sentence_transformers.evaluation'

    ModuleNotFoundError: No module named 'sentence_transformers.evaluation'

    After pip installing and trying to import SentenceTransformer I get this error: ModuleNotFoundError: No module named 'sentence_transformers.evaluation'

    When I look into the source code the only folder I have is models. I am missing evaluation, etc. Any Idea why?

    opened by DavidBegert 20
  • Fine-tune underlying language model for SBERT

    Fine-tune underlying language model for SBERT

    Hi,

    I'd like to use SBERT model architecture for document similarity and topic modelling tasks. However, my data corpus is fairly specific to domain, and I suspect that SBERT will underperform as it was trained on generic WIki/Library corpuses. So, I wonder if there are any recommendation around fine-tuning of underlying language model for SBERT.

    I envision that the overall process will be following:

    1. Take pre-trained BERT model
    2. Fine tune Language Model on domain-specific corpus
    3. Then retrain SBERT model architecture on specific tasks (e.g. SNLI dataset/task)

    Curious to hear thought on the approach and problem definition.

    opened by vdabravolski 18
  • ModuleNotFoundError: No module named 'setuptools.command.build'

    ModuleNotFoundError: No module named 'setuptools.command.build'

    I am trying to pip install sentence transformers on my Macbook Pro with M1 chip. I am using:

    pip install -U sentence-transformers
    

    When I run this, I get this error saying:

    ModuleNotFoundError: No module named 'setuptools.command.build'
    

    Full output:

    Defaulting to user installation because normal site-packages is not writeable
    Collecting sentence-transformers
      Using cached sentence-transformers-2.2.2.tar.gz (85 kB)
      Preparing metadata (setup.py) ... done
    Collecting transformers<5.0.0,>=4.6.0
      Using cached transformers-4.21.0-py3-none-any.whl (4.7 MB)
    Collecting tqdm
      Using cached tqdm-4.64.0-py2.py3-none-any.whl (78 kB)
    Requirement already satisfied: torch>=1.6.0 in ./Library/Python/3.8/lib/python/site-packages (from sentence-transformers) (1.12.0)
    Collecting torchvision
      Using cached torchvision-0.13.0-cp38-cp38-macosx_11_0_arm64.whl (1.2 MB)
    Requirement already satisfied: numpy in ./Library/Python/3.8/lib/python/site-packages (from sentence-transformers) (1.23.1)
    Collecting scikit-learn
      Using cached scikit_learn-1.1.1-cp38-cp38-macosx_12_0_arm64.whl (7.6 MB)
    Collecting scipy
      Using cached scipy-1.8.1-cp38-cp38-macosx_12_0_arm64.whl (28.6 MB)
    Collecting nltk
      Using cached nltk-3.7-py3-none-any.whl (1.5 MB)
    Collecting sentencepiece
      Using cached sentencepiece-0.1.96.tar.gz (508 kB)
      Preparing metadata (setup.py) ... done
    Collecting huggingface-hub>=0.4.0
      Using cached huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
    Collecting requests
      Using cached requests-2.28.1-py3-none-any.whl (62 kB)
    Collecting pyyaml>=5.1
      Using cached PyYAML-6.0.tar.gz (124 kB)
      Installing build dependencies ... done
      Getting requirements to build wheel ... done
      Preparing metadata (pyproject.toml) ... done
    Requirement already satisfied: typing-extensions>=3.7.4.3 in ./Library/Python/3.8/lib/python/site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (4.3.0)
    Requirement already satisfied: filelock in ./Library/Python/3.8/lib/python/site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (3.7.1)
    Requirement already satisfied: packaging>=20.9 in ./Library/Python/3.8/lib/python/site-packages (from huggingface-hub>=0.4.0->sentence-transformers) (21.3)
    Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
      Using cached tokenizers-0.12.1.tar.gz (220 kB)
      Installing build dependencies ... done
      Getting requirements to build wheel ... error
      error: subprocess-exited-with-error
      
      × Getting requirements to build wheel did not run successfully.
      │ exit code: 1
      ╰─> [20 lines of output]
          Traceback (most recent call last):
            File "/Users/joeyoneill/Library/Python/3.8/lib/python/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 363, in <module>
              main()
            File "/Users/joeyoneill/Library/Python/3.8/lib/python/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 345, in main
              json_out['return_val'] = hook(**hook_input['kwargs'])
            File "/Users/joeyoneill/Library/Python/3.8/lib/python/site-packages/pip/_vendor/pep517/in_process/_in_process.py", line 130, in get_requires_for_build_wheel
              return hook(config_settings)
            File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages/setuptools/build_meta.py", line 146, in get_requires_for_build_wheel
              return self._get_build_requires(
            File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages/setuptools/build_meta.py", line 127, in _get_build_requires
              self.run_setup()
            File "/Library/Developer/CommandLineTools/Library/Frameworks/Python3.framework/Versions/3.8/lib/python3.8/site-packages/setuptools/build_meta.py", line 142, in run_setup
              exec(compile(code, __file__, 'exec'), locals())
            File "setup.py", line 2, in <module>
              from setuptools_rust import Binding, RustExtension
            File "/private/var/folders/bg/ncfh283n4t39vqhvbd5n9ckh0000gn/T/pip-build-env-vjj6eow8/overlay/lib/python3.8/site-packages/setuptools_rust/__init__.py", line 1, in <module>
              from .build import build_rust
            File "/private/var/folders/bg/ncfh283n4t39vqhvbd5n9ckh0000gn/T/pip-build-env-vjj6eow8/overlay/lib/python3.8/site-packages/setuptools_rust/build.py", line 20, in <module>
              from setuptools.command.build import build as CommandBuild  # type: ignore[import]
          ModuleNotFoundError: No module named 'setuptools.command.build'
          [end of output]
      
      note: This error originates from a subprocess, and is likely not a problem with pip.
    error: subprocess-exited-with-error
    
    × Getting requirements to build wheel did not run successfully.
    │ exit code: 1
    ╰─> See above for output.
    
    	note: This error originates from a subprocess, and is likely not a problem with pip.
    

    Can anybody tell me what I should do or what is wrong with what I am currently doing? I factory reset my Mac and re-downloaded everything but I still get this same error. I am stumped.

    opened by joeyoneill 15
  • HTTPError: 403 Client Error:

    HTTPError: 403 Client Error:

    I get a request error and I do not know why.

    
    [W 2021-02-02 18:43:15,951] Trial 0 failed because of the following error: HTTPError('403 Client Error: Forbidden for url: https://sbert.net/models/bert-base-german-dbmdz-uncased.zip',)
    Traceback (most recent call last):
      File "/usr/local/lib/python3.6/dist-packages/optuna/_optimize.py", line 211, in _run_trial
        value_or_values = func(trial)
      File "<ipython-input-6-af5cb77f5b44>", line 40, in objective
        model = SentenceTransformer(model_name)  # distiluse-base-multilingual-cased-v2  distilbert-multilingual-nli-stsb-quora-ranking
      File "/usr/local/lib/python3.6/dist-packages/sentence_transformers/SentenceTransformer.py", line 92, in __init__
        raise e
      File "/usr/local/lib/python3.6/dist-packages/sentence_transformers/SentenceTransformer.py", line 75, in __init__
        http_get(model_url, zip_save_path)
      File "/usr/local/lib/python3.6/dist-packages/sentence_transformers/util.py", line 201, in http_get
        req.raise_for_status()
      File "/usr/local/lib/python3.6/dist-packages/requests/models.py", line 941, in raise_for_status
        raise HTTPError(http_error_msg, response=self)
    requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://sbert.net/models/bert-base-german-dbmdz-uncased.zip
    
    HTTPError: 403 Client Error: Forbidden for url: https://sbert.net/models/bert-base-german-dbmdz-uncased.zip
    
    opened by tide90 15
  • Batch cos_sim for community_detection?

    Batch cos_sim for community_detection?

    I've been experimenting with the community_detection method but noticed I quickly get OOM errors if I use too large of embeddings.

    Seeing how it uses cos_sim to computed all the embedding distances, do you think it would make sense to have the option for batching? I believe you will find other bottlenecks when iterating over the entries, but at least it will complete on larger embeddings.

    opened by mmaybeno 13
  • 'torch._C.PyTorchFileReader' object has no attribute'seek'

    'torch._C.PyTorchFileReader' object has no attribute'seek'

    Hello,

    I am using the following model for sentence similarity

    https://huggingface.co/sentence-transformers/stsb-xlm-r-multilingual/tree/main

    word_embedding_model = models.Transformer(bert_model_dir)  # , max_seq_length=512
    pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
    model = SentenceTransformer(modules=[word_embedding_model, pooling_model], device=device_str)
    

    But, I get this error:

    Traceback (most recent call last):
    
      File "/home/work/anaconda/lib/python3.6/site-packages/torch/serialization.py", line 306, in _check_seekable
    
        f.seek(f.tell())
    
    AttributeError:'torch._C.PyTorchFileReader' object has no attribute'seek'
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
    
      File "/home/work/anaconda/lib/python3.6/site-packages/transformers/modeling_utils.py", line 1205, in from_pretrained
    
        state_dict = torch.load(resolved_archive_file, map_location="cpu")
    
      File "/home/work/anaconda/lib/python3.6/site-packages/torch/serialization.py", line 584, in load
    
        return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
    
      File "/home/work/anaconda/lib/python3.6/site-packages/moxing/framework/file/file_io_patch.py", line 200, in _load
    
        _check_seekable(f)
    
      File "/home/work/anaconda/lib/python3.6/site-packages/torch/serialization.py", line 309, in _check_seekable
    
        raise_err_msg(["seek", "tell"], e)
    
      File "/home/work/anaconda/lib/python3.6/site-packages/torch/serialization.py", line 302, in raise_err_msg
    
        raise type(e)(msg)
    
    AttributeError:'torch._C.PyTorchFileReader' object has no attribute'seek'. You can only torch.load from a file that is seekable. Please pre-load the data into a buffer like io.BytesIO and try to load from it instead .
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
    
      File "code/similarity.py", line 118, in <module>
    
        word_embedding_model = models.Transformer(bert_model_dir) #, max_seq_length=512
    
      File "/home/work/anaconda/lib/python3.6/site-packages/sentence_transformers/models/Transformer.py", line 30, in __init__
    
        self.auto_model = AutoModel.from_pretrained(model_name_or_path, config=config, cache_dir=cache_dir)
    
      File "/home/work/anaconda/lib/python3.6/site-packages/transformers/models/auto/auto_factory.py", line 381, in from_pretrained
    
        return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)
    
      File "/home/work/anaconda/lib/python3.6/site-packages/transformers/modeling_utils.py", line 1208, in from_pretrained
    
        f"Unable to load weights from pytorch checkpoint file for'{pretrained_model_name_or_path}' "
    
    OSError: Unable to load weights from pytorch checkpoint file for'/home/work/user-job-dir/input/pretrained_models/stsb-xlm-r-multilingual/' at'/home/work/user-job-dir/input /pretrained_models/stsb-xlm-r-multilingual/pytorch_model.bin'If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True. 
    

    I checked on web but could not find any solution. What could be the problem? Thank you.

    opened by deadsoul44 13
  • Getting SSL Error in downloading

    Getting SSL Error in downloading "distilroberta-base-paraphrase-v1" model embeddings

    I am using google collab with PyTorch version 1.7.0+cu101 I am getting an SSL Error when I am trying to download "distilroberta-base-paraphrase-v1" model.

    Code from sentence_transformers import SentenceTransformer model = SentenceTransformer('distilroberta-base-paraphrase-v1')

    Error

    SSLError Traceback (most recent call last) /usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw) 599 body=body, headers=headers, --> 600 chunked=chunked) 601

    24 frames SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)

    During handling of the above exception, another exception occurred:

    MaxRetryError Traceback (most recent call last) MaxRetryError: HTTPSConnectionPool(host='public.ukp.informatik.tu-darmstadt.de', port=443): Max retries exceeded with url: /reimers/sentence-transformers/v0.2/distilroberta-base-paraphrase-v1.zip (Caused by SSLError(SSLError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)'),))

    During handling of the above exception, another exception occurred:

    SSLError Traceback (most recent call last) SSLError: HTTPSConnectionPool(host='public.ukp.informatik.tu-darmstadt.de', port=443): Max retries exceeded with url: /reimers/sentence-transformers/v0.2/distilroberta-base-paraphrase-v1.zip (Caused by SSLError(SSLError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)'),))

    During handling of the above exception, another exception occurred:

    FileNotFoundError Traceback (most recent call last) /usr/lib/python3.6/shutil.py in rmtree(path, ignore_errors, onerror) 473 # lstat()/open()/fstat() trick. 474 try: --> 475 orig_st = os.lstat(path) 476 except Exception: 477 onerror(os.lstat, path, sys.exc_info())

    FileNotFoundError: [Errno 2] No such file or directory: '/root/.cache/torch/sentence_transformers/sbert.net_models_distilroberta-base-paraphrase-v1'

    opened by rahuliitkgp31 13
  • model.fit  results in nan

    model.fit results in nan

    Hi,

    I want to fine-tune SBERT with pre-trained weights of 'bert-base-uncased'. I follow this tutorial: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/nli/training_nli_v2.py using MultipleNegativesRankingLoss loss function.

    When I do model.fit , the results are 'nan' everywhere.

    here is my code: `root_model = AutoModel.from_pretrained('bert-base-uncased') output_dir = "/root/Automated_Assessment_(ETS)/Model/DRAFT/DRAFT_Bert_base_uncased" BERT_model = root_model.save_pretrained(output_dir) tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') #('onlplab/alephbert-base') tokenizer.save_pretrained(output_dir)

    learning_rate, batch_size, epochs = 2e-5, 8, 1

    train_dataloader = datasets.NoDuplicatesDataLoader(train_data, batch_size=batch_size)
    word_embedding_model = models.Transformer(output_dir, max_seq_length=512)
    pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), pooling_mode='mean') model = SentenceTransformer(modules=[word_embedding_model, pooling_model])

    train_loss = losses.MultipleNegativesRankingLoss(model) val_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(val_data, batch_size=batch_size)

    warmup_steps = math.ceil(len(train_dataloader) * epochs * 0.1) #10% of train data for warm-up logging.info("Warmup-steps: {}".format(warmup_steps))

    output_file = 'output/sentence_similarity'+MODEL_NAME.replace("/", "-")+'-'+datetime.now().strftime("%Y-%m-%d_%H-%M-%S") sb_output_path = os.path.join(ref_saved_models_path, output_file)

    model.fit(train_objectives=[(train_dataloader, train_loss)], evaluator=val_evaluator, epochs=epochs, evaluation_steps=int(len(train_dataloader)*0.1), warmup_steps=warmup_steps, output_path=sb_output_path, use_amp=False #Set to True, if your GPU supports FP16 operations ) `

    here is a screenshot of the log: Capture

    I don't understand what am I doing wrong? Could you please help me?

    opened by Abigail-gs 0
  • Dtype error when using Pooling + Dense layers with half precision

    Dtype error when using Pooling + Dense layers with half precision

    The models.Pooling layer seams to always output a 32-bit float as it's sentence_embedding. This leads to an dtype error when using a dense layer after the pooling layer when the model is in half precision mode via model.half()

    Here is a minimal example:

    from sentence_transformers import SentenceTransformer,models
    from torch import nn
    
    word_embedding_model = models.Transformer("sentence-transformers/all-MiniLM-L6-v2")
    polling = models.Pooling(word_embedding_model.get_word_embedding_dimension(),"mean")
    dense = models.Dense(word_embedding_model.get_word_embedding_dimension(), out_features=64, activation_function=nn.Tanh())
    
    #This works as expected
    sentence_transformer_without_dense = SentenceTransformer(modules=[word_embedding_model,polling])
    sentence_transformer_without_dense.half()
    
    print(sentence_transformer_without_dense.encode("Hello World"))
    
    #This will throw an error
    sentence_transformer_with_dense = SentenceTransformer(modules=[word_embedding_model,polling,dense])
    sentence_transformer_with_dense.half()
    
    print(sentence_transformer_with_dense.encode("Hello World"))
    

    Is this the expected behaviour or a bug?

    opened by LLukas22 0
  • How can I use models.Dense() layer with DenoisingAutoEncoderLoss()?

    How can I use models.Dense() layer with DenoisingAutoEncoderLoss()?

    When creating a SentenceTransformer as follows:

    word_embedding_model = Transformer(
      model_name_or_path=model_name_or_path, # "bert-base-uncased"
      max_seq_length=max_seq_length, # 384
      cache_dir=cache_dir,
      tokenizer_args=tokenizer_args, # {"truncation": True, "padding": "max_length, "max_length": 384}
      do_lower_case=do_lower_case, # True
      tokenizer_name_or_path=tokenizer_name_or_path # "bert-base-uncased"
     )
    
    word_embedding_dimension = word_embedding_model.get_word_embedding_dimension()
    pooling_mode = "cls"
    pooling_model = Pooling(
        word_embedding_dimension=word_embedding_dimension,
        pooling_mode=pooling_mode,
    )
    
    in_features = pooling_model.get_sentence_embedding_dimension()
    out_features = config["parameters"]["num_dense_dimensions"] # 256
    dense_model = Dense(
        in_features=in_features,
        out_features=out_features,
        activation_function=nn.Tanh(),
    )
    
    modules = [
        word_embedding_model,
        pooling_model,
        dense_model,
    ]
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    cache_folder = os.path.join(cache_location)
    model = SentenceTransformer(
        modules=modules,
        device=device,
        cache_folder=cache_folder,
    )
    

    And creating the following DenoisingAutoEncoderLoss:

    train_loss = DenoisingAutoEncoderLoss(
        model=model,
        tie_encoder_decoder=tie_encoder_decoder, # True
    )
    

    With this training setting:

    train_objectives = [
        (train_dataloader, train_loss)
    ]
    evaluator = MSEEvaluator(
        source_sentences=source_sentences,
        target_sentences=target_sentences,
        teacher_model=model,
        show_progress_bar=True,
        batch_size=batch_size, # batch_size = 16
        name="job2vec",
        write_csv=True,
    )
    
    def free_memory(score, epoch, steps):
        torch.cuda.empty_cache()
        gc.collect()
    
    epochs = config["hyperparameters"]["num_epochs"]
    warmup_steps = config["hyperparameters"]["warmup_steps"]
    evaluation_steps = batch_size * 32, # batch_size = 16
    output_path = os.path.join(cache_location, "job2vec")
    save_best_model = True
    use_amp = True
    callback = free_memory,
    show_progress_bar = True
    checkpoint_path = os.path.join(cache_location, "job2vec/checkpoints")
    checkpoint_save_steps = len(train_dataloader)
    model.fit(
        train_objectives=train_objectives,
        evaluator=evaluator,
        epochs=epochs,
        warmup_steps=warmup_steps,
        evaluation_steps=evaluation_steps,
        output_path=output_path,
        save_best_model=save_best_model,
        show_progress_bar=show_progress_bar,
        use_amp=use_amp,
        callback=callback,
        checkpoint_path=checkpoint_path,
        checkpoint_save_steps=checkpoint_save_steps,
    )
    

    Then the following error occurs:

    Traceback (most recent call last):
      File "src/denoising_autoencoder.py", line 216, in <module> 
        main()
      File "src/denoising_autoencoder.py", line 213, in main
        train()
      File "src/denoising_autoencoder.py", line 209, in train
        checkpoint_save_steps=checkpoint_save_steps,
      File "venv/lib/python3.7/site-packages/sentence_transformers/SentenceTransformer.py", line 710, in fit
        loss_value = loss_model(features, labels)
      File "venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
        return forward_call(*input, **kwargs)
      File "venv/lib/python3.7/site-packages/sentence_transformers/losses/DenoisingAutoEncoderLoss.py", line 119, in forward
        use_cache=False
      File "venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
        return forward_call(*input, **kwargs)
      File "venv/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 1250, in forward
        return_dict=return_dict,
      File "venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
        return forward_call(*input, **kwargs)
      File "venv/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 1031, in forward
        return_dict=return_dict,
      File "venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
        return forward_call(*input, **kwargs)
      File "venv/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 617, in forward
        output_attentions,
      File "venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
        return forward_call(*input, **kwargs)
      File "venv/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 529, in forward
        output_attentions,
      File "venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
        return forward_call(*input, **kwargs)
      File "venv/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 433, in forward
        output_attentions,
      File "venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
        return forward_call(*input, **kwargs)
      File "venv/lib/python3.7/site-packages/transformers/models/bert/modeling_bert.py", line 298, in forward
        key_layer = self.transpose_for_scores(self.key(encoder_hidden_states))
      File "venv/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
        return forward_call(*input, **kwargs)
      File "venv/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 114, in forward
        return F.linear(input, self.weight, self.bias)
    RuntimeError: mat1 and mat2 shapes cannot be multiplied (16x256 and 768x768)
    

    How can I use the models.Dense() layer with the DenoisingAutoEncoderLoss?

    opened by niquet 0
  • How to distill model with different tokenizer?

    How to distill model with different tokenizer?

    I am trying to train word embedding models to match embeddings from a sentence transformer, and using model_distillation won't cut it, because when running student_model.fit the model uses student's smart_batching_collate so the teacher model gets wrong tokens.

    Has anybody worked on something similar? I don't see any workaround other than rewriting the SentenceTransformer.fit method, but maybe there's easier way to do this?

    opened by lambdaofgod 0
  • Community detection algorithm can loop forever

    Community detection algorithm can loop forever

    If only one vector is passed community detection algorithm will loop forever.

    I suggest adding

    assert embeddings.shape[0] >= 2, "Embeddings should contain at least two vectors"
    assert embeddings.shape[0] >= min_community_size, "Number of vectors is less than specified min_community_size"
    

    checks. (Can open a pull request for this)

    opened by maiiabocharova 0
  • Override tokenizer args of sentencetransformer

    Override tokenizer args of sentencetransformer

    How can we apply sliding window on sentencetranformer tokenizer. I want to be able to override return_overflowing_tokens=True and stride in the default tokenizer to enable the sliding window.

    opened by datashinobi 0
Releases(v2.2.2)
  • v2.2.2(Jun 26, 2022)

    huggingface_hub dropped support in version 0.5.0 for Python 3.6

    This release fixes the issue so that huggingface_hub with version 0.4.0 and Python 3.6 can still be used.

    Source code(tar.gz)
    Source code(zip)
  • v2.2.1(Jun 23, 2022)

    Version 0.8.1 of huggingface_hub introduces several changes that resulted in errors and warnings. This version of sentence-transformers fixes these issues.

    Further, several improvements have been added / merged:

    • util.community_detection was improved: 1) It works in a batched mode to save memory, 2) Overlapping clusters are no longer dropped but removed by overlapping items, 3) The parameter init_max_size was removed and replaced by a heuristic to estimate the max size of clusters
    • #1581 the training dataset names can be saved in the model card
    • #1426 fix the text summarization example
    • #1487 Rekursive sentence-transformers models are now possible
    • #1522 Private models can now be loaded
    • #1551 DataLoaders can now have workers
    • #1565 Models are just checked on the hub if they don't exist in the cache. Fixes issues with connectivity issues
    • #1591 Example added how to stream encode larger datasets
    Source code(tar.gz)
    Source code(zip)
  • v2.2.0(Feb 10, 2022)

    T5

    You can now use the encoder from T5 to learn text embeddings. You can use it like any other transformer model:

    from sentence_transformers import SentenceTransformer, models
    word_embedding_model = models.Transformer('t5-base', max_seq_length=256)
    pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
    model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
    

    See T5-Benchmark results - the T5 encoder is not the best model for learning text embeddings models. It requires quite a lot of training data and training steps. Other models perform much better, at least in the given experiment with 560k training triplets.

    New Models

    The models from the papers Sentence-T5: Scalable sentence encoders from pre-trained text-to-text models and Large Dual Encoders Are Generalizable Retrievers have been added:

    For benchmark results, see https://seb.sbert.net

    Private Models

    Thanks to #1406 you can now load private models from the hub:

    model = SentenceTransformer("your-username/your-model", use_auth_token=True)
    
    Source code(tar.gz)
    Source code(zip)
  • v2.1.0(Oct 1, 2021)

    This is a smaller release with some new features

    MarginMSELoss

    MarginMSELoss is a great method to train embeddings model with the help of a cross-encoder model. The details are explained here: MSMARCO - MarginMSE Training

    You pass your training data in the format:

    InputExample(texts=[query, positive, negative], label=cross_encoder.predict([query, positive])-cross_encoder.predict([query, negative])
    

    MultipleNegativesSymmetricRankingLoss

    MultipleNegativesRankingLoss computes the loss just in one way: Find the correct answer for a given question.

    MultipleNegativesSymmetricRankingLoss also computes the loss in the other direction: Find the correct question for a given answer.

    Breaking Change: CLIPModel

    The CLIPModel is now based on the transformers model.

    You can still load it like this:

    model = SentenceTransformer('clip-ViT-B-32')
    

    Older SentenceTransformers versions are now longer able to load and use the 'clip-ViT-B-32' model.

    Added files on the hub are automatically downloaded

    PR #1116 checks if you have all files in your local cache or if there are added files on the hub. If this is the case, it will automatically download them.

    SentenceTransformers.encode() can return all values

    When you set output_value=None for the encode method, all values (token_ids, token_embeddings, sentence_embedding) will be returned.

    Source code(tar.gz)
    Source code(zip)
  • v2.0.0(Jun 24, 2021)

    Models hosted on the hub

    All pre-trained models are now hosted on the Huggingface Models hub.

    Our pre-trained models can be found here: https://huggingface.co/sentence-transformers

    But you can easily share your own sentence-transformer model on the hub and have other people easily access it. Simple upload the folder and have people load it via:

    model = SentenceTransformer('[your_username]/[model_name]')
    

    For more information, see: Sentence Transformers in the Hugging Face Hub

    Breaking changes

    There should be no breaking changes. Old models can still be loaded from disc. However, if you use one of the provided pre-trained models, it will be downloaded again in version 2 of sentence transformers as the cache path has slightly changed.

    Find sentence-transformer models on the Hub

    You can filter the hub for sentence-transformers models: https://huggingface.co/models?filter=sentence-transformers

    Add the sentence-transformers tag to you model card so that others can find your model.

    Widget & Inference API

    A widget was added to sentence-transformers models on the hub that lets you interact directly on the models website: https://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2

    Further, models can now be used with the Accelerated Inference API: Send you sentences to the API and get back the embeddings from the respective model.

    Save Model to Hub

    A new method was added to the SentenceTransformer class: save_to_hub.

    Provide the model name and the model is saved on the hub.

    Here you find the explanation from transformers how the hub works: Model sharing and uploading

    Automatic Model Card

    When you save a model with save or save_to_hub, a README.md (also known as model card) is automatically generated with basic information about the respective SentenceTransformer model.

    New Models

    Source code(tar.gz)
    Source code(zip)
  • v1.2.1(Jun 24, 2021)

  • v1.2.0(May 12, 2021)

    Unsupervised Sentence Embedding Learning

    New methods integrated to train sentence embedding models without labeled data. See Unsupervised Learning for an overview of all existent methods.

    New methods:

    Pre-Training Methods

    • MLM: An example script to run Masked-Language-Modeling (MLM). Running MLM on your custom data before supervised training can significantly improve the performances. Further, MLM also works well for domain trainsfer: You first train on your custom data, and then train with e.g. NLI or STS data.

    Training Examples

    New models

    New Functions

    • SentenceTransformer.fit() Checkpoints: The fit() method now allows to save checkpoints during the training at a fixed number of steps. More info
    • Pooling-mode as string: You can now pass the pooling-mode to models.Pooling() as string:
      pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(), pooling_mode='mean')
      

      Valid values are mean/max/cls.

    • NoDuplicatesDataLoader: When using the MultipleNegativesRankingLoss, one should avoid to have duplicate sentences in the same sentence. This data loader simplifies this task and ensures that no duplicate entries are in the same batch.~~~~
    Source code(tar.gz)
    Source code(zip)
  • v1.1.0(Apr 21, 2021)

    Unsupervised Sentence Embedding Learning

    This release integrates methods that allows to learn sentence embeddings without having labeled data:

    • TSDAE: TSDAE is using a denoising auto-encoder to learn sentence embeddings. The method has been presented in our recent paper and achieves state-of-the-art performance for several tasks.
    • GenQ: GenQ uses a pre-trained T5 system to generate queries for a given passage. It was presented in our recent BEIR paper and works well for domain adaptation for (semantic search)[https://www.sbert.net/examples/applications/semantic-search/README.html]

    New Models - SentenceTransformer

    • MSMARCO Dot-Product Models: We trained models using the dot-product instead of cosine similarity as similarity function. As shown in our recent BEIR paper, models with cosine-similarity prefer the retrieval of short documents, while models with dot-product prefer retrieval of longer documents. Now you can choose what is most suitable for your task.
    • MSMARCO MiniLM Models: We uploaded some models based on MiniLM: It uses just 384 dimensions, is faster than previous models and achieves nearly the same performance

    New Models - CrossEncoder

    New Features

    • You can now pass to the CrossEncoder class a default_activation_function, that is applied on-top of the output logits generated by the class.
    • You can now pre-process images for the CLIP Model. Soon I will release a tutorial how to fine-tune the CLIP Model with your data.
    Source code(tar.gz)
    Source code(zip)
  • v1.0.4(Apr 1, 2021)

    It was not possible to fine-tune and save the CLIPModel. This release fixes it. CLIPModel can now be saved like any other model by calling model.save(path)

    Source code(tar.gz)
    Source code(zip)
  • v1.0.3(Mar 22, 2021)

  • v1.0.2(Mar 19, 2021)

    v1.0.2 - Patch for CLIPModel, new Image Examples

    • Bugfix in CLIPModel: Too long inputs raised a RuntimeError. Now they are truncated.
    • New util function: util.paraphrase_mining_embeddings, to find most similar embeddings in a matrix
    • Image Clustering and Duplicate Image Detection examples added: more info
    Source code(tar.gz)
    Source code(zip)
  • v1.0.0(Mar 18, 2021)

    This release brings many new improvements and new features. Also, the version number scheme is updated. Now we use the format x.y.z with x: for major releases, y: smaller releases with new features, z: bugfixes

    Text-Image-Model CLIP

    You can now encode text and images in the same vector space using the OpenAI CLIP Model. You can use the model like this:

    from sentence_transformers import SentenceTransformer, util
    from PIL import Image
    
    #Load CLIP model
    model = SentenceTransformer('clip-ViT-B-32')
    
    #Encode an image:
    img_emb = model.encode(Image.open('two_dogs_in_snow.jpg'))
    
    #Encode text descriptions
    text_emb = model.encode(['Two dogs in the snow', 'A cat on a table', 'A picture of London at night'])
    
    #Compute cosine similarities 
    cos_scores = util.cos_sim(img_emb, text_emb)
    print(cos_scores)
    

    More Information IPython Demo Colab Demo

    Examples how to train the CLIP model on your data will be added soon.

    New Models

    New Features

    • The Asym Model can now be used as the first model in a SentenceTransformer modules list.
    • Sorting when encoding changes: Previously, we encoded from short to long sentences. Now we encode from long to short sentences. Out-of-memory errors will then happen at the start. Also the approximation on the duration of the encode process is more precise
    • Improvement of the util.semantic_search method: It now uses the much faster torch.topk function. Further, you can define which scoring function should be used
    • New util methods: util.dot_score computes the dot product of two embedding matrices. util.normalize_embeddings will normalize embeddings to unit length
    • New parameter for SentenceTransformer.encode method: normalize_embeddings if set to true, it will normalize embeddings to unit length. In that case the faster util.dot_score can be used instead of util.cos_sim to compute cosine similarity scores.
    • If you specify in models.Transformer(do_lower_case=True) when creating a new SentenceTransformer, then all input will be lower cased.

    New Examples

    Bugfixes

    • Encode method now correctly returns token_embeddings if output_value='token_embeddings' is defined
    • Bugfix of the LabelAccuracyEvaluator
    • Bugfix of removing tensors off the CPU if you specified encode(sent, convert_to_tensor=True). They now stay on the GPU

    Breaking changes:

    • SentenceTransformer.encode-Methode: Removed depcreated parameters is_pretokenized and num_workers
    Source code(tar.gz)
    Source code(zip)
  • v0.4.1(Jan 4, 2021)

    Refactored Tokenization

    • Faster tokenization speed: Using batched tokenization for training & inference - Now, all sentences in a batch are tokenized simoultanously.
    • Usage of the SentencesDataset no longer needed for training. You can pass your train examples directly to the DataLoader:
    train_examples = [InputExample(texts=['My first sentence', 'My second sentence'], label=0.8),
        InputExample(texts=['Another pair', 'Unrelated sentence'], label=0.3)]
    train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
    
    • If you use a custom torch DataSet class: The dataset class must now return InputExample objects instead of tokenized texts
    • Class SentenceLabelDataset has been updated to new tokenization flow: It returns always two or more InputExamples with the same label

    Asymmetric Models Add new models.Asym class that allows different encoding of sentences based on some tag (e.g. query vs paragraph). Minimal example:

    word_embedding_model = models.Transformer(base_model, max_seq_length=250)
    pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension())
    d1 = models.Dense(word_embedding_model.get_word_embedding_dimension(), 256, bias=False, activation_function=nn.Identity())
    d2 = models.Dense(word_embedding_model.get_word_embedding_dimension(), 256, bias=False, activation_function=nn.Identity())
    asym_model = models.Asym({'QRY': [d1], 'DOC': [d2]})
    model = SentenceTransformer(modules=[word_embedding_model, pooling_model, asym_model])
    
    ##Your input examples have to look like this:
    inp_example = InputExample(texts=[{'QRY': 'your query'}, {'DOC': 'your document text'}], label=1)
    
    ##Encoding (Note: Mixed inputs are not allowed)
    model.encode([{'QRY': 'your query1'}, {'QRY': 'your query2'}])
    

    Inputs that have the key 'QRY' will be passed through the d1 dense layer, while inputs with they key 'DOC' through the d2 dense layer. More documentation on how to design asymmetric models will follow soon.

    New Namespace & Models for Cross-Encoder Cross-Encoder are now hosted at https://huggingface.co/cross-encoder. Also, new pre-trained models have been added for: NLI & QNLI.

    Logging Log messages now use a custom logger from logging thanks to PR #623. This allows you which log messages you want to see from which components.

    Unit tests A lot more unit tests have been added, which test the different components of the framework.

    Source code(tar.gz)
    Source code(zip)
  • v0.4.0(Dec 22, 2020)

    • Updated the dependencies so that it works with Huggingface Transformers version 4. Sentence-Transformers still works with huggingface transformers version 3, but an update to version 4 of transformers is recommended. Future changes might break with transformers version 3.
    • New naming of pre-trained models. Models will be named: {task}-{transformer_model}. So 'bert-base-nli-stsb-mean-tokens' becomes 'stsb-bert-base'. Models will still be available under their old names, but newer models will follow the updated naming scheme.
    • New application example for information retrieval and question answering retrieval. Together with respective pre-trained models
    Source code(tar.gz)
    Source code(zip)
  • v0.3.9(Nov 18, 2020)

    This release only include some smaller updates:

    • Code was tested with transformers 3.5.1, requirement was updated so that it works with transformers 3.5.1
    • As some parts and models require Pytorch >= 1.6.0, requirement was updated to require at least pytorch 1.6.0. Most of the code and models will work with older pytorch versions.
    • model.encode() stored the embeddings on the GPU, which required quite a lot of GPU memory when encoding millions of sentences. The embeddings are now moved to CPU once they are computed.
    • The CrossEncoder-Class now accepts a max_length parameter to control the truncation of inputs
    • The Cross-Encoder predict method has now a apply_softmax parameter, that allows to apply softmax on-top of a multi-class output.
    Source code(tar.gz)
    Source code(zip)
  • v0.3.8(Oct 19, 2020)

    • Add support training and using CrossEncoder
    • Data Augmentation method AugSBERT added
    • New model trained on large scale paraphrase data. Models works on internal benchmark much better than previous models: distilroberta-base-paraphrase-v1 and xlm-r-distilroberta-base-paraphrase-v1
    • New model for Information Retrieval trained on MS Marco: distilroberta-base-msmarco-v1
    • Improved MultipleNegativesRankingLoss loss function: Similarity function can be changed and is now cosine similarity (was dot-product before), further, similarity scores can be multiplied by a scaling factor. This allows the usage of NTXentLoss / InfoNCE loss.
    • New MegaBatchMarginLoss, inspired from the paper ParaNMT-Paper.

    Smaller changes:

    • Update InformationRetrievalEvaluator, so that it can work with large corpora (Millions of entries). Removed the query_chunk_size parameter from the evaluator
    • SentenceTransformer.encode method detaches tensors from compute graph
    • SentenceTransformer.fit() method - Parameter output_path_ignore_not_empty deprecated. No longer checks that target folder must be empty
    Source code(tar.gz)
    Source code(zip)
  • v0.3.7(Sep 29, 2020)

    • Upgrade transformers dependency, transformers 3.1.0, 3.2.0 and 3.3.1 are working
    • Added example code for model distillation: Sentence Embeddings models can be drastically reduced to e.g. only 2-4 layers while keeping 98+% of their performance. Code can be found in examples/training/distillation
    • Transformer models can now accepts two inputs ['sentence 1', 'context for sent1'], which are encoded as the two inputs for BERT.

    Minor changes:

    • Tokenization in the multi-processes encoding setup now happens in the child processes, not in the parent process.
    • Added models.Normalize() to allow the normalization of embeddings to unit length
    Source code(tar.gz)
    Source code(zip)
  • v0.3.6(Sep 11, 2020)

    Hugginface Transformers version 3.1.0 had a breaking change with previous version 3.0.2

    This release fixes the issue so that Sentence-Transformers is compatible with Huggingface Transformers 3.1.0. Note, that this and future version will not be compatible with transformers < 3.1.0.

    Source code(tar.gz)
    Source code(zip)
  • v0.3.5(Sep 1, 2020)

    • The old FP16 training code in model.fit() was replaced by using Pytorch 1.6.0 automatic mixed precision (AMP). When setting model.fit(use_amp=True), AMP will be used. On suitable GPUs, this leads to a significant speed-up while requiring less memory.
    • Performance improvements in paraphrase mining & semantic search by replacing np.argpartition with torch.topk
    • If a sentence-transformer model is not found, it will fall back to huggingface transformers repository and create it with mean pooling.
    • Fixing huggingface transformers to version 3.0.2. Next release will make it compatible with huggingface transformers 3.1.0
    • Several bugfixes: Downloading of files, mutli-GPU-encoding
    Source code(tar.gz)
    Source code(zip)
  • v0.3.4(Aug 24, 2020)

    • The documentation is substantially improved and can be found at: www.SBERT.net - Feedback welcome
    • The dataset to hold training InputExamples (dataset.SentencesDataset) now uses lazy tokenization, i.e., examples are tokenized once they are needed for a batch. If you set num_workers to a positive integer in your DataLoader, tokenization will happen in a background thread. This substantially increases the start-up time for training.
    • model.encode() uses also a PyTorch DataSet + DataLoader. If you set num_workers to a positive integer, tokenization will happen in the background leading to faster encoding speed for large corpora.
    • Added functions and an example for mutli-GPU encoding - This method can be used to encode a corpus with multiple GPUs in parallel. No multi-GPU support for training yet.
    • Removed parallel_tokenization parameters from encode & SentencesDatasets - No longer needed with lazy tokenization and DataLoader worker threads.
    • Smaller bugfixes

    Breaking changes:

    • Renamed evaluation.BinaryEmbeddingSimilarityEvaluator to evaluation.BinaryClassificationEvaluator
    Source code(tar.gz)
    Source code(zip)
  • v0.3.3(Aug 6, 2020)

    New Functions

    • Multi-process tokenization (Linux only) for the model encode function. Significant speed-up when encoding large sets
    • Tokenization of datasets for training can now run in parallel (Linux Only)
    • New example for Quora Duplicate Questions Retrieval: See examples-folder
    • Many small improvements for training better models for Information Retrieval
    • Fixed LabelSampler (can be used to get batches with certain number of matching labels. Used for BatchHardTripletLoss). Moved it to DatasetFolder
    • Added new Evaluators for ParaphraseMining and InformationRetrieval
    • evaluation.BinaryEmbeddingSimilarityEvaluator no longer assumes a 50-50 split of the dataset. It computes the optimal threshold and measure accuracy
    • model.encode - When the convert_to_numpy parameter is set, the method returns a numpy matrix instead of a list of numpy vectors
    • New function: util.paraphrase_mining to perform paraphrase mining in a corpus. For an example see examples/training_quora_duplicate_questions/
    • New function: util.information_retrieval to perform information retrieval / semantic search in a corpus. For an example see examples/training_quora_duplicate_questions/

    Breaking Changes

    • The evaluators (like EmbeddingSimilarityEvaluator) no longer accept a DataLoader as argument. Instead, the sentence and scores are directly passed. Old code that uses the previous evaluators needs to be changed. They can use the class method from_input_examples(). See examples/training_transformers/training_nli.py how to use the new evaluators.
    Source code(tar.gz)
    Source code(zip)
  • v0.3.2(Jul 23, 2020)

    This is a minor release. There should be no breaking changes.

    • ParallelSentencesDataset: Datasets are tokenized on-the-fly, saving some start-up time
    • util.pytorch_cos_sim - Method. New method to compute cosine similarity with pytorch. About 100 times faster than scipy cdist. semantic_search.py example has been updated accordingly.
    • SentenceTransformer.encode: New parameter: convert_to_tensor. If set to true, encode returns one large pytorch tensor with your embeddings
    Source code(tar.gz)
    Source code(zip)
  • v0.3.1(Jul 22, 2020)

    This is a minor update that changes some classes for training & evaluating multilingual sentence embedding methods.

    The examples for training multi-lingual sentence embeddings models have been significantly extended. See docs/training/multilingual-models.md for details. An automatic script that downloads suitable data and extends sentence embeddings to multiple languages has been added.

    The following classes/files have been changed:

    • datasets/ParallelSentencesDataset.py: The dataset with parallel sentences is encoded on-the-fly, reducing the start-up time for extending a sentence embedding model to new languages. An embedding cache can be configure to store previously computed sentence embeddings during training.

    New evaluation files:

    • evaluation/MSEEvaluator.py - breaking change. Now, this class expects lists of strings with parallel (translated) sentences. The old class has been renamed to MSEEvaluatorFromDataLoader.py
    • evaluation/EmbeddingSimilarityEvaluatorFromList.py - Semantic Textual Similarity data can be passed as lists of strings & scores
    • evaluation/MSEEvaluatorFromDataFrame.py - MSE Evaluation of teacher and student embeddings based on data in a data frame
    • evaluation/MSEEvaluatorFromDataLoader.py - MSE Evaluation if data is passed as a data loader

    Bugfixes:

    • model.encode() failed to sort sentences by length. This function has been fixed to boost encoding speed by reducing overhead of padding tokens.
    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Jul 9, 2020)

    This release updates HuggingFace transformers to v3.0.2. Transformers did some breaking changes to the tokenization API. This (and future) versions will not be compatible with HuggingFace transfomers v2.

    There are no known breaking changes for existent models or existent code. Models trained with version 2 can be loaded without issues.

    New Loss Functions

    Thanks to PR #299 and #176 several new loss functions: Different triplet loss functions and ContrastiveLoss

    Source code(tar.gz)
    Source code(zip)
  • v0.2.6(Apr 16, 2020)

    The release update huggingface/transformers to the release v2.8.0.

    New Features

    • models.Transformer: The Transformer-Model can now load any huggingface transformers model, like BERT, RoBERTa, XLNet, XLM-R, Elextra... It is based on the AutoModel from HuggingFace. You now longer need the architecture specific models (like models.BERT, models.RoBERTa) any more. It also works with the community models.
    • Multilingual Training: Code is released for making mono-lingual sentence embeddings models mutli-lingual. See training_multilingual.py for an example. More documentation and details will follow soon.
    • WKPooling: Adding a pytorch implementation of SBERT-WK. Note, due to an inefficient implementation in pytorch of QR decomposition, WKPooling can only be run on the CPU, which makes it about 40 slower than mean pooling. For some models WKPooling improves the performance, for other don't.
    • WeightedLayerPooling: A new pooling layer that uses representations from all transformer layers and learns a weighted sum of them. So far no improvement compared to only averaging the last layer.
    • New pre-trained models released. Every available model is document in a google Spreadsheet for an easier overview.

    Minor changes

    • Clean-up of the examples folder.
    • Model and tokenizer arguments can now be passed to the according transformers models.
    • Previous version had some issues with RoBERTa and XLM-RoBERTa, that the wrong special characters were added. Everything is fixed now and relies on huggingface transformers for the correct addition of special characters to the input sentences.

    Breaking changes

    • STSDataReader: The default parameter values have been changed, so that it expects the sentences in the first two columns and the score in the third column. If you want to load the STS benchmkark dataset, you can use the STSBenchmarkDataReader.
    Source code(tar.gz)
    Source code(zip)
  • v0.2.5(Jan 10, 2020)

    huggingface/transformers was updated to version 2.3.0

    Changes:

    • ALBERT works (bug was fixed in transformers). Does not yield improvements compared to BERT / RoBERTA
    • T5 added (does not run on GPU due to a bug in transformers). Does not yield improvements compared to BERT / RoBERTA
    • CamemBERT added
    • XML-RoBERTa added
    Source code(tar.gz)
    Source code(zip)
  • v0.2.4(Dec 6, 2019)

    This version update the underlying HuggingFace Transformer package to v2.2.1.

    Changes:

    • DistilBERT and ALBERT modules added
    • Pre-trained models for RoBERTa and DistilBERT uploaded
    • Some smaller bug-fixes
    Source code(tar.gz)
    Source code(zip)
  • v0.2.3(Aug 20, 2019)

    No breaking changes. Just update with pip install -U sentence-transformers

    Bugfixes:

    • SentenceTransformers can now be used with Windows (threw an exception before about invalid tensor types before)
    • Outputs a warning if seq. length for BERT / RoBERTa is too long

    Improvements:

    • A flag can be set to hide the progress bar when a dataset is convert or an evaluator is executed
    Source code(tar.gz)
    Source code(zip)
  • v0.2.2(Aug 19, 2019)

    Updated pytorch-transformers to v1.1.0. Adding support for RoBERTa model.

    Bugfixes:

    • Critical bugfix for SoftmaxLoss: Classifier weights were not optimized in previous version
    • Minor fix for including the timestamp of the output folders
    Source code(tar.gz)
    Source code(zip)
  • v0.2.1(Aug 16, 2019)

Owner
Ubiquitous Knowledge Processing Lab
Ubiquitous Knowledge Processing Lab
Calibre recipe to convert latest issue of Analyse & Kritik into an ebook

Calibre Recipe für "Analyse & Kritik" Dies ist ein "Recipe" für die Konvertierung der aktuellen Ausgabe der Zeitung Analyse & Kritik in ein Ebook. Es

Henning 3 Jan 04, 2022
Tokenizer - Module python d'analyse syntaxique et de grammaire, tokenization

Tokenizer Le Tokenizer est un analyseur lexicale, il permet, comme Flex and Yacc par exemple, de tokenizer du code, c'est à dire transformer du code e

Manolo 1 Aug 15, 2022
Incorporating KenLM language model with HuggingFace implementation of Wav2Vec2CTC Model using beam search decoding

Wav2Vec2CTC With KenLM Using KenLM ARPA language model with beam search to decode audio files and show the most probable transcription. Assuming you'v

farisalasmary 65 Sep 21, 2022
Journalism AI – Quotes extraction for modular journalism

Quote extraction for modular journalism (JournalismAI collab 2021)

Journalism AI collab 2021 207 Dec 25, 2022
pyMorfologik MorfologikpyMorfologik - Python binding for Morfologik.

Python binding for Morfologik Morfologik is Polish morphological analyzer. For more information see http://github.com/morfologik/morfologik-stemming/

Damian Mirecki 18 Dec 29, 2021
VampiresVsWerewolves - Our Implementation of a MiniMax algorithm with alpha beta pruning in the context of an in-class competition

VampiresVsWerewolves Our Implementation of a MiniMax algorithm with alpha beta pruning in the context of an in-class competition. Our Algorithm finish

Shawn 1 Jan 21, 2022
Mastering Transformers, published by Packt

Mastering Transformers This is the code repository for Mastering Transformers, published by Packt. Build state-of-the-art models from scratch with adv

Packt 195 Jan 01, 2023
Abhijith Neil Abraham 2 Nov 05, 2021
This repo stores the codes for topic modeling on palliative care journals.

This repo stores the codes for topic modeling on palliative care journals. Data Preparation You first need to download the journal papers. bash 1_down

3 Dec 20, 2022
Train BPE with fastBPE, and load to Huggingface Tokenizer.

BPEer Train BPE with fastBPE, and load to Huggingface Tokenizer. Description The BPETrainer of Huggingface consumes a lot of memory when I am training

Lizhuo 1 Dec 23, 2021
A natural language modeling framework based on PyTorch

Overview PyText is a deep-learning based NLP modeling framework built on PyTorch. PyText addresses the often-conflicting requirements of enabling rapi

Meta Research 6.4k Jan 08, 2023
PyTorch Language Model for 1-Billion Word (LM1B / GBW) Dataset

PyTorch Large-Scale Language Model A Large-Scale PyTorch Language Model trained on the 1-Billion Word (LM1B) / (GBW) dataset Latest Results 39.98 Perp

Ryan Spring 114 Nov 04, 2022
The code from the whylogs workshop in DataTalks.Club on 29 March 2022

whylogs Workshop The code from the whylogs workshop in DataTalks.Club on 29 March 2022 whylogs - The open source standard for data logging (Don't forg

DataTalksClub 12 Sep 05, 2022
Finally decent dictionaries based on Wiktionary for your beloved eBook reader.

eBook Reader Dictionaries Finally, decent dictionaries based on Wiktionary for your beloved eBook reader. Dictionaries Catalan 🚧 Ελληνικά (help welco

Mickaël Schoentgen 163 Dec 31, 2022
My Implementation for the paper EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks using Tensorflow

Easy Data Augmentation Implementation This repository contains my Implementation for the paper EDA: Easy Data Augmentation Techniques for Boosting Per

Aflah 9 Oct 31, 2022
Code for producing Japanese GPT-2 provided by rinna Co., Ltd.

japanese-gpt2 This repository provides the code for training Japanese GPT-2 models. This code has been used for producing japanese-gpt2-medium release

rinna Co.,Ltd. 491 Jan 07, 2023
Implementation of TF-IDF algorithm to find documents similarity with cosine similarity

NLP learning Trying to learn NLP to use in my projects! Table of Contents About The Project Built With Getting Started Requirements Run Usage License

Faraz Farangizadeh 3 Aug 25, 2022
Index different CKAN entities in Solr, not just datasets

ckanext-sitesearch Index different CKAN entities in Solr, not just datasets Requirements This extension requires CKAN 2.9 or higher and Python 3 Featu

Open Knowledge Foundation 3 Dec 02, 2022
Chatbot for the Chatango messaging platform

BroiestBot The baddest bot in the game right now. Uses the ch.py framework for joining Chantango rooms and responding to user messages. Commands If a

Todd Birchard 3 Jan 17, 2022
pkuseg多领域中文分词工具; The pkuseg toolkit for multi-domain Chinese word segmentation

pkuseg:一个多领域中文分词工具包 (English Version) pkuseg 是基于论文[Luo et. al, 2019]的工具包。其简单易用,支持细分领域分词,有效提升了分词准确度。 目录 主要亮点 编译和安装 各类分词工具包的性能对比 使用方式 论文引用 作者 常见问题及解答 主要

LancoPKU 6k Dec 29, 2022