EMNLP'2021: SimCSE: Simple Contrastive Learning of Sentence Embeddings

Last update: Dec 29, 2022

Related tags

Overview

SimCSE: Simple Contrastive Learning of Sentence Embeddings

This repository contains the code and pre-trained models for our paper SimCSE: Simple Contrastive Learning of Sentence Embeddings.

**************************** Updates ****************************

8/31: Our paper has been accepted to EMNLP! Please check out our updated paper (with updated numbers and baselines).
5/12: We updated our unsupervised models with new hyperparameters and better performance.
5/10: We released our sentence embedding tool and demo code.
4/23: We released our training code.
4/20: We released our model checkpoints and evaluation code.
4/18: We released our paper. Check it out!

Overview

We propose a simple contrastive learning framework that works with both unlabeled and labeled data. Unsupervised SimCSE simply takes an input sentence and predicts itself in a contrastive learning framework, with only standard dropout used as noise. Our supervised SimCSE incorporates annotated pairs from NLI datasets into contrastive learning by using entailment pairs as positives and contradiction pairs as hard negatives. The following figure is an illustration of our models.

Getting Started

We provide an easy-to-use sentence embedding tool based on our SimCSE model (see our Wiki for detailed usage). To use the tool, first install the simcse package from PyPI

pip install simcse

Or directly install it from our code

python setup.py install

Note that if you want to enable GPU encoding, you should install the correct version of PyTorch that supports CUDA. See PyTorch official website for instructions.

After installing the package, you can load our model by just two lines of code

from simcse import SimCSE
model = SimCSE("princeton-nlp/sup-simcse-bert-base-uncased")

See model list for a full list of available models.

Then you can use our model for encoding sentences into embeddings

embeddings = model.encode("A woman is reading.")

Compute the cosine similarities between two groups of sentences

sentences_a = ['A woman is reading.', 'A man is playing a guitar.']
sentences_b = ['He plays guitar.', 'A woman is making a photo.']
similarities = model.similarity(sentences_a, sentences_b)

Or build index for a group of sentences and search among them

sentences = ['A woman is reading.', 'A man is playing a guitar.']
model.build_index(sentences)
results = model.search("He plays guitar.")

We also support faiss, an efficient similarity search library. Just install the package following instructions here and simcse will automatically use faiss for efficient search.

WARNING: We have found that faiss did not well support Nvidia AMPERE GPUs (3090 and A100). In that case, you should change to other GPUs or install the CPU version of faiss package.

We also provide an easy-to-build demo website to show how SimCSE can be used in sentence retrieval. The code is based on DensePhrases' repo and demo (a lot of thanks to the authors of DensePhrases).

Model List

Our released models are listed as following. You can import these models by using the simcse package or using HuggingFace's Transformers.

Model	Avg. STS
princeton-nlp/unsup-simcse-bert-base-uncased	76.25
princeton-nlp/unsup-simcse-bert-large-uncased	78.41
princeton-nlp/unsup-simcse-roberta-base	76.57
princeton-nlp/unsup-simcse-roberta-large	78.90
princeton-nlp/sup-simcse-bert-base-uncased	81.57
princeton-nlp/sup-simcse-bert-large-uncased	82.21
princeton-nlp/sup-simcse-roberta-base	82.52
princeton-nlp/sup-simcse-roberta-large	83.76

Note that the results are slightly better than what we have reported in the current version of the paper after adopting a new set of hyperparameters (for hyperparamters, see the training section).

Naming rules: unsup and sup represent "unsupervised" (trained on Wikipedia corpus) and "supervised" (trained on NLI datasets) respectively.

Use SimCSE with Huggingface

Besides using our provided sentence embedding tool, you can also easily import our models with HuggingFace's transformers:

import torch
from scipy.spatial.distance import cosine
from transformers import AutoModel, AutoTokenizer

# Import our models. The package will take care of downloading the models automatically
tokenizer = AutoTokenizer.from_pretrained("princeton-nlp/sup-simcse-bert-base-uncased")
model = AutoModel.from_pretrained("princeton-nlp/sup-simcse-bert-base-uncased")

# Tokenize input texts
texts = [
    "There's a kid on a skateboard.",
    "A kid is skateboarding.",
    "A kid is inside the house."
]
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")

# Get the embeddings
with torch.no_grad():
    embeddings = model(**inputs, output_hidden_states=True, return_dict=True).pooler_output

# Calculate cosine similarities
# Cosine similarities are in [-1, 1]. Higher means more similar
cosine_sim_0_1 = 1 - cosine(embeddings[0], embeddings[1])
cosine_sim_0_2 = 1 - cosine(embeddings[0], embeddings[2])

print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[1], cosine_sim_0_1))
print("Cosine similarity between \"%s\" and \"%s\" is: %.3f" % (texts[0], texts[2], cosine_sim_0_2))

If you encounter any problem when directly loading the models by HuggingFace's API, you can also download the models manually from the above table and use model = AutoModel.from_pretrained({PATH TO THE DOWNLOAD MODEL}).

Train SimCSE

In the following section, we describe how to train a SimCSE model by using our code.

Requirements

First, install PyTorch by following the instructions from the official website. To faithfully reproduce our results, please use the correct 1.7.1 version corresponding to your platforms/CUDA versions. PyTorch version higher than 1.7.1 should also work. For example, if you use Linux and CUDA11 (how to check CUDA version), install PyTorch by the following command,

pip install torch==1.7.1+cu110 -f https://download.pytorch.org/whl/torch_stable.html

If you instead use CUDA <11 or CPU, install PyTorch by the following command,

pip install torch==1.7.1

Then run the following script to install the remaining dependencies,

pip install -r requirements.txt

Evaluation

Our evaluation code for sentence embeddings is based on a modified version of SentEval. It evaluates sentence embeddings on semantic textual similarity (STS) tasks and downstream transfer tasks. For STS tasks, our evaluation takes the "all" setting, and report Spearman's correlation. See our paper (Appendix B) for evaluation details.

Before evaluation, please download the evaluation datasets by running

cd SentEval/data/downstream/
bash download_dataset.sh

Then come back to the root directory, you can evaluate any transformers-based pre-trained models using our evaluation code. For example,

python evaluation.py \
    --model_name_or_path princeton-nlp/sup-simcse-bert-base-uncased \
    --pooler cls \
    --task_set sts \
    --mode test

which is expected to output the results in a tabular format:

------ test ------
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness |  Avg. |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| 75.30 | 84.67 | 80.19 | 85.40 | 80.82 |    84.26     |      80.39      | 81.58 |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+

Arguments for the evaluation script are as follows,

--model_name_or_path: The name or path of a transformers-based pre-trained checkpoint. You can directly use the models in the above table, e.g., princeton-nlp/sup-simcse-bert-base-uncased.
--pooler: Pooling method. Now we support
- cls (default): Use the representation of [CLS] token. A linear+activation layer is applied after the representation (it's in the standard BERT implementation). If you use supervised SimCSE, you should use this option.
- cls_before_pooler: Use the representation of [CLS] token without the extra linear+activation. If you use unsupervised SimCSE, you should take this option.
- avg: Average embeddings of the last layer. If you use checkpoints of SBERT/SRoBERTa (paper), you should use this option.
- avg_top2: Average embeddings of the last two layers.
- avg_first_last: Average embeddings of the first and last layers. If you use vanilla BERT or RoBERTa, this works the best.
--mode: Evaluation mode
- test (default): The default test mode. To faithfully reproduce our results, you should use this option.
- dev: Report the development set results. Note that in STS tasks, only STS-B and SICK-R have development sets, so we only report their numbers. It also takes a fast mode for transfer tasks, so the running time is much shorter than the test mode (though numbers are slightly lower).
- fasttest: It is the same as test, but with a fast mode so the running time is much shorter, but the reported numbers may be lower (only for transfer tasks).
--task_set: What set of tasks to evaluate on (if set, it will override --tasks)
- sts (default): Evaluate on STS tasks, including STS 12~16, STS-B and SICK-R. This is the most commonly-used set of tasks to evaluate the quality of sentence embeddings.
- transfer: Evaluate on transfer tasks.
- full: Evaluate on both STS and transfer tasks.
- na: Manually set tasks by --tasks.
--tasks: Specify which dataset(s) to evaluate on. Will be overridden if --task_set is not na. See the code for a full list of tasks.

Training

Data

For unsupervised SimCSE, we sample 1 million sentences from English Wikipedia; for supervised SimCSE, we use the SNLI and MNLI datasets. You can run data/download_wiki.sh and data/download_nli.sh to download the two datasets.

Training scripts

We provide example training scripts for both unsupervised and supervised SimCSE. In run_unsup_example.sh, we provide a single-GPU (or CPU) example for the unsupervised version, and in run_sup_example.sh we give a multiple-GPU example for the supervised version. Both scripts call train.py for training. We explain the arguments in following:

--train_file: Training file path. We support "txt" files (one line for one sentence) and "csv" files (2-column: pair data with no hard negative; 3-column: pair data with one corresponding hard negative instance). You can use our provided Wikipedia or NLI data, or you can use your own data with the same format.
--model_name_or_path: Pre-trained checkpoints to start with. For now we support BERT-based models (bert-base-uncased, bert-large-uncased, etc.) and RoBERTa-based models (RoBERTa-base, RoBERTa-large, etc.).
--temp: Temperature for the contrastive loss.
--pooler_type: Pooling method. It's the same as the --pooler_type in the evaluation part.
--mlp_only_train: We have found that for unsupervised SimCSE, it works better to train the model with MLP layer but test the model without it. You should use this argument when training unsupervised SimCSE models.
--hard_negative_weight: If using hard negatives (i.e., there are 3 columns in the training file), this is the logarithm of the weight. For example, if the weight is 1, then this argument should be set as 0 (default value).
--do_mlm: Whether to use the MLM auxiliary objective. If True:
- --mlm_weight: Weight for the MLM objective.
- --mlm_probability: Masking rate for the MLM objective.

All the other arguments are standard Huggingface's transformers training arguments. Some of the often-used arguments are: --output_dir, --learning_rate, --per_device_train_batch_size. In our example scripts, we also set to evaluate the model on the STS-B development set (need to download the dataset following the evaluation section) and save the best checkpoint.

For results in the paper, we use Nvidia 3090 GPUs with CUDA 11. Using different types of devices or different versions of CUDA/other softwares may lead to slightly different performance.

Hyperparameters

We use the following hyperparamters for training SimCSE:

	Unsup. BERT	Unsup. RoBERTa	Sup.
Batch size	64	512	512
Learning rate (base)	3e-5	1e-5	5e-5
Learning rate (large)	1e-5	3e-5	1e-5

Convert models

Our saved checkpoints are slightly different from Huggingface's pre-trained checkpoints. Run python simcse_to_huggingface.py --path {PATH_TO_CHECKPOINT_FOLDER} to convert it. After that, you can evaluate it by our evaluation code or directly use it out of the box.

Bugs or questions?

If you have any questions related to the code or the paper, feel free to email Tianyu ([email protected]) and Xingcheng ([email protected]). If you encounter any problems when using the code, or want to report a bug, you can open an issue. Please try to specify the problem with details so we can help you better and quicker!

Citation

Please cite our paper if you use SimCSE in your work:

@inproceedings{gao2021simcse,
   title={{SimCSE}: Simple Contrastive Learning of Sentence Embeddings},
   author={Gao, Tianyu and Yao, Xingcheng and Chen, Danqi},
   booktitle={Empirical Methods in Natural Language Processing (EMNLP)},
   year={2021}
}

SimCSE Elsewhere

We thank the community's efforts for extending SimCSE!

Jianlin Su has provided a Chinese version of SimCSE.
AK391 integrated to Huggingface Spaces with Gradio. See demo:
Nils Reimers has implemented a sentence-transformers-based training code for SimCSE.

Comments

Troubles reproducing the results

Hi, folks! Thank you very much for the hard work (^^) I have a question on how to reproduce the results -- not that I am aiming to spot the differences, just making sure that I am running the code correctly.

I use the run_run_unsup_example.sh script to train the unsupervised SimCSE. At the end of the training procedure, I run evaluation as follows: time CUDA_VISIBLE_DEVICES=0 python evaluation.py --model_name_or_path result/my-unsup-simcse-bert-base-uncased --pooler cls --task_set sts --mode test. The results table I am getting is:

------ test ------
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness |  Avg. |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+
| 46.88 | 56.47 | 58.33 | 65.43 | 58.92 |    56.71     |      55.36      | 56.87 |
+-------+-------+-------+-------+-------+--------------+-----------------+-------+

I believe the table i should be comparing to is Table 5 from the paper, the relevant row shows:

∗SimCSE-BERTbase | 68.40 | 82.41 | 74.38 | 80.91 | 78.56 | 76.85 | 72.23 | 76.25|

Which is far better than what i get. Can you maybe help me understand if I am doing smth wrong? I follow the main README.md file, the content of the run_run_unsup_example.sh script is:

python train.py \
    --model_name_or_path bert-base-uncased \
    --train_file data/wiki1m_for_simcse.txt \
    --output_dir result/my-unsup-simcse-bert-base-uncased \
    --num_train_epochs 1 \
    --per_device_train_batch_size 64 \
    --learning_rate 3e-5 \
    --max_seq_length 32 \
    --evaluation_strategy steps \
    --metric_for_best_model stsb_spearman \
    --load_best_model_at_end \
    --eval_steps 125 \
    --pooler_type cls \
    --mlp_only_train \
    --overwrite_output_dir \
    --temp 0.05 \
    --do_train \
    --do_eval \
    --fp16 \
    "$@"

opened by ypuzikov 17

No attribute validation_file in train.py?

os.system(f'bash ./data/download_nli.sh')
os.system(
    'cd SimCSE;'
    'python train.py '
    '--model_name_or_path bert-base-uncased'
    '--train_file data/nli_for_simcse.csv '
    '--output_dir result/my-sup-simcse-bert-base-uncased '   
    '--num_train_epochs 3 '
    '--per_device_train_batch_size 128 '
    '--learning_rate 5e-5 '
    '--max_seq_length 32 '
    '--evaluation_strategy steps '
    '--metric_for_best_model stsb_spearman '
    '--load_best_model_at_end '
    '--eval_steps 125 '
    '--pooler_type cls '
    '--overwrite_output_dir '
    '--temp 0.05 '
    '--do_train '
    '--do_eval '
    '--fp16 '
    '"$@"'
)

Traceback (most recent call last):
  File "/Users/sumner/Downloads/Replication/SimCSE/train.py", line 584, in <module>
    main()
  File "/Users/sumner/Downloads/Replication/SimCSE/train.py", line 257, in main
    model_args, data_args, training_args = parser.parse_args_into_dataclasses()
  File "/Users/sumner/miniforge3/lib/python3.9/site-packages/transformers/hf_argparser.py", line 157, in parse_args_into_dataclasses
    obj = dtype(**inputs)
  File "<string>", line 12, in __init__
  File "/Users/sumner/Downloads/Replication/SimCSE/train.py", line 179, in __post_init__
    if self.dataset_name is None and self.train_file is None and self.validation_file is None:
AttributeError: 'DataTrainingArguments' object has no attribute 'validation_file'

opened by SumNeuron 9

Cannot reproduce the result~

Hello, and thank you for this useful code! I tried to reproduce the unsupervisd BERT+SimCSE results, but failed. My environment setup is as follows:

pytorch=1.7.1 cudatoolkit=11.1 Single RTX 3090 The following script is the training script I used (exactly the same as run_unsup_example.sh).

python train.py
--model_name_or_path bert-base-uncased
--train_file data/wiki1m_for_simcse.txt
--output_dir result/my-unsup-simcse-bert-base-uncased
--num_train_epochs 1
--per_device_train_batch_size 64
--learning_rate 3e-5
--max_seq_length 32
--evaluation_strategy steps
--metric_for_best_model stsb_spearman
--load_best_model_at_end
--eval_steps 125
--pooler_type cls
--mlp_only_train
--overwrite_output_dir
--temp 0.05
--do_train
--do_eval
--fp16
"$@" However, there is a runtimeerror when training is finished. I obtained following evaluation results:

+-------+-------+-------+-------+-------+--------------+-----------------+-------+ | STS12 | STS13 | STS14 | STS15 | STS16 | STSBenchmark | SICKRelatedness | Avg. | +-------+-------+-------+-------+-------+--------------+-----------------+-------+ | 64.28 | 79.15 | 70.99 | 78.38 | 78.26 | 75.62 | 67.58 | 73.47 | +-------+-------+-------+-------+-------+--------------+-----------------+-------+

I think the gap (2.8 in average) is too large. Is it because of the error? How to obtain ~76 results in STS tasks?

opened by liuh236 9
Invalid tensor shape

I keep getting the following error at the end of the first epoch: "RuntimeError: Input tensor at index 1 has invalid shape [22, 44], but expected [22, 46]". This happens on a custom dataset. However, the dataset is thoroughly cleaned and should be valid.

The error happens in: comm.py, line 231

Any idea what might be causing this?

opened by peregilk 8

Geting different testing resutls in different testing time

Thanks for your great works. We want to train a simcse-bert-base model. We have tested the trained model two times and get different testing results with the following scripts, python evaluation.py --model_name_or_path result/my-unsup-simcse-bert-base-uncased

1st results 2nd results

The python packages

(simcse) H:\contrast\SimCSE-main\SimCSE-main>pip freeze
analytics-python==1.4.0
apex==0.1
backoff==1.10.0
bcrypt==3.2.0
certifi==2021.5.30
cffi==1.14.6
charset-normalizer==2.0.6
click==8.0.1
colorama==0.4.4
cryptography==3.4.8
cycler==0.10.0
datasets==1.4.0
dill==0.3.4
ffmpy==0.3.0
filelock==3.1.0
Flask==2.0.1
Flask-CacheBuster==1.0.0
Flask-Cors==3.0.10
Flask-Login==0.5.0
fsspec==2021.10.0
gradio==2.3.6
huggingface-hub==0.0.2
idna==3.2
importlib-metadata==4.8.1
itsdangerous==2.0.1
Jinja2==3.0.1
joblib==1.0.1
kiwisolver==1.3.2
markdown2==2.4.1
MarkupSafe==2.0.1
matplotlib==3.4.3
monotonic==1.6
multiprocess==0.70.12.2
numpy==1.21.2
packaging==21.0
pandas==1.1.5
paramiko==2.7.2
Pillow==8.3.2
prettytable==2.1.0
pyarrow==5.0.0
pycparser==2.20
pycryptodome==3.10.4
PyNaCl==1.4.0
pyparsing==2.4.7
python-dateutil==2.8.2
pytz==2021.1
PyYAML==5.4.1
regex==2021.9.24
requests==2.26.0
sacremoses==0.0.46
scikit-learn==0.24.0
scipy==1.5.4
six==1.16.0
threadpoolctl==2.2.0
tokenizers==0.9.4
torch==1.9.1+cu102
torchaudio==0.9.1
torchvision==0.10.1+cu102
tqdm==4.49.0
transformers==4.2.1
typing-extensions==3.10.0.2
urllib3==1.26.7
wcwidth==0.2.5
Werkzeug==2.0.1
wincertstore==0.2
xxhash==2.0.2
zipp==3.5.0

opened by marscrazy 8

Why the max_sequence_length is just 32

Hello, I noticed that the max_sequence_length in your code is set to 32. But the number of tokens of most of sentences in Eng WIKI exceed 32. Why the max sequence_length is 32? Thank you

opened by leoozy 8
error when training unsupverised simcse

When i run run_unsup_example.sh and when i almost finished training, an error happend:

Traceback (most recent call last): File "train.py", line 584, in main() File "train.py", line 548, in main train_result = trainer.train(model_path=model_path) File "/home/v-nuochen/SimCSE/simcse/trainers.py", line 464, in train tr_loss += self.training_step(model, inputs) File "/home/v-nuochen/.local/lib/python3.6/site-packages/transformers/trainer.py", line 1248, in training_step loss = self.compute_loss(model, inputs) File "/home/v-nuochen/.local/lib/python3.6/site-packages/transformers/trainer.py", line 1277, in compute_loss outputs = model(**inputs) File "/home/v-nuochen/.local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/v-nuochen/.local/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 162, in forward return self.gather(outputs, self.output_device) File "/home/v-nuochen/.local/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 174, in gather return gather(outputs, output_device, dim=self.dim) File "/home/v-nuochen/.local/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather res = gather_map(outputs) File "/home/v-nuochen/.local/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map for k in out)) File "", line 6, in init File "/home/v-nuochen/.local/lib/python3.6/site-packages/transformers/file_utils.py", line 1383, in post_init for element in iterator: File "/home/v-nuochen/.local/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in for k in out)) File "/home/v-nuochen/.local/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map return Gather.apply(target_device, dim, *outputs) File "/home/v-nuochen/.local/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 71, in forward return comm.gather(inputs, ctx.dim, ctx.target_device) File "/home/v-nuochen/.local/lib/python3.6/site-packages/torch/nn/parallel/comm.py", line 230, in gather return torch._C._gather(tensors, dim, destination)

RuntimeError: Input tensor at index 7 has invalid shape [2, 2], but expected [2, 9] 100%|█████████████████████████████████████████████████████████████████████████████████▉| 1953/1954 [18:36<00:00, 1.75it/s]

Could you please tell me why?

opened by cn-boop 8
Error while computing cosine similarity

Hello! I get the following error when comparing one sentence vs many others using similarities = model.similarity(keyword, phrases) the model loaded is model = SimCSE("princeton-nlp/sup-simcse-bert-base-uncased")

` ~\AppData\Roaming\Python\Python38\site-packages\simcse\tool.py in similarity(self, queries, keys, device) 110 111 # returns an N*M similarity array --> 112 similarities = cosine_similarity(query_vecs, key_vecs) 113 114 if single_query:

~\anaconda3\lib\site-packages\sklearn\metrics\pairwise.py in cosine_similarity(X, Y, dense_output) 1178 # to avoid recursive import 1179 -> 1180 X, Y = check_pairwise_arrays(X, Y) 1181 1182 X_normalized = normalize(X, copy=True)

~\anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs) 61 extra_args = len(args) - len(all_args) 62 if extra_args <= 0: ---> 63 return f(*args, **kwargs) 64 65 # extra_args > 0

~\anaconda3\lib\site-packages\sklearn\metrics\pairwise.py in check_pairwise_arrays(X, Y, precomputed, dtype, accept_sparse, force_all_finite, copy) 147 copy=copy, force_all_finite=force_all_finite, 148 estimator=estimator) --> 149 Y = check_array(Y, accept_sparse=accept_sparse, dtype=dtype, 150 copy=copy, force_all_finite=force_all_finite, 151 estimator=estimator)

~\anaconda3\lib\site-packages\sklearn\utils\validation.py in inner_f(*args, **kwargs) 61 extra_args = len(args) - len(all_args) 62 if extra_args <= 0: ---> 63 return f(*args, **kwargs) 64 65 # extra_args > 0

~\anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator) 671 array = array.astype(dtype, casting="unsafe", copy=False) 672 else: --> 673 array = np.asarray(array, order=order, dtype=dtype) 674 except ComplexWarning as complex_warning: 675 raise ValueError("Complex data not supported\n"

ValueError: could not convert string to float:`

opened by gaurav-95 7

The alignment computed with function implemented by Wang and Isola differs a lot with the paper

The alignment computed with the function implemented by Wang and Isola differs link a lot with your paper. I compute the alignment by that function directly, and I get a score of 1.21. But as shown in Fig.3 the score of the paper is less than 0.25. Could you tell me how to compute the alignment in this paper? My code is as follows：

def align_loss(x, y, alpha=2):    
    return (x - y).norm(p=2, dim=1).pow(alpha).mean()

def uniform_loss(x, t=2):
    return torch.pdist(x, p=2).pow(2).mul(-t).exp().mean().log()

def get_pair_emb(model, input_ids, attention_mask,token_type_ids):
    outputs = model(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
    pooler_output = outputs.pooler_output
    pooler_output = pooler_output.view((batch_size, 2, pooler_output.size(-1)))
    z1, z2 = pooler_output[:,0], pooler_output[:,1]
    return z1,z2

def get_align(model, dataloader):
    align_all = []
    unif_all = []
    with torch.no_grad():        
        for data in dataloader:
            input_ids = torch.cat((data['input_ids'][0],data['input_ids'][1])).cuda()
            attention_mask = torch.cat((data['attention_mask'][0],data['attention_mask'][1])).cuda()
            token_type_ids = torch.cat((data['token_type_ids'][0],data['token_type_ids'][1])).cuda()

            z1,z2 = get_pair_emb(model, input_ids, attention_mask, token_type_ids)        
            z1 = F.normalize(z1,p=2,dim=1)
            z2 = F.normalize(z2,p=2,dim=1)

            align_all.append(align_loss(z1, z2, alpha=2))
            
    return align_all

def get_unif(model, dataloader):
    unif_all = []
    with torch.no_grad():        
        for data in dataloader:
            input_ids = torch.cat((data['input_ids'][0],data['input_ids'][1])).cuda()
            attention_mask = torch.cat((data['attention_mask'][0],data['attention_mask'][1])).cuda()
            token_type_ids = torch.cat((data['token_type_ids'][0],data['token_type_ids'][1])).cuda()

            z1,z2 = get_pair_emb(model, input_ids, attention_mask, token_type_ids)        
            z1 = F.normalize(z1,p=2,dim=1)
            z2 = F.normalize(z2,p=2,dim=1)
            z = torch.cat((z1,z2))
            unif_all.append(uniform_loss(z, t=2))

    return unif_all



model = AutoModel.from_pretrained("princeton-nlp/unsup-simcse-bert-base-uncased")
model = model.cuda()
model_name = "unsup-simcse-bert-base-uncased"

align_all = get_align(model, pos_loader)

align = sum(align_all)/len(align_all)

opened by xbdxwyh 7

Error when I run unsupervised：RuntimeError: Input tensor at index 1 has invalid shape [32, 32], but expected [32, 33]

File "train.py", line 591, in main() File "train.py", line 555, in main train_result = trainer.train(model_path=model_path) File "/mnt/data/data/home/zhanghaoran/learn_project/SimCSE-main/simcse/trainers.py", line 464, in train tr_loss += self.training_step(model, inputs) File "/mnt/data/data/home/zhanghaoran/.conda/envs/simcse/lib/python3.8/site-packages/transformers/trainer.py", line 1248, in training_step loss = self.compute_loss(model, inputs) File "/mnt/data/data/home/zhanghaoran/.conda/envs/simcse/lib/python3.8/site-packages/transformers/trainer.py", line 1277, in compute_loss outputs = model(**inputs) File "/mnt/data/data/home/zhanghaoran/.conda/envs/simcse/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/mnt/data/data/home/zhanghaoran/.conda/envs/simcse/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 169, in forward return self.gather(outputs, self.output_device) File "/mnt/data/data/home/zhanghaoran/.conda/envs/simcse/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 181, in gather return gather(outputs, output_device, dim=self.dim) File "/mnt/data/data/home/zhanghaoran/.conda/envs/simcse/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 78, in gather res = gather_map(outputs) File "/mnt/data/data/home/zhanghaoran/.conda/envs/simcse/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 69, in gather_map return type(out)((k, gather_map([d[k] for d in outputs])) File "", line 7, in init File "/mnt/data/data/home/zhanghaoran/.conda/envs/simcse/lib/python3.8/site-packages/transformers/file_utils.py", line 1383, in post_init for element in iterator: File "/mnt/data/data/home/zhanghaoran/.conda/envs/simcse/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 69, in return type(out)((k, gather_map([d[k] for d in outputs])) File "/mnt/data/data/home/zhanghaoran/.conda/envs/simcse/lib/python3.8/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map return Gather.apply(target_device, dim, *outputs) File "/mnt/data/data/home/zhanghaoran/.conda/envs/simcse/lib/python3.8/site-packages/torch/nn/parallel/_functions.py", line 75, in forward return comm.gather(inputs, ctx.dim, ctx.target_device) File "/mnt/data/data/home/zhanghaoran/.conda/envs/simcse/lib/python3.8/site-packages/torch/nn/parallel/comm.py", line 235, in gather return torch._C._gather(tensors, dim, destination) RuntimeError: Input tensor at index 1 has invalid shape [32, 32], but expected [32, 33]

opened by xing-ye 6

computing alignment and uniformity

I'm following Wang and Isola to compute alignment and uniformity (using their given code in Fig 5, http://proceedings.mlr.press/v119/wang20k/wang20k.pdf) to reproduce Fig 2 in your paper but fail. What I saw is that the alignment decreases whereas the uniformity is almost unchanged, which is completely different from Fig 2. Details are below.

To compute alignment and uniformity, I changed line 66-79 file SimCSE/blob/main/SentEval/senteval/sts.py by adding the code from Wang and Isola:

            ...
            input1, input2, gs_scores = self.data[dataset]
            all_enc1 = []
            all_enc2 = []
            for ii in range(0, len(gs_scores), params.batch_size):
                batch1 = input1[ii:ii + params.batch_size]
                batch2 = input2[ii:ii + params.batch_size]

                # we assume get_batch already throws out the faulty ones
                if len(batch1) == len(batch2) and len(batch1) > 0:
                    enc1 = batcher(params, batch1)
                    enc2 = batcher(params, batch2)

                    all_enc1.append(enc1.detach())
                    all_enc2.append(enc2.detach())
                    ...
            
             def _norm(x, eps=1e-8): 
                xnorm = torch.linalg.norm(x, dim=-1)
                xnorm = torch.max(xnorm, torch.ones_like(xnorm) * eps)
                return x / xnorm.unsqueeze(dim=-1)

            # from Wang and Isola (with a bit of modification)
            # only consider pairs with gs > 4 (from footnote 3)
            def _lalign(x, y, ok, alpha=2):
                return ((_norm(x) - _norm(y)).norm(dim=1).pow(alpha) * ok).sum() / ok.sum()
            
            def _lunif(x, t=2):
                sq_pdist = torch.pdist(_norm(x), p=2).pow(2)
                return sq_pdist.mul(-t).exp().mean().log()

            ok = (torch.Tensor(gs_scores) > 4).int()
            align = _lalign(
                torch.cat(all_enc1), 
                torch.cat(all_enc2), 
                ok).item()

            # consider all sentences (from footnote 3)
            unif = _lunif(torch.cat(all_enc1 + all_enc2)).item()
            logging.info(f'align {align}\t\t uniform {unif}')

The output (which also shows spearman on stsb dev set) is

align 0.2672557830810547 uniform -2.5320491790771484 'eval_stsb_spearman': 0.6410360622426501, 'epoch': 0.01
align 0.2519586384296417 uniform -2.629746913909912 'eval_stsb_spearman': 0.6859433315879646, 'epoch': 0.02
align 0.2449202835559845 uniform -2.5870673656463623 'eval_stsb_spearman': 0.7198291431689111, 'epoch': 0.02
align 0.22248655557632446 uniform -2.557053565979004 'eval_stsb_spearman': 0.7538674335025006, 'epoch': 0.03
align 0.22624073922634125 uniform -2.6622540950775146 'eval_stsb_spearman': 0.7739112284380941, 'epoch': 0.04
align 0.22583454847335815 uniform -2.5768041610717773 'eval_stsb_spearman': 0.7459814500897265, 'epoch': 0.05
align 0.22845414280891418 uniform -2.5601420402526855 'eval_stsb_spearman': 0.7683573046863201, 'epoch': 0.06
align 0.22689573466777802 uniform -2.560364007949829 'eval_stsb_spearman': 0.7766837072148098, 'epoch': 0.06
align 0.22807720303535461 uniform -2.5539987087249756 'eval_stsb_spearman': 0.7692866256106997, 'epoch': 0.07
align 0.20026598870754242 uniform -2.50628399848938 'eval_stsb_spearman': 0.7939010002048291, 'epoch': 0.08
align 0.20466476678848267 uniform -2.535121440887451 'eval_stsb_spearman': 0.8011027122797894, 'epoch': 0.09
align 0.2030458152294159 uniform -2.5547776222229004 'eval_stsb_spearman': 0.8044623693996088, 'epoch': 0.1
align 0.20119303464889526 uniform -2.5325350761413574 'eval_stsb_spearman': 0.8070404405714893, 'epoch': 0.1
align 0.19329915940761566 uniform -2.488903522491455 'eval_stsb_spearman': 0.8220311448535872, 'epoch': 0.11
align 0.19556573033332825 uniform -2.5273373126983643 'eval_stsb_spearman': 0.8183500898254208, 'epoch': 0.12
align 0.19112755358219147 uniform -2.4959402084350586 'eval_stsb_spearman': 0.8146496522216178, 'epoch': 0.13
align 0.18491695821285248 uniform -2.4762508869171143 'eval_stsb_spearman': 0.8088527080054781, 'epoch': 0.14
align 0.19815796613693237 uniform -2.5905373096466064 'eval_stsb_spearman': 0.8333401056438776, 'epoch': 0.14
align 0.1950838416814804 uniform -2.4894299507141113 'eval_stsb_spearman': 0.8293951990138778, 'epoch': 0.15
align 0.19777807593345642 uniform -2.5985066890716553 'eval_stsb_spearman': 0.8268435050866446, 'epoch': 0.16
align 0.2016373723745346 uniform -2.616013765335083 'eval_stsb_spearman': 0.8199602019842832, 'epoch': 0.17
align 0.19906719028949738 uniform -2.57528018951416 'eval_stsb_spearman': 0.8094202934650283, 'epoch': 0.18
align 0.18731220066547394 uniform -2.517271041870117 'eval_stsb_spearman': 0.8231122818777513, 'epoch': 0.18
align 0.18802008032798767 uniform -2.508246421813965 'eval_stsb_spearman': 0.8248523275594679, 'epoch': 0.19
align 0.20015984773635864 uniform -2.4563515186309814 'eval_stsb_spearman': 0.8061084765791668, 'epoch': 0.2
align 0.2015877515077591 uniform -2.5121841430664062 'eval_stsb_spearman': 0.8113328705761889, 'epoch': 0.21
align 0.20187602937221527 uniform -2.5167288780212402 'eval_stsb_spearman': 0.8124173161634701, 'epoch': 0.22
align 0.20096932351589203 uniform -2.5201926231384277 'eval_stsb_spearman': 0.8127754107163266, 'epoch': 0.22
align 0.19966433942317963 uniform -2.5182201862335205 'eval_stsb_spearman': 0.8152261579570365, 'epoch': 0.23
align 0.19897222518920898 uniform -2.557129383087158 'eval_stsb_spearman': 0.8169452712415308, 'epoch': 0.24
...

We can see that alignment drops from 0.26 to less than 0.20 whereas uniformity is still around -2.55. It means that reducing alignment is key, not uniformity. This trend is completely different from Fig 2.

Did you also use the code from Wang and Isola like I did? If possible, could you please provide the code for reproducing alignment and uniformity?

opened by lephong 6

Why one epoch for unsupervised?

Hello,

Thanks for your amazing work on SimCSE!

I was wondering why only one epoch was chosen for unsupervised, while supervised approach does 3 epochs.

What is the reason?

Thanks in advance :)

opened by jeongwoopark0514 0
what are the negative samples if removing the hard negative? (train with supervised verison)

Hi, I follow your supervised example, but remove the hard negative from the dataset, In this may, where does the negative sample come from? Does it take the other 'sent0's or 'sent1's in the same batch as the negative sample? Would you give me some instructions? thanks a lot!

opened by Rachel-Yeah-Lee 0
Can't Install on Mac M1

Hi,

I can't install SimCSE on my macbook pro because the version of scipy used as a dependency can't be installed on on M1, I think because it doesn't support ARM. Would it be possible to to change the scipy dependency to a newer version that does, or is there another workaround?

Thanks, Shawn

opened by shawnjhenry 2

Releases(0.4)

0.4(May 12, 2021)
Update pooling methods for unsupervised models (cls -> cls-before-pooler)

Fix a faiss bug.

Source code(tar.gz)
Source code(zip)
0.3(May 11, 2021)

Source code(tar.gz)
Source code(zip)

Owner

Princeton Natural Language Processing

GitHub Repository

Exploring Machine Learning Models for detecting anomalous behavior in credit-card transactions. It's crucial that credit-card companies are able to recognize fraudulent activity so that customers are not charged for items they didn't purchase.

Credit Card Fraud Detection Came across this mocked-up dataset of customer transactions at [Capital One Recruitment Challenge](https://github.com/Capi

1 Nov 17, 2022

EigenGAN Tensorflow, EigenGAN: Layer-Wise Eigen-Learning for GANs

Gender Bangs Body Side Pose (Yaw) Lighting Smile Face Shape Lipstick Color Painting Style Pose (Yaw) Pose (Pitch) Zoom & Rotate Flush & Eye Color Mout

321 Dec 01, 2022

Implementation of ReSeg using PyTorch

Implementation of ReSeg using PyTorch ReSeg: A Recurrent Neural Network-based Model for Semantic Segmentation Pascal-Part Annotations Pascal VOC 2010

46 Nov 23, 2022

High-performance moving least squares material point method (MLS-MPM) solver.

High-Performance MLS-MPM Solver with Cutting and Coupling (CPIC) (MIT License) A Moving Least Squares Material Point Method with Displacement Disconti

2.2k Dec 31, 2022

This program writes christmas wish programmatically. It is using turtle as a pen pointer draw christmas trees and stars.

Introduction This is a simple program is written in python and turtle library. The objective of this program is to wish merry Christmas programmatical

1 Dec 25, 2021

PyTorch implementation of Pointnet2/Pointnet++

Pointnet2/Pointnet++ PyTorch Project Status: Unmaintained. Due to finite time, I have no plans to update this code and I will not be responding to iss

1.2k Dec 29, 2022

Implementation of the GVP-Transformer, which was used in the paper "Learning inverse folding from millions of predicted structures" for de novo protein design alongside Alphafold2

GVP Transformer (wip) Implementation of the GVP-Transformer, which was used in the paper Learning inverse folding from millions of predicted structure

19 May 06, 2022

Official implementation of the ICCV 2021 paper "Conditional DETR for Fast Training Convergence".

The DETR approach applies the transformer encoder and decoder architecture to object detection and achieves promising performance. In this paper, we handle the critical issue, slow training convergen

281 Dec 30, 2022

git《USD-Seg:Learning Universal Shape Dictionary for Realtime Instance Segmentation》(2020) GitHub: [fig2]

USD-Seg This project is an implement of paper USD-Seg:Learning Universal Shape Dictionary for Realtime Instance Segmentation, based on FCOS detector f

80 Nov 28, 2022

How to Predict Stock Prices Easily Demo

How-to-Predict-Stock-Prices-Easily-Demo How to Predict Stock Prices Easily - Intro to Deep Learning #7 by Siraj Raval on Youtube ##Overview This is th

752 Nov 16, 2022

Rasterize with the least efforts for researchers.

utils3d Rasterize and do image-based 3D transforms with the least efforts for researchers. Based on numpy and OpenGL. It could be helpful when you wan

8 Dec 15, 2022

A module that used for encrypt code which includes RSA and AES

软件加密模块 requirement： Crypto,pycryptodome,pyqt5 本地加密信息为随机字符串使用说明命令行参数 -h 帮助 -checkWorking 检查是否能正常工作，后接1确认指令 -checkEndDate 检查截至日期，后接1确认指令 -activateCode

2 Sep 27, 2022

Face detection using deep learning.

Face Detection Docker Solution Using Faster R-CNN Dockerface is a deep learning face detector. It deploys a trained Faster R-CNN network on Caffe thro

181 Dec 19, 2022

Learning from graph data using Keras

Steps to run = Download the cora dataset from this link : https://linqs.soe.ucsc.edu/data unzip the files in the folder input/cora cd code python eda

64 Nov 16, 2022

🕵 Artificial Intelligence for social control of public administration

Non-tech crash course into Operação Serenata de Amor Tech crash course into Operação Serenata de Amor Contributing with code and tech skills Supportin

4.4k Dec 31, 2022

Semiconductor Machine learning project

Wafer Fault Detection Problem Statement: Wafer (In electronics), also called a slice or substrate, is a thin slice of semiconductor, such as a crystal

1 Jan 15, 2022

A data annotation pipeline to generate high-quality, large-scale speech datasets with machine pre-labeling and fully manual auditing.

About This repository provides data and code for the paper: Scalable Data Annotation Pipeline for High-Quality Large Speech Datasets Development (subm

86 Dec 07, 2022

[ICCV21] Code for RetrievalFuse: Neural 3D Scene Reconstruction with a Database

RetrievalFuse Paper | Project Page | Video RetrievalFuse: Neural 3D Scene Reconstruction with a Database Yawar Siddiqui, Justus Thies, Fangchang Ma, Q

75 Dec 22, 2022

As a part of the HAKE project, includes the reproduced SOTA models and the corresponding HAKE-enhanced versions (CVPR2020).

HAKE-Action HAKE-Action (TensorFlow) is a project to open the SOTA action understanding studies based on our Human Activity Knowledge Engine. It inclu

94 Nov 18, 2022

The official code for paper "R2D2: Recursive Transformer based on Differentiable Tree for Interpretable Hierarchical Language Modeling".

R2D2 This is the official code for paper titled "R2D2: Recursive Transformer based on Differentiable Tree for Interpretable Hierarchical Language Mode

49 Dec 17, 2022

EMNLP'2021: SimCSE: Simple Contrastive Learning of Sentence Embeddings

Related tags

Overview

SimCSE: Simple Contrastive Learning of Sentence Embeddings

Quick Links

Overview

Getting Started

Model List

Use SimCSE with Huggingface

Train SimCSE

Requirements

Evaluation

Training

Bugs or questions?

Citation

SimCSE Elsewhere

Comments

Releases(0.4)

0.4(May 12, 2021)

0.3(May 11, 2021)

Owner

Princeton Natural Language Processing

Exploring Machine Learning Models for detecting anomalous behavior in credit-card transactions. It's crucial that credit-card companies are able to recognize fraudulent activity so that customers are not charged for items they didn't purchase.

EigenGAN Tensorflow, EigenGAN: Layer-Wise Eigen-Learning for GANs

Implementation of ReSeg using PyTorch

High-performance moving least squares material point method (MLS-MPM) solver.

This program writes christmas wish programmatically. It is using turtle as a pen pointer draw christmas trees and stars.

PyTorch implementation of Pointnet2/Pointnet++

Implementation of the GVP-Transformer, which was used in the paper "Learning inverse folding from millions of predicted structures" for de novo protein design alongside Alphafold2

Official implementation of the ICCV 2021 paper "Conditional DETR for Fast Training Convergence".

git《USD-Seg:Learning Universal Shape Dictionary for Realtime Instance Segmentation》(2020) GitHub: [fig2]

How to Predict Stock Prices Easily Demo

Rasterize with the least efforts for researchers.

A module that used for encrypt code which includes RSA and AES

Face detection using deep learning.

Learning from graph data using Keras

🕵 Artificial Intelligence for social control of public administration

Semiconductor Machine learning project

A data annotation pipeline to generate high-quality, large-scale speech datasets with machine pre-labeling and fully manual auditing.

[ICCV21] Code for RetrievalFuse: Neural 3D Scene Reconstruction with a Database

As a part of the HAKE project, includes the reproduced SOTA models and the corresponding HAKE-enhanced versions (CVPR2020).

The official code for paper "R2D2: Recursive Transformer based on Differentiable Tree for Interpretable Hierarchical Language Modeling".