Longformer: The Long-Document Transformer

Overview

Longformer

Longformer and LongformerEncoderDecoder (LED) are pretrained transformer models for long documents.

***** New December 1st, 2020: LongformerEncoderDecoder *****

A LongformerEncoderDecoder (LED) model is now available. It supports seq2seq tasks with long input. With gradient checkpointing, fp16, and 48GB gpu, the input length can be up to 16K tokens. Check the updated paper for the model details and evaluation.

  • Pretrained models: 1) led-base-16384, 2) led-large-16384

  • Requirements: Make sure to use the huggingface/transformers fork specified in requirements.txt. It adds support for gradient checkpointing and allows different maximum sequence length for the input and output. You can also run pip install git+https://github.com/allenai/longformer.git

  • Check the script scripts/summarization.py for an example of how to use the model.

***** New July 23rd, 2020: Speed degradation *****

A significant speed degradation in the hugginface/transformers was recenlty discovered and fixed (check this PR for details). To avoid this problem, either use the old release v2.11.0 but it doesn't support gradient checkpointing, or use the master branch. This problem should be fixed with the next hugginface/transformers release.

***** New June 29th, 2020: Easier to use Gradient checkpointing *****

Gradient checkpointing has been released with huggingface/transformers release v3.0.0. Gradient checkpointing reduces memory by 5x which makes it possible to process longer sequences on smaller GPUs. To use, try something like the following:

from transformers import LongformerModel
model = LongformerModel.from_pretrained('allenai/longformer-base-4096', gradient_checkpointing=True)

***** New June 2nd, 2020: Integrating with Huggingface + Train your own long model + Gradient checkpointing *****

  1. Longformer is now integrated in the huggingface/transformers release v2.11.0. Now you can do
from transformers import LongformerModel
model = LongformerModel.from_pretrained("allenai/longformer-base-4096")

The release also includes LongformerForQA and other LongformerForTaskName with automatic setting of global attention.

  1. We added a notebook to show how to convert an existing pretrained model into its "long" version.

  2. Gradient checkpointing has been merged into HF master (check PR). Gradient checkpointing can reduce memory usage significanlty (5x for longformer-base-4096) allowing longer sequences on smaller gpus.

***** New April 27th, 2020: A PyTorch implementation of the sliding window attention *****

We added a PyTorch implementation of the sliding window attention that doesn't require the custom CUDA kernel. It is limited in functionality but more convenient to use for finetuning on downstream tasks.

Advantage: supports CPU, TPU and fp16, which aren't supported by the custom CUDA kernel

Limitations: uses 2x more memory (but fp16 offsets that), and doesn’t support dilation and autoregressive attention (not needed for finetuning)

therefore, it is suitable for finetuning on downstream tasks but not a good choice for language modeling. The code snippit below and the TriviaQA scripts were updated to use this new implementation.

***** End new information *****

How to use

  1. Download pretrained model
  1. Install environment and code

    conda create --name longformer python=3.7
    conda activate longformer
    conda install cudatoolkit=10.0
    pip install git+https://github.com/allenai/longformer.git
  2. Run the model

    import torch
    from longformer.longformer import Longformer, LongformerConfig
    from longformer.sliding_chunks import pad_to_window_size
    from transformers import RobertaTokenizer
    
    config = LongformerConfig.from_pretrained('longformer-base-4096/') 
    # choose the attention mode 'n2', 'tvm' or 'sliding_chunks'
    # 'n2': for regular n2 attantion
    # 'tvm': a custom CUDA kernel implementation of our sliding window attention
    # 'sliding_chunks': a PyTorch implementation of our sliding window attention
    config.attention_mode = 'sliding_chunks'
    
    model = Longformer.from_pretrained('longformer-base-4096/', config=config)
    tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
    tokenizer.model_max_length = model.config.max_position_embeddings
    
    SAMPLE_TEXT = ' '.join(['Hello world! '] * 1000)  # long input document
    
    input_ids = torch.tensor(tokenizer.encode(SAMPLE_TEXT)).unsqueeze(0)  # batch of size 1
    
    # TVM code doesn't work on CPU. Uncomment this if `config.attention_mode = 'tvm'`
    # model = model.cuda(); input_ids = input_ids.cuda()
    
    # Attention mask values -- 0: no attention, 1: local attention, 2: global attention
    attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device) # initialize to local attention
    attention_mask[:, [1, 4, 21,]] =  2  # Set global attention based on the task. For example,
                                         # classification: the <s> token
                                         # QA: question tokens
    
    # padding seqlen to the nearest multiple of 512. Needed for the 'sliding_chunks' attention
    input_ids, attention_mask = pad_to_window_size(
            input_ids, attention_mask, config.attention_window[0], tokenizer.pad_token_id)
    
    output = model(input_ids, attention_mask=attention_mask)[0]

Model pretraining

This notebook demonstrates our procedure for training Longformer starting from the RoBERTa checkpoint. The same procedure can be followed to get a long-version of other existing pretrained models.

TriviaQA

  • Training scripts: scripts/triviaqa.py
  • Pretrained large model: here (replicates leaderboard results)
  • Instructions: scripts/cheatsheet.txt

CUDA kernel

Our custom CUDA kernel is implemented in TVM. For now, the kernel only works on GPUs and Linux. We tested it on Ubuntu, Python 3.7, CUDA10, PyTorch >= 1.2.0. If it doesn't work for your environment, please create a new issue.

Compiling the kernel: We already include the compiled binaries of the CUDA kernel, so most users won't need to compile it, but if you are intersted, check scripts/cheatsheet.txt for instructions.

Known issues

Please check the repo issues for a list of known issues that we are planning to address soon. If your issue is not discussed, please create a new one.

Citing

If you use Longformer in your research, please cite Longformer: The Long-Document Transformer.

@article{Beltagy2020Longformer,
  title={Longformer: The Long-Document Transformer},
  author={Iz Beltagy and Matthew E. Peters and Arman Cohan},
  journal={arXiv:2004.05150},
  year={2020},
}

Longformer is an open-source project developed by the Allen Institute for Artificial Intelligence (AI2). AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering.

Comments
  • ImportError: cannot import name 'nvcc'

    ImportError: cannot import name 'nvcc'

    from tvm.contrib import nvcc ImportError: cannot import name 'nvcc'

    I get this when trying to compile the kernel from scratch. Did I miss something in the cmake config? I can import a lot of TVM modules but not nvcc.

    My cuda version is: Cuda compilation tools, release 10.0, V10.0.130

    opened by safooray 33
  • Text Classifier using longformer

    Text Classifier using longformer

    Can we request to add a short example of longformer for long text/review classification? Current triviaQA is good but more examples will encourage further use of longformer.

    Thanks. Patrick

    opened by pchankh 14
  • RuntimeError: CUDA error: device-side assert triggered - is_global_attn = is_index_global_attn.flatten().any().item()

    RuntimeError: CUDA error: device-side assert triggered - is_global_attn = is_index_global_attn.flatten().any().item()

    I'm trying to train a new model from scratch where it's length is 1024 (using huggingface implementation of longformer), but I get the following exception at a line that is recently added:

    --> 150         is_global_attn = is_index_global_attn.flatten().any().item()
        151 
        152         hidden_states = hidden_states.transpose(0, 1)
    
    RuntimeError: CUDA error: device-side assert triggered
    

    I tried Reformer and it worked as expected. The Longfomer config is as follows?

    LongformerConfig {
      "attention_probs_dropout_prob": 0.1,
      "attention_window": 64,
      "bos_token_id": 0,
      "eos_token_id": 2,
      "gradient_checkpointing": false,
      "hidden_act": "gelu",
      "hidden_dropout_prob": 0.1,
      "hidden_size": 768,
      "initializer_range": 0.02,
      "intermediate_size": 3072,
      "layer_norm_eps": 1e-12,
      "max_position_embeddings": 1026,
      "model_type": "longformer",
      "num_attention_heads": 12,
      "num_hidden_layers": 6,
      "pad_token_id": 257,
      "sep_token_id": 258,
      "type_vocab_size": 2,
      "vocab_size": 261
    }
    

    Any idea what the issue is?

    opened by zarandioon 13
  • segmentation fault illegal instruction

    segmentation fault illegal instruction

    setup

    ubuntu 16.04 tvm 0.7 dev1 pytorch 1.4.0 transformer 2.11.0 other same as requirements.txt

    issue

    I uncomment the line in diagonaled_mm_tvm.py DiagonaledMM._get_function('float32', 'cuda')

    After that, When I run the code , it show Loading tvm binary from :./longformer/lib/lib_diagonaled_mm_float32_cuda.so ... segmentation fault (core dump) or show Loading tvm binary from :./longformer/lib/lib_diagonaled_mm_float32_cuda.so ... illegal instruction (core dump)

    other

    I test the tvm, tensorflow and pytorch, there are fine. And I follow the scripts/cheatsheet.txt to regenerate the lib_diagonaled_mm_float32_cuda.so, it can generate succeed.

    Any idea or suggestion?

    the code is below

    import torch
    from longformer.longformer import Longformer, LongformerConfig
    from longformer.sliding_chunks import pad_to_window_size
    from transformers import RobertaTokenizer
    
    config = LongformerConfig.from_pretrained('longformer-base-4096/') 
    # choose the attention mode 'n2', 'tvm' or 'sliding_chunks'
    # 'n2': for regular n2 attantion
    # 'tvm': a custom CUDA kernel implementation of our sliding window attention
    # 'sliding_chunks': a PyTorch implementation of our sliding window attention
    config.attention_mode = 'tvm'
    
    model = Longformer.from_pretrained('longformer-base-4096/', config=config)
    tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
    tokenizer.model_max_length = model.config.max_position_embeddings
    
    SAMPLE_TEXT = ' '.join(['Hello world! '] * 1000)  # long input document
    
    input_ids = torch.tensor(tokenizer.encode(SAMPLE_TEXT)).unsqueeze(0)  # batch of size 1
    
    # TVM code doesn't work on CPU. Uncomment this if `config.attention_mode = 'tvm'`
    model = model.cuda(); input_ids = input_ids.cuda()
    
    # Attention mask values -- 0: no attention, 1: local attention, 2: global attention
    attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device) # initialize to local attention
    attention_mask[:, [1, 4, 21,]] =  2  # Set global attention based on the task. For example,
                                         # classification: the <s> token
                                         # QA: question tokens
    
    # padding seqlen to the nearest multiple of 512. Needed for the 'sliding_chunks' attention
    input_ids, attention_mask = pad_to_window_size(
            input_ids, attention_mask, config.attention_window[0], tokenizer.pad_token_id)
    
    output = model(input_ids, attention_mask=attention_mask)[0]
    
    opened by ProfXGiter 13
  • Using RoBERTa or LongFormer for texts with 16K tokens

    Using RoBERTa or LongFormer for texts with 16K tokens

    LongFormer does it by pooling all the local attentions (512) together in global attention (512 x 8 = 4096).

    This is not entirely true. There's no "pooling" of the 4096 tokens into 512. We keep all 4096 tokens. The only change is how attention is computed; instead of every token attending to every other token, we change it such that every token attends to a smaller number of surrounding tokens. This speeds up selfattention computation (which is the bottleneck) by assuming that the attention score between certain pairs of words is zero. This doesn't change the architecture or introduce any pooling.

    We are working on some code that will make it easy to train your own long model, so you can try longer sequences. We know it is easy to get to 16K or even 32k with RoBERTa-base architecture (need base model, fp16, gradient checkpointing). For sequences longer than that, you will need to find ways to save memory depending on your application. For example, reducing window size, reducing size of the feed forward layers, implementing reversible transformers, use sinusoidal position embedding instead of learned position embedding.

    Originally posted by @ibeltagy in https://github.com/allenai/longformer/issues/48#issuecomment-634270401

    opened by vr25 10
  • Not able to use the embedding for calculating similarity.

    Not able to use the embedding for calculating similarity.

    First of all let me thank you for contributing this knowledge to us. It makes a lot of difference for beginners like me. :) Now the issue: I was trying to use longformer for calculating the similarity between a query and a list of paragraphs retrieved from my index search. The idea is to re-rank these paragraphs based on the the cosine similarity of the embedding of Question and the individual paragraph.

    However, once I have calculated the embedding of both query and paragraph using this code: SAMPLE_TEXT = f'{tokenizer.cls_token}{SAMPLE_TEXT}{tokenizer.eos_token}' ................................... ...................... output = model(input_ids, attention_mask=attention_mask)[0]

    I get a embedding of dimension: torch.Size([1, 512, 768]) and when I try to calculate the cosine similarity on these embeddings I get error saying : ever got this error: RuntimeError: Can't call numpy() on Variable that requires grad. Use var.detach().numpy() instead. while working with torch?

    I do see that the error recommends me to use var.detach().numpy() insteam of numpy(). https://stackoverflow.com/questions/55466298/pytorch-cant-call-numpy-on-variable-that-requires-grad-use-var-detach-num

    However, I am unsure where should I append this line of code. I am a beginner and hence please pardon if I have raised an issue unrelated to longformer.

    Thanks for help :)

    opened by titu1992 10
  • help in understanding task global attention

    help in understanding task global attention

    Hi,

    Need help in understanding the concept below?

    image

    So does this mean that the complexity is quadratic (if all tokens attend to all other tokens) for task tuning but linear otherwise?

    Thanks!

    opened by vr25 9
  • Has anyone reproduced TriviaQA result with pytorch-lightning checkpoint?

    Has anyone reproduced TriviaQA result with pytorch-lightning checkpoint?

    Hi, I'm trying to reproduce the TriviaQA result following instructions in cheatsheet. I user following instructions to reproduce it from cheatsheet.txt

    // To run our pretrained TriviaQA large model (replicates the leaderboard results), // first download the pytorch-lightning checkpoint: // https://ai2-s2-research.s3-us-west-2.amazonaws.com/longformer/triviaqa-longformer-large.tar.gz // then run: python -m scripts.triviaqa
    --train_dataset squad-wikipedia-train-4096.json \ # loaded but not used --dev_dataset squad-wikipedia-dev-4096.json
    --gpus 0 --num_workers 4
    --max_seq_len 4096 --doc_stride -1
    --save_prefix triviaqa-longformer-large \ # pretrained pytorch-lighting checkpoint --model_path path/to/pretrained/longformer-large-4096 \ # loaded but not used --test # predictions will be saved into predictions.json

    //then run the official evaluation scripts python -m scripts.triviaqa_utils.evaluation_utils
    --dataset_file path/to/qa/wikipedia-dev.json
    --prediction_file predictions.json
    //Output should be: {'exact_match': 73.07644188665083, 'f1': 77.78523804802242, 'common': 7993, 'denominator': 7993, 'pred_len': 7993, 'gold_len': 7993}

    But I keep getting result {'exact_match': 0.025021894157387713, 'f1': 4.579085300341775, 'common': 7993, 'denominator': 7993, 'pred_len': 7993, 'gold_len': 7993}, which is very weird..

    I downloaded dataset and converted both train and dev dataset into squad format by provided script, and I just replaced data and model path to my server's setting.

    Has anyone reproduced the result f1:77.78 with given pytorch-lightning checkpoint?

    opened by YJYJLee 9
  • How can I train the pre-train model on chinese corpus?

    How can I train the pre-train model on chinese corpus?

    Now I want to train a pre-train model on chinese corpus, but the details are not clear. such as, how to make the minimal changes necessary to support Longformer’s attention mechanism, how to take the attention pattern to plug into a pretrained transformer model.

    opened by liangxg787 9
  • Fine-tuning Longformer for squad (out of memory)

    Fine-tuning Longformer for squad (out of memory)

    I have pretrained an MLM Longformer using roberta-base based on this recipe.

    Then I tried to fine-tune it for squad quetion-answering. Here is the trainer and following is the run-time setting (based on here):

    python run_squad.py
    --model_type roberta
    --model_name_or_path pathe_to_roberta_base_mlm_trained_4096
    --do_train
    --do_eval
    --do_lower_case
    --train_file $SQUAD_DIR/train-v1.1.json
    --predict_file $SQUAD_DIR/dev-v1.1.json
    --per_gpu_train_batch_size 1
    --learning_rate 3e-5
    --num_train_epochs 2.0
    --max_seq_length 4096
    --doc_stride 128
    --output_dir /tmp/debug_squad/

    While I am using a V100 node (16-GPUs, 32 GB), it always faces memory limit of gpu as follow:

    File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/parallel_apply.py", line 60, in _worker
    output = module(*input, **kwargs)
    

    File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/aaaa/.local/lib/python3.6/site-packages/transformers/modeling_roberta.py", line 642, in forward output_hidden_states=output_hidden_states, File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/aaaa/.local/lib/python3.6/site-packages/transformers/modeling_bert.py", line 762, in forward output_hidden_states=output_hidden_states, File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/aaaa/.local/lib/python3.6/site-packages/transformers/modeling_bert.py", line 439, in forward output_attentions, File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/aaaa/.local/lib/python3.6/site-packages/transformers/modeling_bert.py", line 371, in forward hidden_states, attention_mask, head_mask, output_attentions=output_attentions, File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/aaaa/.local/lib/python3.6/site-packages/transformers/modeling_bert.py", line 315, in forward hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, output_attentions, File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 550, in call result = self.forward(*input, **kwargs) File "/home/aaaa/.local/lib/python3.6/site-packages/transformers/modeling_bert.py", line 240, in forward attention_scores = attention_scores / math.sqrt(self.attention_head_size) RuntimeError: CUDA out of memory. Tried to allocate 768.00 MiB (GPU 0; 31.72 GiB total capacity; 30.25 GiB already allocated; 300.38 MiB free; 30.29 GiB reserved in total by PyTorch)

    However, using allenai/longformer-base-4096, it works. Could you please comment on what I may be missing in the above steps.

    opened by arashashari 8
  • CUDA error: device-side assert triggered, while converting BERT to Long

    CUDA error: device-side assert triggered, while converting BERT to Long

    Hi!

    I got an apparently working code for converting a BERT model into a longformer, but now I am trying to convert BERTeus to Longoformer, which I expected to work in the same way (just changing the dataset + model name/path).

    with a small(with big same issue) training corpus (50K lines), the training starts well, but it breaks around step 20, after 3-4 epochs.

    
    2020-09-22 15:01:55.336576: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
    2020-09-22 15:01:55.338202: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6
    INFO:__main__:Loading the model from tmp/bert-base-4096
    INFO:transformers.configuration_utils:loading configuration file tmp/bert-base-4096/config.json
    INFO:transformers.configuration_utils:Model config BertConfig {
      "architectures": [
        "BertForMaskedLM"
      ],
      "attention_probs_dropout_prob": 0.1,
      "attention_window": [
        512,
        512,
        512,
        512,
        512,
        512,
        512,
        512,
        512,
        512,
        512,
        512
      ],
      "gradient_checkpointing": true,
      "hidden_act": "gelu",
      "hidden_dropout_prob": 0.1,
      "hidden_size": 768,
      "initializer_range": 0.02,
      "intermediate_size": 3072,
      "layer_norm_eps": 1e-12,
      "max_position_embeddings": 4096,
      "model_type": "bert",
      "num_attention_heads": 12,
      "num_hidden_layers": 12,
      "output_past": true,
      "pad_token_id": 3,
      "type_vocab_size": 2,
      "vocab_size": 50099
    }
    
    INFO:transformers.tokenization_utils_base:Model name 'tmp/bert-base-4096' not found in model shortcut name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased, TurkuNLP/bert-base-finnish-cased-v1, TurkuNLP/bert-base-finnish-uncased-v1, wietsedv/bert-base-dutch-cased). Assuming 'tmp/bert-base-4096' is a path, a model identifier, or url to a directory containing tokenizer files.
    INFO:transformers.tokenization_utils_base:Didn't find file tmp/bert-base-4096/added_tokens.json. We won't load it.
    INFO:transformers.tokenization_utils_base:Didn't find file tmp/bert-base-4096/tokenizer.json. We won't load it.
    INFO:transformers.tokenization_utils_base:loading file tmp/bert-base-4096/vocab.txt
    INFO:transformers.tokenization_utils_base:loading file None
    INFO:transformers.tokenization_utils_base:loading file tmp/bert-base-4096/special_tokens_map.json
    INFO:transformers.tokenization_utils_base:loading file tmp/bert-base-4096/tokenizer_config.json
    INFO:transformers.tokenization_utils_base:loading file None
    /mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_auto.py:798: FutureWarning: The class `AutoModelWithLMHead` is deprecated and will be removed in a future version. Please use `AutoModelForCausalLM` for causal language models, `AutoModelForMaskedLM` for masked language models and `AutoModelForSeq2SeqLM` for encoder-decoder models.
      FutureWarning,
    INFO:transformers.configuration_utils:loading configuration file tmp/bert-base-4096/config.json
    INFO:transformers.configuration_utils:Model config BertConfig {
      "architectures": [
        "BertForMaskedLM"
      ],
      "attention_probs_dropout_prob": 0.1,
      "attention_window": [
        512,
        512,
        512,
        512,
        512,
        512,
        512,
        512,
        512,
        512,
        512,
        512
      ],
      "gradient_checkpointing": true,
      "hidden_act": "gelu",
      "hidden_dropout_prob": 0.1,
      "hidden_size": 768,
      "initializer_range": 0.02,
      "intermediate_size": 3072,
      "layer_norm_eps": 1e-12,
      "max_position_embeddings": 4096,
      "model_type": "bert",
      "num_attention_heads": 12,
      "num_hidden_layers": 12,
      "output_past": true,
      "pad_token_id": 3,
      "type_vocab_size": 2,
      "vocab_size": 50099
    }
    
    INFO:transformers.modeling_utils:loading weights file tmp/bert-base-4096/pytorch_model.bin
    WARNING:transformers.modeling_utils:Some weights of the model checkpoint at tmp/bert-base-4096 were not used when initializing BertForMaskedLM: ['bert.encoder.layer.0.attention.self.query_global.weight', 'bert.encoder.layer.0.attention.self.query_global.bias', 'bert.encoder.layer.0.attention.self.key_global.weight', 'bert.encoder.layer.0.attention.self.key_global.bias', 'bert.encoder.layer.0.attention.self.value_global.weight', 'bert.encoder.layer.0.attention.self.value_global.bias', 'bert.encoder.layer.1.attention.self.query_global.weight', 'bert.encoder.layer.1.attention.self.query_global.bias', 'bert.encoder.layer.1.attention.self.key_global.weight', 'bert.encoder.layer.1.attention.self.key_global.bias', 'bert.encoder.layer.1.attention.self.value_global.weight', 'bert.encoder.layer.1.attention.self.value_global.bias', 'bert.encoder.layer.2.attention.self.query_global.weight', 'bert.encoder.layer.2.attention.self.query_global.bias', 'bert.encoder.layer.2.attention.self.key_global.weight', 'bert.encoder.layer.2.attention.self.key_global.bias', 'bert.encoder.layer.2.attention.self.value_global.weight', 'bert.encoder.layer.2.attention.self.value_global.bias', 'bert.encoder.layer.3.attention.self.query_global.weight', 'bert.encoder.layer.3.attention.self.query_global.bias', 'bert.encoder.layer.3.attention.self.key_global.weight', 'bert.encoder.layer.3.attention.self.key_global.bias', 'bert.encoder.layer.3.attention.self.value_global.weight', 'bert.encoder.layer.3.attention.self.value_global.bias', 'bert.encoder.layer.4.attention.self.query_global.weight', 'bert.encoder.layer.4.attention.self.query_global.bias', 'bert.encoder.layer.4.attention.self.key_global.weight', 'bert.encoder.layer.4.attention.self.key_global.bias', 'bert.encoder.layer.4.attention.self.value_global.weight', 'bert.encoder.layer.4.attention.self.value_global.bias', 'bert.encoder.layer.5.attention.self.query_global.weight', 'bert.encoder.layer.5.attention.self.query_global.bias', 'bert.encoder.layer.5.attention.self.key_global.weight', 'bert.encoder.layer.5.attention.self.key_global.bias', 'bert.encoder.layer.5.attention.self.value_global.weight', 'bert.encoder.layer.5.attention.self.value_global.bias', 'bert.encoder.layer.6.attention.self.query_global.weight', 'bert.encoder.layer.6.attention.self.query_global.bias', 'bert.encoder.layer.6.attention.self.key_global.weight', 'bert.encoder.layer.6.attention.self.key_global.bias', 'bert.encoder.layer.6.attention.self.value_global.weight', 'bert.encoder.layer.6.attention.self.value_global.bias', 'bert.encoder.layer.7.attention.self.query_global.weight', 'bert.encoder.layer.7.attention.self.query_global.bias', 'bert.encoder.layer.7.attention.self.key_global.weight', 'bert.encoder.layer.7.attention.self.key_global.bias', 'bert.encoder.layer.7.attention.self.value_global.weight', 'bert.encoder.layer.7.attention.self.value_global.bias', 'bert.encoder.layer.8.attention.self.query_global.weight', 'bert.encoder.layer.8.attention.self.query_global.bias', 'bert.encoder.layer.8.attention.self.key_global.weight', 'bert.encoder.layer.8.attention.self.key_global.bias', 'bert.encoder.layer.8.attention.self.value_global.weight', 'bert.encoder.layer.8.attention.self.value_global.bias', 'bert.encoder.layer.9.attention.self.query_global.weight', 'bert.encoder.layer.9.attention.self.query_global.bias', 'bert.encoder.layer.9.attention.self.key_global.weight', 'bert.encoder.layer.9.attention.self.key_global.bias', 'bert.encoder.layer.9.attention.self.value_global.weight', 'bert.encoder.layer.9.attention.self.value_global.bias', 'bert.encoder.layer.10.attention.self.query_global.weight', 'bert.encoder.layer.10.attention.self.query_global.bias', 'bert.encoder.layer.10.attention.self.key_global.weight', 'bert.encoder.layer.10.attention.self.key_global.bias', 'bert.encoder.layer.10.attention.self.value_global.weight', 'bert.encoder.layer.10.attention.self.value_global.bias', 'bert.encoder.layer.11.attention.self.query_global.weight', 'bert.encoder.layer.11.attention.self.query_global.bias', 'bert.encoder.layer.11.attention.self.key_global.weight', 'bert.encoder.layer.11.attention.self.key_global.bias', 'bert.encoder.layer.11.attention.self.value_global.weight', 'bert.encoder.layer.11.attention.self.value_global.bias']
    - This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
    - This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
    INFO:transformers.modeling_utils:All the weights of BertForMaskedLM were initialized from the model checkpoint at tmp/bert-base-4096.
    If your task is similar to the task the model of the ckeckpoint was trained on, you can already use BertForMaskedLM for predictions without further training.
    INFO:__main__:Pretraining bert-base-4096 ... 
    INFO:filelock:Lock 140392820589624 acquired on cached_lm_BertTokenizerFast_4094_valEusLong.txt.lock
    INFO:transformers.data.datasets.language_modeling:Loading features from cached file cached_lm_BertTokenizerFast_4094_valEusLong.txt [took 0.008 s]
    INFO:filelock:Lock 140392820589624 released on cached_lm_BertTokenizerFast_4094_valEusLong.txt.lock
    INFO:__main__:Loading and tokenizing training data is usually slow: trainEusLong1.txt
    INFO:filelock:Lock 140392820589456 acquired on cached_lm_BertTokenizerFast_4094_trainEusLong1.txt.lock
    INFO:transformers.data.datasets.language_modeling:Loading features from cached file cached_lm_BertTokenizerFast_4094_trainEusLong1.txt [took 0.053 s]
    INFO:filelock:Lock 140392820589456 released on cached_lm_BertTokenizerFast_4094_trainEusLong1.txt.lock
    INFO:transformers.training_args:PyTorch: setting up devices
    INFO:transformers.trainer:You are instantiating a Trainer but W&B is not installed. To use wandb logging, run `pip install wandb; wandb login` see https://docs.wandb.com/huggingface.
    INFO:transformers.trainer:***** Running Evaluation *****
    INFO:transformers.trainer:  Num examples = 70
    INFO:transformers.trainer:  Batch size = 1
    Evaluation:   0%|                                                                                                                                                 | 0/70 [00:00<?, ?it/s]/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/utils/checkpoint.py:25: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
      warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
    Evaluation: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 70/70 [00:21<00:00,  3.22it/s]
    INFO:transformers.trainer:{'eval_loss': 12.326190962110246, 'step': 0}
    INFO:__main__:Initial eval bpc: 17.782934574086813
    INFO:transformers.trainer:***** Running training *****
    INFO:transformers.trainer:  Num examples = 388
    INFO:transformers.trainer:  Num Epochs = 501
    INFO:transformers.trainer:  Instantaneous batch size per device = 1
    INFO:transformers.trainer:  Total train batch size (w. parallel, distributed & accumulation) = 64
    INFO:transformers.trainer:  Gradient Accumulation steps = 64
    INFO:transformers.trainer:  Total optimization steps = 3000
    INFO:transformers.trainer:  Starting fine-tuning.
    Epoch:   0%|                                                                                                                                                     | 0/501 [00:00<?, ?it/sINFO:transformers.trainer:{'loss': 12.102866038680077, 'learning_rate': 6.000000000000001e-08, 'epoch': 0.16494845360824742, 'step': 1}                  | 63/388 [01:18<06:51,  1.27s/it]
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-1
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-1/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-1/pytorch_model.bin
    INFO:transformers.trainer:{'loss': 12.099215269088745, 'learning_rate': 1.2000000000000002e-07, 'epoch': 0.32989690721649484, 'step': 2}                                 | 127/388 [02:50<05:35,  1.29s/it]
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-2
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-2/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-2/pytorch_model.bin
    INFO:transformers.trainer:{'loss': 12.078452616930008, 'learning_rate': 1.8e-07, 'epoch': 0.4948453608247423, 'step': 3}                                                 | 191/388 [04:24<04:14,  1.29s/it]
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-3
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-3/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-3/pytorch_model.bin
    INFO:transformers.trainer:{'loss': 12.023080185055733, 'learning_rate': 2.4000000000000003e-07, 'epoch': 0.6597938144329897, 'step': 4}                                  | 255/388 [05:56<02:50,  1.28s/it]
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-4
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-4/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-4/pytorch_model.bin
    INFO:transformers.trainer:{'loss': 12.003526121377945, 'learning_rate': 3.0000000000000004e-07, 'epoch': 0.8247422680412371, 'step': 5}█████████▉                        | 319/388 [07:29<01:28,  1.29s/it]INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-5
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-5/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-5/pytorch_model.bin
    INFO:transformers.trainer:{'loss': 11.993770495057106, 'learning_rate': 3.6e-07, 'epoch': 0.9896907216494846, 'step': 6}███████████████████████████████████████████████▎ | 383/388 [09:01<00:06,  1.29s/it]
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-6
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-6/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-6/pytorch_model.bin
    Iteration: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 388/388 [09:18<00:00,  1.44s/it]
    Epoch:   0%|▎                                                                                                                                        | 1/501 [09:18<77:36:08, 558.74s/it]                 INFO:transformers.trainer:{'loss': 12.672470852732658, 'learning_rate': 4.2e-07, 'epoch': 1.1649484536082475, 'step': 7}                                                   | 63/388 [01:20<06:58,  1.29s/it]
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-7
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-7/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-7/pytorch_model.bin
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-8
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-8/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-8/pytorch_model.bin
    
    Iteration:  36%|███████████████████████████████████████████████████████▏                                                                                                 | 140/388 [03:21<05:27,  1.32s/iItINFO:transformers.trainer:{'loss': 11.813278079032898, 'learning_rate': 5.4e-07, 'epoch': 1.4948453608247423, 'step': 9}                                                  | 191/388 [04:27<04:15,  1.30s/it]
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-9
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-9/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-9/pytorch_model.bin
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-10
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-10/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-10/pytorch_model.bin
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-11
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-11/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-11/pytorch_model.bin
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-12
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-12/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-12/pytorch_model.bin
    Iteration: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 388/388 [09:24<00:00,  1.45s/it]
    Epoch:   0%|▌                                                                                                                                        | 2/501 [18:43<77:40:49, 560.42s/it]<00:00,  2.07s/it]INFO:transformers.trainer:{'loss': 12.117324143648148, 'learning_rate': 7.799999999999999e-07, 'epoch': 2.1649484536082473, 'step': 13}                                     | 63/388 [01:20<06:59,  1.29s/it]
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-13
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-13/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-13/pytorch_model.bin
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-14
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-14/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-14/pytorch_model.bin
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-15
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-15/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-15/pytorch_model.bin
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-16
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-16/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-16/pytorch_model.bin
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-17
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-17/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-17/pytorch_model.bin
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-18
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-18/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-18/pytorch_model.bin
    Iteration: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 388/388 [09:24<00:00,  1.45s/it]
    Epoch:   1%|▊                                                                                                                                        | 3/501 [28:07<77:40:37, 561.52s/it]4<00:00,  2.07s/itINFO:transformers.trainer:{'loss': 11.206573352217674, 'learning_rate': 1.14e-06, 'epoch': 3.1649484536082473, 'step': 19}                                                  | 63/388 [01:20<06:58,  1.29s/it]
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-19
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-19/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-19/pytorch_model.bin
    INFO:transformers.trainer:Saving model checkpoint to tmp/checkpoint-20
    INFO:transformers.configuration_utils:Configuration saved in tmp/checkpoint-20/config.json
    INFO:transformers.modeling_utils:Model weights saved in tmp/checkpoint-20/pytorch_model.bin
    
    /pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [467,0,0], thread: [93,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
    /pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [467,0,0], thread: [94,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
    /pytorch/aten/src/THC/THCTensorIndex.cu:361: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDim = 2, SrcDim = 2, IdxDim = -2, IndexIsMajor = true]: block: [467,0,0], thread: [95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
    Iteration:  39%|████████████████████████████████████████████████████████████▋                                                                                             | 153/388 [03:38<05:35,  1.43s/it]
    Epoch:   1%|▊                                                                                                                                        | 3/501 [31:45<87:51:44, 635.15s/it]
    Traceback (most recent call last):
      File "BERTeus2LongB.py", line 305, in <module>
        pretrain_and_evaluate(training_args, model, tokenizer, eval_only=False, model_path=training_args.output_dir)
      File "BERTeus2LongB.py", line 183, in pretrain_and_evaluate
        trainer.train(model_path=model_path)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/trainer.py", line 499, in train
        tr_loss += self._training_step(model, inputs, optimizer)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/trainer.py", line 622, in _training_step
        outputs = model(**inputs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
        result = self.forward(*input, **kwargs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 1083, in forward
        output_hidden_states=output_hidden_states,
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
        result = self.forward(*input, **kwargs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 753, in forward
        input_ids=input_ids, position_ids=position_ids, token_type_ids=token_type_ids, inputs_embeds=inputs_embeds
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
        result = self.forward(*input, **kwargs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 182, in forward
        embeddings = inputs_embeds + position_embeddings + token_type_embeddings
    RuntimeError: CUDA error: device-side assert triggered
    

    the same run with

    ###########################################

    os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

    ###########################################

    ...
    Epoch:   1%|▉                                                                                                                                                          | 3/501 [30:52<85:25:53, 617.58s/it]
    Traceback (most recent call last):
      File "BERTeus2LongB.py", line 305, in <module>
        pretrain_and_evaluate(training_args, model, tokenizer, eval_only=False, model_path=training_args.output_dir)
      File "BERTeus2LongB.py", line 183, in pretrain_and_evaluate
        trainer.train(model_path=model_path)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/trainer.py", line 499, in train
        tr_loss += self._training_step(model, inputs, optimizer)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/trainer.py", line 622, in _training_step
        outputs = model(**inputs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
        result = self.forward(*input, **kwargs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 1083, in forward
        output_hidden_states=output_hidden_states,
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
        result = self.forward(*input, **kwargs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 762, in forward
        output_hidden_states=output_hidden_states,
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
        result = self.forward(*input, **kwargs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 430, in forward
        encoder_attention_mask,
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 155, in checkpoint
        return CheckpointFunction.apply(function, preserve, *args)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/utils/checkpoint.py", line 74, in forward
        outputs = run_function(*args)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 420, in custom_forward
        return module(*inputs, output_attentions)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
        result = self.forward(*input, **kwargs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 371, in forward
        hidden_states, attention_mask, head_mask, output_attentions=output_attentions,
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
        result = self.forward(*input, **kwargs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 315, in forward
        hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, output_attentions,
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
        result = self.forward(*input, **kwargs)
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/transformers/modeling_bert.py", line 243, in forward
        attention_scores = attention_scores + attention_mask
    RuntimeError: CUDA error: device-side assert triggered
    (transformers) [email protected]:/mnt/datuak/gorka-tmp$ python BERTeus2LongB.py
    

    Any hint what causes this error?

    By the way, I also got sometimes this error, which I am not able to reproduce right now:

     File "BERTeus2LongB.py", line 305, in <module>
        pretrain_and_evaluate(training_args, model, tokenizer, eval_only=False, model_path=training_args.output_dir)
      ...
      File "/mnt/datuak/virtualenvs/transformers/lib/python3.6/site-packages/torch/nn/functional.py", line 1372, in linear
        output = input.matmul(weight.t())
    RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
    

    Regards, Gorka

    opened by GorkaUrbizu 7
  • Number of tokens per batch mismatch - longformer vs roberta

    Number of tokens per batch mismatch - longformer vs roberta

    I see in your conversion notebook that you suggest that the number of tokens per batch should be the same as roberta: 2^18 = 260k

    When I look at the roberta paper, it says it uses a sequence length of 512 and a batch size of 8k. This means that each batch has 512*8k = 4M tokens

    Am I missing something?

    opened by nbroad1881 1
  • Answering performance of Longformer-base on the HotpotQA dev set

    Answering performance of Longformer-base on the HotpotQA dev set

    Hi,

    I only found Longformer-base's joint F1 on the HopotQA dev set from the paper, and I would like to know if my reproduction results (Ans EM = 61.38, Ans F1 = 75.18) are expected. Could you provide some more specific metrics?

    Thank you!

    opened by zycdev 0
  • CVE-2007-4559 Patch

    CVE-2007-4559 Patch

    Patching CVE-2007-4559

    Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

    If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

    opened by TrellixVulnTeam 0
  • Updated BART to Longformer-encoder-decoder (LED) converter

    Updated BART to Longformer-encoder-decoder (LED) converter

    Hi @ibeltagy et al., I'm pre-training BART to Portuguese and converting the pre-trained model to LED following the instructions you gave in the paper and the code at https://github.com/allenai/longformer/blob/caefee668e39cacdece7dd603a0bebf24df6d8ca/scripts/convert_bart_to_longformerencoderdecoder.py.

    The huggingface library is evolving fast; unfortunately, the code you provided is outdated and I had to implement a new version based on yours.

    I have 2 questions:

    1. Could you tell me if everything is ok or if I missed something? https://gist.github.com/erichans/af745a381b28b1c019f96997ddac4cd7
    2. Is the LEDForConditionalGeneration model uploaded to huggingface just a BART model converted to LED or is there something else?

    Thanks in advance!

    opened by erichans 0
  • Why the TVM impelmentation is memroy efficient

    Why the TVM impelmentation is memroy efficient

    Thanks for your excellent work!

    Just want to discuss the memory reduction problem. It seems that the TVM implementation does not store fewer matrices (like Queries, Keys, and Values matrix). The num of Q-K pairs is less than the full attention so that we can get a faster calculation speed, but why the memory reduction has a similar trend with the time reduction? Seems the TVM kernel does not use any technique to save the memory, and the padding 0 values are also int32, but the fact is that TVM implementation is memory efficient...

    Looking forward to your reply.

    opened by jlidw 0
  • Pretraining longformer for NER on big pdf text

    Pretraining longformer for NER on big pdf text

    Hi, I'm trying to extract entities from documents containing 50-60 pages per document. can anybody suggest a better approach for it, please? I couldn't find any NER implementation of longformers.

    opened by ajaysurya1221 0
Releases(v0.2)
Simple program that translates the name of files into English

Simple program that translates the name of files into English. Useful for when editing/inspecting programs that were developed in a foreign language.

0 Dec 22, 2021
NLP, before and after spaCy

textacy: NLP, before and after spaCy textacy is a Python library for performing a variety of natural language processing (NLP) tasks, built on the hig

Chartbeat Labs Projects 2k Jan 04, 2023
EasyTransfer is designed to make the development of transfer learning in NLP applications easier.

EasyTransfer is designed to make the development of transfer learning in NLP applications easier. The literature has witnessed the success of applying

Alibaba 819 Jan 03, 2023
⚡ boost inference speed of T5 models by 5x & reduce the model size by 3x using fastT5.

Reduce T5 model size by 3X and increase the inference speed up to 5X. Install Usage Details Functionalities Benchmarks Onnx model Quantized onnx model

Kiran R 399 Jan 05, 2023
BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model

BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model

303 Dec 17, 2022
Finding Label and Model Errors in Perception Data With Learned Observation Assertions

Finding Label and Model Errors in Perception Data With Learned Observation Assertions This is the project page for Finding Label and Model Errors in P

Stanford Future Data Systems 17 Oct 14, 2022
This is the writeup of all the challenges from Advent-of-cyber-2019 of TryHackMe

Advent-of-cyber-2019-writeup This is the writeup of all the challenges from Advent-of-cyber-2019 of TryHackMe https://tryhackme.com/shivam007/badges/c

shivam danawale 5 Jul 17, 2022
Open source annotation tool for machine learning practitioners.

doccano doccano is an open source text annotation tool for humans. It provides annotation features for text classification, sequence labeling and sequ

7.1k Jan 01, 2023
TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset. TunBERT was applied to three NLP downstream tasks: Sentiment Analysis (S

InstaDeep Ltd 72 Dec 09, 2022
GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model

GPT-Code-Clippy (GPT-CC) is an open source version of GitHub Copilot, a language model -- based on GPT-3, called GPT-Codex -- that is fine-tuned on publicly available code from GitHub.

Nathan Cooper 2.3k Jan 01, 2023
End-to-End Speech Processing Toolkit

ESPnet: end-to-end speech processing toolkit system/pytorch ver. 1.0.1 1.1.0 1.2.0 1.3.1 1.4.0 1.5.1 1.6.0 1.7.1 1.8.1 ubuntu18/python3.8/pip ubuntu18

ESPnet 5.9k Jan 03, 2023
Simplified diarization pipeline using some pretrained models - audio file to diarized segments in a few lines of code

simple_diarizer Simplified diarization pipeline using some pretrained models. Made to be a simple as possible to go from an input audio file to diariz

Chau 65 Dec 30, 2022
使用Mask LM预训练任务来预训练Bert模型。训练垂直领域语料的模型表征,提升下游任务的表现。

Pretrain_Bert_with_MaskLM Info 使用Mask LM预训练任务来预训练Bert模型。 基于pytorch框架,训练关于垂直领域语料的预训练语言模型,目的是提升下游任务的表现。 Pretraining Task Mask Language Model,简称Mask LM,即

Desmond Ng 24 Dec 10, 2022
pytorch implementation of Attention is all you need

A Pytorch Implementation of the Transformer: Attention Is All You Need Our implementation is largely based on Tensorflow implementation Requirements N

230 Dec 07, 2022
This is the offline-training-pipeline for our project.

offline-training-pipeline This is the offline-training-pipeline for our project. We adopt the offline training and online prediction Machine Learning

0 Apr 22, 2022
Dé op-de-vlucht Pieton vertaler. Wereldwijd gebruikt door meer dan 1.000+ succesvolle bedrijven!

Dé op-de-vlucht Pieton vertaler. Wereldwijd gebruikt door meer dan 1.000+ succesvolle bedrijven!

Lau 1 Dec 17, 2021
Scene Text Retrieval via Joint Text Detection and Similarity Learning

This is the code of "Scene Text Retrieval via Joint Text Detection and Similarity Learning". For more details, please refer to our CVPR2021 paper.

79 Nov 29, 2022
Simple Python library, distributed via binary wheels with few direct dependencies, for easily using wav2vec 2.0 models for speech recognition

Wav2Vec2 STT Python Beta Software Simple Python library, distributed via binary wheels with few direct dependencies, for easily using wav2vec 2.0 mode

David Zurow 22 Dec 29, 2022
Mesh TensorFlow: Model Parallelism Made Easier

Mesh TensorFlow - Model Parallelism Made Easier Introduction Mesh TensorFlow (mtf) is a language for distributed deep learning, capable of specifying

1.3k Dec 26, 2022
A CRM department in a local bank works on classify their lost customers with their past datas. So they want predict with these method that average loss balance and passive duration for future.

Rule-Based-Classification-in-a-Banking-Case. A CRM department in a local bank works on classify their lost customers with their past datas. So they wa

ÖMER YILDIZ 4 Mar 20, 2022