:house_with_garden: Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.

Overview

FARM LOGO

(Framework for Adapting Representation Models)

Docs Build Release License Last Commit Last Commit Downloads

What is it?

FARM makes Transfer Learning with BERT & Co simple, fast and enterprise-ready. It's built upon transformers and provides additional features to simplify the life of developers: Parallelized preprocessing, highly modular design, multi-task learning, experiment tracking, easy debugging and close integration with AWS SageMaker.

With FARM you can build fast proof-of-concepts for tasks like text classification, NER or question answering and transfer them easily into production.

Core features

  • Easy fine-tuning of language models to your task and domain language
  • Speed: AMP optimizers (~35% faster) and parallel preprocessing (16 CPU cores => ~16x faster)
  • Modular design of language models and prediction heads
  • Switch between heads or combine them for multitask learning
  • Full Compatibility with HuggingFace Transformers' models and model hub
  • Smooth upgrading to newer language models
  • Integration of custom datasets via Processor class
  • Powerful experiment tracking & execution
  • Checkpointing & Caching to resume training and reduce costs with spot instances
  • Simple deployment and visualization to showcase your model
Task BERT RoBERTa* XLNet ALBERT DistilBERT XLMRoBERTa ELECTRA MiniLM
Text classification x x x x x x x x
NER x x x x x x x x
Question Answering x x x x x x x x
Language Model Fine-tuning x              
Text Regression x x x x x x x x
Multilabel Text classif. x x x x x x x x
Extracting embeddings x x x x x x x x
LM from scratch x              
Text Pair Classification x x x x x x x x
Passage Ranking x x x x x x x x
Document retrieval (DPR) x x   x x x x x

* including CamemBERT and UmBERTo

**NEW** Interested in doing Question Answering at scale? Checkout Haystack!

Resources

Docs

Online documentation

Tutorials

Demo

Checkout https://demos.deepset.ai to play around with some models

More

Installation

Recommended (because of active development):

git clone https://github.com/deepset-ai/FARM.git
cd FARM
pip install -r requirements.txt
pip install --editable .

If problems occur, please do a git pull. The --editable flag will update changes immediately.

From PyPi:

pip install farm

Note: On windows you might need pip install farm -f https://download.pytorch.org/whl/torch_stable.html to install PyTorch correctly

Basic Usage

1. Train a downstream model

FARM offers two modes for model training:

Option 1: Run experiment(s) from config

https://raw.githubusercontent.com/deepset-ai/FARM/master/docs/img/code_snippet_experiment.png

Use cases: Training your first model, hyperparameter optimization, evaluating a language model on multiple down-stream tasks.

Option 2: Stick together your own building blocks

https://raw.githubusercontent.com/deepset-ai/FARM/master/docs/img/code_snippet_building_blocks.png

Usecases: Custom datasets, language models, prediction heads ...

Metrics and parameters of your model training get automatically logged via MLflow. We provide a public MLflow server for testing and learning purposes. Check it out to see your own experiment results! Just be aware: We will start deleting all experiments on a regular schedule to ensure decent server performance for everybody!

2. Run Inference

Use a public model or your own to get predictions:

https://raw.githubusercontent.com/deepset-ai/FARM/master/docs/img/code_snippet_inference.png

3. Showcase your models (API + UI)

FARM Inferennce UI

One docker container exposes a REST API (localhost:5000) and another one runs a simple demo UI (localhost:3000). You can use both of them individually and mount your own models. Check out the docs for details.

Advanced Usage

Once you got started with FARM, there's plenty of options to customize your pipeline and boost your models. Let's highlight a few of them ...

1. Optimizers & Learning rate schedules

While FARM provides decent defaults for both, you can easily configure many other optimizers & LR schedules:

  • any optimizer from PyTorch, Apex or Transformers
  • any learning rate schedule from PyTorch or Transformers

You can configure them by passing a dict to initialize_optimizer() (see example).

2. Early Stopping

With early stopping, the run stops once a chosen metric is not improving any further and you take the best model up to this point. This helps prevent overfitting on small datasets and reduces training time if your model doesn't improve any further (see example).

3. Imbalanced classes

If you do classification on imbalanced classes, consider using class weights. They change the loss function to down-weight frequent classes. You can set them when you init a prediction head:

prediction_head = TextClassificationHead(
class_weights=data_silo.calculate_class_weights(task_name="text_classification"),
num_labels=len(label_list))`

4. Cross Validation

Get more reliable eval metrics on small datasets (see example)

5. Caching & Checkpointing

Save time if you run similar pipelines (e.g. only experimenting with model params): Store your preprocessed dataset & load it next time from cache:

data_silo = DataSilo(processor=processor, batch_size=batch_size, caching=True)

Start & stop training by saving checkpoints of the trainer:

trainer = Trainer.create_or_load_checkpoint(
            ...
            checkpoint_on_sigterm=True,
            checkpoint_every=200,
            checkpoint_root_dir=Path(“/opt/ml/checkpoints/training”),
            resume_from_checkpoint=“latest”)

The checkpoints include the state of everything that matters (model, optimizer, lr_schedule ...) to resume training. This is particularly useful, if your training crashes (e.g. because your are using spot cloud instances). You can either save checkpoints every X steps or when a SIGTERM signal is received.

6. Training on AWS SageMaker (incl. Spot Instances)

We are currently working a lot on simplifying large scale training and deployment. As a first step, we are adding support for training on AWS SageMaker. The interesting part here is the option to use Managed Spot Instances and save about 70% on costs compared to the regular EC2 instances. This is particularly relevant for training models from scratch, which we introduce in a basic version in this release and will improve over the next weeks. See this tutorial to get started with using SageMaker for training on down-stream tasks.

Core concepts

Model

AdaptiveModel = Language Model + Prediction Head(s) With this modular approach you can easily add prediction heads (multitask learning) and re-use them for different types of language models. (Learn more)

https://raw.githubusercontent.com/deepset-ai/FARM/master/docs/img/adaptive_model_no_bg_small.jpg

Data Processing

Custom Datasets can be loaded by customizing the Processor. It converts "raw data" into PyTorch Datasets. Much of the heavy lifting is then handled behind the scenes to make it fast & simple to debug. (Learn more)

https://raw.githubusercontent.com/deepset-ai/FARM/master/docs/img/data_silo_no_bg_small.jpg

Inference Time Benchmarks

FARM has a configurable test suite for benchmarking inference times with combinations of inference engine(PyTorch, ONNXRuntime), batch size, document length, maximum sequence length, and other parameters. Here is a benchmark for Question Answering inference with the current FARM version.

FAQ

1. What language model shall I use for non-english NLP? If you’re working with German, French, Chinese, Japanese or Finnish you might be interested in trying out the pretrained BERT models in your language. You can see a list here of the available models hosted by our friends over at HuggingFace which can be directly accessed through FARM. If your language isn’t one of those (or even if it is), we’d encourage you to try out XLM-Roberta (https://arxiv.org/pdf/1911.02116.pdf) which supports 100 different languages and shows surprisingly strong performance compared to single language models.

2. Why do you have separate prediction heads? PredictionHeads are needed in order to adapt the general language understanding capabilities of the language model to a specific task. For example, the predictions of NER and document classification require very different output formats. Having separate PredictionHead classes means that it is a) very easy to re-use prediction heads on top of different language models and b) it simplifies multitask-learning. The latter allows you e.g. to add proxy tasks that facilitate learning of your "true objective". Example: You want to classify documents into classes and know that some document tags (e.g. author) already provide helpful information for this task. It might help to add additional tasks for classifying these meta tags.

3. When is adaptation of a language model to a domain corpus useful? Mostly when your domain language differs a lot to the one that the original model was trained on. Example: Your corpus is from the aerospace industry and contains a lot of engineering terminology. This is very different to Wikipedia text on in terms of vocab and semantics. We found that this can boost performance especially if your down-stream tasks are using rather small domain datasets. In contrast, if you have huge downstream datasets, the model can often adapt to the domain "on-the-fly" during downstream training.

4. How can I adapt a language model to a domain corpus? There are two main methods: you can extend the vocabulary by Tokenizer.add_tokens(["term_a", "term_b"...]) or fine-tune your model on a domain text corpus (see example).

5. How can I convert from / to HuggingFace's models? We support conversion in both directions (see example) You can also load any language model from HuggingFace's model hub by just specifying the name, e.g. LanguageModel.load("deepset/bert-base-cased-squad2")

6. How can you scale Question Answering to larger collections of documents? It's currently most common to put a fast "retriever" in front of the QA model. Checkout haystack for such an implementation and more features you need to really run QA in production.

7. How can you tailor Question Answering to your own domain? We attained high performance by training a model first on public datasets (e.g. SQuAD, Natural Questions ...) and then fine-tuning it on a few custom QA labels from the domain. Even ~2000 domain labels can give you the essential performance boost you need. Checkout haystack for more details and a QA labeling tool.

8. My GPU runs out of memory. How can I train with decent batch sizes? Use gradient accumulation! It combines multiple batches before applying backprop. In FARM, just set the param grad_acc_steps in initialize_optimizer() and Trainer() to the number of batches you want to combine (i.e. grad_acc_steps=2 and batch_size=16 results in an effective batch size of 32).

Acknowledgements

  • FARM is built upon parts of the great Transformers repository from HuggingFace. It utilizes their implementations of models and tokenizers.
  • FARM is a community effort! Essential pieces of it have been implemented by our FARMers out there. Thanks to all contributors!
  • The original BERT model and paper was published by Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova.

Citation

As of now there is no published paper on FARM. If you want to use or cite our framework, please include the link to this repository. If you are working with the German Bert model, you can link our blog post describing its training details and performance.

Comments
  • Getting different predictions on different runs with same ELECTRA model.

    Getting different predictions on different runs with same ELECTRA model.

    I trained an Electra Model on text classification. 3 Classes. Saved it with

        save_dir = Path("saved_models/bert-german-doc-tutorial")
        model.save(save_dir)
        processor.save(save_dir)
    

    Now I load Data I want to predict on and getting different results:

    1st run:
    labal_n          5055
    label_i           855
    label_e          8
    
    2nd run
    labal_n          4609
    label_e          990
    label_i           319
    
    3rd run
    label_i        3355
    labal_n       2510
    label_e       53
    

    How on earth can that be? Even if the model would be a total crap the prediction should be deterministic - right?

    bug 
    opened by PhilipMay 38
  • Fix for CI problem - closing multiprocessing.pool again

    Fix for CI problem - closing multiprocessing.pool again

    For the background of this PR see here and the following comments: https://github.com/deepset-ai/FARM/issues/385#issuecomment-640086021

    TODO

    • [x] should we write a regression test with additional dependency to psutil? Details see below.
    • [x] what about num_processes = mp.cpu_count() - 1 vs. num_processes = mp.cpu_count()? Details see below.
    • [x] write docstrings
    • [ ] check examples and other documentation
    • [ ] sould other code be changed to close pool
    opened by PhilipMay 29
  • Adding additional custom features?

    Adding additional custom features?

    Is it possible to add additional custom features in addition to using pre-trained language models for the down-stream tasks?

    For example:

    # TWO INPUTS
    这是一只狗和这是一只红猫<div>This is a dog and that is a panda
    0 0 0 0 B-TERM 0 0 0 0 0 0 0<div>0 0 0 0 0 0 0 0 0
    
    # OUTPUT
    0 0 0 0 B-TERM 0 0 0 0 0 0 0<div>0 0 0 B-TERM 0 0 0 0 0
    
    enhancement 
    opened by echan00 27
  • Add option to use fast HF tokenizer.

    Add option to use fast HF tokenizer.

    This PR adds the option to use fast HF tokenizer.

    The reason why I need this is the following: I plan to open source a German Electra model which is lower case but does not strip accents. To do that you have to specify strip_accents=False to the tokenizer. But this option is only available for the fast tokenizer. Also see here: https://github.com/huggingface/transformers/issues/6186 and here https://github.com/google-research/electra/pull/88

    Might also solve #157

    To-do before done

    • [x] write tests
    • [x] review
    • [x] merge with #205
    • [x] fix CI problems
    opened by PhilipMay 23
  • Cannot load model from local dir

    Cannot load model from local dir

    Describe the bug I want to do this with Haystack:

    ### Inference ############
    
    # Load model
    reader = FARMReader(model_name_or_path="../../saved_models/twmkn9/albert-base-v2-squad2", use_gpu=False)
    

    I finetuned the model before and saved it to my local dir. Here the code:

    ### TRAINING #############
    # Let's take a reader as a base model
    reader = FARMReader(model_name_or_path="twmkn9/albert-base-v2-squad2", max_seq_len=512, use_gpu=False)
    
    # and fine-tune it on your own custom dataset (should be in SQuAD like format)
    train_data = "training_data"
    reader.train(data_dir=train_data, train_filename="2020-02-23_answers.json", test_file_name='TEST_answers.json', use_gpu=False, n_epochs=1, dev_split=0.1)
    

    Error message

    03/28/2020 22:25:07 - INFO - farm.utils -   device: cpu n_gpu: 0, distributed training: False, automatic mixed precision training: None
    03/28/2020 22:25:07 - INFO - farm.modeling.adaptive_model -   Found files for loading 1 prediction heads
    03/28/2020 22:25:07 - WARNING - farm.modeling.prediction_head -   Some unused parameters are passed to the QuestionAnsweringHead. Might not be a problem. Params: {"training": true, "num_labels": 2, "ph_output_type": "per_token_squad", "model_type": "span_classification", "name": "QuestionAnsweringHead"}
    03/28/2020 22:25:07 - INFO - farm.modeling.prediction_head -   Prediction head initialized with size [768, 2]
    03/28/2020 22:25:07 - INFO - farm.modeling.prediction_head -   Loading prediction head from ../../saved_models/twmkn9/albert-base-v2-squad2/prediction_head_0.bin
    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    ~/Documents/CodingProjects/NLPofTimFerrissShow/QnA_with_Tim_Haystack.py in 
          51 
          52 # Load model
    ----> 53 reader = FARMReader(model_name_or_path="../../saved_models/twmkn9/albert-base-v2-squad2", use_gpu=False)
          54 # A retriever identifies the k most promising chunks of text that might contain the answer for our question
          55 # Retrievers use some simple but fast algorithm, here: TF-IDF
    
    /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/haystack/reader/farm.py in __init__(self, model_name_or_path, context_window_size, batch_size, use_gpu, no_ans_boost, top_k_per_candidate, top_k_per_sample, max_processes, max_seq_len, doc_stride)
         79         self.inferencer = Inferencer.load(model_name_or_path, batch_size=batch_size, gpu=use_gpu,
         80                                           task_type="question_answering", max_seq_len=max_seq_len,
    ---> 81                                           doc_stride=doc_stride)
         82         self.inferencer.model.prediction_heads[0].context_window_size = context_window_size
         83         self.inferencer.model.prediction_heads[0].no_ans_boost = no_ans_boost
    
    /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/farm/infer.py in load(cls, model_name_or_path, batch_size, gpu, task_type, return_class_probs, strict, max_seq_len, doc_stride)
        139                 processor = InferenceProcessor.load_from_dir(model_name_or_path)
        140             else:
    --> 141                 processor = Processor.load_from_dir(model_name_or_path)
        142 
        143         # b) or from remote transformers model hub
    
    /Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/farm/data_handler/processor.py in load_from_dir(cls, load_dir)
        189         del config["tokenizer"]
        190 
    --> 191         processor = cls.load(tokenizer=tokenizer, processor_name=config["processor"], **config)
        192 
        193         for task_name, task in config["tasks"].items():
    
    TypeError: load() missing 1 required positional argument: 'data_dir'
    

    Expected behavior There is no error.

    Additional context I use Haystack

    To Reproduce Steps to reproduce the behavior

    System:

    • OS: Mac OS 10.14.6 (mojave)
    • GPU/CPU: CPU Intel core i5
    • FARM version: 0.4.1
    bug stale 
    opened by RobKnop 22
  • Loosing Input Data in Classification Task

    Loosing Input Data in Classification Task

    I am using FARM for a document classification task in medical background. The deepset German Bert model was aditionally trained on german medical wikipedia and then trained for the task we have on hand: Given a long input sting (OCR read content of a scanned file), predict the document class of that file.

    I mostly held onto this https://colab.research.google.com/drive/130_7dgVC3VdLBPhiEkGULHmqSlflhmVM#scrollTo=tPltDefXjSiJ tutorial and it worked fine in the first tests.

    As we are now training on the final Data, the processor (?) started to simply loose data even before the training. The problem happens in the Inferencer too.

    In a pre-cleaning step, the train.tsv and test.tsv get produced like this:

    #get mukl data and convert to dataset
    df_all_data = pd.read_csv('dataset/mukl.tsv', delimiter="\t", encoding='latin-1', names=['sentence', 'label'])
    df_mukl_bert = pd.DataFrame({'text': df_all_data['sentence'], 'label': df_all_data['label']})
    
    # convert given labels to final labels
    new_labels = convert_labels_mukl(df_mukl_bert['label'])
    df_mukl_bert['label'] = new_labels
    
    # produce train and test DataFrames
    df_mukl_train, df_mukl_test = train_test_split(df_mukl_bert, test_size=0.1)
    df_mukl_test = pd.DataFrame({'text': df_mukl_test['text'],
        'label': df_mukl_test['label']}) 
    
    # write DataFrames to files
    df_mukl_train.to_csv('./dataset/train.tsv', sep='\t', index=False)
    df_mukl_test.to_csv('./dataset/test.tsv', sep='\t', index=False)
    

    After this step, i have my BERT Model ready to train and set up the basics:

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    tokenizer = Tokenizer.load(
        pretrained_model_name_or_path= Path("./saved_models/german-bert-pretrain-med"), do_lower_case=False)
    
    labels = ["2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", "20", "21"]
    processor = TextClassificationProcessor(tokenizer=tokenizer,
                                            max_seq_len=128,
                                            data_dir="./dataset" ,
                                            train_filename="train.tsv",
                                            label_list=labels,
                                            metric="acc",
                                            label_column_name="label" )
    
    BATCH_SIZE = 32
    data_silo = DataSilo(
        processor=processor,
        batch_size=BATCH_SIZE)
    
    

    The DataSilo does its work without any Errors, but at latest here, some of the input data gets lost. My train.tsv contains 55705 entries and my test.tsv contains 6190 entries.

    Expected Behaviour: The DataSilo logs: Loading train set from: dataset/train.tsv Got ya 15 parallel workers to convert 55705 dictionaries to pytorch datasets (chunksize = 741) ....Preprocessing... Examples in train + Examples in dev = 550705 Examples in test : 6190

    Actual Behaviour: The DataSilo logs: Loading train set from: dataset/train.tsv Got ya 15 parallel workers to convert 55518 dictionaries to pytorch datasets (chunksize = 741) ....Preprocessing... Examples in train: 49467 Examples in dev: 5871 Examples in test : 6171

    When using the exact same code with another train/test pair from another file generated from a csv, the behaviour is as expected. So i felt the reason must lie in the data. Since even the Inferencer is loosing the same data, i could identify the lost texts and check them. Some of them have special characters like !, ?, @, *, _ and some start with ' but no character is in all of those and all of those characters appeared in readable texts before.

    So now i am our of ideas on how to fiy this. Any ideas?

    bug 
    opened by FM29 20
  • Deadlock in DataSilo._get_dataset when using docker

    Deadlock in DataSilo._get_dataset when using docker

    UPDATE/SOLUTION: PSA FOR POSTERITY If you have a deadlock running FARM in a docker container, make sure you are running the container with --ipc=host to increase shared memory. SOLUTION END

    Describe the bug There seems to be a multiprocessing-related deadlock in DataSilo._get_dataset, which transforms tsv lines into dicts into datasets, chunkwise. Reading a moderately-sized training set (~18k docs) stalls with zero CPU activity after around 2/3 of the data.

    Error message This is a rather unhelpful trace just like so many multiprocessing deadlocks:

    Process ForkPoolWorker-23:
    Process ForkPoolWorker-20:
    Process ForkPoolWorker-22:
    Process ForkPoolWorker-21:
    Process ForkPoolWorker-19:
    Process ForkPoolWorker-17:
    Process ForkPoolWorker-18:
    Traceback (most recent call last):
      File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
        self.run()
      File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
        self._target(*self._args, **self._kwargs)
      File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker
        task = get()
      File "/usr/lib/python3.6/multiprocessing/queues.py", line 334, in get
        with self._rlock:
      File "/usr/lib/python3.6/multiprocessing/synchronize.py", line 95, in __enter__
        return self._semlock.__enter__()
    KeyboardInterrupt
    Traceback (most recent call last):
      File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
        self.run()
      File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
        self._target(*self._args, **self._kwargs)
      File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker
        task = get()
      File "/usr/lib/python3.6/multiprocessing/queues.py", line 334, in get
        with self._rlock:
      File "/usr/lib/python3.6/multiprocessing/synchronize.py", line 95, in __enter__
        return self._semlock.__enter__()
    Traceback (most recent call last):
    Traceback (most recent call last):
    KeyboardInterrupt
    Traceback (most recent call last):
      File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
        self.run()
      File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
        self.run()
      File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
        self.run()
      File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
        self._target(*self._args, **self._kwargs)
      File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
        self._target(*self._args, **self._kwargs)
      File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
        self._target(*self._args, **self._kwargs)
      File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker
        task = get()
      File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker
        task = get()
      File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker
        task = get()
      File "/usr/lib/python3.6/multiprocessing/queues.py", line 334, in get
        with self._rlock:
      File "/usr/lib/python3.6/multiprocessing/queues.py", line 334, in get
        with self._rlock:
      File "/usr/lib/python3.6/multiprocessing/queues.py", line 334, in get
        with self._rlock:
      File "/usr/lib/python3.6/multiprocessing/synchronize.py", line 95, in __enter__
        return self._semlock.__enter__()
      File "/usr/lib/python3.6/multiprocessing/synchronize.py", line 95, in __enter__
        return self._semlock.__enter__()
      File "/usr/lib/python3.6/multiprocessing/synchronize.py", line 95, in __enter__
        return self._semlock.__enter__()
    KeyboardInterrupt
    KeyboardInterrupt
    KeyboardInterrupt
    Traceback (most recent call last):
    Traceback (most recent call last):
      File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
        self.run()
      File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
        self._target(*self._args, **self._kwargs)
      File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
        self.run()
      File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker
        task = get()
      File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
        self._target(*self._args, **self._kwargs)
      File "/usr/lib/python3.6/multiprocessing/queues.py", line 334, in get
        with self._rlock:
      File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker
        task = get()
      File "/usr/lib/python3.6/multiprocessing/synchronize.py", line 95, in __enter__
        return self._semlock.__enter__()
      File "/usr/lib/python3.6/multiprocessing/queues.py", line 335, in get
        res = self._reader.recv_bytes()
      File "/usr/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
        buf = self._recv_bytes(maxlength)
      File "/usr/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
        buf = self._recv(4)
    KeyboardInterrupt
      File "/usr/lib/python3.6/multiprocessing/connection.py", line 379, in _recv
        chunk = read(handle, remaining)
    KeyboardInterrupt
     60%|██████████████████████████████████████████████████████▋                                    | 10752/17908 [04:04<02:42, 44.04 Dicts/s]
    Process ForkPoolWorker-24:
    Traceback (most recent call last):
      File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
        self.run()
      File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
        self._target(*self._args, **self._kwargs)
      File "/usr/lib/python3.6/multiprocessing/pool.py", line 108, in worker
        task = get()
      File "/usr/lib/python3.6/multiprocessing/queues.py", line 334, in get
        with self._rlock:
      File "/usr/lib/python3.6/multiprocessing/synchronize.py", line 95, in __enter__
        return self._semlock.__enter__()
    KeyboardInterrupt
    

    Expected behavior Dataset loading should successfully finish after processing the last chunk.

    Additional context I have tested that manually loading the data works:

    # This works
    train_dicts = processor.file_to_dicts("train.tsv")
    train_dataset, tensor_names = processor.dataset_from_dicts(dicts=train_dicts)
    

    Loading a subset of the first 10k docs works, too.

    One hunch is that grouper(dicts, multiprocessing_chunk_size) under some condition produces a pathological chunk size.

    To Reproduce I'll try and see if I can come up with a synthetic reproducer that doesn't include my data (which I can't give out).

    System:

    • OS: Ubuntu 18.04 with nvidia-docker2 and a CUDA 10.0 image
    • GPU/CPU: GTX 1080 / Xeon 4-core, 120GB RAM
    • FARM version: master The system is otherwise idle, no file sytem contention, no excessive context switches, plenty of free RAM.
    bug 
    opened by trifle 17
  • Convert FARM gelectra Model to HuggingFace Model

    Convert FARM gelectra Model to HuggingFace Model

    Describe the bug I have trained a classification model of gelectra to classify text. Worked so far as expected. I want to integrate the model in a microservice. For this, i wanted to convert the FARM-Model to a HuggingFace-Model. To do so, i went through the docs and found the examples here. But when i run the code (same code, different directories), i get the messages below:

    Error message 05/11/2021 19:39:56 - INFO - farm.modeling.prediction_head - Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex . 05/11/2021 19:39:56 - INFO - farm.modeling.language_model -
    05/11/2021 19:39:56 - INFO - farm.modeling.language_model - LOADING MODEL 05/11/2021 19:39:56 - INFO - farm.modeling.language_model - ============= 05/11/2021 19:39:56 - INFO - farm.modeling.language_model - Model found locally at /home/myname/Models/Competence/deepset/gelectra-base 05/11/2021 19:39:58 - INFO - farm.modeling.language_model - Loaded /home/myname/Models/Competence/deepset/gelectra-base 05/11/2021 19:39:58 - INFO - farm.modeling.adaptive_model - Found files for loading 1 prediction heads 05/11/2021 19:39:58 - WARNING - farm.modeling.prediction_head - layer_dims will be deprecated in future releases 05/11/2021 19:39:58 - INFO - farm.modeling.prediction_head - Prediction head initialized with size [768, 4] 05/11/2021 19:39:58 - INFO - farm.modeling.prediction_head - Using class weights for task 'text_classification': [1.0, 1.0, 1.0, 1.0] 05/11/2021 19:39:58 - INFO - farm.modeling.prediction_head - Loading prediction head from /home/myname/Models/Competence/deepset/gelectra-base/prediction_head_0.bin file /home/myname/Models/Competence/deepset/gelectra-base/config.json not found 05/11/2021 19:39:58 - INFO - farm.modeling.tokenization - Loading tokenizer of type 'ElectraTokenizer' Traceback (most recent call last): File "/home/myname/source/hslu/ba/Code/Models/HuggingFace/convert_to_transformers_classification.py", line 46, in convert_to_transformers() File "/home/myname/source/hslu/ba/Code/Models/HuggingFace/convert_to_transformers_classification.py", line 29, in convert_to_transformers transformer_model = model.convert_to_transformers()[0] File "/home/myname/anaconda3/envs/ba/lib/python3.8/site-packages/farm/modeling/adaptive_model.py", line 511, in convert_to_transformers return conv.Converter.convert_to_transformers(self) File "/home/myname/anaconda3/envs/ba/lib/python3.8/site-packages/farm/conversion/transformers.py", line 42, in convert_to_transformers transformers_model = Converter._convert_to_transformers_classification_regression(adaptive_model, File "/home/myname/anaconda3/envs/ba/lib/python3.8/site-packages/farm/conversion/transformers.py", line 178, in _convert_to_transformers_classification_regression transformers_model.classifier.load_state_dict( File "/home/myname/anaconda3/envs/ba/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1051, in load_state_dict raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format( RuntimeError: Error(s) in loading state_dict for ElectraClassificationHead: Missing key(s) in state_dict: "dense.weight", "dense.bias", "out_proj.weight", "out_proj.bias". Unexpected key(s) in state_dict: "weight", "bias".

    Process finished with exit code 1

    Expected behavior Model should convert without an error :ok_hand:

    Additional context

    To Reproduce Train an deepset/gelectra model for a classification task with farm - convert it

    System:

    • OS: Ubuntu 20.04.2 LTS x86_64
    • GPU/CPU: NVIDIA GeForce GTX 1080 (others are too expensive :sob: )
    • FARM version: farm~=0.7.0
    bug stale 
    opened by florianbaer 15
  • Howto do evaluation on new annotated data NER?

    Howto do evaluation on new annotated data NER?

    Question Hi there,

    I trained a NER model on the conll03 dataset, and it performs quite well.

    However, I want to test it on a new domain. So we annotated some new texts and brought them into conll03 format.

    I can read these files with the model.inference_from_file(eval_file) just fine. It then proceeds to give me a dict with offsets, tokens and the respective label it would predict.

    save_dir = "saved_models/ner_dbmdz_conll" model = Inferencer.load(save_dir) result = model.inference_from_file(debby_eval_file)

    However, I would like to evaluate how well this model is doing on the new gold annotation.

    I cannot figure out how to do that with Evaluator.eval. Is that even the right approach?

    Your help is much appreciated.

    Thanks.

    enhancement question task: NER 
    opened by tnhaider 15
  • Add possibility to do cross validation split only for train and dev

    Add possibility to do cross validation split only for train and dev

    This adds the possibility to do cross validation splits only for train and dev dataset and keep test as it is (if specified). This is done by adding just one additional parameter called train_dev_split_only which is False by default. The way this has been implemented does not introduce any breaking changes.

    The original implementation (when train_dev_split_only is False) is done this was:

    • concat all datasets given in sets
    • split concatenation into train and test folds (5 by default)
    • then use the dev_split value to split a dev set away from the train set

    This implementation is something I personaly never saw and I would call it a bug. But since this is just my opinion I just added this alternative way because for my project I need it this way.

    When train_dev_split_only is True this happens:

    • concat all datasets given in sets
    • split concatenation into train and dev folds (5 by default)
    • if test is specified it always returns the same test set in all "splits"

    This is the way how cross validation should work for me. If I also want to add the test set to crossvalidation I would use nested cross validation - but that is something different.

    Please give me feedback what you think about this PR. If you agree I will add a test and we are ready to merge.

    Todo

    • [ ] write test
    opened by PhilipMay 14
  • Label_ids

    Label_ids "None" upon initiating training of Adaptive Model

    When calling train method on the adaptive model, an error is thrown upon collecting the losses from the prediction head (logits_to_loss_per_head, adaptive_model) when logits and labels in prediction head are combined (logits_to_loss, prediction_head) to create a per_sample_loss because the variable "label_ids" is None and the "view()" function cannot be called upon it. The variable "label_ids" becomes None due to the fact that the function call assigning values to it "kwargs.get(self.label_tensor_name)" returns None. However, the documentation does not reveal whether and where "label_ids" or rather, "label_tensor_names" ought to be specified.

    Error message Train epoch 1/5: 0%| | 0/200 [00:00<?, ?it/s] Traceback (most recent call last): File "/home/f_weise/.local/share/JetBrains/Toolbox/apps/PyCharm-P/ch-0/192.5728.105/helpers/pydev/pydevd.py", line 2060, in main() File "/home/f_weise/.local/share/JetBrains/Toolbox/apps/PyCharm-P/ch-0/192.5728.105/helpers/pydev/pydevd.py", line 2054, in main globals = debugger.run(setup['file'], None, None, is_module) File "/home/f_weise/.local/share/JetBrains/Toolbox/apps/PyCharm-P/ch-0/192.5728.105/helpers/pydev/pydevd.py", line 1405, in run return self._exec(is_module, entry_point_fn, module_name, file, globals, locals) File "/home/f_weise/.local/share/JetBrains/Toolbox/apps/PyCharm-P/ch-0/192.5728.105/helpers/pydev/pydevd.py", line 1412, in _exec pydev_imports.execfile(file, globals, locals) # execute the script File "/home/f_weise/.local/share/JetBrains/Toolbox/apps/PyCharm-P/ch-0/192.5728.105/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile exec(compile(contents+"\n", file, 'exec'), glob, loc) File "/home/f_weise/projects/adup-watchdog/src/scripts/BertBaseCased.py", line 152, in model_training = trainer.train(model) File "/home/f_weise/projects/adup-watchdog/.venv/src/farm/farm/train.py", line 154, in train per_sample_loss = model.logits_to_loss(logits=logits, **batch) File "/home/f_weise/projects/adup-watchdog/.venv/src/farm/farm/modeling/adaptive_model.py", line 129, in logits_to_loss all_losses = self.logits_to_loss_per_head(logits, **kwargs) File "/home/f_weise/projects/adup-watchdog/.venv/src/farm/farm/modeling/adaptive_model.py", line 116, in logits_to_loss_per_head all_losses.append(head.logits_to_loss(logits=logits_for_one_head, **kwargs)) File "/home/f_weise/projects/adup-watchdog/.venv/src/farm/farm/modeling/prediction_head.py", line 263, in logits_to_loss return self.loss_fct(logits, label_ids.view(-1)) AttributeError: 'NoneType' object has no attribute 'view'

    Expected behavior I expected the variable "label_id" to be a tensor of the same lengths/shape as there are samples in the training set, such that a per_sample_loss can be calculated and the training successfully begins.

    To Reproduce `class BertBaseCased(object):

    def get_layout_for_pytorch(self, list_of_pcs):
    
        list_of_texts_to_channels = []
        for pc in list_of_pcs:
            pc.text = re.sub('\s+', ' ', pc.text).strip() 
            pc.text = re.sub('[^a-zA-ZÜÖÄüöä\d\s:\.\,]', ' ', pc.text)
            if pc.text is "" or " ' " and pc.text in list_of_texts_to_channels:
                continue
            else:
                list_of_texts_to_channels.append([pc.text, 'sensitive' if pc.channel.class_channel_id < 100 else 'non-sensitive'])
        return list_of_texts_to_channels
    
    import logging
    logging.getLogger().setLevel(logging.INFO)
    
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    print("Devices available: {}".format(device))
    
    tokenizer = BertTokenizer.from_pretrained(
        pretrained_model_name_or_path="bert-base-german-cased",
        do_lower_case=True)
    
    processor = TextClassificationProcessor(tokenizer=tokenizer,
                                            max_seq_len=50,
                                            data_dir="/home/f_weise/projects/adup-watchdog/src/scripts/",
                                            train_filename="data_train.tsv",
                                            dev_filename=None,
                                            test_filename="data_test.tsv",
                                            dev_split=0.1,
                                            columns = ["text", "label"],
                                            label_list= ["sensitive", "non-sensitive"],
                                            source_field= "label",
                                            metrics = ["acc"],
                                            use_multiprocessing = True)
    
    data_silo = DataSilo(processor=processor,batch_size= 15)
    
    MODEL_NAME_OR_PATH = "bert-base-german-cased"
    
    language_model = Bert.load(MODEL_NAME_OR_PATH)
    
    LAYER_DIMS = [768, 2]
    
    processor.add_task(name='text_classification', metric='acc', label_list=["sensitive", "non-sensitive"])
    prediction_head = TextClassificationHead(layer_dims=[768, 2])
    
    EMBEDS_DROPOUT_PROB = 0.1
    
    model = AdaptiveModel(
        language_model=language_model,
        prediction_heads=[prediction_head],
        embeds_dropout_prob=EMBEDS_DROPOUT_PROB,
        lm_output_types=["per_sequence"],
        device=device)
    
    
    LEARNING_RATE = 2e-5
    WARMUP_PROPORTION = 0.1
    N_EPOCHS = 5
    
    optimizer, warmup_linear = initialize_optimizer(
        model=model,
        learning_rate=LEARNING_RATE,
        warmup_proportion=WARMUP_PROPORTION,
        n_batches=len(data_silo.loaders["train"]),
        n_epochs=N_EPOCHS)
    
    N_GPU = 0
    
    trainer = Trainer(
        optimizer=optimizer,
        data_silo=data_silo,
        epochs=N_EPOCHS,
        n_gpu=N_GPU,
        warmup_linear=warmup_linear,
        device=device,
    )
    
    
    model_training = trainer.train(model)
    
    save_dir = "bert-german-test_version"
    model.save(save_dir)
    processor.save(save_dir)
    

    `

    data_test_screenshot data_train_screenshot

    System:

    • OS: Mint 19.1
    • GPU/CPU: cpu
    • FARM version: farm-0.2.0
    question 
    opened by FelineWeise 14
  • Which pytorch (and other package) versions are actually required

    Which pytorch (and other package) versions are actually required

    When pip install farm is run, only certain PyTorch versions are allowed (<1.10) and mlflow versions are restricted to <=1.13.1

    Why is this? It makes it increasingly difficult to use the library with other software that expects more recent versions of those libraries.

    Is it precautionary or are there actual known problems with more recent versions (which could possibly get solved, of course)?

    enhancement 
    opened by johann-petrak 1
  • Extract embedding while using parameter

    Extract embedding while using parameter "extraction_strategy="per_token""

    Question Hello! I used the script "embeddings_extraction.py", and I input the sentence as below: basic_texts = [ {"text": "apple is delicious fruit"} ] model = Inferencer.load(lang_model, task_type="embeddings", gpu=use_gpu, batch_size=batch_size, extraction_strategy="per_token", extraction_layer=-1, num_processes=0) result = model.inference_from_dicts(dicts=basic_texts)

    I set the parameter "extraction_strategy="per_token", and the printed len(result[0]["vec"]) is 256. result[0]["vec"][0] is a 768-dimensional vector. And I am wondering this 768 vector result[0]["vec"][0] is the representation of the First word "apple", or the representation of "[CLS]" token? Thank you very much!

    Additional context Add any other context or screenshots about the question (optional).

    question stale 
    opened by JiangYanting 1
  • Need a guidance on Multi label Classification

    Need a guidance on Multi label Classification

    Question Need a guidance on Multi label Classification

    Additional context We are working on Multi label text classification problem to classify to classes such as current states(Negative low, Negative high, Positive High, Positive low) and desired states(Positive High, Positive low ), currently we have labelled around 50 posts,

    As I am beginner to this, could someone guide right approaches or reference code or article to go about this.

    -Thanks Add any other context or screenshots about the question (optional). Screen Shot 2022-04-25 at 4 55 05 PM (1)

    question stale 
    opened by pbsprashanth113 2
Releases(v0.8.0)
  • v0.8.0(Jun 10, 2021)

    DPR Improvements

    DPR - improve loading of datasets #733 @voidful DPR - enable saving and loading of other model types, e.g., RoBERTa models #765 @Timoeller @julian-risch DPR - fix conversion of BiAdaptiveModel #753 @bogdankostic

    torch 1.8.1 and transformers 4.6.1

    Bump transformers version to 4.6.1 #787 @Timoeller @julian-risch Bump torch version to 1.8.1 #767 @Timoeller @julian-risch

    Multi-task Learning

    Implement Multi-task Learning and added example #778 @johann-petrak

    List of Evaluation Metrics

    Allow list of metrics and add tests and pythondoc #777 @johann-petrak

    Misc

    Reduce number of logging messages by Processor about returning problematic ids #772 @johann-petrak Add farm.__version__ tag #761 @johann-petrak Add value of doc_stride, max_seq_len, max_query_length in error message #784 @ftesser Convert QACandidates with empty or whitespace answers to no_answers on doc level #756 @julian-risch

    String comparison: Should replace "is" with "==": #774 @johann-petrak Fix reference before assignment in DataSilo #738 @bogdankostic Changing QA_input format in tutorial #735 @julian-risch Fix TextPairClassificationProcessor example by adding metric #780 @julian-risch

    Source code(tar.gz)
    Source code(zip)
  • v0.7.1(Mar 31, 2021)

    A patch release focusing on bug fixes for Dense Passage Retrieval DPR Fix saving and loading of DPR models and Processors in #746 Fix DPR tokenization statisticss in #738 Fix cosine similarity in DPR training #741

    Misc Fix tuple input for TextPairClassification inference #723

    Source code(tar.gz)
    Source code(zip)
  • v0.7.0(Feb 22, 2021)

    QA Confidence Scores

    In response to several requests from the community, we now provide more meaningful confidence scores for the predictions of extractive QA models. #690 #705 @julian-risch @timoeller @lalitpagaria To this end, predicted answers got a new attribute called confidence, which is in the range [0,1] and can be calibrated with the probability that a prediction is an exact match. The intuition behind the scores is the following: After calibration, if the average confidence of 100 predictions is 70%, then on average 70 of the predictions will be correct. The implementation of the calibration uses a technique called temperature scaling. The calibration can be executed on a dev set by running the eval() method in the Evaluator class and setting the parameter calibrate_conf_scores to true. This parameter is false by default as it is still an experimental feature and we continue working on it. The score attribute of predicted answers and their ranking remain unchanged so that the default behavior is unchanged. An example shows how to calibrate and use the confidence scores.

    Misc

    Refactor Text pair handling, that also add Text pair regression #713 @timoeller Refactor Textsimilarity processor #711 @timoeller Refactor Regression and inference processors #702 @timoeller Fix NER probabilities #700 @brandenchan Calculate squad evaluation metrics overall and separately for text answers and no answers #698 @julian-risch Re-enable test_dpr_modules also for windows #697 @ftesser Use Path instead of String in ONNXAdaptiveModel #694 @skiran252 Big thanks to all contributors!

    Source code(tar.gz)
    Source code(zip)
  • v0.6.2(Jan 20, 2021)

    This is just a small patch to change the return types of offsets in our QAInferencer, see #693

    It is needed to fix RestAPI related issues where int64 cannot decoded within JSONs.

    Source code(tar.gz)
    Source code(zip)
  • v0.6.1(Jan 12, 2021)

    Patch release

    This is just a quick patch release to bugfix some input validation for Question Answering [closed] Fix/missing truncation bug #679

    Additional feature for QA

    Still, another interesting feature slipped in: We can now filter QA predictions to not contain duplicate answers. [closed] Added filter_range parameter that allows to filter answers with similar start/end indices #680

    Additional test

    [part: tokenizer][**task: QA**] Add integration test for QA processing #683

    Misc

    [closed] Remove "qas" inference input wherever possible #681 [closed] Added parameter names to convert_from_transformers call in question_answering_crossvalidation.py #672

    Source code(tar.gz)
    Source code(zip)
  • v0.6.0(Dec 30, 2020)

    Simplification of Preprocessing

    We wanted to make preprocessing for all our tasks (e.g. QA, DPR, NER, classification) more understandable for FARM users, so that it is easier to adjust to specific use cases or extend the functionality to new tasks.

    To achieve this we followed two design choices:

    1. Avoid deeply nested calls
    2. Keep all high-level descriptions in a single place

    Question Answering Preprocessing

    We especially focussed on making QA processing more sequential and divided the code into meaningful snippets #649

    The code snippets are (see related method):

    • convert the input into FARM specific QA format
    • tokenize the questions and texts
    • split texts into passages to fit the sequence length constraint of Language Models
    • [optionally] convert labels (disabled during inference)
    • convert question, text, labels and additional information to PyTorch tensors

    Breaking changes

    1. Switching to FastTokenizers (based on Huggingface tokenizer project written in Rust) as default Tokenizer. We changed the use_fast=True parameter in the Tokenizer.load() method. Support for slow, python-based Tokenizers will be implemented for all tasks in the next release.
    2. The Processor.dataset_from_dicts method by default returns an additional parameter problematic_sample_ids that keeps track of which input sample caused problems during preprocessing:
    dataset, tensor_names, problematic_sample_ids = processor.dataset_from_dicts(dicts=dicts)
    

    Update to transformers version 4.1.1 and torch version 1.7.0

    Transformers comes with many new features, including model versioning, that we do not want to miss out on. #665 Model versions can now be specified like:

        model = Inferencer.load(
            model_name_or_path="deepset/roberta-base-squad2",
            revision="v2.0",
            task_type="question_answering",
        )
    

    DPR enhancements

    • MultiGPU support #619
    • Added tests #643
    • Bugfixes and smaller enhancements #629 #655 #663

    Misc

    • Cleaner logging and error handling #639
    • Benchmark automation via CML #646
    • Disable DPR tests on Windows, since they do not work with PyTorch 1.6.1 #637
    • Option to disable MLflow logger #650
    • Fix to Earlystopping and custom head #617
    • Adding probability of masking a token parameter for LM task #630

    Big thanks to all contributors! @ftesser @pashok3d @Timoeller @tanaysoni @brandenchan @bogdankostic @kolk @tholor

    Source code(tar.gz)
    Source code(zip)
  • v0.5.0(Oct 30, 2020)

    Add Dense Passage Retriever (DPR) incl. Training & Inference (#513, #601, #606)

    Happy to introduce a completely new task type to FARM: Text similarity with two separate transformer encoders

    Why? We observe a big shift in Information Retrieval from sparse methods (BM25 etc.) towards dense methods that encode queries and docs as vectors and use vector similarity to retrieve the most similar docs for a certain query. This is not only helpful for document search but also for open-domain Question Answering. Dense methods outperform sparse methods already in many domains and are especially powerful if the matching between query and passage cannot happen via "keywords" but rather relies on semantics / synonyms / context.

    What? One of the most promising methods at the moment is "Dense Passage Retrieval" from Karphukin et al. (https://arxiv.org/abs/2004.04906). In a nutshell, DPR uses one transformer to encode the query and a second transformer to encode the passage. The two encoders project the different texts into the same vector space and are trained jointly on a similarity measure using in-batch-negatives.

    How? We introduce a new class BiAdaptiveModel that has two language models plus a prediction head. In the case of DPR, this will be one question encoder model and one passage encoder model.
    See the new example script dpr_encoder.py for training / fine-tuning a DPR model. We also have a tight integration in Haystack, where you can use it as a Retriever for open-domain Question Answering.

    Refactor conversion from / to Transformers #576

    We simplified conversion between FARM <-> Transformers. You can now run:

    # Transformers -> FARM
    model = Converter.convert_from_transformers("deepset/roberta-base-squad2", device="cpu")
    
    # FARM -> Transformers
    transformer_models = Converter.convert_to_transformers(your_adaptive_model)
    

    Note: In case your FARM AdaptiveModel has multiple prediction heads (e.g. 1x NER, 1x Text Classification), the conversion will return a list with two transformer models (both with one head respectively).

    Upgrade to Transformers 3.3.1 #579

    Transformers 3.3.1 comes with a few new interesting features, incl. support for Retrieval-Augmented Generation (RAG) which can be used to generate answers rather than extracting answers. In contrast to GPT-3, the generation is conditioned on a set of retrieved documents, and is, therefore, more suitable for most QA applications in the industry that rely on a domain corpus.
    Thanks to @lalitpagaria, we'll support RAG also in Haystack soon (see https://github.com/deepset-ai/haystack/pull/484)


    Details

    Question Answering

    • Improve Speed: Vectorize Question Answering Prediction Head #603
    • Fix removal of yes no answers #540
    • Fix QA bug that rejected spans at beginning of passage #564
    • Added warning about that Natural Questions Inference. #565
    • Remove loss index from QA PH #589

    Other

    • Catch empty datasets in Inferencer #605
    • Add option to set evaluation batch size #607
    • Infer model type from config #600
    • Fix random behavior when loading ELECTRA models #599
    • Fix import for Python3.6 #581
    • Fixed conversion of BertForMaskedLM to transformers #555
    • Load correct config for DistilBert model #562
    • Add passages per second calculation to benchmarks #560
    • Fix batching in ONNX forward pass #559
    • Add ONNX conversion & Inference #557

    Big thanks to all contributors! @ftesser @lalitpagaria @himanshurawlani @Timoeller @tanaysoni @brandenchan @bogdankostic @kolk @tholor

    Source code(tar.gz)
    Source code(zip)
  • v0.4.9(Sep 21, 2020)

    Minor patch: Relax PyTorch version requirements

    Installing FARM in environments where torch's GPU version was already installed via pip (e.g. torch 1.6.0+cu101), caused version trouble. This is especially annoying in Google Colab environments. Change: Allow all torch 1.6.x versions incl 1.6.0+cu101 etc


    Further changes:

    • Nested cross validation by @PhilipMay #508
    Source code(tar.gz)
    Source code(zip)
  • 0.4.8(Sep 14, 2020)

    Minor release

    Experimental Support for fast Rust Tokenizers (#482)

    While preprocessing is usually not the bottleneck in our pipelines, there's still significant time spent on it (~ 20 % for QA inference). We saw substantial speed-ups with HuggingFace's "FastTokenizers" that are based on rust. We are therefore introducing a basic "experimental" implementation with this release. We are planning to stabilizing it and having a smoother fit into the FARM processor.

    Usage:

    tokenizer = Tokenizer.load(pretrained_model_name_or_path=""bert-base-german-cased"",
                               do_lower_case=False, 
                               use_fast=True)
    

    Upgrade to transformers 3.1.0 (#464)

    The latest transformers release has quite interesting new features - one of them being basic support of a DPR model class (Dense Passage Retriever). This will simplify our dense passage retriever integration in Haystack and the upcoming DPR training which we plan to have in FARM.


    Details

    Question Answering

    • Add asserts on doc_stride and max_seq_len to prevent issues with sliding window #538
    • fix Natural Question inference processing #521

    Other

    • Fix logging of error msg for FastTokenizer + QA #541
    • Fix truncation warnings in tokenizer #528
    • Evaluate model on best model when doing early stopping #524
    • Bump transformers version to 3.1.0 #515
    • Add warmup run to component benchmark #504
    • Add optional s3 auth via params #511
    • Add option to use fast HF tokenizer. #482
    • CodeBERT support for embeddings #488
    • Store test eval result in variable #506
    • Fix typo f1 micro vs. macro #505

    Big thanks to all contributors! @PhilipMay @lambdaofgod @Timoeller @tanaysoni @brandenchan @bogdankostic @kolk @tholor

    Source code(tar.gz)
    Source code(zip)
  • 0.4.7(Aug 27, 2020)

    Main changes

    Support for MiniLM Model (#464)

    Interesting model from Microsoft that is up to 2.7x faster than BERT, while showing similar or better performance on many tasks (Paper). We found it particularly interesting for QA and also published a fine-tuned model on SQuAD 2.0: deepset/minilm-uncased-squad2

    Benchmarks per component (#491)

    Measuring the speed of individual components in the pipeline while respecting CUDA's async behaviour. We were especially interested in analyzing how much time we spend for QA in preprocessing, language model, and prediction head. Turns out it's on average about 20% : 50% : 30%. Interestingly, there's a high variance in the prediction head depending on the relevance of the question. We will use that information to further optimize performance in the prediction head. We'll share more detailed benchmarks soon.

    Support for PyTorch 1.6 (#502)

    We now support 1.6 and 1.5.1


    Details

    Question Answering

    • Pass max_answers param to processor #503
    • Deprecate QA input dicts with [context, qas] as keys #472
    • Squad processor verbose feature #470
    • Propagate QA ground truth in Inferencer #469
    • Ensure QAInferencer always has task_type "question_answering" #460

    Other

    • Download models from (private) S3 #500
    • fix _initialize_data_loaders in data_silo #476
    • Remove torch version wildcard in requirements #489
    • Make num processes parameter consistent across inferencer and data silo #480
    • Remove rest_api_schema argument in inference_from_dicts() #474
    • farm.data_handler.utils: Add encoding to open write in split_file method #466
    • Fix and document Inferencer usage and pool handling #429
    • Remove assertions or replace with logging error #468
    • Remove baskets without features in _create_dataset. #471
    • fix bugs with regression label standardization #456

    Big thanks to all contributors! @PhilipMay @Timoeller @tanaysoni @brandenchan @bogdankostic @kolk @rohanag @lingsond @ftesser

    Source code(tar.gz)
    Source code(zip)
  • 0.4.6(Jul 10, 2020)

    Main changes

    • Upgrading to Pytorch 1.5.1 and transformers 3.0.2
    • Important bug fix for language model training from scratch
    • Bug fixes and big refactorings for Question Answering, incl. a specialized QAInferencer with dedicated In- and Output objects to simplify usage and code completion:
    from farm.infer import QAInferencer
    from farm.data_handler.inputs import QAInput, Question
    
    nlp = QAInferencer.load(
        "deepset/roberta-base-squad2",
        task_type="question_answering",
        batch_size=16,
        num_processes=0)
    
    input = QAInput(
        doc_text="My name is Lucas and I live on Mars.",
        questions=Question(text="Who lives on Mars?",
                           uid="your-id"))
    
    res = nlp.inference_from_objects([input], return_json=False)[0]
    
    # High level attributes for your query
    print(res.question)
    print(res.context)
    print(res.no_answer_gap)
    # ...
    # Attributes for individual predictions (= answers)
    pred = res.prediction[0]
    print(pred.answer)
    print(pred.answer_type)
    print(pred.answer_support)
    print(pred.offset_answer_start)
    print(pred.offset_answer_end)
    # ...
    

    Details

    Question Answering

    • Add meta attribute to QACandidate for Haystack #455
    • Fix start and end offset checks in QA #450
    • Fix offset_end character for QA #449
    • Dedicated Input Objects for QA #445
    • Question Answering improvements: cleaner code, more typed objects, better compatibility between SQuAD and Natural Questions #411, #438, #419

    Other

    • Upgrade pytorch and python versions #447
    • Upgrade transformers version #448
    • Fix randomisation of train file for training from scratch #427
    • Fix loading of saved models with class weights #431
    • Remove raising exception errors in processor #451
    • Fix bug in benchmark tests with if statement #430
    • Remove hardcoded seeds from trainer #424
    • Conditional num_training_steps setting #437
    • Add badge with link to doc page #432
    • Fix for CI problem - closing multiprocessing.pool again #403

    :man_farmer: :woman_farmer: Thanks to all contributors for making FARMer's life better! @PhilipMay, @tstadel, @brandenchan, @tanaysoni, @Timoeller, @tholor, @bogdankostic

    Source code(tar.gz)
    Source code(zip)
  • 0.4.5(Jun 24, 2020)

    Minor release including an important bug fix for Question Answering

    Important Bug Fix QA

    Fixing a bug that was introduced in 0.4.4 (#416 and #417) that resulted in returning only a single answer per document in certain situations. This caused particular trouble for open-domain QA settings like in haystack.

    Speed optimization training from scratch

    Adding multiple optimizations and bug fixes to improve training from scratch, incl.:

    • Enable usage of DistributedDataParallel
    • Enable Automatix Mixed Precision Training
    • Fix bugs in StreamingDataSilo
    • Fix bugs in Checkpointing (important for training via spot / on-demand instances)

    This helped to boost training time in our benchmark from 616 hours down to 160 hours See #305 for details


    Other changes:

    • Add model optimization to inference loading #415
    • Proper attribute assignment in QA with yes / no answer #414
    Source code(tar.gz)
    Source code(zip)
  • 0.4.4(Jun 18, 2020)

    ELECTRA Model

    We welcome a new language model to the FARM family that we found to be a really powerful alternative to the existing ones. ELECTRA is trained using a small generator network that replaces tokens with plausible alternatives and a discriminative model that predicts which learns to detect these replaced tokens (see the paper for details: https://arxiv.org/abs/2003.10555). This makes pretraining more efficient and improves down-stream performance for quite many tasks.

    You can load it as usual via

    LanguageModel.load("google/electra-base-discriminator")
    

    See HF's model hub for more model variants

    Natural Questions Style QA

    With QA being our favorite and focussed down-stream task, we are happy to support an additional style of QA in FARM ( #334). In contrast to the popular SQuAD-based models, these NQ models support binary answers, i.e. questions like "Is Berlin the capital of Germany?" can be answered with "Yes" and an additional span that the model used as a "supporting fact" to give this answer.

    The implementation leverages the option of prediction heads in FARM by having one QuestionAnsweringHead that predicts a span (like in SQuAD) and one TextClassificationHead that predicts what type of answer the model should give (current options: span, yes, no, is_impossible).

    Example:

        QA_input = [
            {
                "qas": ["Is Berlin the capital of Germany?"],
                "context":  "Berlin (/bɜːrˈlɪn/) is the capital and largest city of Germany by both area and population."
            }
        ]
        model = Inferencer.load(model_name_or_path="../models/roberta-base-squad2-nq", batch_size=batch_size, gpu=True)
        result = model.inference_from_dicts(dicts=QA_input, return_json=False)
        print(f"Answer: {result[0].prediction[0].answer}")
    
       >> Answer: yes
    

    See this new example script for more details on training and inference.

    Note: This release includes the initial version for NQ, but we are already working on some further simplifications and improvements in #411.

    New speed benchmarking

    With inference speed being crucial for many deployments - especially for QA, we introduce a new benchmarking tool in #321. This allows us to easily compare the performance of different frameworks (e.g. ONNX vs. pytorch), parameters (e.g. batch size) and code optimizations of different FARM versions. See the readme for usage details and this spreadsheet for current results.


    A few more changes ...

    Modeling

    • Add support for Camembert-like models #396
    • Speed up in BERTLMHead by doing argmax on logits on GPU #377
    • Fix bug in BERT-style pretraining #369
    • Remove additional XLM-R tokens #360
    • ELECTRA: use gelu for pooled output of ELECTRA model #364

    Data handling

    • Option to specify text col name in TextClassificationProcessor and RegressionProcessor #387
    • Document magic data loading in TextClassificationProcessor PR #383
    • multilabel support for data_silo.calculate_class_weights #389
    • Implement Prediction Objects for Question Answering #405
    • Removing lambda function from AdaptiveModel so the class can be pickable #345
    • Add target device optimisations for ONNX export #354

    Examples / Docs

    • Add script to reproduce results from COVID-QA paper #412
    • Update tutorials #348
    • Docstring Format fix #382

    Other

    • Adjust code to squad inferencing #367
    • Convert pydantic objects to regular classes #410
    • Rename top n recall to top n accuracy #409
    • Add test for embedding extraction #394
    • Quick fix CI problems with OOM and unclosed worker pool #406
    • Update transformers version to 2.11 #407
    • Managing pytorch pip find-links directive #393
    • Zero based Epoch Display in Training Progress Bar #398
    • Add stalebot #400
    • Update pytorch to 1.5.0 #392
    • Question answering accuracy test #357
    • Add init.py files for farm.conversion module #365
    • Make onnx imports optional #363
    • Make ONNXRuntime dependency optional #347

    :man_farmer: :woman_farmer: Thanks to all contributors for making FARMer's life better! @PhilipMay , @stefan-it, @ftesser , @tstadel, @renaud, @skirdey, @brandenchan, @tanaysoni, @Timoeller, @tholor, @bogdankostic

    Source code(tar.gz)
    Source code(zip)
  • 0.4.3(Apr 29, 2020)

    :1234: Changed Multiprocessing in Inferencer

    The Inferencer has now a fixed pool of processes instead of creating a new one for every inference call. This accelerates the processing a bit and solves some problems when using it in combination with Frameworks like gunicorn/FastAPI etc (#329)

    Old:

    ...
    inferencer.inference_from_dicts(dicts, num_processes=8)
    

    New:

    Inferencer(dicts, num_processes=8)
    ...
    

    :fast_forward: Streaming Inferencer

    You can now also use the Inferencer in a "streaming mode". This is especially useful in production scenarios where the Inferencer is part of a bigger pipeline (e.g. consuming documents from elasticsearch) and you want to get predictions as soon as they are available (#315)

    Input: Generator yielding dicts with your text Output: Generator yielding your predictions

        dicts = sample_dicts_generator()  # it can be a list of dicts or a generator object
        results = inferencer.inference_from_dicts(dicts, streaming=True, multiprocessing_chunksize=20)
        for prediction in results:  # results is a generator object that yields predictions
            print(prediction)
    

    :older_woman: :older_man: "Classic" baseline models for benchmarking + S3E Pooling

    While Transformers are conquering many of the current NLP tasks, there are still quite some tasks (e.g. some document classification) where they are a complete overkill. Benchmarking Transformers with "classic" uncontextualized embedding models is a common, good practice and is now possible without switching frameworks. We added basic support for loading in embeddings models like GloVe, Word2vec and FastText and using them as a "LanguageModels" in FARM (#285)

    See the example script

    We also added a new pooling method to get sentence or document embeddings from these models that can act as a strong baseline for transformer-based approaches (e.g Sentence-BERT). The method is called S3E and was recently introduced by Wang et al in "Efficient Sentence Embedding via Semantic Subspace Analysis" (#286)

    See the example script


    A few more changes ...

    Modeling

    • Cross-validation for Question-Answering #335
    • Add option to use max_seq_len tokens for LM Adaptation/Training-from-scratch instead of real sentences #314
    • Add english glove models #339
    • Implicitly connect heads with processor + check for connection #337

    Evaluation & Inference

    • Registration of custom evaluation reports #331
    • Standalone Evaluation with pretrained models #330
    • tqdm progress bar in inferencer #338
    • Group NER preds by sample #327
    • Fix Processor configs when loading Inferencer #318

    Other

    • Fix the IOB2 to simple tags check #324
    • Update config when saving model to include changes of parameters #323
    • Fix Issues with NER format Conversion #322
    • Fix error message in loading of Tokenizer #317
    • Less verbosity, Fix which Samples and Baskets being Thrown Away #313

    :man_farmer: :woman_farmer: Thanks to all contributors for making FARMer's life better! @brandenchan, @tanaysoni, @Timoeller, @tholor, @bogdankostic, @gsarti

    Source code(tar.gz)
    Source code(zip)
  • 0.4.2(Apr 2, 2020)

    :fast_forward: Scalable preprocessing: StreamingDataSilo

    Allows you to load data lazily from disk and preprocess a batch on-the-fly when needed during training.

    stream_data_silo = StreamingDataSilo(processor=processor, batch_size=batch_size)

    => Allows large datasets that don't fit in memory (e.g. for training from scratch) => Training directly starts. No initial time for preprocessing needed.

    :rocket: Better Inference: Speed, scalability, standardization

    ONNX support:

    Microsoft recently added optimizations to the ONNX-runtime and reported substantial speed-ups compared to PyTorch. Since these improvements can be particularly useful for inference-heavy tasks such as QA, we added a way to export your AdaptiveModel to the ONNX format and load it into the Inferencer:

    model = AdaptiveModel (...)
    model.convert_to_onnx(Path("./onnx_model"))
    inferencer = Inferencer.load(model_name_or_path=Path("./onnx_model"))
    

    => See example
    => Speed improvements depend on device and batch size. On a Tesla V100 we measured improvements between 30% - 260% for doing end-to-end QA inference on a large document and we still see more potential for optimizations.

    | Batch Size | PyTorch | ONNX | ONNX V100 optimizations | Speedup | |------------|---------|------|-----------|---------| | 1 | 27.5 | 12.8 | 10.6 | 2.59 | | 2 | 17.5 | 11.5 | 9.1 | 1.92 | | 4 | 12.5 | 10.7 | 8.3 | 1.50 | | 8 | 10.6 | 10.2 | 8.2 | 1.29 | | 16 | 10.5 | 10.1 | 7.8 | 1.38 | | 32 | 10.1 | 9.8 | 7.8 | 1.29 | | 64 | 9.9 | 9.8 | 7.8 | 1.26 | | 128 | 9.9 | 9.8 | 7.7 | 1.28 | | 256 | 10.0 | 9.8 | 7.9 | 1.26 |

    Embedding extraction:

    Extracting embeddings from a model at inference time is now more similar to other inference modes.

    Old

    model = Inferencer.load(lang_model, task_type="embeddings", gpu=use_gpu, batch_size=batch_size)
    result = model.extract_vectors(dicts=basic_texts, extraction_strategy="cls_token", extraction_layer=-1)
    

    New

    model = Inferencer.load(lang_model, task_type="embeddings", gpu=use_gpu, batch_size=batch_size,
                                extraction_strategy="cls_token", extraction_layer=-1)
    result = model.inference_from_dicts(dicts=basic_texts, max_processes=1)
    

    => The preprocessing can now also utilize multiprocessing => It's easier to reuse other methods like Inference.inference_from_file()

    :left_right_arrow: New tasks: TextPairClassification & Passage ranking

    Added supprt for text pair classification and ranking. Both can be especially helpful in semantic search settings where you want to (re-)rank search results and will be incorporated in our haystack framework soon. Examples:


    A few more changes ...

    Faster & simpler Inference

    • Make extract_vectors more compatible to other inference types #292
    • Add test for onnx qa inference. Fix bug in loading PHs for ONNX. #297
    • Add ONNX Inference for Question Answering #288
    • Improve inferencer for better multiprocessing with QA / haystack #278
    • Scalable Qa aggregation #268
    • Allow for multiple queries in QA inference when using rest_api format #246
    • Decouple n_best in QA predictions #269
    • Correct keyword argument for max_processes when used by calc_chunksize() #255
    • Add document id to QA inference #265
    • Refactor no answer handling in Question Answering #258

    Streaming Data Silo / Training from scratch

    • StreamingDataSilo for loading & preprocessing batches lazily during training #239
    • Fix dict chunking in StreamingDataSilo for LMFinetuning #284
    • Add example for training with AWS SageMaker #283
    • Fix deletion of old training checkpoints #282
    • Fix epoch number for saving a training checkpoint #281
    • Fix Train Step calculations for Checkpointing #279
    • Implement len() for StreamingDataSilo #274
    • Refactor StreamingDataSilo to support multiple train epochs #266
    • Fix serialization for saving train checkpoints #271

    Modeling

    • Add support for text pair classification (ASNQ) and ranking (MSMarco) #237
    • Add conversion of lm_finetuned to HF transformers #290
    • Added next_sentence_head in examples/lm_finetuning.py. #273
    • Quickfix loading pred head #256
    • Maked use of 'language' **kwargs if present in LanguageModel.load. #262
    • Add the option to define the language model class manually #264
    • Fix XLMR Bug When Calculating Start of Second Sequence #240

    Examples / Tutorials / Experiments

    • Add data handling for GermEval14, add checks for correct data files #259
    • Fix separator in CoNLL_de experiment config #254
    • Use correct German conll03 data + conversion #248
    • Bugfix parameter loading through experiment configs #252
    • Add early stopping to experiment #253
    • Fix Tutorial: Add missing param in initialize_optimizer #245

    Other

    • Add Azure test pipeline #270
    • Fix progress bar in datasilo #267
    • Turn off prints and logging during testing #260
    • Pin Werkzeug version in requirements.txt #250
    • Add ConnectionError handling for MLFlow logger #236
    • Clearer message when DataSilo calculates Sequence Lengths #293
    • Add metric to text_pair_classification example #294
    • Add preprocessed CORD-19 dataset #295

    :man_farmer: :woman_farmer: Thanks to all contributors for making FARMer's life better! @brandenchan, @tanaysoni, @Timoeller, @tholor, @bogdankostic, @andra-pumnea, @PhilipMay, @ftesser, @guggio

    Source code(tar.gz)
    Source code(zip)
  • 0.4.1(Feb 3, 2020)

    :man_farmer: :arrows_counterclockwise: :hugs: Full compatibility with Transformers' models

    Open-source is more than just public code. It's a mindset of sharing, being transparent and collaborating across organizations. It's about building on the shoulders of other projects and advancing together the state of technology. That's why we built on the top of the great Transformers library by huggingface and are excited to release today an even deeper compatibility that simplifies the exchange & comparison of models.

    1. Convert models from/to transformers

    model = AdaptiveModel.convert_from_transformers("deepset/bert-base-cased-squad2", device="cpu", task_type="question_answering")
    transformer_model = model.convert_to_transformers()
    

    2. Load models from their new model hub:

    LanguageModel.load("TurkuNLP/bert-base-finnish-cased-v1")
    Inferencer.load("deepset/bert-base-cased-squad2",  task_type="question_answering")
    ...
    

    :rocket: Better & Faster Training

    Thanks to @BramVanroy and @johann-petrak we got some really hot new features here:

    • Automatic Mixed Precision (AMP) Training: Speed up your training by ~ 35%! Model params are usually stored with FP32 precision. Some model layers don't need that precision and can be reduced to FP16, which speeds up training and reduces memory footprint. AMP is a smart way of figuring out, for which params we can reduce precision without sacrificing performance (Read more). Test it by installing apex and setting "use_amp" to "O1" in one of the FARM example scripts.

    • More flexible Optimizers & Schedulers: Choose whatever optimizer you like from PyTorch, apex or Transformers. Take your preferred learning rate schedule from Transformers or PyTorch (Read more)

    • Cross-validation: Get more reliable eval metrics on small datasets (see example)

    • Early Stopping: With early stopping, the run stops once a chosen metric is not improving any further and you take the best model up to this point. This helps prevent overfitting on small datasets and reduces training time if your model doesn't improve any further (see example).

    :fast_forward: Caching & Checkpointing

    Save time if you run similar pipelines (e.g. only experimenting with model params): Store your preprocessed dataset & load it next time from cache:

    data_silo = DataSilo(processor=processor, batch_size=batch_size, caching=True)
    

    Start & stop training by saving checkpoints of the trainer:

    trainer = Trainer.create_or_load_checkpoint(
                ...
                checkpoint_on_sigterm=True,
                checkpoint_every=200,
                checkpoint_root_dir=Path(“/opt/ml/checkpoints/training”),
                resume_from_checkpoint=“latest”)
    

    The checkpoints include the state of everything that matters (model, optimizer, lr_schedule ...) to resume training. This is particularly useful, if your training crashes (e.g. because you are using spot cloud instances).

    :cloud: Integration with AWS SageMaker & Training from scratch

    We are currently working a lot on simplifying large scale training and deployment. As a first step, we are adding support for training on AWS SageMaker. The interesting part here is the option to use Spot Instances and save about 70% of costs compared to regular instances. This is particularly relevant for training models from scratch, which we introduce in a basic version in this release and will improve over the next weeks. See this tutorial to get started with using SageMaker for training on down-stream tasks.

    :computer: Windows support

    FARM now also runs on Windows. This implies one breaking change: We now use pathlib and therefore expect all directory paths to be of type Path instead of str #172


    A few more changes ...

    Modelling

    • [enhancement] ALBERT support #169
    • [enhancement] DistilBERT support #187
    • [enhancement] XLM-Roberta support #181
    • [enhancement] Automatically infer layer dims of prediction head #195
    • [bug] Implement next_sent_pred flag #198

    QA

    • [enhancement] Encoding of QA IDs #171
    • [enhancement] Remove repeat QA preds from overlapping passages #186
    • [enhancement] More options to control predictions of Question Answering Head #183
    • [bug] Fix QA example #203

    Training

    • [enhancement] Use AMP instead of naive fp16. More optimizers. More LR Schedules. #133
    • [bug] Fix for use AMP instead of naive fp16 (#133) #180
    • [enhancement] Add early stopping and custom metrics #165
    • [enhancement] Add checkpointing for training #188
    • [enhancement] Add train loss to tqdm. add desc for data preproc. log only 2 samples #175
    • [enhancement] Allow custom functions to aggregate loss of prediction heads #220

    Eval

    • [bug] Fixed micro f1 score #179
    • [enhancement] Rename classification_report to report #173

    Data Handling

    • [enhancement] Add caching of datasets in DataSilo #177
    • [enhancement] Add option to limit number of processes in datasilo #174
    • [enhancement] Add max_multiprocessing_chunksize as a param for DataSilo #168
    • [enhancement] Issue59 - Add cross-validation for small datasets #167
    • [enhancement] Add max_samples argument to TextClassificationProcessor #204
    • [bug] Fix bug with added tokens #197

    Other

    • [other] Disable multiprocessing in lm_finetuning tests to reduce memory footprint #176
    • [bug] Fix device arg in examples #184
    • [other] Add error message to train/dev split fn #190
    • [enhancement] Add more seeds #192

    :man_farmer: :woman_farmer: Thanks to all contributors for making FARMer's life better! @brandenchan, @tanaysoni, @Timoeller, @tholor, @maknotavailable, @johann-petrak, @BramVanroy

    Source code(tar.gz)
    Source code(zip)
  • 0.3.2(Nov 28, 2019)

    :paintbrush: Fundamental Re-design of Question Answering

    We believe QA is one of the most exciting tasks for transfer learning. However, the complexity of the task lets pipelines easily become messy, complicated and slow. This is unacceptable for production settings and creates a high barrier for developers to modify or improve them.

    We put substantial effort in re-designing QA in FARM with two goals in mind: making it the simplest & fastest pipeline out there. Results:

    • :bulb: Simplicity: The pipeline is cleaner, more modular and easier to extend.
    • :rocket: Speed: Preprocessing of SQuAD 2.0 got down to 42s on a AWS p3.8xlarge (vs. ~ 20min in transformers and early versions of FARM). This will not only speed up training cycles and reduce GPU costs, but has also a big impact at inference time, where most time is actually spend on preprocessing.

    See this blog post for more details and to learn about the key steps in a QA pipeline.

    :briefcase: Support of proxy servers

    Good news for our corporate users: Many of you approached us that the automated downloads of datasets / models caused problem in environments with proxy servers. You can now pass the proxy details to Processor and LanguageModel in the format used by the requests library

    Example:

    proxies = {"https": "http://user:[email protected]:8000"}
    
    language_model = LanguageModel.load(pretrained_model_name_or_path = "bert-base-cased", 
                                        language = "english",
                                        proxies=proxies
                                        )
    ...
    processor = BertStyleLMProcessor(data_dir="data/lm_finetune_nips", 
                                     tokenizer=tokenizer,
                                     max_seq_len=128, 
                                     max_docs=25,
                                     next_sent_pred=True,
                                     proxies = proxies,
                                    )
    

    Modelling

    • [enhancement] QA redesign #151
    • [enhancement] Add backwards compatibility for loading prediction head #159
    • [enhancement] Raise an Exception when an invalid path is supplied for loading a saved model #137
    • [bug] fix context in QA formatted preds #163
    • [bug] Fix loading custom vocab in transformers style for LM finetuning #155

    Data Handling

    • [enhancement] Allow to load dataset from dicts in DataSilo #127
    • [enhancement] Option to supply proxy server #136
    • [bug] Fix tokenizer for multiple whitespaces #156

    Inference

    • [enhancement] Change context in QA formatted preds to not split words #138

    Other

    • [enhancement] Add test for output format of QA Inferencer #149
    • [bug] Fix classification report for multilabel #150
    • [bug] Fix inference in doc_classification_cola example #147

    Thanks to all contributors for making FARMer's life better! @johann-petrak, @brandenchan, @tanaysoni, @Timoeller, @tholor, @cregouby

    Source code(tar.gz)
    Source code(zip)
  • 0.3.1(Nov 4, 2019)

    Improved Question Answering

    Aggregation over multiple passages

    When asking questions on long documents, the underlying Language Model needs to cut the document in multiple passages and answer the question on each of them. The output needs to be aggregated.

    Improved QA Inferencer

    The QA Inferencer

    • projects model predictions back to character space
    • can be used in the FARM demos UI
    • writes predictions in SQuAD style format, so you can compare the model accuracy with other frameworks

    Modelling

    • [closed] Refactor squad qa #131
    • [enhancement][**part: model**] Fix passing kwargs to LM loading (e.g. proxy) #132
    Source code(tar.gz)
    Source code(zip)
  • 0.3.0(Oct 28, 2019)

    Major Changes

    Adding Roberta & XLNet

    Welcome RoBERTa and XLNet on the FARM :tada:! We did some intense refactoring in FARM to make it easier to add more language models. However, we will only add models where we see some decent advantages. One of the next models to follow will very likely be ALBERT ...

    For now, we support Roberta/XLNet on (Multilabel) Textclassification, Text Regression and NER. QA will follow soon.

    :warning: Breaking Change - Loading of Language models has changed: Bert.load("bert-base-cased") -> LanguageModel.load("bert-base-cased")

    Migrating to tokenizers from the transformers repo.

    Pros:

    • It's quite easy to add a tokenizer for any of the models implemented in transformers.
    • We rather support the development there than building something in parallel
    • The additional metadata during tokenization (offsets, start_of_word) is still created via tokenize_with_metadata
    • We can use encode_plus to add model specific special tokens (CLS, SEP ...)

    Cons:

    • We had to deprecate our attribute "never_split_chars" that allowed to adjust the BasicTokenizer of BERT.
    • Custom vocab is now realized by increasing vocab_size instead of replacing unused tokens

    :warning: Breaking Change - Loading of tokenizers has changed: BertTokenizer.from_pretrained("bert-base-cased") -> Tokenizer.load("bert-base-cased")

    :warning: Breaking Change - never_split_chars: is no longer supported as an argument for the Tokenizer


    Modelling:

    • [enhancement] Add Roberta, XLNet and redesign Tokenizer #125
    • [bug] fix loading of old tokenizer style #129

    Data Handling:

    • [bug] Fix name of squad labels in experiment config #121
    • [bug] change arg in squadprocessor from labels to label_list #123

    Inference:

    • [enhancement] Add option to disable multiprocessing in Inferencer(#117) #128
    • [bug] Fix logging verbosity in Inferencer (#117) #122

    Other

    • [enhancement] Tutorial update #116
    • [enhancement] Update docs for api/ui docker #118
    Source code(tar.gz)
    Source code(zip)
  • 0.2.2(Oct 14, 2019)

    Major Changes

    Parallelization of Data Preprocessing :rocket:

    Data preprocessing via the Processor is now fast while maintaining a low memory footprint. Before, the parallelization via multiprocessing was causing serious memory issues on larger data sets (e.g. for Language Model fine-tuning). Now, we are running a small chunk through the whole processor (-> Samples -> Featurization -> Dataset ...). The multiprocessing is handled by the DataSilo now which simplifies implementation.

    With this new approach we can still easily inspect & debug all important transformations for a chunk, but only keep the resulting dataset in memory once a process has finished with a chunk.

    Multilabel classification

    We support now also multilabel classification. Prepare your data by simply setting multilabel=true in the TextClassificationProcessor and use the new MultiLabelTextClassificationHead for your model. => See an example here

    Concept of Tasks

    To further simplify multi-task learning we added the concept of "tasks". With this you can now use one TextClassificationProcessor to preprocess data for multiple tasks (e.g. using two columns in your CSV for classification). Example:

    1. Add the tasks to the Processor:
        processor = TextClassificationProcessor(...)
    
        news_categories = ["Sports", "Tech", "Politics", "Business", "Society"]
        publisher = ["cnn", "nytimes","wsj"]
    
        processor.add_task(name="category", label_list=news_categories, metric="acc", label_column_name="category_label")
        processor.add_task(name="publisher", label_list=publisher, metric="acc", label_column_name="publisher_label")
    
    1. Link the data to right PredictionHead by supplying the task name at initialization:
    category_head = MultiLabelTextClassificationHead(layer_dims=[768,5)], task_name="action_type")
    publisher_head = MultiLabelTextClassificationHead(layer_dims=[768, 3], task_name="parts")
    

    Update to transformers 2.0

    We are happy to see how huggingface's repository is growing and how they made another major step with the new 2.0 release. Since their collection of language models is awesome, we will continue building upon their language models and tokenizers. However, we will keep following a different philosophy for all other components (dataprocessing, training, inference, deployment ...) to improve usability, allow multitask learning and simplify usage in the industry.


    Modelling:

    • ['enhancement] Add Multilabel Classification (#89)
    • ['enhancement] Add PredictionHead for Regression task (#50)
    • [enhancement] Introduce concept of "tasks" to support of multitask training using multiple heads of the same type (e.g. for multiple text classification tasks) (#75)
    • [enhancement] Update dependency to transformers 2.0 (#106)
    • [bug] TypeError: classification_report() got an unexpected keyword argument 'target_names' #93
    • [bug] Fix issue with class weights (#82)

    Data Handling:

    • [enhancement] Chunkwise multiprocessing to reduce memory footprint in preprocessing large datasets (#88)
    • [bug] Threading Error upon building Data Silo #90
    • [bug] Multiprocessing causes data preprocessing to crash #110 (https://github.com/deepset-ai/FARM/issues/102)
    • [bug] Multiprocessing Error with PyTorch Version 1.2.0 #97
    • [bug] Windows fixes (#109)

    Inference:

    • [enhancement] excessive uncalled-for warnings when using the inferencer #104
    • [enhancement] Get probability distribution over all classes in Inference mode (#102)
    • [enhancement] Add InferenceProcessor (#72)
    • [bug] Fix classifcation report bug with binary doc classification

    Other:

    • [enhancement] Add more tests (#108)
    • [enhancement] do logging within run_experiment() (#37)
    • [enhancement] Improved logging (#82, #87 #105)
    • [bug] fix custom vocab for bert-base-cased (#108)

    Thanks to all contributors: @tripl3a, @busyxin, @AhmedIdr, @jinnerbichler, @Timoeller, @tanaysoni, @brandenchan , @tholor

    👩‍🌾 Happy FARMing!

    Source code(tar.gz)
    Source code(zip)
  • 0.2.0(Aug 19, 2019)

    Besides fixing various smaller bugs, we focussed in this release on two major changes:

    1. Speeding things up :rocket: :

    • By adding multiprocessing to the data preprocessing, we reduced the execution time for many tasks from hours to minutes. Since the functionality is mostly hidden in the parent class, the user doesn't have to implement anything on his own. However, this required changing the interface of the processor slightly. _dict_to_samples and _sample_to_features must now be classmethods and all objects accessed by them must be class attributes.
    • Multi-GPU support is now also available for the "building blocks mode"

    2. Making the processor more user friendly :blush: :

    • Instead of having one individual processor per dataset, we have implemented a more generic TextClassificationProcessor that you can instantiate easily for various predefined tasks (GNAD, GermEval ...) or your own dataset in CSV/TSV format
    processor = TextClassificationProcessor(tokenizer=tokenizer,
                                            max_seq_len=128,
                                            data_dir="../data/germeval18",
                                            columns=["text", "label", "unused"],
                                            label_list=["OTHER", "OFFENSE"],
                                            metrics=["f1_macro"]
                                            ) 
    

    Thanks for contributing @brandenchan @tanaysoni @tholor @Timoeller @tripl3a @Seb0 @waldemarhahn !


    Modeling:

    • [bug] Accuracy metric in LM finetuning always zero #30
    • [enhancement] Multi-GPU only enabled in experiment mode #57
    • [bug] Wrong number of total steps for linear warmup schedule #46

    Data Handling:

    • [enhancement] Unify redundant Processor; add new NERProcessor and TextClassificationProcessor
    • [enhancement] Add parallel dataprocessing #45
    • [bug] dev_size param in run-by-config is being ignored #49
    • [bug] output_dir parameter in run by config is being ignored #39
    • [bug] Error when running by config with a list of batch sizes #38

    Documentation:

    • [bug] LM finetuning example missing data #47
    • [bug] Colab Notebook referenced in readme does not work #27

    Other:

    • [enhancement] Proposition: improve dependency management with pipenv #35
    Source code(tar.gz)
    Source code(zip)
  • 0.1.2(Jul 29, 2019)

Owner
deepset
Building enterprise search systems powered by latest NLP & open-source.
deepset
Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation This is the implementaion of our paper: Bridging the

hezw.tkcw 20 Dec 12, 2022
LV-BERT: Exploiting Layer Variety for BERT (Findings of ACL 2021)

LV-BERT Introduction In this repo, we introduce LV-BERT by exploiting layer variety for BERT. For detailed description and experimental results, pleas

Weihao Yu 14 Aug 24, 2022
Natural Language Processing for Adverse Drug Reaction (ADR) Detection

Natural Language Processing for Adverse Drug Reaction (ADR) Detection This repo contains code from a project to identify ADRs in discharge summaries a

Medicines Optimisation Service - Austin Health 21 Aug 05, 2022
Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN & HiFi-GAN & StyleMelGAN) with Pytorch

Parallel WaveGAN implementation with Pytorch This repository provides UNOFFICIAL pytorch implementations of the following models: Parallel WaveGAN Mel

Tomoki Hayashi 1.2k Dec 23, 2022
Common Voice Dataset explorer

Common Voice Dataset Explorer Common Voice Dataset is by Mozilla Made during huggingface finetuning week Usage pip install -r requirements.txt streaml

Ceyda Cinarel 22 Nov 16, 2022
Training open neural machine translation models

Train Opus-MT models This package includes scripts for training NMT models using MarianNMT and OPUS data for OPUS-MT. More details are given in the Ma

Language Technology at the University of Helsinki 167 Jan 03, 2023
Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers

beyond masking Beyond Masking: Demystifying Token-Based Pre-Training for Vision Transformers The code is coming Figure 1: Pipeline of token-based pre-

Yunjie Tian 23 Sep 27, 2022
Index different CKAN entities in Solr, not just datasets

ckanext-sitesearch Index different CKAN entities in Solr, not just datasets Requirements This extension requires CKAN 2.9 or higher and Python 3 Featu

Open Knowledge Foundation 3 Dec 02, 2022
Treemap visualisation of Maya scene files

Ever wondered which nodes are responsible for that 600 mb+ Maya scene file? Features Fast, resizable UI Parsing at 50 mb/sec Dependency-free, single-f

Marcus Ottosson 76 Nov 12, 2022
[EMNLP 2021] LM-Critic: Language Models for Unsupervised Grammatical Error Correction

LM-Critic: Language Models for Unsupervised Grammatical Error Correction This repo provides the source code & data of our paper: LM-Critic: Language M

Michihiro Yasunaga 98 Nov 24, 2022
text to speech toolkit. 好用的中文语音合成工具箱,包含语音编码器、语音合成器、声码器和可视化模块。

ttskit Text To Speech Toolkit: 语音合成工具箱。 安装 pip install -U ttskit 注意 可能需另外安装的依赖包:torch,版本要求torch=1.6.0,=1.7.1,根据自己的实际环境安装合适cuda或cpu版本的torch。 ttskit的

KDD 483 Jan 04, 2023
Full Spectrum Bioinformatics - a free online text designed to introduce key topics in Bioinformatics using the Python

Full Spectrum Bioinformatics is a free online text designed to introduce key topics in Bioinformatics using the Python programming language. The text is written in interactive Jupyter Notebooks, whic

Jesse Zaneveld 33 Dec 28, 2022
Yet another Python binding for fastText

pyfasttext Warning! pyfasttext is no longer maintained: use the official Python binding from the fastText repository: https://github.com/facebookresea

Vincent Rasneur 230 Nov 16, 2022
Neural text generators like the GPT models promise a general-purpose means of manipulating texts.

Boolean Prompting for Neural Text Generators Neural text generators like the GPT models promise a general-purpose means of manipulating texts. These m

Jeffrey M. Binder 20 Jan 09, 2023
ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset.

ProteinBERT is a universal protein language model pretrained on ~106M proteins from the UniRef90 dataset. Through its Python API, the pretrained model can be fine-tuned on any protein-related task in

241 Jan 04, 2023
This project is part of Eleuther AI's quest to create a massive repository of high quality text data for training language models.

This project is part of Eleuther AI's quest to create a massive repository of high quality text data for training language models.

EleutherAI 42 Dec 13, 2022
Code Implementation of "Learning Span-Level Interactions for Aspect Sentiment Triplet Extraction".

Span-ASTE: Learning Span-Level Interactions for Aspect Sentiment Triplet Extraction ***** New March 31th, 2022: Scikit-Style API for Easy Usage *****

Chia Yew Ken 111 Dec 23, 2022
🦅 Pretrained BigBird Model for Korean (up to 4096 tokens)

Pretrained BigBird Model for Korean What is BigBird • How to Use • Pretraining • Evaluation Result • Docs • Citation 한국어 | English What is BigBird? Bi

Jangwon Park 183 Dec 14, 2022
تولید اسم های رندوم فینگیلیش

karafs کرفس تولید اسم های رندوم فینگیلیش installation ➜ pip install karafs usage دو زبانه ➜ karafs -n 10 توت فرنگی بی ناموس toot farangi-ye bi_namoos

Vaheed NÆINI (9E) 36 Nov 24, 2022
Optimal Transport Tools (OTT), A toolbox for all things Wasserstein.

Optimal Transport Tools (OTT), A toolbox for all things Wasserstein. See full documentation for detailed info on the toolbox. The goal of OTT is to pr

OTT-JAX 255 Dec 26, 2022