Natural Language Processing Best Practices & Examples

Overview

NLP Best Practices

In recent years, natural language processing (NLP) has seen quick growth in quality and usability, and this has helped to drive business adoption of artificial intelligence (AI) solutions. In the last few years, researchers have been applying newer deep learning methods to NLP. Data scientists started moving from traditional methods to state-of-the-art (SOTA) deep neural network (DNN) algorithms which use language models pretrained on large text corpora.

This repository contains examples and best practices for building NLP systems, provided as Jupyter notebooks and utility functions. The focus of the repository is on state-of-the-art methods and common scenarios that are popular among researchers and practitioners working on problems involving text and language.

Overview

The goal of this repository is to build a comprehensive set of tools and examples that leverage recent advances in NLP algorithms, neural architectures, and distributed machine learning systems. The content is based on our past and potential future engagements with customers as well as collaboration with partners, researchers, and the open source community.

We hope that the tools can significantly reduce the “time to market” by simplifying the experience from defining the business problem to development of solution by orders of magnitude. In addition, the example notebooks would serve as guidelines and showcase best practices and usage of the tools in a wide variety of languages.

In an era of transfer learning, transformers, and deep architectures, we believe that pretrained models provide a unified solution to many real-world problems and allow handling different tasks and languages easily. We will, therefore, prioritize such models, as they achieve state-of-the-art results on several NLP benchmarks like GLUE and SQuAD leaderboards. The models can be used in a number of applications ranging from simple text classification to sophisticated intelligent chat bots.

Note that for certain kind of NLP problems, you may not need to build your own models. Instead, pre-built or easily customizable solutions exist which do not require any custom coding or machine learning expertise. We strongly recommend evaluating if these can sufficiently solve your problem. If these solutions are not applicable, or the accuracy of these solutions is not sufficient, then resorting to more complex and time-consuming custom approaches may be necessary. The following cognitive services offer simple solutions to address common NLP tasks:

Text Analytics are a set of pre-trained REST APIs which can be called for Sentiment Analysis, Key phrase extraction, Language detection and Named Entity Detection and more. These APIs work out of the box and require minimal expertise in machine learning, but have limited customization capabilities.

QnA Maker is a cloud-based API service that lets you create a conversational question-and-answer layer over your existing data. Use it to build a knowledge base by extracting questions and answers from your semi-structured content, including FAQs, manuals, and documents.

Language Understanding is a SaaS service to train and deploy a model as a REST API given a user-provided training set. You could do Intent Classification as well as Named Entity Extraction by performing simple steps of providing example utterances and labelling them. It supports Active Learning, so your model always keeps learning and improving.

Target Audience

For this repository our target audience includes data scientists and machine learning engineers with varying levels of NLP knowledge as our content is source-only and targets custom machine learning modelling. The utilities and examples provided are intended to be solution accelerators for real-world NLP problems.

Focus Areas

The repository aims to expand NLP capabilities along three separate dimensions

Scenarios

We aim to have end-to-end examples of common tasks and scenarios such as text classification, named entity recognition etc.

Algorithms

We aim to support multiple models for each of the supported scenarios. Currently, transformer-based models are supported across most scenarios. We have been working on integrating the transformers package from Hugging Face which allows users to easily load pretrained models and fine-tune them for different tasks.

Languages

We strongly subscribe to the multi-language principles laid down by "Emily Bender"

  • "Natural language is not a synonym for English"
  • "English isn't generic for language, despite what NLP papers might lead you to believe"
  • "Always name the language you are working on" (Bender rule)

The repository aims to support non-English languages across all the scenarios. Pre-trained models used in the repository such as BERT, FastText support 100+ languages out of the box. Our goal is to provide end-to-end examples in as many languages as possible. We encourage community contributions in this area.

Content

The following is a summary of the commonly used NLP scenarios covered in the repository. Each scenario is demonstrated in one or more Jupyter notebook examples that make use of the core code base of models and repository utilities.

Scenario Models Description Languages
Text Classification BERT, DistillBERT, XLNet, RoBERTa, ALBERT, XLM Text classification is a supervised learning method of learning and predicting the category or the class of a document given its text content. English, Chinese, Hindi, Arabic, German, French, Japanese, Spanish, Dutch
Named Entity Recognition BERT Named entity recognition (NER) is the task of classifying words or key phrases of a text into predefined entities of interest. English
Text Summarization BERTSumExt
BERTSumAbs
UniLM (s2s-ft)
MiniLM
Text summarization is a language generation task of summarizing the input text into a shorter paragraph of text. English
Entailment BERT, XLNet, RoBERTa Textual entailment is the task of classifying the binary relation between two natural-language texts, text and hypothesis, to determine if the text agrees with the hypothesis or not. English
Question Answering BiDAF, BERT, XLNet Question answering (QA) is the task of retrieving or generating a valid answer for a given query in natural language, provided with a passage related to the query. English
Sentence Similarity BERT, GenSen Sentence similarity is the process of computing a similarity score given a pair of text documents. English
Embeddings Word2Vec
fastText
GloVe
Embedding is the process of converting a word or a piece of text to a continuous vector space of real number, usually, in low dimension. English
Sentiment Analysis Dependency Parser
GloVe
Provides an example of train and use Aspect Based Sentiment Analysis with Azure ML and Intel NLP Architect . English

Getting Started

While solving NLP problems, it is always good to start with the prebuilt Cognitive Services. When the needs are beyond the bounds of the prebuilt cognitive service and when you want to search for custom machine learning methods, you will find this repository very useful. To get started, navigate to the Setup Guide, which lists instructions on how to setup your environment and dependencies.

Azure Machine Learning Service

Azure Machine Learning service is a cloud service used to train, deploy, automate, and manage machine learning models, all at the broad scale that the cloud provides. AzureML is presented in notebooks across different scenarios to enhance the efficiency of developing Natural Language systems at scale and for various AI model development related tasks like:

To successfully run these notebooks, you will need an Azure subscription or can try Azure for free. There may be other Azure services or products used in the notebooks. Introduction and/or reference of those will be provided in the notebooks themselves.

Contributing

We hope that the open source community would contribute to the content and bring in the latest SOTA algorithm. This project welcomes contributions and suggestions. Before contributing, please see our contribution guidelines.

Blog Posts

References

The following is a list of related repositories that we like and think are useful for NLP tasks.

Repository Description
Transformers A great PyTorch library from Hugging Face with implementations of popular transformer-based models. We've been using their package extensively in this repo and greatly appreciate their effort.
Azure Machine Learning Notebooks ML and deep learning examples with Azure Machine Learning.
AzureML-BERT End-to-end recipes for pre-training and fine-tuning BERT using Azure Machine Learning service.
MASS MASS: Masked Sequence to Sequence Pre-training for Language Generation.
MT-DNN Multi-Task Deep Neural Networks for Natural Language Understanding.
UniLM Unified Language Model Pre-training.
DialoGPT DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation

Build Status

Build Branch Status
Linux CPU master Build Status
Linux CPU staging Build Status
Linux GPU master Build Status
Linux GPU staging Build Status
Comments
  • [ASK] Remove 'repo_metrics' folder

    [ASK] Remove 'repo_metrics' folder

    Description

    During the team discussion, we believe that the files in this 'repo_metrics' folder should be removed from the NLP repo and create a centralized way to maintain it.

    Other Comments

    enhancement 
    opened by yijingchen 31
  • Fix broken data path and add git clone cell

    Fix broken data path and add git clone cell

    Description

    This notebook 'embedding_trainer.ipynb' cannot be ran end to end. Fixed related issues:

    • Added the !git clone http://github.com/stanfordnlp/glove command inside the Jupyter Notebook. I found the experience smoother this way. However, I don't know why the author decided to leave it out. Please ask the author to verify this in case there are other risk that I'm not aware of.

    • This command cd glove && make gives error and I didn't add a cell for !cd glove && make. The notebook runs fine without this command. Please check with the author to see if this is necessary to include it.

    • This function from utils_nlp.dataset import stsbenchmark updated the data path in the util file. However this notebook haven't updated it. I modified the path in this notebook.

    • With these fixes, test pipeline should be able to run this notebook end to end.

    Related Issues

    https://github.com/microsoft/nlp/issues/230

    Checklist:

    • [X] My code follows the code style of this project, as detailed in our contribution guidelines.
    • [x] I have added tests.
    • [ ] I have updated the documentation accordingly.
    opened by yijingchen 19
  • GenSen on AML deep dive notebook (sentence similarity)

    GenSen on AML deep dive notebook (sentence similarity)

    1. This notebook serves as an introduction to an end-to-end NLP solution for sentence similarity building one of the advanced models - GenSen on AzureML platform. We show the advantages of AzureML when training large NLP models with GPU.

    The notebook includes: Data loading and preprocessing, Train GenSen model with distributed PyTorch with Horovod on AzureML and Tuning on HypterDrive. Evaluation and deployment will be added later. In addition, the comparison results with training and tuning on AML v.s. VM will be added once this initial PR is merged with staging.

    1. Provide a refactored GenSen code into utils_nlp to make the model reusable.

    We provide a distributed PyTorch with Horovod implementation of the paper along with pre-trained models as well as code to evaluate these models on a variety of transfer learning benchmarks. This code is based on the gibhub codebase from Maluuba, but we have refactored the code in the following aspects:

    1. Support a distributed PyTorch with Horovod
    2. Clean and refactor the original code in a more structured form
    3. Change the training file (train.py) from non-stopping to stop when the validation loss reaches to the local minimum
    4. Update the code from Python 2.7 to 3+ and PyTorch from 0.2 or 0.3 to 1.0.1
    5. Add some necessary comments
    6. Add some code for training on AzureML platform
    7. Fix the bug on when setting the batch size to 1, the training raises an error
    opened by catherine667 16
  • [BUG] Cannot set up nlp_gpu environment

    [BUG] Cannot set up nlp_gpu environment

    Description

    The following errors showing up while setting up the GPU environment:

    Collecting package metadata: done Solving environment: failed

    ResolvePackageNotFound: cudatoolkit==9.2

    image

    How do we replicate the bug?

    Machine: Microsoft Azure Deep Learning Virtual Machine Standard NC6 Operation System: Windows Code: cd nlp python tools/generate_conda_file.py --gpu conda env create -n nlp_gpu -f nlp_gpu.yaml

    Expected behavior (i.e. solution)

    The installation should complete without errors.

    Other Comments

    I changed some package version in yaml file to this:

    • cudatoolkit>=9.2
    • tensorflow-gpu>=1.12.0

    The installation proceed with the above configuration, however another error occur which shows below: ---------------------------------------- ERROR: Command "'C:\Anaconda\envs\nlp_gpu\python.exe' -u -c 'import setuptools, tokenize;file='"'"'C:\Users\adminyijing\AppData\Local\Temp\2\pip-install-9454380z\horovod\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record 'C:\Users\adminyijing\AppData\Local\Temp\2\pip-record-su74tqwf\install-record.txt' --single-version-externally-managed --compile" failed with error code 1 in C:\Users\adminyijing\AppData\Local\Temp\2\pip-install-9454380z\horovod\

    bug 
    opened by yijingchen 13
  • [FEATURE] Check that all AzureML notebooks are tested

    [FEATURE] Check that all AzureML notebooks are tested

    Description

    Need to add .azureml folder to provide the common AzureML subscription for AzureML notebooks.

    How do we replicate the bug?

    Under .azureml folder, it should contain: config.json file should look like this: "{"Id": null, "Scope": "/subscriptions/[ID]/resourceGroups/nlprg/providers/Microsoft.MachineLearningServices/workspaces/[Workspace Name]"}", which can be downloaded from workspace.

    Related notebooks: GenSen https://github.com/microsoft/nlp/pull/199 and BERT https://github.com/microsoft/nlp/pull/191 notebook testing

    related to https://github.com/microsoft/nlp/issues/143

    Expected behavior (i.e. solution)

    Other Comments

    #262

    bug release-blocker 
    opened by catherine667 10
  • Staging to master to add github metrics

    Staging to master to add github metrics

    Description

    In order to start recording the metrics we need to merge the metrics to master.

    @irshaffe when this is in master, I need to activate it from devops to be executed every day, it will store the metrics so you can start populating the powerbi dashboard. The original powerbi dashboard for Recommenders was done with Scott (@gramhagen)

    Related Issues

    #24

    Checklist:

    • [ ] My code follows the code style of this project, as detailed in our contribution guidelines.
    • [ ] I have added tests.
    • [ ] I have updated the documentation accordingly.
    opened by miguelgfierro 10
  • V chguan/add icml ex nlp code

    V chguan/add icml ex nlp code

    Description

    We have added the code of our ICML paper. The related files are:

    • interpreter.py and README.md files under utils_nlp\interpreter. The interpreter.py file is the main functional file we utilize. README.md is an instruction file on it.
    • explain_simple_model.ipynb and explain_BERT_model.ipynb files under scenarios\interpret_NLP_models for two scenarios on how to interpreter.py.
    • test_interpreter.py under tests\unit. This file contains 6 unit tests for interpreter.py (which, in my machine, cost about 2.25s to run).
    • example.png under utils_nlp\interpreter folder used by README.md, and regular.json under scenarios\interpret_NLP_models folder used by explain_BERT_model.ipynb. I know from other pull requests that files like these are not allowed to merge. So, can anyone help me upload these two files to somewhere? Thanks for your help in advance : )

    Related Issues

    Our issue is #62.

    Checklist:

    • My code follows the code style of this project, as detailed in our contribution guidelines.
    • I have added tests.
    • [ ] I have updated the documentation accordingly (I now add README.md to utils_nlp only. What other .md files should I modify or add?).
    opened by Frozenmad 9
  • Transformers

    Transformers

    Description

    Related Issues

    Checklist:

    • [ ] My code follows the code style of this project, as detailed in our contribution guidelines.
    • [ ] I have added tests.
    • [ ] I have updated the documentation accordingly.
    opened by saidbleik 8
  • Integration tests

    Integration tests

    Description

    Integration and smoke tests

    @bethz, @jainr In the code I would like to have the sequence:

    1. create conda env
    2. run smoke
    3. run integration
    4. remove conda

    Beth told me that there might be a more elegant way of doing this. Can you please offer some guideline?

    @saidbleik @sharatsc the scheduler is not working at the moment (you might have seen the emails to devops). As a temporal solution I thought of running this pipeline every time there is a PR to master. Feel free to propose another idea

    Related Issues

    #25

    Checklist:

    • [ ] My code follows the code style of this project, as detailed in our contribution guidelines.
    • [ ] I have added tests.
    • [ ] I have updated the documentation accordingly.
    opened by miguelgfierro 8
  • Hlu/bert ner utils

    Hlu/bert ner utils

    5/31: Notebook is updated with new dataset. Everything is ready to be reviewed! @saidbleik @miguelgfierro @yexing99


    5/29 updates: @saidbleik @miguelgfierro I made several updates based on recent discussions with Said. I still need to update the notebook with a new dataset, but the utility classes and functions are ready to be reviewed (I don't plan to make other significant changes besides addressing review comments.) Please ignore the bert_data_utils.py file for now, I need to update it for the new dataset. Some functions in the common_ner.py are from Said's sequence classification PR. I will merge this with the common.py once Said completes his PR.


    5/20 udpates: @saidbleik @miguelgfierro I made another update based on our discussion last week.

    I got rid of the InputFeature class. I also tried to get rid of the InputExample class, but found it hard. If we pass the data around as tuples, there a few possible scenarios a. Single sentence data with label: (sentence_text, label) b. Single sentence data without label: (sentence_text,) c. Two sentence data with label: (sentence_1_text, sentence_2_text, label) d. Two sentence data without label: (sentence_1_text, sentence_2_text) As you can see, a and d can be confusing, unless we have different sets of code for single-sentence tasks and two-sentence tasks. I renamed InputExample to BertInputData and created a namedtuple version of it. Please take a look at bert_data_utils.py.

    I'm still keeping the tokenization step outside of the classifier, but changed the tokenization utility function to output TensorDataset instead of InputFeature. TensorDataset helps wrapping multiple tensors without using InputFeature.

    I'm flexible with using or not using the configuration class.

    Let's seek more evidence to finalize these decisions as Miguel suggested.

    5/16 updates: @saidbleik @miguelgfierro I made another pass through the code. Three major changes:

    1. Consolidated some utility functions into the BertTokenClassifier class.
    2. Removed some unnecessary configurations.
    3. Added docstring.

    In general, I followed the BertSequenceClassifer Said wrote, but made a few different design decisions.

    • Use a single BertFineTuneConfig class to set all parameters. BertFineTuneConfig is initialized using a dictionary. User can use a yaml file to set parameters, then load the yaml file into a dictionary. I think this would make code less verbose when we want to give the user more control and also make it easier for user to document how they run their experiments.
      I also store all configurations in the BertTokenClassifier object, in case one needs to pickle the model and use it somewhere else.
    • Keep the tokenization step out side of the classifier class. I think this is a preprocessing step and shouldn't be included in the classifier. It also helps the users understand better what they are doing. I think we want to abstract things to improve resusabiliy, but a sequence of smaller black boxes may help user understand the process better than one big black box.
    • Keep the InputExample and InputFeatures classes, and use PyTorch Dataloader instead of custom function to create batches. I think using some standard data structures will make the code written by different people look more consistent. There may be some initial learning curve, but could be helpful in the long run. The fields in InputExample and InputFeatures also help the user understanding how BERT works.

    I will try to catch you guys to discuss these in the next couple of days. Please take a look at the updated code if you have time. Thanks!

    I still need to refine some functions and improve the formatting, but want to create this PR for people to review and comment. @miguelgfierro

    opened by hlums 8
  • [ASK] Improve user experience for long running notebooks

    [ASK] Improve user experience for long running notebooks

    Description

    Some notebooks take long time to run. For external data scientist who wants to try it out fast and see how things work, this is not a quite pleasant experience. Here are some ideas for improvements:

    • Each notebook add a note section describing the the machine configuration(i.e. # of GPU, etc) and estimated time to finish running the notebooks so that user won't be surprised.
    • Another idea is set the default of the notebooks to run on smaller data and smaller parameters. And then add another section to guide user change it to larger experiment so they know they'll face a long running time.

    Notebook running time (Last update: 8/1/2019)

    Machine: Azure DLVM Standard_NC12 with 2 GPU

    | Scenario |Notebook Name |CPU |GPU | |------|--------------------------------|------|---| |entailment | entailment_xnli_multilingual | NA | ~20hrs | |name_entity_recognition| ner_wikigold_bert | ~37mins | ~6mins | |embeddings| embedding_trainer| ~5mins | ~5mins | | interpret_NLP_models | understand_models| ~4mins | ~2mins| | text_classification | tc_mnil_bert | ~8.2hrs| ~1.2hrs |

    enhancement 
    opened by yijingchen 7
  • Add `$schema` to `cgmanifest.json`

    Add `$schema` to `cgmanifest.json`

    This pull request adds the JSON schema for cgmanifest.json.

    FAQ

    Why?

    A JSON schema helps you to ensure that your cgmanifest.json file is valid. JSON schema validation is a build-in feature in most modern IDEs like Visual Studio and Visual Studio Code. Most modern IDEs also provide code-completion for JSON schemas.

    How can I validate my cgmanifest.json file?

    Most modern IDEs like Visual Studio and Visual Studio Code have a built-in feature to validate JSON files. You can also use this small script to validate your cgmanifest.json file.

    Why does it suggest camel case for the properties?

    Component Detection is able to read camel case and pascal case properties. However, the JSON schema doesn't have a case-insensitive mode. We therefore suggest camel case as it's the most common format for JSON.

    Why is the diff so large?

    To deserialize the cgmanifest.json file, we use JSON.parse(). However, to serialize the JSON again we use prettier. We found that, in general, it gave smaller diffs than the default JSON.stringify() function.

    opened by JamieMagee 0
  • This repo is missing important files

    This repo is missing important files

    There are important files that Microsoft projects should all have that are not present in this repository. A pull request has been opened to add the missing file(s). When the pr is merged this issue will be closed automatically.

    Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

    Merge this pull request

    opened by microsoft-github-policy-service[bot] 1
  • Adding Microsoft SECURITY.MD

    Adding Microsoft SECURITY.MD

    Please accept this contribution adding the standard Microsoft SECURITY.MD :lock: file to help the community understand the security policy and how to safely report security issues. GitHub uses the presence of this file to light-up security reminders and a link to the file. This pull request commits the latest official SECURITY.MD file from https://github.com/microsoft/repo-templates/blob/main/shared/SECURITY.md.

    Microsoft teams can learn more about this effort and share feedback within the open source guidance available internally.

    opened by microsoft-github-policy-service[bot] 0
  • [ASK] How to run on GPU

    [ASK] How to run on GPU

    Description

    I was running the codes for text classification (tc_mnli_transformers.ipynb) and it keeps on running on my cpu instead of gpu. How can I change that? It's taking way too long to train as a result. Please help.

    Other Comments

    opened by poko1 0
  • [ASK] transformers.abstractive_summarization_bertsum.py not importing transformers

    [ASK] transformers.abstractive_summarization_bertsum.py not importing transformers

    Description

    I run in Google Colab the following code

    !pip install --upgrade 
    !pip install -q git+https://github.com/microsoft/nlp-recipes.git
    !pip install jsonlines
    !pip install pyrouge
    !pip install scrapbook
    
    import os
    import shutil
    import sys
    from tempfile import TemporaryDirectory
    import torch
    import nltk
    from nltk import tokenize
    import pandas as pd
    import pprint
    import scrapbook as sb
    
    nlp_path = os.path.abspath("../../")
    if nlp_path not in sys.path:
        sys.path.insert(0, nlp_path)
    
    from utils_nlp import models
    from utils_nlp.models import transformers 
    from utils_nlp.models.transformers.abstractive_summarization_bertsum \
         import BertSumAbs, BertSumAbsProcessor
    

    It breaks on the last line and I get the following error

    /usr/local/lib/python3.7/dist-packages/utils_nlp/models/transformers/abstractive_summarization_bertsum.py in <module>()
         15 from torch.utils.data.distributed import DistributedSampler
         16 from tqdm import tqdm
    ---> 17 from transformers import AutoTokenizer, BertModel
         18 
         19 from utils_nlp.common.pytorch_utils import (
    
    ModuleNotFoundError: No module named 'transformers'
    

    In summary, the code in abstractive_summarization_bertsum.py doesn't resolve transformers where it is located in the transformer folder. Is it something to be fixed on your side?

    opened by neqkir 1
  • [ASK] Error while running extractive_summarization_cnndm_transformer.ipynb

    [ASK] Error while running extractive_summarization_cnndm_transformer.ipynb

    When I run below code. summarizer.fit( ext_sum_train, num_gpus=NUM_GPUS, batch_size=BATCH_SIZE, gradient_accumulation_steps=2, max_steps=MAX_STEPS, learning_rate=LEARNING_RATE, warmup_steps=WARMUP_STEPS, verbose=True, report_every=REPORT_EVERY, clip_grad_norm=False, use_preprocessed_data=USE_PREPROCSSED_DATA )

    It gives me error like this.

    Iteration:   0%|          | 0/199 [00:00<?, ?it/s]
    
    ---------------------------------------------------------------------------
    
    TypeError                                 Traceback (most recent call last)
    
    <ipython-input-40-343cf59f0aa4> in <module>()
         12             report_every=REPORT_EVERY,
         13             clip_grad_norm=False,
    ---> 14             use_preprocessed_data=USE_PREPROCSSED_DATA
         15         )
         16 
    
    11 frames
    
    /content/drive/My Drive/nlp-recipes/utils_nlp/models/transformers/extractive_summarization.py in fit(self, train_dataset, num_gpus, gpu_ids, batch_size, local_rank, max_steps, warmup_steps, learning_rate, optimization_method, max_grad_norm, beta1, beta2, decay_method, gradient_accumulation_steps, report_every, verbose, seed, save_every, world_size, rank, use_preprocessed_data, **kwargs)
        775             report_every=report_every,
        776             clip_grad_norm=False,
    --> 777             save_every=save_every,
        778         )
        779 
    
    /content/drive/My Drive/nlp-recipes/utils_nlp/models/transformers/common.py in fine_tune(self, train_dataloader, get_inputs, device, num_gpus, max_steps, global_step, max_grad_norm, gradient_accumulation_steps, optimizer, scheduler, fp16, amp, local_rank, verbose, seed, report_every, save_every, clip_grad_norm, validation_function)
        191                 disable=local_rank not in [-1, 0] or not verbose,
        192             )
    --> 193             for step, batch in enumerate(epoch_iterator):
        194                 inputs = get_inputs(batch, device, self.model_name)
        195                 outputs = self.model(**inputs)
    
    /usr/local/lib/python3.7/dist-packages/tqdm/std.py in __iter__(self)
       1102                 fp_write=getattr(self.fp, 'write', sys.stderr.write))
       1103 
    -> 1104         for obj in iterable:
       1105             yield obj
       1106             # Update and possibly print the progressbar.
    
    /usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in __next__(self)
        519             if self._sampler_iter is None:
        520                 self._reset()
    --> 521             data = self._next_data()
        522             self._num_yielded += 1
        523             if self._dataset_kind == _DatasetKind.Iterable and \
    
    /usr/local/lib/python3.7/dist-packages/torch/utils/data/dataloader.py in _next_data(self)
        559     def _next_data(self):
        560         index = self._next_index()  # may raise StopIteration
    --> 561         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
        562         if self._pin_memory:
        563             data = _utils.pin_memory.pin_memory(data)
    
    /usr/local/lib/python3.7/dist-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
         45         else:
         46             data = self.dataset[possibly_batched_index]
    ---> 47         return self.collate_fn(data)
    
    /content/drive/My Drive/nlp-recipes/utils_nlp/models/transformers/extractive_summarization.py in collate_fn(data)
        744             def collate_fn(data):
        745                 return self.processor.collate(
    --> 746                     data, block_size=self.max_pos_length, device=device
        747                 )
        748 
    
    /content/drive/My Drive/nlp-recipes/utils_nlp/models/transformers/extractive_summarization.py in collate(self, data, block_size, device, train_mode)
        470         else:
        471             if train_mode is True and "tgt" in data[0] and "oracle_ids" in data[0]:
    --> 472                 encoded_text = [self.encode_single(d, block_size) for d in data]
        473                 batch = Batch(list(filter(None, encoded_text)), True)
        474             else:
    
    /content/drive/My Drive/nlp-recipes/utils_nlp/models/transformers/extractive_summarization.py in <listcomp>(.0)
        470         else:
        471             if train_mode is True and "tgt" in data[0] and "oracle_ids" in data[0]:
    --> 472                 encoded_text = [self.encode_single(d, block_size) for d in data]
        473                 batch = Batch(list(filter(None, encoded_text)), True)
        474             else:
    
    /content/drive/My Drive/nlp-recipes/utils_nlp/models/transformers/extractive_summarization.py in encode_single(self, d, block_size, train_mode)
        539             + ["[SEP]"]
        540         )
    --> 541         src_subtoken_idxs = self.tokenizer.convert_tokens_to_ids(src_subtokens)
        542         _segs = [-1] + [i for i, t in enumerate(src_subtoken_idxs) if t == self.sep_vid]
        543         segs = [_segs[i] - _segs[i - 1] for i in range(1, len(_segs))]
    
    /usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_fast.py in convert_tokens_to_ids(self, tokens)
    
    /usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_fast.py in _convert_token_to_id_with_added_voc(self, token)
    
    TypeError: Can't convert 0 to PyString
    

    P.S. I try to run this code using google colab free GPU.

    Any help is welcome :)

    opened by ToonicTie 2
Releases(2.2.0)
  • 2.2.0(Mar 30, 2020)

    Text Summarization

    In this release, we support both abstractive and extractive text summarization.

    New Model: UniLM

    UniLM is a state of the art model developed by Microsoft Research Asia (MSRA). The model is pre-trained on a large unlabeled natural language corpus (English Wikipedia and BookCorpus) and can be fine-tuned on different types of labeled data for various NLP tasks like text classification and abstractive summarization.

    Supported Models

    • unilm-large-cased
    • unilm-base-cased

    For more info about UniLM, please refer to the following:

    Thanks to the UniLM team, Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, Hsiao-Wuen Hon, for their great work and support for the integration.

    New Model: BERTSum

    BERTSum is an encoder architecture designed for text summarization. It can be used together with different decoders to support both extractive and abstractive summarization.

    Supported Models

    • bert-base-uncased (extractive and abstractive)
    • distilbert-base-uncased (extractive)

    Thanks to the original authors Yang Liu and Mirella Lapata for their great contribution.

    All model implementations support distributed training and multi-GPU inferencing. For abstractive summarization, we also support mixed-precision training and inference.

    Source code(tar.gz)
    Source code(zip)
  • v2.0.0(Dec 4, 2019)

Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
Final Project for the Intel AI Readiness Boot Camp NLP (Jan)

NLP Boot Camp (Jan) Synopsis Full Name: Prameya Mohanty Name of your School: Delhi Public School, Rourkela Class: VIII Title of the Project: iTransect

TheCodingHub 1 Feb 01, 2022
To be a next-generation DL-based phenotype prediction from genome mutations.

Sequence -----------+-- 3D_structure -- 3D_module --+ +-- ? | |

Eric Alcaide 18 Jan 11, 2022
Repository for Project Insight: NLP as a Service

Project Insight NLP as a Service Contents Introduction Features Installation Setup and Documentation Project Details Demonstration Directory Details H

Abhishek Kumar Mishra 286 Dec 06, 2022
Python package for Turkish Language.

PyTurkce Python package for Turkish Language. Documentation: https://pyturkce.readthedocs.io. Installation pip install pyturkce Usage from pyturkce im

Mert Cobanov 14 Oct 09, 2022
BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

BMInf (Big Model Inference) is a low-resource inference package for large-scale pretrained language models (PLMs).

OpenBMB 377 Jan 02, 2023
Code voor mijn Master project omtrent VideoBERT

Code voor masterproef Deze repository bevat de code voor het project van mijn masterproef omtrent VideoBERT. De code in deze repository is gebaseerd o

35 Oct 18, 2021
Implementation of "Adversarial purification with Score-based generative models", ICML 2021

Adversarial Purification with Score-based Generative Models by Jongmin Yoon, Sung Ju Hwang, Juho Lee This repository includes the official PyTorch imp

15 Dec 15, 2022
🍊 PAUSE (Positive and Annealed Unlabeled Sentence Embedding), accepted by EMNLP'2021 🌴

PAUSE: Positive and Annealed Unlabeled Sentence Embedding Sentence embedding refers to a set of effective and versatile techniques for converting raw

EQT 21 Dec 15, 2022
A simple Streamlit App to classify swahili news into different categories.

Swahili News Classifier Streamlit App A simple app to classify swahili news into different categories. Installation Install all streamlit requirements

Davis David 4 May 01, 2022
Tokenizer - Module python d'analyse syntaxique et de grammaire, tokenization

Tokenizer Le Tokenizer est un analyseur lexicale, il permet, comme Flex and Yacc par exemple, de tokenizer du code, c'est à dire transformer du code e

Manolo 1 Aug 15, 2022
Pytorch-Named-Entity-Recognition-with-BERT

BERT NER Use google BERT to do CoNLL-2003 NER ! Train model using Python and Inference using C++ ALBERT-TF2.0 BERT-NER-TENSORFLOW-2.0 BERT-SQuAD Requi

Kamal Raj 1.1k Dec 25, 2022
Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.

TextBlob: Simplified Text Processing Homepage: https://textblob.readthedocs.io/ TextBlob is a Python (2 and 3) library for processing textual data. It

Steven Loria 8.4k Dec 26, 2022
KoBERT - Korean BERT pre-trained cased (KoBERT)

KoBERT KoBERT Korean BERT pre-trained cased (KoBERT) Why'?' Training Environment Requirements How to install How to use Using with PyTorch Using with

SK T-Brain 1k Jan 02, 2023
PyJPBoatRace: Python-based Japanese boatrace tools 🚤

pyjpboatrace :speedboat: provides you with useful tools for data analysis and auto-betting for boatrace.

5 Oct 29, 2022
Few-shot Natural Language Generation for Task-Oriented Dialog

Few-shot Natural Language Generation for Task-Oriented Dialog This repository contains the dataset, source code and trained model for the following pa

172 Dec 13, 2022
Sequence modeling benchmarks and temporal convolutional networks

Sequence Modeling Benchmarks and Temporal Convolutional Networks (TCN) This repository contains the experiments done in the work An Empirical Evaluati

CMU Locus Lab 3.5k Jan 03, 2023
A Flask Sentiment Analysis API, with visual implementation

The Sentiment Analysis Api was created using python flask module,it allows users to parse a text or sentence throught the (?text) arguement, then view the sentiment analysis of that sentence. It can

Ifechukwudeni Oweh 10 Jul 17, 2022
This repository contains helper functions which can help you generate additional data points depending on your NLP task.

NLP Albumentations For Data Augmentation This repository contains helper functions which can help you generate additional data points depending on you

Aflah 6 May 22, 2022
Code release for "COTR: Correspondence Transformer for Matching Across Images"

COTR: Correspondence Transformer for Matching Across Images This repository contains the inference code for COTR. We plan to release the training code

UBC Computer Vision Group 358 Dec 24, 2022
Script to download some free japanese lessons in portuguse from NHK

Nihongo_nhk This is a script to download some free japanese lessons in portuguese from NHK. It can be executed by installing the packages with: pip in

Matheus Alves 2 Jan 06, 2022