Blue Brain text mining toolbox for semantic search and structured information extraction

Overview

Blue Brain Search

Source Code DOI Source code DOI
Data & Models DOI Data & Models DOI
Documentation Docs
Latest Release PyPI
Python Versions Python Versions
License License
Build Status Build status
Static Typing Mypy
Code Style Black Isort Pydocstyle Pydocstyle
Security Bandit

Blue Brain Search is a text mining toolbox to perform semantic literature search and structured information extraction from text sources.

This repository originated from the Blue Brain Project efforts on exploring and mining the CORD-19 dataset.

Graphical Interface

The graphical interface is composed of widgets to be used in Jupyter notebooks.

For the graphical interface to work, the steps of the Getting Started should have been completed successfully.

Find documents based on sentence semantic similarity

Search Widget

To find sentences semantically similar to the query 'Glucose is a risk factor for COVID-19' in the documents, you could just click on the blue button named Search Literature!. You could also enter the query of your choice by editing the text in the top field named Query.

The returned results are ranked by decreasing semantic similarity. This means that the first results have a similar meaning to the query. Thanks to the state-of-the-art approach based on deep learning used by Blue Brain Search, this is true even if the query and the sentences from the documents do not share the same words (e.g. they are synonyms, they have a similar meaning, ...).

Extract structured information from documents

The extraction could be done either on documents found by the search above or on the text content of a document pasted in the widget.

Found documents

Mining Widget (articles)

To extract structured information from the found documents, you could just click on the blue button named Mine Selected Articles!.

At the moment, the returned results are named entities. For each named entity, the structured information is: the mention (e.g. 'COVID-19'), the type (e.g. 'DISEASE'), and its location up to the character in the document.

Pasted document content

Mining Widget (text)

It is also possible to extract structured information from the pasted content of a document. To switch to this mode, you could just click on the tab named Mine Text. Then, you could launch the extraction by just clicking on the blue button named Mine This Text!. You could also enter the content of your choice by editing the text field.

Getting Started

There are 8 steps which need to be done in the following order:

  1. Prerequisites
  2. Retrieve the documents
  3. Initialize the database server
  4. Install Blue Brain Search
  5. Create the database
  6. Compute the sentence embeddings
  7. Create the mining cache
  8. Initialize the search, mining, and notebook servers
  9. Open the example notebook

Before proceeding, four things need to be noted.

First, these instructions are to reproduce the environment and results of Blue Brain Search v0.1.0. Indeed, this is the version for which the models we have trained have been publicly released.

Second, the setup of Blue Brain Search requires the launch of 4 servers (database, search, mining, notebook). The instructions are supposed to be executed on a powerful remote machine and the notebooks are supposed to be accessed from a personal local machine through the network.

Third, the ports, the Docker image names, and the Docker container names are modified (see below) to safely test the instructions on a machine where the Docker images would have already been built, the Docker containers would already run, and the servers would already run.

Fourth, if you are in a production setting, the database password and the notebook server token should be changed, the prefix test_ should be removed from the Docker image and container names, the sed commands should be omitted, and the second digit of the ports should be replaced by 8.

Prerequisites

The instructions are written for GNU/Linux machines. However, any machine with the equivalent of git, wget, tar, cd, mv, mkdir, sed (optional), and echo could be used.

The software named Docker is also needed. To install Docker, please refer to the official Docker documentation.

An optional part is using the programming language Python and its package manager pip. To install Python and pip please refer to the official Python documentation.

Otherwise, let's start in a newly created directory.

First, download the snapshot of the DVC remote and extract it.

wget https://zenodo.org/record/4589007/files/bbs_dvc_remote.tar.gz
tar xf bbs_dvc_remote.tar.gz

Second, clone the Blue Brain Search repository for v0.1.0.

git clone --depth 1 --branch v0.1.0 https://github.com/BlueBrain/Search.git

Third, keep track of the path to the working directory, the repository directory, and the data and models directory.

export WORKING_DIRECTORY="$(pwd)"
export REPOSITORY_DIRECTORY="$WORKING_DIRECTORY/Search"
export BBS_DATA_AND_MODELS_DIR="$REPOSITORY_DIRECTORY/data_and_models"

Finally, define the configuration common to all the instructions.

export DATABASE_PORT=8953
export SEARCH_PORT=8950
export MINING_PORT=8952
export NOTEBOOK_PORT=8954

export DATABASE_PASSWORD=1234
export NOTEBOOK_TOKEN=1a2b3c4d

export USER_NAME=$(id -un)
export USER_ID=$(id -u)

export http_proxy=http://bbpproxy.epfl.ch:80/
export https_proxy=http://bbpproxy.epfl.ch:80/

Retrieve the documents

This will download and decompress the CORD-19 version corresponding to the version 73 on Kaggle. Note that the data are around 7 GB. Decompression would take around 3 minutes.

export CORD19_VERSION=2021-01-03
export CORD19_ARCHIVE=cord-19_${CORD19_VERSION}.tar.gz
export CORD19_DIRECTORY=$WORKING_DIRECTORY/$CORD19_VERSION
cd $WORKING_DIRECTORY
wget https://ai2-semanticscholar-cord-19.s3-us-west-2.amazonaws.com/historical_releases/$CORD19_ARCHIVE
tar xf $CORD19_ARCHIVE
cd $CORD19_DIRECTORY
tar xf document_parses.tar.gz

CORD-19 contains more than 400,000 publications. The next sections could run for several hours, even days, depending on the power of the machine.

For testing purposes, you might want to consider a subset of the CORD-19. The following code select around 1,400 articles about glucose and risk factors:

mv metadata.csv metadata.csv.original
pip install pandas
python
import pandas as pd
metadata = pd.read_csv('metadata.csv.original')
sample = metadata[
    metadata.title.str.contains('glucose', na=False)
    | metadata.title.str.contains('risk factor', na=False)
  ]
print('The subset contains', sample.shape[0], 'articles.')
sample.to_csv('metadata.csv', index=False)
exit()

Initialize the database server

export DATABASE_NAME=cord19
export DATABASE_URL=$HOSTNAME:$DATABASE_PORT/$DATABASE_NAME

This will build a Docker image where MySQL is installed.

cd $REPOSITORY_DIRECTORY
docker build \
  --build-arg http_proxy \
  --build-arg https_proxy  \
  -f docker/mysql.Dockerfile -t test_bbs_mysql .

NB:HTTP_PROXY and HTTPS_PROXY, in upper case, are not working here.

This will launch using this image a MySQL server running in a Docker container.

docker run \
  --publish $DATABASE_PORT:3306 \
  --env MYSQL_ROOT_PASSWORD=$DATABASE_PASSWORD \
  --detach \
  --name test_bbs_mysql test_bbs_mysql

You will be asked to enter the MySQL root password defined above (DATABASE_PASSWORD).

docker exec --interactive --tty test_bbs_mysql bash
mysql -u root -p

Please replace <database name> by the value of DATABASE_NAME.

CREATE DATABASE <database name>;
CREATE USER 'guest'@'%' IDENTIFIED WITH mysql_native_password BY 'guest';
GRANT SELECT ON <database name>.* TO 'guest'@'%';
exit;

Please exit the interactive session on the test_bbs_mysql container.

exit

Install Blue Brain Search

This will build a Docker image where Blue Brain Search is installed.

cd $REPOSITORY_DIRECTORY
docker build \
  --build-arg BBS_HTTP_PROXY=$http_proxy \
  --build-arg BBS_http_proxy=$http_proxy \
  --build-arg BBS_HTTPS_PROXY=$https_proxy \
  --build-arg BBS_https_proxy=$https_proxy \
  --build-arg BBS_USERS="$USER_NAME/$USER_ID" \
  -f docker/base.Dockerfile -t test_bbs_base .

NB: At the moment, HTTP_PROXY, HTTPS_PROXY, http_proxy, and https_proxy are not working here.

This will launch using this image an interactive session in a Docker container.

The immediate next sections will need to be run in this session.

docker run \
  --volume /raid:/raid \
  --env REPOSITORY_DIRECTORY \
  --env CORD19_DIRECTORY \
  --env WORKING_DIRECTORY \
  --env DATABASE_URL \
  --env BBS_DATA_AND_MODELS_DIR \
  --gpus all \
  --interactive \
  --tty \
  --rm \
  --user "$USER_NAME" \
  --name test_bbs_base test_bbs_base
cd $REPOSITORY_DIRECTORY
pip install .[data_and_models]

NB: The optional dependencies installed with the [data_and_models] option are only necessary if you want to execute training or inference using the dvc and the model and scripts contained under data_and_models/. If this is not the case, you can skip the [data_and_models] at the end of pip install.

Then, configure DVC to work with the downloaded snapshot of the DVC remote.

dvc remote add --default local $WORKING_DIRECTORY/bbs_dvc_remote

Create the database

You will be asked to enter the MySQL root password defined above (DATABASE_PASSWORD).

If you are using the CORD-19 subset of around 1,400 articles, this would take around 3 minutes.

create_database \
  --cord-data-path $CORD19_DIRECTORY \
  --db-url $DATABASE_URL

Compute the sentence embeddings

If you are using the CORD-19 subset of around 1,400 articles, this would take around 2 minutes (on 2 Tesla V100 16 GB).

export EMBEDDING_MODEL='BioBERT NLI+STS CORD-19 v1'
export BBS_SEARCH_EMBEDDINGS_PATH=$WORKING_DIRECTORY/embeddings.h5
cd $BBS_DATA_AND_MODELS_DIR/models/sentence_embedding/
dvc pull biobert_nli_sts_cord19_v1
compute_embeddings SentTransformer $BBS_SEARCH_EMBEDDINGS_PATH \
  --checkpoint biobert_nli_sts_cord19_v1 \
  --db-url $DATABASE_URL \
  --gpus 0,1 \
  --h5-dataset-name "$EMBEDDING_MODEL" \
  --n-processes 2

NB: At the moment, compute_embeddings handles more models than the search server. The supported models for the search could be found in SearchServer._get_model(...).

Create the mining cache

cd $BBS_DATA_AND_MODELS_DIR/pipelines/ner/
dvc pull $(< dvc.yaml grep -oE '\badd_er_[0-9]+\b' | xargs)

You will be asked to enter the MySQL root password defined above (DATABASE_PASSWORD).

If you are using the CORD-19 subset of around 1,400 articles, this would take around 4 minutes.

cd $REPOSITORY_DIRECTORY
create_mining_cache \
  --db-url $DATABASE_URL \
  --target-table-name=mining_cache

NB: By default, the logging level is set to show the INFO logs. Note also that the command cd $REPOSITORY_DIRECTORY above is essential as otherwise the mining models will not be found.

Initialize the search, mining, and notebook servers

Please exit the interactive session of the test_bbs_base container.

exit
cd $REPOSITORY_DIRECTORY

Search server

sed -i 's/ bbs_/ test_bbs_/g' docker/search.Dockerfile
docker build \
  -f docker/search.Dockerfile -t test_bbs_search .

Please export also in this environment the variables EMBEDDING_MODEL and BBS_SEARCH_EMBEDDINGS_PATH.

export BBS_SEARCH_DB_URL=$DATABASE_URL
export BBS_SEARCH_MYSQL_USER=guest
export BBS_SEARCH_MYSQL_PASSWORD=guest

export BBS_SEARCH_MODELS_PATH=$BBS_DATA_AND_MODELS_DIR/models/sentence_embedding/
export BBS_SEARCH_MODELS=$EMBEDDING_MODEL
docker run \
  --publish $SEARCH_PORT:8080 \
  --volume /raid:/raid \
  --env BBS_SEARCH_DB_URL \
  --env BBS_SEARCH_MYSQL_USER \
  --env BBS_SEARCH_MYSQL_PASSWORD \
  --env BBS_SEARCH_MODELS \
  --env BBS_SEARCH_MODELS_PATH \
  --env BBS_SEARCH_EMBEDDINGS_PATH \
  --detach \
  --name test_bbs_search test_bbs_search

Mining server

sed -i 's/ bbs_/ test_bbs_/g' docker/mining.Dockerfile
docker build \
  -f docker/mining.Dockerfile -t test_bbs_mining .
export BBS_MINING_DB_TYPE=mysql
export BBS_MINING_DB_URL=$DATABASE_URL
export BBS_MINING_MYSQL_USER=guest
export BBS_MINING_MYSQL_PASSWORD=guest
docker run \
  --publish $MINING_PORT:8080 \
  --volume /raid:/raid \
  --env BBS_MINING_DB_TYPE \
  --env BBS_MINING_DB_URL \
  --env BBS_MINING_MYSQL_USER \
  --env BBS_MINING_MYSQL_PASSWORD \
  --detach \
  --name test_bbs_mining test_bbs_mining

Notebook server

The structured information searched and extracted using the text mining tools provided by Blue Brain Seach can be conveniently transformed and analyzed as a knowledge graph using the tools provided by Blue Brain Graph.

To use the complete pipeline—composed of literature search, text mining, and transformed into a knowledge graph-you should use the proof of concept notebook BBS_BBG_poc.ipynb from our dedicated repository. In order to use such notebook, please follow the instructions from the dedicated README.

If you want to setup the notebook in a docker container, please create an environment variable called NOTEBOOK_DIRECTORY and launch the following command:

export NOTEBOOK_DIRECTORY="$WORKING_DIRECTORY/Search-Graph-Examples"
docker run \
  --publish $NOTEBOOK_PORT:8888 \
  --volume /raid:/raid \
  --env NOTEBOOK_TOKEN \
  --env DB_URL \
  --env SEARCH_ENGINE_URL \
  --env TEXT_MINING_URL \
  --interactive \
  --tty \
  --rm \
  --user "$USER_NAME" \
  --workdir $NOTEBOOK_DIRECTORY \
  --name test_bbs_notebook test_bbs_base

Do not hesitate to check Blue Brain Search-Graph-Examples repository for any encountered issues linked to the notebook.

Please hit CTRL+P and then CTRL+Q to detach from the Docker container.

Open the example notebook

echo http://$HOSTNAME:$NOTEBOOK_PORT/lab/tree/BBS_BBG_poc.ipynb?token=$NOTEBOOK_TOKEN

To open the example notebook, please open the link returned above in a browser.

Voilà! You could now use the graphical interface.

Clean-up

Please note that this will DELETE ALL what was done in the previous sections of this Getting Started. This could be useful to do so after having tried the instructions or when something went bad.

export SERVERS='test_bbs_search test_bbs_mining test_bbs_mysql'
docker stop test_bbs_notebook $SERVERS
docker rm $SERVERS
docker rmi $SERVERS test_bbs_base
rm $BBS_SEARCH_EMBEDDINGS_PATH
rm -R $CORD19_DIRECTORY
rm $WORKING_DIRECTORY/$CORD19_ARCHIVE
rm -R $REPOSITORY_DIRECTORY

Installation (virtual environment)

We currently support the following Python versions. Make sure you are using one of them.

  • Python 3.7
  • Python 3.8
  • Python 3.9

Before installation, please make sure you have a recent pip installed (>=19.1)

pip install --upgrade pip

Then you can easily install bluesearch from PyPI:

pip install bluesearch[data_and_models]

You can also build from source if you prefer:

pip install .[data_and_models]

Installation (Docker)

We provide a docker file, docker/Dockerfile that allows to build a docker image with all dependencies of bluesearch pre-installed. Note that bluesearch itself is not installed, which needs to be done manually on each container that is spawned.

To build the docker image open a terminal in the root directory of the project and run the following command.

$ docker build -f docker/Dockerfile -t bbs .

Then, to spawn an interactive container session run

$ docker run -it --rm bbs

Documentation

We provide additional information on the package in the documentation. All the versions of our documentation, both stable and latest, can be found on Read the Docs.

If you want to manually build the documentation, you can do so using Sphinx. Make sure to install the bluesearch package with dev extras to get the necessary dependencies.

pip install -e .[dev]

Then, to generate the documentation run

cd docs
make clean && make html

You can open the resulting documentation in a browser by navigating to docs/_build/html/index.html.

Testing

We use tox to run all our tests. Running tox in the terminal will execute the following environments:

  • lint: code style and documentation checks
  • docs: test doc build
  • check-packaging: test packaging
  • py37: run unit tests (using pytest) with python3.7
  • py38: run unit tests (using pytest) with python3.8
  • py39: run unit tests (using pytest) with python3.9

Each of these environments can be run separately using the following syntax:

$ tox -e lint

This will only run the lint environment.

We provide several convenience tox environments that are not run automatically and have to be triggered by hand:

  • format
  • benchmarks

The format environment will reformat all source code using isort and black.

The benchmark environment will run pre-defined pytest benchmarks. Currently these benchmarks only test various servers and therefore need to know the server URL. These can be passed to tox via the following environment variables:

export EMBEDDING_SERVER=http://<url>:<port>
export MINING_SERVER=http://<url>:<port>
export MYSQL_SERVER=<url>:<port>
export SEARCH_SERVER=http://<url>:<port>

If a server URL is not defined, then the corresponding tests will be skipped.

It is also possible to provide additional positional arguments to pytest using the following syntax:

$ tox -e benchmarks -- <positional arguments>

for example:

$ tox -e benchmarks -- \
  --benchmark-histogram=my_histograms/benchmarks \
  --benchmark-max-time=1.5 \
  --benchmark-min-rounds=1

See pytest --help for additional options.

Funding & Acknowledgment

This project was supported by funding to the Blue Brain Project, a research center of the Ecole polytechnique fédérale de Lausanne, from the Swiss government's ETH Board of the Swiss Federal Institutes of Technology.

COPYRIGHT (c) 2021 Blue Brain Project/EPFL

Comments
  • #355 Integrate Sentence Embedding training and fine-tuning in DVC pipeline.

    #355 Integrate Sentence Embedding training and fine-tuning in DVC pipeline.

    Fixes #355.

    Note

    Before running the whole pipeline ([email protected]_nli_sts_cord19_v1), please read this comment.

    Description

    After #343 we found out that in fact some reproducibility issues are related just to torch.save(). Since, torch==1.9.0 got released, this version contains the patch resolving this reproducibility issue.

    First step of this PR was to check that training and fine-tuning our Sentence Embedding model is now reproducible. It is indeed the case (see How to check reproducibility? to reproduce the experiment).

    As it is now reproducible, this PR handles the integration of those training and fine-tuning steps into the DVC pipeline. What is done during this PR:

    • Rename scripts to training_transformers for more clarity
    • Update dvc.yaml file containing now two news steps (i.e. training_transformers and fine_tuning_transformers)
    • Update requirements.txt and setup.py with the new release of torch==1.9.0.

    Small question still to answer: Should we remove build.sh and the reference of it in the README.md as now the training is handled by DVC ?

    How to check reproducibility?

    First step of this PR was to check that training and fine-tuning our Sentence Embedding model is now reproducible. Here are steps followed:

    cd data_and_models/pipelines/sentence_embedding/scripts/
    # Create a copy of sentences-filtered_11-527-877.txt with a sample of all sentences
    sed -n '1,100000p' sentences-filtered_11-527-877.txt > sentences-filtered_11-527-877_sample.txt
    
    # Make sure you have the last version of torch (1.9.0) and the good version of transformers
    pip install --upgrade torch
    pip install transformers==3.4.0 
    
    # Launch the scripts after changing the TRAIN environment variable in the script and also the output directory contained under TEMP environment variable
    ./build.sh
    

    You need to launch this script twice. Output directory being part of the training arguments. The final binary files saving those arguments are going to be different if the output directory between the two runs is also different. However, the binary file containing the weights is now fully reproducible.

    How to test?

    Please provide here instructions on how to test the changes introduced by this PR. (if some changes cannot be tested by automated tests)

    Checklist

    • [x] This PR refers to an issue from the issue tracker. (if it is not the case, please create an issue first).
    • [x] Unit tests added. (if needed)
    • [x] Documentation and whatsnew.rst updated. (if needed)
    • [X] setup.py and requirements.txt updated with new dependencies. (if needed)
    • [X] Type annotations added. (if a function is added or modified)
    • [x] All CI tests pass.
    🦉 dvc 
    opened by EmilieDel 27
  • [BBS 199] Upgrade torch + transformers and investigate MP start method

    [BBS 199] Upgrade torch + transformers and investigate MP start method

    JIRA: BBS-199

    TODO

    • [x] Change requirements.txt

    • [x] Check manually whether multiprocessing inside of compute_embeddings works (we do not have unittests that actual run multiprocessing). @pafonta feel free to find a failure case.

      • There is an issue, the compute_embeddings does not work correctly for newer version of transformers. See huggingface/transformers#8801.
    • [ ] Build base docker image (will do it once merged)

    opened by jankrepl 21
  • Add knowledge graph building process steps

    Add knowledge graph building process steps

    Hello,

    With @annakristinkaufmann, we have added the process steps for knowledge graph building to the BBS BBG PoC notebook.

    We are proposing the variable table as a way to have the two notebook sections connected. This variable could of course be renamed.

    opened by pafonta 20
  • Add knowledge graph data model and RDF graph

    Add knowledge graph data model and RDF graph

    Hello,

    This is a first iteration on building a knowledge graph from the output of the NERs and REs.

    For this first iteration, the data model represents and enables the semantic search of the recognized entities and their provenance. Real example data are represented with this RDF data model and are loaded in data structure understanding RDF and operations on it.

    The next iteration will improve the semantic representation of the data.

    opened by pafonta 18
  • Use individual spaCy with transformer backbones for all NER models

    Use individual spaCy with transformer backbones for all NER models

    🚀 Feature

    In light of what we discussed in PR #328, and in particular looking at the results shown in this comparison table, we should operate the following changes to the NER models in data_and_models/pipelines/ner.

    • [x] All NER models should use a transformer backbone. Now they are using tok2vec.
    • [x] All NER models should initialize the weights of this backbone using the pre-trained weights of CORD19 NLI+STS v1. Also, the weights should not be frozen during the fine-tuning on the NER task.
    • [x] There should be one distinct NER model (= spaCy pipeline) for each entity type we support. Note that currently this is not the case, as e.g. model2 is used to extract 3 different entity types (see table here).
    • [x] Unlike the experiments of PR #328, the spaCy pipeline should also include the rule-based entity extraction component (Note: for the moment let's keep using add_er.py, then in the future we'll improve that with #310).
    • [x] All the evaluation results (token and entity based, Prec, Rec, F1) obtained before and after this PR should be collected in a table.
    🔤 named-entity-recognition 
    opened by FrancescoCasalegno 17
  • First draft for NER models improvement processes

    First draft for NER models improvement processes

    Context

    As we have been requested, it is of highest importance that not only our NER models improve their accuracy, but also that we implement features and define processes make it as seamless as possible to improve our NER models by allowing users to address the two following use cases.

    1. Add support for a new entity types.
    2. Correct errors observed in predictions.

    Ideas for this process

    • Get inspired by prodigy process here:
    • Is the new entity type a sub-type of an already existing entity type? (e.g. MAMMAL is sub-type of ANIMAL)
      • If Yes, then redirect this problem to Ontology Linking and Blue Graph
    • How do we provide estimates on how many training samples will be needed? Can we do this iteratively e.g. using #276 learning curves?
    • Shall we always train a "statistical model" or consider using an EntityRuler?
    • In any case, maybe at least some training samples are needed for testing? How many?

    Actions

    • [x] Create draft of process to add support for a new entity types.
    • [x] Create draft of process to correct errors observed in predictions.
    🔤 named-entity-recognition 
    opened by FrancescoCasalegno 16
  • CI aka Jenkins config

    CI aka Jenkins config

    The config lies here: bbsearch-jenkins

    So currently our CI does (more or less) the following things:

    1. Install package via setup.py i.e. pip install --upgrade .[dev]
    2. Run unit tests pytest

    I wanted to discuss addition of multiple other "code quality" tools but at the same time I do not want to impose some overkill requirements.

    • flake8 - non-zero exit code if PEP8 not satisifed
    • pydocstyle - non-zero exit code if docstrings missing/wrongly formatted w.r.t numpydoc
    • pytest-coverage - either just printing the coverage at the end (--cov-report=term) or if we want to be more extreme we can impose a minimum coverage --cov-fail-under=MIN

    Please feel free to share your opinions on this. @Stannislav @EmilieDel @FrancescoCasalegno

    opened by jankrepl 13
  • Test reproducibility of spaCy training

    Test reproducibility of spaCy training

    Context

    We have already seen (see BBS-198) that torch training results (i.e. model weights) are not bitwise reproducible.

    On the other hand, spaCy training has been reproducible until now. But we have been using tok2vec as a backbone, not transformer, so can this change the situation?

    Actions

    • [x] Test if spacy NER training is reproducible when using a transformer backbone instead of tok2vec.
    • [x] If No to the previous question, is the training also not reproducible with a frozen transformer.
    • [x] If training appears not to be reproducible, ask spaCy developers if this is expected—it seems indeed to contrast with the multiple times that "reproducibile" is mentioned in their docs.
    opened by FrancescoCasalegno 12
  • Compare runtimes of spaCy NER pipelines using CPU and GPU

    Compare runtimes of spaCy NER pipelines using CPU and GPU

    Description

    While adopting a transformer backbone for our spaCy NER models may be beneficial in terms of accuracy (see #335), this may also imply slower runtime with respect to using a simpler tok2vec.

    To run on GPUs using spaCy it seems that only 2 things are needed (see here for complete guide).

    1. pip install --upgrade spacy[<cuda_version>];
    2. specify spacy.require_gpu() before any spacy.load(some_model).

    The GPU version should be checked before installing the right spacy[<cuda_version>]. For instance, given

    $ nvcc --version
    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2019 NVIDIA Corporation
    Built on Wed_Oct_23_19:24:38_PDT_2019
    Cuda compilation tools, release 10.2, V10.2.89
    

    we should pip install --upgrade spacy[cuda102], since we have cuda v10.2.

    Actions

    Collect results of runtimes of inference using spaCy pipelines in a variety of settings:

    • [x] running on CPU vs GPU
    • [x] using backbone tok2vec vs transformer
    • [x] with pipeline = ["transformer", "ner"] vs pipliene = ["transformer", "tagger", "attribute_ruler", "lemmatizer", "parser", "ner", "entity_ruler"]
    optimization 🔤 named-entity-recognition 
    opened by FrancescoCasalegno 12
  • [BBS-293] Migrate spaCy 2.x -> 3.x

    [BBS-293] Migrate spaCy 2.x -> 3.x

    Fixes #274.

    Description

    PR progress

    The detailed scope is available here: https://github.com/BlueBrain/Search/issues/274#issuecomment-801108330. See especially the section "Out of scope" for links to follow-up work.

    PR changes

    • Removed the installation of Prodigy in pipelines/ner/Dockerfile. Now spacy train is used.
    • Removed the patch from #268. Patched issue fixed in spaCy >= 3.0.4.
    • Migrated add_pipe(EntityRuler, ...) to spaCy 3. Now add_pipe(...) has a new API.
    • Migrated the NER training to spacy train and spaCy 3. Now Prodigy is no more needed to train NER models.
    • Convert .jsonl annotations to .spacy files for spaCy 3. See #308 for migrating to .spacy permanently.
    • Use the default configuration of spaCy 3 for NER training. See #309 for using the config.cfg from Prodigy.
    • Upgraded en_core_web_sm and scispaCy models for spaCy 3.
    • Upgraded spaCy to >= 3.0.4 and scispaCy to 0.4.0.

    Note on Prodigy for spaCy 3

    At the moment, Prodigy has not been release for spaCy 3 yet. The progress on the release of the new Prodigy could be followed here and here.

    Note on the NER performances

    The base models from scispaCy need to be upgraded (v0.2.5 to v0.4.0) to be loadable with spaCy 3.

    Improving the performances of the trained models is not part of the PR. See #309, #294, and #295 instead.

    There are performance changes compared to master(scores for 86844fe0fa3d9731c5bb0b4f9b43a3eeb57fe80d):

    • According to the F1 score from pipeline/ner/eval.py:

      • 8 entities have an increase, especially cell_compartment (+12 %), organism (+7.2 %), and drug (+ 5.8 %)
      • 1 entity has a negligible decrease: protein (-0.73 %)
      • 1 entity has a concerning decrease: pathway (-19 %)
    • According to the F1 score from Prodigy or spaCy:

      • 5 entities have an increase, especially organism (+31 %)
      • 1 entity has a negligible decrease: pathway (-0.58 %)
      • 3 entities have a concerning decrease: cell_compartment (-8.7 %), protein (-7.0 %), and drug (-2.5 %)

    There are probably issues with 3 of the 5 entities mentioned above. Indeed:

    • pathway has a catastrophic decrease according to eval.py (-19 %) but not spaCy (- 0.58 %)
    • cell_compartment has a huge increase according to eval.py (+12 %) but has the opposite according to spaCy (-8.7 %)
    • drug has an increase according to eval.py (+5.8 %) but has a decrease according to spaCy (-2.5%)

    The changes seem to be caused by data sampling and domain adaptation issues. See #321, 3rd and 4th points.

    Here are the full changes according to the F1 score from pipeline/ner/eval.py:

    entity | before | now | delta | % | ------------------ | ------: | ---: | ----: | --: | cell_compartment | 0.65 | 0.72 | 0.08 | 12 | cell_type | 0.64 | 0.64 | 0.00 | 0.0 | chemical | 0.54 | 0.55 | 0.00 | 0.4 | disease | 0.69 | 0.70 | 0.01 | 1.3 | drug | 0.60 | 0.64 | 0.03 | 5.8 | organ | 0.53 | 0.55 | 0.02 | 3.7 | organism | 0.54 | 0.58 | 0.04 | 7.2 | pathway | 0.58 | 0.47 | -0.11 | -19 | protein | 0.53 | 0.53 | -0.00 | -0.73 |

    Here are the full changes according to the F1 score from Prodigy or spaCy:

    entity | before | now | delta | % | ------------------ | ------: | ---: | ----: | --: | cell_compartment | 0.88 | 0.81 | -0.08 | -8.7 | cell_type | 0.77 | 0.78 | 0.01 | 1.7 | chemical | 0.57 | 0.63 | 0.06 | 10.5 | disease | 0.88 | 0.92 | 0.04 | 4.6 | drug | 0.77 | 0.75 | -0.02 | -2.5 | organ | 0.81 | 0.85 | 0.04 | 5.0 | organism | 0.68 | 0.89 | 0.21 | 31 | pathway | 0.84 | 0.84 | -0.00 | -0.58 | protein | 0.80 | 0.74 | -0.06 | -7.0 |

    Note on reproducibility

    The order of lines in model-best/vocab/strings.json changes between dvc repro -f calls. This leads to changes in pipeline/ner/dvc.lock where the NER models are declared.

    Otherwise, the outputs of the other parts of the pipeline (convert_annotations_*, add_er_*, eval_*) do not change . This therefore implies that the performances of the trained models do not change.

    A patch has been applied in d8b9f7407641f977a0091d61788a96c97b48fa4a for ordering strings.json deterministically. See also #327 for removing the patch when spaCy will have fixed it upstream.

    How to test?

    1. The following should execute without errors:
    git clone https://github.com/BlueBrain/Search
    git checkout bbs_293
    
    docker build <options> -f data_and_models/pipelines/ner/Dockerfile -t <image> .
    docker run -it -rm <options> --name <container> <image>
    
    dvc pull data_and_models/annotations/ner/*.dvc
    
    cd data_and_models/pipelines/ner/
    # This takes around 30 mins.
    dvc repro -f
    
    1. The following, executed after the block above, should return nothing:
    git diff
    

    Checklist

    • [ ] All checkable items from https://github.com/BlueBrain/Search/issues/274#issuecomment-801108330 are checked.
    • [x] This PR refers to an issue from the issue tracker.
    • [x] Documentation and whatsnew.rst updated.
    • [x] setup.py and requirements.txt updated with new dependencies.
    • [x] All CI tests pass.

    Tests failing

    1. add_pipe API change:
    ValueError: [E966] `nlp.add_pipe` now takes the string name of the registered component factory, not a callable component. Expected string, but got <spacy.pipeline.ner.EntityRecognizer object at 0x151ece910>
     - If you created your component with `nlp.create_pipe('name')`: remove nlp.create_pipe and call `nlp.add_pipe('name')` instead.
     - If you passed in a component like `TextCategorizer()`: call `nlp.add_pipe` with the string name instead, e.g. `nlp.add_pipe('textcat')`.
     - If you're using a custom component: Add the decorator `@Language.component` (for function components) or `@Language.factory` (for class components / factories) to your custom component and assign it a name, e.g. `@Language.component('your_name')`. You can then run `nlp.add_pipe('your_name')` to add it to the pipeline.
    

    Change of syntax with an entity_ruler: spaCy<3:

    er = spacy.pipeline.EntityRuler(model, patterns=self.to_list())
    model.add_pipe(er, **add_pipe_kwargs)
    

    spaCy>=3:

    er = model.add_pipe("entity_ruler",
                        config={'validate': True, **add_pipe_kwargs})
    er.add_patterns(self.to_list())
    
    1. deepcopy of spacy model:
    TypeError: self.c cannot be converted to a Python object for pickling
    

    from tests/test_mining/test_attribute.py line 1225:

    extractor.ee_model = deepcopy(extractor.ee_model)
    
    new feature optimization 🦉 dvc dependencies 🔤 named-entity-recognition 
    opened by EmilieDel 12
  • [BBS-269] Parse chemprot dataset

    [BBS-269] Parse chemprot dataset

    Description

    This PR is introducing a script to parse Chemprot dataset into tsv files compatible for the training of biobert model (see Github Repo).

    Part of the ticket BBS-269.

    opened by EmilieDel 12
  • add remote relation extraction

    add remote relation extraction

    Fixes #{issue-id-number}.

    Description

    Please provide here a summary of the changes introduced by this PR.

    How to test?

    Please provide here instructions on how to test the changes introduced by this PR. (if some changes cannot be tested by automated tests)

    Checklist

    • [ ] This PR refers to an issue from the issue tracker. (if it is not the case, please create an issue first).
    • [ ] Unit tests added. (if needed)
    • [ ] Documentation and whatsnew.rst updated. (if needed)
    • [ ] setup.py and requirements.txt updated with new dependencies. (if needed)
    • [ ] Type annotations added. (if a function is added or modified)
    • [ ] All CI tests pass.
    opened by drsantos89 0
  • add ner k8s

    add ner k8s

    Description

    Adds a function to perform and store the output of NER. NER is run remotely using the deployment on Kubernetes. It supports both ML and RULE-based approaches. The NER output and model version are stored in ES.

    Notes

    • One function, handle_conflits, is currently found in two repositories (this current PR and one repo on GitLab). It might be interesting to keep it only in BlueSearch and import it into the other repository.
    • The remote models are relatively slow when running 1 sample at a time (~1-2 paragraphs/s for ML and ~15 for RULE). Testing the remote model speed with locust revealed a maximum possible performant of ~15 and ~50 for ML and RULE models, respectively. A multiprocessing option was hence added.
    • pool.apply_async compies the arguments to a new memory location. The client object is not serializable and needs to be called inside the function if required.
    • by default, the function only updates the paragraphs which are empty or with an outdated model version. There is an option to force the update of every paragraph.
    • The JSON output of both models is saved in an ES field of type flattened. ("This data type can be useful for indexing objects with a large or unknown number of unique keys. Only one field mapping is created for the whole JSON object, which can help prevent a [mappings explosion (https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html#mapping-limit-settings) from having too many distinct field mappings.")

    How to test?

    tests/unit/k8s/test_ner.py

    Checklist

    • [x] Unit tests added. (if needed)
    • [x] Type annotations added. (if a function is added or modified)
    • [x] All CI tests pass.
    opened by drsantos89 0
  • Create JSONL configuration file for topic filtering

    Create JSONL configuration file for topic filtering

    Context

    • We have set up a pipeline stage that is able to determine the relevance of an article w.r.t. some user-given topics configuration.
    • Currently, the only config file we have behaves as a wildcard *.
    • We should instead have specific topic inclusion criteria for each archive type (arXiv, PMC, ...).

    Actions

    • [ ] Compile a JSON configuration file for topic filtering.
    🌪️ db-filter 
    opened by FrancescoCasalegno 0
  • feature/add-abstract-to-paragraphs-table

    feature/add-abstract-to-paragraphs-table

    Fixes #637.

    Description

    Adds the abstract to the paragraphs table so it can be searchable using semantic search.

    How to test?

    Abstract field added to the tests/unit/entrypoint/database/test_add_es.py

    Checklist

    • [x] This PR refers to an issue from the issue tracker. (if it is not the case, please create an issue first).
    • [x] Unit tests added. (if needed)
    • [ ] Documentation and whatsnew.rst updated. (if needed)
    • [x] setup.py and requirements.txt updated with new dependencies. (if needed)
    • [x] Type annotations added. (if a function is added or modified)
    • [x] All CI tests pass.
    opened by drsantos89 0
  • Unify abstract and section paragraphs

    Unify abstract and section paragraphs

    We should be able to assign an embedding to the abstract. However, currently the abstract is a separate field/attribute of the Article class.

    Todos

    • [x] Decide on the best design
    • [x] Implement it

    Some reference

    • https://github.com/BlueBrain/Search/pull/593
    🗄️ database 
    opened by jankrepl 1
  • Feature/embedings k8s

    Feature/embedings k8s

    Fixes #623

    Description

    Add function to update embedding on paragraphs without embeddings using a local model

    How to test?

    Embeddings are present in the database. test/unit/k8s/test_add_embeddings.py

    Checklist

    • [x] This PR refers to an issue from the issue tracker. (if it is not the case, please create an issue first).
    • [x] Unit tests added. (if needed)
    • [x] setup.py and requirements.txt updated with new dependencies. (if needed)
    • [x] Type annotations added. (if a function is added or modified)
    • [x] All CI tests pass.
    opened by drsantos89 0
Releases(v0.0.10)
Owner
The Blue Brain Project
Open Source Software produced and used by the Blue Brain Project
The Blue Brain Project
Use Google's BERT for named entity recognition (CoNLL-2003 as the dataset).

For better performance, you can try NLPGNN, see NLPGNN for more details. BERT-NER Version 2 Use Google's BERT for named entity recognition (CoNLL-2003

Kaiyinzhou 1.2k Dec 26, 2022
The Sudachi synonym dictionary in Solar format.

solr-sudachi-synonyms The Sudachi synonym dictionary in Solar format. Summary Run a script that checks for updates to the Sudachi dictionary every hou

Karibash 3 Aug 19, 2022
Main repository for the chatbot Bobotinho.

Bobotinho Bot Main repository for the chatbot Bobotinho. ℹ️ Introduction Twitch chatbot with entertainment commands. ‎ 💻 Technologies Concurrent code

Bobotinho 14 Nov 29, 2022
iSTFTNet : Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform

iSTFTNet : Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform This repo try to implement iSTFTNet : Fast

Rishikesh (ऋषिकेश) 126 Jan 02, 2023
Active learning for text classification in Python

Active Learning allows you to efficiently label training data in a small-data scenario.

Webis 375 Dec 28, 2022
Korean Sentence Embedding Repository

Korean-Sentence-Embedding 🍭 Korean sentence embedding repository. You can download the pre-trained models and inference right away, also it provides

80 Jan 02, 2023
This is my reading list for my PhD in AI, NLP, Deep Learning and more.

This is my reading list for my PhD in AI, NLP, Deep Learning and more.

Zhong Peixiang 156 Dec 21, 2022
Simple virtual assistant using pyttsx3 and speech recognition optionally with pywhatkit and pther libraries.

VirtualAssistant Simple virtual assistant using pyttsx3 and speech recognition optionally with pywhatkit and pther libraries. Third Party Libraries us

Logadheep 1 Nov 27, 2021
Clone a voice in 5 seconds to generate arbitrary speech in real-time

This repository is forked from Real-Time-Voice-Cloning which only support English. English | 中文 Features 🌍 Chinese supported mandarin and tested with

Weijia Chen 25.6k Jan 06, 2023
Twitter-Sentiment-Analysis - Twitter sentiment analysis for india's top online retailers(2019 to 2022)

Twitter-Sentiment-Analysis Twitter sentiment analysis for india's top online retailers(2019 to 2022) Project Overview : Sentiment Analysis helps us to

Balaji R 1 Jan 01, 2022
Python api wrapper for JellyFish Lights

Python api wrapper for JellyFish Lights The hope is to make this a pip installable package Current capabalilities: Connects to a local JellyFish Light

10 Dec 18, 2022
Pipeline for training LSA models using Scikit-Learn.

Latent Semantic Analysis Pipeline for training LSA models using Scikit-Learn. Usage Instead of writing custom code for latent semantic analysis, you j

Dani El-Ayyass 23 Sep 05, 2022
Generate product descriptions, blogs, ads and more using GPT architecture with a single request to TextCortex API a.k.a Hemingwai

TextCortex - HemingwAI Generate product descriptions, blogs, ads and more using GPT architecture with a single request to TextCortex API a.k.a Hemingw

TextCortex AI 27 Nov 28, 2022
Document processing using transformers

Doc Transformers Document processing using transformers. This is still in developmental phase, currently supports only extraction of form data i.e (ke

Vishnu Nandakumar 13 Dec 21, 2022
Graph4nlp is the library for the easy use of Graph Neural Networks for NLP

Graph4NLP Graph4NLP is an easy-to-use library for R&D at the intersection of Deep Learning on Graphs and Natural Language Processing (i.e., DLG4NLP).

Graph4AI 1.5k Dec 23, 2022
Universal End2End Training Platform, including pre-training, classification tasks, machine translation, and etc.

背景 安装教程 快速上手 (一)预训练模型 (二)机器翻译 (三)文本分类 TenTrans 进阶 1. 多语言机器翻译 2. 跨语言预训练 背景 TrenTrans是一个统一的端到端的多语言多任务预训练平台,支持多种预训练方式,以及序列生成和自然语言理解任务。 安装教程 git clone git

Tencent Minority-Mandarin Translation Team 42 Dec 20, 2022
Lyrics generation with GPT2-based Transformer

HuggingArtists - Train a model to generate lyrics Create AI-Artist in just 5 minutes! 🚀 Run the demo notebook to train 🚀 Run the GUI demo to test Di

Aleksey Korshuk 65 Dec 19, 2022
A Facebook Messenger Chatbot using NLP

A Facebook Messenger Chatbot using NLP This project is about creating a messenger chatbot using basic NLP techniques and models like Logistic Regressi

6 Nov 20, 2022
An Open-Source Package for Neural Relation Extraction (NRE)

OpenNRE We have a DEMO website (http://opennre.thunlp.ai/). Try it out! OpenNRE is an open-source and extensible toolkit that provides a unified frame

THUNLP 3.9k Jan 03, 2023
Code for the paper "BERT Loses Patience: Fast and Robust Inference with Early Exit".

Patience-based Early Exit Code for the paper "BERT Loses Patience: Fast and Robust Inference with Early Exit". NEWS: We now have a better and tidier i

Kevin Canwen Xu 54 Jan 04, 2023