💫 Industrial-strength Natural Language Processing (NLP) in Python

Overview

spaCy: Industrial-strength NLP

spaCy is a library for advanced Natural Language Processing in Python and Cython. It's built on the very latest research, and was designed from day one to be used in real products.

spaCy comes with pretrained pipelines and currently supports tokenization and training for 60+ languages. It features state-of-the-art speed and neural network models for tagging, parsing, named entity recognition, text classification and more, multi-task learning with pretrained transformers like BERT, as well as a production-ready training system and easy model packaging, deployment and workflow management. spaCy is commercial open-source software, released under the MIT license.

💫 Version 3.0 out now! Check out the release notes here.

Azure Pipelines Current Release Version pypi Version conda Version Python wheels Code style: black
PyPi downloads Conda downloads spaCy on Twitter

📖 Documentation

Documentation
⭐️ spaCy 101 New to spaCy? Here's everything you need to know!
📚 Usage Guides How to use spaCy and its features.
🚀 New in v3.0 New features, backwards incompatibilities and migration guide.
🪐 Project Templates End-to-end workflows you can clone, modify and run.
🎛 API Reference The detailed reference for spaCy's API.
📦 Models Download trained pipelines for spaCy.
🌌 Universe Plugins, extensions, demos and books from the spaCy ecosystem.
👩‍🏫 Online Course Learn spaCy in this free and interactive online course.
📺 Videos Our YouTube channel with video tutorials, talks and more.
🛠 Changelog Changes and version history.
💝 Contribute How to contribute to the spaCy project and code base.

💬 Where to ask questions

The spaCy project is maintained by @honnibal, @ines, @svlandeg and @adrianeboyd. Please understand that we won't be able to provide individual support via email. We also believe that help is much more valuable if it's shared publicly, so that more people can benefit from it.

Type Platforms
🚨 Bug Reports GitHub Issue Tracker
🎁 Feature Requests & Ideas GitHub Discussions
👩‍💻 Usage Questions GitHub Discussions · Stack Overflow
🗯 General Discussion GitHub Discussions

Features

  • Support for 60+ languages
  • Trained pipelines for different languages and tasks
  • Multi-task learning with pretrained transformers like BERT
  • Support for pretrained word vectors and embeddings
  • State-of-the-art speed
  • Production-ready training system
  • Linguistically-motivated tokenization
  • Components for named entity recognition, part-of-speech-tagging, dependency parsing, sentence segmentation, text classification, lemmatization, morphological analysis, entity linking and more
  • Easily extensible with custom components and attributes
  • Support for custom models in PyTorch, TensorFlow and other frameworks
  • Built in visualizers for syntax and NER
  • Easy model packaging, deployment and workflow management
  • Robust, rigorously evaluated accuracy

📖 For more details, see the facts, figures and benchmarks.

Install spaCy

For detailed installation instructions, see the documentation.

  • Operating system: macOS / OS X · Linux · Windows (Cygwin, MinGW, Visual Studio)
  • Python version: Python 3.6+ (only 64 bit)
  • Package managers: pip · conda (via conda-forge)

pip

Using pip, spaCy releases are available as source packages and binary wheels. Before you install spaCy and its dependencies, make sure that your pip, setuptools and wheel are up to date.

pip install -U pip setuptools wheel
pip install spacy

To install additional data tables for lemmatization and normalization you can run pip install spacy[lookups] or install spacy-lookups-data separately. The lookups package is needed to create blank models with lemmatization data, and to lemmatize in languages that don't yet come with pretrained models and aren't powered by third-party libraries.

When using pip it is generally recommended to install packages in a virtual environment to avoid modifying system state:

python -m venv .env
source .env/bin/activate
pip install -U pip setuptools wheel
pip install spacy

conda

You can also install spaCy from conda via the conda-forge channel. For the feedstock including the build recipe and configuration, check out this repository.

conda install -c conda-forge spacy

Updating spaCy

Some updates to spaCy may require downloading new statistical models. If you're running spaCy v2.0 or higher, you can use the validate command to check if your installed models are compatible and if not, print details on how to update them:

pip install -U spacy
python -m spacy validate

If you've trained your own models, keep in mind that your training and runtime inputs must match. After updating spaCy, we recommend retraining your models with the new version.

📖 For details on upgrading from spaCy 2.x to spaCy 3.x, see the migration guide.

📦 Download model packages

Trained pipelines for spaCy can be installed as Python packages. This means that they're a component of your application, just like any other module. Models can be installed using spaCy's download command, or manually by pointing pip to a path or URL.

Documentation
Available Pipelines Detailed pipeline descriptions, accuracy figures and benchmarks.
Models Documentation Detailed usage and installation instructions.
Training How to train your own pipelines on your data.
# Download best-matching version of specific model for your spaCy installation
python -m spacy download en_core_web_sm

# pip install .tar.gz archive or .whl from path or URL
pip install /Users/you/en_core_web_sm-3.0.0.tar.gz
pip install /Users/you/en_core_web_sm-3.0.0-py3-none-any.whl
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz

Loading and using models

To load a model, use spacy.load() with the model name or a path to the model data directory.

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("This is a sentence.")

You can also import a model directly via its full name and then call its load() method with no arguments.

import spacy
import en_core_web_sm

nlp = en_core_web_sm.load()
doc = nlp("This is a sentence.")

📖 For more info and examples, check out the models documentation.

Compile from source

The other way to install spaCy is to clone its GitHub repository and build it from source. That is the common way if you want to make changes to the code base. You'll need to make sure that you have a development environment consisting of a Python distribution including header files, a compiler, pip, virtualenv and git installed. The compiler part is the trickiest. How to do that depends on your system.

Platform
Ubuntu Install system-level dependencies via apt-get: sudo apt-get install build-essential python-dev git .
Mac Install a recent version of XCode, including the so-called "Command Line Tools". macOS and OS X ship with Python and git preinstalled.
Windows Install a version of the Visual C++ Build Tools or Visual Studio Express that matches the version that was used to compile your Python interpreter.

For more details and instructions, see the documentation on compiling spaCy from source and the quickstart widget to get the right commands for your platform and Python version.

git clone https://github.com/explosion/spaCy
cd spaCy

python -m venv .env
source .env/bin/activate

# make sure you are using the latest pip
python -m pip install -U pip setuptools wheel

pip install -r requirements.txt
pip install --no-build-isolation --editable .

To install with extras:

pip install --no-build-isolation --editable .[lookups,cuda102]

🚦 Run tests

spaCy comes with an extensive test suite. In order to run the tests, you'll usually want to clone the repository and build spaCy from source. This will also install the required development dependencies and test utilities defined in the requirements.txt.

Alternatively, you can run pytest on the tests from within the installed spacy package. Don't forget to also install the test utilities via spaCy's requirements.txt:

pip install -r requirements.txt
python -m pytest --pyargs spacy
Comments
  • Japanese Model

    Japanese Model

    Feature description

    I'd like to add a Japanese model to spaCy. (Let me know if this should be discussed in #3056 instead - I thought it best to just tag it in for now.)

    The Ginza project exists, but currently it's a repackaging of spaCy rather than a model to use with normal spaCy, and I think some of the resources it uses may be tricky to integrate from a licensing perspective.

    My understanding is that the main parts of a model now are 1. the dependency model, 2. NER, and 3. word vectors. Notes on each of those:

    1. Dependencies. For dependency info we can use UD Japanese GSD. UD BCCWJ is bigger but the corpus has licensing issues. GSD is rather small but probably enough to be usable (8k sentences). I have trained it with spaCy and there were no conversion issues.

    2. NER. I don't know of a good dataset for this; Christopher Manning mentioned the same problem two years ago. I guess I could make one based on Wikipedia - I think some other spaCy models use data produced by Nothman et al's method, which skipped Japanese to avoid dealing with segmentation, so that might be one approach. (A reasonable question here is: what do people use for NER in Japanese? Most tokenizer dictionaries, including Unidic, have entity-like information and make it easy to add your own entries, so that's probably the most common approach.)

    3. Vectors. Using JA Wikipedia is no problem. I haven't worked with the Common Crawl before and I'm not sure I have the hardware for it, buf if I could get some help on it that's also an option.

    So, how does that sound? If there's no issues with that I'll look into creating an NER dataset.

    enhancement models lang / ja 
    opened by polm 182
  • 💫 spaCy v2.0.0 alpha – details, feedback & questions (plus stickers!)

    💫 spaCy v2.0.0 alpha – details, feedback & questions (plus stickers!)

    We're very excited to finally publish the first alpha pre-release of spaCy v2.0. It's still an early release and (obviously) not intended for production use. You might come across a NotImplementedError – see the release notes for the implementation details that are still missing.

    This thread is intended for general discussion, feedback and all questions related to v2.0. If you come across more complex bugs, feel free to open a separate issue.

    Quickstart & overview

    The most important new features

    • New neural network models for English (15 MB) and multi-language NER (12 MB), plus GPU support via Chainer's CuPy.
    • Strings mapped to hash values instead of integer IDs. This means they will always match – even across models.
    • Improved saving and loading, consistent serialization API across objects, plus Pickle support.
    • Built-in displaCy visualizers with Jupyter notebook support.
    • Improved language data with support for lazy loading and multi-language models. Alpha tokenization for Norwegian Bokmål, Japanese, Danish and Polish. Lookup-based lemmatization for English, German, French, Spanish, Italian, Hungarian, Portuguese and Swedish.
    • Revised API for Matcher and language processing pipelines.
    • Trainable document vectors and contextual similarity via convolutional neural networks.
    • Various bug fixes and almost completely re-written documentation.

    Installation

    spaCy v2.0.0-alpha is available on pip as spacy-nightly. If you want to test the new version, we recommend setting up a clean environment first. To install the new model, you'll have to download it with its full name, using the --direct flag.

    pip install spacy-nightly
    python -m spacy download en_core_web_sm-2.0.0-alpha --direct   # English
    python -m spacy download xx_ent_wiki_sm-2.0.0-alpha --direct   # Multi-language NER
    
    import spacy
    nlp = spacy.load('en_core_web_sm')
    
    import en_core_web_sm
    nlp = en_core_web_sm.load()
    

    Alpha models for German, French and Spanish are coming soon!

    Now on to the fun part – stickers!

    stickers

    We just got our first delivery of spaCy stickers and want to to share them with you! There's only one small favour we'd like to ask. The part we're currently behind on are the tests – this includes our test suite as well as in-depth testing of the new features and usage examples. So here's the idea:

    • Find something that's currently not covered in the test suite and doesn't require the models, and write a test for it - for example, language-specific tokenization tests.
    • Alternatively, find examples from the docs that haven't been added to the tests yet and add them. Plus points if the examples don't actually work – this means you've either discovered a bug in spaCy, or a bug in the docs! 🎉

    Submit a PR with your test to the develop branch – if the test covers a bug and currently fails, mark it with @pytest.mark.xfail. For more info, see the test suite docs. Once your pull request is accepted, send us your address via email or private message on Gitter and we'll mail you stickers.

    If you can't find anything, don't have time or can't be bothered, that's fine too. Posting your feedback on spaCy v2.0 here counts as well. To be honest, we really just want to mail out stickers 😉

    help wanted 🌙 nightly meta 
    opened by ines 109
  • Build from source with MinGW

    Build from source with MinGW

    I am trying to build from source under MinGW. I noticed that Cython seems to have trouble with relative import sometimes, but not all the times. I am not using virtualenv as I have installed the dependencies into my system. I am not sure if that might have something to do with this. Anyway this is what I am encountering:

    First I ran into this:

    Error compiling Cython file:
    ------------------------------------------------------------
    ...
    from ..vocab cimport Vocab
    ^
    ------------------------------------------------------------
    
    spacy/serialize/packer.pxd:1:0: 'vocab.pxd' not found
    
    Error compiling Cython file:
    ------------------------------------------------------------
    ...
    from ..vocab cimport Vocab
    ^
    ------------------------------------------------------------
    
    spacy/serialize/packer.pxd:1:0: 'vocab/Vocab.pxd' not found
    

    So I edited it to use absolute path for the module:

    --- a/spacy/serialize/packer.pxd
    +++ b/spacy/serialize/packer.pxd
    @@ -1,4 +1,4 @@
    -from ..vocab cimport Vocab
    +from spacy.vocab cimport Vocab
    

    and the compile then succeeded. I also had to do the same for spacy/syntax/transition_system.pxd and spacy/tokens/doc.pxd. I was able to compile the following DLLs:

    $ ls spacy/*.dll
    spacy/_ml-cpython-34m.dll      spacy/morphology-cpython-34m.dll
    spacy/_theano-cpython-34m.dll  spacy/orth-cpython-34m.dll
    spacy/attrs-cpython-34m.dll    spacy/parts_of_speech-cpython-34m.dll
    spacy/cfile-cpython-34m.dll    spacy/strings-cpython-34m.dll
    spacy/gold-cpython-34m.dll     spacy/tagger-cpython-34m.dll
    spacy/lexeme-cpython-34m.dll   spacy/tokenizer-cpython-34m.dll
    spacy/matcher-cpython-34m.dll  spacy/vocab-cpython-34m.dll
    

    Now I am having trouble with

    spacy/syntax/ner.cpp: In function 'int __pyx_f_5spacy_6syntax_3ner_13BiluoPushDown_preprocess_gold(__pyx_obj_5spacy_6syntax_3ner_BiluoPushDown*, __pyx_obj_5spacy_4gold_GoldParse*)':
    spacy/syntax/ner.cpp:3532:38: error: no match for 'operator=' (operand types are '__pyx_t_5spacy_24syntax_dot_transition_system_Transition' and '__pyx_t_5spacy_6syntax_17transition_system_Transition')
         (__pyx_v_gold->c.ner[__pyx_v_i]) = __pyx_t_4;
    

    which looks like it that it could be an issue related to imported names?

    I wonder if you have seen this kind of problems before. I use up to date Msys2/MinGW-Packages. My

    $ python3 --version
    Python 3.4.3
    $ cython --version
    Cython version 0.23.beta1
    
    opened by htzh 78
  • 💫 Better, faster and more customisable matcher

    💫 Better, faster and more customisable matcher

    Related issues: #1567, #1711, #1819, #1939, #1945, #1951, #2042

    We're currently in the process of rewriting the match loop, fixing long-standing issues and making it easier to extend the Matcher and PhraseMatcher. The community contributions by @GregDubbin and @savkov have already made a big difference – we can't wait to get it all ready and shipped.

    This issue discusses some of the planned new features and additions to the match patterns API, including matching by custom extension attributes (Token._.), regular expressions, set membership and rich comparison for numeric values.

    New features

    Custom extension attributes

    spaCy v2.0 introduced custom extension attributes on the Doc, Span and Token. Custom attributes make it easier to attach arbitrary data to the built-in objects, and let users take advantage of spaCy's data structures and the Doc object as the "single source of truth". However, not being able to match on custom attributes was quite limiting (see #1499, #1825).

    The new patterns spec will allow an _ space on token patterns, which can map to a dictionary keyed by the attribute names:

    Token.set_extension('is_fruit', getter=lambda token: token.text in ('apple', 'banana'))
    
    pattern = [{'LEMMA': 'have'}, {'_': {'is_fruit': True}}]
    matcher.add('HAVING_FRUIT', None, pattern)
    

    Both regular attribute extensions (with a default value) and property extensions (with a getter) will be supported and can be combined for more exact matches.

    pattern = [{'_': {'is_fruit': True, 'fruit_color': 'red', 'fruit_rating': 5}}]
    

    Rich comparison for numeric values

    Token patterns already allow specifying a LENGTH (the token's character length). However, matching tokens of between five and ten characters previously required adding 6 copies of the exact same pattern, introducing unnecessary overhead. Numeric attributes can now also specify a dictionary with the predicate (e.g. '>' or '<=') mapped to the value. For example:

    pattern = [{'ENT_TYPE': 'ORG', 'LENGTH': 5}]          # exact length
    pattern = [{'ENT_TYPE': 'ORG', 'LENGTH': {'>=': 5}}]  # length with predicate
    

    The second pattern above will match a token with the entity type ORG that's 5 or more characters long. Combined with custom attributes, this allows very powerful queries combining both linguistic features and numeric data:

    # match a token based on custom numeric attributes
    pattern = [{'_': {'fruit_rating': {'>': 7}, 'fruit_weight': {'>=': 100, '<': 300}}]
    
    # match a verb with ._.sentiment_score >= 5 and one token on each side
    pattern = [{}, {'POS': 'VERB', '_': {'sentiment_score': {'>=': 0.5}}}, {}]
    

    Defining predicates and values as a dictionary instead of a single string like '>=5' allows us to avoid string parsing, and lets spaCy handle custom attributes without requiring the user to specify their types upfront. (While we know the type of the built-in LENGTH attribute, spaCy has no way of knowing whether the value '<3' of a custom attribute should be interpreted as "less than 3", or the heart emoticon.)

    Set membership

    This is another feature that has been requested before and will now be much easier to implement. Similar to the predicate mapping for numeric values, token attributes can now also be defined as dictionaries. The keys IN or NOT_IN can be used to indicate set membership and non-membership.

    pattern = [{'LEMMA': {'IN': ['like', 'love']}}, 
               {'LOWER': {'IN': ['apples', 'bananas']}}]
    

    The above pattern will match a token with the lemma "like" or "love", followed by a token whose lowercase form is either "apples" or "bananas". For example, "loving apples" or "likes bananas". Lists can be used for all non-boolean values, including custom _ attributes:

    # verb or conjunction followed by custom is_fruit token
    pattern = [{'POS': {'IN': ['VERB', 'CONJ', 'CCONJ']}}, 
               {'_': {'is_fruit': True, 'fruit_color': {'NOT_IN': ['red', 'yellow']}}}]
    
    # set membership of numeric custom attributes
    pattern = [{'_': {'is_customer': True, 'customer_year': {'IN': [2018, 2017, 2016]}}}]
    
    # combination of predicates and and non-membership
    pattern = [{'_': {'custom_count': {'<=': 100, 'NOT_IN': [12, 66, 79]}}}]
    

    Regular expressions

    Using regular expressions within token patterns is already possible via custom binary flags (see #1567). However, this has some inconvenient limitations – including the patterns not being JSON-serializable. If the solution is to add binary flags, spaCy might as well take care of that. The following example is based on the work by @savkov (see #1833):

    pattern = [{'ORTH': {'REGEX': '^([Uu](\\.?|nited) ?[Ss](\\.?|tates)'}},
               {'LOWER': 'president'}]
    

    'REGEX' as an operator (instead of a top-level property that only matches on the token's text) allows defining rules for any string value, including custom attributes:

    # match tokens with fine-grained POS tags starting with 'V'
    pattern = [{'TAG': {'REGEX': '^V'}}]
    
    # match custom attribute values with regular expressions
    pattern = [{'_': {'country': {'REGEX': '^([Uu](\\.?|nited) ?[Ss](\\.?|tates)'}}}]
    

    New operators

    TL;DR: The new patterns spec will allow two ways of defining properties – attribute values for exact matches and dictionaries using operators for more fine-grained matches.

    {
        PROPERTY: value,                  # exact match
        PROPERTY: {OPERATOR: value, ...}  # match with operators
    }
    

    The following operators can be used within dictionaries describing attribute values:

    | Operator | Value type | Description | Example | | --- | --- | --- | --- | | ==, >=, <=, >, < | int, float | Attribute value is equal, greater or equal, smaller or equal, greater or smaller. | 'LENGTH': {'>': 10} | | IN | any | Attribute value is member of a list. | 'LEMMA': {'IN': ['like', 'love']} | | NOT_IN | any | Attribute value is not member of a list. | 'POS': {'NOT_IN': ['NOUN', 'PROPN']} | | REGEX | unicode | Attribute value matches regular expression. | 'TAG': {'REGEX': '^V'} |

    API improvements and bug fixes

    See @honnibal's comments in #1945 and the feature/better-faster-matcher branch for more details and implementation examples.

    Other fixes

    • [x] #1711: Remove hard-coded length limit of 10 on the PhraseMatcher.
    • [x] #1939: Fix pickling of PhraseMatcher.
    • [x] #1951: Matcher.pipe should yield matches instead of Doc objects.
    • [ ] #2042: Support deleting rules in PhraseMatcher.
    • [x] Accept "TEXT" as an alternative to "ORTH" (for consistency).
    enhancement feat / matcher perf / speed 
    opened by ines 64
  • v2 standard pipeline running 10x slower

    v2 standard pipeline running 10x slower

    Your Environment

    Info about spaCy

    • Python version: 2.7.13
    • Platform: Linux-4.10.0-38-generic-x86_64-with-debian-stretch-sid
    • spaCy version: 2.0.0
    • Models: en

    I just updated to v2.0. Not sure what changed, but the exact same pipeline of documents called in the standard nlp = spacy.load('en'); nlp(u"string") way is now 10x slower.

    usage perf / speed 
    opened by hodsonjames 56
  • PR for testing Thinc 703

    PR for testing Thinc 703

    Description

    dummy PR for testing purposes only - should not be merged

    PR to test the CI with the Thinc branch for https://github.com/explosion/thinc/pull/703

    🔮 thinc 
    opened by svlandeg 55
  • Use in Apache Spark / English() object cannot be pickled

    Use in Apache Spark / English() object cannot be pickled

    For spaCy to work out of the box with Apache Spark the language modles need to be pickled so that they can be initialised on the master node and then sent to the workers.

    This currently doesn't work with plain pickle, failing as follows:

    >>> from __future__ import unicode_literals, print_function
    >>> from spacy.en import English
    >>> import pickle
    >>> nlp = English()
    >>> nlpp = pickle.dumps(nlp)
    Traceback (most recent call last):
    [...]
    TypeError: can't pickle Vocab objects
    

    Apache Spark ships with a package called cloudpickle which is meant to support a wider set of Python constructs, but serialisation with cloudpickle also fails resulting in a segmentation fault:

    >>> from pyspark import cloudpickle
    >>> pickled_nlp = cloudpickle.dumps(nlp)
    >>> nlpp = pickle.dumps(nlp)
    >>> nlpp('test text')
    Segmentation fault
    

    By default Apache Spark uses pickle, but can be told to use cloudpickle instead.

    Currently a feasable workaround is lazy loading of the language models on the worker nodes:

    global nlp
    def lazyloaded_nlp(s):
        global nlp
        try:
            return nlp(s)
        except:
            nlp = English()
            return nlp(s)
    

    The above works. Nevertheless, I wonder if it would be possible to make the English() object pickleable? If not too difficult from your end, having the language models pickleable would provide a better out of box experience for Apache Spark users.

    enhancement 🌙 nightly 
    opened by aeneaswiener 53
  • Segmentation fault training NER with large number of training examples  #1757 #1335

    Segmentation fault training NER with large number of training examples #1757 #1335

    Re-opening this as a new issue specifically related to NER ~~batch size~~ training with many examples. Relates to #1757 #1335 (which appear to be closed).

    Training NER on 500+ examples throws segmentation fault error:

    Process finished with exit code 139 (interrupted by signal 11: SIGSEGV)

    Anybody found a solution/workaround for this? Thanks!

    info about spaCy Python version: 3.6.3 spaCy version: 2.0.5 Models: en, en_core_sm Platform: MacOS

    bug training feat / ner 
    opened by nikeqiang 43
  • Timeout Downloading Models

    Timeout Downloading Models

    How to reproduce the behaviour

    My GitHub action trying to download models as follows -

    python -m spacy download en_core_web_lg
    

    But it is sometimes giving timeout errors -

    ERROR: Could not install packages due to an OSError: HTTPSConnectionPool(host='objects.githubusercontent.com', port=443): Max retries exceeded with url: /github-production-release-asset-2e65be/84940268/ee782580-63d4-11eb-9a2f-4a14ddffedbb?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20211103%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20211103T074829Z&X-Amz-Expires=300&X-Amz-Signature=4a4170665e395bcd6d5c55886d9fdc8d982870ee5954f34ef0d681b9ded628a2&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=84940268&response-content-disposition=attachment%3B%20filename%3Den_core_web_lg-3.0.0-py3-none-any.whl&response-content-type=application%2Foctet-stream (Caused by ReadTimeoutError("HTTPSConnectionPool(host='objects.githubusercontent.com', port=443): Read timed out. (read timeout=15)"))
    

    Not sure if this is related to this https://github.com/explosion/spaCy/issues/5260

    Your Environment

    • Operating System: Github Runners (Ubuntu, Windows, and Mac)
    • Python Version Used: 3.7
    • spaCy Version Used: 3.0.0
    • Environment Information: pip
    install models third-party 
    opened by lalitpagaria 42
  • 💫 Participating in CoNLL 2018 Universal Dependencies evaluation (Team spaCy?)

    💫 Participating in CoNLL 2018 Universal Dependencies evaluation (Team spaCy?)

    Update 06/06/2018. Best way to run the CoNLL experiments is:

    git clone https://github.com/explosion/spaCy -b develop
    cd spaCy
    make
    ./dist/spacy.pex ud-train --help
    

    The Conference for Natural Language Learning (CoNLL) 2017 shared task is a great standard for evaluating parsing algorithms. Unlike previous parsing evaluations, CoNLL 2017 is end-to-end: from raw text to dependencies, across many languages. While we missed the 2017 evaluation, I'd like to participate in 2018.

    To participate in CoNLL 2018, we would need to:

    • Adapt tokenizers for to match UD tokenization more closely.

    • Add pipeline component for statistical lemmatization, to improve lemmatizer coverage across languages.

    • Add pipeline component to predict morphological tags.

    • Support joint segmentation and tagging or parsing, for languages like Chinese.

    All of these are great goals, regardless of the competition! However, it's a lot of work, especially the tokenization, which really needs speakers of the various languages.

    Even if we don't get everything done in time to participate in the official evaluation, it will be a great step for spaCy to publish accuracy figures using the official evaluation software and methodology. This will allow direct comparison against other systems, and make quality control across languages much easier.

    What would be really awesome is if we got a few people working on this together, so we could participate as "Team spaCy". Ideally we'd have people taking ownership of some of the main languages, e.g. French, Spanish, German, Chinese, Japanese etc. It's much easier to work on a specific language that you're well familiar with. The official evaluation will consider all language equally, but I'm okay with having low accuracy on like, Ancient Greek or Dothraki.

    The official testing period will run April 30 to June 26. However, we can get started right away by working with the CoNLL 2017 data.

    To get started, I've made a quick script to run an experiment, which I've been testing on the English data. You can run it by building the feature/better-gold branch, and running the examples/training/conllu.py script like so:

    python examples/training/conllu.py en ~/data/ud-treebanks-conll2017/UD_English/en-ud-train.conllu ~/data/ud-treebanks-conll2017/UD_English/en-ud-train.txt  ~/data/ud-treebanks-conll2017/UD_English/en-ud-dev.conllu ~/data/ud-treebanks-conll2017/UD_English/en-ud-dev.txt /tmp/dev.conllu
    

    This will write you an output file /tmp/dev.conllu after each training epoch, which you can pass into the official CoNLL 2017 evaluation scorer. Scores currently suck, as there are various things to tweak and fix --- but at least the evaluation runs.

    enhancement help wanted 
    opened by honnibal 41
  • 💫 Entity Linking in spaCy

    💫 Entity Linking in spaCy

    Feature description

    With @honnibal & @ines we have been discussing adding an Entity Linking module to spaCy. This module would run on top of NER results and disambiguate & link tagged mentions to a knowledge base. We are thinking of implementing this in a few different phases:

    1. Implement an efficient encoding of a knowledge base + all APIs / interfaces, to integrate with the current processing pipeline. We would take the following components of EL into account:
      • Candidate generation
      • Encoding document context
      • Encoding local context
      • Type prediction
      • Coreference resolution / ensuring global consistency
    2. Implement a model that links English texts to English Wikipedia entries
    3. Implement a cross-lingual model that links non-English texts to English Wikipedia entries
    4. Fine-tune WP linking models to be able to ship them as such
    5. Implement support in Prodigy to perform custom EL annotations for your specific project
    6. Test / implement the models on a different domain & non-wikipedia knowledge base

    Notes

    As some prior research, we compiled some notes on this project & its requirements: https://drive.google.com/file/d/1UYnPgx3XjhUx48uNNQ3kZZoDzxoinSBF. This contains more details on the EL components and implementation phases.

    Feedback requested

    We will start implementing the APIs soon, but we would love to hear your ideas, suggestions, requests with respect to this new functionality first!

    enhancement feat / ner 
    opened by svlandeg 40
  • Fix inconsistency in displaCy docs about page option

    Fix inconsistency in displaCy docs about page option

    Description

    The page option, which wraps the output SVG in HTML, is true by default for serve but not for render. The render docs were wrong though, so this updates them.

    Types of change

    Minor docs fix

    Checklist

    • [x] I confirm that I have the right to submit this contribution under the project's MIT license.
    • [ ] I ran the tests, and all new and existing tests passed.
    • [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
    docs feat / visualizers 
    opened by polm 0
  • nlp.rehearse with textcat and tok2vec listener

    nlp.rehearse with textcat and tok2vec listener

    Description

    Related to #12044

    When using nlp.rehearse on a textcat pipeline with a tok2vec listener, it throws ValueError: [E953] Mismatched IDs received by the Tok2Vec listener. This is not the case when using an inline tok2vec listener. (The same goes for when using textcat_multilabel)

    This PR aims to fix the issue, however it is still WIP and currently only contains the failing unit tests.

    Types of change

    bug fix

    Checklist

    • [x] I confirm that I have the right to submit this contribution under the project's MIT license.
    • [ ] I ran the tests, and all new and existing tests passed.
    • [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.
    bug feat / textcat feat / tok2vec 
    opened by thomashacker 0
  • Mismatched IDs error when using nlp.rehearse on textcat

    Mismatched IDs error when using nlp.rehearse on textcat

    Discussed in https://github.com/explosion/spaCy/discussions/10861

    Using nlp.rehearse on a textcat pipeline with a tok2vec listener results in ValueError: [E953] Mismatched IDs. This is not the case when using tok2vec directly within the textcat component. The same goes for textcat_multilabel.

    Originally posted by nashcaps2255 May 27, 2022 Have a textcat multilabel model which I am trying to update with nlp.rehearse to alleviate the catastrophic forgetting problem.

    nlp = spacy.load('my_model')
    
    examples = []
    for line in file_:
       text, label = line.split("|")
       doc = nlp(text)
       gold_dict = {"cats": {label: float(1)}}
       gold_dict = Example.from_dict(doc, gold_dict)
       examples.append(example)
    
    
    optimizer = nlp.resume_training()
    nlp.rehearse(examples, sgd = optimizer) 
    

    Results in......

    ValueError: [E953] Mismatched IDs received by the Tok2Vec listener: 179568814531392983158587824 vs. 2172509679243279887229
    
    bug training feat / textcat 
    opened by thomashacker 0
  • Delete unused imports for StringStore

    Delete unused imports for StringStore

    Description

    This PR removes unused imports for StringStore from lexeme and tokenizer.

    Types of change

    Checklist

    • [x] I confirm that I have the right to submit this contribution under the project's MIT license.
    • [x] I ran the tests, and all new and existing tests passed.
    • [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
    opened by tetsuok 0
  • Memory leak when processing a large number of documents with Spacy transformers

    Memory leak when processing a large number of documents with Spacy transformers

    I have a Spacy distilbert transformer model trained for NER. When I use this model for predictions on a large corpus of documents, the RAM usage spikes up very quickly, and then keeps increasing over time, until I run out of memory, and my process gets killed. I am running this on a CPU AWS machine m5.12xlarge I see the same behavior when using en_core_web_trf model.

    The following code can be used to reproduce error with en_core_web_trf model

    
    import sys, pickle, time, os
    import spacy
    
    print(f"CPU Count: {os.cpu_count()}")
    
    model = spacy.load("en_core_web_trf")
    
    ## Docs are English text documents with average character length of 2479, std dev 3487, max 69000
    docs = pickle.load( open( "memory_analysis/data/docs.p", "rb" ) )
    print(len(docs))
    
    for i, body in (enumerate(docs)):
        if i==10000:
            break
        ## Spacy prediction 
        list( model.pipe([body], disable=["tok2vec", "parser", "attribute_ruler", "lemmatizer"] ))
        if i%400==0:
            print(f"Doc number: {i}")
    

    Environment:

    spacy-transformers==1.1.8
    spacy==3.4.3
    torch==1.12.1
    

    Additional info: I notice that model vocab length and cached string store grows with the processed documents as well, although unsure if this is causing the memory leak. I tried periodically reloading model, but that does not help either.

    Using Memray for memory usage analysis:

    python3 -m memray run -o memory_usage_trf_max.bin  memory_analysis.py
    python3 -m memray flamegraph memory_usage_trf_max_len.bin    
    
    opened by saketsharmabmb 0
  • Fix required maximum version of typing-extensions

    Fix required maximum version of typing-extensions

    Description

    This PR fixes the required maximum version of typing-extensions.

    Currently it is bounded to <4.2.0: typing_extensions>=3.7.4.1,<4.2.0; python_version < "3.8"

    This PR sets the upper bound to all compatible versions, until the next major release <5.0.0.

    Required:

    • [ ] https://github.com/explosion/confection/pull/20
    • [ ] https://github.com/explosion/thinc/pull/833

    See:

    • https://github.com/explosion/spaCy/issues/12034

    See issue in pydantic:

    • https://github.com/pydantic/pydantic/issues/4885

    See fixing PR in pydantic (typing-extensions>=4.2.0), which will be incompatible with your requirement typing_extensions>=3.7.4,<4.2.0; python_version < "3.8":

    • https://github.com/pydantic/pydantic/pull/4886

    Types of change

    Checklist

    • [ ] I confirm that I have the right to submit this contribution under the project's MIT license.
    • [ ] I ran the tests, and all new and existing tests passed.
    • [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.
    install third-party 
    opened by albertvillanova 1
Releases(v3.0.9)
  • v3.0.9(Dec 16, 2022)

    This bug fix release is primarily to avoid deprecation warnings and future incompatibility with NumPy v1.24+.

    🔴 Bug fixes

    • #11331, #11701: Clean up warnings in spaCy and its test suite.
    • #11845: Don't raise an error in displaCy for unset spans keys.
    • #11864: Add smart_open requirement and update deprecated options.
    • #11899: Fix spacy init config --gpu for environments without spacy-transformers.
    • #11933: Update for compatibility with NumPy v1.24+ integer conversions.
    • #11935: Restore missing error messages for beam search.

    👥 Contributors

    @adrianeboyd, @honnibal, @ines, @polm, @svlandeg

    Source code(tar.gz)
    Source code(zip)
  • v2.3.9(Dec 16, 2022)

    This release addresses future compatibility with NumPy v1.24+.

    🔴 Bug fixes

    • #11940: Update for compatibility with NumPy v1.24+ integer conversions.

    👥 Contributors

    @adrianeboyd, @honnibal, @ines, @svlandeg

    Source code(tar.gz)
    Source code(zip)
  • v3.4.4(Dec 14, 2022)

    This bug fix release is primarily to avoid deprecation warnings and future incompatibility with NumPy v1.24+.

    🔴 Bug fixes

    • #11845: Don't raise an error in displaCy for unset spans keys.
    • #11860: Fix spancat for docs with zero suggestions.
    • #11864: Add smart_open requirement and update deprecated options.
    • #11899: Fix spacy init config --gpu for environments without spacy-transformers.
    • #11933: Update for compatibility with NumPy v1.24+ integer conversions.
    • #11934: Add strings when initializing from labels in EditTreeLemmatizer.
    • #11935: Restore missing error messages for beam search.

    👥 Contributors

    @adrianeboyd, @danieldk, @honnibal, @ines, @polm, @svlandeg

    Source code(tar.gz)
    Source code(zip)
  • v3.3.2(Dec 16, 2022)

    This bug fix release is primarily to avoid deprecation warnings and future incompatibility with NumPy v1.24+.

    🔴 Bug fixes

    • #10911, #11194: Improve speed in precomputable_biaffine by avoiding concatenation.
    • #11276, #11331, #11701: Clean up warnings in spaCy and its test suite.
    • #11845: Don't raise an error in displaCy for unset spans keys.
    • #11860: Fix spancat for docs with zero suggestions.
    • #11864: Add smart_open requirement and update deprecated options.
    • #11899: Fix spacy init config --gpu for environments without spacy-transformers.
    • #11933: Update for compatibility with NumPy v1.24+ integer conversions.
    • #11934: Add strings when initializing from labels in EditTreeLemmatizer.
    • #11935: Restore missing error messages for beam search.

    👥 Contributors

    @adrianeboyd, @danieldk, @honnibal, @ines, @polm, @svlandeg

    Source code(tar.gz)
    Source code(zip)
  • v3.2.5(Dec 16, 2022)

    This bug fix release is primarily to avoid deprecation warnings and future incompatibility with NumPy v1.24+.

    🔴 Bug fixes

    • #10573: Remove Click pin following Typer updates.
    • #11331, #11701: Clean up warnings in spaCy and its test suite.
    • #11845: Don't raise an error in displaCy for unset spans keys.
    • #11860: Fix spancat for docs with zero suggestions.
    • #11864: Add smart_open requirement and update deprecated options.
    • #11899: Fix spacy init config --gpu for environments without spacy-transformers.
    • #11933: Update for compatibility with NumPy v1.24+ integer conversions.
    • #11935: Restore missing error messages for beam search.

    👥 Contributors

    @adrianeboyd, @honnibal, @ines, @polm, @svlandeg

    Source code(tar.gz)
    Source code(zip)
  • v3.1.7(Dec 16, 2022)

    This bug fix release is primarily to avoid deprecation warnings and future incompatibility with NumPy v1.24+.

    🔴 Bug fixes

    • #10573: Remove Click pin following Typer updates.
    • #11331, #11701: Clean up warnings in spaCy and its test suite.
    • #11845: Don't raise an error in displaCy for unset spans keys.
    • #11860: Fix spancat for docs with zero suggestions.
    • #11864: Add smart_open requirement and update deprecated options.
    • #11899: Fix spacy init config --gpu for environments without spacy-transformers.
    • #11933: Update for compatibility with NumPy v1.24+ integer conversions.
    • #11935: Restore missing error messages for beam search.

    👥 Contributors

    @adrianeboyd, @honnibal, @ines, @polm, @svlandeg

    Source code(tar.gz)
    Source code(zip)
  • v3.4.3(Nov 10, 2022)

    ✨ New features and improvements

    • Extend Typer support to v0.7.x (#11720).

    🔴 Bug fixes

    • #11640: Handle docs with no entities in EntityLinker.
    • #11688: Restore custom doc extension values in Doc.to_json() for attributes set by getters.
    • #11706: Remove incorrect warning for pipeline_package.load().
    • #11735: Improve spacy project requirements checks for unsupported specifiers and requirements lines.
    • #11745: Revert modifications to spacy.load(disable=) that could enable currently disabled components.

    👥 Contributors

    @aaronzipp, @adrianeboyd, @honnibal, @ines, @polm, @rmitsch, @ryndaniels, @svlandeg, @thomashacker

    Source code(tar.gz)
    Source code(zip)
  • v3.4.2(Oct 20, 2022)

    ✨ New features and improvements

    • NEW: Luganda language support (#10847).
    • NEW: Latin language support (#11349).
    • NEW: spacy.ConsoleLogger.v2 optionally saves training logs to JSONL (#11214).
    • NEW: New operators for the DependencyMatcher to include matching parents or children to the left or the right of the node (#10371).
    • Prebuilt Python 3.11 wheels are now available for all spaCy dependencies distributed by @explosion.
    • Support pydantic v1.10 and mypy 0.980+, drop mypy support for Python 3.6 (#11546, #11635).
    • Support CuPy v11 and add extras for cuda11x and cuda-autodetect (using cupy-wheel) (#11279).
    • Support custom attributes for tokens and spans in Doc.to_json() and Doc.from_json() (#11125).
    • Make the enable and disable options for spacy.load() more consistent (#11459).
    • Allow a single string argument for disable/enclude/exclude for spacy.load() (#11406).
    • New --url flag for spacy info to print the direct download URL for a pipeline (#11175).
    • Add a check for missing requirements in the spacy project CLI (#11226).
    • Add a Levenshtein distance function (#11418).
    • Improvements to the spacy debug data CLI for spancat data (#11504).
    • Allow overriding spacy_version in spacy package metadata (#11552).
    • Improve the error message when using the wrong command for spacy project assets (#11458).
    • Ensure parent directories are created when storing the results of the spacy pretrain command (#11210).
    • Extend support to newer versions of natto-py for the ko extra (#11222).

    📦 Trained pipelines updates

    This release includes updated English pipelines for spaCy v3.4 with improved NER performance. The updates in en_core_web_* v3.4.1 address issues related to training from data with partial named entity annotation, which led to lower NER recall in English pipeline versions v3.0.0–v3.4.0. In particular, entities that appear in the sections of the OntoNotes training data without NER annotation were not predicted consistently by the earlier pipeline versions, such as names and places that are frequent in the Biblical sections, e.g., "David" and "Egypt" (see #7493).

    Use spacy download to update your English pipelines to the newest version. If you'd prefer to keep using an earlier version, you can specify the version directly with e.g. spacy download -d en_core_web_sm-3.4.0. You can check that you are using the new version (v3.4.1) with spacy validate:

    NAME                     SPACY            VERSION
    en_core_web_md           >=3.4.0,<3.5.0   3.4.1     ✔
    

    🔴 Bug fixes

    • #11275: Fix Dutch noun chunks to skip overlapping spans.
    • #11276: Fix regex invalid escape sequences.
    • #11312: Better handling of unexpected types in SetPredicate.
    • #11460: Fix config validation failures caused by NVTX pipeline wrappers.
    • #11506: Avoid unwanted side effects in Doc.__init__.
    • #11540: Preserve missing entity annotation in augmenters.
    • #11592: Fix issues with DVC commands.
    • #11631: Fix initialization for pymorphy2_lookup lemmatizer mode for Russian and Ukrainian.

    ⚠️ Backwards incompatibilities

    • If you're using a custom component that does not return a Doc type, an error will now be raised (#11424).
    • If you're using a dot in a factory name, an error is raised as this is not supported (#11336).

    📖 Documentation and examples

    👥 Contributors

    @adrianeboyd, @bdura, @danieldk, @diyclassics, @DSLituiev, @GabrielePicco, @honnibal, @ines, @JulesBelveze, @kadarakos, @ljvmiranda921, @ninjalu, @pmbaumgartner, @polm, @radandreicristian, @richardpaulhudson, @rmitsch, @shadeMe, @stefawolf, @svlandeg, @thomashacker, @tobiusaolo, @tzussman , @yasufumy

    Source code(tar.gz)
    Source code(zip)
  • v2.3.8(Oct 19, 2022)

  • v3.4.1(Jul 26, 2022)

    🔴 Bug fixes

    • Fix issue #11137: Fix compatibility with CuPy v9.x.

    📖 Documentation and examples

    👥 Contributors

    @adrianeboyd, @danieldk, @honnibal, @ines, @lll-lll-lll-lll, @Lucaterre, @MaartenGr, @mr-bjerre, @polm, @radenkovic

    Source code(tar.gz)
    Source code(zip)
  • v3.4.0(Jul 12, 2022)

    ✨ New features and improvements

    • Support for mypy 0.950+ and pydantic v1.9 (#10786).
    • Prebuilt linux aarch64 wheels are now available for all spaCy dependencies distributed by @explosion.
    • Min/max {n,m} operator for Matcher patterns (#10981).
    • Language updates:
      • Improve tokenization for Cyrillic combining diacritics (#10837).
      • Improve English tokenizer exceptions for contractions with this/that/these/those (#10873).
    • Improved speed of vector lookups (#10992).
    • For the parser, use C saxpy/sgemm provided by the Ops implementation in order to use Accelerate through thinc-apple-ops (#10773).
    • Improved speed of Example.get_aligned_parse and Example.get_aligned (#10952).
    • Improved speed of StringStore lookups (#10938).
    • Updated spacy project clone to try both main and master branches by default (#10843).
    • Added confidence threshold for named entity linker (#11016).
    • Improved handling of Typer optional default values for init_config_cli (#10788).
    • Added cycle detection in parser projectivization methods (#10877).
    • Added counts for NER labels in debug data (#10960).
    • Support for adding NVTX ranges to TrainablePipe components (#10965).
    • Support env variable SPACY_NUM_BUILD_JOBS to specify the number of build jobs to run in parallel with pip (#11073).

    📦 Trained pipelines updates

    We have added new pipelines for Croatian that use the trainable lemmatizer and floret vectors.

    | Package | UPOS | Parser LAS | NER F | | ----------------------------------------------- | ---: | ---------: | ----: | | hr_core_news_sm | 96.6 | 77.5 | 76.1 | | hr_core_news_md | 97.3 | 80.1 | 81.8 | | hr_core_news_lg | 97.5 | 80.4 | 83.0 |

    🙏 Special thanks to @gtoffoli for help with the new pipelines!

    The English pipelines have new word vectors:

    | Package | Model Version | TAG | Parser LAS | NER F | | ----------------------------------------------- | ------------- | ---: | ---------: | ----: | | en_core_news_md | v3.3.0 | 97.3 | 90.1 | 84.6 | | en_core_news_md | v3.4.0 | 97.2 | 90.3 | 85.5 | | en_core_news_lg | v3.3.0 | 97.4 | 90.1 | 85.3 | | en_core_news_lg | v3.4.0 | 97.3 | 90.2 | 85.6 |

    All CNN pipelines have been extended to add whitespace augmentation.

    🔴 Bug fixes

    • Fix issue #10960: Support hyphens in NER labels.
    • Fix issue #10994: Fix horizontal spacing for spans in displaCy.
    • Fix issue #11013: Check for any token with a vector in Doc.has_vector, distinguish 0-vectors and missing vectors in similarity warnings.
    • Fix issue #11056: Don't use get_array_module in textcat.
    • Fix issue #11092: Fix vertical alignment for spans in displaCy.

    🚀 Notes about upgrading from v3.3

    • Doc.has_vector now matches Token.has_vector and Span.has_vector: it returns True if at least one token in the doc has a vector rather than checking only whether the vocab contains vectors.

    📖 Documentation and examples

    • spaCy universe additions:
      • Aim-spacy: An Aim-based spaCy experiment tracker.
      • Asent: Fast, flexible and transparent sentiment analysis.
      • spaCy fishing: Named entity disambiguation and linking on Wikidata in spaCy with Entity-Fishing.
      • spacy-report: Generates interactive reports for spaCy models.

    👥 Contributors

    @adrianeboyd, @danieldk, @ericholscher, @gorarakelyan, @honnibal, @ines, @jademlc, @kadarakos, @KennethEnevoldsen, @koaning, @Lucaterre, @maxTarlov, @philipvollet, @pmbaumgartner, @polm, @richardpaulhudson, @rmitsch, @sadovnychyi, @shadeMe, @shen-qin, @single-fingal, @svlandeg, @victorialslocum, @Zackere

    Source code(tar.gz)
    Source code(zip)
  • v3.3.1(Jun 7, 2022)

    ✨ New features and improvements

    🔴 Bug fixes

    • Fix issue #9575: Fix Entity Linker with tokenization mismatches between gold and predicted Doc objects.
    • Fix issue #10685: Fix serialization of SpanGroup objects that share the same name within one SpanGroups container.
    • Fix issue #10718: Remove debug print statements in walk_head_nodes to avoid acquiring the GIL.
    • Fix issue #10741: Make the StringStore.__getitem__ return type dependent on its parameter type.
    • Fix issue #10734: Support removal of overlapping terms in PhraseMatcher.
    • Fix issue #10772: Override SpanGroups.setdefault to also support Iterable[SpanGroup] as the default.
    • Fix issue #10817: Ensure that the term ROOT is in the glossary.
    • Fix issue #10830: Better errors for Doc.has_annotation and Matcher.
    • Fix issue #10864: Avoid pickling Doc inputs passed to Language.pipe().
    • Fix issue #10898: Fix schemas import in Doc.

    ⚠️ Backward incompatibilities

    • Before this release, a validation bug allowed the configuration of a pipeline component to override the name of the pipeline itself through the name attribute. For example, the following pipeline component:

      [components.transformer]
      factory = "transformer"
      name = "custom_transformer_name"
      

      would be registered erroneously as custom_transformer_name. Such overrides are now ignored and a warning is emitted (#10779). From spaCy v3.3.1 onwards, this component will be registered as transformer.

    👥 Contributors

    @adrianeboyd, @danieldk, @freddyheppell, @honnibal, @ines, @kadarakos, @ldorigo, @ljvmiranda921, @maxTarlov, @pmbaumgartner, @polm, @pypae, @richardpaulhudson, @rmitsch, @shadeMe, @single-fingal, @svlandeg

    Source code(tar.gz)
    Source code(zip)
  • v3.3.0(Apr 29, 2022)

    ✨ New features and improvements

    📦 Trained pipelines

    v3.3 introduces trained pipelines for Finnish, Korean and Swedish which feature the trainable lemmatizer and floret vectors. Due to the use Bloom embeddings and subwords, the pipelines have compact vectors with no out-of-vocabulary words.

    | Package | Language | UPOS | Parser LAS | NER F | | --------------------------------------------------------------- | -------- | ---: | ---------: | ----: | | fi_core_news_sm | Finnish | 92.5 | 71.9 | 75.9 | | fi_core_news_md | Finnish | 95.9 | 78.6 | 80.6 | | fi_core_news_lg | Finnish | 96.2 | 79.4 | 82.4 | | ko_core_news_sm | Korean | 86.1 | 65.6 | 71.3 | | ko_core_news_md | Korean | 94.7 | 80.9 | 83.1 | | ko_core_news_lg | Korean | 94.7 | 81.3 | 85.3 | | sv_core_news_sm | Swedish | 95.0 | 75.9 | 74.7 | | sv_core_news_md | Swedish | 96.3 | 78.5 | 79.3 | | sv_core_news_lg | Swedish | 96.3 | 79.1 | 81.1 |

    🙏 Special thanks to @aajanki, @thiippal (Finnish) and Elena Fano (Swedish) for their help with the new pipelines!

    The new trainable lemmatizer is used for Danish, Dutch, Finnish, German, Greek, Italian, Korean, Lithuanian, Norwegian, Polish, Portuguese, Romanian and Swedish.

    | Model | v3.2 Lemma Acc | v3.3 Lemma Acc | | ----------------------------------------------- | -------------: | -------------: | | da_core_news_md | 84.9 | 94.8 | | de_core_news_md | 73.4 | 97.7 | | el_core_news_md | 56.5 | 88.9 | | fi_core_news_md | - | 86.2 | | it_core_news_md | 86.6 | 97.2 | | ko_core_news_md | - | 90.0 | | lt_core_news_md | 71.1 | 84.8 | | nb_core_news_md | 76.7 | 97.1 | | nl_core_news_md | 81.5 | 94.0 | | pl_core_news_md | 87.1 | 93.7 | | pt_core_news_md | 76.7 | 96.9 | | ro_core_news_md | 81.8 | 95.5 | | sv_core_news_md | - | 95.5 |

    🔴 Bug fixes

    • Fix issue #5447: Avoid overlapping arcs when using displaCy in manual mode.
    • Fix issue #9443: Fix Scorer.score_cats for missing labels.
    • Fix issue #9669: Fix entity linker batching.
    • Fix issue #9903: Handle _ value for UPOS in CoNLL-U converter.
    • Fix issue #9904: Fix textcat loss scaling.
    • Fix issue #9956: Compare all Span attributes consistently.
    • Fix issue #10073: Add "spans" to the output of doc.to_json.
    • Fix issue #10086: Add tokenizer option to allow Matcher handling for all special cases.
    • Fix issue #10189: Allow Example to align whitespace annotation.
    • Fix issue #10302: Fix check for NER annotation in MISC in CoNLL-U converter.
    • Fix issue #10324: Fix Tok2Vec for empty batches.
    • Fix issue #10347: Update basic functionality for rehearse.
    • Fix issue #10394: Fix Vectors.n_keys for floret vectors.
    • Fix issue #10400: Use meta in util.load_model_from_config.
    • Fix issue #10451: Fix Example.get_matching_ents.
    • Fix issue #10460: Fix initial special cases for Tokenizer.explain.
    • Fix issue #10521: Stream large assets on download in spaCy projects.
    • Fix issue #10536: Handle unknown tags in KoreanTokenizer tag map.
    • Fix issue #10551: Add automatic vector deduplication for init vectors.

    🚀 Notes about upgrading from v3.2

    • To see the speed improvements for the Tagger architecture, edit your configs to switch from spacy.Tagger.v1 to spacy.Tagger.v2 and then run init fill-config.
    • Span comparisons involving ordering (<, <=, >, >=) now take all span attributes into account (start, end, label, and KB ID) so spans may be sorted in a slightly different order (#9956).
    • Annotation on whitespace tokens is handled in the same way as annotation on non-whitespace tokens during training in order to allow custom whitespace annotation (#10189).
    • Doc.from_docs now includes Doc.tensor by default and supports excludes with an exclude argument in the same format as Doc.to_bytes. The supported exclude fields are spans, tensor and user_data.

    📖 Documentation and examples

    👥 Contributors

    @aajanki, @adrianeboyd, @apjanco, @bdura, @BramVanroy, @danieldk, @danmysak, @davidberenstein1957, @DuyguA, @fonfonx, @gremur, @HaakonME, @harmbuisman, @honnibal, @ines, @internaut, @jfainberg, @jnphilipp, @jsnfly, @kadarakos, @koaning, @ljvmiranda921, @martinjack, @mgrojo, @nrodnova, @ofirnk, @orglce, @pepemedigu, @philipvollet, @pmbaumgartner, @polm, @richardpaulhudson, @ryndaniels, @SamEdwardes, @Schero1994, @shadeMe, @single-fingal, @svlandeg, @thebugcreator, @thomashacker, @umaxfun, @y961996

    Source code(tar.gz)
    Source code(zip)
  • v3.1.6(Mar 30, 2022)

    🔴 Bug fixes

    • Fix issue #10564: Restrict supported Click versions as a workaround for incompatibilities between Click v8.1.0 and Typer v0.4.0.

    👥 Contributors

    @adrianeboyd, @honnibal, @ines

    Source code(tar.gz)
    Source code(zip)
  • v3.2.4(Mar 29, 2022)

    🔴 Bug fixes

    • Fix issue #10564: Restrict supported Click versions as a workaround for incompatibilities between Click v8.1.0 and Typer v0.4.0.

    👥 Contributors

    @adrianeboyd, @honnibal, @ines

    Source code(tar.gz)
    Source code(zip)
  • v3.2.3(Mar 1, 2022)

  • v3.1.5(Mar 1, 2022)

    🔴 Bug fixes

    • Fix issue #9593: Use metaclass to subclass errors for easier pickling.
    • Fix issue #9654: Fix spancat for empty docs and zero suggestions.
    • Fix issue #9979: Fix type of Lexeme.rank.
    • Fix issue #10324: Fix Tok2Vec for empty batches.

    👥 Contributors

    @adrianeboyd, @BramVanroy, @brucewlee, @danieldk, @honnibal, @ines, @ljvmiranda921, @polm, @svlandeg, @vgautam, @xxyzz

    Source code(tar.gz)
    Source code(zip)
  • v3.0.8(Mar 1, 2022)

  • v3.2.2(Feb 11, 2022)

    ✨ New features and improvements

    • Improved parser and ner speeds on long documents (see technical details in #10019).
    • Support for spancat components in debug data.
    • Support for ENT_IOB as a Matcher token pattern key.
    • Extended and improved types for many classes.

    🔴 Bug fixes

    • Fix issue #9735: Make floret murmurhash endian-neutral.
    • Fix issue #9738: Support string IOB values for ENT_IOB.
    • Fix issue #9746: Updates to avoid "dictionary size changed during iteration" runtime errors.
    • Fix issue #9960: Warn about entities that cross sentence boundaries in debug data.
    • Fix issue #9979: Fix type for Lexeme.rank.
    • Fix issue #10026: Check for 0-size assets in spacy project.
    • Fix issue #10051: Consistently return scalars from similarity methods.
    • Fix issue #10052: Fix spaces in Doc.from_docs() for empty docs.
    • Fix issue #10079: Fix label detection in debug data for components with custom names.
    • Fix issue #10109: Add types to Underscore and DependencyMatcher and improve types in Language, Matcher and PhraseMatcher.
    • Fix issue #10130: Fix Tokenizer.explain when infixes appear as prefixes.
    • Fix issue #10143: Use simple suggester in spancat initialization.
    • Fix issue #10164: Support IS_SENT_END in Doc.has_annotation.
    • Fix issue #10192: Detect invalid package names in spacy package.
    • Fix issue #10223: Support mixed case in package names.
    • Fix issue #10234: Fix type in PhraseMatcher.

    📖 Documentation and examples

    • Various documentation updates.
    • New spaCy version tags in spaCy universe.
    • New Dockerfile for repeatable website builds and easier local development.
    • New additions to spaCy universe:
      • Augmenty: a text augmentation library
      • Healthsea: an end-to-end spaCy pipeline for exploring health supplement effects
      • spacy-wrap: wrap fine-tuned transformers in spaCy pipelines
      • spacypdfreader: easy PDF to text to spaCy text extraction
      • textnets: text analysis with networks

    👥 Contributors

    @adrianeboyd, @antonpibm, @ColleterVi, @danieldk, @DuyguA, @ezorita, @HaakonME, @honnibal, @ines, @jboynyc, @KennethEnevoldsen, @ljvmiranda921, @mrshu, @pmbaumgartner, @polm, @ramonziai, @richardpaulhudson, @ryndaniels, @svlandeg, @thiippal, @thomashacker, @yoavxyoav

    Source code(tar.gz)
    Source code(zip)
  • v3.2.1(Dec 7, 2021)

    ✨ New features and improvements

    • NEW: doc_cleaner component for removing doc.tensor,doc._._trf_data or other Doc attributes at the end of the pipeline to reduce size of output docs.
    • NEW: ENT_ID and ENT_KB_ID to Matcher pattern attributes.
    • Support kb_id for entities in displaCy from Doc input.
    • Add Span.sents property for spans spanning over more than one sentence.
    • Add EntityRuler.remove to remove patterns by id.
    • Make the Tagger neg_prefix configurable.
    • Use Language.pipe in Language.evaluate for more efficient processing.
    • Test suite updates: move regression tests into core test modules with pytest markers for issue numbers, extend tests for languages with alpha support.

    🔴 Bug fixes

    • Fix issue #9638: Make JsonlCorpus path optional again.
    • Fix issue #9654: Fix spancat for empty docs and zero suggestions.
    • Fix issue #9658: Improve error message for incorrect .jsonl paths in EntityRuler.
    • Fix issue #9674: Fix language-specific factory handling in package CLI.
    • Fix issue #9694: Convert labels to strings for README in package CLI.
    • Fix issue #9697: Exclude strings from source vector checks.
    • Fix issue #9701: Allow Scorer.score_spans to handle predicted docs with missing annotation.
    • Fix issue #9722: Initialize parser from reference parse rather than aligned example.
    • Fix issue #9764: Set annotations more efficiently in tagger and morphologizer.

    📖 Documentation and examples

    👥 Contributors

    @adrianeboyd, @danieldk, @DuyguA, @honnibal, @ines, @ljvmiranda921, @narayanacharya6, @nrodnova, @Pantalaymon, @polm, @richardpaulhudson, @svlandeg, @thiippal, @Vishnunkumar

    Source code(tar.gz)
    Source code(zip)
  • v3.2.0(Nov 5, 2021)

    ✨ New features and improvements

    • NEW: Registered scoring functions for each component in the config.
    • NEW: nlp() and nlp.pipe() accept Doc input, which simplifies setting custom tokenization or extensions before processing.
    • NEW: Support for floret vectors, which combine fastText subwords with Bloom embeddings for compact, full-coverage vectors.
    • overwrite config settings for entity_linker, morphologizer, tagger, sentencizer and senter.
    • extend config setting for morphologizer for whether existing feature types are preserved.
    • Support for a wider range of language codes in spacy.blank() including IETF language tags, for example fra for French and zh-Hans for Chinese.
    • New package spacy-loggers for additional loggers.
    • New Irish lemmatizer.
    • New Portuguese noun chunks and updated Spanish noun chunks.
    • Language updates for Bulgarian, Catalan, Sinhala, Tagalog, Tigrinya and Vietnamese.
    • Japanese reading and inflection from sudachipy are annotated as Token.morph features.
    • Additional morph_micro_p/r/f scores for morphological features from Scorer.score_morph_per_feat().
    • LIKE_URL attribute includes the tokenizer URL pattern.
    • --n-save-epoch option for spacy pretrain.
    • Trained pipelines:
      • New transformer pipeline for Japanese ja_core_news_trf, thanks to @hiroshi-matsuda-rit and the spaCy Japanese community!
      • Updates for Catalan data, tokenizer and lemmatizer, thanks to @cayorodriguez, Carme Armentano and @TeMU-BSC!
      • Transformer pipelines are trained using spacy-transformers v1.1, with improved IO and more options for model config and output.
      • Universal Dependencies corpora updated to v2.8.
      • Trailing space added as a tok2vec feature, improving the performance for many components, especially fine-grained tagging and sentence segmentation.
      • English attribute ruler patterns updated to improve Token.pos and Token.morph.

    For more details, see the New in v3.2 usage guide.

    🔴 Bug fixes

    • Fix issue #8972: Fix pickling for Japanese, Korean and Vietnamese tokenizers.
    • Fix issue #9032: Retain alignment between doc and context for Language.pipe(as_tuples=True) for multiprocessing with custom error handlers.
    • Fix issue #9136: Ignore prefixes when applying suffix patterns in Tokenizer.
    • Fix issue #9584: Use metaclass to subclass errors to allow better pickling.

    ⚠️ Backwards incompatibilities

    • In the Tokenizer, prefixes are now removed before suffix matches are applied, which may lead to minor differences in the output. In particular, the default tokenization of °[cfk]. is now ° c . instead of ° c. for most languages.
    • The tokenizer classes ChineseTokenizer, JapaneseTokenizer, KoreanTokenizer, ThaiTokenizer and VietnameseTokenizer require Vocab rather than Language in __init__.
    • In DocBin, user data is now always serialized according to the store_user_data option, see #9190.

    📖 Documentation and examples

    👥 Contributors

    @adrianeboyd, @Avi197, @baxtree, @BramVanroy, @cayorodriguez, @DuyguA, @fgaim, @honnibal, @ines, @Jette16, @jimregan, @polm, @rspeer, @rumeshmadhusanka, @svlandeg, @syrull, @thomashacker

    Source code(tar.gz)
    Source code(zip)
  • v3.1.4(Oct 29, 2021)

    ✨ New features and improvements

    • NEW: Binary wheels for Python 3.10.
    • NEW: Improve performance on Apple M1 with AppleOps: pip install spacy[apple].
    • GPU profiling with spacy.models_with_nvtx_range.v1.
    • Full mypy integration in the CI and many type fixes across the code base.
    • Added custom Protocol classes in ty.py to define behavior of pipeline components.
    • Support for entity linking visualization in displacy.
    • Allow overriding vars in spacy project assets .
    • Standalone train function to run the training from Python scripts just like the spacy train CLI.
    • Support for spacy-transformers>=1.1.0 with improved IO.
    • Support for thinc>=8.0.11 with improved gradient clipping.

    🔴 Bug fixes

    • Fix issue #5507: Improve UX for multiprocessing on GPU.
    • Fix issue #9137: Fix serialization for KnowledgeBase.set_entities.
    • Fix issue #9244: Fix vectors for 0-length spans.
    • Fix issue #9247: Improve UX for the DocBin constructor.
    • Fix Issue #9254: Allow unicode in a spacy project title.
    • Fix issue #9263: Make added patterns consistent in the DependencyMatcher.
    • Fix issue #9305: Restore tokenization timing during evaluation.
    • Fix issue #9335: Sync vocab in vectors and sourced components.
    • Fix issue #9387: Ensure lemmas are consistent for Catalan, Dutch, French, Russian and Ukrainian.
    • Fix issue #9404: Create consistent default textcat and textcat_multilabel configurations.
    • Fix issue #9437: Improve UX around Doc object creation.
    • Fix issue #9465: Fix minor issues with convert CLI.
    • Fix issue #9500: Include .pyi files in the distributed package.

    📖 Documentation and examples

    • Various updates to the documentation.
    • New additions to the spaCy universe:
      • deplacy: CUI-based dependency visualizer
      • ipymarkup: Visualizations for NER and syntax trees
      • PhruzzMatcher: Find fuzzy matches
      • spacy-huggingface-hub: Push spaCy pipelines to the Hugging Face Hub
      • spaCyOpenTapioca: Entity Linking on Wikidata
      • spacy-clausie: Clause-based information extraction system
      • "Applied Natural Language Processing in the Enterprise": Book by Ankur A. Patel
      • "Introduction to spaCy 3": Free course by Dr. W.J.B. Mattingly

    👥 Contributors

    @adrianeboyd, @connorbrinton, @danieldk, @DuyguA, @honnibal, @ines, @Jette16, @ljvmiranda921, @mjvallone, @philipvollet, @polm, @rspeer, @ryndaniels, @shigapov, @svlandeg, @thomashacker

    Source code(tar.gz)
    Source code(zip)
  • v3.1.3(Sep 20, 2021)

    ✨ New features and improvements

    • The v3 of WandbLogger now supports optional run_name and entity parameters.
    • Improved UX when providing invalid pos values for a Doc or Token.

    🔴 Bug fixes

    • Fix issue #9001: Pass alignments to Matcher callbacks.
    • Fix issue #9009: Include component factories in third-party dependencies resolver.
    • Fix issue #9012: Correct type of config in create_pipe.
    • Fix issue #9014: Allow typer 0.4 to provide support for both Click 7 and Click 8.
    • Fix issue #9033: Fix verbs list for French tokenizer exceptions.
    • Fix issue #9059: Pass overrides to subcommands in spacy project workflows.
    • Fix issue #9074: Improve UX around repo and path arguments in spacy project.
    • Fix issue #9084: Fix inference of epoch_resume in spacy pretrain.
    • Fix issue #9163: Handle spacy-legacy in spacy package dependency detection.
    • Fix issue #9211: Include only runtime-relevant dependencies in spacy package.

    📖 Documentation and examples

    • Various updates to the documentation.
    • Few additions and updates to the spaCy universe.
    • Extended the developer documentation with information about the listener pattern, the StringStore and the Vocab.

    👥 Contributors

    @adrianeboyd, @davidefiocco, @davidstrouk, @filipematos95, @honnibal, @ines, @j-frei, @Joozty, @kwhumphreys, @mjhajharia, @mylibrar, @polm, @rspeer, @shigapov, @svlandeg, @thomashacker

    Source code(tar.gz)
    Source code(zip)
  • v3.1.2(Aug 20, 2021)

    ✨ New features and improvements

    • NEW: Provide scores for the SpanCategorizer predictions.
    • NEW: Broader compatibility with type checkers thanks to .pyi stub files.
    • NEW: Auto-detect package dependencies in spacy package.
    • New INTERSECTS operator for the Matcher.
    • More debugging info for spacy project push and pull commands.
    • Allow passing in a precomputed array for speeding up multiple Span.as_doc calls.
    • The default da transformer is now the same as the one from the trained pipelines (Maltehb/danish-bert-botxo).

    🔴 Bug fixes

    • Fix issue #8767: Fix offsets of empty and out-of-bounds spans.
    • Fix issue #8774: Ensure debug data runs correctly with a custom tokenizer.
    • Fix issue #8784: Fix incorrect ISSUBSET and ISSUPERSET in schema and docs.
    • Fix issue #8796: Respect the no_skip value for spacy project run.
    • Fix issue #8810: Make ConsoleLogger flush after each logging line.
    • Fix issue #8819: Pass exclude when serializing the vocab.
    • Fix issue #8830: Avoid adding sourced vectors hashes if not necessary.
    • Fix issue #8970: Fix allow_overlap default for span categorizer scoring.
    • Fix issue #8982: Add glossary entry for _SP.
    • Fix issue #9007: Fix span categorizer training on nested entities.

    📖 Documentation and examples

    👥 Contributors

    @adrianeboyd, @bbieniek, @DuyguA, @ezorita, @HLasse, @honnibal, @ines, @kabirkhan, @kevinlu1248, @ldorigo, @Ledenel, @nsorros, @polm, @svlandeg, @swfarnsworth, @themrmax, @thomashacker

    Source code(tar.gz)
    Source code(zip)
  • v3.0.7(Jul 23, 2021)

    ✨ New features and improvements

    • Alpha tokenization support for Azerbaijani.
    • Updates for French stop words.

    🔴 Bug fixes

    • Fix issue #7629: Fix scoring normalization.
    • Fix issue #7886: Fix unknown tokens percentage in debug data.
    • Fix issue #7907: Update load_lookups return type and docstring.
    • Fix issue #7930: Make EntityLinker robust for nO=None.
    • Fix issue #7925: Skip vector ngram backoff if minn is not set.
    • Fix issue #7973: Fix debug model for transformers.
    • Fix issue #7988: Preserve existing ENT_KB_ID in ner annotation.
    • Fix issue #7992: Fix span offsets for Matcher(as_spans) on spans.
    • Fix issue #8004: Handle errors while multiprocessing.
    • Fix issue #8009: Fix Doc.from_docs() for all empty docs.
    • Fix issue #8012: Fix ensemble textcat with listener.
    • Fix issue #8054: Add ENT_ID and NORM to DocBin strings.
    • Fix issue #8055: Handle partial entities in Span.as_doc.
    • Fix issue #8062: Make all Span attrs writable.
    • Fix issue #8066: Update debug data for textcat.
    • Fix issue #8069: Custom warning if DocBin is too large.
    • Fix issue #8113: Support to/from_bytes for KnowledgeBase and EntityLinker.
    • Fix issue #8116: Fix offsets in Span.get_lca_matrix.
    • Fix issue #8132: Remove unsupported attrs from attrs.IDS.
    • Fix issue #8158: Ensure tolerance is passed on in spacy.batch_by_words.v1.
    • Fix issue #8169: Fix bug from EntityRuler: ent_ids returns None for phrases.
    • Fix issue #8208: Address missing config overrides post load of models.
    • Fix issue #8212: Add all symbols in Unicode Currency Symbols to currency characters.
    • Fix issue #8216: Don't add duplicate patterns in EntityRuler.
    • Fix issue #8244: Use context manager when reading model file.
    • Fix issue #8245: Fix other open calls without context managers.
    • Fix issue #8265: Address mypy errors.
    • Fix issue #8299: Restrict pymorphy2 requirement to pymorphy2 mode in Russian and Ukrainian lemmatizers.
    • Fix issue #8335: Raise error if deps not provided with heads in Doc.
    • Fix issue #8368: Preserve whitespace in Span.lemma_.
    • Fix issue #8396: Make JsonlReader path optional.
    • Fix issue #8421: Fix non-deterministic deduplication in Greek lemmatizer.
    • Fix issue #8423: Update validate CLI to fix compat and ignore warnings.
    • Fix issue #8426: Fix setting empty entities in Example.from_dict.
    • Fix issue #8487: Fix span offsets and keys in Doc.from_docs.
    • Fix issue #8584: Raise an error for textcat with <2 labels.
    • Fix issue #8551: Fix duplicate spacy package CLI opts.

    👥 Contributors

    @adrianeboyd, @bodak, @bryant1410, @dhruvrnaik, @fhopp, @frascuchon, @graue70, @ines, @jenojp, @jhroy, @jklaise, @juliensalinas, @meghanabhange, @michael-k, @narayanacharya6, @polm, @sevdimali, @svlandeg, @ZeeD

    Source code(tar.gz)
    Source code(zip)
  • v3.1.1(Jul 20, 2021)

    ✨ New features and improvements

    • Alpha tokenization support for Ancient Greek.
    • Implementation of a noun_chunk iterator for Dutch.
    • Support for black & flake8 as pre-commit hooks.
    • New spacy.ngram_range_suggester.v1 for suggesting a range of n-gram sizes for the spancat component.

    🔴 Bug fixes

    • Fix issue #8638: Fix Azerbaijani initialization.
    • Fix issue #8639: Use 0-vector for OOV lexemes.
    • Fix issue #8640: Update lexeme ranks for loaded vectors.
    • Fix issue #8651: Fix ru and uk multiprocessing (with spawn).
    • Fix issue #8663: Preserve existing meta information with spacy package.
    • Fix issue #8718: Ensure that replace_pipe takes disabled components into account.

    👥 Contributors

    @adrianeboyd, @honnibal, @ines, @jmyerston, @julien-talkair, @KennethEnevoldsen, @mariosasko, @mylibrar, @polm, @rynoV, @svlandeg, @thomashacker, @yohasebe

    Source code(tar.gz)
    Source code(zip)
  • v3.1.0(Jul 7, 2021)

    ✨ New features and improvements

    For more details, see the New in v3.1 usage guide.

    📦 New trained pipelines

    | Package | Language | UPOS | Parser LAS |  NER F | | ----------------------------------------------------------------- | -------- | ---: | ---------: | -----: | | ca_core_news_sm | Catalan | 98.2 | 87.4 | 79.8 | | ca_core_news_md | Catalan | 98.3 | 88.2 | 84.0 | | ca_core_news_lg | Catalan | 98.5 | 88.4 | 84.2 | | ca_core_news_trf | Catalan | 98.9 | 93.0 | 91.2 | | da_core_news_trf | Danish | 98.0 | 85.0 | 82.9 |

    ⚠️ Upgrading from v3.0

    • Due to the use of configs with extensive versioning, v3.0 pipelines should be compatible with v3.1, however you may see slight differences in performance. Test your v3.0 pipeline with v3.1 against your test suite and if the performance is identical, extend the spacy_version in your model package meta to ">=3.0.0,<3.2.0". If you run into degraded performance, retrain your pipeline with v3.1.
    • Use spacy init fill-config to update a v3.0 config for v3.1.
    • When sourcing a pipeline component that requires static vectors, it is now required to include the source model's vectors in [initialize.vectors].
    • Logger warnings have been converted to Python warnings. Use warnings.filterwarnings or the new helper method spacy.errors.filter_warning(action, error_msg='') to manage warnings.

    For more information, see Notes on upgrading from v3.0.

    🔴 Bug fixes

    • Fix issue #7036: Use a context manager when reading model.
    • Fix issue #7629: Fix scoring normalization.
    • Fix issue #7799: Ensure spacy ray command works.
    • Fix issue #7807: Show warning if entity ruler runs without patterns.
    • Fix issue #7886: Fix unknown tokens percentage in debug data.
    • Fix issue #7930: Make EntityLinker robust for nO=None.
    • Fix issue #7925: Skip vector ngram backoff if minn is not set.
    • Fix issue #7973: Fix debug model for transformers.
    • Fix issue #7988: Preserve existing ENT_KB_ID in ner annotation.
    • Fix issue #8004: Handle errors while multiprocessing.
    • Fix issue #8009: Fix Doc.from_docs() for all empty docs.
    • Fix issue #8012: Fix ensemble textcat with listener.
    • Fix issue #8054: Add ENT_ID and NORM to DocBin strings.
    • Fix issue #8055: Handle partial entities in Span.as_doc.
    • Fix issue #8062: Make all Span attrs writable.
    • Fix issue #8066: Update debug data for textcat.
    • Fix issue #8069: Custom warning if DocBin is too large.
    • Fix issue #8099: Update Vietnamese tokenizer.
    • Fix issue #8113: Support to/from_bytes for KnowledgeBase and EntityLinker.
    • Fix issue #8116: Fix offsets in Span.get_lca_matrix.
    • Fix issue #8132: Remove unsupported attrs from attrs.IDS.
    • Fix issue #8158: Ensure tolerance is passed on in spacy.batch_by_words.v1.
    • Fix issue #8169: Fix bug from EntityRuler: ent_ids returns None for phrases.
    • Fix issue #8208: Address missing config overrides post load of models.
    • Fix issue #8212: Add all symbols in Unicode Currency Symbols to currency characters.
    • Fix issue #8216: Don't add duplicate patterns in EntityRuler.
    • Fix issue #8265: Address mypy errors.
    • Fix issue #8335: Raise error if deps not provided with heads in Doc.
    • Fix issue #8368: Preserve whitespace in Span.lemma_.
    • Fix issue #8388: Don't clobber vectors when loading components from source models.
    • Fix issue #8421: Fix non-deterministic deduplication in Greek lemmatizer.
    • Fix issue #8426: Fix setting empty entities in Example.from_dict.
    • Fix issue #8441: Add correct types for Language.pipe return values.
    • Fix issue #8487: Fix span offsets and keys in Doc.from_docs.
    • Fix issue #8559: Fix vectors check for sourced components.
    • Fix issue #8584: Raise an error for textcat with <2 labels.

    👥 Contributors

    @aajanki, @adrianeboyd, @bodak, @bryant1410, @dhruvrnaik, @explosion-bot, @fhopp, @frascuchon, @graue70, @gtoffoli, @honnibal, @ines, @jacopofar, @jenojp, @jhroy, @jklaise, @juliensalinas, @kevinlu1248, @ldorigo, @mathcass, @meghanabhange, @michael-k, @narayanacharya6, @NirantK, @nsorros, @polm, @sevdimali, @svlandeg, @themrmax, @xadrianzetx, @yohasebe, @ZeeD

    Source code(tar.gz)
    Source code(zip)
  • v2.3.7(Jun 4, 2021)

  • v2.3.6(May 18, 2021)

    ✨ New features and improvements

    • Add base support for Amharic.
    • Add noun chunk iterator for Danish.
    • Updates to French, Portuguese and Romanian stop words.

    🔴 Bug fixes

    • Fix issue #6705: Fix deserialization of null token_match and url_match for the tokenizer.
    • Fix issue #6712: Prevent overlapping noun chunks for Spanish.
    • Fix issue #6745: Fix minibatch iterator when size iterator is finished.
    • Fix issue #6759: Skip 0-length matches in the Matcher.
    • Fix issue #6771: Support IS_SENT_START in the PhraseMatcher.
    • Fix issue #6772: Fix Span.text for empty spans.
    • Fix issue #6820: Improve Doc.char_span alignment_mode handling.
    • Fix issue #6857: Remove --no-cache-dir when downloading models.
    • Fix issue #8115: Fix offsets in Span.get_lca_matrix.

    👥 Contributors

    Thanks to @alexcombessie, @AMArostegui, @bryant1410, @Cristianasp, @garethsparks, @jenojp, @jganseman, @jumasheff, @lorenanda, @ophelielacroix, @thomasbird, @timgates42, @tupui and @yosiasz for the pull requests and contributions.

    Source code(tar.gz)
    Source code(zip)
  • v3.0.6(Apr 23, 2021)

    ✨ New features and improvements

    • New assemble CLI command for assembling a pipeline from a config without training.
    • Add support for match alignments in the Matcher to align matched tokens with matcher patterns.
    • Add support for training from streamed corpora.
    • Add support for W&B data and model checkpoint logging and versioning in spacy.WandbLogger.v2.
    • Extend Scorer.score_spans to support overlapping and unlabeled spans.
    • Update debug data for new v3 components.
    • Improve language data for Italian.
    • Various improvements to error handling and UX.

    🔴 Bug fixes

    • Fix issue #7408: Add vocab kwarg to spacy.load.
    • Fix issue #7419: Exclude user hooks in displacy conversion.
    • Fix issue #7421: Update --code usage in CLI commands.
    • Fix issue #7424: Preserve sent starts on retokenization without parse.
    • Fix issue #7440: Fix pymorphy2 lookup lemmatizer.
    • Fix issue #7471: Improve warnings related to listening components.
    • Fix issue #7488: Fix upstream check in pretraining.
    • Fix issue #7489: Support callbacks entry points.
    • Fix issue #7497: Merge doc.spans in Doc.from_docs().
    • Fix issue #7528: Preserve user data for DependencyMatcher on spans.
    • Fix issue #7557: Fix __add__ method for PRFScore.
    • Fix issue #7574: Fix conversion of custom extension data in Span.as_doc and Doc.from_docs.
    • Fix issue #7620: Fix replace_listeners in configs.
    • Fix issue #7626: Fix vectors data on GPU.
    • Fix issue #7630: Update NEL for entities crossing sentence boundaries.
    • Fix issue #7631: Fix parser sourcing in NER converter.
    • Fix issue #7642: Fix handling of hyphen string value in config files.
    • Fix issue #7655: Fix sent starts when converting from v2 JSON training format.
    • Fix issue #7674: Fix handling of unknown tokens in StaticVectors.
    • Fix issue #7690: Fix pickling of Lemmatizer.
    • Fix issue #7749: Update Tokenizer.explain for special cases in v3.
    • Fix issue #7755: Fix config parsing of ints/strings.
    • Fix issue #7836: Fix tokenizer cache flushing.
    • Fix issue #7847: Fix handling of boolean values in Example.from_dict for sent starts.

    📖 Documentation and examples

    • Add documentation for legacy functions and architectures.
    • Add documentation for pretrained pipeline design.
    • Add more details about pipe and multiprocessing.
    • Fix various typos and inconsistencies.

    👥 Contributors

    Thanks to @alvaroabascar, @armsp, @AyushExel, @BramVanroy, @broaddeep, @bryant1410, @bsweileh, @dpalmasan, @Findus23, @graue70, @jaidevd, @koaning, @langdonholmes, @m0canu1, @meghanabhange, @paoloq, @plison, @richardpaulhudson, @SamEdwardes, @Stannislav for the pull requests and contributions!

    Source code(tar.gz)
    Source code(zip)
Owner
Explosion
A software company specializing in developer tools for Artificial Intelligence and Natural Language Processing
Explosion
Image2pcl - Enter the metaverse with 2D image to 3D projections

Image2PCL Enter the metaverse with 2D image to 3D projections! This is an implem

Benjamin Ho 0 Feb 05, 2022
Labelling platform for text using distant supervision

With DataQA, you can label unstructured text documents using rule-based distant supervision.

245 Aug 05, 2022
Transformer - A TensorFlow Implementation of the Transformer: Attention Is All You Need

[UPDATED] A TensorFlow Implementation of Attention Is All You Need When I opened this repository in 2017, there was no official code yet. I tried to i

Kyubyong Park 3.8k Dec 26, 2022
Weird Sort-and-Compress Thing

Weird Sort-and-Compress Thing A weird integer sorting + compression algorithm inspired by a conversation with Luthingx (it probably already exists by

Douglas 1 Jan 03, 2022
Search for documents in a domain through Google. The objective is to extract metadata

MetaFinder - Metadata search through Google _____ __ ___________ .__ .___ / \

Josué Encinar 85 Dec 16, 2022
nlp-tutorial is a tutorial for who is studying NLP(Natural Language Processing) using Pytorch

nlp-tutorial is a tutorial for who is studying NLP(Natural Language Processing) using Pytorch. Most of the models in NLP were implemented with less than 100 lines of code.(except comments or blank li

Tae-Hwan Jung 11.9k Jan 08, 2023
초성 해석기 based on ko-BART

초성 해석기 개요 한국어 초성만으로 이루어진 문장을 입력하면, 완성된 문장을 예측하는 초성 해석기입니다. 초성: ㄴㄴ ㄴㄹ ㅈㅇㅎ 예측 문장: 나는 너를 좋아해 모델 모델은 SKT-AI에서 공개한 Ko-BART를 이용합니다. 데이터 문장 단위로 이루어진 아무 코퍼스나

Dawoon Jung 29 Oct 28, 2022
A full spaCy pipeline and models for scientific/biomedical documents.

This repository contains custom pipes and models related to using spaCy for scientific documents. In particular, there is a custom tokenizer that adds

AI2 1.3k Jan 03, 2023
Korean Sentence Embedding Repository

Korean-Sentence-Embedding 🍭 Korean sentence embedding repository. You can download the pre-trained models and inference right away, also it provides

80 Jan 02, 2023
This repository contains the code, data, and models of the paper titled "CrossSum: Beyond English-Centric Cross-Lingual Abstractive Text Summarization for 1500+ Language Pairs".

CrossSum This repository contains the code, data, and models of the paper titled "CrossSum: Beyond English-Centric Cross-Lingual Abstractive Text Summ

BUET CSE NLP Group 29 Nov 19, 2022
chaii - hindi & tamil question answering

chaii - hindi & tamil question answering This is the solution for rank 5th in Kaggle competition: chaii - Hindi and Tamil Question Answering. The comp

abhishek thakur 33 Dec 18, 2022
Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

SWRM Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors" Clone Clone th

14 Jan 03, 2023
📜 GPT-2 Rhyming Limerick and Haiku models using data augmentation

Well-formed Limericks and Haikus with GPT2 📜 GPT-2 Rhyming Limerick and Haiku models using data augmentation In collaboration with Matthew Korahais &

Bardia Shahrestani 2 May 26, 2022
Natural Language Processing with transformers

we want to create a repo to illustrate usage of transformers in chinese

Datawhale 763 Dec 27, 2022
⚖️ A Statutory Article Retrieval Dataset in French.

A Statutory Article Retrieval Dataset in French This repository contains the Belgian Statutory Article Retrieval Dataset (BSARD), as well as the code

Maastricht Law & Tech Lab 19 Nov 17, 2022
Turn clang-tidy warnings and fixes to comments in your pull request

clang-tidy pull request comments A GitHub Action to post clang-tidy warnings and suggestions as review comments on your pull request. What platisd/cla

Dimitris Platis 30 Dec 13, 2022
A sentence aligner for comparable corpora

About Yalign is a tool for extracting parallel sentences from comparable corpora. Statistical Machine Translation relies on parallel corpora (eg.. eur

Machinalis 128 Aug 24, 2022
MHtyper is an end-to-end pipeline for recognized the Forensic microhaplotypes in Nanopore sequencing data.

MHtyper is an end-to-end pipeline for recognized the Forensic microhaplotypes in Nanopore sequencing data. It is implemented using Python.

willow 6 Jun 27, 2022
Official PyTorch implementation of Time-aware Large Kernel (TaLK) Convolutions (ICML 2020)

Time-aware Large Kernel (TaLK) Convolutions (Lioutas et al., 2020) This repository contains the source code, pre-trained models, as well as instructio

Vasileios Lioutas 28 Dec 07, 2022