State of the Art Natural Language Processing

Overview

Spark NLP: State of the Art Natural Language Processing

build Maven Central PyPI version Anaconda-Cloud License

Spark NLP is a Natural Language Processing library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines that scale easily in a distributed environment. Spark NLP comes with 1100+ pretrained pipelines and models in more than 192+ languages. It supports state-of-the-art transformers such as BERT, XLNet, ELMO, ALBERT, and Universal Sentence Encoder that can be used seamlessly in a cluster. It also offers Tokenization, Word Segmentation, Part-of-Speech Tagging, Named Entity Recognition, Dependency Parsing, Spell Checking, Multi-class Text Classification, Multi-class Sentiment Analysis, Machine Translation (+180 languages), Summarization and Question Answering (Google T5), and many more NLP tasks.

Project's website

Take a look at our official Spark NLP page: http://nlp.johnsnowlabs.com/ for user documentation and examples

Community support

  • Slack For live discussion with the Spark NLP community and the team
  • GitHub Bug reports, feature requests, and contributions
  • Discussions Engage with other community members, share ideas, and show off how you use Spark NLP!
  • Medium Spark NLP articles
  • YouTube Spark NLP video tutorials

Table of contents

Features

  • Tokenization
  • Trainable Word Segmentation
  • Stop Words Removal
  • Token Normalizer
  • Document Normalizer
  • Stemmer
  • Lemmatizer
  • NGrams
  • Regex Matching
  • Text Matching
  • Chunking
  • Date Matcher
  • Sentence Detector
  • Deep Sentence Detector (Deep learning)
  • Dependency parsing (Labeled/unlabeled)
  • Part-of-speech tagging
  • Sentiment Detection (ML models)
  • Spell Checker (ML and DL models)
  • Word Embeddings (GloVe and Word2Vec)
  • BERT Embeddings (TF Hub models)
  • ELMO Embeddings (TF Hub models)
  • ALBERT Embeddings (TF Hub models)
  • XLNet Embeddings
  • Universal Sentence Encoder (TF Hub models)
  • BERT Sentence Embeddings (42 TF Hub models)
  • Sentence Embeddings
  • Chunk Embeddings
  • Unsupervised keywords extraction
  • Language Detection & Identification (up to 375 languages)
  • Multi-class Sentiment analysis (Deep learning)
  • Multi-label Sentiment analysis (Deep learning)
  • Multi-class Text Classification (Deep learning)
  • Neural Machine Translation
  • Text-To-Text Transfer Transformer (Google T5)
  • Named entity recognition (Deep learning)
  • Easy TensorFlow integration
  • GPU Support
  • Full integration with Spark ML functions
  • +710 pre-trained models in +192 languages!
  • +450 pre-trained pipelines in +192 languages!
  • Multi-lingual NER models: Arabic, Bengali, Chinese, Danish, Dutch, English, Finnish, French, German, Hewbrew, Italian, Japanese, Korean, Norwegian, Persian, Polish, Portuguese, Russian, Spanish, Swedish, and Urdu.

Requirements

In order to use Spark NLP you need the following requirements:

  • Java 8
  • Apache Spark 2.4.x (or Apache Spark 2.3.x)

Quick Start

This is a quick example of how to use Spark NLP pre-trained pipeline in Python and PySpark:

$ java -version
# should be Java 8 (Oracle or OpenJDK)
$ conda create -n sparknlp python=3.6 -y
$ conda activate sparknlp
$ pip install spark-nlp==2.7.3 pyspark==2.4.7

In Python console or Jupyter Python3 kernel:

# Import Spark NLP
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.pretrained import PretrainedPipeline
import sparknlp

# Start Spark Session with Spark NLP
# start() functions has two parameters: gpu and spark23
# sparknlp.start(gpu=True) will start the session with GPU support
# sparknlp.start(spark23=True) is when you have Apache Spark 2.3.x installed
spark = sparknlp.start()

# Download a pre-trained pipeline
pipeline = PretrainedPipeline('explain_document_dl', lang='en')

# Your testing dataset
text = """
The Mona Lisa is a 16th century oil painting created by Leonardo.
It's held at the Louvre in Paris.
"""

# Annotate your testing dataset
result = pipeline.annotate(text)

# What's in the pipeline
list(result.keys())
Output: ['entities', 'stem', 'checked', 'lemma', 'document',
'pos', 'token', 'ner', 'embeddings', 'sentence']

# Check the results
result['entities']
Output: ['Mona Lisa', 'Leonardo', 'Louvre', 'Paris']

For more examples, you can visit our dedicated repository to showcase all Spark NLP use cases!

Apache Spark Support

Spark NLP 2.7.3 has been built on top of Apache Spark 2.4.x and fully supports Apache Spark 2.3.x:

Spark NLP Apache Spark 2.3.x Apache Spark 2.4.x
2.7.x YES YES
2.6.x YES YES
2.5.x YES YES
2.4.x Partially YES
1.8.x Partially YES
1.7.x YES NO
1.6.x YES NO
1.5.x YES NO

NOTE: Starting 2.5.4 release, we support both Apache Spark 2.4.x and Apache Spark 2.3.x at the same time.

Find out more about Spark NLP versions from our release notes.

Databricks Support

Spark NLP 2.7.3 has been tested and is compatible with the following runtimes:

  • 6.2
  • 6.2 ML
  • 6.3
  • 6.3 ML
  • 6.4
  • 6.4 ML
  • 6.5
  • 6.5 ML

EMR Support

Spark NLP 2.7.3 has been tested and is compatible with the following EMR releases:

  • 5.26.0
  • 5.27.0

Full list of EMR releases.

Usage

Spark Packages

Command line (requires internet connection)

This library has been uploaded to the spark-packages repository.

The benefit of spark-packages is that makes it available for both Scala-Java and Python

To use the most recent version on Apache Spark 2.4.x just add the --packages com.johnsnowlabs.nlp:spark-nlp_2.11:2.7.3 to you spark command:

spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.11:2.7.3
pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.11:2.7.3
spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.11:2.7.3

This can also be used to create a SparkSession manually by using the spark.jars.packages option in both Python and Scala.

NOTE: To use Spark NLP with GPU you can use the dedicated GPU package com.johnsnowlabs.nlp:spark-nlp-gpu_2.11:2.7.3

NOTE: To use Spark NLP on Apache Spark 2.3.x you should instead use the following packages:

  • CPU: com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:2.7.3
  • GPU: com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:2.7.3

NOTE: In case you are using large pretrained models like UniversalSentenceEncoder, you need to have the following set in your SparkSession:

spark-shell --driver-memory 16g --conf spark.kryoserializer.buffer.max=1000M --packages com.johnsnowlabs.nlp:spark-nlp_2.11:2.7.3

Scala

Our package is deployed to maven central. To add this package as a dependency in your application:

Maven

spark-nlp on Apache Spark 2.4.x:

<!-- https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp -->
<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp_2.11</artifactId>
    <version>2.7.3</version>
</dependency>

spark-nlp-gpu:

<!-- https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-gpu -->
<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu_2.11</artifactId>
    <version>2.7.3</version>
</dependency>

spark-nlp on Apache Spark 2.3.x:

<!-- https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-spark23 -->
<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-spark23_2.11</artifactId>
    <version>2.7.3</version>
</dependency>

spark-nlp-gpu:

<!-- https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-gpu-spark23 -->
<dependency>
    <groupId>com.johnsnowlabs.nlp</groupId>
    <artifactId>spark-nlp-gpu-spark23_2.11</artifactId>
    <version>2.7.3</version>
</dependency>

SBT

spark-nlp on Apache Spark 2.4.x:

// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "2.7.3"

spark-nlp-gpu:

// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-gpu
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu" % "2.7.3"

spark-nlp on Apache Spark 2.3.x:

// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-spark23
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-spark23" % "2.7.3"

spark-nlp-gpu:

// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-gpu-spark23
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-gpu-spark23" % "2.7.3"

Maven Central: https://mvnrepository.com/artifact/com.johnsnowlabs.nlp

Python

Python without explicit Pyspark installation

Pip/Conda

If you installed pyspark through pip/conda, you can install spark-nlp through the same channel.

Pip:

pip install spark-nlp==2.7.3

Conda:

conda install -c johnsnowlabs spark-nlp

PyPI spark-nlp package / Anaconda spark-nlp package

Then you'll have to create a SparkSession either from Spark NLP:

import sparknlp

spark = sparknlp.start()

or manually:

spark = SparkSession.builder \
    .appName("Spark NLP")\
    .master("local[4]")\
    .config("spark.driver.memory","16G")\
    .config("spark.driver.maxResultSize", "0") \
    .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:2.7.3")\
    .config("spark.kryoserializer.buffer.max", "1000M")\
    .getOrCreate()

If using local jars, you can use spark.jars instead for a comma delimited jar files. For cluster setups, of course you'll have to put the jars in a reachable location for all driver and executor nodes.

Quick example:

import sparknlp
from sparknlp.pretrained import PretrainedPipeline

#create or get Spark Session

spark = sparknlp.start()

sparknlp.version()
spark.version

#download, load, and annotate a text by pre-trained pipeline

pipeline = PretrainedPipeline('recognize_entities_dl', 'en')
result = pipeline.annotate('The Mona Lisa is a 16th century oil painting created by Leonardo')

Compiled JARs

Build from source

spark-nlp

  • FAT-JAR for CPU on Apache Spark 2.4.x
sbt assembly
  • FAT-JAR for GPU on Apache Spark 2.4.x
sbt -Dis_gpu=true assembly
  • FAT-JAR for CPU on Apache Spark 2.3.x
sbt -Dis_spark23=true assembly
  • FAT-JAR for GPU on Apache Spark 2.3.x
sbt -Dis_gpu=true -Dis_spark23=true assembly

Using the jar manually

If for some reason you need to use the JAR, you can either download the Fat JARs provided here or download it from Maven Central.

To add JARs to spark programs use the --jars option:

spark-shell --jars spark-nlp.jar

The preferred way to use the library when running spark programs is using the --packages option as specified in the spark-packages section.

Apache Zeppelin

Use either one of the following options

  • Add the following Maven Coordinates to the interpreter's library list
com.johnsnowlabs.nlp:spark-nlp_2.11:2.7.3
  • Add path to pre-built jar from here in the interpreter's library list making sure the jar is available to driver path

Python in Zeppelin

Apart from previous step, install python module through pip

pip install spark-nlp==2.7.3

Or you can install spark-nlp from inside Zeppelin by using Conda:

python.conda install -c johnsnowlabs spark-nlp

Configure Zeppelin properly, use cells with %spark.pyspark or any interpreter name you chose.

Finally, in Zeppelin interpreter settings, make sure you set properly zeppelin.python to the python you want to use and install the pip library with (e.g. python3).

An alternative option would be to set SPARK_SUBMIT_OPTIONS (zeppelin-env.sh) and make sure --packages is there as shown earlier, since it includes both scala and python side installation.

Jupyter Notebook (Python)

The easiest way to get this done is by making Jupyter Notebook run using pyspark as follows:

export SPARK_HOME=/path/to/your/spark/folder
export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook

pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.11:2.7.3

Alternatively, you can mix in using --jars option for pyspark + pip install spark-nlp

If not using pyspark at all, you'll have to run the instructions pointed here

Google Colab Notebook

Google Colab is perhaps the easiest way to get started with spark-nlp. It requires no installation or set up other than having a Google account.

Run the following code in Google Colab notebook and start using spark-nlp right away.

import os

# Install java
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! java -version

# Install pyspark
! pip install --ignore-installed pyspark==2.4.7

# Install Spark NLP
! pip install --ignore-installed spark-nlp==2.7.3

# Quick SparkSession start
import sparknlp
spark = sparknlp.start()

print("Spark NLP version")
sparknlp.version()
print("Apache Spark version")
spark.version

Here is a live demo on Google Colab that performs sentiment analysis and NER using pretrained spark-nlp models.

Databricks Cluster

  1. Create a cluster if you don't have one already

  2. On a new cluster or existing one you need to add the following to the Advanced Options -> Spark tab:

spark.kryoserializer.buffer.max 1000M
spark.serializer org.apache.spark.serializer.KryoSerializer
  1. Check Enable autoscaling local storage box to have persistent local storage

  2. In Libraries tab inside your cluster you need to follow these steps:

    4.1. Install New -> PyPI -> spark-nlp -> Install

    4.2. Install New -> Maven -> Coordinates -> com.johnsnowlabs.nlp:spark-nlp_2.11:2.7.3 -> Install

  3. Now you can attach your notebook to the cluster and use Spark NLP!

S3 Cluster

With no Hadoop configuration

If your distributed storage is S3 and you don't have a standard Hadoop configuration (i.e. fs.defaultFS) You need to specify where in the cluster distributed storage you want to store Spark NLP's tmp files. First, decide where you want to put your application.conf file

import com.johnsnowlabs.util.ConfigLoader
ConfigLoader.setConfigPath("/somewhere/to/put/application.conf")

And then we need to put in such application.conf the following content

sparknlp {
  settings {
    cluster_tmp_dir = "somewhere in s3n:// path to some folder"
  }
}

Pipelines and Models

Pipelines

Spark NLP offers more than 450+ pre-trained pipelines in 192 languages.

English pipelines:

Pipeline Name Build lang
Explain Document ML explain_document_ml 2.4.0 en
Explain Document DL explain_document_dl 2.4.3 en
Recognize Entities DL recognize_entities_dl 2.4.3 en
Recognize Entities DL recognize_entities_bert 2.4.3 en
OntoNotes Entities Small onto_recognize_entities_sm 2.4.0 en
OntoNotes Entities Large onto_recognize_entities_lg 2.4.0 en
Match Datetime match_datetime 2.4.0 en
Match Pattern match_pattern 2.4.0 en
Match Chunk match_chunks 2.4.0 en
Match Phrases match_phrases 2.4.0 en
Clean Stop clean_stop 2.4.0 en
Clean Pattern clean_pattern 2.4.0 en
Clean Slang clean_slang 2.4.0 en
Check Spelling check_spelling 2.4.0 en
Check Spelling DL check_spelling_dl 2.5.0 en
Analyze Sentiment analyze_sentiment 2.4.0 en
Analyze Sentiment DL analyze_sentimentdl_use_imdb 2.5.0 en
Analyze Sentiment DL analyze_sentimentdl_use_twitter 2.5.0 en
Dependency Parse dependency_parse 2.4.0 en

Quick example:

import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
import com.johnsnowlabs.nlp.SparkNLP

SparkNLP.version()

val testData = spark.createDataFrame(Seq(
(1, "Google has announced the release of a beta version of the popular TensorFlow machine learning library"),
(2, "Donald John Trump (born June 14, 1946) is the 45th and current president of the United States")
)).toDF("id", "text")

val pipeline = PretrainedPipeline("explain_document_dl", lang="en")

val annotation = pipeline.transform(testData)

annotation.show()
/*
import com.johnsnowlabs.nlp.pretrained.PretrainedPipeline
import com.johnsnowlabs.nlp.SparkNLP
2.5.0
testData: org.apache.spark.sql.DataFrame = [id: int, text: string]
pipeline: com.johnsnowlabs.nlp.pretrained.PretrainedPipeline = PretrainedPipeline(explain_document_dl,en,public/models)
annotation: org.apache.spark.sql.DataFrame = [id: int, text: string ... 10 more fields]
+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| id|                text|            document|               token|            sentence|             checked|               lemma|                stem|                 pos|          embeddings|                 ner|            entities|
+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|  1|Google has announ...|[[document, 0, 10...|[[token, 0, 5, Go...|[[document, 0, 10...|[[token, 0, 5, Go...|[[token, 0, 5, Go...|[[token, 0, 5, go...|[[pos, 0, 5, NNP,...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 0, 5, Go...|
|  2|The Paris metro w...|[[document, 0, 11...|[[token, 0, 2, Th...|[[document, 0, 11...|[[token, 0, 2, Th...|[[token, 0, 2, Th...|[[token, 0, 2, th...|[[pos, 0, 2, DT, ...|[[word_embeddings...|[[named_entity, 0...|[[chunk, 4, 8, Pa...|
+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
*/

annotation.select("entities.result").show(false)

/*
+----------------------------------+
|result                            |
+----------------------------------+
|[Google, TensorFlow]              |
|[Donald John Trump, United States]|
+----------------------------------+
*/

Please check out our Models Hub for the full list of pre-trained pipelines with examples, demos, benchmarks, and more

Models

Spark NLP offers more than 710+ pre-trained models in 192 languages.

Some of the selected languages: Afrikaans, Arabic, Armenian, Basque, Bengali, Breton, Bulgarian, Catalan, Czech, Dutch, English, Esperanto, Finnish, French, Galician, German, Greek, Hausa, Hebrew, Hindi, Hungarian, Indonesian, Irish, Italian, Japanese, Latin, Latvian, Marathi, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Slovak, Slovenian, Somali, Southern Sotho, Spanish, Swahili, Swedish, Tswana, Turkish, Ukrainian, Zulu

English Models:

Model Name Build Lang
LemmatizerModel (Lemmatizer) lemma_antbnc 2.0.2 en
PerceptronModel (POS) pos_anc 2.0.2 en
PerceptronModel (POS UD) pos_ud_ewt 2.2.2 en
NerCrfModel (NER with GloVe) ner_crf 2.4.0 en
NerDLModel (NER with GloVe) ner_dl 2.4.3 en
NerDLModel (NER with BERT) ner_dl_bert 2.4.3 en
NerDLModel (OntoNotes with GloVe 100d) onto_100 2.4.0 en
NerDLModel (OntoNotes with GloVe 300d) onto_300 2.4.0 en
SymmetricDeleteModel (Spell Checker) spellcheck_sd 2.0.2 en
NorvigSweetingModel (Spell Checker) spellcheck_norvig 2.0.2 en
ViveknSentimentModel (Sentiment) sentiment_vivekn 2.0.2 en
DependencyParser (Dependency) dependency_conllu 2.0.8 en
TypedDependencyParser (Dependency) dependency_typed_conllu 2.0.8 en

Embeddings:

Model Name Build Lang
WordEmbeddings (GloVe) glove_100d 2.4.0 en
BertEmbeddings bert_base_uncased 2.4.0 en
BertEmbeddings bert_base_cased 2.4.0 en
BertEmbeddings bert_large_uncased 2.4.0 en
BertEmbeddings bert_large_cased 2.4.0 en
ElmoEmbeddings elmo 2.4.0 en
UniversalSentenceEncoder (USE) tfhub_use 2.4.0 en
UniversalSentenceEncoder (USE) tfhub_use_lg 2.4.0 en
AlbertEmbeddings albert_base_uncased 2.5.0 en
AlbertEmbeddings albert_large_uncased 2.5.0 en
AlbertEmbeddings albert_xlarge_uncased 2.5.0 en
AlbertEmbeddings albert_xxlarge_uncased 2.5.0 en
XlnetEmbeddings xlnet_base_cased 2.5.0 en
XlnetEmbeddings xlnet_large_cased 2.5.0 en

Classification:

Model Name Build Lang
ClassifierDL (with tfhub_use) classifierdl_use_trec6 2.5.0 en
ClassifierDL (with tfhub_use) classifierdl_use_trec50 2.5.0 en
SentimentDL (with tfhub_use) sentimentdl_use_imdb 2.5.0 en
SentimentDL (with tfhub_use) sentimentdl_use_twitter 2.5.0 en
SentimentDL (with glove_100d) sentimentdl_glove_imdb 2.5.0 en

Quick online example:

# load NER model trained by deep learning approach and GloVe word embeddings
ner_dl = NerDLModel.pretrained('ner_dl')
# load NER model trained by deep learning approach and BERT word embeddings
ner_bert = NerDLModel.pretrained('ner_dl_bert')
// load French POS tagger model trained by Universal Dependencies
val french_pos = PerceptronModel.pretrained("pos_ud_gsd", lang="fr")
// load Italain LemmatizerModel
val italian_lemma = LemmatizerModel.pretrained("lemma_dxc", lang="it")

Quick offline example:

  • Loading PerceptronModel annotator model inside Spark NLP Pipeline
val french_pos = PerceptronModel.load("/tmp/pos_ud_gsd_fr_2.0.2_2.4_1556531457346/")
      .setInputCols("document", "token")
      .setOutputCol("pos")

Please check out our Models Hub for the full list of pre-trained models with examples, demo, benchmark, and more

Examples

Need more examples? Check out our dedicated Spark NLP Showcase repository to showcase all Spark NLP use cases!

In addition, don't forget to check Spark NLP in Action built by Streamlit.

All examples: spark-nlp-workshop

FAQ

Check our Articles and Videos page here

Citation

We have published a paper that you can cite for the Spark NLP library:

@article{KOCAMAN2021100058,
    title = {Spark NLP: Natural language understanding at scale},
    journal = {Software Impacts},
    pages = {100058},
    year = {2021},
    issn = {2665-9638},
    doi = {https://doi.org/10.1016/j.simpa.2021.100058},
    url = {https://www.sciencedirect.com/science/article/pii/S2665963821000063},
    author = {Veysel Kocaman and David Talby},
    keywords = {Spark, Natural language processing, Deep learning, Tensorflow, Cluster},
    abstract = {Spark NLP is a Natural Language Processing (NLP) library built on top of Apache Spark ML. It provides simple, performant & accurate NLP annotations for machine learning pipelines that can scale easily in a distributed environment. Spark NLP comes with 1100+ pretrained pipelines and models in more than 192+ languages. It supports nearly all the NLP tasks and modules that can be used seamlessly in a cluster. Downloaded more than 2.7 million times and experiencing 9x growth since January 2020, Spark NLP is used by 54% of healthcare organizations as the world’s most widely used NLP library in the enterprise.}
    }
}

Contributing

We appreciate any sort of contributions:

  • ideas
  • feedback
  • documentation
  • bug reports
  • NLP training and testing corpora
  • development and testing

Clone the repo and submit your pull-requests! Or directly create issues in this repo.

Contact

[email protected]

John Snow Labs

http://johnsnowlabs.com

Comments
  • spark-nlp won't download pretrained model on Hadoop Cluster

    spark-nlp won't download pretrained model on Hadoop Cluster

    Description

    I am using the code below to get word embeddings using BERT model.

    from sparknlp.pretrained import PretrainedPipeline
    from sparknlp.annotator import *
    from sparknlp.common import *
    from sparknlp.base import *
    
    spark = SparkSession.builder\
        .master("yarn")\
        .config("spark.locality.wait", "0")\
        .config("spark.kryoserializer.buffer.max", "2000M")\
        .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.0.0")\
        .config("spark.sql.autoBroadcastJoinThreshold", -1)\
        .config("spark.sql.codegen.aggregate.map.twolevel.enabled", "false")\
        .getOrCreate()
    
    document_assembler = DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")
    
    sentence_detector = SentenceDetector() \
        .setInputCols(["document"]) \
        .setOutputCol("sentence") \
        .setLazyAnnotator(False)
    
    embeddings = BertSentenceEmbeddings.pretrained("labse", "xx") \
          .setInputCols("sentence") \
          .setOutputCol("embeddings")
    nlp_pipeline = Pipeline(stages=[document_assembler, sentence_detector, embeddings])
    pipeline_model = nlp_pipeline.fit(spark.createDataFrame([[""]]).toDF("text"))
    

    The script works great on spark local development mode but when i deployed the script on the Hadoop Cluster ( using YARN as a resource manager ) i get the following error

    labse download started this may take some time.
    Traceback (most recent call last):
      File "testing_bert_hadoop.py", line 138, in <module>
        embeddings = BertSentenceEmbeddings.pretrained("labse", "xx") \
      File "/usr/local/lib/python3.6/site-packages/sparknlp/annotator.py", line 1969, in pretrained
        return ResourceDownloader.downloadModel(BertSentenceEmbeddings, name, lang, remote_loc)
      File "/usr/local/lib/python3.6/site-packages/sparknlp/pretrained.py", line 32, in downloadModel
        file_size = _internal._GetResourceSize(name, language, remote_loc).apply()
      File "/usr/local/lib/python3.6/site-packages/sparknlp/internal.py", line 192, in __init__
        "com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize", name, language, remote_loc)
      File "/usr/local/lib/python3.6/site-packages/sparknlp/internal.py", line 129, in __init__
        self._java_obj = self.new_java_obj(java_obj, *args)
      File "/usr/local/lib/python3.6/site-packages/sparknlp/internal.py", line 139, in new_java_obj
        return self._new_java_obj(java_class, *args)
      File "/hadoop/yarn/local/usercache/livy/appcache/application_1623058160826_0016/container_e199_1623058160826_0016_01_000001/pyspark.zip/pyspark/ml/wrapper.py", line 63, in _new_java_obj
      File "/hadoop/yarn/local/usercache/livy/appcache/application_1623058160826_0016/container_e199_1623058160826_0016_01_000001/py4j-0.10.6-src.zip/py4j/java_gateway.py", line 1160, in __call__
      File "/hadoop/yarn/local/usercache/livy/appcache/application_1623058160826_0016/container_e199_1623058160826_0016_01_000001/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
      File "/hadoop/yarn/local/usercache/livy/appcache/application_1623058160826_0016/container_e199_1623058160826_0016_01_000001/py4j-0.10.6-src.zip/py4j/protocol.py", line 320, in get_return_value
    py4j.protocol.Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize.
    : java.lang.NoSuchMethodError: org.json4s.jackson.JsonMethods$.parse$default$3()Z
    	at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$.parseJson(ResourceMetadata.scala:61)
    	at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$$anonfun$readResources$1.applyOrElse(ResourceMetadata.scala:90)
    	at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$$anonfun$readResources$1.applyOrElse(ResourceMetadata.scala:89)
    	at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
    	at scala.collection.Iterator$$anon$14.next(Iterator.scala:541)
    	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
    	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
    	at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
    	at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:183)
    	at scala.collection.mutable.ListBuffer.$plus$plus$eq(ListBuffer.scala:45)
    	at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
    	at scala.collection.AbstractIterator.to(Iterator.scala:1336)
    	at scala.collection.TraversableOnce$class.toList(TraversableOnce.scala:294)
    	at scala.collection.AbstractIterator.toList(Iterator.scala:1336)
    	at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$.readResources(ResourceMetadata.scala:92)
    	at com.johnsnowlabs.nlp.pretrained.ResourceMetadata$.readResources(ResourceMetadata.scala:84)
    	at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.downloadMetadataIfNeed(S3ResourceDownloader.scala:70)
    	at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.resolveLink(S3ResourceDownloader.scala:81)
    	at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.getDownloadSize(S3ResourceDownloader.scala:159)
    	at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.getDownloadSize(ResourceDownloader.scala:399)
    	at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.getDownloadSize(ResourceDownloader.scala:496)
    	at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize(ResourceDownloader.scala)
    	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    	at java.lang.reflect.Method.invoke(Method.java:498)
    	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    	at py4j.Gateway.invoke(Gateway.java:282)
    	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    	at py4j.commands.CallCommand.execute(CallCommand.java:79)
    	at py4j.GatewayConnection.run(GatewayConnection.java:214)
    	at java.lang.Thread.run(Thread.java:745)
    
    

    I tried to manually updated the jars json4s-native, json4s-scalap and many others but the error still persists.

    Expected Behavior

    The pretrained pipeline should be downloaded and loaded into the pipeline_model variable

    Current Behavior

    Gives the above mentioned error while running on Hadoop cluster

    Possible Solution

    I tried to manually updated the jars json4s-native, json4s-scalap and many others but the error still persists. but maybe i am lacking some knowledge or misunderstanding the problem

    Context

    I was trying to get word embeddings using LABSE model for classification problem

    Your Environment

    • Spark NLP version 3.0.0 on all nodes
    • Apache NLP version 2.3.0.2.6.5.1175-1
    • Java version OpenJDK Runtime Environment (build 1.8.0_292-b10) OpenJDK 64-Bit Server VM (build 25.292-b10, mixed mode)
    • Setup and installation : spark comes default with Hadoop installation
    • Operating System and version: centos 7
    • Cluster Manager: Ambari (HDP 2.6.5.1175-1)

    Please do let me know if u need any more info. Thanks

    question 
    opened by DanielOX 39
  • TypeError: 'JavaPackage' object is not callable

    TypeError: 'JavaPackage' object is not callable

    Get "TypeError: 'JavaPackage' object is not callable " error whenever trying to call any annotators.

    Description

    Platform: Ubuntu 16.04LTS on Windows 10's Linux System (wls) Python: Python 3.6.4 |Anaconda custom (64-bit)| (default, Jan 16 2018, 18:10:19) Pyspark: Use pip to install (ie python without explcit spark installation) spark-nlp: pip install --index-url https://test.pypi.org/simple/ spark-nlp==1.5.4

    Tried running the followings, but all returned with the same "TypeError: 'JavaPackage' object is not callable " error. There seems to have a similar bug "Python annotators should be loadable on its own #91" that was closed sometime ago, but it still happened to me.

    from pyspark.sql import SparkSession spark = SparkSession \ .builder \ .config("spark.driver.extraClassPath", "lib/sparknlp.jar") \ .getOrCreate()

    from sparknlp.annotator import * from sparknlp.common import * from sparknlp.base import *

    documentAssembler = DocumentAssembler()
    .setInputCol("text")
    .setOutputCol("document")

    lemmatizer = Lemmatizer()
    .setInputCols(["token"])
    .setOutputCol("lemma")
    .setDictionary("./lemmas001.txt")

    normalizer = Normalizer()
    .setInputCols(["token"])
    .setOutputCol("normalized")

    Here are the errors:

    === from documentassembler ==============================================

    File "", line 1, in documentAssembler = DocumentAssembler() .setInputCol("text") .setOutputCol("document")

    File "/home/quickt2/anaconda3/lib/python3.6/site-packages/pyspark/init.py", line 105, in wrapper return func(self, **kwargs)

    File "/home/quickt2/anaconda3/lib/python3.6/site-packages/sparknlp/base.py", line 175, in init super(DocumentAssembler, self).init(classname="com.johnsnowlabs.nlp.DocumentAssembler")

    File "/home/quickt2/anaconda3/lib/python3.6/site-packages/pyspark/init.py", line 105, in wrapper return func(self, **kwargs)

    File "/home/quickt2/anaconda3/lib/python3.6/site-packages/sparknlp/base.py", line 20, in init self._java_obj = self._new_java_obj(classname, self.uid)

    File "/home/quickt2/anaconda3/lib/python3.6/site-packages/pyspark/ml/wrapper.py", line 63, in _new_java_obj return java_obj(*java_args)

    TypeError: 'JavaPackage' object is not callable

    === from lemmatizer ====================================================

    Traceback (most recent call last):

    File "", line 1, in lemmatizer = Lemmatizer() .setInputCols(["token"]) .setOutputCol("lemma") .setDictionary("./lemmas001.txt")

    File "/home/quickt2/anaconda3/lib/python3.6/site-packages/pyspark/init.py", line 105, in wrapper return func(self, **kwargs)

    File "/home/quickt2/anaconda3/lib/python3.6/site-packages/sparknlp/annotator.py", line 281, in init super(Lemmatizer, self).init(classname="com.johnsnowlabs.nlp.annotators.Lemmatizer")

    File "/home/quickt2/anaconda3/lib/python3.6/site-packages/pyspark/init.py", line 105, in wrapper return func(self, **kwargs)

    File "/home/quickt2/anaconda3/lib/python3.6/site-packages/sparknlp/annotator.py", line 95, in init self._java_obj = self._new_java_obj(classname, self.uid)

    File "/home/quickt2/anaconda3/lib/python3.6/site-packages/pyspark/ml/wrapper.py", line 63, in _new_java_obj return java_obj(*java_args)

    TypeError: 'JavaPackage' object is not callable

    === from normalizer ====================================================

    Traceback (most recent call last):

    File "", line 1, in normalizer = Normalizer() .setInputCols(["token"]) .setOutputCol("normalized")

    File "/home/quickt2/anaconda3/lib/python3.6/site-packages/pyspark/init.py", line 105, in wrapper return func(self, **kwargs)

    File "/home/quickt2/anaconda3/lib/python3.6/site-packages/sparknlp/annotator.py", line 198, in init super(Normalizer, self).init(classname="com.johnsnowlabs.nlp.annotators.Normalizer")

    File "/home/quickt2/anaconda3/lib/python3.6/site-packages/pyspark/init.py", line 105, in wrapper return func(self, **kwargs)

    File "/home/quickt2/anaconda3/lib/python3.6/site-packages/sparknlp/annotator.py", line 95, in init self._java_obj = self._new_java_obj(classname, self.uid)

    File "/home/quickt2/anaconda3/lib/python3.6/site-packages/pyspark/ml/wrapper.py", line 63, in _new_java_obj return java_obj(*java_args)

    TypeError: 'JavaPackage' object is not callable

    opened by bigheadming 32
  • Version Compatibility between sparkNLP 2.5.3 and spark 2.3.x

    Version Compatibility between sparkNLP 2.5.3 and spark 2.3.x

    • Apache Spark version 2.3.2.3.1.5.0-152
    • Spark NLP version 1.7.3
    • Apache Spark setup (OS, docker, jupyter, zeppelin, Couldera, Databricks, EMR, etc.) : cloudera
    • How did you install Spark NLP: Quoting the IT team – “we don't install packages from source because doing so would not allow us to pass a umask value to the package during installation and thus making it only importable by the root user so we install via pip, specifically using the pip module in ansible, in order to pass the needed umask value”
    • Java version : 1.8.0_121
    • Python/Scala version : Python 3.6.5
    • Does anything else work in Apache Spark and only Spark NLP related part fails? Not sure I’m working on linux and assuming it is connected to Hadoop system letting me code on spark

    Code Snippet:*****************************************************

    import os
    import sys
    sys.path.append('../../')
    
    print(sys.version)
    
    from sparknlp.pretrained import ResourceDownloader
    from sparknlp.base import DocumentAssembler
    from sparknlp.annotator import *
    
    from pyspark.sql import SparkSession
    from pyspark.ml import Pipeline
    
    spark = SparkSession.builder \
        .appName("ner")\
        .master("local[*]")\
        .config("spark.driver.memory","4G")\
        .config("spark.driver.maxResultSize", "2G")\
        .config("spark.driver.extraClassPath", "/hadoop/anaconda3.6/lib/python3.6/site-packages/sparknlp/lib/sparknlp.jar")\
        .config("spark.kryoserializer.buffer.max", "500m")\
        .getOrCreate()
    
    downloader = ResourceDownloader()
    
    
    l = [(1,'Thanks for calling to ESI'),(2,'How can i help you'),(3,'Please reach out to us on mail')]
    
    data = spark.createDataFrame(l, ['docID','text'])
    
    #Working fine
    document_assembler = DocumentAssembler() \
        .setInputCol("text")
    
    #Working fine
    sentence_detector = SentenceDetector() \
        .setInputCols(["document"]) \
        .setOutputCol("sentence")
    
    #Working fine
    tokenizer = Tokenizer() \
        .setInputCols(["sentence"]) \
        .setOutputCol("token")
    
    #Working fine
    lemma = LemmatizerModel.load("/user/elxxx/emma_mod").setInputCols(["token"]).setOutputCol("lemma")
    
    #Working fine
    pos = PerceptronModel.load("/user/elxxx/pos_anc_mod/").setInputCols(["document","token"]).setOutputCol("pos")
    
    #Working fine
    nor_sweet = NorvigSweetingModel.load("/user/elxxx/spell_nor_mod").setInputCols(["token"]).setOutputCol("corrected")
    
    #Working fine
    sent_viv = ViveknSentimentModel.load("/user/elxxx/sent_vivek_mod").setInputCols(["sentence","token"]).setOutputCol("sentiment")
    
    
    **#Error: WordEmbeddingsModel not defined**
    embed = WordEmbeddingsModel.load("/user/elxxx/wordEmbedMod").setStoragePath("/user/elxxx/wordEmbedMod/glove.6B.100d.txt", "TEXT")\
          .setDimension(100)\
          .setStorageRef("glove_100d") \
          .setInputCols("document", "token") \
          .setOutputCol("embeddings")
    
    #Similar issue with other modules
    #Error: BertEmbeddingsModel not defined
    #bert = BertEmbeddings.load ("/user/elxxx/bert").setInputCols("sentence", "token") .setOutputCol("bert").
    **************************************************************************************************************************
    

    We replaced the previous sparkNLP.jar with the newly provided sparkNLP fatJAR (and renamed it to sparkNLP.jar) file by @maziyarpanahi . It seems like it had some conflict with Jackson.Jar file which might be the reason the spark crashed.

    Could you help us configure the sparkNLP for our version of spark given there are jar files that support the compatibility. Happy to fill you in with more details if needed.

    question Requires more input 
    opened by akash166d 28
  • Problematic frame: C  [libtensorflow_framework.so.1+0x744da9]  _GLOBAL__sub_I_loader.cc+0x99

    Problematic frame: C [libtensorflow_framework.so.1+0x744da9] _GLOBAL__sub_I_loader.cc+0x99

    Description

    I have to perform a spark job, which uses the recognize_entities_dl pretrained pipeline, in a mesos (dockerized) cluster. The cmd is as follows:

    /opt/spark/spark-2.4.5-bin-hadoop2.7/bin/spark-submit --packages com.johnsnowlabs.nlp:spark-nlp_2.11:2.5.0,com.couchbase.client:spark-connector_2.11:2.3.0 --master mesos://zk://remote_ip:2181/mesos --deploy-mode client --class tags_extraction.tags_extraction_eng /opt/sparkscala_2.11-0.1.jar

    This is the code:

    val (sparkSession, sc) = start_spark_session()
    
    def start_spark_session(): (SparkSession, SparkContext) = {
    
      val sparkSession = SparkSession.builder()
          .master("mesos://zk://remote-ip:32181/mesos")
          .config("spark.mesos.executor.home", "/opt/spark/spark-2.4.5-bin-hadoop2.7")
    
          .config("spark.jars",
            "/opt/sparkscala_2.11-0.1.jar," +
              "https://repo1.maven.org/maven2/com/couchbase/client/java-client/2.7.6/java-client-2.7.6.jar," +
              "https://repo1.maven.org/maven2/com/couchbase/client/core-io/1.7.6/core-io-1.7.6.jar," +
              "https://repo1.maven.org/maven2/com/couchbase/client/spark-connector_2.11/2.3.0/spark-connector_2.11-2.3.0.jar," +
              "https://repo1.maven.org/maven2/io/opentracing/opentracing-api/0.31.0/opentracing-api-0.31.0.jar," +
              "https://repo1.maven.org/maven2/io/reactivex/rxjava/1.3.8/rxjava-1.3.8.jar," +
              "https://repo1.maven.org/maven2/io/reactivex/rxscala_2.11/0.26.5/rxscala_2.11-0.26.5.jar," +
    
              //I tried them both and they give the same error
              "https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/spark-nlp-assembly-2.5.0.jar"+
              "https://repo1.maven.org/maven2/com/johnsnowlabs/nlp/spark-nlp_2.11/2.5.0/spark-nlp_2.11-2.5.0.jar"
          )
          .config("spark.executor.extraLibraryPath",
            "/sparkscala_2.11-0.1.jar" +
              "/java-client-2.7.6.jar" +
              "/core-io-1.7.6.jar" +
              "/spark-connector_2.11-2.3.0.jar" +
              "/opentracing-api-0.31.0.jar" +
              "/rxjava-1.3.8.jar" +
              "/rxscala_2.11-0.26.5.jar" +
              "/core-1.1.2.jar" +
              "/spark-streaming-kafka-0-10_2.11-2.4.5.jar" +
              "/spark-sql-kafka-0-10_2.11-2.4.5.jar" +
              "/kafka-clients-2.4.0.jar" +
              "/kafka_2.11-2.4.1.jar" +
              "/spark-nlp-assembly-2.5.0.jar" +
              "/spark-nlp_2.11-2.5.0.jar"
          )
          .getOrCreate()
    
        sparkSession.sparkContext.setLogLevel("DEBUG")
    
        val sc = sparkSession.sparkContext
        sc.getConf.getAll.foreach(println)
    
        (sparkSession, sc)
      }
    
    
    def main(args: Array[String]) {
      
        val feeds_df = sparkSession.read.couchbase(schema = feedSchema, options = Map("bucket" -> "feeds"))
      
        val pipeline = new PretrainedPipeline("recognize_entities_dl", "en")
       
        println("PIPELINE LOADED") // not printed
    
        val feeds_tags = pipeline.transform(feeds_df)
          .selectExpr("author_id", "id", "category", "text", "entities.result as tags")
    
        feeds_tags.printSchema()
        println(feeds_tags)
        println(feeds_tags.getClass.toString)
        println(SizeEstimator.estimate(feeds_tags))
         println("COUNT", feeds_tags.count)
    
        feeds_tags.show()
    
        sparkSession.close()
      }
    
    }
    

    While the pipeline is being downloaded, this error is raised when loading stage 4:

    # A fatal error has been detected by the Java Runtime Environment:
    #
    #  SIGILL (0x4) at pc=0x00007f8c09bc2da9, pid=4192, tid=0x00007f8d51343700
    #
    # JRE version: OpenJDK Runtime Environment (8.0_252-b09) (build 1.8.0_252-8u252-b09-1~16.04-b09)
    # Java VM: OpenJDK 64-Bit Server VM (25.252-b09 mixed mode linux-amd64 compressed oops)
    # Problematic frame:
    # C  [libtensorflow_framework.so.1+0x744da9]  _GLOBAL__sub_I_loader.cc+0x99
    #
    # Core dump written. Default location: /var/lib/mesos/slaves/fb88a3ad-d32c-41ae-be67-36517a272bcb-S0/frameworks/fb88a3ad-d32c-41ae-be67-36517a272bcb-0000/executors/ct:1591367792198:0:tags_extraction_eng:/runs/2a41d953-7343-4dd5-a59b-2e253f0cda55/core or core.4192
    #
    # An error report file with more information is saved as:
    # /var/lib/mesos/slaves/fb88a3ad-d32c-41ae-be67-36517a272bcb-S0/frameworks/fb88a3ad-d32c-41ae-be67-36517a272bcb-0000/executors/ct:1591367792198:0:tags_extraction_eng:/runs/2a41d953-7343-4dd5-a59b-2e253f0cda55/hs_err_pid4192.log
    #
    # If you would like to submit a bug report, please visit:
    #   http://bugreport.java.com/bugreport/crash.jsp
    # The crash happened outside the Java Virtual Machine in native code.
    # See problematic frame for where to report the bug.
    #
    

    Expected Behavior

    Download pretrained pipeline withval pipeline = new PretrainedPipeline("recognize_entities_dl", "en")

    Current Behavior

    Driver's stdout:

    (spark.repl.local.jars,file:///root/.ivy2/jars/com.johnsnowlabs.nlp_spark-nlp_2.11-2.5.0.jar,file:///root/.ivy2/jars/com.couchbase.client_spark-connector_2.11-2.3.0.jar,file:///root/.ivy2/jars/com.typesafe_config-1.3.0.jar,file:///root/.ivy2/jars/org.rocksdb_rocksdbjni-6.5.3.jar,file:///root/.ivy2/jars/org.apache.hadoop_hadoop-aws-3.2.0.jar,file:///root/.ivy2/jars/com.amazonaws_aws-java-sdk-core-1.11.603.jar,file:///root/.ivy2/jars/com.amazonaws_aws-java-sdk-s3-1.11.603.jar,file:///root/.ivy2/jars/com.github.universal-automata_liblevenshtein-3.0.0.jar,file:///root/.ivy2/jars/com.navigamez_greex-1.0.jar,file:///root/.ivy2/jars/org.json4s_json4s-ext_2.11-3.5.3.jar,file:///root/.ivy2/jars/org.tensorflow_tensorflow-1.15.0.jar,file:///root/.ivy2/jars/net.sf.trove4j_trove4j-3.0.3.jar,file:///root/.ivy2/jars/commons-logging_commons-logging-1.1.3.jar,file:///root/.ivy2/jars/org.apache.httpcomponents_httpclient-4.5.9.jar,file:///root/.ivy2/jars/software.amazon.ion_ion-java-1.0.2.jar,file:///root/.ivy2/jars/com.fasterxml.jackson.dataformat_jackson-dataformat-cbor-2.6.7.jar,file:///root/.ivy2/jars/org.apache.httpcomponents_httpcore-4.4.11.jar,file:///root/.ivy2/jars/commons-codec_commons-codec-1.11.jar,file:///root/.ivy2/jars/com.amazonaws_aws-java-sdk-kms-1.11.603.jar,file:///root/.ivy2/jars/com.amazonaws_jmespath-java-1.11.603.jar,file:///root/.ivy2/jars/com.fasterxml.jackson.core_jackson-databind-2.6.7.2.jar,file:///root/.ivy2/jars/com.fasterxml.jackson.core_jackson-annotations-2.6.0.jar,file:///root/.ivy2/jars/com.fasterxml.jackson.core_jackson-core-2.6.7.jar,file:///root/.ivy2/jars/com.google.code.findbugs_annotations-3.0.1.jar,file:///root/.ivy2/jars/com.google.protobuf_protobuf-java-util-3.0.0-beta-3.jar,file:///root/.ivy2/jars/com.google.protobuf_protobuf-java-3.0.0-beta-3.jar,file:///root/.ivy2/jars/it.unimi.dsi_fastutil-7.0.12.jar,file:///root/.ivy2/jars/org.projectlombok_lombok-1.16.8.jar,file:///root/.ivy2/jars/org.slf4j_slf4j-api-1.7.21.jar,file:///root/.ivy2/jars/net.jcip_jcip-annotations-1.0.jar,file:///root/.ivy2/jars/com.google.code.findbugs_jsr305-3.0.1.jar,file:///root/.ivy2/jars/com.google.code.gson_gson-2.3.jar,file:///root/.ivy2/jars/dk.brics.automaton_automaton-1.11-8.jar,file:///root/.ivy2/jars/joda-time_joda-time-2.9.5.jar,file:///root/.ivy2/jars/org.joda_joda-convert-1.8.1.jar,file:///root/.ivy2/jars/org.tensorflow_libtensorflow-1.15.0.jar,file:///root/.ivy2/jars/org.tensorflow_libtensorflow_jni-1.15.0.jar,file:///root/.ivy2/jars/com.couchbase.client_java-client-2.7.6.jar,file:///root/.ivy2/jars/com.couchbase.client_dcp-client-0.23.0.jar,file:///root/.ivy2/jars/io.reactivex_rxscala_2.11-0.26.5.jar,file:///root/.ivy2/jars/org.apache.logging.log4j_log4j-api-2.2.jar,file:///root/.ivy2/jars/com.couchbase.client_core-io-1.7.6.jar,file:///root/.ivy2/jars/io.reactivex_rxjava-1.3.8.jar,file:///root/.ivy2/jars/io.opentracing_opentracing-api-0.31.0.jar)
    (spark.sql.execution.arrow.enabled,true)
    (spark.couchbase.nodes,couchbase://remote_ip)
    (com.couchbase.connectTimeout,300000)
    (spark.jars,/opt/sparkscala_2.11-0.1.jar,https://repo1.maven.org/maven2/com/couchbase/client/java-client/2.7.6/java-client-2.7.6.jar,https://repo1.maven.org/maven2/com/couchbase/client/core-io/1.7.6/core-io-1.7.6.jar,https://repo1.maven.org/maven2/com/couchbase/client/spark-connector_2.11/2.3.0/spark-connector_2.11-2.3.0.jar,https://repo1.maven.org/maven2/io/opentracing/opentracing-api/0.31.0/opentracing-api-0.31.0.jar,https://repo1.maven.org/maven2/io/reactivex/rxjava/1.3.8/rxjava-1.3.8.jar,https://repo1.maven.org/maven2/io/reactivex/rxscala_2.11/0.26.5/rxscala_2.11-0.26.5.jar,https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/spark-nlp-assembly-2.5.0.jar)
    (spark.executor.id,driver)
    (spark.driver.port,41651)
    (spark.couchbase.bucket.feeds,)
    (spark.couchbase.bucket.users,)
    (spark.driver.memory,1g)
    (spark.serializer,org.apache.spark.serializer.KryoSerializer)
    (com.couchbase.username,apps)
    (spark.cores.max,1)
    (spark.sql.tungsten.enabled,true)
    (spark.driver.host,mesos-slave)
    (spark.executor.memory,1g)
    (spark.couchbase.bucket.action_sink,)
    (com.couchbase.password,password)
    (spark.master,mesos://zk://remote_ip:2181/mesos)
    (com.couchbase.socketConnect,300000)
    (spark.mesos.executor.home,/opt/spark/spark-2.4.5-bin-hadoop2.7)
    (spark.submit.deployMode,client)
    (spark.app.name,tags_extraction_eng)
    (spark.app.id,fb88a3ad-d32c-41ae-be67-36517a272bcb-0005)
    (spark.ui.showConsoleProgress,true)
    (spark.worker.cleanup.enabled,true)
    (spark.executor.extraLibraryPath,/sparkscala_2.11-0.1.jar/java-client-2.7.6.jar/core-io-1.7.6.jar/spark-connector_2.11-2.3.0.jar/opentracing-api-0.31.0.jar/rxjava-1.3.8.jar/rxscala_2.11-0.26.5.jar/core-1.1.2.jar/spark-streaming-kafka-0-10_2.11-2.4.5.jar/spark-sql-kafka-0-10_2.11-2.4.5.jar/kafka-clients-2.4.0.jar/kafka_2.11-2.4.1.jar/spark-nlp-assembly-2.5.0.jar/spark-nlp_2.11-2.5.0.jar)
    
    recognize_entities_dl download started this may take some time.
    Approximate size to download 159 MB
    Download done! Loading the resource.
    
    #
    # A fatal error has been detected by the Java Runtime Environment:
    #
    #  SIGILL (0x4) at pc=0x00007f8c09bc2da9, pid=4192, tid=0x00007f8d51343700
    #
    # JRE version: OpenJDK Runtime Environment (8.0_252-b09) (build 1.8.0_252-8u252-b09-1~16.04-b09)
    # Java VM: OpenJDK 64-Bit Server VM (25.252-b09 mixed mode linux-amd64 compressed oops)
    # Problematic frame:
    # C  [libtensorflow_framework.so.1+0x744da9]  _GLOBAL__sub_I_loader.cc+0x99
    #
    # Core dump written. Default location: /var/lib/mesos/slaves/fb88a3ad-d32c-41ae-be67-36517a272bcb-S0/frameworks/fb88a3ad-d32c-41ae-be67-36517a272bcb-0000/executors/ct:1591367792198:0:tags_extraction_eng:/runs/2a41d953-7343-4dd5-a59b-2e253f0cda55/core or core.4192
    #
    # An error report file with more information is saved as:
    # /var/lib/mesos/slaves/fb88a3ad-d32c-41ae-be67-36517a272bcb-S0/frameworks/fb88a3ad-d32c-41ae-be67-36517a272bcb-0000/executors/ct:1591367792198:0:tags_extraction_eng:/runs/2a41d953-7343-4dd5-a59b-2e253f0cda55/hs_err_pid4192.log
    #
    # If you would like to submit a bug report, please visit:
    #   http://bugreport.java.com/bugreport/crash.jsp
    # The crash happened outside the Java Virtual Machine in native code.
    # See problematic frame for where to report the bug.
    #
    

    Executor's Logs:

    ...
    20/06/05 14:40:01 INFO CoarseGrainedExecutorBackend: Got assigned task 17
    20/06/05 14:40:01 INFO Executor: Running task 0.0 in stage 14.0 (TID 17)
    20/06/05 14:40:01 INFO TorrentBroadcast: Started reading broadcast variable 26
    20/06/05 14:40:01 INFO MemoryStore: Block broadcast_26_piece0 stored as bytes in memory (estimated size 2.2 KB, free 362.9 MB)
    20/06/05 14:40:01 INFO TorrentBroadcast: Reading broadcast variable 26 took 11 ms
    20/06/05 14:40:01 INFO MemoryStore: Block broadcast_26 stored as values in memory (estimated size 3.7 KB, free 362.9 MB)
    20/06/05 14:40:01 INFO HadoopRDD: Input split: file:/root/cache_pretrained/recognize_entities_dl_en_2.4.3_2.4_1584626752821/stages/4_NerDLModel_d4424c9af5f4/metadata/part-00000:0+408
    20/06/05 14:40:01 INFO TorrentBroadcast: Started reading broadcast variable 25
    20/06/05 14:40:01 INFO MemoryStore: Block broadcast_25_piece0 stored as bytes in memory (estimated size 23.1 KB, free 362.8 MB)
    20/06/05 14:40:01 INFO TorrentBroadcast: Reading broadcast variable 25 took 25 ms
    20/06/05 14:40:01 INFO MemoryStore: Block broadcast_25 stored as values in memory (estimated size 322.8 KB, free 362.5 MB)
    20/06/05 14:40:01 INFO Executor: Finished task 0.0 in stage 14.0 (TID 17). 1209 bytes result sent to driver
    20/06/05 14:40:01 INFO CoarseGrainedExecutorBackend: Got assigned task 18
    20/06/05 14:40:01 INFO Executor: Running task 0.0 in stage 15.0 (TID 18)
    20/06/05 14:40:01 INFO TorrentBroadcast: Started reading broadcast variable 28
    20/06/05 14:40:01 INFO MemoryStore: Block broadcast_28_piece0 stored as bytes in memory (estimated size 2.2 KB, free 362.5 MB)
    20/06/05 14:40:01 INFO TorrentBroadcast: Reading broadcast variable 28 took 13 ms
    20/06/05 14:40:01 INFO MemoryStore: Block broadcast_28 stored as values in memory (estimated size 3.7 KB, free 362.5 MB)
    20/06/05 14:40:01 INFO HadoopRDD: Input split: file:/root/cache_pretrained/recognize_entities_dl_en_2.4.3_2.4_1584626752821/stages/4_NerDLModel_d4424c9af5f4/metadata/part-00000:0+408
    20/06/05 14:40:01 INFO TorrentBroadcast: Started reading broadcast variable 27
    20/06/05 14:40:01 INFO MemoryStore: Block broadcast_27_piece0 stored as bytes in memory (estimated size 23.1 KB, free 362.5 MB)
    20/06/05 14:40:01 INFO TorrentBroadcast: Reading broadcast variable 27 took 11 ms
    20/06/05 14:40:01 INFO MemoryStore: Block broadcast_27 stored as values in memory (estimated size 322.8 KB, free 362.2 MB)
    20/06/05 14:40:01 INFO Executor: Finished task 0.0 in stage 15.0 (TID 18). 1166 bytes result sent to driver
    20/06/05 14:40:01 INFO CoarseGrainedExecutorBackend: Got assigned task 19
    20/06/05 14:40:01 INFO Executor: Running task 0.0 in stage 16.0 (TID 19)
    20/06/05 14:40:01 INFO TorrentBroadcast: Started reading broadcast variable 30
    20/06/05 14:40:01 INFO MemoryStore: Block broadcast_30_piece0 stored as bytes in memory (estimated size 2.4 KB, free 362.2 MB)
    20/06/05 14:40:01 INFO TorrentBroadcast: Reading broadcast variable 30 took 11 ms
    20/06/05 14:40:01 INFO MemoryStore: Block broadcast_30 stored as values in memory (estimated size 3.9 KB, free 362.2 MB)
    20/06/05 14:40:01 INFO HadoopRDD: Input split: file:/root/cache_pretrained/recognize_entities_dl_en_2.4.3_2.4_1584626752821/stages/4_NerDLModel_d4424c9af5f4/fields/datasetParams/part-00005:0+95
    20/06/05 14:40:01 INFO TorrentBroadcast: Started reading broadcast variable 29
    20/06/05 14:40:01 INFO MemoryStore: Block broadcast_29_piece0 stored as bytes in memory (estimated size 23.1 KB, free 362.1 MB)
    20/06/05 14:40:01 INFO TorrentBroadcast: Reading broadcast variable 29 took 17 ms
    20/06/05 14:40:01 INFO MemoryStore: Block broadcast_29 stored as values in memory (estimated size 322.8 KB, free 361.8 MB)
    20/06/05 14:40:01 INFO Executor: Finished task 0.0 in stage 16.0 (TID 19). 765 bytes result sent to driver
    20/06/05 14:40:01 INFO CoarseGrainedExecutorBackend: Got assigned task 20
    20/06/05 14:40:01 INFO Executor: Running task 0.0 in stage 17.0 (TID 20)
    20/06/05 14:40:01 INFO TorrentBroadcast: Started reading broadcast variable 31
    20/06/05 14:40:01 INFO MemoryStore: Block broadcast_31_piece0 stored as bytes in memory (estimated size 2.4 KB, free 362.1 MB)
    20/06/05 14:40:01 INFO TorrentBroadcast: Reading broadcast variable 31 took 19 ms
    20/06/05 14:40:01 INFO MemoryStore: Block broadcast_31 stored as values in memory (estimated size 3.9 KB, free 362.2 MB)
    20/06/05 14:40:01 INFO HadoopRDD: Input split: file:/root/cache_pretrained/recognize_entities_dl_en_2.4.3_2.4_1584626752821/stages/4_NerDLModel_d4424c9af5f4/fields/datasetParams/part-00007:0+95
    20/06/05 14:40:01 INFO Executor: Finished task 0.0 in stage 17.0 (TID 20). 765 bytes result sent to driver
    20/06/05 14:40:01 INFO CoarseGrainedExecutorBackend: Got assigned task 21
    20/06/05 14:40:01 INFO Executor: Running task 1.0 in stage 17.0 (TID 21)
    20/06/05 14:40:01 INFO HadoopRDD: Input split: file:/root/cache_pretrained/recognize_entities_dl_en_2.4.3_2.4_1584626752821/stages/4_NerDLModel_d4424c9af5f4/fields/datasetParams/part-00011:0+2000
    20/06/05 14:40:01 INFO Executor: Finished task 1.0 in stage 17.0 (TID 21). 2146 bytes result sent to driver
    20/06/05 14:40:01 INFO CoarseGrainedExecutorBackend: Got assigned task 22
    20/06/05 14:40:01 INFO Executor: Running task 2.0 in stage 17.0 (TID 22)
    20/06/05 14:40:01 INFO HadoopRDD: Input split: file:/root/cache_pretrained/recognize_entities_dl_en_2.4.3_2.4_1584626752821/stages/4_NerDLModel_d4424c9af5f4/fields/datasetParams/part-00011:2000+831
    20/06/05 14:40:01 INFO Executor: Finished task 2.0 in stage 17.0 (TID 22). 808 bytes result sent to driver
    20/06/05 14:40:01 INFO CoarseGrainedExecutorBackend: Got assigned task 23
    20/06/05 14:40:01 INFO Executor: Running task 3.0 in stage 17.0 (TID 23)
    20/06/05 14:40:01 INFO HadoopRDD: Input split: file:/root/cache_pretrained/recognize_entities_dl_en_2.4.3_2.4_1584626752821/stages/4_NerDLModel_d4424c9af5f4/fields/datasetParams/part-00009:0+95
    20/06/05 14:40:01 INFO Executor: Finished task 3.0 in stage 17.0 (TID 23). 765 bytes result sent to driver
    I0605 14:40:04.482619  4374 exec.cpp:445] Executor asked to shutdown
    I0605 14:40:04.482844  4374 executor.cpp:184] Received SHUTDOWN event
    I0605 14:40:04.482877  4374 executor.cpp:800] Shutting down
    I0605 14:40:04.482920  4374 executor.cpp:913] Sending SIGTERM to process tree at pid 4382
    20/06/05 14:40:04 ERROR CoarseGrainedExecutorBackend: Executor self-exiting due to : Driver mesos-slave:41651 disassociated! Shutting down.
    I0605 14:40:04.489429  4374 executor.cpp:926] Sent SIGTERM to the following process trees:
    [ 
    -+- 4382 sh -c LD_LIBRARY_PATH="/sparkscala_2.11-0.1.jar/java-client-2.7.6.jar/core-io-1.7.6.jar/spark-connector_2.11-2.3.0.jar/opentracing-api-0.31.0.jar/rxjava-1.3.8.jar/rxscala_2.11-0.26.5.jar/core-1.1.2.jar/spark-streaming-kafka-0-10_2.11-2.4.5.jar/spark-sql-kafka-0-10_2.11-2.4.5.jar/kafka-clients-2.4.0.jar/kafka_2.11-2.4.1.jar/spark-nlp-assembly-2.5.0.jar/spark-nlp_2.11-2.5.0.jar:$LD_LIBRARY_PATH" "/opt/spark/spark-2.4.5-bin-hadoop2.7/./bin/spark-class" org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://[email protected]:41651 --executor-id 0 --cores 1 --app-id fb88a3ad-d32c-41ae-be67-36517a272bcb-0005 --hostname mesos-slave 
     \--- 4383 /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp /opt/spark/spark-2.4.5-bin-hadoop2.7/conf/:/opt/spark/spark-2.4.5-bin-hadoop2.7/jars/* -Xmx1024m org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://[email protected]:41651 --executor-id 0 --cores 1 --app-id fb88a3ad-d32c-41ae-be67-36517a272bcb-0005 --hostname mesos-slave 
    ]
    I0605 14:40:04.489470  4374 executor.cpp:930] Scheduling escalation to SIGKILL in 88secs from now
    20/06/05 14:40:04 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL TERM
    20/06/05 14:40:04 INFO DiskBlockManager: Shutdown hook called
    20/06/05 14:40:04 INFO CouchbaseConnection: Performing Couchbase SDK Shutdown
    20/06/05 14:40:04 INFO ShutdownHookManager: Shutdown hook called
    20/06/05 14:40:04 INFO ShutdownHookManager: Deleting directory /var/lib/mesos/slaves/fb88a3ad-d32c-41ae-be67-36517a272bcb-S0/frameworks/fb88a3ad-d32c-41ae-be67-36517a272bcb-0005/executors/0/runs/50383a32-eafb-45cd-ab6b-3be4f5d790a4/spark-e87c68df-00c0-4d18-acc5-684a42cab22b
    20/06/05 14:40:04 INFO ConfigurationProvider: Closed bucket feeds
    20/06/05 14:40:04 INFO Node: Disconnected from Node remote_ip/datanode1
    I0605 14:40:04.540186  4379 executor.cpp:998] Command terminated with signal Terminated (pid: 4382)
    20/06/05 14:40:04 INFO CoreEnvironment: Shutdown IoPool: success 
    20/06/05 14:40:04 INFO CoreEnvironment: Shutdown kvIoPool: success 
    20/06/05 14:40:04 INFO CoreEnvironment: Shutdown viewIoPool: success 
    20/06/05 14:40:04 INFO CoreEnvironment: Shutdown queryIoPool: success 
    20/06/05 14:40:04 INFO CoreEnvironment: Shutdown searchIoPool: success 
    20/06/05 14:40:04 INFO CoreEnvironment: Shutdown Core Scheduler: success 
    20/06/05 14:40:04 INFO CoreEnvironment: Shutdown Runtime Metrics Collector: success 
    20/06/05 14:40:04 INFO CoreEnvironment: Shutdown Latency Metrics Collector: success 
    20/06/05 14:40:04 INFO CoreEnvironment: Shutdown analyticsIoPool: success 
    20/06/05 14:40:04 INFO CoreEnvironment: Shutdown Netty: success 
    20/06/05 14:40:04 INFO CoreEnvironment: Shutdown Tracer: success 
    20/06/05 14:40:04 INFO CoreEnvironment: Shutdown OrphanReporter: success 
    I0605 14:40:05.542169  4381 process.cpp:927] Stopped the socket accept loop
    

    Your Environment

    Docker environment:

    • 1 Mesos Master Container
    • 1 Mesos Worker Container
    • 1 Chronos Container

    Versions:

    • Spark NLP version: 2.4.5
    • Apache NLP version: 2.5.0
    • Java version (java -version): **JRE version: OpenJDK Runtime Environment (8.0_252-b09) (build 1.8.0_252-8u252-b09-1~16.04-b09) **Java VM: OpenJDK 64-Bit Server VM (25.252-b09 mixed mode linux-amd64 compressed oops)
    • Docker Container's Operating System and version:
    NAME="Ubuntu"
    VERSION="16.04.6 LTS (Xenial Xerus)"
    ID=ubuntu
    ID_LIKE=debian
    PRETTY_NAME="Ubuntu 16.04.6 LTS"
    VERSION_ID="16.04"
    HOME_URL="http://www.ubuntu.com/"
    SUPPORT_URL="http://help.ubuntu.com/"
    BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
    VERSION_CODENAME=xenial
    
    wont-fix 
    opened by FedericoF93 25
  • 'JavaPackage' object is not callable when 'PretrainedPipeline('explain_document_ml', 'en')'

    'JavaPackage' object is not callable when 'PretrainedPipeline('explain_document_ml', 'en')'

    TypeError Traceback (most recent call last) in () ----> 1 pipline = PretrainedPipeline('explain_document_ml', 'en')

    /home/bioxcel/anaconda3/lib/python3.7/site-packages/sparknlp/pretrained.py in init(self, name, lang, remote_loc) 89 90 def init(self, name, lang='en', remote_loc=None): ---> 91 self.model = ResourceDownloader().downloadPipeline(name, lang, remote_loc) 92 self.light_model = LightPipeline(self.model) 93

    /home/bioxcel/anaconda3/lib/python3.7/site-packages/sparknlp/pretrained.py in downloadPipeline(name, language, remote_loc) 50 def downloadPipeline(name, language, remote_loc=None): 51 print(name + " download started this may take some time.") ---> 52 file_size = _internal._GetResourceSize(name, language, remote_loc).apply() 53 if file_size == "-1": 54 print("Can not find the model to download please check the name!")

    /home/bioxcel/anaconda3/lib/python3.7/site-packages/sparknlp/internal.py in init(self, name, language, remote_loc) 68 super(_ClearCache, self).init("com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.clearCache", name, language, remote_loc) 69 ---> 70 71 class _GetResourceSize(ExtendedJavaWrapper): 72 def init(self, name, language, remote_loc):

    /home/bioxcel/anaconda3/lib/python3.7/site-packages/sparknlp/internal.py in init(self, java_obj, *args) 9 super(ExtendedJavaWrapper, self).init(java_obj) 10 self.sc = SparkContext._active_spark_context ---> 11 self._java_obj = self.new_java_obj(java_obj, *args) 12 self.java_obj = self._java_obj 13

    /home/bioxcel/anaconda3/lib/python3.7/site-packages/sparknlp/internal.py in new_java_obj(self, java_class, *args) 19 20 def new_java_obj(self, java_class, *args): ---> 21 return self._new_java_obj(java_class, *args) 22 23 def new_java_array(self, pylist, java_class):

    /opt/spark-2.4.3-bin-hadoop2.7/python/pyspark/ml/wrapper.py in _new_java_obj(java_class, *args) 65 java_obj = getattr(java_obj, name) 66 java_args = [_py2java(sc, arg) for arg in args] ---> 67 return java_obj(*java_args) 68 69 @staticmethod

    TypeError: 'JavaPackage' object is not callable

    invalid 
    opened by vasudhajain0 25
  • Why do you use hadoop-aws 3.2 ? Spark 2.4 doesn't come with hadoop 3.2 which makes it very difficult to work with as we already use hadoop-aws 2.7.4

    Why do you use hadoop-aws 3.2 ? Spark 2.4 doesn't come with hadoop 3.2 which makes it very difficult to work with as we already use hadoop-aws 2.7.4

    Description

    Expected Behavior

    Current Behavior

    Possible Solution

    Steps to Reproduce

    1. Using with hadoop-aws 2.7.3 already installed, hadoop 3.2 is a conflict along with aws sdk

    Context

    Your Environment

    • Spark NLP version:
    • Apache NLP version:
    • Java version (java -version):
    • Setup and installation (Pypi, Conda, Maven, etc.):
    • Operating System and version:
    • Link to your project (if any):
    question 
    opened by appunni-dishq 23
  • Problem with spark-nlp

    Problem with spark-nlp

    Hi! I'm using this example to create my own sentiment classifier but when I want to execute the below code, I got an error.

    use = BertEmbeddings.load('/home/mahdi/workTable/dataset/bert/') \
                        .setInputCols(["document"])\
                        .setOutputCol("sentence_embeddings")\
                        .setPoolingLayer(-2)
    

    I tested it with UniversalSentenceEncoder but got the same error.

    The error:

    # A fatal error has been detected by the Java Runtime Environment:
    #
    #  SIGILL (0x4) at pc=0x00007fac59e78da9, pid=1736, tid=0x00007fad517fb700
    #
    # JRE version: OpenJDK Runtime Environment (8.0_252-b09) (build 1.8.0_252-8u252-b09-1~18.04-b09)
    # Java VM: OpenJDK 64-Bit Server VM (25.252-b09 mixed mode linux-amd64 compressed oops)
    # Problematic frame:
    # C  [libtensorflow_framework.so.1+0x744da9]  _GLOBAL__sub_I_loader.cc+0x99
    #
    # Core dump written. Default location: /home/mahdi/workTable/core or core.1736
    

    I used standalone cluster mode with one master and 3 slaves with 4G memory and 4 core for each one at first. Then I used one master and one slave with 10G memory and 6 core for each one. But still got the same error.

    My spark initialization:

    findspark.init()
    conf=SparkConf()
    conf.set("spark.driver.memory", "19g")
    conf.set("spark.cores.max", "16")
    conf.set("spark.executor.memory", "9700m")
    conf.set("spark.executor.cores", "8")
    conf.set("spark.executor.instances", "8")
    conf.set("spark.rpc.message.maxSize","1024")
    conf.set("spark.driver.extraJavaOptions","-Djava.io.tmpdir=/home/mahdi/workTable/temp/")
    conf.set("spark.executor.extraJavaOptions","-Djava.io.tmpdir=/home/mahdi/workTable/temp/")
    
    
    spark = SparkSession.builder.master("spark://172.18.16.74:7077").appName("Sentiment Analysis").config(conf=conf)\
                                .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.11:2.5.4")\
                                .getOrCreate()
    sc = spark.sparkContext
    sqlContext = SQLContext(sc)
    
    print("Spark version : " ,spark.version)
    print("Spark-NLP version : " ,sparknlp.version())
    # Spark version :  2.4.5
    # Spark-NLP version :  2.5.4
    

    How can I fix it?

    Thanks for your help :)

    Requires more input Stale 
    opened by m-developer96 23
  • Could not initialize class com.johnsnowlabs.util.ConfigHelper$

    Could not initialize class com.johnsnowlabs.util.ConfigHelper$

    Receiving an error when trying to load pretrained model from hdfs.

    Description

    In HDFS, loaded offline pre trained model file(s). Apply or use it in code e.g. bert = BertEmbeddings.load() throws an error "Could not initialize class com.JohnSnowLabs.util. ConfigHelper"

    Expected Behavior

    It should load pre trained model from the uncompressed file in HDFS.

    Current Behavior

    Receiving an error message: Py4JJavaError: An error occurred while calling None.com.johnsnowlabs.nlp.embeddings.BertEmbeddings. : java.lang.NoClassDefFoundError: Could not initialize class com.johnsnowlabs.util.ConfigHelper$

    Possible Solution

    Reference to the offline model might be wrong OR something needs to be updated in Config.

    Steps to Reproduce

    1. Import all spark NLP libs from sparknlp.base import * from sparknlp.annotator import *
      from sparknlp.common import * import sparknlp
    2. Sparknlp.start() spark = sparknlp.start()
    3. document_assembler = DocumentAssembler()
      .setInputCol("text")
      .setOutputCol("document")
    4. Load the pretrained model from hdfs path. bert = BertEmbeddings.load("/user/xxx/bert_base_cased_en_2.4.0_2.4_1580579557778")
      .setInputCols(["document"])
      .setOutputCol("bert")
      .setCaseSensitive(False)
      .setPoolingLayer(0)

    Context

    Trying to apply ClassifierDL - word embedding and sentence Embeddings (USE). classiferDL is new for me, fixing this issue will enable it's use for many different applications.

    Your Environment

    • Spark NLP version sparknlp.version(): 2.4.5
    • Apache NLP version spark.version: 2.3.2.3.1.0.0-78
    • Java version java -version: openjdk version "1.8.0_282", OpenJDK Runtime Environment (build 1.8.0_282-b08), OpenJDK 64-Bit Server VM (build 25.282-b08, mixed mode)
    • Setup and installation (Pypi, Conda, Maven, etc.): Pyspark
    • Operating System and version: Hadoop Cluster
    • Link to your project (if any):

    Thank you for the help.

    Requires more input Stale 
    opened by beginneruser2021 22
  • Encountering java.lang.NullPointerException when dislpaying Bert transformations

    Encountering java.lang.NullPointerException when dislpaying Bert transformations

    Hello,

    My set up is a single laptop computer running Kubuntu 20.10 (Linux kernel version 5.8.0-55-generic) on Intel Core i5-7200U CPU (4 cores) with 5.7 GB of RAM available.

    On this modest machine, I am trying to learn how to set up a standalone spark cluster and submit a job with PySpark that uses SparkNLP.

    I have base my work off of https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/blogposts/3.NER_with_BERT.ipynb

    I have installed Spark 3.0.2 in this machine in the home directory and have set SPARK_HOME in my environment variables as necessary. Once done, I ran the start-master.sh script from spark's sbin directory and it launched master successfully. Then, I launched a worker on the same machine and it registered with the master with 4 cores and 4.7 GB of RAM. On this setup, I was able to successfully run the PI approximation example from Spark's website.

    Now, in this machine, I created another directory and setup a virtualenvironment. PIP packages installed in this venv: numpy==1.20.3 py4j==0.10.9 pyspark==3.0.2 spark-nlp==3.1.0 sparknlp==1.0.0

    I launched Python 3.8.6 from this virtualenvironment and ran the following script:

    from pyspark.sql import SparkSession
    
    spark = SparkSession.builder\
        .master("spark://rajan-X556URK:7077")\
        .appName("nerexample")\
        .config("spark.driver.memory", "4G")\
        .config("spark.executor.memory", "4G")\
        .config("spark.jars.packages", "com.johnsnowlabs.nlp:spark-nlp_2.12:3.1.0")\
        .getOrCreate() 
    
    import sparknlp
    from sparknlp.annotator import *
    from sparknlp.base import *
    
    from urllib.request import urlretrieve
    
    urlretrieve('https://github.com/JohnSnowLabs/spark-nlp/raw/master/src/test/resources/conll2003/eng.train',
               'eng.train')
    
    urlretrieve('https://github.com/JohnSnowLabs/spark-nlp/raw/master/src/test/resources/conll2003/eng.testa',
               'eng.testa') 
    
    bert_annotator = BertEmbeddings.pretrained('small_bert_L2_128', 'en') \
     .setInputCols(["sentence",'token'])\
     .setOutputCol("bert")\
     .setBatchSize(8)
    
    from sparknlp.training import CoNLL
    
    test_data = CoNLL().readDataset(spark, '/home/w/Assignments/ner/eng.testa')
    
    test_data = bert_annotator.transform(test_data)
    
    
    
    test_data.show(3)
    

    Right when I execute the test_data.show() line, I get a NullPointerException.

    Following is the log from the stderr file of this worker:

    Spark Executor Command: "/usr/lib/jvm/java-11-openjdk-amd64/bin/java" "-cp" "/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/conf/:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/jars/*" "-Xmx4096M" "-Dspark.driver.port=34205" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://[email protected]:34205" "--executor-id" "0" "--hostname" "192.168.2.103" "--cores" "4" "--app-id" "app-20210611204208-0009" "--worker-url" "spark://[email protected]:44535"
    ========================================
    
    Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
    21/06/11 20:42:09 INFO CoarseGrainedExecutorBackend: Started daemon with process name: [email protected]
    21/06/11 20:42:09 INFO SignalUtils: Registered signal handler for TERM
    21/06/11 20:42:09 INFO SignalUtils: Registered signal handler for HUP
    21/06/11 20:42:09 INFO SignalUtils: Registered signal handler for INT
    21/06/11 20:42:09 WARN Utils: Your hostname, rajan-X556URK resolves to a loopback address: 127.0.1.1; using 192.168.2.103 instead (on interface wlp3s0)
    21/06/11 20:42:09 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
    WARNING: An illegal reflective access operation has occurred
    WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/jars/spark-unsafe_2.12-3.0.2.jar) to constructor java.nio.DirectByteBuffer(long,int)
    WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
    WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
    WARNING: All illegal access operations will be denied in a future release
    21/06/11 20:42:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    21/06/11 20:42:10 INFO SecurityManager: Changing view acls to: w
    21/06/11 20:42:10 INFO SecurityManager: Changing modify acls to: w
    21/06/11 20:42:10 INFO SecurityManager: Changing view acls groups to: 
    21/06/11 20:42:10 INFO SecurityManager: Changing modify acls groups to: 
    21/06/11 20:42:10 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(w); groups with view permissions: Set(); users  with modify permissions: Set(w); groups with modify permissions: Set()
    21/06/11 20:42:10 INFO TransportClientFactory: Successfully created connection to /192.168.2.103:34205 after 95 ms (0 ms spent in bootstraps)
    21/06/11 20:42:10 INFO SecurityManager: Changing view acls to: w
    21/06/11 20:42:10 INFO SecurityManager: Changing modify acls to: w
    21/06/11 20:42:10 INFO SecurityManager: Changing view acls groups to: 
    21/06/11 20:42:10 INFO SecurityManager: Changing modify acls groups to: 
    21/06/11 20:42:10 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(w); groups with view permissions: Set(); users  with modify permissions: Set(w); groups with modify permissions: Set()
    21/06/11 20:42:10 INFO TransportClientFactory: Successfully created connection to /192.168.2.103:34205 after 3 ms (0 ms spent in bootstraps)
    21/06/11 20:42:10 INFO DiskBlockManager: Created local directory at /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/blockmgr-d8c52b91-a0ef-49fb-8712-7116a0410c3b
    21/06/11 20:42:11 INFO MemoryStore: MemoryStore started with capacity 2.2 GiB
    21/06/11 20:42:11 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://[email protected]:34205
    21/06/11 20:42:11 INFO WorkerWatcher: Connecting to worker spark://[email protected]:44535
    21/06/11 20:42:11 INFO ResourceUtils: ==============================================================
    21/06/11 20:42:11 INFO ResourceUtils: Resources for spark.executor:
    
    21/06/11 20:42:11 INFO ResourceUtils: ==============================================================
    21/06/11 20:42:11 INFO TransportClientFactory: Successfully created connection to /192.168.2.103:44535 after 31 ms (0 ms spent in bootstraps)
    21/06/11 20:42:11 INFO WorkerWatcher: Successfully connected to spark://[email protected]:44535
    21/06/11 20:42:11 INFO CoarseGrainedExecutorBackend: Successfully registered with driver
    21/06/11 20:42:11 INFO Executor: Starting executor ID 0 on host 192.168.2.103
    21/06/11 20:42:11 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 35511.
    21/06/11 20:42:11 INFO NettyBlockTransferService: Server created on 192.168.2.103:35511
    21/06/11 20:42:11 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
    21/06/11 20:42:11 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(0, 192.168.2.103, 35511, None)
    21/06/11 20:42:11 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(0, 192.168.2.103, 35511, None)
    21/06/11 20:42:11 INFO BlockManager: Initialized BlockManager: BlockManagerId(0, 192.168.2.103, 35511, None)
    21/06/11 20:42:11 INFO Executor: Fetching spark://192.168.2.103:34205/files/com.google.protobuf_protobuf-java-util-3.0.0-beta-3.jar with timestamp 1623424325166
    21/06/11 20:42:11 INFO TransportClientFactory: Successfully created connection to /192.168.2.103:34205 after 3 ms (0 ms spent in bootstraps)
    21/06/11 20:42:12 INFO Utils: Fetching spark://192.168.2.103:34205/files/com.google.protobuf_protobuf-java-util-3.0.0-beta-3.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp16731401106283909142.tmp
    21/06/11 20:42:12 INFO Utils: Copying /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/-19307619201623424325166_cache to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.google.protobuf_protobuf-java-util-3.0.0-beta-3.jar
    21/06/11 20:42:12 INFO Executor: Fetching spark://192.168.2.103:34205/files/net.jcip_jcip-annotations-1.0.jar with timestamp 1623424325166
    21/06/11 20:42:12 INFO Utils: Fetching spark://192.168.2.103:34205/files/net.jcip_jcip-annotations-1.0.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp4135900916397329853.tmp
    21/06/11 20:42:12 INFO Utils: Copying /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/1155917211623424325166_cache to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./net.jcip_jcip-annotations-1.0.jar
    21/06/11 20:42:12 INFO Executor: Fetching spark://192.168.2.103:34205/files/com.google.code.findbugs_annotations-3.0.1.jar with timestamp 1623424325166
    21/06/11 20:42:12 INFO Utils: Fetching spark://192.168.2.103:34205/files/com.google.code.findbugs_annotations-3.0.1.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp5390838511657707315.tmp
    21/06/11 20:42:12 INFO Utils: Copying /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/10453638051623424325166_cache to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.google.code.findbugs_annotations-3.0.1.jar
    21/06/11 20:42:12 INFO Executor: Fetching spark://192.168.2.103:34205/files/com.johnsnowlabs.nlp_tensorflow-cpu_2.12-0.3.1.jar with timestamp 1623424325166
    21/06/11 20:42:12 INFO Utils: Fetching spark://192.168.2.103:34205/files/com.johnsnowlabs.nlp_tensorflow-cpu_2.12-0.3.1.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp12163369454458897115.tmp
    21/06/11 20:42:15 INFO Utils: Copying /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/-6753754811623424325166_cache to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.johnsnowlabs.nlp_tensorflow-cpu_2.12-0.3.1.jar
    21/06/11 20:42:15 INFO Executor: Fetching spark://192.168.2.103:34205/files/org.projectlombok_lombok-1.16.8.jar with timestamp 1623424325166
    21/06/11 20:42:15 INFO Utils: Fetching spark://192.168.2.103:34205/files/org.projectlombok_lombok-1.16.8.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp546613331331570155.tmp
    21/06/11 20:42:15 INFO Utils: Copying /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/15471060871623424325166_cache to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./org.projectlombok_lombok-1.16.8.jar
    21/06/11 20:42:15 INFO Executor: Fetching spark://192.168.2.103:34205/files/com.typesafe_config-1.3.0.jar with timestamp 1623424325166
    21/06/11 20:42:15 INFO Utils: Fetching spark://192.168.2.103:34205/files/com.typesafe_config-1.3.0.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp501578203029232760.tmp
    21/06/11 20:42:15 INFO Utils: Copying /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/-6243396901623424325166_cache to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.typesafe_config-1.3.0.jar
    21/06/11 20:42:15 INFO Executor: Fetching spark://192.168.2.103:34205/files/net.sf.trove4j_trove4j-3.0.3.jar with timestamp 1623424325166
    21/06/11 20:42:15 INFO Utils: Fetching spark://192.168.2.103:34205/files/net.sf.trove4j_trove4j-3.0.3.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp4294334457124108819.tmp
    21/06/11 20:42:15 INFO Utils: Copying /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/-9179969801623424325166_cache to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./net.sf.trove4j_trove4j-3.0.3.jar
    21/06/11 20:42:15 INFO Executor: Fetching spark://192.168.2.103:34205/files/org.json4s_json4s-ext_2.12-3.5.3.jar with timestamp 1623424325166
    21/06/11 20:42:15 INFO Utils: Fetching spark://192.168.2.103:34205/files/org.json4s_json4s-ext_2.12-3.5.3.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp5724117738489913536.tmp
    21/06/11 20:42:15 INFO Utils: Copying /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/-1785968311623424325166_cache to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./org.json4s_json4s-ext_2.12-3.5.3.jar
    21/06/11 20:42:15 INFO Executor: Fetching spark://192.168.2.103:34205/files/com.google.code.findbugs_jsr305-3.0.1.jar with timestamp 1623424325166
    21/06/11 20:42:15 INFO Utils: Fetching spark://192.168.2.103:34205/files/com.google.code.findbugs_jsr305-3.0.1.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp6507586328711846510.tmp
    21/06/11 20:42:15 INFO Utils: Copying /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/-19147812741623424325166_cache to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.google.code.findbugs_jsr305-3.0.1.jar
    21/06/11 20:42:15 INFO Executor: Fetching spark://192.168.2.103:34205/files/org.joda_joda-convert-1.8.1.jar with timestamp 1623424325166
    21/06/11 20:42:15 INFO Utils: Fetching spark://192.168.2.103:34205/files/org.joda_joda-convert-1.8.1.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp11192836114627213928.tmp
    21/06/11 20:42:15 INFO Utils: Copying /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/-18183925021623424325166_cache to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./org.joda_joda-convert-1.8.1.jar
    21/06/11 20:42:15 INFO Executor: Fetching spark://192.168.2.103:34205/files/dk.brics.automaton_automaton-1.11-8.jar with timestamp 1623424325166
    21/06/11 20:42:15 INFO Utils: Fetching spark://192.168.2.103:34205/files/dk.brics.automaton_automaton-1.11-8.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp17414383452524692686.tmp
    21/06/11 20:42:15 INFO Utils: Copying /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/18002895341623424325166_cache to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./dk.brics.automaton_automaton-1.11-8.jar
    21/06/11 20:42:15 INFO Executor: Fetching spark://192.168.2.103:34205/files/com.navigamez_greex-1.0.jar with timestamp 1623424325166
    21/06/11 20:42:15 INFO Utils: Fetching spark://192.168.2.103:34205/files/com.navigamez_greex-1.0.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp1310093016529474953.tmp
    21/06/11 20:42:15 INFO Utils: Copying /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/444129991623424325166_cache to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.navigamez_greex-1.0.jar
    21/06/11 20:42:15 INFO Executor: Fetching spark://192.168.2.103:34205/files/com.google.code.gson_gson-2.3.jar with timestamp 1623424325166
    21/06/11 20:42:15 INFO Utils: Fetching spark://192.168.2.103:34205/files/com.google.code.gson_gson-2.3.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp16952031653904177164.tmp
    21/06/11 20:42:15 INFO Utils: Copying /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/-20852710581623424325166_cache to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.google.code.gson_gson-2.3.jar
    21/06/11 20:42:15 INFO Executor: Fetching spark://192.168.2.103:34205/files/it.unimi.dsi_fastutil-7.0.12.jar with timestamp 1623424325166
    21/06/11 20:42:15 INFO Utils: Fetching spark://192.168.2.103:34205/files/it.unimi.dsi_fastutil-7.0.12.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp5122682618647664079.tmp
    21/06/11 20:42:15 INFO Utils: Copying /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/-5370007131623424325166_cache to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./it.unimi.dsi_fastutil-7.0.12.jar
    21/06/11 20:42:15 INFO Executor: Fetching spark://192.168.2.103:34205/files/com.johnsnowlabs.nlp_spark-nlp_2.12-3.1.0.jar with timestamp 1623424325166
    21/06/11 20:42:15 INFO Utils: Fetching spark://192.168.2.103:34205/files/com.johnsnowlabs.nlp_spark-nlp_2.12-3.1.0.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp4638237247886531412.tmp
    21/06/11 20:42:16 INFO Utils: Copying /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/-3144268511623424325166_cache to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.johnsnowlabs.nlp_spark-nlp_2.12-3.1.0.jar
    21/06/11 20:42:16 INFO Executor: Fetching spark://192.168.2.103:34205/files/com.github.universal-automata_liblevenshtein-3.0.0.jar with timestamp 1623424325166
    21/06/11 20:42:16 INFO Utils: Fetching spark://192.168.2.103:34205/files/com.github.universal-automata_liblevenshtein-3.0.0.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp18408146982236201037.tmp
    21/06/11 20:42:16 INFO Utils: Copying /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/19900329611623424325166_cache to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.github.universal-automata_liblevenshtein-3.0.0.jar
    21/06/11 20:42:16 INFO Executor: Fetching spark://192.168.2.103:34205/files/org.slf4j_slf4j-api-1.7.21.jar with timestamp 1623424325166
    21/06/11 20:42:16 INFO Utils: Fetching spark://192.168.2.103:34205/files/org.slf4j_slf4j-api-1.7.21.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp18433314345265653010.tmp
    21/06/11 20:42:16 INFO Utils: Copying /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/13339163381623424325166_cache to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./org.slf4j_slf4j-api-1.7.21.jar
    21/06/11 20:42:16 INFO Executor: Fetching spark://192.168.2.103:34205/files/org.rocksdb_rocksdbjni-6.5.3.jar with timestamp 1623424325166
    21/06/11 20:42:16 INFO Utils: Fetching spark://192.168.2.103:34205/files/org.rocksdb_rocksdbjni-6.5.3.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp15154651623340219296.tmp
    21/06/11 20:42:16 INFO Utils: Copying /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/19889744071623424325166_cache to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./org.rocksdb_rocksdbjni-6.5.3.jar
    21/06/11 20:42:16 INFO Executor: Fetching spark://192.168.2.103:34205/files/joda-time_joda-time-2.9.5.jar with timestamp 1623424325166
    21/06/11 20:42:16 INFO Utils: Fetching spark://192.168.2.103:34205/files/joda-time_joda-time-2.9.5.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp6878914735123238495.tmp
    21/06/11 20:42:16 INFO Utils: Copying /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/-7077374021623424325166_cache to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./joda-time_joda-time-2.9.5.jar
    21/06/11 20:42:16 INFO Executor: Fetching spark://192.168.2.103:34205/files/com.amazonaws_aws-java-sdk-bundle-1.11.603.jar with timestamp 1623424325166
    21/06/11 20:42:16 INFO Utils: Fetching spark://192.168.2.103:34205/files/com.amazonaws_aws-java-sdk-bundle-1.11.603.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp11428706676980878857.tmp
    21/06/11 20:42:17 INFO Utils: Copying /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/11445123081623424325166_cache to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.amazonaws_aws-java-sdk-bundle-1.11.603.jar
    21/06/11 20:42:17 INFO Executor: Fetching spark://192.168.2.103:34205/files/com.google.protobuf_protobuf-java-3.0.0-beta-3.jar with timestamp 1623424325166
    21/06/11 20:42:17 INFO Utils: Fetching spark://192.168.2.103:34205/files/com.google.protobuf_protobuf-java-3.0.0-beta-3.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp2249757634022456047.tmp
    21/06/11 20:42:17 INFO Utils: Copying /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/-12346780511623424325166_cache to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.google.protobuf_protobuf-java-3.0.0-beta-3.jar
    21/06/11 20:42:17 INFO Executor: Fetching spark://192.168.2.103:34205/jars/org.json4s_json4s-ext_2.12-3.5.3.jar with timestamp 1623424325166
    21/06/11 20:42:17 INFO Utils: Fetching spark://192.168.2.103:34205/jars/org.json4s_json4s-ext_2.12-3.5.3.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp4902258414204843486.tmp
    21/06/11 20:42:17 INFO Utils: /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/6839329141623424325166_cache has been previously copied to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./org.json4s_json4s-ext_2.12-3.5.3.jar
    21/06/11 20:42:17 INFO Executor: Adding file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./org.json4s_json4s-ext_2.12-3.5.3.jar to class loader
    21/06/11 20:42:17 INFO Executor: Fetching spark://192.168.2.103:34205/jars/dk.brics.automaton_automaton-1.11-8.jar with timestamp 1623424325166
    21/06/11 20:42:17 INFO Utils: Fetching spark://192.168.2.103:34205/jars/dk.brics.automaton_automaton-1.11-8.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp7723995488492432875.tmp
    21/06/11 20:42:17 INFO Utils: /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/6345908611623424325166_cache has been previously copied to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./dk.brics.automaton_automaton-1.11-8.jar
    21/06/11 20:42:17 INFO Executor: Adding file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./dk.brics.automaton_automaton-1.11-8.jar to class loader
    21/06/11 20:42:17 INFO Executor: Fetching spark://192.168.2.103:34205/jars/net.jcip_jcip-annotations-1.0.jar with timestamp 1623424325166
    21/06/11 20:42:17 INFO Utils: Fetching spark://192.168.2.103:34205/jars/net.jcip_jcip-annotations-1.0.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp332592907644172826.tmp
    21/06/11 20:42:17 INFO Utils: /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/14461652401623424325166_cache has been previously copied to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./net.jcip_jcip-annotations-1.0.jar
    21/06/11 20:42:17 INFO Executor: Adding file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./net.jcip_jcip-annotations-1.0.jar to class loader
    21/06/11 20:42:17 INFO Executor: Fetching spark://192.168.2.103:34205/jars/org.projectlombok_lombok-1.16.8.jar with timestamp 1623424325166
    21/06/11 20:42:17 INFO Utils: Fetching spark://192.168.2.103:34205/jars/org.projectlombok_lombok-1.16.8.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp1709548010051135733.tmp
    21/06/11 20:42:17 INFO Utils: /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/3280036381623424325166_cache has been previously copied to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./org.projectlombok_lombok-1.16.8.jar
    21/06/11 20:42:17 INFO Executor: Adding file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./org.projectlombok_lombok-1.16.8.jar to class loader
    21/06/11 20:42:17 INFO Executor: Fetching spark://192.168.2.103:34205/jars/net.sf.trove4j_trove4j-3.0.3.jar with timestamp 1623424325166
    21/06/11 20:42:17 INFO Utils: Fetching spark://192.168.2.103:34205/jars/net.sf.trove4j_trove4j-3.0.3.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp12992547080912692118.tmp
    21/06/11 20:42:17 INFO Utils: /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/18958713891623424325166_cache has been previously copied to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./net.sf.trove4j_trove4j-3.0.3.jar
    21/06/11 20:42:17 INFO Executor: Adding file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./net.sf.trove4j_trove4j-3.0.3.jar to class loader
    21/06/11 20:42:17 INFO Executor: Fetching spark://192.168.2.103:34205/jars/org.joda_joda-convert-1.8.1.jar with timestamp 1623424325166
    21/06/11 20:42:17 INFO Utils: Fetching spark://192.168.2.103:34205/jars/org.joda_joda-convert-1.8.1.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp16024886356109174200.tmp
    21/06/11 20:42:17 INFO Utils: /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/-21432645511623424325166_cache has been previously copied to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./org.joda_joda-convert-1.8.1.jar
    21/06/11 20:42:17 INFO Executor: Adding file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./org.joda_joda-convert-1.8.1.jar to class loader
    21/06/11 20:42:17 INFO Executor: Fetching spark://192.168.2.103:34205/jars/com.github.universal-automata_liblevenshtein-3.0.0.jar with timestamp 1623424325166
    21/06/11 20:42:17 INFO Utils: Fetching spark://192.168.2.103:34205/jars/com.github.universal-automata_liblevenshtein-3.0.0.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp11719577668617794252.tmp
    21/06/11 20:42:17 INFO Utils: /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/-19939684301623424325166_cache has been previously copied to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.github.universal-automata_liblevenshtein-3.0.0.jar
    21/06/11 20:42:17 INFO Executor: Adding file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.github.universal-automata_liblevenshtein-3.0.0.jar to class loader
    21/06/11 20:42:17 INFO Executor: Fetching spark://192.168.2.103:34205/jars/com.navigamez_greex-1.0.jar with timestamp 1623424325166
    21/06/11 20:42:17 INFO Utils: Fetching spark://192.168.2.103:34205/jars/com.navigamez_greex-1.0.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp807262545140534729.tmp
    21/06/11 20:42:17 INFO Utils: /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/18948526941623424325166_cache has been previously copied to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.navigamez_greex-1.0.jar
    21/06/11 20:42:17 INFO Executor: Adding file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.navigamez_greex-1.0.jar to class loader
    21/06/11 20:42:17 INFO Executor: Fetching spark://192.168.2.103:34205/jars/joda-time_joda-time-2.9.5.jar with timestamp 1623424325166
    21/06/11 20:42:17 INFO Utils: Fetching spark://192.168.2.103:34205/jars/joda-time_joda-time-2.9.5.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp12279758982734310357.tmp
    21/06/11 20:42:17 INFO Utils: /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/-5516510511623424325166_cache has been previously copied to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./joda-time_joda-time-2.9.5.jar
    21/06/11 20:42:17 INFO Executor: Adding file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./joda-time_joda-time-2.9.5.jar to class loader
    21/06/11 20:42:17 INFO Executor: Fetching spark://192.168.2.103:34205/jars/com.google.protobuf_protobuf-java-3.0.0-beta-3.jar with timestamp 1623424325166
    21/06/11 20:42:17 INFO Utils: Fetching spark://192.168.2.103:34205/jars/com.google.protobuf_protobuf-java-3.0.0-beta-3.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp12934127082379220752.tmp
    21/06/11 20:42:17 INFO Utils: /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/-4898677941623424325166_cache has been previously copied to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.google.protobuf_protobuf-java-3.0.0-beta-3.jar
    21/06/11 20:42:17 INFO Executor: Adding file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.google.protobuf_protobuf-java-3.0.0-beta-3.jar to class loader
    21/06/11 20:42:17 INFO Executor: Fetching spark://192.168.2.103:34205/jars/com.johnsnowlabs.nlp_spark-nlp_2.12-3.1.0.jar with timestamp 1623424325166
    21/06/11 20:42:17 INFO Utils: Fetching spark://192.168.2.103:34205/jars/com.johnsnowlabs.nlp_spark-nlp_2.12-3.1.0.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp7551843316349076899.tmp
    21/06/11 20:42:17 INFO Utils: /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/17717935161623424325166_cache has been previously copied to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.johnsnowlabs.nlp_spark-nlp_2.12-3.1.0.jar
    21/06/11 20:42:17 INFO Executor: Adding file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.johnsnowlabs.nlp_spark-nlp_2.12-3.1.0.jar to class loader
    21/06/11 20:42:17 INFO Executor: Fetching spark://192.168.2.103:34205/jars/com.google.code.gson_gson-2.3.jar with timestamp 1623424325166
    21/06/11 20:42:17 INFO Utils: Fetching spark://192.168.2.103:34205/jars/com.google.code.gson_gson-2.3.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp8987546978536014081.tmp
    21/06/11 20:42:17 INFO Utils: /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/-7546975391623424325166_cache has been previously copied to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.google.code.gson_gson-2.3.jar
    21/06/11 20:42:17 INFO Executor: Adding file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.google.code.gson_gson-2.3.jar to class loader
    21/06/11 20:42:17 INFO Executor: Fetching spark://192.168.2.103:34205/jars/it.unimi.dsi_fastutil-7.0.12.jar with timestamp 1623424325166
    21/06/11 20:42:17 INFO Utils: Fetching spark://192.168.2.103:34205/jars/it.unimi.dsi_fastutil-7.0.12.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp14662829152554853125.tmp
    21/06/11 20:42:17 INFO Utils: /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/-20180996401623424325166_cache has been previously copied to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./it.unimi.dsi_fastutil-7.0.12.jar
    21/06/11 20:42:17 INFO Executor: Adding file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./it.unimi.dsi_fastutil-7.0.12.jar to class loader
    21/06/11 20:42:17 INFO Executor: Fetching spark://192.168.2.103:34205/jars/org.rocksdb_rocksdbjni-6.5.3.jar with timestamp 1623424325166
    21/06/11 20:42:17 INFO Utils: Fetching spark://192.168.2.103:34205/jars/org.rocksdb_rocksdbjni-6.5.3.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp9949668037197273689.tmp
    21/06/11 20:42:18 INFO Utils: /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/5078754801623424325166_cache has been previously copied to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./org.rocksdb_rocksdbjni-6.5.3.jar
    21/06/11 20:42:18 INFO Executor: Adding file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./org.rocksdb_rocksdbjni-6.5.3.jar to class loader
    21/06/11 20:42:18 INFO Executor: Fetching spark://192.168.2.103:34205/jars/com.johnsnowlabs.nlp_tensorflow-cpu_2.12-0.3.1.jar with timestamp 1623424325166
    21/06/11 20:42:18 INFO Utils: Fetching spark://192.168.2.103:34205/jars/com.johnsnowlabs.nlp_tensorflow-cpu_2.12-0.3.1.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp9212643548963030178.tmp
    21/06/11 20:42:19 INFO Utils: /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/694347761623424325166_cache has been previously copied to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.johnsnowlabs.nlp_tensorflow-cpu_2.12-0.3.1.jar
    21/06/11 20:42:19 INFO Executor: Adding file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.johnsnowlabs.nlp_tensorflow-cpu_2.12-0.3.1.jar to class loader
    21/06/11 20:42:19 INFO Executor: Fetching spark://192.168.2.103:34205/jars/com.google.code.findbugs_jsr305-3.0.1.jar with timestamp 1623424325166
    21/06/11 20:42:19 INFO Utils: Fetching spark://192.168.2.103:34205/jars/com.google.code.findbugs_jsr305-3.0.1.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp6313052555132831521.tmp
    21/06/11 20:42:19 INFO Utils: /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/-11647417711623424325166_cache has been previously copied to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.google.code.findbugs_jsr305-3.0.1.jar
    21/06/11 20:42:19 INFO Executor: Adding file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.google.code.findbugs_jsr305-3.0.1.jar to class loader
    21/06/11 20:42:19 INFO Executor: Fetching spark://192.168.2.103:34205/jars/org.slf4j_slf4j-api-1.7.21.jar with timestamp 1623424325166
    21/06/11 20:42:19 INFO Utils: Fetching spark://192.168.2.103:34205/jars/org.slf4j_slf4j-api-1.7.21.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp16533770977053225215.tmp
    21/06/11 20:42:19 INFO Utils: /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/18776259231623424325166_cache has been previously copied to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./org.slf4j_slf4j-api-1.7.21.jar
    21/06/11 20:42:19 INFO Executor: Adding file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./org.slf4j_slf4j-api-1.7.21.jar to class loader
    21/06/11 20:42:19 INFO Executor: Fetching spark://192.168.2.103:34205/jars/com.amazonaws_aws-java-sdk-bundle-1.11.603.jar with timestamp 1623424325166
    21/06/11 20:42:19 INFO Utils: Fetching spark://192.168.2.103:34205/jars/com.amazonaws_aws-java-sdk-bundle-1.11.603.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp17738135895720612151.tmp
    21/06/11 20:42:19 INFO Utils: /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/13928342451623424325166_cache has been previously copied to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.amazonaws_aws-java-sdk-bundle-1.11.603.jar
    21/06/11 20:42:19 INFO Executor: Adding file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.amazonaws_aws-java-sdk-bundle-1.11.603.jar to class loader
    21/06/11 20:42:19 INFO Executor: Fetching spark://192.168.2.103:34205/jars/com.typesafe_config-1.3.0.jar with timestamp 1623424325166
    21/06/11 20:42:19 INFO Utils: Fetching spark://192.168.2.103:34205/jars/com.typesafe_config-1.3.0.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp15020909164608666214.tmp
    21/06/11 20:42:19 INFO Utils: /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/-4682533391623424325166_cache has been previously copied to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.typesafe_config-1.3.0.jar
    21/06/11 20:42:19 INFO Executor: Adding file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.typesafe_config-1.3.0.jar to class loader
    21/06/11 20:42:19 INFO Executor: Fetching spark://192.168.2.103:34205/jars/com.google.protobuf_protobuf-java-util-3.0.0-beta-3.jar with timestamp 1623424325166
    21/06/11 20:42:19 INFO Utils: Fetching spark://192.168.2.103:34205/jars/com.google.protobuf_protobuf-java-util-3.0.0-beta-3.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp16503352824305074337.tmp
    21/06/11 20:42:19 INFO Utils: /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/-8807534571623424325166_cache has been previously copied to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.google.protobuf_protobuf-java-util-3.0.0-beta-3.jar
    21/06/11 20:42:19 INFO Executor: Adding file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.google.protobuf_protobuf-java-util-3.0.0-beta-3.jar to class loader
    21/06/11 20:42:19 INFO Executor: Fetching spark://192.168.2.103:34205/jars/com.google.code.findbugs_annotations-3.0.1.jar with timestamp 1623424325166
    21/06/11 20:42:19 INFO Utils: Fetching spark://192.168.2.103:34205/jars/com.google.code.findbugs_annotations-3.0.1.jar to /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/fetchFileTemp1442695003069020584.tmp
    21/06/11 20:42:19 INFO Utils: /tmp/spark-a819910c-fda9-401a-b6e9-a88810001756/executor-8497494f-01a3-4bb3-a836-c15131d0ca98/spark-aef159f2-4977-4e17-9deb-eac747de8d62/12936857421623424325166_cache has been previously copied to /home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.google.code.findbugs_annotations-3.0.1.jar
    21/06/11 20:42:19 INFO Executor: Adding file:/home/w/Assignments/ner/spark-3.0.2-bin-hadoop2.7/work/app-20210611204208-0009/0/./com.google.code.findbugs_annotations-3.0.1.jar to class loader
    21/06/11 20:42:36 INFO CoarseGrainedExecutorBackend: Got assigned task 0
    21/06/11 20:42:36 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
    21/06/11 20:42:36 INFO TorrentBroadcast: Started reading broadcast variable 1 with 1 pieces (estimated total size 4.0 MiB)
    21/06/11 20:42:36 INFO TransportClientFactory: Successfully created connection to /192.168.2.103:32947 after 4 ms (0 ms spent in bootstraps)
    21/06/11 20:42:36 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.4 KiB, free 2.2 GiB)
    21/06/11 20:42:36 INFO TorrentBroadcast: Reading broadcast variable 1 took 128 ms
    21/06/11 20:42:36 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 4.2 KiB, free 2.2 GiB)
    21/06/11 20:42:37 INFO HadoopRDD: Input split: file:/home/w/cache_pretrained/small_bert_L2_128_en_2.6.0_2.4_1598344320681/metadata/part-00000:0+443
    21/06/11 20:42:37 INFO TorrentBroadcast: Started reading broadcast variable 0 with 1 pieces (estimated total size 4.0 MiB)
    21/06/11 20:42:37 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 23.6 KiB, free 2.2 GiB)
    21/06/11 20:42:37 INFO TorrentBroadcast: Reading broadcast variable 0 took 18 ms
    21/06/11 20:42:37 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 198.4 KiB, free 2.2 GiB)
    21/06/11 20:42:37 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1414 bytes result sent to driver
    21/06/11 20:42:37 INFO CoarseGrainedExecutorBackend: Got assigned task 1
    21/06/11 20:42:37 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
    21/06/11 20:42:37 INFO CoarseGrainedExecutorBackend: Got assigned task 2
    21/06/11 20:42:37 INFO CoarseGrainedExecutorBackend: Got assigned task 3
    21/06/11 20:42:37 INFO Executor: Running task 1.0 in stage 1.0 (TID 2)
    21/06/11 20:42:37 INFO CoarseGrainedExecutorBackend: Got assigned task 4
    21/06/11 20:42:37 INFO Executor: Running task 2.0 in stage 1.0 (TID 3)
    21/06/11 20:42:37 INFO Executor: Running task 3.0 in stage 1.0 (TID 4)
    21/06/11 20:42:37 INFO TorrentBroadcast: Started reading broadcast variable 3 with 1 pieces (estimated total size 4.0 MiB)
    21/06/11 20:42:37 INFO MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 2.4 KiB, free 2.2 GiB)
    21/06/11 20:42:37 INFO TorrentBroadcast: Reading broadcast variable 3 took 15 ms
    21/06/11 20:42:37 INFO MemoryStore: Block broadcast_3 stored as values in memory (estimated size 4.1 KiB, free 2.2 GiB)
    21/06/11 20:42:37 INFO HadoopRDD: Input split: file:/home/w/cache_pretrained/small_bert_L2_128_en_2.6.0_2.4_1598344320681/fields/vocabulary/part-00005:0+111532
    21/06/11 20:42:37 INFO TorrentBroadcast: Started reading broadcast variable 2 with 1 pieces (estimated total size 4.0 MiB)
    21/06/11 20:42:37 INFO HadoopRDD: Input split: file:/home/w/cache_pretrained/small_bert_L2_128_en_2.6.0_2.4_1598344320681/fields/vocabulary/part-00004:0+111799
    21/06/11 20:42:37 INFO HadoopRDD: Input split: file:/home/w/cache_pretrained/small_bert_L2_128_en_2.6.0_2.4_1598344320681/fields/vocabulary/part-00009:0+111710
    21/06/11 20:42:37 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 23.6 KiB, free 2.2 GiB)
    21/06/11 20:42:37 INFO HadoopRDD: Input split: file:/home/w/cache_pretrained/small_bert_L2_128_en_2.6.0_2.4_1598344320681/fields/vocabulary/part-00003:0+111815
    21/06/11 20:42:37 INFO TorrentBroadcast: Reading broadcast variable 2 took 20 ms
    21/06/11 20:42:37 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 198.4 KiB, free 2.2 GiB)
    21/06/11 20:42:38 INFO Executor: Finished task 3.0 in stage 1.0 (TID 4). 66763 bytes result sent to driver
    21/06/11 20:42:38 INFO Executor: Finished task 1.0 in stage 1.0 (TID 2). 66496 bytes result sent to driver
    21/06/11 20:42:38 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 66779 bytes result sent to driver
    21/06/11 20:42:38 INFO Executor: Finished task 2.0 in stage 1.0 (TID 3). 66674 bytes result sent to driver
    21/06/11 20:42:38 INFO CoarseGrainedExecutorBackend: Got assigned task 5
    21/06/11 20:42:38 INFO Executor: Running task 4.0 in stage 1.0 (TID 5)
    21/06/11 20:42:38 INFO HadoopRDD: Input split: file:/home/w/cache_pretrained/small_bert_L2_128_en_2.6.0_2.4_1598344320681/fields/vocabulary/part-00006:0+111573
    21/06/11 20:42:38 INFO CoarseGrainedExecutorBackend: Got assigned task 6
    21/06/11 20:42:38 INFO Executor: Running task 5.0 in stage 1.0 (TID 6)
    21/06/11 20:42:38 INFO CoarseGrainedExecutorBackend: Got assigned task 7
    21/06/11 20:42:38 INFO Executor: Running task 6.0 in stage 1.0 (TID 7)
    21/06/11 20:42:38 INFO HadoopRDD: Input split: file:/home/w/cache_pretrained/small_bert_L2_128_en_2.6.0_2.4_1598344320681/fields/vocabulary/part-00007:0+111394
    21/06/11 20:42:38 INFO CoarseGrainedExecutorBackend: Got assigned task 8
    21/06/11 20:42:38 INFO Executor: Running task 7.0 in stage 1.0 (TID 8)
    21/06/11 20:42:38 INFO HadoopRDD: Input split: file:/home/w/cache_pretrained/small_bert_L2_128_en_2.6.0_2.4_1598344320681/fields/vocabulary/part-00001:0+111321
    21/06/11 20:42:38 INFO HadoopRDD: Input split: file:/home/w/cache_pretrained/small_bert_L2_128_en_2.6.0_2.4_1598344320681/fields/vocabulary/part-00008:0+111429
    21/06/11 20:42:38 INFO Executor: Finished task 7.0 in stage 1.0 (TID 8). 66350 bytes result sent to driver
    21/06/11 20:42:38 INFO CoarseGrainedExecutorBackend: Got assigned task 9
    21/06/11 20:42:38 INFO Executor: Running task 8.0 in stage 1.0 (TID 9)
    21/06/11 20:42:38 INFO HadoopRDD: Input split: file:/home/w/cache_pretrained/small_bert_L2_128_en_2.6.0_2.4_1598344320681/fields/vocabulary/part-00011:0+111491
    21/06/11 20:42:38 INFO Executor: Finished task 6.0 in stage 1.0 (TID 7). 66242 bytes result sent to driver
    21/06/11 20:42:38 INFO CoarseGrainedExecutorBackend: Got assigned task 10
    21/06/11 20:42:38 INFO Executor: Finished task 4.0 in stage 1.0 (TID 5). 66494 bytes result sent to driver
    21/06/11 20:42:38 INFO Executor: Finished task 5.0 in stage 1.0 (TID 6). 66315 bytes result sent to driver
    21/06/11 20:42:38 INFO Executor: Running task 9.0 in stage 1.0 (TID 10)
    21/06/11 20:42:38 INFO CoarseGrainedExecutorBackend: Got assigned task 11
    21/06/11 20:42:38 INFO HadoopRDD: Input split: file:/home/w/cache_pretrained/small_bert_L2_128_en_2.6.0_2.4_1598344320681/fields/vocabulary/part-00010:0+111524
    21/06/11 20:42:38 INFO CoarseGrainedExecutorBackend: Got assigned task 12
    21/06/11 20:42:38 INFO Executor: Running task 10.0 in stage 1.0 (TID 11)
    21/06/11 20:42:38 INFO Executor: Running task 11.0 in stage 1.0 (TID 12)
    21/06/11 20:42:38 INFO HadoopRDD: Input split: file:/home/w/cache_pretrained/small_bert_L2_128_en_2.6.0_2.4_1598344320681/fields/vocabulary/part-00000:0+111679
    21/06/11 20:42:38 INFO HadoopRDD: Input split: file:/home/w/cache_pretrained/small_bert_L2_128_en_2.6.0_2.4_1598344320681/fields/vocabulary/part-00002:0+111457
    21/06/11 20:42:38 INFO Executor: Finished task 8.0 in stage 1.0 (TID 9). 66412 bytes result sent to driver
    21/06/11 20:42:38 INFO Executor: Finished task 11.0 in stage 1.0 (TID 12). 66600 bytes result sent to driver
    21/06/11 20:42:38 INFO Executor: Finished task 9.0 in stage 1.0 (TID 10). 66445 bytes result sent to driver
    21/06/11 20:42:38 INFO Executor: Finished task 10.0 in stage 1.0 (TID 11). 66378 bytes result sent to driver
    21/06/11 20:42:55 INFO CoarseGrainedExecutorBackend: Got assigned task 13
    21/06/11 20:42:55 INFO Executor: Running task 0.0 in stage 2.0 (TID 13)
    21/06/11 20:42:55 INFO TorrentBroadcast: Started reading broadcast variable 6 with 1 pieces (estimated total size 4.0 MiB)
    21/06/11 20:42:55 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 77.1 KiB, free 2.2 GiB)
    21/06/11 20:42:55 INFO TorrentBroadcast: Reading broadcast variable 6 took 14 ms
    21/06/11 20:42:55 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 376.3 KiB, free 2.2 GiB)
    21/06/11 20:42:58 INFO CodeGenerator: Code generated in 392.898976 ms
    21/06/11 20:42:58 INFO CodeGenerator: Code generated in 50.294749 ms
    21/06/11 20:42:58 INFO CodeGenerator: Code generated in 85.842712 ms
    21/06/11 20:42:58 INFO CodeGenerator: Generated method too long to be JIT compiled: org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.serializefromobject_doConsume_0$ is 20081 bytes
    21/06/11 20:42:58 INFO CodeGenerator: Code generated in 257.430603 ms
    21/06/11 20:42:59 INFO CodeGenerator: Code generated in 166.091418 ms
    21/06/11 20:42:59 INFO TorrentBroadcast: Started reading broadcast variable 4 with 1 pieces (estimated total size 4.0 MiB)
    21/06/11 20:42:59 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 333.3 KiB, free 2.2 GiB)
    21/06/11 20:42:59 INFO TorrentBroadcast: Reading broadcast variable 4 took 8 ms
    21/06/11 20:42:59 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 3.4 MiB, free 2.2 GiB)
    21/06/11 20:42:59 INFO TorrentBroadcast: Started reading broadcast variable 5 with 5 pieces (estimated total size 20.0 MiB)
    21/06/11 20:42:59 INFO MemoryStore: Block broadcast_5_piece3 stored as bytes in memory (estimated size 4.0 MiB, free 2.2 GiB)
    21/06/11 20:42:59 INFO MemoryStore: Block broadcast_5_piece2 stored as bytes in memory (estimated size 4.0 MiB, free 2.2 GiB)
    21/06/11 20:42:59 INFO MemoryStore: Block broadcast_5_piece4 stored as bytes in memory (estimated size 1039.2 KiB, free 2.2 GiB)
    21/06/11 20:42:59 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 4.0 MiB, free 2.2 GiB)
    21/06/11 20:42:59 INFO MemoryStore: Block broadcast_5_piece1 stored as bytes in memory (estimated size 4.0 MiB, free 2.2 GiB)
    21/06/11 20:42:59 INFO TorrentBroadcast: Reading broadcast variable 5 took 126 ms
    21/06/11 20:43:00 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 17.5 MiB, free 2.2 GiB)
    21/06/11 20:43:00 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 13)
    java.lang.NullPointerException
    	at com.johnsnowlabs.ml.tensorflow.TensorflowWrapper.getTFHubSession(TensorflowWrapper.scala:109)
    	at com.johnsnowlabs.ml.tensorflow.TensorflowBert.tag(TensorflowBert.scala:90)
    	at com.johnsnowlabs.ml.tensorflow.TensorflowBert.$anonfun$calculateEmbeddings$1(TensorflowBert.scala:223)
    	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
    	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
    	at scala.collection.Iterator.toStream(Iterator.scala:1415)
    	at scala.collection.Iterator.toStream$(Iterator.scala:1414)
    	at scala.collection.AbstractIterator.toStream(Iterator.scala:1429)
    	at scala.collection.TraversableOnce.toSeq(TraversableOnce.scala:303)
    	at scala.collection.TraversableOnce.toSeq$(TraversableOnce.scala:303)
    	at scala.collection.AbstractIterator.toSeq(Iterator.scala:1429)
    	at com.johnsnowlabs.ml.tensorflow.TensorflowBert.calculateEmbeddings(TensorflowBert.scala:221)
    	at com.johnsnowlabs.nlp.embeddings.BertEmbeddings.$anonfun$batchAnnotate$2(BertEmbeddings.scala:237)
    	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
    	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
    	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
    	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
    	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
    	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
    	at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
    	at com.johnsnowlabs.nlp.embeddings.BertEmbeddings.batchAnnotate(BertEmbeddings.scala:229)
    	at com.johnsnowlabs.nlp.HasBatchedAnnotate.$anonfun$batchProcess$1(HasBatchedAnnotate.scala:41)
    	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
    	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
    	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
    	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
    	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
    	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
    	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
    	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
    	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
    	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
    	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
    	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    	at org.apache.spark.scheduler.Task.run(Task.scala:127)
    	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:462)
    	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
    	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:465)
    	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    	at java.base/java.lang.Thread.run(Thread.java:829)
    21/06/11 20:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 14
    21/06/11 20:43:00 INFO Executor: Running task 0.1 in stage 2.0 (TID 14)
    21/06/11 20:43:00 ERROR Executor: Exception in task 0.1 in stage 2.0 (TID 14)
    java.lang.NullPointerException
    	at com.johnsnowlabs.ml.tensorflow.TensorflowWrapper.getTFHubSession(TensorflowWrapper.scala:109)
    	at com.johnsnowlabs.ml.tensorflow.TensorflowBert.tag(TensorflowBert.scala:90)
    	at com.johnsnowlabs.ml.tensorflow.TensorflowBert.$anonfun$calculateEmbeddings$1(TensorflowBert.scala:223)
    	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
    	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
    	at scala.collection.Iterator.toStream(Iterator.scala:1415)
    	at scala.collection.Iterator.toStream$(Iterator.scala:1414)
    	at scala.collection.AbstractIterator.toStream(Iterator.scala:1429)
    	at scala.collection.TraversableOnce.toSeq(TraversableOnce.scala:303)
    	at scala.collection.TraversableOnce.toSeq$(TraversableOnce.scala:303)
    	at scala.collection.AbstractIterator.toSeq(Iterator.scala:1429)
    	at com.johnsnowlabs.ml.tensorflow.TensorflowBert.calculateEmbeddings(TensorflowBert.scala:221)
    	at com.johnsnowlabs.nlp.embeddings.BertEmbeddings.$anonfun$batchAnnotate$2(BertEmbeddings.scala:237)
    	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
    	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
    	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
    	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
    	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
    	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
    	at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
    	at com.johnsnowlabs.nlp.embeddings.BertEmbeddings.batchAnnotate(BertEmbeddings.scala:229)
    	at com.johnsnowlabs.nlp.HasBatchedAnnotate.$anonfun$batchProcess$1(HasBatchedAnnotate.scala:41)
    	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
    	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
    	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
    	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
    	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
    	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
    	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
    	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
    	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
    	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
    	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
    	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    	at org.apache.spark.scheduler.Task.run(Task.scala:127)
    	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:462)
    	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
    	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:465)
    	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    	at java.base/java.lang.Thread.run(Thread.java:829)
    21/06/11 20:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 15
    21/06/11 20:43:00 INFO Executor: Running task 0.2 in stage 2.0 (TID 15)
    21/06/11 20:43:00 ERROR Executor: Exception in task 0.2 in stage 2.0 (TID 15)
    java.lang.NullPointerException
    	at com.johnsnowlabs.ml.tensorflow.TensorflowWrapper.getTFHubSession(TensorflowWrapper.scala:109)
    	at com.johnsnowlabs.ml.tensorflow.TensorflowBert.tag(TensorflowBert.scala:90)
    	at com.johnsnowlabs.ml.tensorflow.TensorflowBert.$anonfun$calculateEmbeddings$1(TensorflowBert.scala:223)
    	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
    	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
    	at scala.collection.Iterator.toStream(Iterator.scala:1415)
    	at scala.collection.Iterator.toStream$(Iterator.scala:1414)
    	at scala.collection.AbstractIterator.toStream(Iterator.scala:1429)
    	at scala.collection.TraversableOnce.toSeq(TraversableOnce.scala:303)
    	at scala.collection.TraversableOnce.toSeq$(TraversableOnce.scala:303)
    	at scala.collection.AbstractIterator.toSeq(Iterator.scala:1429)
    	at com.johnsnowlabs.ml.tensorflow.TensorflowBert.calculateEmbeddings(TensorflowBert.scala:221)
    	at com.johnsnowlabs.nlp.embeddings.BertEmbeddings.$anonfun$batchAnnotate$2(BertEmbeddings.scala:237)
    	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
    	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
    	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
    	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
    	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
    	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
    	at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
    	at com.johnsnowlabs.nlp.embeddings.BertEmbeddings.batchAnnotate(BertEmbeddings.scala:229)
    	at com.johnsnowlabs.nlp.HasBatchedAnnotate.$anonfun$batchProcess$1(HasBatchedAnnotate.scala:41)
    	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
    	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
    	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
    	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
    	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
    	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
    	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
    	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
    	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
    	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
    	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
    	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    	at org.apache.spark.scheduler.Task.run(Task.scala:127)
    	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:462)
    	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
    	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:465)
    	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    	at java.base/java.lang.Thread.run(Thread.java:829)
    21/06/11 20:43:00 INFO CoarseGrainedExecutorBackend: Got assigned task 16
    21/06/11 20:43:00 INFO Executor: Running task 0.3 in stage 2.0 (TID 16)
    21/06/11 20:43:01 ERROR Executor: Exception in task 0.3 in stage 2.0 (TID 16)
    java.lang.NullPointerException
    	at com.johnsnowlabs.ml.tensorflow.TensorflowWrapper.getTFHubSession(TensorflowWrapper.scala:109)
    	at com.johnsnowlabs.ml.tensorflow.TensorflowBert.tag(TensorflowBert.scala:90)
    	at com.johnsnowlabs.ml.tensorflow.TensorflowBert.$anonfun$calculateEmbeddings$1(TensorflowBert.scala:223)
    	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
    	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
    	at scala.collection.Iterator.toStream(Iterator.scala:1415)
    	at scala.collection.Iterator.toStream$(Iterator.scala:1414)
    	at scala.collection.AbstractIterator.toStream(Iterator.scala:1429)
    	at scala.collection.TraversableOnce.toSeq(TraversableOnce.scala:303)
    	at scala.collection.TraversableOnce.toSeq$(TraversableOnce.scala:303)
    	at scala.collection.AbstractIterator.toSeq(Iterator.scala:1429)
    	at com.johnsnowlabs.ml.tensorflow.TensorflowBert.calculateEmbeddings(TensorflowBert.scala:221)
    	at com.johnsnowlabs.nlp.embeddings.BertEmbeddings.$anonfun$batchAnnotate$2(BertEmbeddings.scala:237)
    	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
    	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
    	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
    	at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
    	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
    	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
    	at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:198)
    	at com.johnsnowlabs.nlp.embeddings.BertEmbeddings.batchAnnotate(BertEmbeddings.scala:229)
    	at com.johnsnowlabs.nlp.HasBatchedAnnotate.$anonfun$batchProcess$1(HasBatchedAnnotate.scala:41)
    	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
    	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
    	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
    	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
    	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
    	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
    	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:345)
    	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:872)
    	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:872)
    	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
    	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
    	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    	at org.apache.spark.scheduler.Task.run(Task.scala:127)
    	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:462)
    	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
    	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:465)
    	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    	at java.base/java.lang.Thread.run(Thread.java:829)
    21/06/11 20:43:04 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
    21/06/11 20:43:04 INFO MemoryStore: MemoryStore cleared
    21/06/11 20:43:04 ERROR CoarseGrainedExecutorBackend: RE
    

    I am a novice and this is probably a trivial issue, but raising it nonetheless since I couldnt' find a solution anywhere.

    Thanks!

    bug-fix fixed-next-release 
    opened by havellay 22
  • Using Fat Jars behind company's firewall not viable.

    Using Fat Jars behind company's firewall not viable.

    Description

    I have started this conversation:

    https://spark-nlp.slack.com/archives/CA118BWRM/p1617225602087300

    and based on the response, I have tried fat jars on my work laptop. Using the Fat Jars, it did move pass the starting session step, but it failed short in sentence detection, and there are big differences between spark-nlp 2.7.x and 3.0.x, as detailed below:

    1.1. On Spark NLP version 2.7.5: got a timeout when company's VPN is enabled (on my work MACOS laptop):

    spark = SparkSession.builder\
        .appName("Spark NLP")\
        .master("local[4]")\
        .config("spark.driver.memory","16G")\
        .config("spark.driver.maxResultSize", "0")\
        .config("spark.kryoserializer.buffer.max", "2000M")\
        .config("spark.jars", "/Users/filotio/Downloads/spark-nlp-assembly-2.7.5.jar")\
        .getOrCreate()
     spark
    

    Apache Spark version: 2.4.4 Spark NLP version 2.7.5   sentence_detector_dl download started this may take some time.

    Py4JJavaError                             Traceback (most recent call last) in       1 sentencerDL = SentenceDetectorDLModel
    ----> 2     .pretrained("sentence_detector_dl", "en")
          3     .setInputCols(["document"])
          4     .setOutputCol("sentences")       5   ~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/sparknlp/annotator.py in pretrained(name, lang, remote_loc)    3095     def pretrained(name="sentence_detector_dl", lang="en", remote_loc=None):    3096         from sparknlp.pretrained import ResourceDownloader -> 3097         return ResourceDownloader.downloadModel(SentenceDetectorDLModel, name, lang, remote_loc)    3098    3099   ~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/sparknlp/pretrained.py in downloadModel(reader, name, language, remote_loc, j_dwn)      30     def downloadModel(reader, name, language, remote_loc=None, j_dwn='PythonResourceDownloader'):      31         print(name + " download started this may take some time.") ---> 32         file_size = _internal._GetResourceSize(name, language, remote_loc).apply()      33         if file_size == "-1":      34             print("Can not find the model to download please check the name!")   ~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/sparknlp/internal.py in init(self, name, language, remote_loc)     190     def init(self, name, language, remote_loc):     191         super(_GetResourceSize, self).init( --> 192             "com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize", name, language, remote_loc)     193     194   ~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/sparknlp/internal.py in init(self, java_obj, *args)     127         super(ExtendedJavaWrapper, self).init(java_obj)     128         self.sc = SparkContext._active_spark_context --> 129         self._java_obj = self.new_java_obj(java_obj, *args)     130         self.java_obj = self._java_obj     131   ~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/sparknlp/internal.py in new_java_obj(self, java_class, *args)     137     138     def new_java_obj(self, java_class, *args): --> 139         return self._new_java_obj(java_class, *args)     140     141     def new_java_array(self, pylist, java_class):   ~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/pyspark/ml/wrapper.py in _new_java_obj(java_class, *args)      65             java_obj = getattr(java_obj, name)      66         java_args = [_py2java(sc, arg) for arg in args] ---> 67         return java_obj(*java_args)      68      69     @staticmethod   ~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/py4j/java_gateway.py in call(self, *args)    1255         answer = self.gateway_client.send_command(command)    1256         return_value = get_return_value( -> 1257             answer, self.gateway_client, self.target_id, self.name)    1258    1259         for temp_arg in temp_args:   ~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/pyspark/sql/utils.py in deco(*a, **kw)      61     def deco(*a, **kw):      62         try: ---> 63             return f(*a, **kw)      64         except py4j.protocol.Py4JJavaError as e:      65             s = e.java_exception.toString()   ~/opt/anaconda3/envs/spark-nlp/lib/python3.7/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)     326                 raise Py4JJavaError(     327                     "An error occurred while calling {0}{1}{2}.\n". --> 328                     format(target_id, ".", name), value)     329             else:     330                 raise Py4JError(   Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize. : com.amazonawsShadedAmazonClientException: Unable to execute HTTP request: Connect to auxdata.johnsnowlabs.com.s3.amazonaws.com:443 timed out         at com.amazonawsShadedhttp.AmazonHttpClient.executeHelper(AmazonHttpClient.java:454)         at com.amazonawsShadedhttp.AmazonHttpClient.execute(AmazonHttpClient.java:232)         at com.amazonawsShadedservices.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)         at com.amazonawsShadedservices.s3.AmazonS3Client.getObject(AmazonS3Client.java:1111)         at com.amazonawsShadedservices.s3.AmazonS3Client.getObject(AmazonS3Client.java:984)         at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.downloadMetadataIfNeed(S3ResourceDownloader.scala:69)         at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.resolveLink(S3ResourceDownloader.scala:81)         at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.getDownloadSize(S3ResourceDownloader.scala:159)         at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.getDownloadSize(ResourceDownloader.scala:401)         at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.getDownloadSize(ResourceDownloader.scala:501)         at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize(ResourceDownloader.scala)         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)         at java.lang.reflect.Method.invoke(Method.java:498)         at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)         at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)         at py4j.Gateway.invoke(Gateway.java:282)         at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)         at py4j.commands.CallCommand.execute(CallCommand.java:79)         at py4j.GatewayConnection.run(GatewayConnection.java:238)         at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.httpShadedconn.ConnectTimeoutException: Connect to auxdata.johnsnowlabs.com.s3.amazonaws.com:443 timed out         at org.apache.httpShadedconn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:551)         at org.apache.httpShadedimpl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:180)         at org.apache.httpShadedimpl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:294)         at org.apache.httpShadedimpl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:641)         at org.apache.httpShadedimpl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:480)         at org.apache.httpShadedimpl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)         at org.apache.httpShadedimpl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)         at com.amazonawsShadedhttp.AmazonHttpClient.executeHelper(AmazonHttpClient.java:384)         ... 21 more 1.2. However, once I disable the company's VPN, the above call to SentenceDetectorDLModel works!

    2.1. Using Spark NLP version 3.0.1 I get a NullPointerException back:

    spark = SparkSession.builder\
        .appName("Spark NLP")\
        .master("local[4]")\
        .config("spark.driver.memory","16G")\
        .config("spark.driver.maxResultSize", "0")\
        .config("spark.kryoserializer.buffer.max", "2000M")\
        .config("spark.jars", "/Users/filotio/Downloads/spark-nlp-assembly-3.0.1.jar")\
        .getOrCreate()
     spark
    

    Apache Spark version: 3.1.1 Spark NLP version 3.0.1

    sentence_detector_dl download started this may take some time.

    Py4JJavaError                             Traceback (most recent call last) in       1 sentencerDL = SentenceDetectorDLModel
    ----> 2     .pretrained("sentence_detector_dl", "en")
          3     .setInputCols(["document"])
          4     .setOutputCol("sentences")       5   ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/sparknlp/annotator.py in pretrained(name, lang, remote_loc)    3107     def pretrained(name="sentence_detector_dl", lang="en", remote_loc=None):    3108         from sparknlp.pretrained import ResourceDownloader -> 3109         return ResourceDownloader.downloadModel(SentenceDetectorDLModel, name, lang, remote_loc)    3110    3111   ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/sparknlp/pretrained.py in downloadModel(reader, name, language, remote_loc, j_dwn)      30     def downloadModel(reader, name, language, remote_loc=None, j_dwn='PythonResourceDownloader'):      31         print(name + " download started this may take some time.") ---> 32         file_size = _internal._GetResourceSize(name, language, remote_loc).apply()      33         if file_size == "-1":      34             print("Can not find the model to download please check the name!")   ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/sparknlp/internal.py in init(self, name, language, remote_loc)     190     def init(self, name, language, remote_loc):     191         super(_GetResourceSize, self).init( --> 192             "com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize", name, language, remote_loc)     193     194   ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/sparknlp/internal.py in init(self, java_obj, *args)     127         super(ExtendedJavaWrapper, self).init(java_obj)     128         self.sc = SparkContext._active_spark_context --> 129         self._java_obj = self.new_java_obj(java_obj, *args)     130         self.java_obj = self._java_obj     131   ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/sparknlp/internal.py in new_java_obj(self, java_class, *args)     137     138     def new_java_obj(self, java_class, *args): --> 139         return self._new_java_obj(java_class, *args)     140     141     def new_java_array(self, pylist, java_class):   ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/pyspark/ml/wrapper.py in _new_java_obj(java_class, *args)      64             java_obj = getattr(java_obj, name)      65         java_args = [_py2java(sc, arg) for arg in args] ---> 66         return java_obj(*java_args)      67      68     @staticmethod   ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/py4j/java_gateway.py in call(self, *args)    1303         answer = self.gateway_client.send_command(command)    1304         return_value = get_return_value( -> 1305             answer, self.gateway_client, self.target_id, self.name)    1306    1307         for temp_arg in temp_args:   ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/pyspark/sql/utils.py in deco(*a, **kw)     109     def deco(*a, **kw):     110         try: --> 111             return f(*a, **kw)     112         except py4j.protocol.Py4JJavaError as e:     113             converted = convert_exception(e.java_exception)   ~/opt/anaconda3/envs/sparknlp/lib/python3.7/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)     326                 raise Py4JJavaError(     327                     "An error occurred while calling {0}{1}{2}.\n". --> 328                     format(target_id, ".", name), value)     329             else:     330                 raise Py4JError(   Py4JJavaError: An error occurred while calling z:com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize. : java.lang.NullPointerException         at com.amazonaws.ShadedByJSLClientConfiguration.getProxyUsernameEnvironment(ClientConfiguration.java:874)         at com.amazonaws.ShadedByJSLClientConfiguration.getProxyUsername(ClientConfiguration.java:902)         at com.amazonaws.ShadedByJSLhttp.settings.HttpClientSettings.getProxyUsername(HttpClientSettings.java:90)         at com.amazonaws.ShadedByJSLhttp.settings.HttpClientSettings.isAuthenticatedProxy(HttpClientSettings.java:182)         at com.amazonaws.ShadedByJSLhttp.apache.client.impl.ApacheHttpClientFactory.addProxyConfig(ApacheHttpClientFactory.java:96)         at com.amazonaws.ShadedByJSLhttp.apache.client.impl.ApacheHttpClientFactory.create(ApacheHttpClientFactory.java:75)         at com.amazonaws.ShadedByJSLhttp.apache.client.impl.ApacheHttpClientFactory.create(ApacheHttpClientFactory.java:38)         at com.amazonaws.ShadedByJSLhttp.AmazonHttpClient.(AmazonHttpClient.java:324)         at com.amazonaws.ShadedByJSLhttp.AmazonHttpClient.(AmazonHttpClient.java:308)         at com.amazonaws.ShadedByJSLAmazonWebServiceClient.(AmazonWebServiceClient.java:229)         at com.amazonaws.ShadedByJSLAmazonWebServiceClient.(AmazonWebServiceClient.java:181)         at com.amazonaws.ShadedByJSLservices.s3.AmazonS3Client.(AmazonS3Client.java:617)         at com.amazonaws.ShadedByJSLservices.s3.AmazonS3Client.(AmazonS3Client.java:597)         at com.amazonaws.ShadedByJSLservices.s3.AmazonS3Client.(AmazonS3Client.java:575)         at com.amazonaws.ShadedByJSLservices.s3.AmazonS3Client.(AmazonS3Client.java:542)         at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.client$lzycompute(S3ResourceDownloader.scala:45)         at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.client(S3ResourceDownloader.scala:36)         at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.downloadMetadataIfNeed(S3ResourceDownloader.scala:69)         at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.resolveLink(S3ResourceDownloader.scala:81)         at com.johnsnowlabs.nlp.pretrained.S3ResourceDownloader.getDownloadSize(S3ResourceDownloader.scala:159)         at com.johnsnowlabs.nlp.pretrained.ResourceDownloader$.getDownloadSize(ResourceDownloader.scala:401)         at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader$.getDownloadSize(ResourceDownloader.scala:501)         at com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize(ResourceDownloader.scala)         at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)         at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)         at java.lang.reflect.Method.invoke(Method.java:498)         at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)         at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)         at py4j.Gateway.invoke(Gateway.java:282)         at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)         at py4j.commands.CallCommand.execute(CallCommand.java:79)         at py4j.GatewayConnection.run(GatewayConnection.java:238)         at java.lang.Thread.run(Thread.java:748)

    2.2. If I disaable company's VPN, I get the same NullPointerException as above - 2.1.

    Expected Behavior

    I would like to use your code behind company's firewall, and more importantly from AWS SageMaker. I do test it first on my work laptop, so I like to have it working there as well.

    Current Behavior

    Not working, got a healthcare temp license, which expires in a couple of days, and so far I was not able to run any of your code behind company's firewall. So, setting the spark-nlp session using the Fat Jars: when using a pretrain model such as: sentencerDL = SentenceDetectorDLModel
    .pretrained("sentence_detector_dl", "en")
    .setInputCols(["document"])
    .setOutputCol("sentences") it fails.

    Possible Solution

    Like the idea of using Fat Jars, but need them functional.

    Steps to Reproduce

    tested on my work macos catalina latest version using the installation instructions: https://nlp.johnsnowlabs.com/docs/en/install#python for both: $ java -version $ conda create -n sparknlp python=3.7 -y $ conda activate sparknlp $ pip install spark-nlp==3.0.1 pyspark==3.1.1 $ pip install jupyter $ jupyter notebook

    and

    $ java -version $ conda create -n spark-nlp python=3.7 -y $ conda activate spark-nlp $ pip install spark-nlp==2.7.5 pyspark==2.4.4 $ pip install jupyter $ jupyter notebook

    Pretty much follow the code from: https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Public/9.SentenceDetectorDL.ipynb#scrollTo=KvNuyGXpD7Nt

    but using the Fat Jars instead:

    spark = SparkSession.builder
    .appName("Spark NLP")
    .master("local[4]")
    .config("spark.driver.memory","16G")
    .config("spark.driver.maxResultSize", "0")
    .config("spark.kryoserializer.buffer.max", "2000M")
    .config("spark.jars", "/Users/filotio/Downloads/spark-nlp-assembly-3.0.1.jar")
    .getOrCreate()

    and the moment I hit this code:

    sentencerDL = SentenceDetectorDLModel
    .pretrained("sentence_detector_dl", "en")
    .setInputCols(["document"])
    .setOutputCol("sentences")

    I get the above errors (NullPointerException for spark-nlp 3.0.x and timing out for spark-nlp 2.7.x)

    Context

    Your Environment

    • Spark NLP version sparknlp.version(): Spark NLP version 3.0.1
    • Apache NLP version spark.version: Apache Spark version: 3.1.1
    • Java version java -version: openjdk version "1.8.0_282" OpenJDK Runtime Environment (build 1.8.0_282-bre_2021_01_20_16_37-b00) OpenJDK 64-Bit Server VM (build 25.282-b00, mixed mode)
    • Conda latest release.
    • Operating System and version: MacOS catalina, latest release.
    opened by Octavian-act 22
  • Tensorflow lib core dumped

    Tensorflow lib core dumped

    When I try to use pretrained model I get core dumped. Error is below.

    2020-08-05 14:35:59 INFO  HadoopRDD:54 - Input split: hdfs://namenode:9000/models/recognize_entities_dl/stages/4_NerDLModel_d4424c9af5f4/fields/datasetParams/part-00011:0+2831
    2020-08-05 14:35:59 INFO  Executor:54 - Finished task 4.0 in stage 16.0 (TID 32). 765 bytes result sent to driver
    2020-08-05 14:35:59 INFO  TaskSetManager:54 - Finished task 4.0 in stage 16.0 (TID 32) in 38 ms on localhost (executor driver) (5/7)
    2020-08-05 14:35:59 INFO  Executor:54 - Finished task 5.0 in stage 16.0 (TID 33). 765 bytes result sent to driver
    2020-08-05 14:35:59 INFO  TaskSetManager:54 - Finished task 5.0 in stage 16.0 (TID 33) in 47 ms on localhost (executor driver) (6/7)
    2020-08-05 14:35:59 INFO  Executor:54 - Finished task 6.0 in stage 16.0 (TID 34). 2146 bytes result sent to driver
    2020-08-05 14:35:59 INFO  TaskSetManager:54 - Finished task 6.0 in stage 16.0 (TID 34) in 55 ms on localhost (executor driver) (7/7)
    2020-08-05 14:35:59 INFO  TaskSchedulerImpl:54 - Removed TaskSet 16.0, whose tasks have all completed, from pool 
    2020-08-05 14:35:59 INFO  DAGScheduler:54 - ResultStage 16 (first at Feature.scala:120) finished in 0.110 s
    2020-08-05 14:35:59 INFO  DAGScheduler:54 - Job 16 finished: first at Feature.scala:120, took 0.119676 s
    2020-08-05 14:35:59 INFO  MemoryStore:54 - Block broadcast_31 stored as values in memory (estimated size 8.4 KB, free 361.2 MB)
    2020-08-05 14:35:59 INFO  MemoryStore:54 - Block broadcast_31_piece0 stored as bytes in memory (estimated size 440.0 B, free 361.2 MB)
    2020-08-05 14:35:59 INFO  BlockManagerInfo:54 - Added broadcast_31_piece0 in memory on 82a79ae5305b:45455 (size: 440.0 B, free: 365.8 MB)
    2020-08-05 14:35:59 INFO  SparkContext:54 - Created broadcast 31 from broadcast at Feature.scala:87
    \#
    \# A fatal error has been detected by the Java Runtime Environment:
    \#
    \#  SIGILL (0x4) at pc=0x00007f2dae59ada9, pid=846, tid=0x00007f2e5dad5700
    \#
    \# JRE version: OpenJDK Runtime Environment (8.0_171-b11) (build 1.8.0_171-8u171-b11-1~bpo8+1-b11)
    \# Java VM: OpenJDK 64-Bit Server VM (25.171-b11 mixed mode linux-amd64 compressed oops)
    \# Problematic frame:
    \# C  [libtensorflow_framework.so.1+0x744da9]  _GLOBAL__sub_I_loader.cc+0x99
    \#
    \# Core dump written. Default location: //core or core.846
    \#
    \# An error report file with more information is saved as:
    \# //hs_err_pid846.log
    \#
    \# If you would like to submit a bug report, please visit:
    \#   http://bugreport.java.com/bugreport/crash.jsp
    \# The crash happened outside the Java Virtual Machine in native code.
    \# See problematic frame for where to report the bug.
    

    Steps to Reproduce

    1. Clone the repo https://github.com/miloradtrninic/entity/
    2. Run docker compose from cloned directory
    3. Download recognize_entities_dl for offline usage (recognize_entities_dl_en_2.4.3_2.4_1584626752821)
    4. Unzip on local computer model
    5. docker exec namenode mkdir models
    6. docker cp recognize_entities_dl/ namenode:/models/recognize_entities_dl
    7. docker exec namenode hdfs dfs -mkdir /models
    8. docker exec namenode hdfs dfs -put /models/recognize_entities_dl/ /models/
    9. Run " ./submit.sh b=1 d=1 e=1 a=/ " it will give 1GB to driver and executors and build the project with sbt assembly.

    Context

    I am getting core dumped on simple execution of the spark nlp framework.

    It seams a lot like #923 but I think I provided reproducible environment.

    This issue along with #985 is blocking me completely from using the library and proceeding with my masters thesis. Until it is fixed can you provide some docker images you know it is working on? I have this same issue when I use offline models for the spark-nlp starter project in this environment.

    Your Environment

    • Spark version: 2.4.0
    • Apache NLP version: 2.4.5
    • Java version (java -version): 1.8
    • Setup and installation (Pypi, Conda, Maven, etc.): SBT
    • Operating System and version: Linux
    • Link to your project (if any): https://github.com/miloradtrninic/entity/
    wont-fix 
    opened by miloradtrninic 18
  • SPARKNLP-713 Modifies Default Values GraphExtraction

    SPARKNLP-713 Modifies Default Values GraphExtraction

    Description

    Modifies default values of explodeEntities and mergeEntities parameters

    Motivation and Context

    Defining these parameters by default to true, makes this annotator to have and output, avoiding users to think it does not work.

    How Has This Been Tested?

    Screenshots (if appropriate):

    Types of changes

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] Code improvements with no or little impact
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to change)

    Checklist:

    • [x] My code follows the code style of this project.
    • [ ] My change requires a change to the documentation.
    • [ ] I have updated the documentation accordingly.
    • [ ] I have read the CONTRIBUTING page.
    • [ ] I have added tests to cover my changes.
    • [x] All new and existing tests passed.
    opened by danilojsl 0
  • SPARKNLP-607: Implement HubertForCTC

    SPARKNLP-607: Implement HubertForCTC

    Description

    This PR adds an Annotator to load HubertForCTC models.

    Motivation and Context

    With more speech-to-text models coming out, we want to support a wider range of models.

    How Has This Been Tested?

    Added new tests for the annotator on python and scala side.

    Types of changes

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [ ] Code improvements with no or little impact
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to change)
    opened by DevinTDHa 0
  • Need guidance to finetune BertSentenceEmbedding using  domain specific pair of sentences

    Need guidance to finetune BertSentenceEmbedding using domain specific pair of sentences

    This is not a proper feature request; rather I need the guidance to build our customized model using BertSentenceEmbedding which would be built on top of pretrained model for ex: small_bert_L2_128; I will use some domain specific dataset to finetune the mentioned model. Request to share the approach in spark-nlp perspective.

    Feature request 
    opened by srimantacse 0
  • Relocating public examples back to the main repository

    Relocating public examples back to the main repository

    We are relocating all examples related to the public Spark NLP back to the example directory. The reasons resulting for this decision:

    • It is reasonable to have some examples under the example directory like many other libraries
    • The public examples are abandoned in the spark-nlp-workshop and not maintain by any specific team
    • The spark-nlp-workshop has become extremely hard to navigate. It's not easy for a new user to know where to start and I don't see any sign it will get any better
    • Having all our examples in main repository will allow us to have them all compatible in each release (versioning them as well via tag)
    • This also encourages us to have more examples for different languages as the people maintaining workshop mostly know Python
    documentation new-feature DON'T MERGE 
    opened by maziyarpanahi 0
  • Spark NLP 427 release candidate

    Spark NLP 427 release candidate

    • https://github.com/JohnSnowLabs/spark-nlp/pull/13280
    • https://github.com/JohnSnowLabs/spark-nlp/pull/13282
    • https://github.com/JohnSnowLabs/spark-nlp/pull/13283
    • https://github.com/JohnSnowLabs/spark-nlp/pull/13284
    enhancement documentation bug-fix models_hub DON'T MERGE 
    opened by maziyarpanahi 0
Releases(4.2.6)
  • 4.2.6(Dec 21, 2022)


    :star: Improvements

    • Updating Spark & PySpark dependencies from 3.2.1 to 3.2.3 in provided scripts and in all the documentation

    :bug: Bug Fixes

    • Fix the broken TypedDependencyParserApproach and TypedDependencyParserModel annotators used in Python (this bug was introduced in 4.2.5 release)
    • Fix the broken Python API documentation

    :book: Documentation


    Installation

    Python

    #PyPI
    
    pip install spark-nlp==4.2.6
    

    Spark Packages

    spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x (Scala 2.12):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.6
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.6
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.6
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.6
    

    M1

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.6
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.6
    

    AArch64

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.2.6
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.2.6
    

    Maven

    spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp_2.12</artifactId>
        <version>4.2.6</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu_2.12</artifactId>
        <version>4.2.6</version>
    </dependency>
    

    spark-nlp-m1:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-m1_2.12</artifactId>
        <version>4.2.6</version>
    </dependency>
    

    spark-nlp-aarch64:

    <!-- https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-aarch64 -->
    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-aarch64_2.12</artifactId>
        <version>4.2.6</version>
    </dependency>
    

    FAT JARs

    • CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.2.6.jar

    • GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.2.6.jar

    • M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.2.6.jar

    • AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-4.2.6.jar

    What's Changed

    Contributors

    '@gadde5300 @diatrambitas @Cabir40 @josejuanmartinez @danilojsl @jsl-builder @DevinTDHa @maziyarpanahi @dcecchini @agsfer '

    Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/4.2.5...4.2.6

    Source code(tar.gz)
    Source code(zip)
  • 4.2.5(Dec 16, 2022)


    :loudspeaker: Overview

    Spark NLP 4.2.5 🚀 comes with a new CamemBERT for sequence classification annotator (multi-class & multi-label), new pipeline validation for LightPipeline in Python, 26 updated noteooks to use the latest TensorFlow and Transformers libraries, support for new Databricks 11.3 runtime, support for new EMR versions of 6.8 and 6.9 (only EMR versions with Spark 3.3), over 400+ state-of-the-art multi-lingual pretrained models, and bug fixes.

    Do not forget to visit Models Hub with over 11700+ free and open-source models & pipelines. As always, we would like to thank our community for their feedback, questions, and feature requests. 🎉


    :star: New Features & improvements

    • NEW: Introducing CamemBertForSequenceClassification annotator in Spark NLP 🚀. CamemBertForSequenceClassification can load CamemBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using CamembertForSequenceClassification for PyTorch or TFCamembertForSequenceClassification for TensorFlow in HuggingFace 🤗
    • NEW: Add AnnotatorType validation in Spark NLP LightPipeline. Currently, a misconfiguration of inputCols in an annotator in a pipeline raises an exception when using transform method, but in LightPipeline it only outputs empty values. This behavior can confuse users, this change introduces a validation that will raise an exception now in LightPipeline too.
      • Add outputAnnotatorType for all annotators in Python
      • Add inputAnnotatorTypes and outputAnnotatorType requirement validation for all subclasses derived from AnnotatorApproach and AnnotatorModel
      • Adding AnnotatorType validation in LightPipeline
    • NEW: Migrate 26 notenooks to import external Transformer models into Spark NLP. These notebooks now come with latest TensorFlow 2.11.0 and HuggingFace 4.25.1 releases. The notebooks also have TF signatures with data input types explicitly set to guarantee model sanity once imported into Spark NLP
    • Add validation for the number and type of columns set in TFNerDLGraphBuilder annotator. In efforts to avoid wrong definition of columns when using Spark NLP annotators in Python
    • Add more details to Alphabet error message in EntityRuler annotator to better guide users
    • Add instructions on how to resolve RocksDB incompatibilities when using Spark NLP with an M1 machine
    • Welcoming new Databricks runtimes support
      • 11.3
      • 11.3 ML
      • 11.3 GPU
    • Welcoming new EMR versions support
      • 6.8.0
      • 6.9.0
    • Refactor and implement a better error handling in ResourceDownloader. This change removes getObjectFromS3 allowing AWS SDK to rise the correspondent error. In addition, this change also refactors ResourceDownloader to reflect the intention of each credential type on the downloader
    • Implement full build and test of all unit tests base on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x major releases
    • UpdateUpgrade sbt-assembly to 1.2.0 that comes with lots of performance improvements. This benefits those who are trying to package Spark NLP as a Fat JAR
    • Update sbt to 1.8.0 with improvements and bug fixes, but mostly for CVEs fixes:
    • Use the new withIncludeScala in assemblyOption instead of value

    :bug: Bug Fixes

    • Fix an issue with the BigTextMatcher Annotator, where it would not match entities with overlapping definitions. For Example, if both lung and lung cancer are defined, lung would not be matched in a given text. This was due to an abstraction error of one of the subclasses of the BigTextMatcher during construction of the underlying data structure
    • Fix indexing issue for RegexTokenizer annotator. If the document was split into sentences, the index of the sentence inside the document was not taken into consideration for the indexes of the tokens. This would lead to further issues down the pipeline, where tokens would be filtered while unpacking them for other Annotators
    • Refactor the Resolvers object in Spark NLP's dependency to avoid the conflict with the Resolvers inside the new sbt

    🛑 Known Issues

    • TypedDependencyParserModel annotator fails in Python in this release (will be fixed in 4.2.6 release next week)

    Models

    Spark NLP 4.2.5 comes with 400+ state-of-the-art pre-trained transformer models in many languages.

    Featured Models

    | Model | Name | Lang |
    |:---------------------|:-------------------|:---| | RoBertaForSequenceClassification | roberta_classifier_autotrain_neurips_chanllenge_1287149282 | en | RoBertaForSequenceClassification | roberta_classifier_autonlp_imdb_rating_625417974 | en | RoBertaForSequenceClassification | RoBertaForSequenceClassification | bn | RoBertaForSequenceClassification | roberta_classifier_autotrain_citizen_nlu_hindi_1370952776 | hi | RoBertaForSequenceClassification | roberta_classifier_detect_acoso_twitter | es | RoBertaForQuestionAnswering | roberta_qa_deepset_base_squad2 | en | RoBertaForQuestionAnswering | roberta_qa_icebert | is | RoBertaForQuestionAnswering | roberta_qa_mrm8488_base_bne_finetuned_s_c | es | RoBertaForQuestionAnswering | roberta_qa_base_bne_squad2 | es | BertEmbeddings | bert_embeddings_rbt3 | zh | BertEmbeddings | bert_embeddings_base_it_cased | it | BertEmbeddings | bert_embeddings_base_indonesian_522m | id | BertEmbeddings | bert_embeddings_base_german_uncased | de | BertEmbeddings | [bert_embeddings_base_japanese_char](https://nlp.johnsnowlabs.com/2022/12/02/bert_embeddings_base_japanese_char_ja.html) |ja| BertEmbeddings | [bert_embeddings_bangla_base](https://nlp.johnsnowlabs.com/2022/12/02/bert_embeddings_bangla_base_bn.html) |bn| BertEmbeddings | [bert_embeddings_base_arabertv01](https://nlp.johnsnowlabs.com/2022/12/02/bert_embeddings_base_arabertv01_ar.html) |ar`

    Spark NLP covers the following languages:

    English ,Multilingual ,Afrikaans ,Afro-Asiatic languages ,Albanian ,Altaic languages ,American Sign Language ,Amharic ,Arabic ,Argentine Sign Language ,Armenian ,Artificial languages ,Atlantic-Congo languages ,Austro-Asiatic languages ,Austronesian languages ,Azerbaijani ,Baltic languages ,Bantu languages ,Basque ,Basque (family) ,Belarusian ,Bemba (Zambia) ,Bengali, Bangla ,Berber languages ,Bihari ,Bislama ,Bosnian ,Brazilian Sign Language ,Breton ,Bulgarian ,Catalan ,Caucasian languages ,Cebuano ,Celtic languages ,Central Bikol ,Chichewa, Chewa, Nyanja ,Chilean Sign Language ,Chinese ,Chuukese ,Colombian Sign Language ,Congo Swahili ,Croatian ,Cushitic languages ,Czech ,Danish ,Dholuo, Luo (Kenya and Tanzania) ,Dravidian languages ,Dutch ,East Slavic languages ,Eastern Malayo-Polynesian languages ,Efik ,Esperanto ,Estonian ,Ewe ,Fijian ,Finnish ,Finnish Sign Language ,Finno-Ugrian languages ,French ,French-based creoles and pidgins ,Ga ,Galician ,Ganda ,Georgian ,German ,Germanic languages ,Gilbertese ,Greek (modern) ,Greek languages ,Gujarati ,Gun ,Haitian, Haitian Creole ,Hausa ,Hebrew (modern) ,Hiligaynon ,Hindi ,Hiri Motu ,Hungarian ,Icelandic ,Igbo ,Iloko ,Indic languages ,Indo-European languages ,Indo-Iranian languages ,Indonesian ,Irish ,Isoko ,Isthmus Zapotec ,Italian ,Italic languages ,Japanese ,Japanese ,Kabyle ,Kalaallisut, Greenlandic ,Kannada ,Kaonde ,Kinyarwanda ,Kirundi ,Kongo ,Korean ,Kwangali ,Kwanyama, Kuanyama ,Latin ,Latvian ,Lingala ,Lithuanian ,Louisiana Creole ,Lozi ,Luba-Katanga ,Luba-Lulua ,Lunda ,Lushai ,Luvale ,Macedonian ,Malagasy ,Malay ,Malayalam ,Malayo-Polynesian languages ,Maltese ,Manx ,Marathi (Marāṭhī) ,Marshallese ,Mexican Sign Language ,Mon-Khmer languages ,Morisyen ,Mossi ,Multiple languages ,Ndonga ,Nepali ,Niger-Kordofanian languages ,Nigerian Pidgin ,Niuean ,North Germanic languages ,Northern Sotho, Pedi, Sepedi ,Norwegian ,Norwegian Bokmål ,Norwegian Nynorsk ,Nyaneka ,Oromo ,Pangasinan ,Papiamento ,Persian (Farsi) ,Peruvian Sign Language ,Philippine languages ,Pijin ,Pohnpeian ,Polish ,Portuguese ,Portuguese-based creoles and pidgins ,Punjabi (Eastern) ,Romance languages ,Romanian ,Rundi ,Russian ,Ruund ,Salishan languages ,Samoan ,San Salvador Kongo ,Sango ,Semitic languages ,Serbo-Croatian ,Seselwa Creole French ,Shona ,Sindhi ,Sino-Tibetan languages ,Slavic languages ,Slovak ,Slovene ,Somali ,South Caucasian languages ,South Slavic languages ,Southern Sotho ,Spanish ,Spanish Sign Language ,Sranan Tongo ,Swahili ,Swati ,Swedish ,Tagalog ,Tahitian ,Tai ,Tamil ,Telugu ,Tetela ,Tetun Dili ,Thai ,Tigrinya ,Tiv ,Tok Pisin ,Tonga (Tonga Islands) ,Tonga (Zambia) ,Tsonga ,Tswana ,Tumbuka ,Turkic languages ,Turkish ,Tuvalu ,Tzotzil ,Ukrainian ,Umbundu ,Uralic languages ,Urdu ,Venda ,Venezuelan Sign Language ,Vietnamese ,Wallisian ,Walloon ,Waray (Philippines) ,Welsh ,West Germanic languages ,West Slavic languages ,Western Malayo-Polynesian languages ,Wolaitta, Wolaytta ,Wolof ,Xhosa ,Yapese ,Yiddish ,Yoruba ,Yucatec Maya, Yucateco ,Zande (individual language) ,Zulu

    The complete list of all 11700+ models & pipelines in 230+ languages is available on Models Hub

    :notebook: New Notebooks

    Spark NLP | Notebooks | Colab :------------ | :-------------| :----------| CamemBertForTokenClassification|HuggingFace in Spark NLP - CamemBertForSequenceClassification | Open In Colab

    :notebook: Updated Notebooks

    The following notebooks have been updated to use the last release of TensorFLow 2.11 and Hugging Face 4.25 libraries

    Spark NLP | Notebooks | Colab :------------ | :-------------| :----------| BertEmbeddings | HuggingFace in Spark NLP - BERT | Open In Colab BertSentenceEmbeddings | HuggingFace in Spark NLP - BERT Sentence | Open In Colab DistilBertEmbeddings| HuggingFace in Spark NLP - DistilBERT | Open In Colab CamemBertEmbeddings| HuggingFace in Spark NLP - CamemBERT | Open In Colab RoBertaEmbeddings | HuggingFace in Spark NLP - RoBERTa | Open In Colab DeBertaEmbeddings | HuggingFace in Spark NLP - DeBERTa | Open In Colab XlmRoBertaEmbeddings | HuggingFace in Spark NLP - XLM-RoBERTa | Open In Colab AlbertEmbeddings | HuggingFace in Spark NLP - ALBERT | Open In Colab BertForTokenClassification|HuggingFace in Spark NLP - BertForTokenClassification | Open In Colab DistilBertForTokenClassification|HuggingFace in Spark NLP - DistilBertForTokenClassification | Open In Colab AlbertForTokenClassification|HuggingFace in Spark NLP - AlbertForTokenClassification | Open In Colab RoBertaForTokenClassification|HuggingFace in Spark NLP - RoBertaForTokenClassification | Open In Colab XlmRoBertaForTokenClassification|HuggingFace in Spark NLP - XlmRoBertaForTokenClassification | Open In Colab CamemBertForTokenClassification|HuggingFace in Spark NLP - CamemBertForTokenClassification | Open In Colab CamemBertForTokenClassification|HuggingFace in Spark NLP - CamemBertForSequenceClassification | Open In Colab BertForSequenceClassification |HuggingFace in Spark NLP - BertForSequenceClassification | Open In Colab DistilBertForSequenceClassification |HuggingFace in Spark NLP - DistilBertForSequenceClassification | Open In Colab AlbertForSequenceClassification |HuggingFace in Spark NLP - AlbertForSequenceClassification | Open In Colab RoBertaForSequenceClassification |HuggingFace in Spark NLP - RoBertaForSequenceClassification | Open In Colab XlmRoBertaForSequenceClassification |HuggingFace in Spark NLP - XlmRoBertaForSequenceClassification | Open In Colab AlbertForQuestionAnswering |HuggingFace in Spark NLP - AlbertForQuestionAnswering | Open In Colab BertForQuestionAnswering|HuggingFace in Spark NLP - BertForQuestionAnswering | Open In Colab DeBertaForQuestionAnswering|HuggingFace in Spark NLP - DeBertaForQuestionAnswering | Open In Colab DistilBertForQuestionAnswering|HuggingFace in Spark NLP - DistilBertForQuestionAnswering | Open In Colab RoBertaForQuestionAnswering|HuggingFace in Spark NLP - RoBertaForQuestionAnswering | Open In Colab XlmRobertaForQuestionAnswering|HuggingFace in Spark NLP - XlmRobertaForQuestionAnswering | Open In Colab


    :book: Documentation


    Installation

    Python

    #PyPI
    
    pip install spark-nlp==4.2.5
    

    Spark Packages

    spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x (Scala 2.12):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.5
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.5
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.5
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.5
    

    M1

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.5
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.5
    

    AArch64

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.2.5
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.2.5
    

    Maven

    spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp_2.12</artifactId>
        <version>4.2.5</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu_2.12</artifactId>
        <version>4.2.5</version>
    </dependency>
    

    spark-nlp-m1:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-m1_2.12</artifactId>
        <version>4.2.5</version>
    </dependency>
    

    spark-nlp-aarch64:

    <!-- https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-aarch64 -->
    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-aarch64_2.12</artifactId>
        <version>4.2.5</version>
    </dependency>
    

    FAT JARs

    • CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.2.5.jar

    • GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.2.5.jar

    • M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.2.5.jar

    • AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-4.2.5.jar

    What's Changed

    Contributors

    @Damla-Gurbaz @Cabir40 @josejuanmartinez @danilojsl @mhnavid @DevinTDHa @jsl-builder @KshitizGIT @suvrat-joshi @maziyarpanahi @agsfer

    New Contributors

    • @mhnavid made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/12977

    Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/4.2.4...4.2.5

    Source code(tar.gz)
    Source code(zip)
  • 4.2.4(Nov 28, 2022)


    :loudspeaker: Overview

    Spark NLP 4.2.4 🚀 comes with new support for GCP storage to automatically download and load models & pipelines via setting the cache_pretrained path, update to TensorFlow 2.7.4 with security patch fixes, lots of improvements in our documentation, improvements, and bug fixes.

    Do not forget to visit Models Hub with over 11400+ free and open-source models & pipelines. As always, we would like to thank our community for their feedback, questions, and feature requests. 🎉


    :star: New Features & improvements

    • Introducing support for GCP storage to automatically download and load pre-trained models/pipelines from cache_pretrained directory
    • Update to TensorFlow 2.7.4 with bug and CVEs fixes. Details about bugs and CVEs fixes: https://github.com/JohnSnowLabs/spark-nlp/commit/417e2a1ff2b0bca2d2046c4d4740f52ce770689f
    • Improve error handling while importing external TensorFlow models into Spark NLP
    • Improve error messages when importing external models from remote storages like DBFS, S3, and HDFS
    • Update documentation on how to use testDataset param in NerDLApproach, ClassifierDLApproach, MultiClassifierDLApproach, and SentimentDLApproach
    • Update installation instructions for the Apple M1 chip
    • Add support for future decoder-encoder models with 2 separated models

    🐛 Bug Fixes

    • Add missing setPreservePosition in NerConverter
    • Add missing inputAnnotatorTypes to BigTextMatcher, ViveknSentimentModel, and NerConverter annotators
    • Fix all wrong example codes provided for LemmatizerModel in Models Hub
    • Fix the t5_grammar_error_corrector model to be compatible with Spark NLP 4.0+
    • Fix provided notebook to import Longformer models from Hugging Face into Spark NLP

    :notebook: New Notebooks

    Spark NLP | Notebooks | Colab :------------ | :-------------| :----------| Spark NLP Conf |Dowbload and Load Model from GCP Storage | Open In Colab | LongformerEmbeddings|HuggingFace in Spark NLP - Longformer | Open In Colab


    :book: Documentation


    Installation

    Python

    #PyPI
    
    pip install spark-nlp==4.2.4
    

    Spark Packages

    spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x (Scala 2.12):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.4
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.4
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.4
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.4
    

    M1

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.4
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.4
    

    AArch64

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.2.4
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.2.4
    

    Maven

    spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp_2.12</artifactId>
        <version>4.2.4</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu_2.12</artifactId>
        <version>4.2.4</version>
    </dependency>
    

    spark-nlp-m1:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-m1_2.12</artifactId>
        <version>4.2.4</version>
    </dependency>
    

    spark-nlp-aarch64:

    <!-- https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-aarch64 -->
    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-aarch64_2.12</artifactId>
        <version>4.2.4</version>
    </dependency>
    

    FAT JARs

    • CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.2.4.jar

    • GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.2.4.jar

    • M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.2.4.jar

    • AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-4.2.4.jar

    What's Changed

    • release note v4.2.2 by @Cabir40 in https://github.com/JohnSnowLabs/spark-nlp/pull/13091
    • added languages by @ahmedlone127 in https://github.com/JohnSnowLabs/spark-nlp/pull/13097
    • [skip ci] Create PR 4.2.2-healthcare-docs-8fde8ce2327dce2fb89db1742eec8ca121eee0de-3 by @jsl-builder in https://github.com/JohnSnowLabs/spark-nlp/pull/13084
    • FEATURE NMH-139: Add annotator to existing model [skip-test] by @KshitizGIT in https://github.com/JohnSnowLabs/spark-nlp/pull/13096
    • Add Visual NLP 4.2 to compatible versions in models.json by @pabla in https://github.com/JohnSnowLabs/spark-nlp/pull/13099
    • Add new demos 25 by @agsfer in https://github.com/JohnSnowLabs/spark-nlp/pull/13100
    • Docs/alab 4.3.0 by @diatrambitas in https://github.com/JohnSnowLabs/spark-nlp/pull/13104
    • Added content for installation in OpenShift by @suvrat-joshi in https://github.com/JohnSnowLabs/spark-nlp/pull/13105
    • Update subtabs by @agsfer in https://github.com/JohnSnowLabs/spark-nlp/pull/13110
    • Release Notes Updated by @Cabir40 in https://github.com/JohnSnowLabs/spark-nlp/pull/13111
    • Updated old hc snippets by @ArshaanNazir in https://github.com/JohnSnowLabs/spark-nlp/pull/13092
    • Added content for healthcare nlp integration by @suvrat-joshi in https://github.com/JohnSnowLabs/spark-nlp/pull/13115
    • Added some content for troubleshooting section by @suvrat-joshi in https://github.com/JohnSnowLabs/spark-nlp/pull/13116
    • Docs/alab 2479 add content for model testing page by @rpranab in https://github.com/JohnSnowLabs/spark-nlp/pull/13114
    • Update oncology.md by @agsfer in https://github.com/JohnSnowLabs/spark-nlp/pull/13146
    • SPARKNLP-656 & SPARKNLP-657: Updated Documentation by @DevinTDHa in https://github.com/JohnSnowLabs/spark-nlp/pull/13108
    • SPARKNLP-658 Update EngineError message by @maziyarpanahi in https://github.com/JohnSnowLabs/spark-nlp/pull/13109
    • SPARKNLP-661: Add missing setPreservePosition in NerConverter by @DevinTDHa in https://github.com/JohnSnowLabs/spark-nlp/pull/13112
    • fixed Wrong Example code provided for LemmatizerModel #13125 by @ahmedlone127 in https://github.com/JohnSnowLabs/spark-nlp/pull/13126
    • SPARKNLP-620 Provide GCP Support for Cache Folder by @danilojsl in https://github.com/JohnSnowLabs/spark-nlp/pull/13141
    • SPARKNLP-669 Adding missing inputAnnotatorTypes by @danilojsl in https://github.com/JohnSnowLabs/spark-nlp/pull/13144
    • SPARKNLP-665 Updating to TensorFlow 2.7.4 by @maziyarpanahi in https://github.com/JohnSnowLabs/spark-nlp/pull/13152
    • SPARKNLP-671 incorporate the exception into the error message by @maziyarpanahi in https://github.com/JohnSnowLabs/spark-nlp/pull/13153
    • Models hub by @maziyarpanahi in https://github.com/JohnSnowLabs/spark-nlp/pull/13160
    • Release/424 release candidate by @maziyarpanahi in https://github.com/JohnSnowLabs/spark-nlp/pull/13163

    Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/4.2.3...4.2.4

    Source code(tar.gz)
    Source code(zip)
  • 4.2.1(Nov 28, 2022)


    :loudspeaker: Overview

    Spark NLP 4.2.1 🚀 comes with a new multi-lingual support for Word Segmentation mostly used for (but not limited to) Chinese, Japanese, Korean, and so on, adding Automatic Speech Recognition (ASR) pipelines to LightPipeline arsenal for faster computation of smaller datasets without Apache Spark (e.g. RESTful API use case), adding support for processed audio files in type of Double in addition to Float for Wav2Vec2, over 230+ state-of-the-art Transformer Vision (ViT) pretrained pipelines for 1-line Image Classification, and bug fixes.

    Do not forget to visit Models Hub with over 11400+ free and open-source models & pipelines. As always, we would like to thank our community for their feedback, questions, and feature requests. 🎉


    :star: New Features & improvements

    • NEW: Support for multi-lingual WordSegmenter. Add enableRegexTokenizer feature in WordSegmenter to support word segmentation within mixed and multi-lingual content https://github.com/JohnSnowLabs/spark-nlp/pull/12854
    • NEW: Add support for Audio/ASR (Wav2Vec2) support to LightPipeline https://github.com/JohnSnowLabs/spark-nlp/pull/12895
    • NEW: Add support for Double type in addition to Float type to AudioAssembler annotator https://github.com/JohnSnowLabs/spark-nlp/pull/12904
    • Improve error handling in fullAnnotateImage for LightPipeline https://github.com/JohnSnowLabs/spark-nlp/pull/12868
    • Add SpanBertCoref annotator to all docs https://github.com/JohnSnowLabs/spark-nlp/pull/12889

    Bug Fixes

    • Fix feeding fullAnnotate in Lightpipeline with a list that started to fail in 4.2.0 release
    • Fix exception in ContextSpellCheckerModel when updateVocabClass is used with append set to true https://github.com/JohnSnowLabs/spark-nlp/pull/12875
    • Fix exception in Chunker annotator https://github.com/JohnSnowLabs/spark-nlp/pull/12901

    :notebook: New Notebooks

    Spark NLP | Notebooks | Colab :------------ | :-------------| :----------| SpanBertCorefModel |Coreference Resolution with SpanBertCorefModel | Open In Colab | WordSegmenter |Train and inference multi-lingual Word Segmenter | Open In Colab |


    Models

    Spark NLP 4.2.1 comes with 230+ state-of-the-art pre-trained Transformer Vision (ViT) pipeline:

    Featured Pipelines

    | Pipeline | Name | Lang |
    |:---------------------|:-------------------|:---| | PretrainedPipeline | pipeline_image_classifier_vit_base_patch16_224_finetuned_eurosat | en | PretrainedPipeline | pipeline_image_classifier_vit_base_beans_demo_v5 | en | PretrainedPipeline | pipeline_image_classifier_vit_animal_classifier_huggingface | en | PretrainedPipeline | pipeline_image_classifier_vit_Infrastructures | en | PretrainedPipeline | pipeline_image_classifier_vit_blocks | en | PretrainedPipeline | pipeline_image_classifier_vit_beer_whisky_wine_detection | en | PretrainedPipeline | pipeline_image_classifier_vit_base_xray_pneumonia | en | PretrainedPipeline | pipeline_image_classifier_vit_baseball_stadium_foods | en | PretrainedPipeline | pipeline_image_classifier_vit_dog_vs_chicken | en

    Check 460+ Transformer Vision (ViT) models & pipelines for Models Hub - Image Classification

    Spark NLP covers the following languages:

    English ,Multilingual ,Afrikaans ,Afro-Asiatic languages ,Albanian ,Altaic languages ,American Sign Language ,Amharic ,Arabic ,Argentine Sign Language ,Armenian ,Artificial languages ,Atlantic-Congo languages ,Austro-Asiatic languages ,Austronesian languages ,Azerbaijani ,Baltic languages ,Bantu languages ,Basque ,Basque (family) ,Belarusian ,Bemba (Zambia) ,Bengali, Bangla ,Berber languages ,Bihari ,Bislama ,Bosnian ,Brazilian Sign Language ,Breton ,Bulgarian ,Catalan ,Caucasian languages ,Cebuano ,Celtic languages ,Central Bikol ,Chichewa, Chewa, Nyanja ,Chilean Sign Language ,Chinese ,Chuukese ,Colombian Sign Language ,Congo Swahili ,Croatian ,Cushitic languages ,Czech ,Danish ,Dholuo, Luo (Kenya and Tanzania) ,Dravidian languages ,Dutch ,East Slavic languages ,Eastern Malayo-Polynesian languages ,Efik ,Esperanto ,Estonian ,Ewe ,Fijian ,Finnish ,Finnish Sign Language ,Finno-Ugrian languages ,French ,French-based creoles and pidgins ,Ga ,Galician ,Ganda ,Georgian ,German ,Germanic languages ,Gilbertese ,Greek (modern) ,Greek languages ,Gujarati ,Gun ,Haitian, Haitian Creole ,Hausa ,Hebrew (modern) ,Hiligaynon ,Hindi ,Hiri Motu ,Hungarian ,Icelandic ,Igbo ,Iloko ,Indic languages ,Indo-European languages ,Indo-Iranian languages ,Indonesian ,Irish ,Isoko ,Isthmus Zapotec ,Italian ,Italic languages ,Japanese ,Japanese ,Kabyle ,Kalaallisut, Greenlandic ,Kannada ,Kaonde ,Kinyarwanda ,Kirundi ,Kongo ,Korean ,Kwangali ,Kwanyama, Kuanyama ,Latin ,Latvian ,Lingala ,Lithuanian ,Louisiana Creole ,Lozi ,Luba-Katanga ,Luba-Lulua ,Lunda ,Lushai ,Luvale ,Macedonian ,Malagasy ,Malay ,Malayalam ,Malayo-Polynesian languages ,Maltese ,Manx ,Marathi (Marāṭhī) ,Marshallese ,Mexican Sign Language ,Mon-Khmer languages ,Morisyen ,Mossi ,Multiple languages ,Ndonga ,Nepali ,Niger-Kordofanian languages ,Nigerian Pidgin ,Niuean ,North Germanic languages ,Northern Sotho, Pedi, Sepedi ,Norwegian ,Norwegian Bokmål ,Norwegian Nynorsk ,Nyaneka ,Oromo ,Pangasinan ,Papiamento ,Persian (Farsi) ,Peruvian Sign Language ,Philippine languages ,Pijin ,Pohnpeian ,Polish ,Portuguese ,Portuguese-based creoles and pidgins ,Punjabi (Eastern) ,Romance languages ,Romanian ,Rundi ,Russian ,Ruund ,Salishan languages ,Samoan ,San Salvador Kongo ,Sango ,Semitic languages ,Serbo-Croatian ,Seselwa Creole French ,Shona ,Sindhi ,Sino-Tibetan languages ,Slavic languages ,Slovak ,Slovene ,Somali ,South Caucasian languages ,South Slavic languages ,Southern Sotho ,Spanish ,Spanish Sign Language ,Sranan Tongo ,Swahili ,Swati ,Swedish ,Tagalog ,Tahitian ,Tai ,Tamil ,Telugu ,Tetela ,Tetun Dili ,Thai ,Tigrinya ,Tiv ,Tok Pisin ,Tonga (Tonga Islands) ,Tonga (Zambia) ,Tsonga ,Tswana ,Tumbuka ,Turkic languages ,Turkish ,Tuvalu ,Tzotzil ,Ukrainian ,Umbundu ,Uralic languages ,Urdu ,Venda ,Venezuelan Sign Language ,Vietnamese ,Wallisian ,Walloon ,Waray (Philippines) ,Welsh ,West Germanic languages ,West Slavic languages ,Western Malayo-Polynesian languages ,Wolaitta, Wolaytta ,Wolof ,Xhosa ,Yapese ,Yiddish ,Yoruba ,Yucatec Maya, Yucateco ,Zande (individual language) ,Zulu

    The complete list of all 11000+ models & pipelines in 230+ languages is available on Models Hub


    :book: Documentation


    Installation

    Python

    #PyPI
    
    pip install spark-nlp==4.2.1
    

    Spark Packages

    spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x (Scala 2.12):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.1
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.1
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.1
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.1
    

    M1

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.1
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.1
    

    Maven

    spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp_2.12</artifactId>
        <version>4.2.1</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu_2.12</artifactId>
        <version>4.2.1</version>
    </dependency>
    

    spark-nlp-m1:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-m1_2.12</artifactId>
        <version>4.2.1</version>
    </dependency>
    

    FAT JARs

    • CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.2.1.jar

    • GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.2.1.jar

    • M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.2.1.jar

    • AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.2.1.jar

    What's Changed

    Contributors

    @Meryem1425 @muhammetsnts @jsl-models @josejuanmartinez @DevinTDHa @ArshaanNazir @C-K-Loan @KshitizGIT @agsfer @diatrambitas @danilojsl @Damla-Gurbaz @maziyarpanahi @jsl-builder

    New Contributors

    • @ArshaanNazir made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/12881

    Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/4.2.0...4.2.1

    Source code(tar.gz)
    Source code(zip)
  • 4.2.3(Nov 10, 2022)


    :loudspeaker: Overview

    Spark NLP 4.2.3 🚀 comes with new improvements to the CoNLLGenerator annotator, a new way to pass rules to the RegexMatcher annotator, unifying control over a number of columns in setInputCols between the Scala and Python, new documentation for our new IAnnotation feature for those who are using Spark NLP in Scala, and bug fixes.

    Do not forget to visit Models Hub with over 11400+ free and open-source models & pipelines. As always, we would like to thank our community for their feedback, questions, and feature requests. 🎉


    :star: New Features & improvements

    • Adding metadata sentence key parameter in order to select which metadata field to use as a sentence for the CoNLLGenerator annotator
    • Include escaping in the CoNLLGenerator annotator when writing to CSV and preserve special char token
    • Add rules and delimiter parameters to RegexMatcher annotator to support string as input in addition to a file
    regexMatcher = RegexMatcher() \
          .setRules(["\\d{4}\\/\\d\\d\\/\\d\\d,date", "\\d{2}\\/\\d\\d\\/\\d\\d,short_date"]) \
          .setDelimiter(",") \
          .setInputCols(["sentence"]) \
          .setOutputCol("regex") \
          .setStrategy("MATCH_ALL")
    
    • Implement a new control over a number of accepted columns in Python. This will sync the behavior between Scala and Python where the user sets more columns than allowed inside setInputCols while using Spark NLP in Python
    • Add documentation for the new IAnnotation feature for Scala users

    Bug Fixes

    • Fix NotSerializableException when the WordEmbeddings annotator is used over the K8s cluster while setEnableInMemoryStorage is set to true
    • Fix a bug in the RegexTokenizer annotator when it outputs the wrong indexes if the pattern includes splits that are not followed by a space
    • Fix training module failing on EMR due to a bad Apache Spark version detection. The use of the following classes was fixed on EMR: CoNLL(), CoNLLU(), POS(), and PubTator()
    • Fix a bug in the CoNLLGenerator annotator where the token has non-int metadata
    • Fix the wrong SentencePiece model's name required for DeBertaForQuestionAnswering and DeBertaEmbeddings when importing models
    • Fix NaNs result in some ViTForImageClassification models/pipelines

    :notebook: New Notebooks


    :book: Documentation


    Installation

    Python

    #PyPI
    
    pip install spark-nlp==4.2.3
    

    Spark Packages

    spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x (Scala 2.12):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.3
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.3
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.3
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.3
    

    M1

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.3
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.3
    

    AArch64

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.2.3
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-aarch64_2.12:4.2.3
    

    Maven

    spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp_2.12</artifactId>
        <version>4.2.3</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu_2.12</artifactId>
        <version>4.2.3</version>
    </dependency>
    

    spark-nlp-m1:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-m1_2.12</artifactId>
        <version>4.2.3</version>
    </dependency>
    

    spark-nlp-aarch64:

    <!-- https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-aarch64 -->
    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-aarch64_2.12</artifactId>
        <version>4.2.3</version>
    </dependency>
    

    FAT JARs

    • CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.2.3.jar

    • GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.2.3.jar

    • M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.2.3.jar

    • AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-4.2.3.jar

    What's Changed

    • Models hub legal by @josejuanmartinez in https://github.com/JohnSnowLabs/spark-nlp/pull/12999
    • Models hub finance by @josejuanmartinez in https://github.com/JohnSnowLabs/spark-nlp/pull/13000
    • Embed React and ReactDOM instead of packages from unpkg [skip test] by @pabla in https://github.com/JohnSnowLabs/spark-nlp/pull/13002
    • updated OCR release notes by @albertoandreottiATgmail in https://github.com/JohnSnowLabs/spark-nlp/pull/13010
    • Compat tables by @albertoandreottiATgmail in https://github.com/JohnSnowLabs/spark-nlp/pull/13012
    • Updating s3 link for dependency_conllu model by @luca-martial in https://github.com/JohnSnowLabs/spark-nlp/pull/13016
    • Add new demos by @agsfer in https://github.com/JohnSnowLabs/spark-nlp/pull/13020
    • Add new demos 24 by @agsfer in https://github.com/JohnSnowLabs/spark-nlp/pull/13022
    • Updated legre_contract_doc_parties_en and finre_work_experience_en mo… by @bunyamin-polat in https://github.com/JohnSnowLabs/spark-nlp/pull/13023
    • Docs/alab update documentation 410 by @diatrambitas in https://github.com/JohnSnowLabs/spark-nlp/pull/13024
    • Doc fix scala and open source by @ArshaanNazir in https://github.com/JohnSnowLabs/spark-nlp/pull/13008
    • Update 2022-10-22-finclf_bert_sentiment_analysis_lt.md by @gadde5300 in https://github.com/JohnSnowLabs/spark-nlp/pull/13026
    • add alab image by @agsfer in https://github.com/JohnSnowLabs/spark-nlp/pull/13030
    • Docs/alab update documentation 410 by @diatrambitas in https://github.com/JohnSnowLabs/spark-nlp/pull/13034
    • SPARKNLP 643 detecting spark version in a safer way by @maziyarpanahi in https://github.com/JohnSnowLabs/spark-nlp/pull/13035
    • Docs/alab update documentation 410 by @diatrambitas in https://github.com/JohnSnowLabs/spark-nlp/pull/13041
    • Added content for exporting visual NER project ad updated few other sections by @suvrat-joshi in https://github.com/JohnSnowLabs/spark-nlp/pull/13042
    • Bump model card Spark NLP HC version to 4.2.1 by @luca-martial in https://github.com/JohnSnowLabs/spark-nlp/pull/13027
    • SPARKNLP-642: Fix indexing issue for regex splits without space by @DevinTDHa in https://github.com/JohnSnowLabs/spark-nlp/pull/13032
    • Update ALAB by @agsfer in https://github.com/JohnSnowLabs/spark-nlp/pull/13045
    • Serializable Issue K8s Word Embeddings by @danilojsl in https://github.com/JohnSnowLabs/spark-nlp/pull/13001
    • FEATURE NMH-133: Rename products in search [skip-test] by @KshitizGIT in https://github.com/JohnSnowLabs/spark-nlp/pull/12998
    • Fix sorting in the versions drop-down [skip test] by @pabla in https://github.com/JohnSnowLabs/spark-nlp/pull/13049
    • Add tooltips for Unidirectional and Bidirectional models [skip test] by @pabla in https://github.com/JohnSnowLabs/spark-nlp/pull/13064
    • FEATURE NMH-134: Rebranding products [skip-test] by @KshitizGIT in https://github.com/JohnSnowLabs/spark-nlp/pull/13065
    • Adding Control for Annotators with One Column by @danilojsl in https://github.com/JohnSnowLabs/spark-nlp/pull/12997
    • Update 2022-10-18-legre_confidentiality_en.md by @gadde5300 in https://github.com/JohnSnowLabs/spark-nlp/pull/13059
    • Update 2022-09-28-legre_indemnifications_en.md by @gadde5300 in https://github.com/JohnSnowLabs/spark-nlp/pull/13058
    • Fix a bug in Vision Transformer annotator that results in NaNs for some models by @ahmedlone127 in https://github.com/JohnSnowLabs/spark-nlp/pull/13048
    • Bug fix and enhancements for CoNLLGenerator annotator by @maziyarpanahi in https://github.com/JohnSnowLabs/spark-nlp/pull/13053
    • SPARKNLP-621: Add string support to RegexMatcher in addition to a file by @DevinTDHa in https://github.com/JohnSnowLabs/spark-nlp/pull/13060
    • Add ScalaDoc for IAnnotation by @danilojsl in https://github.com/JohnSnowLabs/spark-nlp/pull/13061
    • doc fix in old hc md files by @ArshaanNazir in https://github.com/JohnSnowLabs/spark-nlp/pull/13025
    • Release/423 release candidate by @maziyarpanahi in https://github.com/JohnSnowLabs/spark-nlp/pull/13036

    Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/4.2.2...4.2.3

    Source code(tar.gz)
    Source code(zip)
  • 4.2.2(Oct 27, 2022)


    :loudspeaker: Overview

    Spark NLP 4.2.2 🚀 comes with support for DBFS, HDFS, and S3 in addition to local file systems when you are importing external models from TF Hub and Hugging Face, unifying LightPipeline APIs across Scala, Java, and Python languages for Image Classification, the new fullAnnotateImage for Scala, the new fullAnnotateImageJava for Java, the support for LightPipeline for QuestionAnswering pre-trained pipelines, and bug fixes.

    Do not forget to visit Models Hub with over 11400+ free and open-source models & pipelines. As always, we would like to thank our community for their feedback, questions, and feature requests. 🎉


    :star: New Features & improvements

    • Add support for importing TensorFlow SavedModel from remote storages like DBFS, S3, and HDFS. From this release, you can import models saved from TF Hub and HuggingFace on a remote storage
    • Add support for fullAnnotate in LightPipeline for the path of images in Scala
    • Add fullAnnotate method in PretrainedPipeline for Scala
    • Add fullAnnotateJava method in PretrainedPipeline for Java
    • Add fullAnnotateImage to PretrainedPipeline for Scala
    • Add fullAnnotateImageJava to PretrainedPipeline for Java
    • Add support for Question Answering in fullAnnotate method in PretrainedPipeline
    • Add Predicted Entities to all Vision Transformers (ViT) models and pipelines

    Bug Fixes

    • Unify the annotatorType name in Python and Scala for Spark schema in Annotation, AnnotationImage, and AnnotationAudio
    • Fix missing indexes in the RecursiveTokenizer annotator affecting downstream NLP tasks in the pipeline

    :notebook: New Notebooks

    Spark NLP | Notebooks | Colab :------------ | :-------------| :----------| WordSegmenter |Import External SavedModel From Remote | Open In Colab |


    :book: Documentation


    Installation

    Python

    #PyPI
    
    pip install spark-nlp==4.2.2
    

    Spark Packages

    spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x (Scala 2.12):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.2
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.2
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.2
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.2
    

    M1

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.2
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.2
    

    Maven

    spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp_2.12</artifactId>
        <version>4.2.2</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu_2.12</artifactId>
        <version>4.2.2</version>
    </dependency>
    

    spark-nlp-m1:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-m1_2.12</artifactId>
        <version>4.2.2</version>
    </dependency>
    

    FAT JARs

    • CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.2.2.jar

    • GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.2.2.jar

    • M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.2.2.jar

    • AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.2.2.jar

    What's Changed

    Contributors

    @galiph @agsfer @pabla @josejuanmartinez @Cabir40 @maziyarpanahi @Meryem1425 @danilojsl @jsl-builder @jsl-models @ahmedlone127 @DevinTDHa @jdobes-cz @Damla-Gurbaz @Mary-Sci

    New Contributors

    • @Mary-Sci made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/12978

    Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/4.2.1...4.2.2

    Source code(tar.gz)
    Source code(zip)
  • 4.2.0(Sep 27, 2022)


    :loudspeaker: Overview

    For the first time ever we are delighted to announce Automatic Speech Recognition (ASR) support in Spark NLP by using state-of-the-art Wav2Vec2 models at scale 🚀. This release also comes with Table Question Answering by TAPAS, CamemBERT for Token Classification, support for an external test dataset during training of all classifiers, much faster EntityRuler, 3000+ state-of-the-art models, and other enhancements and bug fixes!

    We are also celebrating crossing 11000+ free and open-source models & pipelines in our Models Hub. 🎉 As always, we would like to thank our community for their feedback, questions, and feature requests.


    :star: New Features & improvements

    • NEW: Introducing Wav2Vec2ForCTC annotator in Spark NLP 🚀. Wav2Vec2ForCTC can load Wav2Vec2 models for the Automatic Speech Recognition (ASR) task. Wav2Vec2 is a multi-modal model, that combines speech and text. It's the first multi-modal model of its kind we welcome in Spark NLP. This annotator is compatible with all the models trained/fine-tuned by using Wav2Vec2ForCTC for PyTorch or TFWav2Vec2ForCTC for TensorFlow models in HuggingFace 🤗 (https://github.com/JohnSnowLabs/spark-nlp/pull/12767)
    image

    wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

    • NEW: Introducing TapasForQuestionAnswering annotator in Spark NLP 🚀. TapasForQuestionAnswering can load TAPAS Models with a cell selection head and optional aggregation head on top for question-answering tasks on tables (linear layers on top of the hidden-states output to compute logits and optional logits_aggregation), e.g. for SQA, WTQ or WikiSQL-supervised tasks. TAPAS is a BERT-based model specifically designed (and pre-trained) for answering questions about tabular data. This annotator is compatible with all the models trained/fine-tuned by using TapasForQuestionAnswering for PyTorch or TFTapasForQuestionAnswering for TensorFlow models in HuggingFace 🤗
    image

    TAPAS: Weakly Supervised Table Parsing via Pre-training

    • NEW: Introducing CamemBertForTokenClassification annotator in Spark NLP 🚀. CamemBertForTokenClassification can load CamemBERT Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using CamembertForTokenClassification for PyTorch or TFCamembertForTokenClassification for TensorFlow in HuggingFace 🤗 (https://github.com/JohnSnowLabs/spark-nlp/pull/12752)
    • Implementing setTestDataset to evaluate metrics on an external dataset during training of Text Classifiers in Spark NLP. This feature is similar to NerDLApproach where metrics are calculated on each Epoch and have been added to the following multi-class/multi-label text classifier annotators: ClassifierDLApproach, SentimentDLApproach, and MultiClassifierDLApproach (https://github.com/JohnSnowLabs/spark-nlp/pull/12796)
    • Refactoring and improving EntityRuler annotator inference to up to 24x faster especially when used with a long list of labels/entities. We speed up the inference process by implementing the Aho-Corasick algorithm to match patterns in a string. This requires the following changes when using EntityRuler https://github.com/JohnSnowLabs/spark-nlp/pull/12634
    • Add support for S3 storage in the cache_folder where models are downloaded, extracted, and loaded from. Previously, we only supported all local file systems, HDFS, and DBFS. This new feature is especially useful for users on Kubernetes clusters with no access to HDFS or any other distributed file systems (https://github.com/JohnSnowLabs/spark-nlp/pull/12707)
    • Implementing lookaround functionalities in DocumentNormalizer annotator. Currently, DocumentNormalizer has both lookahead and lookbehind functionalities. To extend support for more complex normalizations, especially within the clinical text we are introducing the lookaround feature (https://github.com/JohnSnowLabs/spark-nlp/pull/12735)
    • Implementing setReplaceEntities param to NerOverwriter annotator to replace all the NER labels (entities) with the given new labels (entities) (https://github.com/JohnSnowLabs/spark-nlp/pull/12745)

    Bug Fixes

    • Fix a bug in generating the NerDL graph by using TF v2. The previous graph generated by the TFGraphBuilder annotator resulted in an exception when the length of the sequence was 1. This issue has been resolved and the new graphs created by TFGraphBuilder won't have this issue anymore (https://github.com/JohnSnowLabs/spark-nlp/pull/12636)
    • Fix a bug introduced in the 4.0.0 release between Transformer-based Word Embeddings annotators. In the 4.0.0 release, the following annotators were migrated to BatchAnnotate to improve their performance, especially on GPU. However, a bug was introduced in sentence indices which when it is combined with SentenceEmbeddings for Text Classifications tasks (ClassifierDLApproach, SentimentDLApproach, and ClassifierDLApproach) resulted in low accuracy: AlbertEmbeddings, CamemBertEmbeddings, DeBertaEmbeddings, DistilBertEmbeddings, LongformerEmbeddings, RoBertaEmbeddings, XlmRoBertaEmbeddings, and XlnetEmbeddings (https://github.com/JohnSnowLabs/spark-nlp/pull/12641)
    • Add support for a list of questions and context in LightPipline. Previously, only one context and question at a time were supported in LightPipeline for Question Answering annotators. We have added support to fullAnnotate and annotate to receive two lists of questions and contexts (https://github.com/JohnSnowLabs/spark-nlp/pull/12653)
    • Fix division by zero exception in the GPT2Transformer annotator when the setDoSample param was set to true (https://github.com/JohnSnowLabs/spark-nlp/pull/12661)
    • Fix AttributeError when PretrainedPipeline is used in Python with ImageAssembler as one of the stages (https://github.com/JohnSnowLabs/spark-nlp/pull/12813)

    :notebook: New Notebooks

    Spark NLP | Notebooks | Colab :------------ | :-------------| :----------| Wav2Vec2ForCTC|Automatic Speech Recognition in Spark NLP | Open In Colab ViTForImageClassification|HuggingFace in Spark NLP - ViTForImageClassification | Open In Colab CamemBertForTokenClassification|HuggingFace in Spark NLP - CamemBertForTokenClassification | Open In Colab ClassifierDLApproach|ClassifierDL Train and Evaluate | Open In Colab | MultiClassifierDLApproach|MultiClassifierDL Train and Evaluate | Open In Colab | SentimentDLApproach|SentimentDL Train and Evaluate | Open In Colab | Pretrained/cache_folder|Download & Load Models From S3 | Open In Colab | EntityRuler|EntityRuler | Open In Colab | EntityRuler|EntityRuler Alphabet | Open In Colab | EntityRuler|EntityRuler LightPipeline | Open In Colab | EntityRuler|EntityRuler Without Storage | Open In Colab | DocumentNormalizer|Apply Lookaround Patterns | Open In Colab |


    Models

    Spark NLP 4.2.0 comes with 3000+ state-of-the-art pre-trained transformer models in many languages.

    Featured Models

    | Model | Name | Lang |
    |:---------------------|:-------------------|:---| | Wav2Vec2ForCTC | asr_wav2vec2_base_100h_by_facebook | en | Wav2Vec2ForCTC | asr_wav2vec2_base_960h_by_facebook | en | Wav2Vec2ForCTC | asr_wav2vec2_large_960h | en | Wav2Vec2ForCTC | asr_wav2vec2_large_xlsr_53_german_by_facebook | de | Wav2Vec2ForCTC | asr_wav2vec2_large_xlsr_53_french_by_facebook | fr | Wav2Vec2ForCTC | asr_wav2vec2_large_xlsr_53_polish_by_facebook | nl | Wav2Vec2ForCTC | asr_wav2vec2_base_10k_voxpopuli | hu | Wav2Vec2ForCTC | asr_wav2vec2_base_10k_voxpopuli | fi | Wav2Vec2ForCTC | asr_wav2vec2_base_10k_voxpopuli | it | Wav2Vec2ForCTC | asr_wav2vec2_large_xlsr_japanese_hiragana | ja

    Check 2000+ Wav2Vec2 models & pipelines for Models Hub - Automatic Speech Recognition (ASR)

    Spark NLP covers the following languages:

    English ,Multilingual ,Afrikaans ,Afro-Asiatic languages ,Albanian ,Altaic languages ,American Sign Language ,Amharic ,Arabic ,Argentine Sign Language ,Armenian ,Artificial languages ,Atlantic-Congo languages ,Austro-Asiatic languages ,Austronesian languages ,Azerbaijani ,Baltic languages ,Bantu languages ,Basque ,Basque (family) ,Belarusian ,Bemba (Zambia) ,Bengali, Bangla ,Berber languages ,Bihari ,Bislama ,Bosnian ,Brazilian Sign Language ,Breton ,Bulgarian ,Catalan ,Caucasian languages ,Cebuano ,Celtic languages ,Central Bikol ,Chichewa, Chewa, Nyanja ,Chilean Sign Language ,Chinese ,Chuukese ,Colombian Sign Language ,Congo Swahili ,Croatian ,Cushitic languages ,Czech ,Danish ,Dholuo, Luo (Kenya and Tanzania) ,Dravidian languages ,Dutch ,East Slavic languages ,Eastern Malayo-Polynesian languages ,Efik ,Esperanto ,Estonian ,Ewe ,Fijian ,Finnish ,Finnish Sign Language ,Finno-Ugrian languages ,French ,French-based creoles and pidgins ,Ga ,Galician ,Ganda ,Georgian ,German ,Germanic languages ,Gilbertese ,Greek (modern) ,Greek languages ,Gujarati ,Gun ,Haitian, Haitian Creole ,Hausa ,Hebrew (modern) ,Hiligaynon ,Hindi ,Hiri Motu ,Hungarian ,Icelandic ,Igbo ,Iloko ,Indic languages ,Indo-European languages ,Indo-Iranian languages ,Indonesian ,Irish ,Isoko ,Isthmus Zapotec ,Italian ,Italic languages ,Japanese ,Japanese ,Kabyle ,Kalaallisut, Greenlandic ,Kannada ,Kaonde ,Kinyarwanda ,Kirundi ,Kongo ,Korean ,Kwangali ,Kwanyama, Kuanyama ,Latin ,Latvian ,Lingala ,Lithuanian ,Louisiana Creole ,Lozi ,Luba-Katanga ,Luba-Lulua ,Lunda ,Lushai ,Luvale ,Macedonian ,Malagasy ,Malay ,Malayalam ,Malayo-Polynesian languages ,Maltese ,Manx ,Marathi (Marāṭhī) ,Marshallese ,Mexican Sign Language ,Mon-Khmer languages ,Morisyen ,Mossi ,Multiple languages ,Ndonga ,Nepali ,Niger-Kordofanian languages ,Nigerian Pidgin ,Niuean ,North Germanic languages ,Northern Sotho, Pedi, Sepedi ,Norwegian ,Norwegian Bokmål ,Norwegian Nynorsk ,Nyaneka ,Oromo ,Pangasinan ,Papiamento ,Persian (Farsi) ,Peruvian Sign Language ,Philippine languages ,Pijin ,Pohnpeian ,Polish ,Portuguese ,Portuguese-based creoles and pidgins ,Punjabi (Eastern) ,Romance languages ,Romanian ,Rundi ,Russian ,Ruund ,Salishan languages ,Samoan ,San Salvador Kongo ,Sango ,Semitic languages ,Serbo-Croatian ,Seselwa Creole French ,Shona ,Sindhi ,Sino-Tibetan languages ,Slavic languages ,Slovak ,Slovene ,Somali ,South Caucasian languages ,South Slavic languages ,Southern Sotho ,Spanish ,Spanish Sign Language ,Sranan Tongo ,Swahili ,Swati ,Swedish ,Tagalog ,Tahitian ,Tai ,Tamil ,Telugu ,Tetela ,Tetun Dili ,Thai ,Tigrinya ,Tiv ,Tok Pisin ,Tonga (Tonga Islands) ,Tonga (Zambia) ,Tsonga ,Tswana ,Tumbuka ,Turkic languages ,Turkish ,Tuvalu ,Tzotzil ,Ukrainian ,Umbundu ,Uralic languages ,Urdu ,Venda ,Venezuelan Sign Language ,Vietnamese ,Wallisian ,Walloon ,Waray (Philippines) ,Welsh ,West Germanic languages ,West Slavic languages ,Western Malayo-Polynesian languages ,Wolaitta, Wolaytta ,Wolof ,Xhosa ,Yapese ,Yiddish ,Yoruba ,Yucatec Maya, Yucateco ,Zande (individual language) ,Zulu

    The complete list of all 11000+ models & pipelines in 230+ languages is available on Models Hub


    :book: Documentation


    Installation

    Python

    #PyPI
    
    pip install spark-nlp==4.2.0
    

    Spark Packages

    spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x (Scala 2.12):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.0
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.2.0
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.0
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.2.0
    

    M1

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.0
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.2.0
    

    Maven

    spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, and 3.3.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp_2.12</artifactId>
        <version>4.2.0</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu_2.12</artifactId>
        <version>4.2.0</version>
    </dependency>
    

    spark-nlp-m1:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-m1_2.12</artifactId>
        <version>4.2.0</version>
    </dependency>
    

    FAT JARs

    • CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.2.0.jar

    • GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.2.0.jar

    • M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.2.0.jar

    • AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.2.0.jar

    What's Changed

    Contributors

    @maziyarpanahi @suvrat-joshi @danilojsl @josejuanmartinez @ahmedlone127 @Damla-Gurbaz @vankov @xusliebana @DevinTDHa @jsl-builder @Cabir40 @muhammetsnts @wolliq @Meryem1425 @pabla @C-K-Loan @rpranab @agsfer

    Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/4.1.0...4.2.0


    This discussion was created from the release John Snow Labs Spark-NLP 4.2.0: Wav2Vec2 for Automatic Speech Recognition (ASR), TAPAS for Table Question Answering, CamemBERT for Token Classification, new evaluation metrics for external datasets in all classifiers, much faster EntityRuler, over 3000+ state-of-the-art multi-lingual models & pipelines, and many more!. Source code(tar.gz)
    Source code(zip)
  • 4.1.0(Aug 24, 2022)


    Overview

    An Image is Worth 16x16 Words!

    For the first time ever we are delighted to announce support for Image Classification in Spark NLP by using state-of-the-art Vision Transformer (ViT) models at scale. This release comes with official support for AWS Graviton and ARM64 processors, new Databricks and EMR support, and 1000+ state-of-the-art models.

    Spark NLP 4.1 also celebrates crossing 8000+ free and open-source models & pipelines available on Models Hub. 🎉 As always, we would like to thank our community for their feedback, questions, and feature requests.


    :star: New Features & improvements

    • NEW: Introducing ViTForImageClassification annotator in Spark NLP 🚀. ViTForImageClassification can load Vision Transformer ViT Models with an image classification head on top (a linear layer on top of the final hidden state of the [CLS] token) e.g. for ImageNet. This annotator is compatible with all the models trained/fine-tuned by using ViTForImageClassification for PyTorch or TFViTForImageClassification for TensorFlow models in HuggingFace 🤗 (https://github.com/JohnSnowLabs/spark-nlp/pull/11536)

    An overview of the ViT model structure as introduced in Google Research’s original 2021 paper

    data_df = spark.read.format("image") \
                .load(path="images/")
    
    image_assembler = ImageAssembler() \
                .setInputCol("image") \
                .setOutputCol("image_assembler")
    
    image_classifier = ViTForImageClassification \
        .pretrained() \
        .setInputCols("image_assembler") \
        .setOutputCol("class")
    
    pipeline = Pipeline(stages=[
        image_assembler,
        image_classifier,
    ])
    
    model = pipeline.fit(data_df)
    
    • NEW: Support for AWS Graviton/Graviton2 With up to 3x Better Price-Performance. For the first time, Spark NLP supports Graviton and ARM64 (ARMv8 above) processors. (https://github.com/JohnSnowLabs/spark-nlp/pull/10939)
    • NEW: Introducing TFNerDLGraphBuilder annotator. TFNerDLGraphBuilder can be used to automatically detect the parameters of a needed NerDL graph and generate the graph within a pipeline when the default NER graphs are not suitable for your training datasets. TFNerDLGraphBuilder supports local, DBFS, and S3 file systems. (https://github.com/JohnSnowLabs/spark-nlp/pull/10564)
    • Allow passing confidence scores from all XXXForTokenClassification annotators to NerConverter. It is now possible to access the confidence scores coming from the following annotators in NerConverter metadata (similar to NerDLModel): AlbertForTokenClassification, BertForTokenClassification, DeBertaForTokenClassification, DistilBertForTokenClassification, LongformerForTokenClassification, RoBertaForTokenClassification, XlmRoBertaForTokenClassification, XlnetForTokenClassification, and DeBertaForTokenClassification
    • Introducing PushToHub Python class to easily push public models & pipelines to Models Hub
    • Introducing fullAnnotateImage to existing LightPipeline to support ImageAssembler and ViTForImageClassification annotators in a Spark NLP pipeline. The fullAnnotateImage supports a path to images hosted locally, on DBFS, and S3.
    light_pipeline = LightPipeline(model)
    annotations_result = light_pipeline.fullAnnotateImage("images/hippopotamus.JPEG")
    
    • Welcoming a new EMR 6.x series to our Spark NLP family:
      • EMR 6.7.0 (now supports Apache Spark 3.2.1, Apache Hive 3.1.3, HUDI 0.11, PrestoDB 0.272, and Trino 0.378.)
    • Welcoming 3 new Databricks runtimes to our Spark NLP family:
    • Welcoming new AWS Graviton-enabled for Databricks runtime:

    Models

    Spark NLP 4.1.0 comes with 1000+ state-of-the-art pre-trained transformer models for Image Classifications, Token Classification, and Sequence Classification in many languages.

    Featured Models

    | Model | Name | Lang |
    |:---------------------|:-------------------|:---| | ViTForImageClassification | image_classifier_vit_base_patch16_224 | en | ViTForImageClassification | image_classifier_vit_base_patch16_384 | en | ViTForImageClassification | image_classifier_vit_base_patch32_384 | en | ViTForImageClassification | image_classifier_vit_base_xray_pneumonia | en | ViTForImageClassification | image_classifier_vit_finetuned_chest_xray_pneumonia | en | ViTForImageClassification | image_classifier_vit_food | en | ViTForImageClassification | image_classifier_vit_base_food101 | en | ViTForImageClassification | image_classifier_vit_autotrain_dog_vs_food | en | ViTForImageClassification | image_classifier_vit_baseball_stadium_foods | en | ViTForImageClassification | image_classifier_vit_south_indian_foods | en | ViTForImageClassification | image_classifier_vit_denver_nyc_paris | en | ViTForImageClassification | image_classifier_vit_CarViT | en

    Check out 240 (ViT) models on Models Hub - Image Classification

    Spark NLP covers the following languages:

    English ,Multilingual ,Afrikaans ,Afro-Asiatic languages ,Albanian ,Altaic languages ,American Sign Language ,Amharic ,Arabic ,Argentine Sign Language ,Armenian ,Artificial languages ,Atlantic-Congo languages ,Austro-Asiatic languages ,Austronesian languages ,Azerbaijani ,Baltic languages ,Bantu languages ,Basque ,Basque (family) ,Belarusian ,Bemba (Zambia) ,Bengali, Bangla ,Berber languages ,Bihari ,Bislama ,Bosnian ,Brazilian Sign Language ,Breton ,Bulgarian ,Catalan ,Caucasian languages ,Cebuano ,Celtic languages ,Central Bikol ,Chichewa, Chewa, Nyanja ,Chilean Sign Language ,Chinese ,Chuukese ,Colombian Sign Language ,Congo Swahili ,Croatian ,Cushitic languages ,Czech ,Danish ,Dholuo, Luo (Kenya and Tanzania) ,Dravidian languages ,Dutch ,East Slavic languages ,Eastern Malayo-Polynesian languages ,Efik ,Esperanto ,Estonian ,Ewe ,Fijian ,Finnish ,Finnish Sign Language ,Finno-Ugrian languages ,French ,French-based creoles and pidgins ,Ga ,Galician ,Ganda ,Georgian ,German ,Germanic languages ,Gilbertese ,Greek (modern) ,Greek languages ,Gujarati ,Gun ,Haitian, Haitian Creole ,Hausa ,Hebrew (modern) ,Hiligaynon ,Hindi ,Hiri Motu ,Hungarian ,Icelandic ,Igbo ,Iloko ,Indic languages ,Indo-European languages ,Indo-Iranian languages ,Indonesian ,Irish ,Isoko ,Isthmus Zapotec ,Italian ,Italic languages ,Japanese ,Japanese ,Kabyle ,Kalaallisut, Greenlandic ,Kannada ,Kaonde ,Kinyarwanda ,Kirundi ,Kongo ,Korean ,Kwangali ,Kwanyama, Kuanyama ,Latin ,Latvian ,Lingala ,Lithuanian ,Louisiana Creole ,Lozi ,Luba-Katanga ,Luba-Lulua ,Lunda ,Lushai ,Luvale ,Macedonian ,Malagasy ,Malay ,Malayalam ,Malayo-Polynesian languages ,Maltese ,Manx ,Marathi (Marāṭhī) ,Marshallese ,Mexican Sign Language ,Mon-Khmer languages ,Morisyen ,Mossi ,Multiple languages ,Ndonga ,Nepali ,Niger-Kordofanian languages ,Nigerian Pidgin ,Niuean ,North Germanic languages ,Northern Sotho, Pedi, Sepedi ,Norwegian ,Norwegian Bokmål ,Norwegian Nynorsk ,Nyaneka ,Oromo ,Pangasinan ,Papiamento ,Persian (Farsi) ,Peruvian Sign Language ,Philippine languages ,Pijin ,Pohnpeian ,Polish ,Portuguese ,Portuguese-based creoles and pidgins ,Punjabi (Eastern) ,Romance languages ,Romanian ,Rundi ,Russian ,Ruund ,Salishan languages ,Samoan ,San Salvador Kongo ,Sango ,Semitic languages ,Serbo-Croatian ,Seselwa Creole French ,Shona ,Sindhi ,Sino-Tibetan languages ,Slavic languages ,Slovak ,Slovene ,Somali ,South Caucasian languages ,South Slavic languages ,Southern Sotho ,Spanish ,Spanish Sign Language ,Sranan Tongo ,Swahili ,Swati ,Swedish ,Tagalog ,Tahitian ,Tai ,Tamil ,Telugu ,Tetela ,Tetun Dili ,Thai ,Tigrinya ,Tiv ,Tok Pisin ,Tonga (Tonga Islands) ,Tonga (Zambia) ,Tsonga ,Tswana ,Tumbuka ,Turkic languages ,Turkish ,Tuvalu ,Tzotzil ,Ukrainian ,Umbundu ,Uralic languages ,Urdu ,Venda ,Venezuelan Sign Language ,Vietnamese ,Wallisian ,Walloon ,Waray (Philippines) ,Welsh ,West Germanic languages ,West Slavic languages ,Western Malayo-Polynesian languages ,Wolaitta, Wolaytta ,Wolof ,Xhosa ,Yapese ,Yiddish ,Yoruba ,Yucatec Maya, Yucateco ,Zande (individual language) ,Zulu

    The complete list of all 8000+ models & pipelines in 230+ languages is available on Models Hub

    New Notebooks

    | Notebook | ------------ | |Graph Builder| |Graph ViTForImageClassification|


    :book: Documentation


    Installation

    Python

    #PyPI
    
    pip install spark-nlp==4.1.0
    

    Spark Packages

    spark-nlp on Apache Spark 3.0.x, 3.1.x, and 3.2.x (Scala 2.12):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.1.0
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.1.0
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.1.0
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.1.0
    

    M1

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.1.0
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.1.0
    

    Maven

    spark-nlp on Apache Spark 3.0.x, 3.1.x, and 3.2.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp_2.12</artifactId>
        <version>4.1.0</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu_2.12</artifactId>
        <version>4.1.0</version>
    </dependency>
    

    spark-nlp-m1:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-m1_2.12</artifactId>
        <version>4.1.0</version>
    </dependency>
    

    spark-nlp-aarch64:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-aarch64_2.12</artifactId>
        <version>4.1.0</version>
    </dependency>
    

    FAT JARs

    • CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.1.0.jar

    • GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.1.0.jar

    • M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.1.0.jar

    • AArch64 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-aarch64-assembly-4.1.0.jar

    What's Changed

    New Contributors

    • @paulk-asert made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/11128
    • @cayorodriguez made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/10376

    Contributors

    @josejuanmartinez @jsl-models @maziyarpanahi @DevinTDHa @agsfer @rpranab @vankov @cayorodriguez @paulk-asert @Ahmetemintek @muhammetsnts @jsl-builder @Cabir40 @diatrambitas @galiph @ahmedlone127 @pabla @Damla-Gurbaz

    Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/4.0.2...4.1.0

    Source code(tar.gz)
    Source code(zip)
  • 4.0.2(Jul 19, 2022)


    Overview

    We are pleased to release Spark NLP 🚀 4.0.2! This release comes with full compatibility with the newly-released Apache Spark 3.3.0 and official support for Databrick's new runtimes 11.1 Beta (includes Apache Spark 3.3.0, Scala 2.12).

    As always, we would like to thank our community for their feedback, questions, and feature requests.


    New Features

    • Welcoming new Databricks runtimes based on Spark/PySpark 3.3.0 to our Spark NLP family:
      • Databricks 11.1 Beta
      • Databricks 11.1 ML Berta
      • Databricks 11.1 ML Berta GPU
    • SentenceDetector now comes with a new parameter customBoundsStrategy for returning custom bounds https://github.com/JohnSnowLabs/spark-nlp/pull/10567

    Example

    with setCustomBounds([r"\.", ";"])

    This is a sentence. This one uses custom bounds; As is this one;
    

    Without the flags will result in

    ["This is a sentence", "This one uses custom bounds", "As is this one"]
    

    With the new flag:

    .setCustomBounds([r"\.", ";"])
    .setCustomBoundsStrategy("append")
    

    the result will be

    ["This is a sentence.", "This one uses custom bounds;", "As is this one;"]
    

    Similarly with prepend:

    1. This is a list
    1.1 This is a subpoint
    2. Second thing
    2.2 Second subthing
    
    .setCustomBounds([r"\n[\d\.]+"])
    .setCustomBoundsStrategy("prepend")
    

    the result will be

    [
        "1. This is a list",
        "1.1 This is a subpoint",
        "2. Second thing",
        "2.2 Second subthing"
    ]
    

    Bug Fixes

    • Fix bug that attempts to create spark session on executors when using GraphExtraction in Spark/PySpark 3.3 https://github.com/JohnSnowLabs/spark-nlp/pull/9905

    Models and Pipelines

    Spark NLP 4.0.2 comes with 620+ state-of-the-art pre-trained transformer models in 21 languages including multi-lingual models.

    Featured Models

    | Model | Name | Lang |
    |:---------------------|:-------------------|:---| | BertForQuestionAnswering | electra_qa_BioM_Base_SQuAD2_BioASQ8B | en | BertForQuestionAnswering | bert_qa_multilingual_base_cased_chines | zh | BertForQuestionAnswering | bert_qa_deep_pavlov_full | ru | BertForQuestionAnswering | bert_qa_firmanindolanguagemodel | id | BertForQuestionAnswering | bert_qa_kcbert_base_finetuned_squad | ko | BertForQuestionAnswering | bert_qa_mbert_finetuned_mlqa_de_hi_dev | xx | BertForQuestionAnswering | bert_qa_modelontquad | tr | BertForQuestionAnswering | bert_qa_newsqa_el_4 | el | BertForQuestionAnswering | bert_qa_testpersianqa | fa | BertForQuestionAnswering | bert_qa_arabert_finetuned_arcd | ar | BertForTokenClassification | bert_ner_NER_legal_de_Sahajtomar | de | BertForTokenClassification | bert_ner_NER_en_vi_it_es_tinparadox | xx | BertForTokenClassification | bert_ner_NER_CAMELBERT | ar | BertForTokenClassification | bert_ner_Swedish_NER | sv | BertForTokenClassification | bert_ner_bert_base_chinese_ner | zh | BertForTokenClassification | bert_ner_bert_base_hu_cased_ner | hu | BertForTokenClassification | bert_ner_bert_base_indonesian_NER | id | BertForTokenClassification | bert_ner_bert_base_irish_cased_v1_finetuned_ner | ga | BertForTokenClassification | bert_ner_bert_base_pt_archive | pt | BertForTokenClassification | bert_ner_bert_base_spanish_wwm_uncased_finetuned_NER_medical | es

    The complete list of all 6900+ models & pipelines in 230+ languages is available on Models Hub


    📖 Documentation & Articles


    Installation

    Python

    #PyPI
    
    pip install spark-nlp==4.0.2
    

    Spark Packages

    spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x (Scala 2.12):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.2
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.2
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.0.2
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.0.2
    

    M1

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.0.2
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.0.2
    

    Maven

    spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x (Scala 2.12):

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp_2.12</artifactId>
        <version>4.0.2</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu_2.12</artifactId>
        <version>4.0.2</version>
    </dependency>
    

    spark-nlp-m1:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-m1_2.12</artifactId>
        <version>4.0.2</version>
    </dependency>
    

    FAT JARs

    • CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.0.2.jar

    • GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.0.2.jar

    • M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.0.2.jar

    What's Changed

    Contributors

    @gadde5300 @danilojsl @hsaglamlar @Cabir40 @ahmedlone127 @muhammetsnts @KshitizGIT @maziyarpanahi @albertoandreottiATgmail @DevinTDHa @luca-martial @Damla-Gurbaz @jsl-models @Meryem1425

    New Contributors

    • @hsaglamlar made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/10544

    Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/4.0.1...4.0.2

    Source code(tar.gz)
    Source code(zip)
  • 4.0.1(Jul 1, 2022)


    Overview

    We are pleased to release Spark NLP 🚀 4.0.1! This release comes with supporting the newly-released Apache Spark 3.3.0 with improved join query performance via Bloom filters, increases the Pandas API coverage, and many other improvements. In addition, Spark NLP comes with official support for Databricks runtimes 11, other enhancements, and bug fixes.

    As always, we would like to thank our community for their feedback, questions, and feature requests.


    Features & Enhancements

    • Full support for Apache Spark & PySpark 3.3.0
    • Add Apache Spark 3.3.0 to Google Colab and Kaggle setup scripts
    • New -g option for Google Colab and Kaggle setup on GPU device to upgrade libcudnn8 to 8.1.0 to solve the issue on GPU
    • Welcoming new Databricks runtimes based on Spark/PySpark 3.3.0 to our Spark NLP family:
      • Databricks 11.0 LTS
      • Databricks 11.0 LTS ML
      • Databricks 11.0 LTS ML GPU

    Bug Fixes

    • Fix the error caused by PySpark 3.3.0 in CoNLL, CoNLLU, POS, and PubTator annotators as training helpers
    • Fix and re-upload Dependency and Type Dependency parser pre-trained models
    • Update pre-trained pipelines with issues on PySpark 3.2 and 3.3

    Documentation


    Installation

    Python

    #PyPI
    
    pip install spark-nlp==4.0.1
    

    Spark Packages

    spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x (Scala 2.12):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.1
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.1
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.0.1
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.0.1
    

    M1

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.0.1
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.0.1
    

    Maven

    spark-nlp on Apache Spark 3.0.x, 3.1.x, 3.2.x, 3.3.x (Scala 2.12):

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp_2.12</artifactId>
        <version>4.0.1</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu_2.12</artifactId>
        <version>4.0.1</version>
    </dependency>
    

    spark-nlp-m1:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-m1_2.12</artifactId>
        <version>4.0.1</version>
    </dependency>
    

    FAT JARs

    • CPU on Apache Spark 3.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.0.1.jar

    • GPU on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.0.1.jar

    • M1 on Apache Spark 3.0.x/3.1.x/3.2.x/3.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.0.1.jar

    What's Changed

    Contributors

    @muhammetsnts @jsl-models @Meryem1425 @Damla-Gurbaz @jsl-builder @rpranab @danilojsl @josejuanmartinez @Cabir40 @DevinTDHa @agsfer @suvrat-joshi @ahmedlone127 @albertoandreottiATgmail @KshitizGIT @mahmoodbayeshi @maziyarpanahi

    New Contributors

    • @ahmedlone127 made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/9887

    Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/4.0.0...4.0.1

    Source code(tar.gz)
    Source code(zip)
  • 4.0.0(Jun 15, 2022)


    Overview

    We are very excited to release Spark NLP 4.0.0! This has been one of the biggest releases we have ever done and we are so proud to share this with our community! 🎉

    This release comes with official support for Apple silicon M1 chip (for the first time), official support for Spark/PySpark 3.2, support oneAPI Deep Neural Network Library (oneDNN) to improve TensorFlow on CPU up to 97%, optimized transformer-based embeddings on GPU to increase the performance up to +700%, brand new modern extractive transformer-based Question answering (QA) annotators for tasks like SQuAD based on ALBERT, BERT, DistilBERT, DeBERTa, RoBERTa, Longformer, and XLM-RoBERTa architectures, 1000+ state-of-the-art models, WordEmbeddingsModel now works in clusters without HDFS/DBFS/S3 such as Kubernetes, new Databricks and EMR support, new NER models achieving highest F1 score in Spark NLP, and many more enhancements and bug fixes!

    We would like to mention that Spark NLP 4.0.0 drops the support for Spark 2.3 and 2.4 (Scala 2.11). Starting 4.0.0 we only support Spark/PySpark 3.x on Scala 2.12.

    As always, we would like to thank our community for their feedback, questions, and feature requests.


    Major features and improvements

    • NEW: Support for The oneAPI Deep Neural Network Library (oneDNN) optimizations to improve TensorFlow on CPU. Enabling onDNN can improve some transformer-based models up to 97%. By default, the oneDNN optimizations will be turned off. To enable them, you can set the environment variable TF_ENABLE_ONEDNN_OPTS. On Linux systems, for instance: export TF_ENABLE_ONEDNN_OPTS=1
    • NEW: Optimizing batch processing for transformer-based Word Embeddings on a GPU device. These optimizations can result in performance improvements up to +700% (more details in the Benchmarks section)
    • NEW: Official support for Apple silicon M1 on macOS devices. You can use the spark-nlp-m1 package that supports Apple silicon M1 on your macOS machine in Spark NLP 4.0.0
    • NEW: Introducing AlbertForQuestionAnswering annotator in Spark NLP 🚀. AlbertForQuestionAnswering can load ALBERT Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using AlbertForQuestionAnswering for PyTorch or TFAlbertForQuestionAnswering for TensorFlow models in HuggingFace 🤗
    • NEW: Introducing BertForQuestionAnswering annotator in Spark NLP 🚀. BertForQuestionAnswering can load BERT & ELECTRA Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using BertForQuestionAnswering and ElectraForQuestionAnswering for PyTorch or TFBertForQuestionAnswering and TFElectraForQuestionAnswering for TensorFlow models in HuggingFace 🤗
    • NEW: Introducing DeBertaForQuestionAnswering annotator in Spark NLP 🚀. DeBertaForQuestionAnswering can load DeBERTa v2&v3 Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using DebertaV2ForQuestionAnswering for PyTorch or TFDebertaV2ForQuestionAnswering for TensorFlow models in HuggingFace 🤗
    • NEW: Introducing DistilBertForQuestionAnswering annotator in Spark NLP 🚀. DistilBertForQuestionAnswering can load DistilBERT Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using DistilBertForQuestionAnswering for PyTorch or TFDistilBertForQuestionAnswering for TensorFlow models in HuggingFace 🤗
    • NEW: Introducing LongformerForQuestionAnswering annotator in Spark NLP 🚀. LongformerForQuestionAnswering can load Longformer Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using LongformerForQuestionAnswering for PyTorch or TFLongformerForQuestionAnswering for TensorFlow models in HuggingFace 🤗
    • NEW: Introducing RoBertaForQuestionAnswering annotator in Spark NLP 🚀. RoBertaForQuestionAnswering can load RoBERTa Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using RobertaForQuestionAnswering for PyTorch or TFRobertaForQuestionAnswering for TensorFlow models in HuggingFace 🤗
    • NEW: Introducing XlmRoBertaForQuestionAnswering annotator in Spark NLP 🚀. XlmRoBertaForQuestionAnswering can load XLM-RoBERTa Models with a span classification head on top for extractive question-answering tasks like SQuAD (a linear layer on top of the hidden-states output to compute span start logits and span end logits). This annotator is compatible with all the models trained/fine-tuned by using XLMRobertaForQuestionAnswering for PyTorch or TFXLMRobertaForQuestionAnswering for TensorFlow models in HuggingFace 🤗
    • NEW: Introducing MultiDocumentAssembler annotator where multiple inputs require to be converted to DOCUMENT such as in XXXForQuestionAnswering annotators
    • NEW: Introducing SpanBertCorefModel annotator for Coreference Resolution on BERT and SpanBERT models based on BERT for Coreference Resolution: Baselines and Analysis paper. An implementation of a SpanBert-based coreference resolution model.
    • NEW: Introducing enableInMemoryStorage parameter in WordEmbeddingsModel annotator. By enabling this parameter the annotator will no longer require a distributed storage to unpack indices and will perform everything in-memory.
    • Official support for Apache Spark and PySpark 3.2.x on Scala 2.12. Spark NLP by default is shipped for Spark 3.2.x and supports Spark/PySpark 3.0.x and 3.1.x in additions
    • Unifying all supported Apache Spark packages on Maven into spark-nlp for CPU, spark-nlp-gpu for GPU, and spark-nlp-m1 for new Apple silicon M1 on macOS. The need for Apache Spark specific packages like spark-nlp-spark32 has been removed.
    • Adding a new param to sparknlp.start() function in Python and Scala for Apple silicon M1 on macOS (m1=True)
    • Upgrade TensorFlow to 2.7.1 and start supporting Apple silicon M1
    • Upgrade RocksDB with new enhancements and support for Apple silicon M1
    • Upgrade SentencePiece tokenizer TF ops to 2.7.1
    • Upgrade SentencePiece JNI to v0.1.96 and provide support for Apple silicon M1 on macOS support
    • Upgrade to Scala 2.12.15
    • Update Colab, Kaggle, and SageMaker scripts
    • Refactor the entire Python module in Spark NLP to make the development and maintenance easier
    • Refactor unit tests in Python and migrate to pytest
    • Welcoming 6x new Databricks runtimes to our Spark NLP family:
      • Databricks 10.4 LTS
      • Databricks 10.4 LTS ML
      • Databricks 10.4 LTS ML GPU
      • Databricks 10.5
      • Databricks 10.5 ML
      • Databricks 10.5 ML GPU
    • Welcoming a new EMR 6.x series to our Spark NLP family:
      • EMR 6.6.0 (Apache Spark 3.2.0 / Hadoop 3.2.1)
    • Migrate T5Transformer to TensorFlow v2 architecture by re-uploading all the existing models
    • Support for 2 inputs in LightPipeline with MultiDocumentAssembler
    • Add new default NerDL graph for xsmall DeBERTa embeddings model (384 dimensions)
    • Adding annotateJava method to PretrainedPipeline class in Java to facilitate the use of LightPipelines
    • Allow change of case sensitivity. Currently, the user cannot set the setCaseSensitive param. This allows users to change this value if the model was saved/uploaded with the wrong case sensitivity parameter. (BERT, ALBERT, DistilBERT, RoBERTa, DeBERTa, XLM-RoBERTa, and Longformer for XXXForSequenceClassitication and XXXForTokenClassification.
    • Keep accuracy in ClassifierDL and SentimentDL during the training between 0.0 and 1.0
    • Preserve the original form of the token in BPE Tokenizer used in RoBERTa annotators (used in embeddings, sequence and token classification)

    Performance Improvements (Benchmarks)

    We have introduced two major performance improvements for GPU and CPU devices in Spark NLP 4.0.0 release.

    The following benchmarks have been done by using a single Dell Server with the following specs:

    • GPU: Tesla P100 PCIe 12GB - CUDA Version: 11.3 - Driver Version: 465.19.01
    • CPU: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz - 40 Cores
    • Memory: 80G

    GPU

    We have improved our batch processing approach for transformer-based Word Embeddings to improve their performance on a GPU device. These optimizations result in performance improvements up to +700%. The detailed list of improved transformer models on GPU in comparison to Spark NLP 3.4.x:

    | Model on GPU | Spark NLP 3.4.3 vs. 4.0.0 | | ----------------- |:-------------------------:| | RoBERTa base | +560%(6.6x) | | RoBERTa Large | +332%(4.3x) | | Albert Base | +587%(6.9x) | | Albert Large | +332%(4.3x) | | DistilBERT | +659%(7.6x) | | XLM-RoBERTa Base | +638%(7.4x) | | XLM-RoBERTa Large | +365%(4.7x) | | XLNet Base | +449%(5.5x) | | XLNet Large | +267%(3.7x) | | DeBERTa Base | +713%(8.1x) | | DeBERTa Large | +477%(5.8x) | | Longformer Base | +52%(1.5x) |

    Spark NLP 3 4 vs  Spark NLP 4 0 on GPU

    CPU

    The oneAPI Deep Neural Network Library (oneDNN) optimizations are now available in Spark NLP 4.0.0 that uses TensorFlow 2.7.1. You can enable those CPU optimizations by setting the environment variable TF_ENABLE_ONEDNN_OPTS=1.

    Intel has been collaborating with Google to optimize its performance on Intel Xeon processor-based platforms using Intel oneAPI Deep Neural Network (oneDNN), an open-source, cross-platform performance library for DL applications. TensorFlow optimizations are enabled via oneDNN to accelerate key performance-intensive operations such as convolution, matrix multiplication, and batch normalization.

    Comparing the last release of Spark NLP 3.4.3 on CPU vs. Spark NLP 4.0.0 on CPU with oneDNN enabled.

    | Model on CPU | 3.4.x vs. 4.0.0 with oneDNN | | ----------------- |:------------------------:| | BERT Base | +47% | | BERT Large | +42% | | RoBERTa Base | +51% | | RoBERTa Large | +61% | | Albert Base | +83% | | Albert Large | +58% | | DistilBERT | +80% | | XLM-RoBERTa Base | +82% | | XLM-RoBERTa Large | +72% | | XLNet Base | +50% | | XLNet Large | +27% | | DeBERTa Base | +59% | | DeBERTa Large | +56% | | CamemBERT Base | +97% | | CamemBERT Large | +65% | | Longformer Base | +63% |

    Spark NLP 3 4 on CPU vs  Spark NLP 4 0 on CPU with oneDNN


    Bug Fixes

    • Fix the default pre-trained model for DeBertaForTokenClassification in Scala and Python
    • Remove a requirement in DocumentNormalizer that consecutive stage processing can produce empty text annotations without breaking the pipeline
    • Fix WordSegmenterModel outputting the wrong order of tokens. The regex that groups the tagging format was refactored to preserve the order of segmented outputs (tokens)
    • Fix encoding sentences not respecting the max sequence length given by a user in XlmRobertaSentenceEmbeddings
    • Fix encoding sentences by using SentencePiece to calculate the correct tokens indexing
    • Fix SentencePiece serialization issue when XlmRoBertaEmbeddings and XlmRoBertaSentenceEmbeddings annotators are used from a Fat JAR on GPU
    • Remove non-existing parameters from DocumentAssembler in Python

    Updated Requirements

    • Java 8 (still supported) or 11
    • Apache Spark 3.x (3.0, 3.1, and 3.2)
    • NVIDIA® GPU drivers version 450.80.02 or higher
    • CUDA® Toolkit 11.2
    • cuDNN SDK 8.1.0
    • Scala 2.12.15

    Backward Compatibility

    • Deprecate support for Spark/PySpark 2.3, Spark/PySpark 2.4, and Scala 2.11 https://github.com/JohnSnowLabs/spark-nlp/pull/8319
    • The start() functions in Python and Scala will no longer have spark23, spark24, and spark32 parameters. The default sparknlp.start() works on PySpark 3.0.x, 3.1.x, and 3.2.x without the need for any Spark-related flags
    • Some models/pipelines which were trained or saved by using Spark and PySpark 2.3/2.4 will no longer work on Spark NLP 4.0.0
    • Remove json4s-ext dependency to allow the support for all Apache Spark major releases in one build

    Models and Pipelines

    Spark NLP 4.0.0 comes with 1000+ state-of-the-art pre-trained transformer models in many languages.

    New NER Models

    nerdl_conll_deberta_large NER model breaks the previously highest F1 on CoNLL03 dev by 1%

    | Model | Name | Lang | Dev F1 |:---------------------|:-------------------|:---|:----| | NerDLModel | nerdl_conll_deberta_large | en | 96% | | NerDLModel | nerdl_conll_elmo | en | 95.6% | | NerDLModel | nerdl_conll_deberta_base | en | 94% |

    Featured Models

    | Model | Name | Lang |
    |:---------------------|:-------------------|:---| | AlbertForQuestionAnswering | albert_base_qa_squad2 | en | DebertaForQuestionAnswering | deberta_v3_xsmall_qa_squad2 | en | DistilBertForQuestionAnswering | distilbert_base_cased_qa_squad2 | en | LongformerForQuestionAnswering | longformer_base_base_qa_squad2 | en | RoBertaForQuestionAnswering | roberta_base_qa_squad2 | en | XlmRoBertaForQuestionAnswering | xlm_roberta_base_qa_squad2 | en | DistilBertForQuestionAnswering | distilbert_qa_multi_finedtuned_squad | pt | BertForQuestionAnswering | bert_qa_bert_large_cased_squad_v1.1_portuguese | pt | BertForQuestionAnswering | bert_qa_chinese_pert_base_mrc | zh | BertForQuestionAnswering | bert_qa_arap_qa_bert | ar | BertForQuestionAnswering | bert_qa_ainize_klue_bert_base_mrc | ko | BertForQuestionAnswering | bert_qa_Part_1_mBERT_Model_E1 | xx | BertForQuestionAnswering | bert_qa_qacombination_bert_el_Danastos | el

    Spark NLP covers the following languages:

    English ,Multilingual ,Afrikaans ,Afro-Asiatic languages ,Albanian ,Altaic languages ,American Sign Language ,Amharic ,Arabic ,Argentine Sign Language ,Armenian ,Artificial languages ,Atlantic-Congo languages ,Austro-Asiatic languages ,Austronesian languages ,Azerbaijani ,Baltic languages ,Bantu languages ,Basque ,Basque (family) ,Belarusian ,Bemba (Zambia) ,Bengali, Bangla ,Berber languages ,Bihari ,Bislama ,Bosnian ,Brazilian Sign Language ,Breton ,Bulgarian ,Catalan ,Caucasian languages ,Cebuano ,Celtic languages ,Central Bikol ,Chichewa, Chewa, Nyanja ,Chilean Sign Language ,Chinese ,Chuukese ,Colombian Sign Language ,Congo Swahili ,Croatian ,Cushitic languages ,Czech ,Danish ,Dholuo, Luo (Kenya and Tanzania) ,Dravidian languages ,Dutch ,East Slavic languages ,Eastern Malayo-Polynesian languages ,Efik ,Esperanto ,Estonian ,Ewe ,Fijian ,Finnish ,Finnish Sign Language ,Finno-Ugrian languages ,French ,French-based creoles and pidgins ,Ga ,Galician ,Ganda ,Georgian ,German ,Germanic languages ,Gilbertese ,Greek (modern) ,Greek languages ,Gujarati ,Gun ,Haitian, Haitian Creole ,Hausa ,Hebrew (modern) ,Hiligaynon ,Hindi ,Hiri Motu ,Hungarian ,Icelandic ,Igbo ,Iloko ,Indic languages ,Indo-European languages ,Indo-Iranian languages ,Indonesian ,Irish ,Isoko ,Isthmus Zapotec ,Italian ,Italic languages ,Japanese ,Japanese ,Kabyle ,Kalaallisut, Greenlandic ,Kannada ,Kaonde ,Kinyarwanda ,Kirundi ,Kongo ,Korean ,Kwangali ,Kwanyama, Kuanyama ,Latin ,Latvian ,Lingala ,Lithuanian ,Louisiana Creole ,Lozi ,Luba-Katanga ,Luba-Lulua ,Lunda ,Lushai ,Luvale ,Macedonian ,Malagasy ,Malay ,Malayalam ,Malayo-Polynesian languages ,Maltese ,Manx ,Marathi (Marāṭhī) ,Marshallese ,Mexican Sign Language ,Mon-Khmer languages ,Morisyen ,Mossi ,Multiple languages ,Ndonga ,Nepali ,Niger-Kordofanian languages ,Nigerian Pidgin ,Niuean ,North Germanic languages ,Northern Sotho, Pedi, Sepedi ,Norwegian ,Norwegian Bokmål ,Norwegian Nynorsk ,Nyaneka ,Oromo ,Pangasinan ,Papiamento ,Persian (Farsi) ,Peruvian Sign Language ,Philippine languages ,Pijin ,Pohnpeian ,Polish ,Portuguese ,Portuguese-based creoles and pidgins ,Punjabi (Eastern) ,Romance languages ,Romanian ,Rundi ,Russian ,Ruund ,Salishan languages ,Samoan ,San Salvador Kongo ,Sango ,Semitic languages ,Serbo-Croatian ,Seselwa Creole French ,Shona ,Sindhi ,Sino-Tibetan languages ,Slavic languages ,Slovak ,Slovene ,Somali ,South Caucasian languages ,South Slavic languages ,Southern Sotho ,Spanish ,Spanish Sign Language ,Sranan Tongo ,Swahili ,Swati ,Swedish ,Tagalog ,Tahitian ,Tai ,Tamil ,Telugu ,Tetela ,Tetun Dili ,Thai ,Tigrinya ,Tiv ,Tok Pisin ,Tonga (Tonga Islands) ,Tonga (Zambia) ,Tsonga ,Tswana ,Tumbuka ,Turkic languages ,Turkish ,Tuvalu ,Tzotzil ,Ukrainian ,Umbundu ,Uralic languages ,Urdu ,Venda ,Venezuelan Sign Language ,Vietnamese ,Wallisian ,Walloon ,Waray (Philippines) ,Welsh ,West Germanic languages ,West Slavic languages ,Western Malayo-Polynesian languages ,Wolaitta, Wolaytta ,Wolof ,Xhosa ,Yapese ,Yiddish ,Yoruba ,Yucatec Maya, Yucateco ,Zande (individual language) ,Zulu

    The complete list of all 6000+ models & pipelines in 230+ languages is available on Models Hub

    New Notebooks

    Import hundreds of models in different languages to Spark NLP

    Spark NLP | HuggingFace Notebooks | Colab :------------ | :-------------| :----------| AlbertForQuestionAnswering |HuggingFace in Spark NLP - AlbertForQuestionAnswering | Open In Colab BertForQuestionAnswering|HuggingFace in Spark NLP - BertForQuestionAnswering | Open In Colab DeBertaForQuestionAnswering|HuggingFace in Spark NLP - DeBertaForQuestionAnswering | Open In Colab DistilBertForQuestionAnswering|HuggingFace in Spark NLP - DistilBertForQuestionAnswering | Open In Colab LongformerForQuestionAnswering|HuggingFace in Spark NLP - LongformerForQuestionAnswering | Open In Colab RoBertaForQuestionAnswering|HuggingFace in Spark NLP - RoBertaForQuestionAnswering | Open In Colab XlmRobertaForQuestionAnswering|HuggingFace in Spark NLP - XlmRobertaForQuestionAnswering | Open In Colab

    You can visit Import Transformers in Spark NLP for more info


    Documentation


    Installation

    Python

    #PyPI
    
    pip install spark-nlp==4.0.0
    

    Spark Packages

    spark-nlp on Apache Spark 3.0.x, 3.1.x, and 3.2.x (Scala 2.12):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.0
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:4.0.0
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.0.0
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:4.0.0
    

    M1

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.0.0
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-m1_2.12:4.0.0
    

    Maven

    spark-nlp on Apache Spark 3.0.x, 3.1.x, and 3.2.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp_2.12</artifactId>
        <version>4.0.0</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu_2.12</artifactId>
        <version>4.0.0</version>
    </dependency>
    

    spark-nlp-m1:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-m1_2.12</artifactId>
        <version>4.0.0</version>
    </dependency>
    

    FAT JARs

    • CPU on Apache Spark 3.x/3.1.x/3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-4.0.0.jar

    • GPU on Apache Spark 3.0.x/3.1.x/3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-4.0.0.jar

    • M1 on Apache Spark 3.0.x/3.1.x/3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-m1-assembly-4.0.0.jar

    What's Changed

    Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/3.4.4...4.0.0

    @vankov @mahmoodbayeshi @Ahmetemintek @DevinTDHa @albertoandreottiATgmail @KshitizGIT @jsl-models @gokhanturer @josejuanmartinez @murat-gunay @rpranab @wolliq @bunyamin-polat @pabla @danilojsl @agsfer @Meryem1425 @gadde5300 @muhammetsnts @Damla-Gurbaz @maziyarpanahi @jsl-builder @Cabir40 @suvrat-joshi

    Source code(tar.gz)
    Source code(zip)
  • 3.4.4(May 6, 2022)


    Overview

    We are very excited to release Spark NLP 🚀 3.4.4! This release comes with a new DeBERTa for Token Classification annotator compatible with existing or fine-tuned models on HuggingFace 🤗, a new annotator for CamemBERT embeddings models, up to 18x times improvements of UniversalSentenceEncoder on GPU devices, up to 400% speed improvements in Tokenizer with a list of exceptions, new state-of-the-art NER, French embeddings, DistilBERT embeddings, and ALBERT embeddings models!

    As always, we would like to thank our community for their feedback, questions, and feature requests.


    New Features

    • NEW: Introducing DeBertaForTokenClassification annotator in Spark NLP 🚀. DeBertaForTokenClassification can load DeBERTa v2&v3 models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using DebertaV2ForTokenClassification for PyTorch or TFDebertaV2ForTokenClassification for TensorFlow models in HuggingFace https://github.com/JohnSnowLabs/spark-nlp/pull/8082
    • NEW: Introducing CamemBertEmbeddings annotator in Spark NLP 🚀. https://github.com/JohnSnowLabs/spark-nlp/pull/8237 CamemBERT is a state-of-the-art language model for French based on the RoBERTa architecture pretrained on the French subcorpus of the newly available multilingual corpus OSCAR. For further information or requests, please go to Camembert Website
    • Add support for batching rows to improve UniversalSentenceEncoder on GPU devices. This new feature will increase GPU speed between 2x to 18x times depending on the distribution of sentences https://github.com/JohnSnowLabs/spark-nlp/pull/8234

    Bug Fixes & Enhancements

    • Optimizing Tokenizer performance up to 400% when there is an exceptions list. We have improved the exceptions list to be scalable to a large number of exceptions without impacting the overall performance https://github.com/JohnSnowLabs/spark-nlp/pull/7881
    • Support latest PySpark releases in Colab, Kaggle, and SageMaker scripts https://github.com/JohnSnowLabs/spark-nlp/pull/8028
    • Fix bug that caused get input/output/LazyAnnotator to return None https://github.com/JohnSnowLabs/spark-nlp/pull/8043
    • Fix DeBertaForSequenceClassification in Python failing to load pretrained models https://github.com/JohnSnowLabs/spark-nlp/pull/8060
    • Fix missing Lemma and POS models from 3.4.3 release

    Dependencies

    • Removing outdated trove4j dependency in favour of native Java modules https://github.com/JohnSnowLabs/spark-nlp/pull/8236
    • Upgrade the base Apache Spark to 2.4.8, 3.0.3, and 3.2.1
    • Upgrade type typesafe config to 1.4.2
    • Upgrade sbt to 1.6.2

    Models

    Spark NLP 3.4.4 comes with over 160+ state-of-the-art multi-lingual pretrained models. Some of the featured models:

    New DeBERTa Token Classification Models

    New fine-tuned DeBERTa v3 models for token classifications over CoNLL03 and OntoNotes datasets that reach state-of-the-art metrics.

    | Model | Name | Lang | F1 Dev |:----------------|:-----------|:-----|:-----| | DeBertaForTokenClassification | deberta_v3_large_token_classifier_conll03 | en| 0.97 | DeBertaForTokenClassification | deberta_v3_base_token_classifier_conll03 | en| 0.96 | DeBertaForTokenClassification | deberta_v3_small_token_classifier_conll03 | en| 0.95 | DeBertaForTokenClassification | deberta_v3_xsmall_token_classifier_conll03 | en| 0.93 | DeBertaForTokenClassification | deberta_v3_large_token_classifier_ontonotes | en| 0.89 | DeBertaForTokenClassification | deberta_v3_base_token_classifier_ontonotes | en| 0.88 | DeBertaForTokenClassification | deberta_v3_small_token_classifier_ontonotes | en| 0.87 | DeBertaForTokenClassification | deberta_v3_xsmall_token_classifier_ontonotes | en| 0.86

    New CamemBERT Models

    | Model | Name | Lang | |:----------------|:-----------|:-----| | CamemBertEmbeddings | camembert_large | fr| | CamemBertEmbeddings | camembert_base | fr| | CamemBertEmbeddings | camembert_base_ccnet_4gb | fr| | CamemBertEmbeddings | camembert_base_ccnet | fr| | CamemBertEmbeddings | camembert_base_oscar_4gb | fr| | CamemBertEmbeddings | camembert_base_wikipedia_4gb | fr|

    New DistilBERT Embeddings Models

    | Model | Name | Lang | |:----------------|:-----------|:-----| | DistilBertEmbeddings | distilbert_embeddings_distilbert_base_fr_cased | fr| | DistilBertEmbeddings | distilbert_embeddings_marathi_distilbert | mr| | DistilBertEmbeddings | distilbert_embeddings_distilbert_base_indonesian | id| | DistilBertEmbeddings | distilbert_embeddings_javanese_distilbert_small | jv| | DistilBertEmbeddings | distilbert_embeddings_malaysian_distilbert_small | ms| | DistilBertEmbeddings | distilbert_embeddings_distilbert_base_ar_cased | ar|

    New ALBERT Embeddings Models

    | Model | Name | Lang | |:----------------|:-----------|:-----| | AlbertEmbeddings | albert_embeddings_fralbert_base | fr| | AlbertEmbeddings | albert_embeddings_albert_base_arabic | ar| | AlbertEmbeddings | albert_embeddings_marathi_albert_v2 | mr| | AlbertEmbeddings | albert_embeddings_albert_fa_base_v2 | fa| | AlbertEmbeddings | albert_embeddings_albert_large_bahasa_cased | ms| | AlbertEmbeddings | albert_embeddings_marathi_albert | mr|

    The complete list of all 5000+ models & pipelines in 200+ languages is available on Models Hub.

    New Notebooks

    Import CamemBERT models to Spark NLP 🚀

    Spark NLP | HuggingFace Notebooks | Colab :------------ | :-------------| :----------| CamemBertEmbeddings| HuggingFace in Spark NLP - CamemBERT | Open In Colab

    You can visit Import Transformers in Spark NLP for more info


    Documentation


    Installation

    Python

    #PyPI
    
    pip install spark-nlp==3.4.4
    

    Spark Packages

    spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.4
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.4
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.4
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.4
    

    spark-nlp on Apache Spark 3.2.x (Scala 2.12 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.4
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.4
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.4
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.4
    

    spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.4
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.4
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.4
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.4
    

    spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.4
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.4
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.4.4
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.4.4
    

    Maven

    spark-nlp on Apache Spark 3.0.x and 3.1.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp_2.12</artifactId>
        <version>3.4.4</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu_2.12</artifactId>
        <version>3.4.4</version>
    </dependency>
    

    spark-nlp on Apache Spark 3.2.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark32_2.12</artifactId>
        <version>3.4.4</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark32_2.12</artifactId>
        <version>3.4.4</version>
    </dependency>
    

    spark-nlp on Apache Spark 2.4.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark24_2.11</artifactId>
        <version>3.4.4</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
        <version>3.4.4</version>
    </dependency>
    

    spark-nlp on Apache Spark 2.3.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark23_2.11</artifactId>
        <version>3.4.4</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark23_2.11</artifactId>
        <version>3.4.4</version>
    </dependency>
    

    FAT JARs

    • CPU on Apache Spark 3.0.x/3.1.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.4.4.jar

    • GPU on Apache Spark 3.0.x/3.1.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.4.4.jar

    • CPU on Apache Spark 3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark32-assembly-3.4.4.jar

    • GPU on Apache Spark 3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark32-assembly-3.4.4.jar

    • CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.4.4.jar

    • GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.4.4.jar

    • CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.4.4.jar

    • GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.4.4.jar

    What's Changed

    Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/3.4.3...3.4.4

    New Contributors

    • @aymanechilah made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/6956

    @xusliebana @Ahmetemintek @jsl-models @Meryem1425 @mahmoodbayeshi @aymanechilah @DevinTDHa @agsfer @rpranab @C-K-Loan @maziyarpanahi @Damla-Gurbaz @danilojsl @luca-martial @muhammetsnts @josejuanmartinez @bunyamin-polat @galiph @jsl-builder @albertoandreottiATgmail

    Source code(tar.gz)
    Source code(zip)
  • 3.4.3(Apr 12, 2022)


    Overview

    We are very excited to release Spark NLP 🚀 3.4.3! This release comes with a new DeBERTa for Sequence Classification annotator compatible with existing or fine-tuned models on HuggingFace 🤗, a new sigmoid activation function in addition to softmax to support multi-label models in all ForSequenceClassification annotators, new features added to SentenceDetectorDL, new features added to CoNLLU and Lemmatizer, and more than 600 new multi-lingual models for DeBERTa, BERT, DistilBERT, fastText, Lemmatizer and Part of Speech, and other improvements!

    As always, we would like to thank our community for their feedback, questions, and feature requests.


    New Features

    • NEW: Introducing DeBertaForSequenceClassification annotator in Spark NLP 🚀. DeBertaForSequenceClassification can load DeBERTa v2&v3 models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using DebertaForSequenceClassification for PyTorch or TFDebertaForSequenceClassification for TensorFlow models in HuggingFace https://github.com/JohnSnowLabs/spark-nlp/pull/7713
    • New multi-label feature in all SequenceForClassification. The following annotators now have the option to switch to sigmoid activation function instead of softmax for the output layer: AlbertForSequenceClassification, BertForSequenceClassification, DeBertaForSequenceClassification, DistilBertForSequenceClassification, LongformerForSequenceClassification, RoBertaForSequenceClassification, XlmRoBertaForSequenceClassification, and XlnetForSequenceClassification https://github.com/JohnSnowLabs/spark-nlp/pull/7479
    • New minLength, maxLength, splitLength, customBounds, and useCustomBoundsOnly parameters in SentenceDetectorDL https://github.com/JohnSnowLabs/spark-nlp/pull/7214
    • New impossiblePenultimates in SentenceDetectorDLModel https://github.com/JohnSnowLabs/spark-nlp/pull/7685
    • New feature to set names for columns in CoNLLU class: textCol, documentCol, sentenceCol, formCol, uposCol, xposCol, and lemmaCol https://github.com/JohnSnowLabs/spark-nlp/pull/7344
    • New formCol and lemmaCol parameters in Lemmatizer annotator https://github.com/JohnSnowLabs/spark-nlp/pull/7344
    • Add new functionality to download and extract models from S3 via direct link https://github.com/JohnSnowLabs/spark-nlp/pull/7682

    Enhancements

    • Fix and train new English spell checker models for Spark NLP 3.4.1 on Spark 3.x and 2.x
    • Update SentenceDetector Python and Scala documentation
    • Add a missing notebook to demonstrate training a WordSegmenterApproach annotator for word segmentation https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/jupyter/training/chinese/word-segmentation/WordSegmenter_train_chinese_segmentation.ipynb

    Models

    New DeBERTa Classification Models

    New fine-tuned DeBERTa v3 models for text classifications over IMDB reviews in English and Urdu, AG News categories in English, and Allocine French reviews.

    | Model | Name | Lang | |:----------------|:-----------|:-----| | DeBertaForSequenceClassification | mdeberta_v3_base_sequence_classifier_imdb | ur| | DeBertaForSequenceClassification | mdeberta_v3_base_sequence_classifier_allocine | fr| | DeBertaForSequenceClassification | deberta_v3_xsmall_sequence_classifier_imdb | en| | DeBertaForSequenceClassification | deberta_v3_small_sequence_classifier_imdb | en| | DeBertaForSequenceClassification | deberta_v3_base_sequence_classifier_imdb | en| | DeBertaForSequenceClassification | deberta_v3_large_sequence_classifier_imdb | en| | DeBertaForSequenceClassification | deberta_v3_xsmall_sequence_classifier_ag_news | en| | DeBertaForSequenceClassification | deberta_v3_small_sequence_classifier_ag_news | en|

    New BERT Models

    Spark NLP now has up to 250 state-of-the-art BERT models in 27 languages including Arabic, Bengali, Chinese, Dutch, English, Finnish, French, German, Greek, Hindi, Italian, Japanese, Javanese, Korean, Marathi, Panjabi, Portuguese, Romanian, Russian, Spanish, Swedish, Tamil, Telugu, Turkish, Urdu, Vietnamese, and Multi-lingual.

    | Model | Name | Lang | |:----------------|:-----------|:-----| | BertEmbeddings | bert_embeddings_ARBERT | ar| | BertEmbeddings | bert_embeddings_German_MedBERT | de| | BertEmbeddings | bert_embeddings_bangla_bert_base | bn| | BertEmbeddings | bert_embeddings_bert_base_5lang_cased | zh| | BertEmbeddings | bert_embeddings_bert_base_5lang_cased | fr| | BertEmbeddings | bert_embeddings_bert_base_hi_cased | hi| | BertEmbeddings | bert_embeddings_bert_base_it_cased | it| | BertEmbeddings | bert_embeddings_bert_base | ko| | BertEmbeddings | bert_embeddings_bert_base_tr_cased | tr| | BertEmbeddings | bert_embeddings_bert_base_ur_cased | ur| | BertEmbeddings | bert_embeddings_bert_base_vi_cased | vi|

    New fastText Models

    Over 128 new Word2Vec models in 128 languages made by fastText word embeddings.

    | Model | Name | Lang | |:----------------|:-----------|:-----| | WordEmbeddingsModel | w2v_cc_300d | hi| | WordEmbeddingsModel | w2v_cc_300d | azb| | WordEmbeddingsModel | w2v_cc_300d | bo| | WordEmbeddingsModel | w2v_cc_300d | diq| | WordEmbeddingsModel | w2v_cc_300d | cy| | WordEmbeddingsModel | w2v_cc_300d | ckb| | WordEmbeddingsModel | w2v_cc_300d | el| | WordEmbeddingsModel | w2v_cc_300d | es|

    New Lemmatizer and Part of Speech Models

    234 new Lemmatizer and Part of Speech models in 62 languages based on the new Universal Dependency treebank 2.9 release.

    | Model | Name | Lang | |:----------------|:-----------|:-----| | LemmatizerModel | lemma_afribooms | af| | LemmatizerModel | lemma_alksnis | lt| | LemmatizerModel | lemma_alpino | nl| | LemmatizerModel | lemma_arcosg | gd| | LemmatizerModel | lemma_ancora | es| | LemmatizerModel | lemma_ancora | ca| | PerceptronModel | pos_mtg | te| | PerceptronModel | pos_ttb | ta| | PerceptronModel | pos_vtb | vi| | PerceptronModel | pos_cac | cs| | PerceptronModel | pos_btb | bg| | PerceptronModel | pos_afribooms | af|

    The complete list of all 4800+ models & pipelines in 200+ languages is available on Models Hub.


    Documentation


    Installation

    Python

    #PyPI
    
    pip install spark-nlp==3.4.3
    

    Spark Packages

    spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.3
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.3
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.3
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.3
    

    spark-nlp on Apache Spark 3.2.x (Scala 2.12 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.3
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.3
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.3
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.3
    

    spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.3
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.3
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.3
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.3
    

    spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.3
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.3
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.4.3
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.4.3
    

    Maven

    spark-nlp on Apache Spark 3.0.x and 3.1.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp_2.12</artifactId>
        <version>3.4.3</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu_2.12</artifactId>
        <version>3.4.3</version>
    </dependency>
    

    spark-nlp on Apache Spark 3.2.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark32_2.12</artifactId>
        <version>3.4.3</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark32_2.12</artifactId>
        <version>3.4.3</version>
    </dependency>
    

    spark-nlp on Apache Spark 2.4.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark24_2.11</artifactId>
        <version>3.4.3</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
        <version>3.4.3</version>
    </dependency>
    

    spark-nlp on Apache Spark 2.3.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark23_2.11</artifactId>
        <version>3.4.3</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark23_2.11</artifactId>
        <version>3.4.3</version>
    </dependency>
    

    FAT JARs

    • CPU on Apache Spark 3.0.x/3.1.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.4.3.jar

    • GPU on Apache Spark 3.0.x/3.1.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.4.3.jar

    • CPU on Apache Spark 3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark32-assembly-3.4.3.jar

    • GPU on Apache Spark 3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark32-assembly-3.4.3.jar

    • CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.4.3.jar

    • GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.4.3.jar

    • CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.4.3.jar

    • GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.4.3.jar

    What's Changed

    Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/3.4.2...3.4.3

    New Contributors

    • @snosrap made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/7484
    • @gokhanturer made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/7654
    • @suvrat-joshi made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/7671

    @vankov @gokhanturer @egenc @Cabir40 @xusliebana @suvrat-joshi @murat-gunay @snosrap @gadde5300 @jsl-models @Meryem1425 @DevinTDHa @agsfer @rpranab @diatrambitas @maziyarpanahi @Damla-Gurbaz @luca-martial @muhammetsnts @josejuanmartinez @bunyamin-polat @jsl-builder @albertoandreottiATgmail

    Source code(tar.gz)
    Source code(zip)
  • 3.4.2(Mar 10, 2022)


    Overview

    We are pleased to release Spark NLP 🚀 3.4.2! This release comes with a new DeBERTa transformer for word embeddings, new caching to speed up training Word2Vec and Doc2Vec, new English and multi-lingual state-of-the-art models, and bug fixes!

    As always, we would like to thank our community for their feedback, questions, and feature requests.


    New Features

    • Introducing DeBertaEmbeddings annotator. DeBERTa (Decoding-enhanced BERT with disentangled attention) improves the BERT and RoBERTa models using two novel techniques. Compared to RoBERTa-Large, a DeBERTa model trained on half of the training data performs consistently better on a wide range of NLP tasks, achieving improvements on MNLI by +0.9% (90.2% vs. 91.1%), on SQuAD v2.0 by +2.3% (88.4% vs. 90.7%) and RACE by +3.6% (83.2% vs. 86.8%). This annotator is compatible with all the models trained/fine-tuned by using DebertaV2Model for PyTorch or TFDebertaV2Model for TensorFlow models (DeBERTa-v2 & DeBERTa-v3) in HuggingFace
    • Introducing a new param enableCaching in Doc2VecApproach to speed up the training
    • Introducing a new param enableCaching in Word2VecApproach to speed up the training
    • Support Databricks runtime 10.3, 10.3 ML, and 10.3 ML & GPU
    • Support EMR emr-5.34.0 and emr-6.5.0

    Bug Fixes

    • Fix bestModelMetric param when the set value was ignored https://github.com/JohnSnowLabs/spark-nlp/pull/6978

    New Notebooks

    Import DeBERTa models to Spark NLP 🚀

    Spark NLP | HuggingFace Notebooks | Colab :------------ | :-------------| :----------| DeBertaEmbeddings | HuggingFace in Spark NLP - DeBERTa | Open In Colab

    You can visit Import Transformers in Spark NLP for more info


    Models

    New state-of-the-art DeBERTa models:

    | Model | Name | Lang | |:----------------|:-----------|:-----| | DeBertaEmbeddings | deberta_v3_xsmall | en| | DeBertaEmbeddings | deberta_v3_small | en| | DeBertaEmbeddings | deberta_v3_base | en| | DeBertaEmbeddings | deberta_v3_large | en| | DeBertaEmbeddings | mdeberta_v3_base | xx|


    Documentation


    Installation

    Python

    #PyPI
    
    pip install spark-nlp==3.4.2
    

    Spark Packages

    spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.2
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.2
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.2
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.2
    

    spark-nlp on Apache Spark 3.2.x (Scala 2.12 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.2
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.2
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.2
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.2
    

    spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.2
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.2
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.2
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.2
    

    spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.2
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.2
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.4.2
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.4.2
    

    Maven

    spark-nlp on Apache Spark 3.0.x and 3.1.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp_2.12</artifactId>
        <version>3.4.2</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu_2.12</artifactId>
        <version>3.4.2</version>
    </dependency>
    

    spark-nlp on Apache Spark 3.2.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark32_2.12</artifactId>
        <version>3.4.2</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark32_2.12</artifactId>
        <version>3.4.2</version>
    </dependency>
    

    spark-nlp on Apache Spark 2.4.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark24_2.11</artifactId>
        <version>3.4.2</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
        <version>3.4.2</version>
    </dependency>
    

    spark-nlp on Apache Spark 2.3.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark23_2.11</artifactId>
        <version>3.4.2</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark23_2.11</artifactId>
        <version>3.4.2</version>
    </dependency>
    

    FAT JARs

    • CPU on Apache Spark 3.0.x/3.1.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.4.2.jar

    • GPU on Apache Spark 3.0.x/3.1.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.4.2.jar

    • CPU on Apache Spark 3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark32-assembly-3.4.2.jar

    • GPU on Apache Spark 3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark32-assembly-3.4.2.jar

    • CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.4.2.jar

    • GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.4.2.jar

    • CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.4.2.jar

    • GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.4.2.jar

    What's Changed

    Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/3.4.1...3.4.2

    New Contributors

    • @mahmoodbayeshi made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/6835
    • @bunyamin-polat made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/6969

    @agsfer @KshitizGIT @gadde5300 @kolia1985 @jsl-models @rpranab @josejuanmartinez @bunyamin-polat @maziyarpanahi @jsl-builder @Damla-Gurbaz @xusliebana @mahmoodbayeshi @luca-martial @dependabot @muhammetsnts @albertoandreottiATgmai

    Source code(tar.gz)
    Source code(zip)
  • 3.4.1(Feb 8, 2022)


    Overview

    We are pleased to release Spark NLP 🚀 3.4.1! This release comes with a TF session warmup in 3 annotators where the first inference was slower than the rest, adding a new param to choose which F1 to track to save the best model when training a NerDL model, new T5 models such as text to SQL or grammar correction, new multi-lingual state-of-the-art models, and other bug fixes!

    As always, we would like to thank our community for their feedback, questions, and feature requests.


    New Features & Enhancements

    • Implement TF Session warmup for MarianTransformer, T5Transformer, and GPT2Transformer annotators. The first inference for these annotators used to take between 15-20 seconds, now with the warmup session all the inferences including the first time will be the same https://github.com/JohnSnowLabs/spark-nlp/pull/6773
    • Add bestModelMetric param to choose between Micro-average or Macro-average for best model https://github.com/JohnSnowLabs/spark-nlp/pull/6749
    • Add trimWhitespace and preservePosition params to RegexTokenizer https://github.com/JohnSnowLabs/spark-nlp/pull/6806
    • Add a new setSentenceMatch param to EntityRuler to match entities across documents/sentences and not just tokens https://github.com/JohnSnowLabs/spark-nlp/pull/6841
    • Add support spark32 and real_time_output flags in sparknlp.start() function at the same time https://github.com/JohnSnowLabs/spark-nlp/pull/6822
    • Allow users to set tasks in the T5Transformer annotator

    Bug Fixes

    • Fix random NullPointerException when using TensorFlow models without Kyro serialization https://github.com/JohnSnowLabs/spark-nlp/pull/6741
    • Fix RecursiveTokenizerModel not being readable in a saved Pipeline https://github.com/JohnSnowLabs/spark-nlp/pull/6748
    • Fix ContextSpellCheckerApproach not being trained on Databricks https://github.com/JohnSnowLabs/spark-nlp/pull/6750
    • Fix ContextSpellCheckerModel wrong order of tokens it's used with Sentence Detectors https://github.com/JohnSnowLabs/spark-nlp/pull/6799
    • Fix GraphExtraction when fullAnnotate and document are used at the same time https://github.com/JohnSnowLabs/spark-nlp/pull/6845
    • Fix Word2VecModel being cast to Doc2VecModel by mistake https://github.com/JohnSnowLabs/spark-nlp/pull/6849
    • Fix broken sentence indexing in BertEmbeddings that impacted SentenceEmbeddings for text classification https://github.com/JohnSnowLabs/spark-nlp/pull/6867
    • Fix missing setExceotionsPath param in Tokenizer when it's used in Python https://github.com/JohnSnowLabs/spark-nlp/pull/6868
    • Fix the wrong metrics being mentioned when useBestModel was enabled. The documentation said Micro-averaged F1 but in fact, it was Macro-average F1 (the option to choose which metric to be tracked is now available as well)
    • Update broken slow unit tests https://github.com/JohnSnowLabs/spark-nlp/pull/6767

    Models

    New state-of-the-art models in English, French, Vietnamese, Dutch, and Indian (Assamese, Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu)

    Featured Pretrained Models

    | Model | Name | Lang | |:----------------|:-----------|:-----| | T5Transformer | t5_informal_to_formal_styletransfer | en| | T5Transformer | t5_formal_to_informal_styletransfer | en| | T5Transformer | t5_passive_to_active_styletransfer | en| | T5Transformer | t5_active_to_passive_styletransfer | en| | T5Transformer | t5_grammar_error_corrector | en| | T5Transformer | t5_small_wikiSQL | en| | LongformerEmbeddings | clinical_longformer | en| | AlbertEmbeddings | albert_indic | xx| | DistilBertEmbeddings | distilbert_base_cased | vi| | BertForSequenceClassification | bert_sequence_classifier_news_sentiment | de| | BertForSequenceClassification | bert_sequence_classifier_emotion | en| | DistilBertForTokenClassification | distilbert_token_classifier_typo_detector | en| | DistilBertForTokenClassification | distilbert_base_token_classifier_masakhaner | xx| | WordEmbeddingsModel | word2vec_wiki_1000 | fr| | WordEmbeddingsModel | word2vec_wac_200 | fr| | WordEmbeddingsModel | w2v_cc_300d | fr|


    Documentation


    Installation

    Python

    #PyPI
    
    pip install spark-nlp==3.4.1
    

    Spark Packages

    spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.1
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.1
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.1
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.1
    

    spark-nlp on Apache Spark 3.2.x (Scala 2.12 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.1
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.1
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.1
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.1
    

    spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.1
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.1
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.1
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.1
    

    spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.1
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.1
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.4.1
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.4.1
    

    Maven

    spark-nlp on Apache Spark 3.0.x and 3.1.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp_2.12</artifactId>
        <version>3.4.1</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu_2.12</artifactId>
        <version>3.4.1</version>
    </dependency>
    

    spark-nlp on Apache Spark 3.2.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark32_2.12</artifactId>
        <version>3.4.1</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark32_2.12</artifactId>
        <version>3.4.1</version>
    </dependency>
    

    spark-nlp on Apache Spark 2.4.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark24_2.11</artifactId>
        <version>3.4.1</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
        <version>3.4.1</version>
    </dependency>
    

    spark-nlp on Apache Spark 2.3.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark23_2.11</artifactId>
        <version>3.4.1</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark23_2.11</artifactId>
        <version>3.4.1</version>
    </dependency>
    

    FAT JARs

    • CPU on Apache Spark 3.0.x/3.1.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.4.1.jar

    • GPU on Apache Spark 3.0.x/3.1.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.4.1.jar

    • CPU on Apache Spark 3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark32-assembly-3.4.1.jar

    • GPU on Apache Spark 3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark32-assembly-3.4.1.jar

    • CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.4.1.jar

    • GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.4.1.jar

    • CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.4.1.jar

    • GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.4.1.jar

    What's Changed

    Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/3.4.0...3.4.1

    New Contributors

    • @Cabir40 made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/6685
    • @rpranab made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/6830
    • @Meryem1425 made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/6828
    • @Damla-Gurbaz made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/6847

    @diatrambitas @egenc @xyutech @Cabir40 @xusliebana @murat-gunay @KshitizGIT @jsl-models @Meryem1425 @HashamUlHaq @DevinTDHa @agsfer @rpranab @C-K-Loan @maziyarpanahi @Damla-Gurbaz @luca-martial @danilojsl @wolliq @muhammetsnts @pabla @josejuanmartinez @jsl-builder @albertoandreottiATgmail

    Source code(tar.gz)
    Source code(zip)
  • 3.4.0(Jan 5, 2022)


    Overview

    We are very excited to release Spark NLP 3.4.0! This has been one of the biggest releases we have ever done and we are so proud to share this with our community at the dawn of 2022! 🎉

    Spark NLP 3.4.0 extends the support for Apache Spark 3.2.x major releases on Scala 2.12. We now support all 5 major Apache Spark and PySpark releases of 2.3.x, 2.4.x, 3.0.x, 3.1.x, and 3.2.x at once helping our community to migrate from earlier Apache Spark versions to newer releases without being worried about Spark NLP end of life support. We also extend support for new Databricks and EMR instances on Spark 3.2.x clusters.

    This release also comes with a brand new GPT2Transformer using OpenAI GPT-2 models for prediction at scale, new ALBERT, XLNet, RoBERTa, XLM-RoBERTa, and Longformer annotators to use existing or fine-tuned models for Sequence Classification, new distributed and trainable Word2Vec annotators, new state-of-the-art transformer models in many languages, a new param to useBestModel in NerDL during training, bug fixes, and lots more!

    As always, we would like to thank our community for their feedback, questions, and feature requests.


    Major features and improvements

    • NEW: Introducing GPT2Transformer annotator in Spark NLP 🚀 for Text Generation purposes. GPT2Transformer uses OpenAI GPT-2 models from HuggingFace 🤗 for prediction at scale in Spark NLP 🚀 . GPT-2 is a transformer model trained on a very large corpus of English data in a self-supervised fashion. This means it was trained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it was trained to guess the next word in sentences
    • NEW: Introducing RoBertaForSequenceClassification annotator in Spark NLP 🚀. RoBertaForSequenceClassification can load RoBERTa Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using RobertaForSequenceClassification for PyTorch or TFRobertaForSequenceClassification for TensorFlow models in HuggingFace 🤗
    • NEW: Introducing XlmRoBertaForSequenceClassification annotator in Spark NLP 🚀. XlmRoBertaForSequenceClassification can load XLM-RoBERTa Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using XLMRobertaForSequenceClassification for PyTorch or TFXLMRobertaForSequenceClassification for TensorFlow models in HuggingFace 🤗
    • NEW: Introducing LongformerForSequenceClassification annotator in Spark NLP 🚀. LongformerForSequenceClassification can load ALBERT Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using LongformerForSequenceClassification for PyTorch or TFLongformerForSequenceClassification for TensorFlow models in HuggingFace 🤗
    • NEW: Introducing AlbertForSequenceClassification annotator in Spark NLP 🚀. AlbertForSequenceClassification can load ALBERT Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using AlbertForSequenceClassification for PyTorch or TFAlbertForSequenceClassification for TensorFlow models in HuggingFace 🤗
    • NEW: Introducing XlnetForSequenceClassification annotator in Spark NLP 🚀. XlnetForSequenceClassification can load XLNet Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using XLNetForSequenceClassification for PyTorch or TFXLNetForSequenceClassification for TensorFlow models in HuggingFace 🤗
    • NEW: Introducing trainable and distributed Word2Vec annotators based on Word2Vec in Spark ML. You can train Word2Vec in a cluster on multiple machines to handle large-scale datasets and use the trained model for token-level classifications such as NerDL
    • Introducing useBestModel param in NerDLApproach annotator. This param in the NerDLApproach preserves and restores the model that has achieved the best performance at the end of the training. The priority is metrics from testDataset (micro F1), metrics from validationSplit (micro F1), and if none is set it will keep track of loss during the training
    • Support Apache Spark and PySpark 3.2.x on Scala 2.12. Spark NLP by default is shipped for Spark 3.0.x/3.1.x, but now you have spark-nlp-spark32 and spark-nlp-gpu-spark32 packages
    • Adding a new param to sparknlp.start() function in Python for Apache Spark 3.2.x (spark32=True)
    • Update Colab and Kaggle scripts for faster setup. We no longer need to remove Java 11 in order to install Java 8 since Spark NLP works on Java 11. This makes the installation of Spark NLP on Colab and Kaggle as fast as pip install spark-nlp pyspark==3.1.2
    • Add new scripts/notebook to generate custom TensroFlow graphs for ContextSpellCheckerApproach annotator
    • Add a new graphFolder param to ContextSpellCheckerApproach annotator. This param allows to train ContextSpellChecker from a custom made TensorFlow graph
    • Support DBFS file system in graphFolder param. Starting Spark NLP 3.4.0 you can point NerDLApproach or ContextSpellCheckerApproach to a TF graph hosted on Databricks
    • Add a new feature to all classifiers (ForTokenClassification and ForSequenceClassification) to retrieve classes from the pretrained models
    sequenceClassifier = XlmRoBertaForSequenceClassification \
          .pretrained('xlm_roberta_base_sequence_classifier_ag_news', 'en') \
          .setInputCols(['token', 'document']) \
          .setOutputCol('class')
    
    print(sequenceClassifier.getClasses())
    
    #Sports, Business, World, Sci/Tech
    
    • Add inputFormats param to DateMatcher and MultiDateMatcher annotators. DateMatcher and MultiDateMatcher can now define a list of acceptable input formats via date patterns to search in the text. Consequently, the output format will be defining the output pattern for the unique output format.
    date_matcher = DateMatcher() \
        .setInputCols(['document']) \
        .setOutputCol("date") \
        .setInputFormats(["yyyy", "yyyy/dd/MM", "MM/yyyy"]) \
        .setOutputFormat("yyyyMM") \ #previously called `.setDateFormat`
        .setSourceLanguage("en")
    
    
    • Enable batch processing in T5Transformer and MarianTransformer annotators
    • Add Schema to readDataset in CoNLL() class
    • Welcoming 6x new Databricks runtimes to our Spark NLP family:
      • Databricks 10.0
      • Databricks 10.0 ML GPU
      • Databricks 10.1
      • Databricks 10.1 ML GPU
      • Databricks 10.2
      • Databricks 10.2 ML GPU
    • Welcoming 3x new EMR 6.x series to our Spark NLP family:
      • EMR 5.33.1 (Apache Spark 2.4.7 / Hadoop 2.10.1)
      • EMR 6.3.1 (Apache Spark 3.1.1 / Hadoop 3.2.1)
      • EMR 6.4.0 (Apache Spark 3.1.2 / Hadoop 3.2.1)

    Bug Fixes

    • Fix a race condition in a cluster mode when the accessing TF session is called as many times as the number of available cores on the Driver machine for the very first time. Loading a model multiple times at once results in higher disk usage and IO may become a bottleneck for larger models especially on a machine with slower disks. Thanks to @jerrychenhf for finding this issue and offering a solution https://github.com/JohnSnowLabs/spark-nlp/pull/6575
    • Fix a performance issue introduced in the 3.3.3 release for T5Transformer and MarianTransformer annotators. While we added support for ignored tokens, accidentally we introduced a bug that degraded the performance for these two annotators (sometimes up to 2x slower). Please update to 3.4.0 if you are using any of these two annotators https://github.com/JohnSnowLabs/spark-nlp/pull/6605
    • Fix a bug in model resolution by not filtering based on the timestamp
    • Fix configProtoBytes param type in Python https://github.com/JohnSnowLabs/spark-nlp/pull/6549
    • Fix missing DefaultParamsReadable in RegexTokenizer annotator https://github.com/JohnSnowLabs/spark-nlp/pull/6653
    • Fix missing models lemma_antbnc, sentiment_vivekn, and spellcheck_norvig for Spark 3.x
    • Fix missing pipelines clean_slang, check_spelling, match_chunks, and match_datetime for Spark 3.x
    • Fix saveModel in TrainingHelper
    • Fix Keyword/Yake module naming in Scala https://github.com/JohnSnowLabs/spark-nlp/pull/6562

    Models Hub

    Models Hub now comes with new features to easily filter and find your desired models & pipelines by:

    • NLP Task
    • Natural Language
    • Spark NLP version

    image

    In addition, you can also filter models & pipelines by:

    • Models or Pipelines (finally! 😃 )
    • Tags used inside Model's card
    • Or even by predicted entities (which labels/classes a model can predict)

    image

    As always, you can host your own pre-trained models & pipelines easily accessible to you for free & forever! 🚀


    Models and Pipelines

    Spark NLP 3.4.0 comes with state-of-the-art pre-trained transformer models. Models Hub supports over 15 NLP tasks: Named Entity Recognition, Text Classification, Sentiment Analysis, Translation, Question Answering, Summarization, Sentence Detection, Embeddings, Language Detection, Stop Words Removal, Word Segmentation, Part of Speech Tagging, Lemmatization, Spell Check, Dependency Parser, and Text Generation

    Featured Models

    | Model | Name | Lang |
    |:---------------------|:-------------------|:---| | GPT2Transformer| gpt2_distilled | en | GPT2Transformer| gpt2 | en | GPT2Transformer| gpt2_medium | en | GPT2Transformer| gpt2_large | en | XlmRoBertaForSequenceClassification| xlm_roberta_base_sequence_classifier_imdb | en | XlmRoBertaForSequenceClassification| xlm_roberta_base_sequence_classifier_allocine | fr | XlmRoBertaForSequenceClassification| xlm_roberta_base_sequence_classifier_ag_news | en | RoBertaForSequenceClassification| roberta_base_sequence_classifier_imdb | en | RoBertaForSequenceClassification| roberta_base_sequence_classifier_ag_news | en | AlbertForSequenceClassification| albert_base_sequence_classifier_ag_news | en | AlbertForSequenceClassification| albert_base_sequence_classifier_imdb | en | LongformerForSequenceClassification| longformer_base_sequence_classifier_ag_news | en | LongformerForSequenceClassification| longformer_base_sequence_classifier_imdb | en | BertForSequenceClassification| bert_sequence_classifier_sentiment | it | BertForSequenceClassification| bert_sequence_classifier_finbert_tone | en | BertForSequenceClassification| bert_sequence_classifier_toxicity | ru | XlnetForSequenceClassification| xlnet_base_sequence_classifier_imdb | en | XlnetForSequenceClassification| xlnet_base_sequence_classifier_ag_news | en | RoBertaForTokenClassification| roberta_token_classifier_bne_capitel_ner | es | RoBertaForTokenClassification| roberta_token_classifier_icelandic_ner | is | RoBertaForTokenClassification| roberta_token_classifier_ticker | en | RoBertaForTokenClassification| roberta_token_classifier_pos_tagger | id | RoBertaForTokenClassification| roberta_token_classifier_timex_semeval | en | XlmRoBertaForTokenClassification| xlm_roberta_large_token_classifier_masakhaner | xx | XlmRoBertaForTokenClassification| xlm_roberta_base_token_classifier_ner | tr | XlmRoBertaForTokenClassification| xlm_roberta_large_token_classifier_ner | id | XlmRoBertaForTokenClassification| xlm_roberta_large_token_classifier_conll03 | de | XlmRoBertaForTokenClassification| xlm_roberta_large_token_classifier_hrl | xx | BertForTokenClassification| bert_hi_en_ner | hi | BertForTokenClassification| bert_token_classifier_scandi_ner | xx | BertForTokenClassification| bert_token_classifier_hi_en_ner | hi | BertForTokenClassification| bert_token_classifier_dutch_udlassy_ner | nl | BertForTokenClassification| bert_token_classifier_chinese_ner | zh | DistilBertEmbeddings| distilbert_uncased | te | XlmRoBertaEmbeddings| xlm_roberta_base_finetuned_swahili | sw | BertEmbeddings| bert_base_finnish_uncased | fr | BertEmbeddings| bert_base_finnish_cased | fi | BertEmbeddings| electra_medal_acronym | en | ClassifierDLModel| classifierdl_urduvec_fakenews | ur | ClassifierDLModel| classifierdl_bert_news | ur | NerDLModel| nerdl_restaurant_100d | en | Word2VecModel| word2vec_gigaword_wiki_300 | en | Word2VecModel| word2vec_gigaword_300 | en

    Spark NLP covers the following languages:

    English ,Multilingual ,Afrikaans ,Afro-Asiatic languages ,Albanian ,Altaic languages ,American Sign Language ,Amharic ,Arabic ,Argentine Sign Language ,Armenian ,Artificial languages ,Atlantic-Congo languages ,Austro-Asiatic languages ,Austronesian languages ,Azerbaijani ,Baltic languages ,Bantu languages ,Basque ,Basque (family) ,Belarusian ,Bemba (Zambia) ,Bengali, Bangla ,Berber languages ,Bihari ,Bislama ,Bosnian ,Brazilian Sign Language ,Breton ,Bulgarian ,Catalan ,Caucasian languages ,Cebuano ,Celtic languages ,Central Bikol ,Chichewa, Chewa, Nyanja ,Chilean Sign Language ,Chinese ,Chuukese ,Colombian Sign Language ,Congo Swahili ,Croatian ,Cushitic languages ,Czech ,Danish ,Dholuo, Luo (Kenya and Tanzania) ,Dravidian languages ,Dutch ,East Slavic languages ,Eastern Malayo-Polynesian languages ,Efik ,Esperanto ,Estonian ,Ewe ,Fijian ,Finnish ,Finnish Sign Language ,Finno-Ugrian languages ,French ,French-based creoles and pidgins ,Ga ,Galician ,Ganda ,Georgian ,German ,Germanic languages ,Gilbertese ,Greek (modern) ,Greek languages ,Gujarati ,Gun ,Haitian, Haitian Creole ,Hausa ,Hebrew (modern) ,Hiligaynon ,Hindi ,Hiri Motu ,Hungarian ,Icelandic ,Igbo ,Iloko ,Indic languages ,Indo-European languages ,Indo-Iranian languages ,Indonesian ,Irish ,Isoko ,Isthmus Zapotec ,Italian ,Italic languages ,Japanese ,Japanese ,Kabyle ,Kalaallisut, Greenlandic ,Kannada ,Kaonde ,Kinyarwanda ,Kirundi ,Kongo ,Korean ,Kwangali ,Kwanyama, Kuanyama ,Latin ,Latvian ,Lingala ,Lithuanian ,Louisiana Creole ,Lozi ,Luba-Katanga ,Luba-Lulua ,Lunda ,Lushai ,Luvale ,Macedonian ,Malagasy ,Malay ,Malayalam ,Malayo-Polynesian languages ,Maltese ,Manx ,Marathi (Marāṭhī) ,Marshallese ,Mexican Sign Language ,Mon-Khmer languages ,Morisyen ,Mossi ,Multiple languages ,Ndonga ,Nepali ,Niger-Kordofanian languages ,Nigerian Pidgin ,Niuean ,North Germanic languages ,Northern Sotho, Pedi, Sepedi ,Norwegian ,Norwegian Bokmål ,Norwegian Nynorsk ,Nyaneka ,Oromo ,Pangasinan ,Papiamento ,Persian (Farsi) ,Peruvian Sign Language ,Philippine languages ,Pijin ,Pohnpeian ,Polish ,Portuguese ,Portuguese-based creoles and pidgins ,Punjabi (Eastern) ,Romance languages ,Romanian ,Rundi ,Russian ,Ruund ,Salishan languages ,Samoan ,San Salvador Kongo ,Sango ,Semitic languages ,Serbo-Croatian ,Seselwa Creole French ,Shona ,Sindhi ,Sino-Tibetan languages ,Slavic languages ,Slovak ,Slovene ,Somali ,South Caucasian languages ,South Slavic languages ,Southern Sotho ,Spanish ,Spanish Sign Language ,Sranan Tongo ,Swahili ,Swati ,Swedish ,Tagalog ,Tahitian ,Tai ,Tamil ,Telugu ,Tetela ,Tetun Dili ,Thai ,Tigrinya ,Tiv ,Tok Pisin ,Tonga (Tonga Islands) ,Tonga (Zambia) ,Tsonga ,Tswana ,Tumbuka ,Turkic languages ,Turkish ,Tuvalu ,Tzotzil ,Ukrainian ,Umbundu ,Uralic languages ,Urdu ,Venda ,Venezuelan Sign Language ,Vietnamese ,Wallisian ,Walloon ,Waray (Philippines) ,Welsh ,West Germanic languages ,West Slavic languages ,Western Malayo-Polynesian languages ,Wolaitta, Wolaytta ,Wolof ,Xhosa ,Yapese ,Yiddish ,Yoruba ,Yucatec Maya, Yucateco ,Zande (individual language) ,Zulu

    The complete list of all 4100+ models & pipelines in 230+ languages is available on Models Hub

    Backward Compatibility

    • The parameter dateFormat in DateMatcher and MultiDateMatcher annotators has been renamed to outputFormat:
    
    # previously
    .setDateFormat("yyyy/MM/dd")
    
    # after 3.4.0 release
    .setOutputFormat("yyyy/MM/dd")
    
    
    • Deprecating xling TF Hub models for UniversalSentenceEncoder annotator (there are CMLM models available which outperform xling models with support for more languages)
    • Deprecating Finnish old BERT models (there are newer models available now)

    New Notebooks

    Import hundreds of models in different languages to Spark NLP

    Spark NLP | HuggingFace Notebooks | Colab :------------ | :-------------| :----------| AlbertForSequenceClassification |HuggingFace in Spark NLP - AlbertForSequenceClassification | Open In Colab RoBertaForSequenceClassification |HuggingFace in Spark NLP - RoBertaForSequenceClassification | Open In Colab XlmRoBertaForSequenceClassification |HuggingFace in Spark NLP - XlmRoBertaForSequenceClassification | Open In Colab XlnetForSequenceClassification |HuggingFace in Spark NLP - XlnetForSequenceClassification | Open In Colab

    You can visit Import Transformers in Spark NLP for more info

    New Word2Vec notebook

    Spark NLP | Jupyter Notebook :------------ | :-------------| Word2VecApproach | Train Word2Vec and NER models


    Documentation


    Installation

    Python

    #PyPI
    
    pip install spark-nlp==3.4.0
    

    Spark Packages

    spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.0
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.4.0
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.0
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.4.0
    

    spark-nlp on Apache Spark 3.2.x (Scala 2.12 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.0
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark32_2.12:3.4.0
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.0
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark32_2.12:3.4.0
    

    spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.0
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.4.0
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.0
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.4.0
    

    spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.0
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.4.0
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.4.0
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.4.0
    

    Maven

    spark-nlp on Apache Spark 3.0.x and 3.1.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp_2.12</artifactId>
        <version>3.4.0</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu_2.12</artifactId>
        <version>3.4.0</version>
    </dependency>
    

    spark-nlp on Apache Spark 3.2.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark32_2.12</artifactId>
        <version>3.4.0</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark32_2.12</artifactId>
        <version>3.4.0</version>
    </dependency>
    

    spark-nlp on Apache Spark 2.4.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark24_2.11</artifactId>
        <version>3.4.0</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
        <version>3.4.0</version>
    </dependency>
    

    spark-nlp on Apache Spark 2.3.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark23_2.11</artifactId>
        <version>3.4.0</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark23_2.11</artifactId>
        <version>3.4.0</version>
    </dependency>
    

    FAT JARs

    • CPU on Apache Spark 3.0.x/3.1.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.4.0.jar

    • GPU on Apache Spark 3.0.x/3.1.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.4.0.jar

    • CPU on Apache Spark 3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark32-assembly-3.4.0.jar

    • GPU on Apache Spark 3.2.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark32-assembly-3.4.0.jar

    • CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.4.0.jar

    • GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.4.0.jar

    • CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.4.0.jar

    • GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.4.0.jar

    What's Changed

    Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/3.3.4...3.4.0

    New Contributors

    • @galiph made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/6528
    • @Ahmetemintek made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/6531
    • @xyutech made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/6547
    • @KshitizGIT made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/6550
    • @luca-martial made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/6642
    • @Cabir40 made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/6685

    @vankov @xyutech @Cabir40 @murat-gunay @Ahmetemintek @KshitizGIT @gadde5300 @jsl-models @DevinTDHa @agsfer @diatrambitas @maziyarpanahi @luca-martial @danilojsl @wolliq @muhammetsnts @pabla @josejuanmartinez @jsl-builder @galiph @albertoandreottiATgmail

    Source code(tar.gz)
    Source code(zip)
  • 3.3.4(Nov 25, 2021)


    Patch release

    • Fix ClassCastException error in pretrained function for DistilBertForSequenceClassification in Python https://github.com/JohnSnowLabs/spark-nlp/pull/6513

    Documentation


    Installation

    Python

    #PyPI
    
    pip install spark-nlp==3.3.4
    

    Spark Packages

    spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.4
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.4
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.3.4
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.3.4
    

    spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.3.4
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.3.4
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.3.4
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.3.4
    

    spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.3.4
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.3.4
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.3.4
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.3.4
    

    Maven

    spark-nlp on Apache Spark 3.0.x and 3.1.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp_2.12</artifactId>
        <version>3.3.4</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu_2.12</artifactId>
        <version>3.3.4</version>
    </dependency>
    

    spark-nlp on Apache Spark 2.4.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark24_2.11</artifactId>
        <version>3.3.4</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
        <version>3.3.4</version>
    </dependency>
    

    spark-nlp on Apache Spark 2.3.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark23_2.11</artifactId>
        <version>3.3.4</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark23_2.11</artifactId>
        <version>3.3.4</version>
    </dependency>
    

    FAT JARs

    • CPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.3.4.jar

    • GPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.3.4.jar

    • CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.3.4.jar

    • GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.3.4.jar

    • CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.3.4.jar

    • GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.3.4.jar

    What's Changed

    • Update documentation of ChunkKeyPhraseExtraction by @vankov in https://github.com/JohnSnowLabs/spark-nlp/pull/6508
    • Fixes new instantiation in scala section by @josejuanmartinez in https://github.com/JohnSnowLabs/spark-nlp/pull/6469
    • Fix the wrong name for DistilBertForSequenceClassification in Python by @maziyarpanahi in https://github.com/JohnSnowLabs/spark-nlp/pull/6513
    • Release/334 release candidate by @maziyarpanahi in https://github.com/JohnSnowLabs/spark-nlp/pull/6514

    Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/3.3.3...3.3.4

    Source code(tar.gz)
    Source code(zip)
  • 3.3.3(Nov 22, 2021)


    Overview

    (knock, knock, knock) Penny? Yes, this is a very special release if you are obsessed with the number 3 as much as we are! So we are pleased to announce Spark NLP 🚀 3.3.3 release! 🎉 🎊 🎈

    This release comes with a new DistilBertForSequenceClassification annotator for existing or fine-tuned DistilBERT models for Text Classification on HuggingFace, new distributed and trainable Doc2Vec annotator based on Word2Vec implementation in Spark ML, improving BertEmbeddings and BertSentenceEmbeddings on a single machine on a GPU device where the DataFrame has 1 sentence per row or input column is set to document, new state-of-the-art fine-tuned DistilBERT models for Sequence Classification, enhancements, bug fixes, and more!

    As always, we would like to thank our community for their feedback, questions, and feature requests.


    New Features and Enhancements

    • NEW: Introducing DistilBertForSequenceClassification annotator in Spark NLP 🚀. DistilBertForSequenceClassification DistilBertForSequenceClassification can load DistilBERT Models with sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using DistilBertForSequenceClassification or TFDistilBertForSequenceClassification in HuggingFace 🤗
    • NEW: Introducing trainable and distributed Doc2Vec annotators based on Word2Vec in Spark ML
    • Improving BertEmbeddings for single document/sentence DataFrame per row on a single machine with a GPU device
    • Improving BertSentenceEmbeddings for single document/sentence DataFrame per row on a single machine with a GPU device
    • Add a new feature to the CoNLL() class, allowing it to read multiple CoNLL files at the same time into a single DataFrame
    • Add support for Long type in label column for ClassifierDLApproach and SentimentDLApproach
    • Add script to setup AWS SageMaker thanks to @xegulon
    • Add instructions to setup Amazon Linux 2

    Bug Fixes

    • Improve models and pipelines resolutions in Spark NLP when wrong models/pipelines are downloaded regardless of their Apache Spark version
    • Fix MarianTransformer bug on empty sequences
    • Fix TFInvalidArgumentException in MarianTransformer for sequences longer than 512
    • Fix MarianTransformer multi-lingual models and pipelines such as opus_mt_mul_en and opus_mt_mul_en
    • Fix a bug in DateMatcher and MultiDateMatcher when detecting month from subwords by mistake
    • Add the missing lemma_antbnc model to Models Hub
    • Add the missing sentiment_vivekn model to Models Hub
    • Add the missing spellcheck_norvig model to Models Hub

    Models

    New state-of-the-art fine-tuned DistilBERT models for Sequence Classification:

    Featured Pretrained Models

    | Model | Name | Build | Lang | |:---------------------|:-------------------|:-----------------|:-----| | DistilBertForSequenceClassification | distilbert_sequence_classifier_sst2| en | 3.3.3| | DistilBertForSequenceClassification | distilbert_sequence_classifier_policy| en | 3.3.3| | DistilBertForSequenceClassification | distilbert_sequence_classifier_industry| en | 3.3.3| | DistilBertForSequenceClassification | distilbert_sequence_classifier_emotion| en | 3.3.3| | DistilBertForSequenceClassification | distilbert_sequence_classifier_banking77| en | 3.3.3| | DistilBertForSequenceClassification | distilbert_multilingual_sequence_classifier_allocine| fr | 3.3.3| | DistilBertForSequenceClassification | distilbert_base_sequence_classifier_imdb| ur | 3.3.3| | DistilBertForSequenceClassification | distilbert_base_sequence_classifier_imdb| en | 3.3.3| | DistilBertForSequenceClassification | distilbert_base_sequence_classifier_amazon_polarity| en | 3.3.3| | DistilBertForSequenceClassification | distilbert_base_sequence_classifier_ag_news| en | 3.3.3| | Doc2VecModel | doc2vec_gigaword_300| en | 3.3.3| | Doc2VecModel | doc2vec_gigaword_wiki_300| en | 3.3.3|

    The complete list of all 4000+ models & pipelines in 200+ languages is available on Models Hub.

    New Notebooks

    Spark NLP | Notebooks | Colab :------------ | :-------------| :----------| DistilBertForSequenceClassification |HuggingFace in Spark NLP - DistilBertForSequenceClassification | Open In Colab Doc2Vec |Train Doc2Vec for Text Classification | Open In Colab|


    Documentation


    Installation

    Python

    #PyPI
    
    pip install spark-nlp==3.3.3
    

    Spark Packages

    spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.3
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.3
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.3.3
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.3.3
    

    spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.3.3
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.3.3
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.3.3
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.3.3
    

    spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.3.3
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.3.3
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.3.3
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.3.3
    

    Maven

    spark-nlp on Apache Spark 3.0.x and 3.1.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp_2.12</artifactId>
        <version>3.3.3</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu_2.12</artifactId>
        <version>3.3.3</version>
    </dependency>
    

    spark-nlp on Apache Spark 2.4.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark24_2.11</artifactId>
        <version>3.3.3</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
        <version>3.3.3</version>
    </dependency>
    

    spark-nlp on Apache Spark 2.3.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark23_2.11</artifactId>
        <version>3.3.3</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark23_2.11</artifactId>
        <version>3.3.3</version>
    </dependency>
    

    FAT JARs

    • CPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.3.3.jar

    • GPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.3.3.jar

    • CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.3.3.jar

    • GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.3.3.jar

    • CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.3.3.jar

    • GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.3.3.jar

    What's Changed

    Full Changelog: https://github.com/JohnSnowLabs/spark-nlp/compare/3.3.2...3.3.3

    New Contributors

    • @xegulon made their first contribution in https://github.com/JohnSnowLabs/spark-nlp/pull/6449

    @DevinTDHa @diatrambitas @xegulon @egenc @gadde5300 @jsl-models @murat-gunay @josejuanmartinez @maziyarpanahi @jsl-builder @wolliq @xusliebana @agsfer @danilojsl @vankov @muhammetsnts @albertoandreottiATgmail

    Source code(tar.gz)
    Source code(zip)
  • 3.3.2(Nov 3, 2021)


    Overview

    We are pleased to release Spark NLP 🚀 3.3.2! This release comes with a new BertForSequenceClassification annotator for existing or fine-tuned models on HuggingFace, new logging feature during training with Comet.ml, New state-of-the-art fine-tuned BERT models for Sequence Classification, and bug fixes!

    As always, we would like to thank our community for their feedback, questions, and feature requests.


    New Features

    • Introducing BertForSequenceClassification annotator. BertForSequenceClassification can load BERT Models with a sequence classification/regression head on top (a linear layer on top of the pooled output) e.g. for multi-class document classification tasks. This annotator is compatible with all the models trained/fine-tuned by using BertForSequenceClassification (PyTorch) or TFBertForSequenceClassification (TensorFlow) in HuggingFace 🤗
    • New support for Comet.ml in Spark NLP to build better models faster.

    Comet enables data scientists and teams to track, compare, explain and optimize experiments and models across the model’s entire lifecycle. From training to production. With just two lines of code, you can start building better models today.

    Comet SparkNLP Integration Notebook


    Bug Fixes and Enhancements

    • Fix a missing batchSize param in NerDLModel that degraded GPU performance by not allowing users to change the default batchSize
    • Fix NerDLApproach logs format on Databricks
    • Fix EntityRulerApproach name from import
    • Fix missing EntityRulerModel in ResourceDownloader
    • Faster Colab setup script for pyspark 3.0.x and 3.1.x on Java 11

    Models

    New state-of-the-art fine-tuned BERT models for Sequence Classification in English, French, German, Spanish, Japanese, Turkish, Russian, and multilingual languages.

    Featured Pretrained Models

    | Model | Name | Build | Lang | |:---------------------|:-------------------|:-----------------|:-----| | BertForSequenceClassification | bert_multilingual_sequence_classifier_allocine| 3.3.2 | fr| | BertForSequenceClassification | bert_large_sequence_classifier_imdb| 3.3.2 | en| | BertForSequenceClassification | bert_base_sequence_classifier_imdb| 3.3.2 | en| | BertForSequenceClassification | bert_base_sequence_classifier_ag_news| 3.3.2 | en| | BertForSequenceClassification | bert_base_sequence_classifier_dbpedia_14| 3.3.2 | en| | BertForSequenceClassification | bert_sequence_classifier_turkish_sentiment| 3.3.2 | tr| | BertForSequenceClassification | bert_sequence_classifier_sentiment| 3.3.2 | de| | BertForSequenceClassification | bert_sequence_classifier_rubert_sentiment| 3.3.2 | ru| | BertForSequenceClassification | bert_sequence_classifier_multilingual_sentiment| 3.3.2 | xx| | BertForSequenceClassification | bert_sequence_classifier_japanese_sentiment| 3.3.2 | ja| | BertForSequenceClassification | bert_sequence_classifier_finbert| 3.3.2 | en| | BertForSequenceClassification | bert_sequence_classifier_dehatebert_mono| 3.3.2 | en| | BertForSequenceClassification | bert_sequence_classifier_beto_sentiment_analysis| 3.3.2 | es| | BertForSequenceClassification | bert_sequence_classifier_beto_emotion_analysis| 3.3.2 | es|

    The complete list of all 4000+ models & pipelines in 200+ languages is available on Models Hub.

    New Notebooks

    Spark NLP | Notebooks | Colab :------------ | :-------------| :----------| BertForSequenceClassification |HuggingFace in Spark NLP - BertForSequenceClassification | Open In Colab| Comet.ml | Comet SparkNLP Integration Notebook| Open In Colab


    Documentation


    Installation

    Python

    #PyPI
    
    pip install spark-nlp==3.3.2
    

    Spark Packages

    spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.2
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.2
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.3.2
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.3.2
    

    spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.3.2
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.3.2
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.3.2
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.3.2
    

    spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.3.2
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.3.2
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.3.2
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.3.2
    

    Maven

    spark-nlp on Apache Spark 3.0.x and 3.1.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp_2.12</artifactId>
        <version>3.3.2</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu_2.12</artifactId>
        <version>3.3.2</version>
    </dependency>
    

    spark-nlp on Apache Spark 2.4.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark24_2.11</artifactId>
        <version>3.3.2</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
        <version>3.3.2</version>
    </dependency>
    

    spark-nlp on Apache Spark 2.3.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark23_2.11</artifactId>
        <version>3.3.2</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark23_2.11</artifactId>
        <version>3.3.2</version>
    </dependency>
    

    FAT JARs

    • CPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.3.2.jar

    • GPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.3.2.jar

    • CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.3.2.jar

    • GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.3.2.jar

    • CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.3.2.jar

    • GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.3.2.jar

    Source code(tar.gz)
    Source code(zip)
  • 3.3.1(Oct 18, 2021)


    Overview

    We are pleased to release Spark NLP 🚀 3.3.1! This release comes with a new EntityRuler annotator, better compatibility between TokenClassification annotators and other annotators in Spark NLP pipeline, new state-of-the-art XLM-RoBERTa models in African Languages, and bug fixes!

    As always, we would like to thank our community for their feedback, questions, and feature requests.


    New Features

    • Introducing EntityRuler annotators to receive either a JSON or CSV ontology file that maps entities to patterns. You can implement a purely rule-based entity recognition system by using EntityRuler, it can be saved as a Model and reused in other pipelines to annotate your document against your knowledge base.

    Access EntityRuler Documentation


    Bug Fixes

    • Fix compatibility issue between NerOverwriter and AlbertForTokenClassification, BertForTokenClassification, DistilBertForTokenClassification, LongformerForTokenClassification, RoBertaForTokenClassification, XlmRoBertaForTokenClassification, XlnetForTokenClassification annotators
    • Fix a bug in ContextSpellCheckerApproach annotator failing to find an appropriate TF graph
    • Fix a bug in ContextSpellCheckerModel not being able to load a trained model
    • Fix token alignment with token pieces in BertEmbeddings resulting in missing vectors with Unicode characters
    • Add the missing pretrained NER models for the XlmRoBertaForTokenClassification annotator
    • Add the missing pretrained NER models for the LongformerForTokenClassification annotator

    Backward compatibility

    • Renaming YakeModel to YakeKeywordExtraction to represent the actual purpose of this annotator more clearly.

    Models and Pipelines

    New state-of-the-art XLM-RoBERTa models in Luganda, Naija, Yoruba, Hausa, Kinyarwanda, Wolof, Igbo, Amharic, Swahili, and Luo.

    New Transformer Models

    | Model | Name | Build | Lang | |:---------------------|:-------------------|:-----------------|:-----| | XlmRoBertaSentenceEmbeddings| sent_xlm_roberta_base_finetuned_yoruba| 3.3.1 | yo| | XlmRoBertaSentenceEmbeddings| sent_xlm_roberta_base_finetuned_wolof| 3.3.1 | wo| | XlmRoBertaSentenceEmbeddings| sent_xlm_roberta_base_finetuned_naija| 3.3.1 | pcm| | XlmRoBertaSentenceEmbeddings| sent_xlm_roberta_base_finetuned_swahili| 3.3.1 | sw| | XlmRoBertaSentenceEmbeddings| sent_xlm_roberta_base_finetuned_luganda| 3.3.1 | lg| | XlmRoBertaSentenceEmbeddings| sent_xlm_roberta_base_finetuned_kinyarwanda| 3.3.1 | rw| | XlmRoBertaSentenceEmbeddings| sent_xlm_roberta_base_finetuned_hausa| 3.3.1 | ha| | XlmRoBertaSentenceEmbeddings| sent_xlm_roberta_base_finetuned_igbo| 3.3.1 | ig| | XlmRoBertaSentenceEmbeddings| sent_xlm_roberta_base_finetuned_amharic| 3.3.1 | am| | XlmRoBertaEmbeddings| xlm_roberta_base_finetuned_yoruba| 3.3.1 | yo| | XlmRoBertaEmbeddings| xlm_roberta_base_finetuned_wolof| 3.3.1 | wo| | XlmRoBertaEmbeddings| xlm_roberta_base_finetuned_swahili| 3.3.1 | sw| | XlmRoBertaEmbeddings| xlm_roberta_base_finetuned_naija| 3.3.1 | pcm| | XlmRoBertaEmbeddings| xlm_roberta_base_finetuned_luo| 3.3.1 | lou|

    The complete list of all 4000+ models & pipelines in 200+ languages is available on Models Hub.

    New Notebooks

    Spark NLP | Jupyter Notebooks | |:------------ | :-------------| | EntityRuler| EntityRuler| | EntityRuler| EntityRuler_LightPipeline| | EntityRuler| EntityRuler_Whitout_Storage|


    Documentation


    Installation

    Python

    #PyPI
    
    pip install spark-nlp==3.3.1
    

    Spark Packages

    spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.1
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.1
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.3.1
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.3.1
    

    spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.3.1
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.3.1
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.3.1
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.3.1
    

    spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.3.1
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.3.1
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.3.1
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.3.1
    

    Maven

    spark-nlp on Apache Spark 3.0.x and 3.1.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp_2.12</artifactId>
        <version>3.3.1</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu_2.12</artifactId>
        <version>3.3.1</version>
    </dependency>
    

    spark-nlp on Apache Spark 2.4.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark24_2.11</artifactId>
        <version>3.3.1</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
        <version>3.3.1</version>
    </dependency>
    

    spark-nlp on Apache Spark 2.3.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark23_2.11</artifactId>
        <version>3.3.1</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark23_2.11</artifactId>
        <version>3.3.1</version>
    </dependency>
    

    FAT JARs

    • CPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.3.1.jar

    • GPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.3.1.jar

    • CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.3.1.jar

    • GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.3.1.jar

    • CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.3.1.jar

    • GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.3.1.jar

    Source code(tar.gz)
    Source code(zip)
  • 3.3.0(Sep 29, 2021)


    Overview

    We are very excited to release Spark NLP 🚀 3.3.0! This release comes with new ALBERT, XLNet, RoBERTa, XLM-RoBERTa, and Longformer existing or fine-tuned models for Token Classification on HuggingFace 🤗 , up to 50x times faster saving Spark NLP models & pipelines, no more 2G limitation for the size of imported TensorFlow models, lots of new functions to filter and display pretrained models & pipelines inside Spark NLP, bug fixes, and more!

    We are proud to say Spark NLP 3.3.0 is still compatible across all major releases of Apache Spark used locally, by all Cloud providers such as EMR, and all managed services such as Databricks. The major releases of Apache Spark include Apache Spark 3.0.x/3.1.x (spark-nlp), Apache Spark 2.4.x (spark-nlp-spark24), and Apache Spark 2.3.x (spark-nlp-spark23).

    As always, we would like to thank our community for their feedback, questions, and feature requests.


    Major features and improvements

    • NEW: Starting Spark NLP 3.3.0 release there will be no limitation of size when you import TensorFlow models! You can now import TF Hub & HuggingFace models larger than 2 Gigabytes of size.
    • NEW: Up to 50x faster saving Spark NLP models and pipelines! We have improved the way we package TensorFlow SavedModel while saving Spark NLP models & pipelines. For instance, it used to take up to 10 minutes to save the xlm_roberta_base model before Spark NLP 3.3.0, and now it only takes up to 15 seconds!
    • NEW: Introducing AlbertForTokenClassification annotator in Spark NLP 🚀. AlbertForTokenClassification can load ALBERT Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using AlbertForTokenClassification or TFAlbertForTokenClassification in HuggingFace 🤗
    • NEW: Introducing XlnetForTokenClassification annotator in Spark NLP 🚀. XlnetForTokenClassification can load XLNet Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using XLNetForTokenClassificationet or TFXLNetForTokenClassificationet in HuggingFace 🤗
    • NEW: Introducing RoBertaForTokenClassification annotator in Spark NLP 🚀. RoBertaForTokenClassification can load RoBERTa Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using RobertaForTokenClassification or TFRobertaForTokenClassification in HuggingFace 🤗
    • NEW: Introducing XlmRoBertaForTokenClassification annotator in Spark NLP 🚀. XlmRoBertaForTokenClassification can load XLM-RoBERTa Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using XLMRobertaForTokenClassification or TFXLMRobertaForTokenClassification in HuggingFace 🤗
    • NEW: Introducing LongformerForTokenClassification annotator in Spark NLP 🚀. LongformerForTokenClassification can load Longformer Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using LongformerForTokenClassification or TFLongformerForTokenClassification in HuggingFace 🤗
    • NEW: Introducing new ResourceDownloader functions to easily look for pretrained models & pipelines inside Spark NLP (Python and Scala). You can filter models or pipelines via language, version, or the name of the annotator
    from sparknlp.pretrained import *
    
    # display and filter all available pretrained pipelines
    ResourceDownloader.showPublicPipelines()
    ResourceDownloader.showPublicPipelines(lang="en")
    ResourceDownloader.showPublicPipelines(lang="en", version="3.2.0")
    
    # display and filter all available pretrained pipelines
    ResourceDownloader.showPublicModels()
    ResourceDownloader.showPublicModels("NerDLModel", "3.2.0")
    ResourceDownloader.showPublicModels("NerDLModel", "en")
    ResourceDownloader.showPublicModels("XlmRoBertaEmbeddings", "xx")
    +--------------------------+------+---------+
    | Model                    | lang | version |
    +--------------------------+------+---------+
    | xlm_roberta_base         |  xx  | 3.1.0   |
    | twitter_xlm_roberta_base |  xx  | 3.1.0   |
    | xlm_roberta_xtreme_base  |  xx  | 3.1.3   |
    | xlm_roberta_large        |  xx  | 3.3.0   |
    +--------------------------+------+---------+
    
    # remove all the downloaded models & pipelines to free up storage
    ResourceDownloader.clearCache()
    
    # display all available annotators that can be saved as a Model
    ResourceDownloader.showAvailableAnnotators()
    

    Bug Fixes

    • Fix a bug in RoBertaEmbeddings when all special tokens were identical
    • Fix a bug in RoBertaEmbeddings when a special token contained valid regex
    • Fix a bug that leads to memory leak inside NorvigSweeting spell checker. This issue caused issues with pretrained pipelines such as explain_document_ml and explain_document_dl due to some inputs
    • Fix the wrong types being assigned to minCount and classCount in Python for ContextSpellCheckerApproach annotator
    • Fix explain_document_ml pretrained pipeline for Spark NLP 3.x on Apache Spark 2.x
    • Fix WordSegmenterModel wordseg_best model for Thai language
    • Fix WordSegmenterModel wordseg_large model for Chinese language

    Models and Pipelines

    Spark NLP 3.3.0 comes with:

    • New ALBERT, RoBERTa, XLNet, and XLM-RoBERTa for Token Classification models
    • New XLM-RoBERTa models in Luganda, Kinyarwanda, Igbo, Hausa, and Amharic languages

    New Transformer Models

    | Model | Name | Build | Lang | |:---------------------|:-------------------|:-----------------|:-----| |RoBertaForTokenClassification| roberta_large_token_classifier_ontonotes | 3.3.0 | en |RoBertaForTokenClassification| roberta_large_token_classifier_conll03 | 3.3.0 | en |RoBertaForTokenClassification| roberta_base_token_classifier_ontonotes | 3.3.0 | en |RoBertaForTokenClassification| roberta_base_token_classifier_conll03 | 3.3.0 | en |RoBertaForTokenClassification| distilroberta_base_token_classifier_ontonotes | 3.3.0 | en |RoBertaForTokenClassification| roberta_token_classifier_zwnj_base_ner | 3.3.0 | fa |XlmRoBertaForTokenClassification| xlm_roberta_token_classifier_ner_40_lang | 3.3.0 | xx |AlbertForTokenClassification| albert_xlarge_token_classifier_conll03 | 3.3.0 | en |AlbertForTokenClassification| albert_large_token_classifier_conll03 | 3.3.0 | en |AlbertForTokenClassification| albert_base_token_classifier_conll03 | 3.3.0 | en |XlnetForTokenClassification| xlnet_large_token_classifier_conll03 | 3.3.0 | en |XlnetForTokenClassification| xlnet_base_token_classifier_conll03 | 3.3.0 | en |XlmRoBertaEmbeddings| xlm_roberta_large | 3.3.0 | xx |XlmRoBertaEmbeddings| xlm_roberta_base_finetuned_luganda | 3.3.0 | lg |XlmRoBertaEmbeddings| xlm_roberta_base_finetuned_kinyarwanda | 3.3.0 | rw |XlmRoBertaEmbeddings| xlm_roberta_base_finetuned_igbo | 3.3.0 | ig |XlmRoBertaEmbeddings| xlm_roberta_base_finetuned_hausa | 3.3.0 | ha |XlmRoBertaEmbeddings| xlm_roberta_base_finetuned_amharic | 3.3.0 | am

    The complete list of all 3700+ models & pipelines in 200+ languages is available on Models Hub.

    New Notebooks

    Import hundreds of models in different languages to Spark NLP

    Spark NLP | HuggingFace Notebooks | Colab :------------ | :-------------| :----------| AlbertForTokenClassification|HuggingFace in Spark NLP - AlbertForTokenClassification | Open In Colab RoBertaForTokenClassification|HuggingFace in Spark NLP - RoBertaForTokenClassification | Open In Colab XlmRoBertaForTokenClassification|HuggingFace in Spark NLP - XlmRoBertaForTokenClassification | Open In Colab


    Documentation


    Installation

    Python

    #PyPI
    
    pip install spark-nlp==3.3.0
    

    Spark Packages

    spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.0
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.3.0
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.3.0
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.3.0
    

    spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.3.0
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.3.0
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.3.0
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.3.0
    

    spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.3.0
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.3.0
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.3.0
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.3.0
    

    Maven

    spark-nlp on Apache Spark 3.0.x and 3.1.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp_2.12</artifactId>
        <version>3.3.0</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu_2.12</artifactId>
        <version>3.3.0</version>
    </dependency>
    

    spark-nlp on Apache Spark 2.4.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark24_2.11</artifactId>
        <version>3.3.0</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
        <version>3.3.0</version>
    </dependency>
    

    spark-nlp on Apache Spark 2.3.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark23_2.11</artifactId>
        <version>3.3.0</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark23_2.11</artifactId>
        <version>3.3.0</version>
    </dependency>
    

    FAT JARs

    • CPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.3.0.jar

    • GPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.3.0.jar

    • CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.3.0.jar

    • GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.3.0.jar

    • CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.3.0.jar

    • GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.3.0.jar

    Source code(tar.gz)
    Source code(zip)
  • 3.2.3(Sep 15, 2021)


    Overview

    We are pleased to release Spark NLP 🚀 3.2.3! This release comes with new and completed documentation for all Transformers and Trainable annotators in Spark NLP, new Japanese NER and Embeddings models, new multilingual Transformer models, code enhancements, and bug fixes.

    As always, we would like to thank our community for their feedback, questions, and feature requests.


    New Features

    • Add delimiter feature to CoNLL() class to support other delimiters in CoNLL files https://github.com/JohnSnowLabs/spark-nlp/pull/5934
    • Add support for IOB in addition to IOB2 format in GraphExtraction annotator https://github.com/JohnSnowLabs/spark-nlp/pull/6101
    • Change YakeModel output type from KEYWORD to CHUNK to have more available features after the YakeModel annotator such as Chunk2Doc or ChunkEmbeddings https://github.com/JohnSnowLabs/spark-nlp/pull/6065
    • Welcoming Databricks Runtime 9.0, 9.0 ML, and 9.0 ML with GPU
    • A new and completed Transformer page
      • description
      • default model's name
      • link to Models Hub
      • link to notebook on Spark NLP Workshop
      • link to Python APIs
      • link to Scala APIs
      • link to source code and unit test
      • Examples in Python and Scala for
        • Prediction
        • Training
        • Raw Embeddings
    • A new and completed Training page
      • Training Datasets
      • Text Processing
      • Spell Checkers
      • Token Classification
      • Text Classification
      • External Trainable Models

    Bug Fixes & Enhancements

    • Fix the default language for XlmRoBertaSentenceEmbeddings pretrained model in Python https://github.com/JohnSnowLabs/spark-nlp/pull/6057
    • Fix SentenceEmbeddings issue concatenating sentences instead of each correspondent sentence https://github.com/JohnSnowLabs/spark-nlp/pull/6060
    • Fix GraphExctraction usage in LightPipeline https://github.com/JohnSnowLabs/spark-nlp/pull/6101
    • Fix compatibility issue in explain_document_ml pipeline
    • Better import process for corrupted merges file in Longformer tokenizer https://github.com/JohnSnowLabs/spark-nlp/pull/6083

    Models and Pipelines

    Spanish, Greek, Swedish, Dutch, German, French, Romanian, and Japanese

    BERT Embeddings (Word and Sentence)

    | Model | Name | Build | Lang | |:-----------------------------|:-------------------|:-----------------|:------| | BertEmbeddings | bert_base_uncased_legal | 3.2.2 | en | BertEmbeddings | bert_base_uncased | 3.2.2 | es | BertEmbeddings | bert_base_cased | 3.2.2 | es | BertEmbeddings | bert_base_uncased | 3.2.2 | el | BertEmbeddings | bert_base_cased | 3.2.2 | sv | BertEmbeddings | bert_base_cased | 3.2.2 | nl | BertSentenceEmbeddings | sent_bert_base_uncased_legal | 3.2.2 | en | BertSentenceEmbeddings | sent_bert_base_uncased | 3.2.2 | es | BertSentenceEmbeddings | sent_bert_base_cased | 3.2.2 | es | BertSentenceEmbeddings | sent_bert_base_uncased | 3.2.2 | el | BertSentenceEmbeddings | sent_bert_base_cased | 3.2.2 | sv | BertSentenceEmbeddings | sent_bert_base_cased | 3.2.2 | nl | BertSentenceEmbeddings | sent_bert_base_cased | 3.2.2 | de

    Other multilingual models

    | Model | Name | Build | Lang | |:-----------------------------|:-------------------|:-----------------|:------| | WordEmbeddingsModel | japanese_cc_300d | 3.2.2 | ja | NerDLModel | ner_ud_gsd_cc_300d | 3.2.2 | ja | NerDLModel | ner_ud_gsd_xlm_roberta_base | 3.2.2 | ja | BertForTokenClassification | bert_token_classifier_ner_ud_gsd | 3.2.2 | ja | BertForTokenClassification | bert_token_classifier_ner_btc | 3.2.2 | en | ClassifierDLModel | classifierdl_bert_sentiment | 3.2.2 | de | ClassifierDLModel | classifierdl_bert_sentiment | 3.2.2 | fr

    The complete list of all 3700+ models & pipelines in 200+ languages is available on Models Hub.


    Models Hub for the community by community

    Serve Your Spark NLP Models for Free! You can host and share your Spark NLP models & pipelines publicly with everyone to reuse them with one line of code!

    Models Hub is open to everyone to upload their models and pipelines, showcase their work, and share them with others.

    Please visit the following page for more information: https://modelshub.johnsnowlabs.com/


    Documentation


    Installation

    Python

    #PyPI
    
    pip install spark-nlp==3.2.3
    

    Spark Packages

    spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.2.3
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.2.3
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.2.3
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.2.3
    

    spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.2.3
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.2.3
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.2.3
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.2.3
    

    spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.2.3
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.2.3
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.2.3
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.2.3
    

    Maven

    spark-nlp on Apache Spark 3.0.x and 3.1.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp_2.12</artifactId>
        <version>3.2.3</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu_2.12</artifactId>
        <version>3.2.3</version>
    </dependency>
    

    spark-nlp on Apache Spark 2.4.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark24_2.11</artifactId>
        <version>3.2.3</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
        <version>3.2.3</version>
    </dependency>
    

    spark-nlp on Apache Spark 2.3.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark23_2.11</artifactId>
        <version>3.2.3</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark23_2.11</artifactId>
        <version>3.2.3</version>
    </dependency>
    

    FAT JARs

    • CPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.2.3.jar

    • GPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.2.3.jar

    • CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.2.3.jar

    • GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.2.3.jar

    • CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.2.3.jar

    • GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.2.3.jar

    Source code(tar.gz)
    Source code(zip)
  • 3.2.2(Sep 1, 2021)


    Overview

    We are pleased to release Spark NLP 🚀 3.2.2! This release comes with accessible Models Hub to our community to host their models and pipelines for free, new RoBERTa and XLM-RoBERTa Sentence Embeddings, over 40 new models and pipelines in 20+ languages, bug fixes, and more

    As always, we would like to thank our community for their feedback, questions, and feature requests.


    New Features

    • A new RoBertaSentenceEmbeddings annotator for sentence embeddings used in SentimentDL, ClassifierDL, and MultiClassifierDL annotators
    • A new XlmRoBertaSentenceEmbeddings annotator for sentence embeddings used in SentimentDL, ClassifierDL, and MultiClassifierDL annotators
    • Add support for AWS MFA via Spark NLP configuration
    • Add new AWS configs to Spark NLP configuration when using a private S3 bucket to store logs for training models or access TF graphs needed in NerDLApproach
      • spark.jsl.settings.aws.credentials.access_key_id
      • spark.jsl.settings.aws.credentials.secret_access_key
      • spark.jsl.settings.aws.credentials.session_token
      • spark.jsl.settings.aws.s3_bucket
      • spark.jsl.settings.aws.region

    Models Hub for the community, by the community

    Serve Your Spark NLP Models for Free! You can host and share your Spark NLP models & pipelines publicly with everyone to reuse them with one line of code!

    We are opening Models Hub to everyone to upload their models and pipelines, showcase their work, and share them with others.

    Please visit the following page for more information: https://modelshub.johnsnowlabs.com/

    image


    Bug Fixes & Enhancements

    • Improve loading merges file for RoBERTa tokenizer
    • Remove batchSize param from broadcast in XlmRoBertaEmbeddings to be set after it is created
    • Preserve previously generated metadata in BertSentenceEmbeddings annotator
    • Set elmo as a default poolingLayer in ElmoEmbeddings
    • Fix special tokens ids in XlmRoBertaEmbeddings annotator
    • Fix distilbert_base_token_classifier_ontonotes model
    • Fix distilbert_base_token_classifier_conll03 model
    • Fix distilbert_base_token_classifier_few_nerd model
    • Fix distilbert_token_classifier_persian_ner model
    • Fix ner_conll_longformer_base_4096 model

    Models and Pipelines

    Spark NLP 3.2.2 comes with new Turkish text classifier pipelines, Expert BERT Word and Sentence embeddings such as wiki books and PubMed, new BERT model for 17 Indian languages, and Sentence Detection models for 15 new languages.

    Pipelines

    | Name | Build | Lang | |:-------------------|:-----------------|:------| | classifierdl_berturk_cyberbullying_pipeline | 3.1.3 | tr | classifierdl_bert_news_pipeline | 3.1.3 | de | classifierdl_electra_questionpair_pipeline | 3.2.0 | en | classifierdl_bert_news_pipeline | 3.2.0 | tr

    Named Entity Recognition

    | Model | Name | Build | Lang | |:-----------------------------|:-------------------|:-----------------|:------| | NerDLModel | ner_conll_elmo | 3.2.2 | en | NerDLModel | ner_conll_albert_base_uncased | 3.2.2 | en | NerDLModel | ner_conll_albert_large_uncased | 3.2.2 | en | NerDLModel | ner_conll_xlnet_base_cased | 3.2.2 | en

    BERT Embeddings

    | Model | Name | Build | Lang | |:-----------------------------|:-------------------|:-----------------|:------| | BertEmbeddings | bert_muril | 3.2.0 | xx | BertEmbeddings | bert_wiki_books_sst2 | 3.2.0 | en | BertEmbeddings | bert_wiki_books_squad2 | 3.2.0 | en | BertEmbeddings | bert_wiki_books_qqp | 3.2.0 | en | BertEmbeddings | bert_wiki_books_qnli | 3.2.0 | en | BertEmbeddings | bert_wiki_books_mnli | 3.2.0 | en | BertEmbeddings | bert_wiki_books | 3.2.0 | en | BertEmbeddings | bert_pubmed_squad2 | 3.2.0 | en | BertEmbeddings | bert_pubmed | 3.2.0 | en | BertSentenceEmbeddings | sent_bert_wiki_books_sst2 | 3.2.0 | en | BertSentenceEmbeddings | sent_bert_wiki_books_squad2 | 3.2.0 | en | BertSentenceEmbeddings | sent_bert_wiki_books_qqp | 3.2.0 | en | BertSentenceEmbeddings | sent_bert_wiki_books_qnli | 3.2.0 | en | BertSentenceEmbeddings | sent_bert_wiki_books_mnli | 3.2.0 | en | BertSentenceEmbeddings | sent_bert_wiki_books | 3.2.0 | en | BertSentenceEmbeddings | sent_bert_pubmed_squad2 | 3.2.0 | en | BertSentenceEmbeddings | sent_bert_pubmed | 3.2.0 | en | BertSentenceEmbeddings | sent_bert_muril | 3.2.0 | xx

    Sentence Detection

    Yiddish, Ukrainian, Telugu, Tamil, Somali, Sindhi, Russian, Punjabi, Nepali, Marathi, Malayalam, Kannada, Indonesian, Gujrati, Bosnian

    | Model | Name | Build | Lang | |:-----------------------------|:-------------------|:-----------------|:------| | SentenceDetectorDLModel | sentence_detector_dl | 3.2.0 | yi | SentenceDetectorDLModel | sentence_detector_dl | 3.2.0 | uk | SentenceDetectorDLModel | sentence_detector_dl | 3.2.0 | te | SentenceDetectorDLModel | sentence_detector_dl | 3.2.0 | ta | SentenceDetectorDLModel | sentence_detector_dl | 3.2.0 | so | SentenceDetectorDLModel | sentence_detector_dl | 3.2.0 | sd | SentenceDetectorDLModel | sentence_detector_dl | 3.2.0 | ru | SentenceDetectorDLModel | sentence_detector_dl | 3.2.0 | pa | SentenceDetectorDLModel | sentence_detector_dl | 3.2.0 | ne | SentenceDetectorDLModel | sentence_detector_dl | 3.2.0 | mr | SentenceDetectorDLModel | sentence_detector_dl | 3.2.0 | ml | SentenceDetectorDLModel | sentence_detector_dl | 3.2.0 | kn | SentenceDetectorDLModel | sentence_detector_dl | 3.2.0 | id | SentenceDetectorDLModel | sentence_detector_dl | 3.2.0 | gu | SentenceDetectorDLModel | sentence_detector_dl | 3.2.0 | bs

    The complete list of all 3700+ models & pipelines in 200+ languages is available on Models Hub.


    Documentation


    Installation

    Python

    #PyPI
    
    pip install spark-nlp==3.2.2
    

    Spark Packages

    spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.2.2
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.2.2
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.2.2
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.2.2
    

    spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.2.2
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.2.2
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.2.2
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.2.2
    

    spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.2.2
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.2.2
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.2.2
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.2.2
    

    Maven

    spark-nlp on Apache Spark 3.0.x and 3.1.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp_2.12</artifactId>
        <version>3.2.2</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu_2.12</artifactId>
        <version>3.2.2</version>
    </dependency>
    

    spark-nlp on Apache Spark 2.4.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark24_2.11</artifactId>
        <version>3.2.2</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
        <version>3.2.2</version>
    </dependency>
    

    spark-nlp on Apache Spark 2.3.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark23_2.11</artifactId>
        <version>3.2.2</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark23_2.11</artifactId>
        <version>3.2.2</version>
    </dependency>
    

    FAT JARs

    • CPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.2.2.jar

    • GPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.2.2.jar

    • CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.2.2.jar

    • GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.2.2.jar

    • CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.2.2.jar

    • GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.2.2.jar

    Source code(tar.gz)
    Source code(zip)
  • 3.2.1(Aug 11, 2021)


    Patch release

    • Fix unsupported model error in pretrained function for LongformerEmbeddings, BertForTokenClassification, and DistilBertForTokenClassification https://github.com/JohnSnowLabs/spark-nlp/issues/5947

    Documentation


    Installation

    Python

    #PyPI
    
    pip install spark-nlp==3.2.1
    

    Spark Packages

    spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.2.1
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.2.1
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.2.1
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.2.1
    

    spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.2.1
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.2.1
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.2.1
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.2.1
    

    spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.2.1
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.2.1
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.2.1
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.2.1
    

    Maven

    spark-nlp on Apache Spark 3.0.x and 3.1.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp_2.12</artifactId>
        <version>3.2.1</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu_2.12</artifactId>
        <version>3.2.1</version>
    </dependency>
    

    spark-nlp on Apache Spark 2.4.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark24_2.11</artifactId>
        <version>3.2.1</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
        <version>3.2.1</version>
    </dependency>
    

    spark-nlp on Apache Spark 2.3.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark23_2.11</artifactId>
        <version>3.2.1</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark23_2.11</artifactId>
        <version>3.2.1</version>
    </dependency>
    

    FAT JARs

    • CPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.2.1.jar

    • GPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.2.1.jar

    • CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.2.1.jar

    • GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.2.1.jar

    • CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.2.1.jar

    • GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.2.1.jar

    Source code(tar.gz)
    Source code(zip)
  • 3.2.0(Aug 10, 2021)


    Overview

    We are very excited to release Spark NLP 🚀 3.2.0! This is a big release with new Longformer models for long documents, BertForTokenClassification & DistilBertForTokenClassification for existing or fine-tuned models on HuggingFace, GraphExctraction & GraphFinisher to find relevant relationships between words, support for multilingual Date Matching, new Pydoc for Python APIs, and so many more!

    As always, we would like to thank our community for their feedback, questions, and feature requests.


    Major features and improvements

    • NEW: Introducing LongformerEmbeddings annotator. Longformer is a transformer model for long documents. Longformer is a BERT-like model started from the RoBERTa checkpoint and pretrained for MLM on long documents. It supports sequences of length up to 4,096.

    We have trained two NER models based on Longformer Base and Large embeddings:

    | Model | Accuracy | F1 Test | F1 Dev | |:------|:----------|:------|:--------| |ner_conll_longformer_base_4096 | 94.75% | 90.09 | 94.22 |ner_conll_longformer_large_4096 | 95.79% | 91.25 | 94.82

    • NEW: Introducing BertForTokenClassification annotator. BertForTokenClassification can load BERT Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using BertForTokenClassification or TFBertForTokenClassification in HuggingFace 🤗
    • NEW: Introducing DistilBertForTokenClassification annotator. DistilBertForTokenClassification can load BERT Models with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for Named-Entity-Recognition (NER) tasks. This annotator is compatible with all the models trained/fine-tuned by using DistilBertForTokenClassification or TFDistilBertForTokenClassification in HuggingFace 🤗
    • NEW: Introducing GraphExctraction and GraphFinisher annotators to extract a dependency graph between entities. The GraphExtraction class takes e.g. extracted entities from a NerDLModel and creates a dependency tree that describes how the entities relate to each other. For that, a triple store format is used. Nodes represent the entities and the edges represent the relations between those entities. The graph can then be used to find relevant relationships between words
    • NEW: Introducing support for multilingual DateMatcher and MultiDateMatcher annotators. These two annotators will support English, French, Italian, Spanish, German, and Portuguese languages
    • NEW: Introducing new Python APIs and fully documented Pydoc
    • NEW: Introducing new Spark NLP configurations via spark.conf() by deprecating application.conf usage. You can easily change Spark NLP configurations in SparkSession. For more examples please vistit Spark NLP Configuration
    • Add support for Amazon S3 to log_folder Spark NLP config and outputLogsPath param in NerDLApproach, ClassifierDlApproach, MultiClassifierDlApproach, and SentimentDlApproach annotators
    • Added cache_folder, log_folder, and cluster_tmp_dir to sparknlp.start() function to set Spark NLP configurations
    • Added examples to all Spark NLP Scaladoc
    • Added examples to all Spark NLP Pydoc
    • Welcoming new Databricks runtimes to our Spark NLP family:
      • Databricks 8.4 ML & GPU
    • Fix printing a wrong version return in sparknlp.version()

    Models and Pipelines

    Spark NLP 3.2.0 comes with new LongformerEmbeddings, BertForTokenClassification, and DistilBertForTokenClassification annotators.

    New Longformer Models

    | Model | Name | Build | Lang |
    |:-----------------------------|:-------------------|:-----------------|:------| | LongformerEmbeddings | longformer_base_4096 | 3.2.0 | en | LongformerEmbeddings | longformer_large_4096 | 3.2.0 | en

    Featured NerDL Models

    New NER models for CoNLL (4 entities) and OntoNotes (18 entities) trained by using BERT, RoBERTa, DistilBERT, XLM-RoBERTa, and Longformer Embeddings:

    | Model | Name | Build | Lang |
    |:-----------------------------|:-------------------|:-----------------|:------| | NerDLModel | ner_ontonotes_roberta_base | 3.2.0 | en | NerDLModel | ner_ontonotes_roberta_large | 3.2.0 | en | NerDLModel | ner_ontonotes_distilbert_base_cased | 3.2.0 | en | NerDLModel | ner_conll_bert_base_cased | 3.2.0 | en | NerDLModel | ner_conll_distilbert_base_cased | 3.2.0 | en | NerDLModel | ner_conll_roberta_base | 3.2.0 | en | NerDLModel | ner_conll_roberta_large | 3.2.0 | en | NerDLModel | ner_conll_xlm_roberta_base | 3.2.0 | en | NerDLModel | ner_conll_longformer_base_4096 | 3.2.0 | en | NerDLModel | ner_conll_longformer_large_4096 | 3.2.0 | en

    BERT and DistilBERT for Token Classification

    New BERT and DistilBERT fine-tuned for the Named Entity Recognition (NER) in English, Persian, Spanish, Swedish, and Turkish:

    | Model | Name | Build | Lang |
    |:-----------------------------|:-------------------|:-----------------|:------| | BertForTokenClassification | bert_base_token_classifier_conll03 | 3.2.0 | en | BertForTokenClassification | bert_large_token_classifier_conll03 | 3.2.0 | en | BertForTokenClassification | bert_base_token_classifier_ontonote | 3.2.0 | en | BertForTokenClassification | bert_large_token_classifier_ontonote | 3.2.0 | en | BertForTokenClassification | bert_token_classifier_parsbert_armanner | 3.2.0 | fa | BertForTokenClassification | bert_token_classifier_parsbert_ner | 3.2.0 | fa | BertForTokenClassification | bert_token_classifier_parsbert_peymaner | 3.2.0 | fa | BertForTokenClassification | bert_token_classifier_turkish_ner | 3.2.0 | tr | BertForTokenClassification | bert_token_classifier_spanish_ner | 3.2.0 | es | BertForTokenClassification | bert_token_classifier_swedish_ner | 3.2.0 | sv | BertForTokenClassification | bert_base_token_classifier_few_nerd | 3.2.0 | en | DistilBertForTokenClassification | distilbert_base_token_classifier_few_nerd | 3.2.0 | en | DistilBertForTokenClassification | distilbert_base_token_classifier_conll03 | 3.2.0 | en | DistilBertForTokenClassification | distilbert_base_token_classifier_ontonotes | 3.2.0 | en | DistilBertForTokenClassification | distilbert_token_classifier_persian_ner | 3.2.0 | fa

    The complete list of all 3700+ models & pipelines in 200+ languages is available on Models Hub.

    New Notebooks

    Import hundreds of models in different languages to Spark NLP

    Spark NLP | HuggingFace Notebooks | Colab :------------ | :-------------| :----------| LongformerEmbeddings|HuggingFace in Spark NLP - Longformer | Open In Colab BertForTokenClassification|HuggingFace in Spark NLP - BertForTokenClassification | Open In Colab DistilBertForTokenClassification|HuggingFace in Spark NLP - DistilBertForTokenClassification | Open In Colab

    You can visit Import Transformers in Spark NLP for more info

    New Multilingual DateMatcher and MultiDateMatcher

    Spark NLP | Jupyter Notebooks :------------ | :-------------| MultiDateMatcher | Date Matcher in English MultiDateMatcher | Date Matcher in French MultiDateMatcher | Date Matcher in German MultiDateMatcher | Date Matcher in Italian MultiDateMatcher | Date Matcher in Portuguese MultiDateMatcher | Date Matcher in Spanish GraphExtraction | Graph Extraction Intro GraphExtraction | Graph Extraction GraphExtraction | Graph Extraction Explode Entities


    Deprecation

    The use of application.conf has been deprecated in Spark NLP 3.2.0 release. You can set those configurations via Spark Conf during SparkSession creation. For the full list and examples please visit the Spark NLP Configuration.


    Documentation


    Installation

    Python

    #PyPI
    
    pip install spark-nlp==3.2.0
    

    Spark Packages

    spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.2.0
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.2.0
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.2.0
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.2.0
    

    spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.2.0
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.2.0
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.2.0
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.2.0
    

    spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.2.0
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.2.0
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.2.0
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.2.0
    

    Maven

    spark-nlp on Apache Spark 3.0.x and 3.1.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp_2.12</artifactId>
        <version>3.2.0</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu_2.12</artifactId>
        <version>3.2.0</version>
    </dependency>
    

    spark-nlp on Apache Spark 2.4.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark24_2.11</artifactId>
        <version>3.2.0</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
        <version>3.2.0</version>
    </dependency>
    

    spark-nlp on Apache Spark 2.3.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark23_2.11</artifactId>
        <version>3.2.0</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark23_2.11</artifactId>
        <version>3.2.0</version>
    </dependency>
    

    FAT JARs

    • CPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.2.0.jar

    • GPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.2.0.jar

    • CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.2.0.jar

    • GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.2.0.jar

    • CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.2.0.jar

    • GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.2.0.jar

    Source code(tar.gz)
    Source code(zip)
  • 3.1.3(Jul 20, 2021)


    Overview

    We are pleased to release Spark NLP 🚀 3.1.3! In this release, we bring notebooks to easily import models for BERT and ALBERT models from TF Hub into Spark NLP, new multilingual NER models for 40 languages with a fine-tuned XLM-RoBERTa model, and new state-of-the-art document/sentence embeddings models for English and 100+ languages!

    As always, we would like to thank our community for their feedback, questions, and feature requests.


    New Features

    • Support BERT models from TF Hub to Spark NLP
    • Support BERT for sentence embeddings from TF Hub to Spark NLP
    • Support ALBERT models from TF Hub to Spark NLP
    • Welcoming new Databricks 8.4 / 8.4 ML/GPU runtimes to Spark NLP platforms

    New Models

    We have trained multilingual NER models by using the entire XTREME (40 languages) and WIKINER (8 languages).

    Multilingual Named Entity Recognition:

    | Model | Name | Build | Lang |
    |:-----------------------------|:-------------------|:-----------------|:------| | NerDLModel | ner_xtreme_xlm_roberta_xtreme_base | 3.1.3 | xx | NerDLModel | ner_xtreme_glove_840B_300 | 3.1.3 | xx | NerDLModel | ner_wikiner_xlm_roberta_base | 3.1.3 | xx | NerDLModel | ner_wikiner_glove_840B_300 | 3.1.3 | xx | NerDLModel | ner_mit_movie_simple_distilbert_base_cased | 3.1.3 | en | NerDLModel | ner_mit_movie_complex_distilbert_base_cased | 3.1.3 | en | NerDLModel | ner_mit_movie_complex_bert_base_cased | 3.1.3 | en

    Fine-tuned XLM-RoBERTa base model by randomly masking 15% of XTREME dataset:

    | Model | Name | Build | Lang |
    |:-----------------------------|:-------------------|:-----------------|:------| | XlmRoBertaEmbeddings | xlm_roberta_xtreme_base | 3.1.3 | xx

    New Universal Sentence Encoder trained with CMLM (English & 100+ languages):

    The models extend the BERT transformer architecture and that is why we use them with BertSentenceEmbeddings.

    | Model | Name | Build | Lang |
    |:-----------------------------|:-------------------|:-----------------|:------| | BertSentenceEmbeddings | sent_bert_use_cmlm_en_base | 3.1.3 | en | BertSentenceEmbeddings | sent_bert_use_cmlm_en_large | 3.1.3 | en | BertSentenceEmbeddings | sent_bert_use_cmlm_multi_base | 3.1.3 | xx | BertSentenceEmbeddings | sent_bert_use_cmlm_multi_base_br | 3.1.3 | xx


    Benchmark

    We used BERT base, large, and the new Universal Sentence Encoder trained with CMLM extending the BERT transformer architecture to train ClassifierDL with News dataset:

    (120k training examples - 10 Epochs - 512 max sequence - Nvidia Tesla P100)

    | Model | Accuracy | F1 | Duration |:-----------------------------|:-------------------|:-----------------|:------| |tfhub_use | 0.90 | 0.89 | 10 min |tfhub_use_lg | 0.91 | 0.90 | 24 min |sent_bert_base_cased | 0.92 | 0.90 | 35 min |sent_bert_large_cased | 0.93 | 0.91 | 75 min |sent_bert_use_cmlm_en_base | 0.934 | 0.91 | 36 min |sent_bert_use_cmlm_en_large | 0.945 | 0.92| 72 min

    The complete list of all 3700+ models & pipelines in 200+ languages is available on Models Hub.


    Bug Fixes

    • Fix serialization issue in NorvigSweetingModel
    • Fix the issue with BertSentenceEmbeddings model in TF v2
    • Update ArrayType structure to fix Finisher failing to clean up some annotators

    New Notebooks

    Spark NLP | TF Hub Notebooks :------------ | :-------------| BertEmbeddings | TF Hub in Spark NLP - BERT BertSentenceEmbeddings | TF Hub in Spark NLP - BERT Sentence AlbertEmbeddings | TF Hub in Spark NLP - ALBERT


    Documentation


    Installation

    Python

    #PyPI
    
    pip install spark-nlp==3.1.3
    

    Spark Packages

    spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.1.3
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.1.3
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.1.3
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.1.3
    

    spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.1.3
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.1.3
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.1.3
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.1.3
    

    spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.1.3
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.1.3
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.1.3
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.1.3
    

    Maven

    spark-nlp on Apache Spark 3.0.x and 3.1.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp_2.12</artifactId>
        <version>3.1.3</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu_2.12</artifactId>
        <version>3.1.3</version>
    </dependency>
    

    spark-nlp on Apache Spark 2.4.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark24_2.11</artifactId>
        <version>3.1.3</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
        <version>3.1.3</version>
    </dependency>
    

    spark-nlp on Apache Spark 2.3.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark23_2.11</artifactId>
        <version>3.1.3</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark23_2.11</artifactId>
        <version>3.1.3</version>
    </dependency>
    

    FAT JARs

    • CPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.1.3.jar

    • GPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.1.3.jar

    • CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.1.3.jar

    • GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.1.3.jar

    • CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.1.3.jar

    • GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.1.3.jar

    Source code(tar.gz)
    Source code(zip)
  • 3.1.2(Jul 7, 2021)


    Overview

    We are pleased to release Spark NLP 🚀 3.1.2! We have a new and much-improved XLNet annotator with support for HuggingFace 🤗 models in Spark NLP. We managed to make XlnetEmbeddings almost 5x times faster on GPU compare to prior releases!

    As always, we would like to thank our community for their feedback, questions, and feature requests.


    New Features

    • Migrate XlnetEmbeddings to TensorFlow v2. This allows the importing of HuggingFace XLNet models to Spark NLP
    • Migrate XlnetEmbeddings to BatchAnnotate to allow better performance on accelerated hardware such as GPU
    • Dynamically extract special tokens from SentencePiece model in XlmRoBertaEmbeddings
    • Add setIncludeAllConfidenceScores param in NerDLModel to merge confidence scores per label to only predicted label
    • Fully updated Annotators page with full examples in Python and Scala
    • Fully update Transformers page for all the transformers in Spark NLP

    Bug Fixes & Enhancements

    • Fix issue with SymmetricDeleteModel
    • Fix issue with encoding unknown bytes in RoBertaEmbeddings
    • Fix issue with multi-lingual UniversalSentenceEncoder models
    • Sync params between Python and Scala for ContextSpellChecker
      • change setWordMaxDist to setWordMaxDistance in Scala
      • change setLMClasses to setLanguageModelClasses in Scala
      • change setWordMaxDist to setWordMaxDistance in Scala
      • change setBlackListMinFreq to setCompoundCount in Scala
      • change setClassThreshold to setClassCount in Scala
      • change setWeights to setWeightedDistPath in Scala
      • change setInitialBatchSize to setBatchSize in Python
    • Sync params between Python and Scala for ViveknSentimentApproach
      • change setCorpusPrune to setPruneCorpus in Scala
    • Sync params between Python and Scala for RegexMatcher
      • change setRules to setExternalRules in Scala
    • Sync params between Python and Scala for WordSegmenterApproach
      • change setPosCol to setPosColumn
      • change setIterations to setNIterations
    • Sync params between Python and Scala for ViveknSentimentApproach
      • change setCorpusPrune to setPruneCorpus
    • Sync params between Python and Scala for PerceptronApproach
      • change setPosCol to setPosColumn
    • Fix typos in docs: https://github.com/JohnSnowLabs/spark-nlp/pull/5766 and https://github.com/JohnSnowLabs/spark-nlp/pull/5775 thanks to @brollb

    Performance Improvements

    Introducing a new batch annotation technique implemented in Spark NLP 3.1.2 for XlnetEmbeddings annotator to radically improve prediction/inferencing performance. From now on the batchSize for these annotators means the number of rows that can be fed into the models for prediction instead of sentences per row. You can control the throughput when you are on accelerated hardware such as GPU to fully utilize it.


    Backward compatibility

    We have migrated XlnetEmbeddings to TensorFlow v2, the earlier models prior to 3.1.2 won't work after this release. We have already updated the models and uploaded them on Models Hub. You can use pretrained() that takes care of it automatically or please make sure you download the new models manually.


    Documentation


    Installation

    Python

    #PyPI
    
    pip install spark-nlp==3.1.2
    

    Spark Packages

    spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.1.2
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.1.2
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.1.2
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.1.2
    

    spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.1.2
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.1.2
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.1.2
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.1.2
    

    spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.1.2
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.1.2
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.1.2
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.1.2
    

    Maven

    spark-nlp on Apache Spark 3.0.x and 3.1.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp_2.12</artifactId>
        <version>3.1.2</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu_2.12</artifactId>
        <version>3.1.2</version>
    </dependency>
    

    spark-nlp on Apache Spark 2.4.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark24_2.11</artifactId>
        <version>3.1.2</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
        <version>3.1.2</version>
    </dependency>
    

    spark-nlp on Apache Spark 2.3.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark23_2.11</artifactId>
        <version>3.1.2</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark23_2.11</artifactId>
        <version>3.1.2</version>
    </dependency>
    

    FAT JARs

    • CPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.1.2.jar

    • GPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.1.2.jar

    • CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.1.2.jar

    • GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.1.2.jar

    • CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.1.2.jar

    • GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.1.2.jar

    Source code(tar.gz)
    Source code(zip)
  • 3.1.1(Jun 23, 2021)


    Overview

    We are pleased to release Spark NLP 🚀 3.1.1! We have a new and much-improved ALBERT annotator with support for HuggingFace 🤗 models in Spark NLP. We managed to make AlbertEmbeddings almost 7x times faster on GPU compare to prior releases!

    As always, we would like to thank our community for their feedback, questions, and feature requests.


    New Features

    • Migrate AlbertEmbeddings to TensorFlow v2. This allows the importing of HuggingFace ALBERT models to Spark NLP
    • Migrate AlbertEmbeddings to BatchAnnotate to allow better performance on accelerated hardware such as GPU
    • Enable stdout/stderr in real-time for child processes via sparknlp.start(). Thanks to PySpark 3.x, this is now possible with sparknlp.start(real_time_output=True) to have the outputs of Spark NLP (such as metrics during training) right in your Jupyter, Colab, and Kaggle notebooks.
    • Complete examples for all annotators in Scaladoc APIs https://github.com/JohnSnowLabs/spark-nlp/pull/5668

    Bug Fixes & Enhancements

    • Fix YakeModel issue with empty token https://github.com/JohnSnowLabs/spark-nlp/pull/5683 thanks to @shaddoxac
    • Fix getAnchorDateMonth method in DateMatcher and MultiDateMatcher https://github.com/JohnSnowLabs/spark-nlp/pull/5693
    • Fix the broken PubTutor class in Python https://github.com/JohnSnowLabs/spark-nlp/pull/5702
    • Fix relative dates in DateMatcher and MultiDateMatcher such as day after tomorrow or day before yesterday https://github.com/JohnSnowLabs/spark-nlp/pull/5706
    • Add isPaddedToken param to PubTutor https://github.com/JohnSnowLabs/spark-nlp/pull/5702
    • Fix issue with logger inside session on some setup https://github.com/JohnSnowLabs/spark-nlp/pull/5715
    • Add signatures to TF session to handle inputs/outputs more dynamically in BertEmbeddings, DistilBertEmbeddings, RoBertaEmbeddings, and XlmRoBertaEmbeddings https://github.com/JohnSnowLabs/spark-nlp/pull/5715
    • Fix XlmRoBertaEmbeddings issue with init_all_tables https://github.com/JohnSnowLabs/spark-nlp/pull/5715
    • Add missing YakeModel from annotators
    • Add missing random seed param to ClassifierDLApproach, MultiClassifierDLApproach, and SentimentDLApproach https://github.com/JohnSnowLabs/spark-nlp/pull/5697
    • Make the Java Exceptions appear before Py4J exceptions for ease of debugging in Python https://github.com/JohnSnowLabs/spark-nlp/pull/5709
    • Make sure batchSize set in NerDLModel is the same internally to feed TensorFlow https://github.com/JohnSnowLabs/spark-nlp/pull/5716
    • Fix a typo in documentation https://github.com/JohnSnowLabs/spark-nlp/pull/5664 thanks to @roger-yu-ds

    Performance Improvements

    Introducing a new batch annotation technique implemented in Spark NLP 3.1.1 for AlbertEmbeddings annotator to radically improve prediction/inferencing performance. From now on the batchSize for these annotators means the number of rows that can be fed into the models for prediction instead of sentences per row. You can control the throughput when you are on accelerated hardware such as GPU to fully utilize it.

    Performance achievements by using Spark NLP 2.x/3.0.x vs. Spark NLP 3.1.1

    (Performed on a Databricks cluster)

    | Spark NLP 2.x/3.0.x vs. 3.1.1 | CPU | GPU | |------------------|-------------------------|------------------------ |ALBERT Base | 22% | 340% |
    |Albert Large | 20% | 770% |

    We will update this benchmark table in future pre-releases.


    Backward compatibility

    We have migrated AlbertEmbeddings to TensorFlow v2, the earlier models prior to 3.1.1 won't work after this release. We have already updated the models and uploaded them on Models Hub. You can use pretrained() that takes care of it automatically or please make sure you download the new models manually.


    Documentation


    Installation

    Python

    #PyPI
    
    pip install spark-nlp==3.1.1
    

    Spark Packages

    spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.1.1
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.1.1
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.1.1
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.1.1
    

    spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.1.1
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.1.1
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.1.1
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.1.1
    

    spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.1.1
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.1.1
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.1.1
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.1.1
    

    Maven

    spark-nlp on Apache Spark 3.0.x and 3.1.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp_2.12</artifactId>
        <version>3.1.1</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu_2.12</artifactId>
        <version>3.1.1</version>
    </dependency>
    

    spark-nlp on Apache Spark 2.4.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark24_2.11</artifactId>
        <version>3.1.1</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
        <version>3.1.1</version>
    </dependency>
    

    spark-nlp on Apache Spark 2.3.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark23_2.11</artifactId>
        <version>3.1.1</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark23_2.11</artifactId>
        <version>3.1.1</version>
    </dependency>
    

    FAT JARs

    • CPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.1.1.jar

    • GPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.1.1.jar

    • CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.1.1.jar

    • GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.1.1.jar

    • CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.1.1.jar

    • GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.1.1.jar

    Source code(tar.gz)
    Source code(zip)
  • 3.1.0(Jun 7, 2021)


    Overview

    We are very excited to release Spark NLP 🚀 3.1.0! This is one of our biggest releases with lots of models, pipelines, and groundworks for future features that we are so proud to share it with our community.

    Spark NLP 3.1.0 comes with over 2600+ new pretrained models and pipelines in over 200+ languages, new DistilBERT, RoBERTa, and XLM-RoBERTa annotators, support for HuggingFace 🤗 (Autoencoding) models in Spark NLP, and extends support for new Databricks and EMR instances.

    As always, we would like to thank our community for their feedback, questions, and feature requests.


    Major features and improvements

    • NEW: Introducing DistiBertEmbeddings annotator. DistilBERT is a small, fast, cheap, and light Transformer model trained by distilling BERT base. It has 40% fewer parameters than bert-base-uncased, runs 60% faster while preserving over 95% of BERT’s performances
    • NEW: Introducing RoBERTaEmbeddings annotator. RoBERTa (Robustly Optimized BERT-Pretraining Approach) models deliver state-of-the-art performance on NLP/NLU tasks and a sizable performance improvement on the GLUE benchmark. With a score of 88.5, RoBERTa reached the top position on the GLUE leaderboard
    • NEW: Introducing XlmRoBERTaEmbeddings annotator. XLM-RoBERTa (Unsupervised Cross-lingual Representation Learning at Scale) is a large multi-lingual language model, trained on 2.5TB of filtered CommonCrawl data with 100 different languages. It also outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +13.8% average accuracy on XNLI, +12.3% average F1 score on MLQA, and +2.1% average F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 11.8% in XNLI accuracy for Swahili and 9.2% for Urdu over the previous XLM model
    • NEW: Introducing support for HuggingFace exported models in equivalent Spark NLP annotators. Starting this release, you can easily use the saved_model feature in HuggingFace within a few lines of codes and import any BERT, DistilBERT, RoBERTa, and XLM-RoBERTa models to Spark NLP. We will work on the remaining annotators and extend this support to the rest with each release - For more information please visit this discussion
    • NEW: Migrate MarianTransformer to BatchAnnotate to control the throughput when you are on accelerated hardware such as GPU to fully utilize it
    • Upgrade to TensorFlow v2.4.1 with native support for Java to take advantage of many optimizations for CPU/GPU and new features/models introduced in TF v2.x
    • Update to CUDA11 and cuDNN 8.0.2 for GPU support
    • Implement ModelSignatureManager to automatically detect inputs, outputs, save and restore tensors from SavedModel in TF v2. This allows Spark NLP 3.1.x to extend support for external Encoders such as HuggingFace and TF Hub (coming soon!)
    • Implement a new BPE tokenizer for RoBERTa and XLM models. This tokenizer will use the custom tokens from Tokenizer or RegexTokenizer and generates token pieces, encodes, and decodes the results
    • Welcoming new Databricks runtimes to our Spark NLP family:
      • Databricks 8.1 ML & GPU
      • Databricks 8.2 ML & GPU
      • Databricks 8.3 ML & GPU
    • Welcoming a new EMR 6.x series to our Spark NLP family:
      • EMR 6.3.0 (Apache Spark 3.1.1 / Hadoop 3.2.1)
    • Added examples to Spark NLP Scaladoc

    Models and Pipelines

    Spark NLP 3.1.0 comes with over 2600+ new pretrained models and pipelines in over 200 languages available for Windows, Linux, and macOS users.

    Featured Transformers

    | Model | Name | Build | Lang |
    |:-----------------------------|:-------------------|:-----------------|:------| | BertEmbeddings | bert_base_dutch_cased | 3.1.0 | nl | BertEmbeddings | bert_base_german_cased | 3.1.0 | de | BertEmbeddings | bert_base_german_uncased | 3.1.0 | de | BertEmbeddings | bert_base_italian_cased | 3.1.0 | it | BertEmbeddings | bert_base_italian_uncased | 3.1.0 | it | BertEmbeddings | bert_base_turkish_cased | 3.1.0 | tr | BertEmbeddings | bert_base_turkish_uncased | 3.1.0 | tr | BertEmbeddings | chinese_bert_wwm | 3.1.0 | zh | BertEmbeddings | bert_base_chinese | 3.1.0 | zh | DistilBertEmbeddings | distilbert_base_cased | 3.1.0 | en | DistilBertEmbeddings | distilbert_base_uncased | 3.1.0 | en | DistilBertEmbeddings | distilbert_base_multilingual_cased | 3.1.0 | xx | RoBertaEmbeddings | roberta_base | 3.1.0 | en | RoBertaEmbeddings | roberta_large | 3.1.0 | en | RoBertaEmbeddings | distilroberta_base | 3.1.0 | en | XlmRoBertaEmbeddings | xlm_roberta_base | 3.1.0 | xx | XlmRoBertaEmbeddings | twitter_xlm_roberta_base | 3.1.0 | xx

    Featured Translation Models

    | Model | Name | Build | Lang |
    |:-----------------------------|:-------------------|:-----------------|:------| | MarianTransformer | Chinese to Vietnamese | 3.1.0 | xx | MarianTransformer | Chinese to Ukrainian | 3.1.0 | xx | MarianTransformer | Chinese to Dutch | 3.1.0 | xx | MarianTransformer | Chinese to English | 3.1.0 | xx | MarianTransformer | Chinese to Finnish | 3.1.0 | xx | MarianTransformer | Chinese to Italian | 3.1.0 | xx | MarianTransformer | Yoruba to English | 3.1.0 | xx | MarianTransformer | Yapese to French | 3.1.0 | xx | MarianTransformer | Waray to Spanish | 3.1.0 | xx | MarianTransformer | Ukrainian to English | 3.1.0 | xx | MarianTransformer | Hindi to Urdu | 3.1.0 | xx | MarianTransformer | Italian to Ukrainian | 3.1.0 | xx | MarianTransformer | Italian to Icelandic | 3.1.0 | xx

    Transformers in Spark NLP

    Import hundreds of models in different languages to Spark NLP

    Spark NLP | HuggingFace Notebooks :------------ | :-------------| BertEmbeddings | HuggingFace in Spark NLP - BERT BertSentenceEmbeddings | HuggingFace in Spark NLP - BERT Sentence DistilBertEmbeddings| HuggingFace in Spark NLP - DistilBERT
    RoBertaEmbeddings | HuggingFace in Spark NLP - RoBERTa
    XlmRoBertaEmbeddings | HuggingFace in Spark NLP - XLM-RoBERTa

    The complete list of all 3700+ models & pipelines in 200+ languages is available on Models Hub.


    Backward compatibility

    • We have updated our MarianTransformer annotator to be compatible with TF v2 models. This change is not compatible with previous models/pipelines. However, we have updated and uploaded all the models and pipelines for 3.1.x release. You can either use MarianTransformer.pretrained(MODEL_NAME) and it will automatically download the compatible model or you can visit Models Hub to download the compatible models for offline use via MarianTransformer.load(PATH)

    Documentation


    Installation

    Python

    #PyPI
    
    pip install spark-nlp==3.1.0
    

    Spark Packages

    spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.1.0
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.1.0
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.1.0
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.1.0
    

    spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.1.0
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.1.0
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.1.0
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.1.0
    

    spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.1.0
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.1.0
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.1.0
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.1.0
    

    Maven

    spark-nlp on Apache Spark 3.0.x and 3.1.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp_2.12</artifactId>
        <version>3.1.0</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu_2.12</artifactId>
        <version>3.1.0</version>
    </dependency>
    

    spark-nlp on Apache Spark 2.4.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark24_2.11</artifactId>
        <version>3.1.0</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
        <version>3.1.0</version>
    </dependency>
    

    spark-nlp on Apache Spark 2.3.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark23_2.11</artifactId>
        <version>3.1.0</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark23_2.11</artifactId>
        <version>3.1.0</version>
    </dependency>
    

    FAT JARs

    • CPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.1.0.jar

    • GPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.1.0.jar

    • CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.1.0.jar

    • GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.1.0.jar

    • CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.1.0.jar

    • GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.1.0.jar

    Source code(tar.gz)
    Source code(zip)
  • 3.0.3(May 6, 2021)


    Overview

    We are glad to release Spark NLP 3.0.3! We have added some new features to our T5 Transformer annotator to help with longer and more accurate text generation, trained some new multi-lingual models and pipelines in Farsi, Hebrew, Korean, and Turkish, and fixed some bugs in this release.

    As always, we would like to thank our community for their feedback, questions, and feature requests.


    New Features

    • Add 6 new features to T5Transformer for longer and better text generation
      • doSample: Whether or not to use sampling; use greedy decoding otherwise
      • temperature: The value used to module the next token probabilities
      • topK: The number of highest probability vocabulary tokens to keep for top-k-filtering
      • topP: If set to float < 1, only the most probable tokens with probabilities that add up to top_p or higher are kept for generation
      • repetitionPenalty: The parameter for repetition penalty. 1.0 means no penalty. See CTRL: A Conditional Transformer Language Model for Controllable Generation paper for more details
      • noRepeatNgramSize: If set to int > 0, all ngrams of that size can only occur once
    • Spark NLP 3.0.3 is compatible with the new Databricks 8.2 (ML) runtime
    • Spark NLP 3.0.3 is compatible with the new EMR 5.33.0 (with Zeppelin 0.9.0) release

    Bug Fixes

    • Fix ChunkEmbeddings Array out of bounds exception https://github.com/JohnSnowLabs/spark-nlp/pull/2796
    • Fix pretrained tfhub_use_multi and tfhub_use_multi_lg models in UniversalSentenceEncoder https://github.com/JohnSnowLabs/spark-nlp/pull/2827
    • Fix anchorDateMonth in Python that resulted in 1 additional month and case sensitivity to some relative dates like next friday or next Friday https://github.com/JohnSnowLabs/spark-nlp/pull/2848

    Models and Pipelines

    New multilingual models and pipelines for Farsi, Hebrew, Korean, and Turkish

    | Model | Name | Build | Lang |
    |:-----------------------------|:-------------------|:-----------------|:------| | ClassifierDLModel | classifierdl_bert_news | 3.0.2 | tr | UniversalSentenceEncoder | tfhub_use_multi | 3.0.0 | xx | UniversalSentenceEncoder | tfhub_use_multi_lg | 3.0.0 | xx

    | Pipeline | Name | Build | Lang |
    |:-----------------------------|:-------------------|:-----------------|:------| | PretrainedPipeline | recognize_entities_dl | 3.0.0 | fa | PretrainedPipeline | explain_document_lg | 3.0.2 | he | PretrainedPipeline | explain_document_lg | 3.0.2 | ko

    The complete list of all 1100+ models & pipelines in 192+ languages is available on Models Hub.


    Documentation and Notebooks


    Installation

    Python

    #PyPI
    
    pip install spark-nlp==3.0.3
    

    Spark Packages

    spark-nlp on Apache Spark 3.0.x and 3.1.x (Scala 2.12 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.0.3
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.12:3.0.3
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.0.3
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu_2.12:3.0.3
    

    spark-nlp on Apache Spark 2.4.x (Scala 2.11 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.0.3
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark24_2.11:3.0.3
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.0.3
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark24_2.11:3.0.3
    

    spark-nlp on Apache Spark 2.3.x (Scala 2.11 only):

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.0.3
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-spark23_2.11:3.0.3
    

    GPU

    spark-shell --packages com.johnsnowlabs.nlp:spark-nlp-spark23-gpu_2.11:3.0.3
    
    pyspark --packages com.johnsnowlabs.nlp:spark-nlp-gpu-spark23_2.11:3.0.3
    

    Maven

    spark-nlp on Apache Spark 3.0.x and 3.1.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp_2.12</artifactId>
        <version>3.0.3</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu_2.12</artifactId>
        <version>3.0.3</version>
    </dependency>
    

    spark-nlp on Apache Spark 2.4.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark24_2.11</artifactId>
        <version>3.0.3</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark24_2.11</artifactId>
        <version>3.0.3</version>
    </dependency>
    

    spark-nlp on Apache Spark 2.3.x:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-spark23_2.11</artifactId>
        <version>3.0.3</version>
    </dependency>
    

    spark-nlp-gpu:

    <dependency>
        <groupId>com.johnsnowlabs.nlp</groupId>
        <artifactId>spark-nlp-gpu-spark23_2.11</artifactId>
        <version>3.0.3</version>
    </dependency>
    

    FAT JARs

    • CPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-3.0.3.jar

    • GPU on Apache Spark 3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-assembly-3.0.3.jar

    • CPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark24-assembly-3.0.3.jar

    • GPU on Apache Spark 2.4.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark24-assembly-3.0.3.jar

    • CPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-spark23-assembly-3.0.3.jar

    • GPU on Apache Spark 2.3.x: https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-gpu-spark23-assembly-3.0.3.jar

    Source code(tar.gz)
    Source code(zip)
Must-read papers on improving efficiency for pre-trained language models.

Must-read papers on improving efficiency for pre-trained language models.

Tobias Lee 89 Jan 03, 2023
Tools to download and cleanup Common Crawl data

cc_net Tools to download and clean Common Crawl as introduced in our paper CCNet. If you found these resources useful, please consider citing: @inproc

Meta Research 483 Jan 02, 2023
FedNLP: A Benchmarking Framework for Federated Learning in Natural Language Processing

FedNLP is a research-oriented benchmarking framework for advancing federated learning (FL) in natural language processing (NLP). It uses FedML repository as the git submodule. In other words, FedNLP

FedML-AI 216 Nov 27, 2022
NLP project that works with news (NER, context generation, news trend analytics)

СоАвтор СоАвтор – платформа и открытый набор инструментов для редакций и журналистов-фрилансеров, который призван сделать процесс создания контента ма

38 Jan 04, 2023
A text file containing 479k English words for all your dictionary/word-based projects e.g: auto-completion / autosuggestion

List Of English Words A text file containing over 466k English words. While searching for a list of english words (for an auto-complete tutorial) I fo

dwyl 8.5k Jan 03, 2023
Outreachy TFX custom component project

Schema Curation Custom Component Outreachy TFX custom component project This repo contains the code for Schema Curation Custom Component made as a par

Robert Crowe 5 Jul 16, 2021
CJK computer science terms comparison / 中日韓電腦科學術語對照 / 日中韓のコンピュータ科学の用語対照 / 한·중·일 전산학 용어 대조

CJK computer science terms comparison This repository contains the source code of the website. You can see the website from the following link: Englis

Hong Minhee (洪 民憙) 88 Dec 23, 2022
Indonesia spellchecker with python

indonesia-spellchecker Ganti kata yang terdapat pada file teks.txt untuk diperiksa kebenaran kata. Run on local machine python3 main.py

Rahmat Agung Julians 1 Sep 14, 2022
CrossNER: Evaluating Cross-Domain Named Entity Recognition (AAAI-2021)

CrossNER is a fully-labeled collected of named entity recognition (NER) data spanning over five diverse domains (Politics, Natural Science, Music, Literature, and Artificial Intelligence) with specia

Zihan Liu 89 Nov 10, 2022
Machine Learning Course Project, IMDB movie review sentiment analysis by lstm, cnn, and transformer

IMDB Sentiment Analysis This is the final project of Machine Learning Courses in Huazhong University of Science and Technology, School of Artificial I

Daniel 0 Dec 27, 2021
Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS)

Bidirectional Variational Inference for Non-Autoregressive Text-to-Speech (BVAE-TTS) Yoonhyung Lee, Joongbo Shin, Kyomin Jung Abstract: Although early

LEE YOON HYUNG 147 Dec 05, 2022
PyTorch implementation of the paper: Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding

Text is no more Enough! A Benchmark for Profile-based Spoken Language Understanding This repository contains the official PyTorch implementation of th

Xiao Xu 26 Dec 14, 2022
Mednlp - Medical natural language parsing and utility library

Medical natural language parsing and utility library A natural language medical

Paul Landes 3 Aug 24, 2022
Utilize Korean BERT model in sentence-transformers library

ko-sentence-transformers 이 프로젝트는 KoBERT 모델을 sentence-transformers 에서 보다 쉽게 사용하기 위해 만들어졌습니다. Ko-Sentence-BERT-SKTBERT 프로젝트에서는 KoBERT 모델을 sentence-trans

Junghyun 40 Dec 20, 2022
Repositório da disciplina no semestre 2021-2

Avisos! Nenhum aviso! Compiladores 1 Este é o Git da disciplina Compiladores 1. Aqui ficará o material produzido em sala de aula assim como tarefas, w

6 May 13, 2022
Python3 to Crystal Translation using Python AST Walker

py2cr.py A code translator using AST from Python to Crystal. This is basically a NodeVisitor with Crystal output. See AST documentation (https://docs.

66 Jul 25, 2022
This is Assignment1 code for the Web Data Processing System.

This is a Python program to Entity Linking by processing WARC files. We recognize entities from web pages and link them to a Knowledge Base(Wikidata).

3 Dec 04, 2022
Transcribing audio files using Hugging Face's implementation of Wav2Vec2 + "chain-linking" NLP tasks to combine speech-to-text with downstream tasks like translation and summarisation.

PART 2: CHAIN LINKING AUDIO-TO-TEXT NLP TASKS 2A: TRANSCRIBE-TRANSLATE-SENTIMENT-ANALYSIS In notebook3.0, I demo a simple workflow to: transcribe a lo

Chua Chin Hon 30 Jul 13, 2022
Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any language

Simple Python script to scrape youtube channles of "Parity Technologies and Web3 Foundation" and translate them to well-known braille language or any

Little Endian 1 Apr 28, 2022