Stanford CoreNLP provides a set of natural language analysis tools written in Java

Last update: Jan 07, 2023

Overview

Stanford CoreNLP

Stanford CoreNLP provides a set of natural language analysis tools written in Java. It can take raw human language text input and give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize and interpret dates, times, and numeric quantities, mark up the structure of sentences in terms of phrases or word dependencies, and indicate which noun phrases refer to the same entities. It was originally developed for English, but now also provides varying levels of support for (Modern Standard) Arabic, (mainland) Chinese, French, German, and Spanish. Stanford CoreNLP is an integrated framework, which makes it very easy to apply a bunch of language analysis tools to a piece of text. Starting from plain text, you can run all the tools with just two lines of code. Its analyses provide the foundational building blocks for higher-level and domain-specific text understanding applications. Stanford CoreNLP is a set of stable and well-tested natural language processing tools, widely used by various groups in academia, industry, and government. The tools variously use rule-based, probabilistic machine learning, and deep learning components.

The Stanford CoreNLP code is written in Java and licensed under the GNU General Public License (v3 or later). Note that this is the full GPL, which allows many free uses, but not its use in proprietary software that you distribute to others.

Build Instructions

Several times a year we distribute a new version of the software, which corresponds to a stable commit.

During the time between releases, one can always use the latest, under development version of our code.

Here are some helpful instructions to use the latest code:

Provided build

Sometimes we will provide updated jars here which have the latest version of the code.

At present, the current released version of the code is our most recent released jar, though you can always build the very latest from GitHub HEAD yourself.

Build with Ant

Make sure you have Ant installed, details here: http://ant.apache.org/
Compile the code with this command: cd CoreNLP ; ant
Then run this command to build a jar with the latest version of the code: cd CoreNLP/classes ; jar -cf ../stanford-corenlp.jar edu
This will create a new jar called stanford-corenlp.jar in the CoreNLP folder which contains the latest code
The dependencies that work with the latest code are in CoreNLP/lib and CoreNLP/liblocal, so make sure to include those in your CLASSPATH.
When using the latest version of the code make sure to download the latest versions of the corenlp-models, english-models, and english-models-kbp and include them in your CLASSPATH. If you are processing languages other than English, make sure to download the latest version of the models jar for the language you are interested in.

Build with Maven

Make sure you have Maven installed, details here: https://maven.apache.org/
If you run this command in the CoreNLP directory: mvn package , it should run the tests and build this jar file: CoreNLP/target/stanford-corenlp-4.4.0.jar
When using the latest version of the code make sure to download the latest versions of the corenlp-models, english-extra-models, and english-kbp-models and include them in your CLASSPATH. If you are processing languages other than English, make sure to download the latest version of the models jar for the language you are interested in.
If you want to use Stanford CoreNLP as part of a Maven project you need to install the models jars into your Maven repository. Below is a sample command for installing the Spanish models jar. For other languages just change the language name in the command. To install stanford-corenlp-models-current.jar you will need to set -Dclassifier=models. Here is the sample command for Spanish: mvn install:install-file -Dfile=/location/of/stanford-spanish-corenlp-models-current.jar -DgroupId=edu.stanford.nlp -DartifactId=stanford-corenlp -Dversion=4.4.0 -Dclassifier=models-spanish -Dpackaging=jar

Models

The models jars that correspond to the latest code can be found in the table below.

Some of the larger (English) models -- like the shift-reduce parser and WikiDict -- are not distributed with our default models jar. These require downloading the English (extra) and English (kbp) jars. Resources for other languages require usage of the corresponding models jar.

The best way to get the models is to use git-lfs and clone them from Hugging Face Hub.

For instance, to get the French models, run the following commands:

# Make sure you have git-lfs installed
# (https://git-lfs.github.com/)
git lfs install

git clone https://huggingface.co/stanfordnlp/corenlp-french

The jars can be directly downloaded from the links below or the Hugging Face Hub page as well.

Language	Model Jar	Last Updated
Arabic	download (HF Hub)	4.4.0
Chinese	download (HF Hub)	4.4.0
English (extra)	download (HF Hub)	4.4.0
English (KBP)	download (HF Hub)	4.4.0
French	download (HF Hub)	4.4.0
German	download (HF Hub)	4.4.0
Hungarian	download (HF Hub)	4.4.0
Italian	download (HF Hub)	4.4.0
Spanish	download (HF Hub)	4.4.0

Thank you to Hugging Face for helping with our hosting!

Useful resources

You can find releases of Stanford CoreNLP on Maven Central.

You can find more explanation and documentation on the Stanford CoreNLP homepage.

For information about making contributions to Stanford CoreNLP, see the file CONTRIBUTING.md.

Questions about CoreNLP can either be posted on StackOverflow with the tag stanford-nlp, or on the mailing lists.

Comments

An Issue in importing StanfordCoreNLP library in an Android Studio project

I am developing an Android application (I am a beginner). I want to use Stanford CoreNPL 3.8.0 library in my app to extract the part of speech, the lemma, the parser and so on from the user sentences.I have tried a simple java code in NetBeans by following this youtube tutorial https://www.youtube.com/watch?v=9IZsBmHpK3Y, and it is working perfectly.The jar files that I imported to the NetBeans project are: stanford-corenlp-3.8.0.jar and stanford-corenlp-3.8.0-models.jar.

And this is the java source code:

import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.util.CoreMap;

import java.util.List;
import java.util.Properties;

public class CoreNlpExample {

    public static void main(String[] args) {

        // creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution
        Properties props = new Properties();
        props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

        // read some text in the text variable
        String text = "What is the Weather in Bangalore right now?";

        // create an empty Annotation just with the given text
        Annotation document = new Annotation(text);

        // run all Annotators on this text
        pipeline.annotate(document);

        List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);

        for (CoreMap sentence : sentences) {
            // traversing the words in the current sentence
            // a CoreLabel is a CoreMap with additional token-specific methods
            for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
                // this is the text of the token
                String word = token.get(CoreAnnotations.TextAnnotation.class);
                // this is the POS tag of the token
                String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class);
                // this is the NER label of the token
                String ne = token.get(CoreAnnotations.NamedEntityTagAnnotation.class);

                System.out.println(String.format("Print: word: [%s] pos: [%s] ne: [%s]", word, pos, ne));
            }
        }
    }
}

I wanted to try the same code in Android Studio and display the result in a textview, but I am facing a problem with adding these external libraries in my Android Studio 3.0.1 project.

I have read on some websites that I need to reduce the size of the jar files, and I did that and made sure that the reduced jars are still working fine in the Netbeans project. But I am still facing problems in Android studio and this is the error that I am getting:

java.lang.VerifyError: Rejecting class edu.stanford.nlp.pipeline.StanfordCoreNLP that attempts to sub-type erroneous class edu.stanford.nlp.pipeline.AnnotationPipeline (declaration of 'edu.stanford.nlp.pipeline.StanfordCoreNLP' appears in /data/app/com.example.fatimah.nlpapplication-bhlUJOCUwLhSbkWE7NBERA==/split_lib_dependencies_apk.apk)

Any suggestions on how I can fix this and import Stanford library successfully?

opened by ftoom235 52

Use JaFaMa for faster math, and optimize critical code paths
These changes substantially cut down the processing time; by several hours when I process all of Wikipedia. Feel free to benchmark on your own data.

The first commit uses JaFaMa instead of java.lang.Math, which is 2-3x faster for exp, log: http://blog.element84.com/improving-java-math-perf-with-jafama.html In some places I switched back to log1p, because the runtime of log and log1p in JaFaMa are similar, and log1p offers better precision for small values of x than log(1+x).

The other patches optimize the crucial code around the Viterbi algorithm:

HotSpot optimizes better if large functions with multiple loops are split into multiple methods (as they can be recompiled independently).

It pays off to save repeated nested array lookups (e.g. array[i][j] in a loop over j; move array_i = array[i] outside of the loop and use array_i[j] inside).

I also add a cache to avoid recomputing the open tags set in Ttags.

All of these may appear to be trivial changes, but once you benchmark you will see how much this improves the run time.

Processing the first 20000 articles with tokenize,ssplit,pos, doing some further processing such as my own lemmatization based on hunspell, and then loading them into a lucene index with the CoreNLP master branch took 08:51 minutes, and with my patches only 04:38 minutes (sloopy benchmark only). I consider this a substantial speedup, because Wikipedia is 5.3 million articles, and it still needed 19 hours to build the full text index, but it used to take almost two days...
opened by kno10 39
Could the project switch to using log4j for logs?

I see a lot of logs printed to System.out or System.err. Would it be possible to use a library like log4j http://logging.apache.org/log4j/2.x/ and use log.error, log.warning, log.info, log.debug instead? That would make it easier for users of the StanfordCoreNLP to manage which logs should be printed by choosing the log level of the project.
enhancement

opened by Asimov4 33
Quote Annotation - AnnotationException StringIndexOutOfBoundsException

Hello,

I had a situation with text that had this: ""=

It seems to throw an error when I try running the pipeline with quote annotation on this small fragment. Just wanted to verify that it was an issue.

Thank you.

opened by allenkim 29
Parsing fails on AssertionError when using OpenIE (v3.9.2)
Happens with the following sentence, under version 3.9.2, only when adding openIE annotator:

It was a long and stern face, but with eyes that twinkled in a kindly way.

stack trace:

java.lang.AssertionError at edu.stanford.nlp.naturalli.Util.cleanTree(Util.java:324) at edu.stanford.nlp.naturalli.OpenIE.annotateSentence(OpenIE.java:463) at edu.stanford.nlp.naturalli.OpenIE.lambda$annotate$2(OpenIE.java:547) at java.base/java.util.ArrayList.forEach(ArrayList.java:1540) at edu.stanford.nlp.naturalli.OpenIE.annotate(OpenIE.java:547) at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:76) at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:637) at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:629)

to replicate:

Properties props = new Properties(); props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,depparse,natlog,openie"); StanfordCoreNLP pipeline = new StanfordCoreNLP(props); String text = "It was a long and stern face, but with eyes that twinkled in a kindly way."; CoreDocument document = new CoreDocument(text); pipeline.annotate(document);

works fine if openie is disabled, with other sentences, or when using https://corenlp.run/ so looks like it's fixed in later versions but I did not verify it locally as I can't upgrade at the moment anyway.

advice much appreciated
opened by manzurola 29

Stanford CoreNLP server not responding

I have been trying to use the CoreNLP server using various python packages including Stanza. I am always running into the same problem that I do not hear back from the server.

I downloaded a copy of CoreNLP from the website. I then try to start a server from the terminal and go to my localhost as described here. Based on the documentation I should see something when I go to http://localhost:9000/, but nothing loads up.

Here are to commands I use:

cd stanford-corenlp-full-2018-10-05/
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000

Here is the output of running the commands above:

Samarths-MacBook-Pro-2:stanford-corenlp-full-2018-10-05 samarthbhandari$ java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000
[main] INFO CoreNLP - --- StanfordCoreNLPServer#main() called ---
[main] INFO CoreNLP - setting default constituency parser
[main] INFO CoreNLP - warning: cannot find edu/stanford/nlp/models/srparser/englishSR.ser.gz
[main] INFO CoreNLP - using: edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz instead
[main] INFO CoreNLP - to use shift reduce parser download English models jar from:
[main] INFO CoreNLP - http://stanfordnlp.github.io/CoreNLP/download.html
[main] INFO CoreNLP -     Threads: 8
[main] INFO CoreNLP - Starting server...
[main] INFO CoreNLP - StanfordCoreNLPServer listening at /0:0:0:0:0:0:0:0:9000

I then go to http://localhost:9000/, nothing loads up. Like I mentioned above originally I have been trying to do the same thing using some of the python packages and observed similar behavior.

Here is a stack overflow post related to server not responding using Stanza.

OS: MacOS 10.15.4 Python: 3.7.7 Java: 1.8

cantreproduce

opened by samarth12 25

[MEMORY] Possibly use float instead of double in models/weights

double arrays are a large portion of the heap.

There are some places with 2d double arrays with dimensions like

345k x 16, 150k x 24, 80k x 46: CRFCLassifier.weights 100k x 1000: Classifier.saved in DependecyParser 60k x 50: Classifier.E, .eg2E 1000x2400: Classifier.W1, .wg2W1

Most are weights of some sort, making me wonder if they could be stored in less than 64bit each.

The obvious step would be to use float[], halving the memory use of this portion.

Another would be to encode weights in something else, for example a small integer and scale that into a float again when using the weight.

Machine Learning models often use fp16 or even fp8 to store weights, there are java implementations of float -> short -> float (with fp16 semantics stored in a 16bit short)

like https://android.googlesource.com/platform/frameworks/base/+/master/core/java/android/util/Half.java with https://android.googlesource.com/platform/libcore/+/master/luni/src/main/java/libcore/util/FP16.java

or https://stackoverflow.com/questions/6162651/half-precision-floating-point-in-java

The latter approached would need some performance testing as each time a weight is used it would have to be converted first.

I saw that some models serialize themselves using ObjectStreams, that would need an adapter to deserialize to double[] first and then array-cast it to float[].

Like in CRFClassifier.loadClassifier

opened by lambdaupb 25
TokenSequenceParser ignoring tail of patterns mentioned in rules

Following function in TokenSequenceParser class ignores tail of patterns defined in rules for tokensregex

private String getStringFromTokens(Token head, Token tail, boolean includeSpecial) { StringBuilder sb = new StringBuilder(); for( Token p = head ; p != tail ; p = p.next ) { if (includeSpecial) { appendSpecialTokens( sb, p.specialToken ); } sb.append( p.image ); } return sb.toString(); }

Eg: ([{lemma:/([a-zA-Z]{2,}_)?[a-zA-Z]{2,}[0-9]{2,}/}]) gets converted to ([{lemma:/([a-zA-Z]{2,}_)?[a-zA-Z]{2,}[0-9]{2,}/}] while reading and don't provide intended matches

opened by ankitsingh2 23

Exception thrown for operation attempted on unknown vertex

CoreNLP version 4.5.0 using pos lemma depparse. I run the pipeline within Spark (Scala). I lazy initialise the CoreNLP pipeline and I broadcast the pipeline to each executor using lazy instantiation wrapped in a case object. Also I force not to split the text fragment as it is intended to be a sentence already. The objective here is to do dependency analysis on the sentence and run some semgraph rules against it. We got a case where it throws an exception like this

Caused by: edu.stanford.nlp.semgraph.UnknownVertexException: Operation attempted on unknown vertex happens/VBZ'''' in graph -> observed/VBD (root)
  -> 24/CD (nsubj)
    -> response/NN (nmod:in)
      -> In/IN (case)
      -> CoV/NNP (nmod:to)
        -> to/IN (case)
        -> SARS/NNP (compound)
        -> ‐/SYM (dep)
        -> ‐/SYM (dep)
        -> peptides/NNS (dep)
          -> 2/CD (nummod)
  -> ,/, (punct)
  -> we/PRP (nsubj)
  -> unexpectedly/RB (advmod)
  -> associated/VBN (ccomp)
    -> that/IN (mark)
    -> sirolimus/NN (nsubj:pass)
    -> was/VBD (aux:pass)
    -> significantly/RB (advmod)
    -> release/NN (obl:with)
      -> with/IN (case)
      -> a/DT (det)
      -> proinflammatory/JJ (amod)
      -> cytokine/NN (compound)
      -> levels/NNS (nmod:including)
        -> including/VBG (case)
        -> higher/JJR (amod)
        -> α/NN (nmod:of)
          -> of/IN (case)
          -> TNF/NN (compound)
          -> ‐/SYM (dep)
          -> IL/NN (conj:and)
            -> and/CC (cc)
        -> IL/NN (nmod:of)
        -> 1β/NN (nmod)
          -> ‐/SYM (dep)
  -> ./. (punct)

	at edu.stanford.nlp.semgraph.SemanticGraph.parentPairs(SemanticGraph.java:730)
	at edu.stanford.nlp.semgraph.semgrex.GraphRelation$DEPENDENT$1.advance(GraphRelation.java:325)
	at edu.stanford.nlp.semgraph.semgrex.GraphRelation$SearchNodeIterator.initialize(GraphRelation.java:1103)
	at edu.stanford.nlp.semgraph.semgrex.GraphRelation$SearchNodeIterator.<init>(GraphRelation.java:1084)
	at edu.stanford.nlp.semgraph.semgrex.GraphRelation$DEPENDENT$1.<init>(GraphRelation.java:310)
	at edu.stanford.nlp.semgraph.semgrex.GraphRelation$DEPENDENT.searchNodeIterator(GraphRelation.java:310)
	at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.resetChildIter(NodePattern.java:339)
	at edu.stanford.nlp.semgraph.semgrex.SemgrexMatcher.resetChildIter(SemgrexMatcher.java:80)
	at edu.stanford.nlp.semgraph.semgrex.CoordinationPattern$CoordinationMatcher.resetChildIter(CoordinationPattern.java:168)
	at edu.stanford.nlp.semgraph.semgrex.CoordinationPattern$CoordinationMatcher.resetChildIter(CoordinationPattern.java:168)
	at edu.stanford.nlp.semgraph.semgrex.CoordinationPattern$CoordinationMatcher.resetChildIter(CoordinationPattern.java:168)
	at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.resetChild(NodePattern.java:363)
	at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.goToNextNodeMatch(NodePattern.java:457)
	at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.matches(NodePattern.java:574)
	at edu.stanford.nlp.semgraph.semgrex.SemgrexMatcher.find(SemgrexMatcher.java:193)
	at az.bikg.nlp.etl.common.nlp.Pattern.go$3(Pattern.scala:200)
	at az.bikg.nlp.etl.common.nlp.Pattern.$anonfun$findCauseEffectMatches$6(Pattern.scala:268)
	at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
	at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
	at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
	at az.bikg.nlp.etl.common.nlp.Pattern.findCauseEffectMatches(Pattern.scala:266)
	at az.bikg.nlp.etl.steps.ERs$.findRelations(ERs.scala:107)
	at az.bikg.nlp.etl.steps.ERs$.findRelationsSpark(ERs.scala:229)
	at az.bikg.nlp.etl.steps.ERs$.$anonfun$extractERs$1(ERs.scala:242)
	... 28 more

Am I doing anything wrong because of this exception? It didn't happen with version 4.4.0.

opened by mkarmona 22

Are these latest Chines model significantly worse than the Stanford online parser?

I tested the latest Chinese CoreNLP 3.92 version, and found the results are quite horrible. Here are few examples:

我的朋友：always tags "我的" as one NN token. 我的狗吃苹果： ‘我的狗’ tagged as one NN token. 他的狗吃苹果：'狗吃' tagged as one NN token. 高质量就业成时代: '就业' tagged as VV

When I compared them with the results from http://nlp.stanford.edu:8080/parser/index.jsp, surprisingly, the results of those examples are all right. Why is that? Are the models different? Is there a bug in the new 3.92 version model?

opened by lingvisa 21

pos-tagger cannot load models from stanford-corenlp-3.5.2-models.jar

I use Stanford Core NLP in Java as a Maven dependency. I want to use the MaxentTagger with a model supplied in the stanford-corenlp-3.5.2-models package. The problem is that I cannot access this model through classpath.

My code is

tagger = new MaxentTagger("/edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger");

The file "/edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger" exists in the jar and should be loaded through classpath, but the following exception is thrown

Exception in thread "main" edu.stanford.nlp.io.RuntimeIOException: Unrecoverable error while loading a tagger model
    at edu.stanford.nlp.tagger.maxent.MaxentTagger.readModelAndInit(MaxentTagger.java:770)
    at edu.stanford.nlp.tagger.maxent.MaxentTagger.<init>(MaxentTagger.java:298)
    at edu.stanford.nlp.tagger.maxent.MaxentTagger.<init>(MaxentTagger.java:263)
    at cz.zcu.kiv.nlp.semeval.cwi.features.POSFeature.<init>(POSFeature.java:24)
    at cz.zcu.kiv.nlp.semeval.cwi.CWIModel.train(CWIModel.java:60)
    at cz.zcu.kiv.nlp.semeval.cwi.TrainingCrossValidation.main(TrainingCrossValidation.java:51)
Caused by: java.io.IOException: Unable to resolve "/edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger" as either class path, filename or URL
    at  edu.stanford.nlp.io.IOUtils.getInputStreamFromURLOrClasspathOrFileSystem(IOUtils.java:481)
    at edu.stanford.nlp.tagger.maxent.MaxentTagger.readModelAndInit(MaxentTagger.java:765)
    ... 5 more

If I copy the model out of the jar and use e.g.

tagger = new MaxentTagger("./english-left3words-distsim.tagger");

then everything works perfectly.

The problem is probably in the class IOUtils, method findStreamInClasspathOrFileSystem(String name).

In the line

InputStream is = IOUtils.class.getClassLoader().getResourceAsStream(name);

the returned classloader is probably the JarClassLoader which loaded the library (stanford-corenlp-3.5.2.jar) and it does not have access to other libraries.

This theory is supported by the following code

InputStream stream = POSFeature.class.getResourceAsStream("/edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger");
System.out.println("Stream == null: " + (stream == null));

tagger = new MaxentTagger("/edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger");

which outputs

Stream == null: false
Exception in thread "main" edu.stanford.nlp.io.RuntimeIOException: Unrecoverable error while loading a tagger model
...
Caused by: java.io.IOException: Unable to resolve "/edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger" as either class path, filename or URL

opened by konkol 21

Why is there no description of how to set up the models jar with a build tool?

On README, the way of how to install models jars is written but a method using build tools(e.g. Gradle) is not written. However, I try this way( https://stackoverflow.com/a/68859054/3809427 ) and succeeded. Why don't you write this useful method?

opened by lamrongol 3
EntityMentions returns null instead of empty list
I ran into an issue where if an empty string is passed in then getting the entityMentions returns null instead of an empty list like I figure out be standard practice.

Example code:

StanfordCoreNLP processor = new StanfordCoreNLP(props); CoreDocument nlpDocument = new CoreDocument(""); nlpProcessor.annotate(nlpDocument); List<CoreEntityMention> entities = nlpDocument.entityMentions(); <== returns null

Just wanted to know if there is any light that can be shed on this. If this is expected behavior then I will try my best to document it in the documentation
opened by cholojuanito 2
'email' tokenizing as 'em, ail, and '
In the following sentence (from Twitter), 'email' is being tokenized as 'em, ail, and '. This is obviously incorrect. What can be done to stop this split?

It's official (according to the AP) it's 'email' not 'e-mail' and 'website' not 'web-site'!

I have the following parameters set: tokenize.language: English tokenize.whitespace: false (because we want tokens like it's to separate into it and 's) tokenize.keepeol: false tokenize.verbose: false tokenize.options: invertible=true,splitAssimilations=false,splitHyphenated=false,splitForwardSlash=true,untokenizable=allKeep,strictTreebank3=true,normalizeSpace=false,ellipses=original
opened by saxtell-cb 6

parsing '`'

curl 'http://localhost:9000/?properties={%22annotators%22%3A%22lemma%22%2C%22outputFormat%22%3A%22json%22}' -d '`'

Gives me:

{
  "sentences": [
    {
      "index": 0,
      "tokens": [
        {
          "index": 1,
          "word": "`",
          "originalText": "`",
          "lemma": "`",
          "characterOffsetBegin": 0,
          "characterOffsetEnd": 1,
          "pos": "``",
          "before": "",
          "after": ""
        }
      ]
    }
  ]
}

With the standard English model. Is this expected? I'm particularly surprised at the POS.

opened by AntonOfTheWoods 5

Stanford CoreNLP slower after upgrade from 3.7.0 to 4.5.1

A unit test that runs in a loop calling pipeline.annotate(document) appears to be taking about 50% longer. Our configuration properties didn't change during the upgrade, but maybe some new properties have been added in 4.5.1? Below is what we have. Is there a way to determine which annotator is using more time now?

customAnnotatorClass.tokensregex=edu.stanford.nlp.pipeline.TokensRegexAnnotator sutime.binders=0 tokensregex.rules= .... (omitted) ssplit.eolonly=false customAnnotatorClass.tokenOverride_en= .... (omitted) annotators=tokenize, ssplit, tokenOverride_en, pos, lemmaOverride_en, ner, tokensregex, entitymentions, parse language=en tokenize.whitespace=false customAnnotatorClass.lemmaOverride_en=.... (omitted) tokenize.options=untokenizable=allKeep,americanize=false ssplit.isOneSentence=true nermention.acronyms=true

opened by dsbanks99 1

Releases(v4.5.1)

v4.5.1(Aug 30, 2022)
CoreNLP 4.5.1

Bugfixes!

Fix tokenizer regression: 4.5.0 will tokenize ",5" as one word https://github.com/stanfordnlp/CoreNLP/commit/974383ab7336a254d260264885186dd77df0cf81

Use a LinkedHashMap in the PTBTokenizer instead of Properties. Keeps the option processing order predictable. https://github.com/stanfordnlp/CoreNLP/issues/1289 https://github.com/stanfordnlp/CoreNLP/commit/655018895e2f2870ce721de42d31b845fa991335

Fix \r\n not being properly processed on Windows: #1291 https://github.com/stanfordnlp/CoreNLP/commit/9889f4ef4ee9feb8b70f577db8353c8d6c896ae3

Handle one half of surrogate character pairs in the tokenizer w/o crashing https://github.com/stanfordnlp/CoreNLP/issues/1298 https://github.com/stanfordnlp/CoreNLP/commit/1b12faa64b9ea85f808b27ab74ccf9f79ccb01f4

Attempt to fix semgrex "Unknown vertex" errors which have plagued CoreNLP for years in hard to track down circumstances: https://github.com/stanfordnlp/CoreNLP/issues/1296 https://github.com/stanfordnlp/CoreNLP/issues/1229 https://github.com/stanfordnlp/CoreNLP/issues/1169 https://github.com/stanfordnlp/CoreNLP/commit/f99b5ab87f073118a971c4d1e39df85ab9abbab1

Source code(tar.gz)
Source code(zip)
v4.5.0(Jul 22, 2022)
CoreNLP 4.5.0

Main features are improved lemmatization of English, improved tokenization of both English and non-English flex-based languages, and some updates to tregex, tsurgeon, and semgrex

All PTB and German tokens normalized now in PTBLexer (previously only German umlauts). This makes the tokenizer 2% slower, but should avoid issues with resume' for example https://github.com/stanfordnlp/CoreNLP/commit/d46fecd93c6964f635efe85d9b7c327ee8880fb9

log4j removed entirely from public CoreNLP (internal "research" branch still has a use) https://github.com/stanfordnlp/CoreNLP/commit/f05cb54ec0a4f3c90395771817f44a81eb549baf

Fix NumberFormatException showing up in NER models: https://github.com/stanfordnlp/CoreNLP/issues/547 https://github.com/stanfordnlp/CoreNLP/commit/5ee2c391104109a338a28f35c647b7684b00ad41

Fix "seconds" in the lemmatizer: https://github.com/stanfordnlp/CoreNLP/commit/e7a073bde9ba7bbdb40ba81ed96d379455629e44

Fix double escaping of & in the online demos: https://github.com/stanfordnlp/CoreNLP/commit/8413fa1fc432aa2a13cbb4a296352bb9bad4d0cb

Report the cause of an error if "tregex" is asked for but no parse annotator is added: https://github.com/stanfordnlp/CoreNLP/commit/4db80c051322697c983ecda873d8d38f808cb96c

Merge ssplit and cleanxml into the tokenize annotator (done in a backwards compatible manner): https://github.com/stanfordnlp/CoreNLP/pull/1259

Custom tregex pattern, ROOT tregex pattern, and tsurgeon operation for simultaneously moving a subtree and pruning anything left behind, used for processing the Italian VIT treebank in stanza: https://github.com/stanfordnlp/CoreNLP/pull/1263

Refactor tokenization of punctuation, filenames, and other entities common to all languages, not just English: https://github.com/stanfordnlp/CoreNLP/commit/3c40ba32ca51af02936b907d03406e2158883f7b https://github.com/stanfordnlp/CoreNLP/commit/58a2288239f631df47fac3eed105fe78c08b1a5d https://github.com/stanfordnlp/CoreNLP/commit/8b97d64e48e6d4161f62a8635d2bb4cee2e95553

Improved tokenization of number patterns, names with apostrophes such as Sh'reyan, non-American phone numbers, invisible commas https://github.com/stanfordnlp/CoreNLP/commit/9476a8eb724e01df4b05bce38789dd8a7e61397c https://github.com/stanfordnlp/CoreNLP/commit/6193934af8ae0abb0b4c6a2522d7efdfa426e5b3 https://github.com/stanfordnlp/CoreNLP/commit/afb1ea89c874acd58bab584f1e29a059c44dfd20 https://github.com/stanfordnlp/CoreNLP/commit/7c84960df4ac9d391ef37855572e2f8bc301ee17

Significant lemmatizer improvements: adjectives & adverbs, along with some various other special cases https://github.com/stanfordnlp/CoreNLP/pull/1266

Include graph & semgrex indices in the results for a semgrex query (will make the results more usable) https://github.com/stanfordnlp/CoreNLP/commit/45b47e245c367663bba2e81a26ea7c29262ad0d8

Trim words in the NER training process. spaces can still be inside a word, but random whitespace won't ruin the performance of the models https://github.com/stanfordnlp/CoreNLP/commit/0d9e9c829bfa75bb661cccea03fc682a0f955f0d

Fix NBSP in the Chinese segmenter https://github.com/stanfordnlp/stanza/issues/1052 https://github.com/stanfordnlp/CoreNLP/pull/1279

Source code(tar.gz)
Source code(zip)
v4.4.0(Jan 25, 2022)
Enhancements

added -preTokenized option which will assume text should be tokenized on white space and sentence split on newline

tsurgeon CLI - python side added to stanza
https://github.com/stanfordnlp/CoreNLP/pull/1240

sutime WORKDAY definition https://github.com/stanfordnlp/CoreNLP/commit/0dfb11817c2b46a532985c24289e128fbb81a2c0

Fixes

rebuilt Italian dependency parser using CoreNLP predicted tags

XML security issue: https://github.com/stanfordnlp/CoreNLP/pull/1241

NER server security issue: https://github.com/stanfordnlp/CoreNLP/commit/5ee097dbede547023e88f60ed3f430ff09398b87

fix infinite loop in tregex: https://github.com/stanfordnlp/CoreNLP/pull/1238

json utf-8 output on windows https://github.com/stanfordnlp/CoreNLP/pull/1231 https://github.com/stanfordnlp/stanza/issues/894

fix openie crash in certain unusual graphs https://github.com/stanfordnlp/CoreNLP/pull/1230 https://github.com/stanfordnlp/CoreNLP/issues/1082

fix nondeterministic results in certain SemanticGraph structures https://github.com/stanfordnlp/CoreNLP/pull/1228 https://github.com/stanfordnlp/CoreNLP/commit/cc806f265292977b69fd55f36408fe5ad3a695a0

workaround for NLTK sending % unescaped to the server https://github.com/stanfordnlp/CoreNLP/issues/1226 https://github.com/stanfordnlp/CoreNLP/commit/20fe1e996455b1c1434022d6e7f0b8524f41f253

make TimingTest function on Windows https://github.com/stanfordnlp/CoreNLP/commit/4aafb84f6ea5c0102c921a503cbfb8e3d34f3e22

Source code(tar.gz)
Source code(zip)
v4.3.2(Nov 18, 2021)
Fixes

fix issues with default Italian pipeline

Source code(tar.gz)
Source code(zip)
v4.3.1(Oct 22, 2021)
Fixes

character offset issue with StatTok

fixes path issue with default Hungarian properties

adds Hungarian and Italian to demo

fixes umlaut issue

Source code(tar.gz)
Source code(zip)
v4.3.0(Oct 6, 2021)
Overview

This release adds new European languages, improvements to the parsers and tokenizers, and other misc. fixes.

Enhancements

Hungarian pipeline

Italian pipeline

Improvements to English tokenizer

Better memory usage by dependency parser

Fixes

issue with umlaut handling in German #1184

Source code(tar.gz)
Source code(zip)
v4.2.2(May 14, 2021)
This release includes some small fixes to version 4.2.1.

It includes:

demo fixes for 4.2.2, resolving cache issues with demo resources

small fix to RegexNERSequenceClassifier issue allowing AnswerAnnotation to be overwritten

Source code(tar.gz)
Source code(zip)
v4.2.1(May 5, 2021)

Fix the server having some links http instead of https https://github.com/stanfordnlp/CoreNLP/issues/1146

Improve MWE expressions in the enhanced dependency conversion https://github.com/stanfordnlp/CoreNLP/commit/1ef9ef9c75e6948eed10092bf6d1c49c49cfabaa

Add the ability for the command line semgrex processor to handle multiple calls in one process https://github.com/stanfordnlp/CoreNLP/commit/c9d50ef9cb2e1851257d06cda55b1456d69145b7

Fix interaction between discarding tokens in ssplit and assigning NER tags https://github.com/stanfordnlp/CoreNLP/commit/a803bc357c32841beb3919f2e4dc22a1375dca4d

Reduce the size of the sr parser models (not a huge amount, but some) https://github.com/stanfordnlp/CoreNLP/pull/1142

Various QuoteAnnotator bug fixes https://github.com/stanfordnlp/CoreNLP/pull/1135 https://github.com/stanfordnlp/CoreNLP/issues/1134 https://github.com/stanfordnlp/CoreNLP/pull/1121 https://github.com/stanfordnlp/CoreNLP/issues/1118 https://github.com/stanfordnlp/CoreNLP/commit/9f1b015ea91f1db6dce6ab7f35aacb9cdc33e463 https://github.com/stanfordnlp/CoreNLP/issues/1147

Switch to newer istack implementation https://github.com/stanfordnlp/CoreNLP/pull/1133 Newer protobuf https://github.com/stanfordnlp/CoreNLP/pull/1150

Add a conllu output format to some of the segmenter code, useful for testing with the official test scripts https://github.com/stanfordnlp/CoreNLP/commit/c70ddec9736e9d3c7effd4660f63e363caeb333d

Fix Turkish locale enums https://github.com/stanfordnlp/CoreNLP/pull/1126 https://github.com/stanfordnlp/stanza/issues/580

Use StringBuilder instead of StringBuffer where possible https://github.com/stanfordnlp/CoreNLP/pull/1010
Source code(tar.gz)
Source code(zip)
v4.2.0(Nov 17, 2020)
Overview

This release features a collection of small bug fixes and updates. It is the first release built directly from the GitHub repo.

Enhancements

Upgrade libraries (EJML, JUnit, JFlex)

Add character offsets to Tregex responses from server

Improve cleaning of treebanks for English models

Speed up loading of Wikidict annotator

New utility for tagging CoNLL-U files in place

Command line tool for processing TokensRegex

Fixes

Output single token NER entities in inline XML output format

Add currency symbol part of speech training data

Fix issues with tree binarizing

Source code(tar.gz)
Source code(zip)
v4.0.0(May 4, 2020)
Overview

The latest release of Stanford CoreNLP includes a major overhaul of tokenization and a large collection of new parsing and tagging models. There are also miscellaneous enhancements and fixes.

Enhancements

UD v2.0 tokenization standard for English, French, German, and Spanish. That means "new" LDC tokenization for English (splitting on most hyphens) and not escaping parentheses or turning quotes etc. into ASCII sequences by default.

Upgrade options for normalizing special chars (quotes, parentheses, etc.) in PTBTokenizer

Have WhitespaceTokenizer support same newline processing as PTBTokenizer

New mwt annotator for handling multiword tokens in French, German, and Spanish.

New models with more training data and better performance for tagging and parsing in English, French, German, and Spanish.

Add French NER

New Chinese segmentation based off CTB9

Improved handling of double codepoint characters

Easier syntax for specifying language specific pipelines and NER pipeline properties

Improved CoNLL-U processing

Improved speed and memory performance for CRF training

Tregex support in CoreSentence

Updated library dependencies

Fixes

NPE while simultaneously tokenizing on whitespace and sentence splitting on newlines

NPE in EntityMentionsAnnotator during language check

NPE in CorefMentionAnnotator while aligning coref mentions with titles and entity mentions

NPE in NERCombinerAnnotator in certain configurations of models on/off

Incorrect handling of eolonly option in ArabicSegmenterAnnotator

Apply named entity granularity change prior to coref mention detection

Incorrect handling of keeping newline tokens when using Chinese segmenter on Windows

Incorrect handling of reading in German treebank files

SR parser crashes when given bad training input

New PTBTokenizer known abbreviations: "Tech.", "Amb.". Fix legacy tokenizer hack special casing 'Alex.' for 'Alex. Brown'

Fix ancient bug in printing constituency tree with multiple roots.

Fix parser from failing on word "STOP" because it treated it as a special word

Source code(tar.gz)
Source code(zip)

Owner

Stanford NLP

GitHub Repository http://stanfordnlp.github.io/CoreNLP/

Revisiting Pre-trained Models for Chinese Natural Language Processing (Findings of EMNLP 2020)

This repository contains the resources in our paper "Revisiting Pre-trained Models for Chinese Natural Language Processing", which will be published i

463 Dec 30, 2022

DataCLUE: 国内首个以数据为中心的AI测评（含模型分析报告）

DataCLUE 以数据为中心的AI测评(DataCLUE) DataCLUE: A Chinese Data-centric Language Evaluation Benchmark 内容导引章节描述简介介绍以数据为中心的AI测评(DataCLUE)的背景任务描述任务描述实验结果

135 Dec 22, 2022

NLP techniques such as named entity recognition, sentiment analysis, topic modeling, text classification with Python to predict sentiment and rating of drug from user reviews.

This file contains the following documents sumbited for Baruch CIS9665 group 9 fall 2021. 1. Dataset: drug_reviews.csv 2. python codes for text classi

2 Jan 04, 2023

Use PaddlePaddle to reproduce the paper：mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

MT5_paddle Use PaddlePaddle to reproduce the paper：mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer English | 简体中文 mT5: A Massively

2 Oct 17, 2021

Tools, wrappers, etc... for data science with a concentration on text processing

Rosetta Tools for data science with a focus on text processing. Focuses on "medium data", i.e. data too big to fit into memory but too small to necess

207 Nov 22, 2022

This repository contains Python scripts for extracting linguistic features from Filipino texts.

Filipino Text Linguistic Feature Extractors This repository contains scripts for extracting linguistic features from Filipino texts. The scripts were

1 Oct 05, 2021

Hostapd-mac-tod-acl - Setup a hostapd AP with MAC ToD ACL

A brief explanation This script provides a quick way to setup a Time-of-day (Tod

2 Feb 03, 2022

Application for shadowing Chinese.

chinese-shadowing Simple APP for shadowing chinese. With this application, it is very easy to record yourself, play the sound recorded and listen to s

5 Sep 06, 2022

A BERT-based reverse-dictionary of Korean proverbs

Wisdomify A BERT-based reverse-dictionary of Korean proverbs. 김유빈 : 모델링 / 데이터 수집 / 프로젝트 설계 / back-end 김종윤 : 데이터 수집 / 프로젝트 설계 / front-end Quick Start C

94 Dec 08, 2022

An evaluation toolkit for voice conversion models.

Voice-conversion-evaluation An evaluation toolkit for voice conversion models. Sample test pair Generate the metadata for evaluating models. The direc

30 Aug 29, 2022

State-of-the-art NLP through transformer models in a modular design and consistent APIs.

Trapper (Transformers wRAPPER) Trapper is an NLP library that aims to make it easier to train transformer based models on downstream tasks. It wraps h

42 Sep 21, 2022

Rich Prosody Diversity Modelling with Phone-level Mixture Density Network

Phone Level Mixture Density Network for TTS This repo contains pytorch implementation of paper Rich Prosody Diversity Modelling with Phone-level Mixtu

42 Dec 13, 2022

Weaviate demo with the text2vec-openai module

Weaviate demo with the text2vec-openai module This repository contains an example of how to use the Weaviate text2vec-openai module. When using this d

11 Nov 11, 2022

Phrase-Based & Neural Unsupervised Machine Translation

Unsupervised Machine Translation This repository contains the original implementation of the unsupervised PBSMT and NMT models presented in Phrase-Bas

1.5k Dec 28, 2022

DeepSpeech - Easy-to-use Speech Toolkit including SOTA ASR pipeline, influential TTS with text frontend and End-to-End Speech Simultaneous Translation.

(简体中文|English) Quick Start | Documents | Models List PaddleSpeech is an open-source toolkit on PaddlePaddle platform for a variety of critical tasks i

5.6k Jan 03, 2023

multi-label，classifier，text classification，多标签文本分类，文本分类，BERT，ALBERT，multi-label-classification，seq2seq，attention，beam search

30 Dec 12, 2022

Stanford CoreNLP provides a set of natural language analysis tools written in Java

Related tags

Overview

Stanford CoreNLP

Build Instructions

Provided build

Build with Ant

Build with Maven

Models

Useful resources

Comments

Releases(v4.5.1)

v4.5.1(Aug 30, 2022)

CoreNLP 4.5.1

Bugfixes!

v4.5.0(Jul 22, 2022)

CoreNLP 4.5.0

v4.4.0(Jan 25, 2022)

Enhancements

Fixes

v4.3.2(Nov 18, 2021)

Fixes

v4.3.1(Oct 22, 2021)

Fixes

v4.3.0(Oct 6, 2021)

Overview

Enhancements

Fixes

v4.2.2(May 14, 2021)

v4.2.1(May 5, 2021)

v4.2.0(Nov 17, 2020)

Overview

Enhancements

Fixes

v4.0.0(May 4, 2020)

Overview

Enhancements

Fixes

Owner

Stanford NLP

Revisiting Pre-trained Models for Chinese Natural Language Processing (Findings of EMNLP 2020)

DataCLUE: 国内首个以数据为中心的AI测评（含模型分析报告）

NLP techniques such as named entity recognition, sentiment analysis, topic modeling, text classification with Python to predict sentiment and rating of drug from user reviews.

Use PaddlePaddle to reproduce the paper：mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

Tools, wrappers, etc... for data science with a concentration on text processing

This repository contains Python scripts for extracting linguistic features from Filipino texts.

Hostapd-mac-tod-acl - Setup a hostapd AP with MAC ToD ACL

Application for shadowing Chinese.

A BERT-based reverse-dictionary of Korean proverbs

An evaluation toolkit for voice conversion models.

State-of-the-art NLP through transformer models in a modular design and consistent APIs.

Rich Prosody Diversity Modelling with Phone-level Mixture Density Network

Weaviate demo with the text2vec-openai module

Phrase-Based & Neural Unsupervised Machine Translation

DeepSpeech - Easy-to-use Speech Toolkit including SOTA ASR pipeline, influential TTS with text frontend and End-to-End Speech Simultaneous Translation.

Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT.

Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5

EMNLP'2021: Can Language Models be Biomedical Knowledge Bases?

A fast and easy implementation of Transformer with PyTorch.

multi-label，classifier，text classification，多标签文本分类，文本分类，BERT，ALBERT，multi-label-classification，seq2seq，attention，beam search