Stanford CoreNLP provides a set of natural language analysis tools written in Java

Overview

Stanford CoreNLP

Build Status Maven Central Twitter

Stanford CoreNLP provides a set of natural language analysis tools written in Java. It can take raw human language text input and give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize and interpret dates, times, and numeric quantities, mark up the structure of sentences in terms of phrases or word dependencies, and indicate which noun phrases refer to the same entities. It was originally developed for English, but now also provides varying levels of support for (Modern Standard) Arabic, (mainland) Chinese, French, German, and Spanish. Stanford CoreNLP is an integrated framework, which makes it very easy to apply a bunch of language analysis tools to a piece of text. Starting from plain text, you can run all the tools with just two lines of code. Its analyses provide the foundational building blocks for higher-level and domain-specific text understanding applications. Stanford CoreNLP is a set of stable and well-tested natural language processing tools, widely used by various groups in academia, industry, and government. The tools variously use rule-based, probabilistic machine learning, and deep learning components.

The Stanford CoreNLP code is written in Java and licensed under the GNU General Public License (v3 or later). Note that this is the full GPL, which allows many free uses, but not its use in proprietary software that you distribute to others.

Build Instructions

Several times a year we distribute a new version of the software, which corresponds to a stable commit.

During the time between releases, one can always use the latest, under development version of our code.

Here are some helpful instructions to use the latest code:

Provided build

Sometimes we will provide updated jars here which have the latest version of the code.

At present, the current released version of the code is our most recent released jar, though you can always build the very latest from GitHub HEAD yourself.

Build with Ant

  1. Make sure you have Ant installed, details here: http://ant.apache.org/
  2. Compile the code with this command: cd CoreNLP ; ant
  3. Then run this command to build a jar with the latest version of the code: cd CoreNLP/classes ; jar -cf ../stanford-corenlp.jar edu
  4. This will create a new jar called stanford-corenlp.jar in the CoreNLP folder which contains the latest code
  5. The dependencies that work with the latest code are in CoreNLP/lib and CoreNLP/liblocal, so make sure to include those in your CLASSPATH.
  6. When using the latest version of the code make sure to download the latest versions of the corenlp-models, english-models, and english-models-kbp and include them in your CLASSPATH. If you are processing languages other than English, make sure to download the latest version of the models jar for the language you are interested in.

Build with Maven

  1. Make sure you have Maven installed, details here: https://maven.apache.org/
  2. If you run this command in the CoreNLP directory: mvn package , it should run the tests and build this jar file: CoreNLP/target/stanford-corenlp-4.4.0.jar
  3. When using the latest version of the code make sure to download the latest versions of the corenlp-models, english-extra-models, and english-kbp-models and include them in your CLASSPATH. If you are processing languages other than English, make sure to download the latest version of the models jar for the language you are interested in.
  4. If you want to use Stanford CoreNLP as part of a Maven project you need to install the models jars into your Maven repository. Below is a sample command for installing the Spanish models jar. For other languages just change the language name in the command. To install stanford-corenlp-models-current.jar you will need to set -Dclassifier=models. Here is the sample command for Spanish: mvn install:install-file -Dfile=/location/of/stanford-spanish-corenlp-models-current.jar -DgroupId=edu.stanford.nlp -DartifactId=stanford-corenlp -Dversion=4.4.0 -Dclassifier=models-spanish -Dpackaging=jar

Models

The models jars that correspond to the latest code can be found in the table below.

Some of the larger (English) models -- like the shift-reduce parser and WikiDict -- are not distributed with our default models jar. These require downloading the English (extra) and English (kbp) jars. Resources for other languages require usage of the corresponding models jar.

The best way to get the models is to use git-lfs and clone them from Hugging Face Hub.

For instance, to get the French models, run the following commands:

# Make sure you have git-lfs installed
# (https://git-lfs.github.com/)
git lfs install

git clone https://huggingface.co/stanfordnlp/corenlp-french

The jars can be directly downloaded from the links below or the Hugging Face Hub page as well.

Language Model Jar Last Updated
Arabic download (HF Hub) 4.4.0
Chinese download (HF Hub) 4.4.0
English (extra) download (HF Hub) 4.4.0
English (KBP) download (HF Hub) 4.4.0
French download (HF Hub) 4.4.0
German download (HF Hub) 4.4.0
Hungarian download (HF Hub) 4.4.0
Italian download (HF Hub) 4.4.0
Spanish download (HF Hub) 4.4.0

Thank you to Hugging Face for helping with our hosting!

Useful resources

You can find releases of Stanford CoreNLP on Maven Central.

You can find more explanation and documentation on the Stanford CoreNLP homepage.

For information about making contributions to Stanford CoreNLP, see the file CONTRIBUTING.md.

Questions about CoreNLP can either be posted on StackOverflow with the tag stanford-nlp, or on the mailing lists.

Comments
  • An Issue in importing StanfordCoreNLP library in an Android Studio project

    An Issue in importing StanfordCoreNLP library in an Android Studio project

    I am developing an Android application (I am a beginner). I want to use Stanford CoreNPL 3.8.0 library in my app to extract the part of speech, the lemma, the parser and so on from the user sentences.I have tried a simple java code in NetBeans by following this youtube tutorial https://www.youtube.com/watch?v=9IZsBmHpK3Y, and it is working perfectly.The jar files that I imported to the NetBeans project are: stanford-corenlp-3.8.0.jar and stanford-corenlp-3.8.0-models.jar.

    And this is the java source code:

    import edu.stanford.nlp.ling.CoreAnnotations;
    import edu.stanford.nlp.ling.CoreLabel;
    import edu.stanford.nlp.pipeline.Annotation;
    import edu.stanford.nlp.pipeline.StanfordCoreNLP;
    import edu.stanford.nlp.util.CoreMap;
    
    import java.util.List;
    import java.util.Properties;
    
    public class CoreNlpExample {
    
        public static void main(String[] args) {
    
            // creates a StanfordCoreNLP object, with POS tagging, lemmatization, NER, parsing, and coreference resolution
            Properties props = new Properties();
            props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, parse, dcoref");
            StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    
            // read some text in the text variable
            String text = "What is the Weather in Bangalore right now?";
    
            // create an empty Annotation just with the given text
            Annotation document = new Annotation(text);
    
            // run all Annotators on this text
            pipeline.annotate(document);
    
            List<CoreMap> sentences = document.get(CoreAnnotations.SentencesAnnotation.class);
    
            for (CoreMap sentence : sentences) {
                // traversing the words in the current sentence
                // a CoreLabel is a CoreMap with additional token-specific methods
                for (CoreLabel token : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
                    // this is the text of the token
                    String word = token.get(CoreAnnotations.TextAnnotation.class);
                    // this is the POS tag of the token
                    String pos = token.get(CoreAnnotations.PartOfSpeechAnnotation.class);
                    // this is the NER label of the token
                    String ne = token.get(CoreAnnotations.NamedEntityTagAnnotation.class);
    
                    System.out.println(String.format("Print: word: [%s] pos: [%s] ne: [%s]", word, pos, ne));
                }
            }
        }
    }
    

    I wanted to try the same code in Android Studio and display the result in a textview, but I am facing a problem with adding these external libraries in my Android Studio 3.0.1 project.

    I have read on some websites that I need to reduce the size of the jar files, and I did that and made sure that the reduced jars are still working fine in the Netbeans project. But I am still facing problems in Android studio and this is the error that I am getting:

    java.lang.VerifyError: Rejecting class edu.stanford.nlp.pipeline.StanfordCoreNLP that attempts to sub-type erroneous class edu.stanford.nlp.pipeline.AnnotationPipeline (declaration of 'edu.stanford.nlp.pipeline.StanfordCoreNLP' appears in /data/app/com.example.fatimah.nlpapplication-bhlUJOCUwLhSbkWE7NBERA==/split_lib_dependencies_apk.apk)

    Any suggestions on how I can fix this and import Stanford library successfully?

    opened by ftoom235 52
  • Use JaFaMa for faster math, and optimize critical code paths

    Use JaFaMa for faster math, and optimize critical code paths

    These changes substantially cut down the processing time; by several hours when I process all of Wikipedia. Feel free to benchmark on your own data.

    The first commit uses JaFaMa instead of java.lang.Math, which is 2-3x faster for exp, log: http://blog.element84.com/improving-java-math-perf-with-jafama.html In some places I switched back to log1p, because the runtime of log and log1p in JaFaMa are similar, and log1p offers better precision for small values of x than log(1+x).

    The other patches optimize the crucial code around the Viterbi algorithm:

    • HotSpot optimizes better if large functions with multiple loops are split into multiple methods (as they can be recompiled independently).
    • It pays off to save repeated nested array lookups (e.g. array[i][j] in a loop over j; move array_i = array[i] outside of the loop and use array_i[j] inside).
    • I also add a cache to avoid recomputing the open tags set in Ttags.

    All of these may appear to be trivial changes, but once you benchmark you will see how much this improves the run time.

    Processing the first 20000 articles with tokenize,ssplit,pos, doing some further processing such as my own lemmatization based on hunspell, and then loading them into a lucene index with the CoreNLP master branch took 08:51 minutes, and with my patches only 04:38 minutes (sloopy benchmark only). I consider this a substantial speedup, because Wikipedia is 5.3 million articles, and it still needed 19 hours to build the full text index, but it used to take almost two days...

    opened by kno10 39
  • Could the project switch to using log4j for logs?

    Could the project switch to using log4j for logs?

    I see a lot of logs printed to System.out or System.err. Would it be possible to use a library like log4j http://logging.apache.org/log4j/2.x/ and use log.error, log.warning, log.info, log.debug instead? That would make it easier for users of the StanfordCoreNLP to manage which logs should be printed by choosing the log level of the project.

    enhancement 
    opened by Asimov4 33
  • Quote Annotation - AnnotationException StringIndexOutOfBoundsException

    Quote Annotation - AnnotationException StringIndexOutOfBoundsException

    Hello,

    I had a situation with text that had this: ""=

    It seems to throw an error when I try running the pipeline with quote annotation on this small fragment. Just wanted to verify that it was an issue.

    Thank you.

    opened by allenkim 29
  • Parsing fails on AssertionError when using OpenIE (v3.9.2)

    Parsing fails on AssertionError when using OpenIE (v3.9.2)

    Happens with the following sentence, under version 3.9.2, only when adding openIE annotator:

    It was a long and stern face, but with eyes that twinkled in a kindly way.

    stack trace:

    java.lang.AssertionError at edu.stanford.nlp.naturalli.Util.cleanTree(Util.java:324) at edu.stanford.nlp.naturalli.OpenIE.annotateSentence(OpenIE.java:463) at edu.stanford.nlp.naturalli.OpenIE.lambda$annotate$2(OpenIE.java:547) at java.base/java.util.ArrayList.forEach(ArrayList.java:1540) at edu.stanford.nlp.naturalli.OpenIE.annotate(OpenIE.java:547) at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:76) at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:637) at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:629)

    to replicate:

            Properties props = new Properties();
            props.setProperty("annotators", "tokenize,ssplit,pos,lemma,ner,depparse,natlog,openie");
            StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    
            String text = "It was a long and stern face, but with eyes that twinkled in a kindly way.";
    
            CoreDocument document = new CoreDocument(text);
            pipeline.annotate(document);
    

    works fine if openie is disabled, with other sentences, or when using https://corenlp.run/ so looks like it's fixed in later versions but I did not verify it locally as I can't upgrade at the moment anyway.

    advice much appreciated

    opened by manzurola 29
  • Stanford CoreNLP server not responding

    Stanford CoreNLP server not responding

    I have been trying to use the CoreNLP server using various python packages including Stanza. I am always running into the same problem that I do not hear back from the server.

    I downloaded a copy of CoreNLP from the website. I then try to start a server from the terminal and go to my localhost as described here. Based on the documentation I should see something when I go to http://localhost:9000/, but nothing loads up.

    Here are to commands I use:

    cd stanford-corenlp-full-2018-10-05/
    java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000
    

    Here is the output of running the commands above:

    Samarths-MacBook-Pro-2:stanford-corenlp-full-2018-10-05 samarthbhandari$ java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000
    [main] INFO CoreNLP - --- StanfordCoreNLPServer#main() called ---
    [main] INFO CoreNLP - setting default constituency parser
    [main] INFO CoreNLP - warning: cannot find edu/stanford/nlp/models/srparser/englishSR.ser.gz
    [main] INFO CoreNLP - using: edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz instead
    [main] INFO CoreNLP - to use shift reduce parser download English models jar from:
    [main] INFO CoreNLP - http://stanfordnlp.github.io/CoreNLP/download.html
    [main] INFO CoreNLP -     Threads: 8
    [main] INFO CoreNLP - Starting server...
    [main] INFO CoreNLP - StanfordCoreNLPServer listening at /0:0:0:0:0:0:0:0:9000
    

    I then go to http://localhost:9000/, nothing loads up. Like I mentioned above originally I have been trying to do the same thing using some of the python packages and observed similar behavior.

    Here is a stack overflow post related to server not responding using Stanza.

    OS: MacOS 10.15.4 Python: 3.7.7 Java: 1.8

    cantreproduce 
    opened by samarth12 25
  • [MEMORY] Possibly use float instead of double in models/weights

    [MEMORY] Possibly use float instead of double in models/weights

    double

    double arrays are a large portion of the heap.

    There are some places with 2d double arrays with dimensions like

    345k x 16, 150k x 24, 80k x 46: CRFCLassifier.weights 100k x 1000: Classifier.saved in DependecyParser 60k x 50: Classifier.E, .eg2E 1000x2400: Classifier.W1, .wg2W1

    Most are weights of some sort, making me wonder if they could be stored in less than 64bit each.

    The obvious step would be to use float[], halving the memory use of this portion.

    Another would be to encode weights in something else, for example a small integer and scale that into a float again when using the weight.

    Machine Learning models often use fp16 or even fp8 to store weights, there are java implementations of float -> short -> float (with fp16 semantics stored in a 16bit short)

    like https://android.googlesource.com/platform/frameworks/base/+/master/core/java/android/util/Half.java with https://android.googlesource.com/platform/libcore/+/master/luni/src/main/java/libcore/util/FP16.java

    or https://stackoverflow.com/questions/6162651/half-precision-floating-point-in-java

    The latter approached would need some performance testing as each time a weight is used it would have to be converted first.


    I saw that some models serialize themselves using ObjectStreams, that would need an adapter to deserialize to double[] first and then array-cast it to float[].

    Like in CRFClassifier.loadClassifier

    opened by lambdaupb 25
  • TokenSequenceParser ignoring tail of patterns mentioned in rules

    TokenSequenceParser ignoring tail of patterns mentioned in rules

    Following function in TokenSequenceParser class ignores tail of patterns defined in rules for tokensregex

    private String getStringFromTokens(Token head, Token tail, boolean includeSpecial) { StringBuilder sb = new StringBuilder(); for( Token p = head ; p != tail ; p = p.next ) { if (includeSpecial) { appendSpecialTokens( sb, p.specialToken ); } sb.append( p.image ); } return sb.toString(); }

    Eg: ([{lemma:/([a-zA-Z]{2,}_)?[a-zA-Z]{2,}[0-9]{2,}/}]) gets converted to ([{lemma:/([a-zA-Z]{2,}_)?[a-zA-Z]{2,}[0-9]{2,}/}] while reading and don't provide intended matches

    opened by ankitsingh2 23
  • Exception thrown for operation attempted on unknown vertex

    Exception thrown for operation attempted on unknown vertex

    CoreNLP version 4.5.0 using pos lemma depparse. I run the pipeline within Spark (Scala). I lazy initialise the CoreNLP pipeline and I broadcast the pipeline to each executor using lazy instantiation wrapped in a case object. Also I force not to split the text fragment as it is intended to be a sentence already. The objective here is to do dependency analysis on the sentence and run some semgraph rules against it. We got a case where it throws an exception like this

    Caused by: edu.stanford.nlp.semgraph.UnknownVertexException: Operation attempted on unknown vertex happens/VBZ'''' in graph -> observed/VBD (root)
      -> 24/CD (nsubj)
        -> response/NN (nmod:in)
          -> In/IN (case)
          -> CoV/NNP (nmod:to)
            -> to/IN (case)
            -> SARS/NNP (compound)
            -> ‐/SYM (dep)
            -> ‐/SYM (dep)
            -> peptides/NNS (dep)
              -> 2/CD (nummod)
      -> ,/, (punct)
      -> we/PRP (nsubj)
      -> unexpectedly/RB (advmod)
      -> associated/VBN (ccomp)
        -> that/IN (mark)
        -> sirolimus/NN (nsubj:pass)
        -> was/VBD (aux:pass)
        -> significantly/RB (advmod)
        -> release/NN (obl:with)
          -> with/IN (case)
          -> a/DT (det)
          -> proinflammatory/JJ (amod)
          -> cytokine/NN (compound)
          -> levels/NNS (nmod:including)
            -> including/VBG (case)
            -> higher/JJR (amod)
            -> α/NN (nmod:of)
              -> of/IN (case)
              -> TNF/NN (compound)
              -> ‐/SYM (dep)
              -> IL/NN (conj:and)
                -> and/CC (cc)
            -> IL/NN (nmod:of)
            -> 1β/NN (nmod)
              -> ‐/SYM (dep)
      -> ./. (punct)
    
    	at edu.stanford.nlp.semgraph.SemanticGraph.parentPairs(SemanticGraph.java:730)
    	at edu.stanford.nlp.semgraph.semgrex.GraphRelation$DEPENDENT$1.advance(GraphRelation.java:325)
    	at edu.stanford.nlp.semgraph.semgrex.GraphRelation$SearchNodeIterator.initialize(GraphRelation.java:1103)
    	at edu.stanford.nlp.semgraph.semgrex.GraphRelation$SearchNodeIterator.<init>(GraphRelation.java:1084)
    	at edu.stanford.nlp.semgraph.semgrex.GraphRelation$DEPENDENT$1.<init>(GraphRelation.java:310)
    	at edu.stanford.nlp.semgraph.semgrex.GraphRelation$DEPENDENT.searchNodeIterator(GraphRelation.java:310)
    	at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.resetChildIter(NodePattern.java:339)
    	at edu.stanford.nlp.semgraph.semgrex.SemgrexMatcher.resetChildIter(SemgrexMatcher.java:80)
    	at edu.stanford.nlp.semgraph.semgrex.CoordinationPattern$CoordinationMatcher.resetChildIter(CoordinationPattern.java:168)
    	at edu.stanford.nlp.semgraph.semgrex.CoordinationPattern$CoordinationMatcher.resetChildIter(CoordinationPattern.java:168)
    	at edu.stanford.nlp.semgraph.semgrex.CoordinationPattern$CoordinationMatcher.resetChildIter(CoordinationPattern.java:168)
    	at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.resetChild(NodePattern.java:363)
    	at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.goToNextNodeMatch(NodePattern.java:457)
    	at edu.stanford.nlp.semgraph.semgrex.NodePattern$NodeMatcher.matches(NodePattern.java:574)
    	at edu.stanford.nlp.semgraph.semgrex.SemgrexMatcher.find(SemgrexMatcher.java:193)
    	at az.bikg.nlp.etl.common.nlp.Pattern.go$3(Pattern.scala:200)
    	at az.bikg.nlp.etl.common.nlp.Pattern.$anonfun$findCauseEffectMatches$6(Pattern.scala:268)
    	at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:293)
    	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    	at scala.collection.TraversableLike.flatMap(TraversableLike.scala:293)
    	at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:290)
    	at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
    	at az.bikg.nlp.etl.common.nlp.Pattern.findCauseEffectMatches(Pattern.scala:266)
    	at az.bikg.nlp.etl.steps.ERs$.findRelations(ERs.scala:107)
    	at az.bikg.nlp.etl.steps.ERs$.findRelationsSpark(ERs.scala:229)
    	at az.bikg.nlp.etl.steps.ERs$.$anonfun$extractERs$1(ERs.scala:242)
    	... 28 more
    

    Am I doing anything wrong because of this exception? It didn't happen with version 4.4.0.

    opened by mkarmona 22
  • Are these latest Chines model significantly worse than the Stanford online parser?

    Are these latest Chines model significantly worse than the Stanford online parser?

    I tested the latest Chinese CoreNLP 3.92 version, and found the results are quite horrible. Here are few examples:

    我的朋友:always tags "我的" as one NN token. 我的狗吃苹果: ‘我的狗’ tagged as one NN token. 他的狗吃苹果:'狗吃' tagged as one NN token. 高质量就业成时代: '就业' tagged as VV

    When I compared them with the results from http://nlp.stanford.edu:8080/parser/index.jsp, surprisingly, the results of those examples are all right. Why is that? Are the models different? Is there a bug in the new 3.92 version model?

    opened by lingvisa 21
  • pos-tagger cannot load models from stanford-corenlp-3.5.2-models.jar

    pos-tagger cannot load models from stanford-corenlp-3.5.2-models.jar

    I use Stanford Core NLP in Java as a Maven dependency. I want to use the MaxentTagger with a model supplied in the stanford-corenlp-3.5.2-models package. The problem is that I cannot access this model through classpath.

    My code is

    tagger = new MaxentTagger("/edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger");
    

    The file "/edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger" exists in the jar and should be loaded through classpath, but the following exception is thrown

    Exception in thread "main" edu.stanford.nlp.io.RuntimeIOException: Unrecoverable error while loading a tagger model
        at edu.stanford.nlp.tagger.maxent.MaxentTagger.readModelAndInit(MaxentTagger.java:770)
        at edu.stanford.nlp.tagger.maxent.MaxentTagger.<init>(MaxentTagger.java:298)
        at edu.stanford.nlp.tagger.maxent.MaxentTagger.<init>(MaxentTagger.java:263)
        at cz.zcu.kiv.nlp.semeval.cwi.features.POSFeature.<init>(POSFeature.java:24)
        at cz.zcu.kiv.nlp.semeval.cwi.CWIModel.train(CWIModel.java:60)
        at cz.zcu.kiv.nlp.semeval.cwi.TrainingCrossValidation.main(TrainingCrossValidation.java:51)
    Caused by: java.io.IOException: Unable to resolve "/edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger" as either class path, filename or URL
        at  edu.stanford.nlp.io.IOUtils.getInputStreamFromURLOrClasspathOrFileSystem(IOUtils.java:481)
        at edu.stanford.nlp.tagger.maxent.MaxentTagger.readModelAndInit(MaxentTagger.java:765)
        ... 5 more
    

    If I copy the model out of the jar and use e.g.

    tagger = new MaxentTagger("./english-left3words-distsim.tagger");
    

    then everything works perfectly.

    The problem is probably in the class IOUtils, method findStreamInClasspathOrFileSystem(String name).

    In the line

    InputStream is = IOUtils.class.getClassLoader().getResourceAsStream(name);
    

    the returned classloader is probably the JarClassLoader which loaded the library (stanford-corenlp-3.5.2.jar) and it does not have access to other libraries.

    This theory is supported by the following code

    InputStream stream = POSFeature.class.getResourceAsStream("/edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger");
    System.out.println("Stream == null: " + (stream == null));
    
    tagger = new MaxentTagger("/edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger");
    

    which outputs

    Stream == null: false
    Exception in thread "main" edu.stanford.nlp.io.RuntimeIOException: Unrecoverable error while loading a tagger model
    ...
    Caused by: java.io.IOException: Unable to resolve "/edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger" as either class path, filename or URL
    
    opened by konkol 21
  • Why is there no description of how to set up the models jar with a build tool?

    Why is there no description of how to set up the models jar with a build tool?

    On README, the way of how to install models jars is written but a method using build tools(e.g. Gradle) is not written. However, I try this way( https://stackoverflow.com/a/68859054/3809427 ) and succeeded. Why don't you write this useful method?

    opened by lamrongol 3
  • EntityMentions returns null instead of empty list

    EntityMentions returns null instead of empty list

    I ran into an issue where if an empty string is passed in then getting the entityMentions returns null instead of an empty list like I figure out be standard practice.

    Example code:

    StanfordCoreNLP processor = new StanfordCoreNLP(props);
    CoreDocument nlpDocument = new CoreDocument("");
    
    nlpProcessor.annotate(nlpDocument);
    List<CoreEntityMention>  entities = nlpDocument.entityMentions(); <== returns null
    

    Just wanted to know if there is any light that can be shed on this. If this is expected behavior then I will try my best to document it in the documentation

    opened by cholojuanito 2
  • 'email' tokenizing as 'em, ail, and '

    'email' tokenizing as 'em, ail, and '

    In the following sentence (from Twitter), 'email' is being tokenized as 'em, ail, and '. This is obviously incorrect. What can be done to stop this split?

    • It's official (according to the AP) it's 'email' not 'e-mail' and 'website' not 'web-site'!

    I have the following parameters set: tokenize.language: English tokenize.whitespace: false (because we want tokens like it's to separate into it and 's) tokenize.keepeol: false tokenize.verbose: false tokenize.options: invertible=true,splitAssimilations=false,splitHyphenated=false,splitForwardSlash=true,untokenizable=allKeep,strictTreebank3=true,normalizeSpace=false,ellipses=original

    opened by saxtell-cb 6
  • parsing '`'

    parsing '`'

    curl 'http://localhost:9000/?properties={%22annotators%22%3A%22lemma%22%2C%22outputFormat%22%3A%22json%22}' -d '`'
    

    Gives me:

    {
      "sentences": [
        {
          "index": 0,
          "tokens": [
            {
              "index": 1,
              "word": "`",
              "originalText": "`",
              "lemma": "`",
              "characterOffsetBegin": 0,
              "characterOffsetEnd": 1,
              "pos": "``",
              "before": "",
              "after": ""
            }
          ]
        }
      ]
    }
    

    With the standard English model. Is this expected? I'm particularly surprised at the POS.

    opened by AntonOfTheWoods 5
  • Stanford CoreNLP slower after upgrade from 3.7.0 to 4.5.1

    Stanford CoreNLP slower after upgrade from 3.7.0 to 4.5.1

    A unit test that runs in a loop calling pipeline.annotate(document) appears to be taking about 50% longer. Our configuration properties didn't change during the upgrade, but maybe some new properties have been added in 4.5.1? Below is what we have. Is there a way to determine which annotator is using more time now?

    customAnnotatorClass.tokensregex=edu.stanford.nlp.pipeline.TokensRegexAnnotator sutime.binders=0 tokensregex.rules= .... (omitted) ssplit.eolonly=false customAnnotatorClass.tokenOverride_en= .... (omitted) annotators=tokenize, ssplit, tokenOverride_en, pos, lemmaOverride_en, ner, tokensregex, entitymentions, parse language=en tokenize.whitespace=false customAnnotatorClass.lemmaOverride_en=.... (omitted) tokenize.options=untokenizable=allKeep,americanize=false ssplit.isOneSentence=true nermention.acronyms=true

    opened by dsbanks99 1
Releases(v4.5.1)
  • v4.5.1(Aug 30, 2022)

    CoreNLP 4.5.1

    Bugfixes!

    • Fix tokenizer regression: 4.5.0 will tokenize ",5" as one word https://github.com/stanfordnlp/CoreNLP/commit/974383ab7336a254d260264885186dd77df0cf81
    • Use a LinkedHashMap in the PTBTokenizer instead of Properties. Keeps the option processing order predictable. https://github.com/stanfordnlp/CoreNLP/issues/1289 https://github.com/stanfordnlp/CoreNLP/commit/655018895e2f2870ce721de42d31b845fa991335
    • Fix \r\n not being properly processed on Windows: #1291 https://github.com/stanfordnlp/CoreNLP/commit/9889f4ef4ee9feb8b70f577db8353c8d6c896ae3
    • Handle one half of surrogate character pairs in the tokenizer w/o crashing https://github.com/stanfordnlp/CoreNLP/issues/1298 https://github.com/stanfordnlp/CoreNLP/commit/1b12faa64b9ea85f808b27ab74ccf9f79ccb01f4
    • Attempt to fix semgrex "Unknown vertex" errors which have plagued CoreNLP for years in hard to track down circumstances: https://github.com/stanfordnlp/CoreNLP/issues/1296 https://github.com/stanfordnlp/CoreNLP/issues/1229 https://github.com/stanfordnlp/CoreNLP/issues/1169 https://github.com/stanfordnlp/CoreNLP/commit/f99b5ab87f073118a971c4d1e39df85ab9abbab1
    Source code(tar.gz)
    Source code(zip)
  • v4.5.0(Jul 22, 2022)

    CoreNLP 4.5.0

    Main features are improved lemmatization of English, improved tokenization of both English and non-English flex-based languages, and some updates to tregex, tsurgeon, and semgrex

    • All PTB and German tokens normalized now in PTBLexer (previously only German umlauts). This makes the tokenizer 2% slower, but should avoid issues with resume' for example https://github.com/stanfordnlp/CoreNLP/commit/d46fecd93c6964f635efe85d9b7c327ee8880fb9

    • log4j removed entirely from public CoreNLP (internal "research" branch still has a use) https://github.com/stanfordnlp/CoreNLP/commit/f05cb54ec0a4f3c90395771817f44a81eb549baf

    • Fix NumberFormatException showing up in NER models: https://github.com/stanfordnlp/CoreNLP/issues/547 https://github.com/stanfordnlp/CoreNLP/commit/5ee2c391104109a338a28f35c647b7684b00ad41

    • Fix "seconds" in the lemmatizer: https://github.com/stanfordnlp/CoreNLP/commit/e7a073bde9ba7bbdb40ba81ed96d379455629e44

    • Fix double escaping of & in the online demos: https://github.com/stanfordnlp/CoreNLP/commit/8413fa1fc432aa2a13cbb4a296352bb9bad4d0cb

    • Report the cause of an error if "tregex" is asked for but no parse annotator is added: https://github.com/stanfordnlp/CoreNLP/commit/4db80c051322697c983ecda873d8d38f808cb96c

    • Merge ssplit and cleanxml into the tokenize annotator (done in a backwards compatible manner): https://github.com/stanfordnlp/CoreNLP/pull/1259

    • Custom tregex pattern, ROOT tregex pattern, and tsurgeon operation for simultaneously moving a subtree and pruning anything left behind, used for processing the Italian VIT treebank in stanza: https://github.com/stanfordnlp/CoreNLP/pull/1263

    • Refactor tokenization of punctuation, filenames, and other entities common to all languages, not just English: https://github.com/stanfordnlp/CoreNLP/commit/3c40ba32ca51af02936b907d03406e2158883f7b https://github.com/stanfordnlp/CoreNLP/commit/58a2288239f631df47fac3eed105fe78c08b1a5d https://github.com/stanfordnlp/CoreNLP/commit/8b97d64e48e6d4161f62a8635d2bb4cee2e95553

    • Improved tokenization of number patterns, names with apostrophes such as Sh'reyan, non-American phone numbers, invisible commas https://github.com/stanfordnlp/CoreNLP/commit/9476a8eb724e01df4b05bce38789dd8a7e61397c https://github.com/stanfordnlp/CoreNLP/commit/6193934af8ae0abb0b4c6a2522d7efdfa426e5b3 https://github.com/stanfordnlp/CoreNLP/commit/afb1ea89c874acd58bab584f1e29a059c44dfd20 https://github.com/stanfordnlp/CoreNLP/commit/7c84960df4ac9d391ef37855572e2f8bc301ee17

    • Significant lemmatizer improvements: adjectives & adverbs, along with some various other special cases https://github.com/stanfordnlp/CoreNLP/pull/1266

    • Include graph & semgrex indices in the results for a semgrex query (will make the results more usable) https://github.com/stanfordnlp/CoreNLP/commit/45b47e245c367663bba2e81a26ea7c29262ad0d8

    • Trim words in the NER training process. spaces can still be inside a word, but random whitespace won't ruin the performance of the models https://github.com/stanfordnlp/CoreNLP/commit/0d9e9c829bfa75bb661cccea03fc682a0f955f0d

    • Fix NBSP in the Chinese segmenter https://github.com/stanfordnlp/stanza/issues/1052 https://github.com/stanfordnlp/CoreNLP/pull/1279

    Source code(tar.gz)
    Source code(zip)
  • v4.4.0(Jan 25, 2022)

    Enhancements

    • added -preTokenized option which will assume text should be tokenized on white space and sentence split on newline

    • tsurgeon CLI - python side added to stanza
      https://github.com/stanfordnlp/CoreNLP/pull/1240

    • sutime WORKDAY definition https://github.com/stanfordnlp/CoreNLP/commit/0dfb11817c2b46a532985c24289e128fbb81a2c0

    Fixes

    • rebuilt Italian dependency parser using CoreNLP predicted tags

    • XML security issue: https://github.com/stanfordnlp/CoreNLP/pull/1241

    • NER server security issue: https://github.com/stanfordnlp/CoreNLP/commit/5ee097dbede547023e88f60ed3f430ff09398b87

    • fix infinite loop in tregex: https://github.com/stanfordnlp/CoreNLP/pull/1238

    • json utf-8 output on windows https://github.com/stanfordnlp/CoreNLP/pull/1231 https://github.com/stanfordnlp/stanza/issues/894

    • fix openie crash in certain unusual graphs https://github.com/stanfordnlp/CoreNLP/pull/1230 https://github.com/stanfordnlp/CoreNLP/issues/1082

    • fix nondeterministic results in certain SemanticGraph structures https://github.com/stanfordnlp/CoreNLP/pull/1228 https://github.com/stanfordnlp/CoreNLP/commit/cc806f265292977b69fd55f36408fe5ad3a695a0

    • workaround for NLTK sending % unescaped to the server https://github.com/stanfordnlp/CoreNLP/issues/1226 https://github.com/stanfordnlp/CoreNLP/commit/20fe1e996455b1c1434022d6e7f0b8524f41f253

    • make TimingTest function on Windows https://github.com/stanfordnlp/CoreNLP/commit/4aafb84f6ea5c0102c921a503cbfb8e3d34f3e22

    Source code(tar.gz)
    Source code(zip)
  • v4.3.2(Nov 18, 2021)

  • v4.3.1(Oct 22, 2021)

    Fixes

    • character offset issue with StatTok
    • fixes path issue with default Hungarian properties
    • adds Hungarian and Italian to demo
    • fixes umlaut issue
    Source code(tar.gz)
    Source code(zip)
  • v4.3.0(Oct 6, 2021)

    Overview

    This release adds new European languages, improvements to the parsers and tokenizers, and other misc. fixes.

    Enhancements

    • Hungarian pipeline
    • Italian pipeline
    • Improvements to English tokenizer
    • Better memory usage by dependency parser

    Fixes

    • issue with umlaut handling in German #1184
    Source code(tar.gz)
    Source code(zip)
  • v4.2.2(May 14, 2021)

    This release includes some small fixes to version 4.2.1.

    It includes:

    • demo fixes for 4.2.2, resolving cache issues with demo resources
    • small fix to RegexNERSequenceClassifier issue allowing AnswerAnnotation to be overwritten
    Source code(tar.gz)
    Source code(zip)
  • v4.2.1(May 5, 2021)

    Fix the server having some links http instead of https https://github.com/stanfordnlp/CoreNLP/issues/1146

    Improve MWE expressions in the enhanced dependency conversion https://github.com/stanfordnlp/CoreNLP/commit/1ef9ef9c75e6948eed10092bf6d1c49c49cfabaa

    Add the ability for the command line semgrex processor to handle multiple calls in one process https://github.com/stanfordnlp/CoreNLP/commit/c9d50ef9cb2e1851257d06cda55b1456d69145b7

    Fix interaction between discarding tokens in ssplit and assigning NER tags https://github.com/stanfordnlp/CoreNLP/commit/a803bc357c32841beb3919f2e4dc22a1375dca4d

    Reduce the size of the sr parser models (not a huge amount, but some) https://github.com/stanfordnlp/CoreNLP/pull/1142

    Various QuoteAnnotator bug fixes https://github.com/stanfordnlp/CoreNLP/pull/1135 https://github.com/stanfordnlp/CoreNLP/issues/1134 https://github.com/stanfordnlp/CoreNLP/pull/1121 https://github.com/stanfordnlp/CoreNLP/issues/1118 https://github.com/stanfordnlp/CoreNLP/commit/9f1b015ea91f1db6dce6ab7f35aacb9cdc33e463 https://github.com/stanfordnlp/CoreNLP/issues/1147

    Switch to newer istack implementation https://github.com/stanfordnlp/CoreNLP/pull/1133 Newer protobuf https://github.com/stanfordnlp/CoreNLP/pull/1150

    Add a conllu output format to some of the segmenter code, useful for testing with the official test scripts https://github.com/stanfordnlp/CoreNLP/commit/c70ddec9736e9d3c7effd4660f63e363caeb333d

    Fix Turkish locale enums https://github.com/stanfordnlp/CoreNLP/pull/1126 https://github.com/stanfordnlp/stanza/issues/580

    Use StringBuilder instead of StringBuffer where possible https://github.com/stanfordnlp/CoreNLP/pull/1010

    Source code(tar.gz)
    Source code(zip)
  • v4.2.0(Nov 17, 2020)

    Overview

    This release features a collection of small bug fixes and updates. It is the first release built directly from the GitHub repo.

    Enhancements

    • Upgrade libraries (EJML, JUnit, JFlex)
    • Add character offsets to Tregex responses from server
    • Improve cleaning of treebanks for English models
    • Speed up loading of Wikidict annotator
    • New utility for tagging CoNLL-U files in place
    • Command line tool for processing TokensRegex

    Fixes

    • Output single token NER entities in inline XML output format
    • Add currency symbol part of speech training data
    • Fix issues with tree binarizing
    Source code(tar.gz)
    Source code(zip)
  • v4.0.0(May 4, 2020)

    Overview

    The latest release of Stanford CoreNLP includes a major overhaul of tokenization and a large collection of new parsing and tagging models. There are also miscellaneous enhancements and fixes.

    Enhancements

    • UD v2.0 tokenization standard for English, French, German, and Spanish. That means "new" LDC tokenization for English (splitting on most hyphens) and not escaping parentheses or turning quotes etc. into ASCII sequences by default.
    • Upgrade options for normalizing special chars (quotes, parentheses, etc.) in PTBTokenizer
    • Have WhitespaceTokenizer support same newline processing as PTBTokenizer
    • New mwt annotator for handling multiword tokens in French, German, and Spanish.
    • New models with more training data and better performance for tagging and parsing in English, French, German, and Spanish.
    • Add French NER
    • New Chinese segmentation based off CTB9
    • Improved handling of double codepoint characters
    • Easier syntax for specifying language specific pipelines and NER pipeline properties
    • Improved CoNLL-U processing
    • Improved speed and memory performance for CRF training
    • Tregex support in CoreSentence
    • Updated library dependencies

    Fixes

    • NPE while simultaneously tokenizing on whitespace and sentence splitting on newlines
    • NPE in EntityMentionsAnnotator during language check
    • NPE in CorefMentionAnnotator while aligning coref mentions with titles and entity mentions
    • NPE in NERCombinerAnnotator in certain configurations of models on/off
    • Incorrect handling of eolonly option in ArabicSegmenterAnnotator
    • Apply named entity granularity change prior to coref mention detection
    • Incorrect handling of keeping newline tokens when using Chinese segmenter on Windows
    • Incorrect handling of reading in German treebank files
    • SR parser crashes when given bad training input
    • New PTBTokenizer known abbreviations: "Tech.", "Amb.". Fix legacy tokenizer hack special casing 'Alex.' for 'Alex. Brown'
    • Fix ancient bug in printing constituency tree with multiple roots.
    • Fix parser from failing on word "STOP" because it treated it as a special word
    Source code(tar.gz)
    Source code(zip)
Revisiting Pre-trained Models for Chinese Natural Language Processing (Findings of EMNLP 2020)

This repository contains the resources in our paper "Revisiting Pre-trained Models for Chinese Natural Language Processing", which will be published i

Yiming Cui 463 Dec 30, 2022
DataCLUE: 国内首个以数据为中心的AI测评(含模型分析报告)

DataCLUE 以数据为中心的AI测评(DataCLUE) DataCLUE: A Chinese Data-centric Language Evaluation Benchmark 内容导引 章节 描述 简介 介绍以数据为中心的AI测评(DataCLUE)的背景 任务描述 任务描述 实验结果

CLUE benchmark 135 Dec 22, 2022
NLP techniques such as named entity recognition, sentiment analysis, topic modeling, text classification with Python to predict sentiment and rating of drug from user reviews.

This file contains the following documents sumbited for Baruch CIS9665 group 9 fall 2021. 1. Dataset: drug_reviews.csv 2. python codes for text classi

Aarif Munwar Jahan 2 Jan 04, 2023
Use PaddlePaddle to reproduce the paper:mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer

MT5_paddle Use PaddlePaddle to reproduce the paper:mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer English | 简体中文 mT5: A Massively

2 Oct 17, 2021
Tools, wrappers, etc... for data science with a concentration on text processing

Rosetta Tools for data science with a focus on text processing. Focuses on "medium data", i.e. data too big to fit into memory but too small to necess

207 Nov 22, 2022
This repository contains Python scripts for extracting linguistic features from Filipino texts.

Filipino Text Linguistic Feature Extractors This repository contains scripts for extracting linguistic features from Filipino texts. The scripts were

Joseph Imperial 1 Oct 05, 2021
Hostapd-mac-tod-acl - Setup a hostapd AP with MAC ToD ACL

A brief explanation This script provides a quick way to setup a Time-of-day (Tod

2 Feb 03, 2022
Application for shadowing Chinese.

chinese-shadowing Simple APP for shadowing chinese. With this application, it is very easy to record yourself, play the sound recorded and listen to s

Thomas Hirtz 5 Sep 06, 2022
A BERT-based reverse-dictionary of Korean proverbs

Wisdomify A BERT-based reverse-dictionary of Korean proverbs. 김유빈 : 모델링 / 데이터 수집 / 프로젝트 설계 / back-end 김종윤 : 데이터 수집 / 프로젝트 설계 / front-end Quick Start C

Eu-Bin KIM 94 Dec 08, 2022
An evaluation toolkit for voice conversion models.

Voice-conversion-evaluation An evaluation toolkit for voice conversion models. Sample test pair Generate the metadata for evaluating models. The direc

30 Aug 29, 2022
State-of-the-art NLP through transformer models in a modular design and consistent APIs.

Trapper (Transformers wRAPPER) Trapper is an NLP library that aims to make it easier to train transformer based models on downstream tasks. It wraps h

Open Business Software Solutions 42 Sep 21, 2022
Rich Prosody Diversity Modelling with Phone-level Mixture Density Network

Phone Level Mixture Density Network for TTS This repo contains pytorch implementation of paper Rich Prosody Diversity Modelling with Phone-level Mixtu

Rishikesh (ऋषिकेश) 42 Dec 13, 2022
Weaviate demo with the text2vec-openai module

Weaviate demo with the text2vec-openai module This repository contains an example of how to use the Weaviate text2vec-openai module. When using this d

SeMI Technologies 11 Nov 11, 2022
Phrase-Based & Neural Unsupervised Machine Translation

Unsupervised Machine Translation This repository contains the original implementation of the unsupervised PBSMT and NMT models presented in Phrase-Bas

Facebook Research 1.5k Dec 28, 2022
DeepSpeech - Easy-to-use Speech Toolkit including SOTA ASR pipeline, influential TTS with text frontend and End-to-End Speech Simultaneous Translation.

(简体中文|English) Quick Start | Documents | Models List PaddleSpeech is an open-source toolkit on PaddlePaddle platform for a variety of critical tasks i

5.6k Jan 03, 2023
Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT.

KR-BERT-SimCSE Implementing SimCSE(paper, official repository) using TensorFlow 2 and KR-BERT. Training Unsupervised python train_unsupervised.py --mi

Jeong Ukjae 27 Dec 12, 2022
Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5

NLP-Summarizer Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5 This project aimed to provide in

Samuel Sharkey 1 Feb 07, 2022
EMNLP'2021: Can Language Models be Biomedical Knowledge Bases?

BioLAMA BioLAMA is biomedical factual knowledge triples for probing biomedical LMs. The triples are collected and pre-processed from three sources: CT

DMIS Laboratory - Korea University 41 Nov 18, 2022
A fast and easy implementation of Transformer with PyTorch.

FasySeq FasySeq is a shorthand as a Fast and easy sequential modeling toolkit. It aims to provide a seq2seq model to researchers and developers, which

宁羽 7 Jul 18, 2022
multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification,seq2seq,attention,beam search

multi-label,classifier,text classification,多标签文本分类,文本分类,BERT,ALBERT,multi-label-classification,seq2seq,attention,beam search

hellonlp 30 Dec 12, 2022