Python implementation of TextRank for phrase extraction and summarization of text documents

Overview

PyTextRank

PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension, used to:

  • extract the top-ranked phrases from text documents
  • run low-cost extractive summarization of text documents
  • help infer links from unstructured text into structured data

Background

One of the goals for PyTextRank is to provide support (eventually) for entity linking, in contrast to the more commonplace usage of named entity recognition. These approaches can be used together in complementary ways to improve the results overall.

The introduction of graph algorithms -- notably, eigenvector centrality -- provides a more flexible and robust basis for integrating additional techniques that enhance the natural language work being performed. The entity linking aspects here are still a work-in-progress scheduled for a later release.

Internally PyTextRank constructs a lemma graph to represent links among the candidate phrases (e.g., unrecognized entities) and their supporting language. Generally speaking, any means of enriching that graph prior to phrase ranking will tend to improve results. Possible ways to enrich the lemma graph include coreference resolution and semantic relations, as well as leveraging knowledge graphs in the general case.

For example, WordNet and DBpedia both provide means for inferring links among entities, and purpose-built knowledge graphs can be applied for specific use cases. These can help enrich a lemma graph even in cases where links are not explicit within the text. Consider a paragraph that mentions cats and kittens in different sentences: an implied semantic relation exists between the two nouns since the lemma kitten is a hyponym of the lemma cat -- such that an inferred link can be added between them.

This has an additional benefit of linking parsed and annotated documents into more structured data, and can also be used to support knowledge graph construction.

The TextRank algorithm used here is based on research published in:
"TextRank: Bringing Order into Text"
Rada Mihalcea, Paul Tarau
Empirical Methods in Natural Language Processing (2004)

Several modifications in PyTextRank improve on the algorithm originally described in the paper:

  • fixed a bug: see Java impl, 2008
  • use lemmatization in place of stemming
  • include verbs in the graph (but not in the resulting phrases)
  • leverage preprocessing via noun chunking and named entity recognition
  • provide extractive summarization based on ranked phrases

This implementation was inspired by the Williams 2016 talk on text summarization. Note that while much better approaches exit for summarizing text, questions linger about some of the top contenders -- see: 1, 2. Arguably, having alternatives such as this allow for cost trade-offs.

Installation

Prerequisites:

To install from PyPi:

pip install pytextrank
python -m spacy download en_core_web_sm

If you install directly from this Git repo, be sure to install the dependencies as well:

pip install -r requirements.txt
python -m spacy download en_core_web_sm

Usage

import spacy
import pytextrank

# example text
text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types."

# load a spaCy model, depending on language, scale, etc.
nlp = spacy.load("en_core_web_sm")

# add PyTextRank to the spaCy pipeline
tr = pytextrank.TextRank()
nlp.add_pipe(tr.PipelineComponent, name="textrank", last=True)

doc = nlp(text)

# examine the top-ranked phrases in the document
for p in doc._.phrases:
    print("{:.4f} {:5d}  {}".format(p.rank, p.count, p.text))
    print(p.chunks)

For other example usage, see the PyTextRank wiki. If you need to troubleshoot any problems:

For related course materials and training, please check for calendar updates in the article "Natural Language Processing in Python".

Let us know if you find this package useful, tell us about use cases, describe what else you would like to see integrated, etc. For inquiries about consulting work in machine learning, natural language, knowledge graph, and other AI applications, contact Derwen, Inc.

Links

Testing

To run the unit tests:

coverage run -m unittest discover

To generate a coverage report and upload it to the codecov.io reporting site:

coverage report
bash <(curl -s https://codecov.io/bash) -t @.cc_token

Test coverage reports can be viewed at https://codecov.io/gh/DerwenAI/pytextrank

License and Copyright

Source code for PyTextRank plus its logo, documentation, and examples have an MIT license which is succinct and simplifies use in commercial applications.

All materials herein are Copyright © 2016-2021 Derwen, Inc.

Attribution

Please use the following BibTeX entry for citing PyTextRank if you use it in your research or software. Citations are helpful for the continued development and maintenance of this library.

@software{PyTextRank,
  author = {Paco Nathan},
  title = {{PyTextRank, a Python implementation of TextRank for phrase extraction and summarization of text documents}},
  year = 2016,
  publisher = {Derwen},
  url = {https://github.com/DerwenAI/pytextrank}
}

TODOs

  • kglab integration
  • generate MkDocs
  • MyPy and PyLint coverage
  • include more unit tests
  • show examples of spacy-wordnet to enrich the lemma graph
  • leverage neuralcoref to enrich the lemma graph
  • generate a phrase graph, with entity linking into Wikidata, etc.
  • include more unit tests
  • fix Sphinx errors, generate docs

Kudos

Many thanks to our contributors: @louisguitton, @anna-droid-beep, @kavorite, @htmartin, @williamsmj, @mattkohl, @vanita5, @HarshGrandeur, @mnowotka, @kjam, @dvsrepo, @SaiThejeshwar, @laxatives, @dimmu, @JasonZhangzy1757, @jake-aft, @junchen1992, @Ankush-Chander, @shyamcody, @chikubee, encouragement from the wonderful folks at spaCy, plus general support from Derwen, Inc.

thx noam

Comments
  • Example file throws KeyError: 1255

    Example file throws KeyError: 1255

    Have not been able to get either the long form (from wiki) or short form (from github readme) files to work successfully.

    The file @ https://github.com/DerwenAI/pytextrank/blob/master/example.py throws a KeyError: 1255 when run. Output for this is below.

    I have been able to get the example from the github page working but only for very small strings. Anything larger than a few words throws a KeyError with varying number depending on the length of the string.

    Can't figure out the issue even using all input (txt files) from the example on the wiki page and changing the spacy version to various releases from 2.0.0 to present.


    KeyError Traceback (most recent call last) in () 31 text = f.read() 32 ---> 33 doc = nlp(text) 34 35 print("pipeline", nlp.pipe_names)

    /home/pete/.local/lib/python3.5/site-packages/spacy/language.py in call(self, text, disable, component_cfg) 433 if not hasattr(proc, "call"): 434 raise ValueError(Errors.E003.format(component=type(proc), name=name)) --> 435 doc = proc(doc, **component_cfg.get(name, {})) 436 if doc is None: 437 raise ValueError(Errors.E005.format(name=name))

    /usr/local/lib/python3.5/dist-packages/pytextrank/pytextrank.py in PipelineComponent(self, doc) 530 """ 531 self.doc = doc --> 532 Doc.set_extension("phrases", force=True, default=self.calc_textrank()) 533 Doc.set_extension("textrank", force=True, default=self) 534

    /usr/local/lib/python3.5/dist-packages/pytextrank/pytextrank.py in calc_textrank(self) 389 390 for chunk in self.doc.noun_chunks: --> 391 self.collect_phrases(chunk) 392 393 for ent in self.doc.ents:

    /usr/local/lib/python3.5/dist-packages/pytextrank/pytextrank.py in collect_phrases(self, chunk) 345 if key in self.seen_lemma: 346 node_id = list(self.seen_lemma.keys()).index(key) --> 347 rank = self.ranks[node_id] 348 phrase.sq_sum_rank += rank 349 compound_key.add(key)

    KeyError: 1255

    bug 
    opened by oldskewlcool 17
  • A question on keyphrases that are subsets of others and overlapping `Spans`

    A question on keyphrases that are subsets of others and overlapping `Spans`

    I think the current implementation returns keyphrases that are potential subsets of each other, that this is due to the use of noun_chunks and ents, and that this is not the desired output. Specifically, if a document has an entity that is a superset (as far as span start and end is concerned) of a noun chunk (or vice-versa), and both contain a key token, then both will be returned as keyphrases.

    While also/possibly linked to the issue of entity linkage (which I'd love to know more about!), this can simply be a matter of defining "entity" boundaries and a "duplication" issue, as the example below with "Seouls Four Seasons hotel" and "Four Seasons", where I believe one keyphrase is enough and having both is confusing, demonstrates.

    Am I missing something? Is this the desired logic?

    Example:

    from spacy.util import filter_spans
    import pytextrank
    import en_core_web_sm
    
    nlp = en_core_web_sm.load()
    nlp.add_pipe("textrank", last=True);
    
    # from dat/lee.txt
    text = """
    After more than four hours of tight play and a rapid-fire endgame, Google's artificially intelligent Go-playing computer system has won a second contest against grandmaster Lee Sedol, taking a two-games-to-none lead in their historic best-of-five match in downtown Seoul.  The surprisingly skillful Google machine, known as AlphaGo, now needs only one more win to claim victory in the match. The Korean-born Lee Sedol will go down in defeat unless he takes each of the match's last three games. Though machines have beaten the best humans at chess, checkers, Othello, Scrabble, Jeopardy!, and so many other games considered tests of human intellect, they have never beaten the very best at Go. Game Three is set for Saturday afternoon inside Seoul's Four Seasons hotel.  The match is a way of judging the suddenly rapid progress of artificial intelligence. One of the machine-learning techniques at the heart of AlphaGo has already reinvented myriad online services inside Google and other big-name Internet companies, helping to identify images, recognize commands spoken into smartphones, improve search engine results, and more. Meanwhile, another AlphaGo technique is now driving experimental robotics at Google and places like the University of California at Berkeley. This week's match can show how far these technologies have come - and perhaps how far they will go.  Created in Asia over 2,500 year ago, Go is exponentially more complex than chess, and at least among humans, it requires an added degree of intuition. Lee Sedol is widely-regarded as the top Go player of the last decade, after winning more international titles than all but one other player. He is currently ranked number five in the world, and according to Demis Hassabis, who leads DeepMind, the Google AI lab that created AlphaGo, his team chose the Korean for this all-important match because they wanted an opponent who would be remembered as one of history's great players.  Although AlphaGo topped Lee Sedol in the match's first game on Wednesday afternoon, the outcome of Game Two was no easier to predict. In his 1996 match with IBM's Deep Blue supercomputer, world chess champion Gary Kasparov lost the first game but then came back to win the second game and, eventually, the match as a whole. It wasn't until the following year that Deep Blue topped Kasparov over the course of a six-game contest. The thing to realize is that, after playing AlphaGo for the first time on Wednesday, Lee Sedol could adjust his style of play - just as Kasparov did back in 1996. But AlphaGo could not. Because this Google creation relies so heavily on machine learning techniques, the DeepMind team needs a good four to six weeks to train a new incarnation of the system. And that means they can't really change things during this eight-day match.  "This is about teaching and learning," Hassabis told us just before Game Two. "One game is not enough data to learn from - for a machine - and training takes an awful lot of time."
    """
    
    doc = nlp(text)
    
    key_spans = []
    for phrase in doc._.phrases:
        for chunk in phrase.chunks:
            key_spans.append(chunk)
    
    print(len(key_spans))
    
    full_set = set([p.text for p in doc._.phrases])
    
    print(full_set)
    
    print(len(filter_spans(key_spans)))
    
    sub_set = set([pytextrank.util.default_scrubber(p) for p in filter_spans(key_spans)])
    
    print(sub_set)
    
    print(full_set - sub_set)
    
    print(sub_set - full_set)
    

    Possible solution?:

    all_spans = list(self.doc.noun_chunks) + list(self.doc.ents)
    filtered_spans = filter_spans(all_spans)
    filtered_phrases = self._collect_phrases(filtered_spans, self.ranks) # replacing all_phrases
    

    instead of

    nc_phrases: typing.Dict[Span, float] = self._collect_phrases(self.doc.noun_chunks, self.ranks)
    ent_phrases: typing.Dict[Span, float] = self._collect_phrases(self.doc.ents, self.ranks)
    all_phrases: typing.Dict[Span, float] = { **nc_phrases, **ent_phrases }
    

    see https://github.com/DerwenAI/pytextrank/blob/29339027b905844af0064ed9a0326e2578f21bf6/pytextrank/base.py#L362

    Note:

    • My understanding is that self._get_min_phrases is doing something else.
    • spacy.util.filter_spans simply looks for the (first) longest span, which might not be the best solution.
    enhancement 
    opened by DayalStrub 11
  • Errors importing from pytextrank

    Errors importing from pytextrank

    Hi! I'm working on a project connected with NLP and was happy to find out that there is such a tool as PyTextRank. However, I've encountered an issue at the very beginning trying to just import package to run the example code given here. The error that I get is the following:

    ----> from pytextrank import json_iter, parse_doc, pretty_print
    ImportError: cannot import name 'json_iter'
    ----> from pytextrank import parse_doc
    ImportError: cannot import name 'parse_doc'
    

    I've tried running it in iPython console and a Jupyter Notebook, both the same result. I've installed PyTextRank with pip, the python version that I have is 3.5.4., spacy 2.1.8., networkx 2.4, graphvis 0.13.2

    question 
    opened by Erin59 9
  • NotImplementedError: [E894] The 'noun_chunks' syntax iterator is not implemented for language 'ru'.

    NotImplementedError: [E894] The 'noun_chunks' syntax iterator is not implemented for language 'ru'.

    It seems to me that nlp.add_pipe("textrank") must have "noun chunks" which will raise "NotImplementedError" for some language models where "noun chunks" have not been implemented. I've got "NotImplementedError" with "ru_core_news_lg" and "ru_core_news_sm" spacy models.

    The proposal is to make the use of "noun chunks" optional to prevent such errors.

    bug 
    opened by gremur 8
  • How to use this?

    How to use this?

    Hi there, I've been looking at your code and example for a long time and I still have no idea how to use this.

    I have documents in string format, what JSON format should they have if I want to use the stages as in the examples?

    I find there's a crucial piece of information missing in the documentation, which is how to use the functionality of this package with a simple document in string format (or list of strings, representing sentences). As I don't know beforehand what JSON format I have to convert my text to in order to use the stage pipeline.

    Cheers

    question 
    opened by romanovzky 8
  • Error: Can't find factory for 'textrank' for language English....

    Error: Can't find factory for 'textrank' for language English....

    Hi there,

    Does any know how to fix below errors when running example code?

    Thanks.

    Traceback (most recent call last): File "test.py", line 14, in nlp.add_pipe("textrank") File "/usr/local/lib64/python3.6/site-packages/spacy/language.py", line 773, in add_pipe validate=validate, File "/usr/local/lib64/python3.6/site-packages/spacy/language.py", line 639, in create_pipe raise ValueError(err) ValueError: [E002] Can't find factory for 'textrank' for language English (en). This usually happens when spaCy calls nlp.create_pipe with a custom component name that's not registered on the current language class. If you're using a Transformer, make sure to install 'spacy-transformers'. If you're using a custom component, make sure you've added the decorator @Language.component (for function components) or @Language.factory (for class components).

    Available factories: attribute_ruler, tok2vec, merge_noun_chunks, merge_entities, merge_subtokens, token_splitter, parser, beam_parser, entity_linker, ner, beam_ner, entity_ruler, lemmatizer, tagger, morphologizer, senter, sentencizer, textcat, textcat_multilabel, en.lemmatizer

    question 
    opened by r76941156 7
  • Differences between 2.1.0 and 3.0.0

    Differences between 2.1.0 and 3.0.0

    Are the changes between the two versions of pytextrank documented anywhere?

    The queries seem to be giving different results, so I would like to understand if that is because of changes to spaCy or to the algorithm itself?

    Thank you for your help.

    question howto 
    opened by debraj135 7
  • Keyword extraction

    Keyword extraction

    Hi there, I'm working on a project extracting keywords from a german text. Is there a tutorial on how to extract keywords using pytextrank?

    Best regards,

    question 
    opened by danielp3011 7
  • AttributeError: 'DiGraph' object has no attribute 'edge'

    AttributeError: 'DiGraph' object has no attribute 'edge'

    Fixed by changing the code on pytextrank (307) from: try: graph.edge[pair[0]][pair[1]]["weight"] += 1.0 except KeyError: graph.add_edge(pair[0], pair[1], weight=1.0) to:
    if "edge" in dir(graph): graph.edge[pair[0]][pair[1]]["weight"] += 1.0 else: graph.add_edge(pair[0], pair[1], weight=1.0)

    opened by Vickoh 7
  • Add biasedtextrank module.

    Add biasedtextrank module.

    Hey @ceteri I have added basic version of biased textrank.

    It takes into account "focus" as well "bias" to augment ranking in favour of focus. As per the paper, it should add bias to the graph based on similarity calculation between "focus" and nodes. But this version just assigns "bias" to focus terms while leaving other nodes unbiased.

    Let me know of your ideas so that we can improve upon this version.

    @louisguitton

    opened by Ankush-Chander 6
  • IndexError: list index out of range

    IndexError: list index out of range

    Hi,

    I'm getting the following error when trying to run pytextrank with my own data. Is there a way to fix this?

    app_1 | Traceback (most recent call last): app_1 | File "index.py", line 26, in app_1 | for rl in pytextrank.normalize_key_phrases(path_stage1, ranks): app_1 | File "/usr/local/lib/python3.6/site-packages/pytextrank/pytextrank.py", line 581, in normalize_key_phrases app_1 | for rl in collect_entities(sent, ranks, stopwords, spacy_nlp): app_1 | File "/usr/local/lib/python3.6/site-packages/pytextrank/pytextrank.py", line 485, in collect_entities app_1 | w_ranks, w_ids = find_entity(sent, ranks, ent.text.split(" "), 0) app_1 | File "/usr/local/lib/python3.6/site-packages/pytextrank/pytextrank.py", line 454, in find_entity app_1 | return find_entity(sent, ranks, ent, i + 1) app_1 | File "/usr/local/lib/python3.6/site-packages/pytextrank/pytextrank.py", line 454, in find_entity app_1 | return find_entity(sent, ranks, ent, i + 1) app_1 | File "/usr/local/lib/python3.6/site-packages/pytextrank/pytextrank.py", line 454, in find_entity app_1 | return find_entity(sent, ranks, ent, i + 1) app_1 | [Previous line repeated 137 more times] app_1 | File "/usr/local/lib/python3.6/site-packages/pytextrank/pytextrank.py", line 451, in find_entity app_1 | w = sent[i + j] app_1 | IndexError: list index out of range

    wontfix 
    opened by rabinneslo 6
  • [Snyk] Security upgrade setuptools from 39.0.1 to 65.5.1

    [Snyk] Security upgrade setuptools from 39.0.1 to 65.5.1

    This PR was automatically created by Snyk using the credentials of a real user.


    Snyk has created this PR to fix one or more vulnerable packages in the `pip` dependencies of this project.

    Changes included in this PR

    • Changes to the following files to upgrade the vulnerable dependencies to a fixed version:
      • requirements-dev.txt
    ⚠️ Warning
    pymdown-extensions 8.0 requires Markdown, which is not installed.
    mkdocs-material 8.0.1 requires mkdocs, which is not installed.
    mkdocs-material 8.0.1 requires mkdocs-material-extensions, which is not installed.
    mkdocs-material 8.0.1 requires markdown, which is not installed.
    mkdocs-material 8.0.1 has requirement pymdown-extensions>=9.0, but you have pymdown-extensions 8.0.
    
    

    Vulnerabilities that will be fixed

    By pinning:

    Severity | Priority Score (*) | Issue | Upgrade | Breaking Change | Exploit Maturity :-------------------------:|-------------------------|:-------------------------|:-------------------------|:-------------------------|:------------------------- low severity | 441/1000
    Why? Recently disclosed, Has a fix available, CVSS 3.1 | Regular Expression Denial of Service (ReDoS)
    SNYK-PYTHON-SETUPTOOLS-3113904 | setuptools:
    39.0.1 -> 65.5.1
    | No | No Known Exploit

    (*) Note that the real score may have changed since the PR was raised.

    Some vulnerabilities couldn't be fully fixed and so Snyk will still find them when the project is tested again. This may be because the vulnerability existed within more than one direct dependency, but not all of the affected dependencies could be upgraded.

    Check the changes in this PR to ensure they won't cause issues with your project.


    Note: You are seeing this because you or someone else with access to this repository has authorized Snyk to open fix PRs.

    For more information: 🧐 View latest project report

    🛠 Adjust project settings

    📚 Read more about Snyk's upgrade and patch logic


    Learn how to fix vulnerabilities with free interactive lessons:

    🦉 Regular Expression Denial of Service (ReDoS)

    opened by ceteri 0
  • [Snyk] Security upgrade setuptools from 39.0.1 to 65.5.1

    [Snyk] Security upgrade setuptools from 39.0.1 to 65.5.1

    Snyk has created this PR to fix one or more vulnerable packages in the `pip` dependencies of this project.

    Changes included in this PR

    • Changes to the following files to upgrade the vulnerable dependencies to a fixed version:
      • requirements-dev.txt
    ⚠️ Warning
    pymdown-extensions 8.0 requires Markdown, which is not installed.
    mkdocs-material 8.0.1 requires mkdocs, which is not installed.
    mkdocs-material 8.0.1 requires markdown, which is not installed.
    mkdocs-material 8.0.1 requires mkdocs-material-extensions, which is not installed.
    mkdocs-material 8.0.1 has requirement pymdown-extensions>=9.0, but you have pymdown-extensions 8.0.
    
    

    Vulnerabilities that will be fixed

    By pinning:

    Severity | Priority Score (*) | Issue | Upgrade | Breaking Change | Exploit Maturity :-------------------------:|-------------------------|:-------------------------|:-------------------------|:-------------------------|:------------------------- low severity | 441/1000
    Why? Recently disclosed, Has a fix available, CVSS 3.1 | Regular Expression Denial of Service (ReDoS)
    SNYK-PYTHON-SETUPTOOLS-3113904 | setuptools:
    39.0.1 -> 65.5.1
    | No | No Known Exploit

    (*) Note that the real score may have changed since the PR was raised.

    Some vulnerabilities couldn't be fully fixed and so Snyk will still find them when the project is tested again. This may be because the vulnerability existed within more than one direct dependency, but not all of the affected dependencies could be upgraded.

    Check the changes in this PR to ensure they won't cause issues with your project.


    Note: You are seeing this because you or someone else with access to this repository has authorized Snyk to open fix PRs.

    For more information: 🧐 View latest project report

    🛠 Adjust project settings

    📚 Read more about Snyk's upgrade and patch logic


    Learn how to fix vulnerabilities with free interactive lessons:

    🦉 Regular Expression Denial of Service (ReDoS)

    opened by snyk-bot 0
  • suggestion: allow

    suggestion: allow "wildcard" POS for stopwords

    The current approach which specifies stopwords as lemma: [POS] presents two issues:

    1. There are some terms which POS taggers will fail over. For example, Spacy labels "AI" (artificial intelligence) as PROPN
    2. If I create software to be used by people without linguistic knowledge, I cannot expect them to know about POS.

    As a work-around, it is necessary to specify all POS tags, which is rather inelegant.

    opened by arc12 0
  • "ValueError: [E002] Can't find factory for 'textrank' for language English (en)." - incompatibility with SpaCy 3.3.1?

    I'm trying to use this package for the first time and followed the README:

    !pip install pytextrank
    !python -m spacy download en_core_web_sm
    nlp = spacy.load("en_core_web_sm")
    nlp.add_pipe("textrank")
    

    This throws an error at the last line:

    ValueError: [E002] Can't find factory for 'textrank' for language English (en). This usually happens when spaCy calls 'nlp.create_pipe' with a custom component name that's not registered on the current language class. If you're using a Transformer, make sure to install 'spacy-transformers'. If you're using a custom component, make sure you've added the decorator '@Language.component' (for function components) or '@Language.factory' (for class components).

    Available factories: attribute_ruler, tok2vec, merge_noun_chunks, merge_entities, merge_subtokens, token_splitter, doc_cleaner, parser, beam_parser, lemmatizer, trainable_lemmatizer, entity_linker, ner, beam_ner, entity_ruler, tagger, morphologizer, senter, sentencizer, textcat, spancat, future_entity_ruler, span_ruler, textcat_multilabel, en.lemmatizer`

    Is this an incompatibility with SpaCy version 3.3.1 or have I overseen something crucial? Which SpaCy version do you recommend? (I restarted the kernel after installing pytextrank)

    question 
    opened by lisabecker-ml6 1
  • Is `biasedtextrank` implemented?

    Is `biasedtextrank` implemented?

    https://github.com/DerwenAI/pytextrank/blob/9ab64507a26f946191504598f86021f511245cd7/pytextrank/base.py#L305

    self.focus_tokens is initialized to an empty set but I don't see where it is parameterized?

    e.g.

    nlp = spacy.load("en_core_web_sm")
    nlp.add_pipe("biasedtextrank")
    focus = "my example focus"
    doc = nlp(text)
    

    At what point can I inform the model of the focus?

    question kg 
    opened by Ayenem 4
  • ZeroDivisionError: division by zero in _calc_discounted_normalised_rank

    ZeroDivisionError: division by zero in _calc_discounted_normalised_rank

    Hi,

    I use this library together with spacy for the extraction of the most important words. However, when using the catalan model of spacy, the algorithm gives the following error:

    `File "/code/app.py", line 20, in getNlpEntities

    entities = runTextRankEntities(hl, contents['contents'], algorithm, num)
    

    File "/code/nlp/textRankEntities.py", line 51, in runTextRankEntities

    doc = nlp(joined_content)
    

    File "/usr/local/lib/python3.9/site-packages/spacy/language.py", line 1022, in call

    error_handler(name, proc, [doc], e)
    

    File "/usr/local/lib/python3.9/site-packages/spacy/util.py", line 1617, in raise_error

    raise e
    

    File "/usr/local/lib/python3.9/site-packages/spacy/language.py", line 1017, in call

    doc = proc(doc, **component_cfg.get(name, {}))  # type: ignore[call-arg]
    

    File "/usr/local/lib/python3.9/site-packages/pytextrank/base.py", line 253, in call

    doc._.phrases = doc._.textrank.calc_textrank()
    

    File "/usr/local/lib/python3.9/site-packages/pytextrank/base.py", line 363, in calc_textrank

    nc_phrases = self._collect_phrases(self.doc.noun_chunks, self.ranks)
    

    File "/usr/local/lib/python3.9/site-packages/pytextrank/base.py", line 548, in _collect_phrases

    return {
    

    File "/usr/local/lib/python3.9/site-packages/pytextrank/base.py", line 549, in

    span: self._calc_discounted_normalised_rank(span, sum_rank)
    

    File "/usr/local/lib/python3.9/site-packages/pytextrank/base.py", line 592, in _calc_discounted_normalised_rank

    phrase_rank = math.sqrt(sum_rank / (len(span) + non_lemma))
    

    ZeroDivisionError: division by zero`

    bug help wanted good first issue 
    opened by sumitkumarjethani 2
Releases(v3.2.4)
  • v3.2.4(Jul 27, 2022)

    2022-07-27

    • better support for "ru" and other languages without noun_chunks support in spaCy
    • updated example notebook to illustrate TopicRank algorithm
    • made the node bias setting case-independent for Biased Textrank algorithm; kudos @Ankush-Chander
    • updated summarization tests; kudos @tomaarsen
    • reworked some unit tests to be less brittle, less dependent on specific spaCy point releases

    What's Changed

    • updated docs and example to show TopicRank by @ceteri in https://github.com/DerwenAI/pytextrank/pull/211
    • working on #204 by @ceteri in https://github.com/DerwenAI/pytextrank/pull/212
    • Prevent exception on TopicRank when there are no noun_chunks by @tomaarsen in https://github.com/DerwenAI/pytextrank/pull/219
    • Biasedrank case fix by @Ankush-Chander in https://github.com/DerwenAI/pytextrank/pull/217
    • Docs update by @ceteri in https://github.com/DerwenAI/pytextrank/pull/221
    • rework some unit tests by @ceteri in https://github.com/DerwenAI/pytextrank/pull/222

    Full Changelog: https://github.com/DerwenAI/pytextrank/compare/v3.2.3...v3.2.4

    Source code(tar.gz)
    Source code(zip)
  • v3.2.3(Mar 6, 2022)

    2022-03-06

    • handles missing noun_chunks in some language models (e.g., "ru") #204
    • add TopicRank algorithm; kudos @tomaarsen
    • improved test suite; fixed tests for newer spacy releases; kudos @tomaarsen

    What's Changed

    • [Snyk] Security upgrade mistune from 0.8.4 to 2.0.1 by @snyk-bot in https://github.com/DerwenAI/pytextrank/pull/201
    • Improved test suite; fixed tests by @tomaarsen in https://github.com/DerwenAI/pytextrank/pull/205
    • Updated Copyright year from 2021 to 2022 by @tomaarsen in https://github.com/DerwenAI/pytextrank/pull/206
    • update API reference docs by @ceteri in https://github.com/DerwenAI/pytextrank/pull/207
    • Inclusion of the TopicRank Keyphrase Extraction algorithm by @tomaarsen in https://github.com/DerwenAI/pytextrank/pull/208
    • Prep release by @ceteri in https://github.com/DerwenAI/pytextrank/pull/210

    New Contributors

    • @snyk-bot made their first contribution in https://github.com/DerwenAI/pytextrank/pull/201

    Full Changelog: https://github.com/DerwenAI/pytextrank/compare/v3.2.2...v3.2.3

    Source code(tar.gz)
    Source code(zip)
  • v3.2.2(Oct 10, 2021)

    What's Changed

    • prep next release by @ceteri in https://github.com/DerwenAI/pytextrank/pull/189
    • warning about the deprecated code in archive by @ceteri in https://github.com/DerwenAI/pytextrank/pull/190
    • fixes chunk to be between sent_start and sent_end in BaseTextRank.calc_sent_dist by @clabornd in https://github.com/DerwenAI/pytextrank/pull/191
    • Update by @ceteri in https://github.com/DerwenAI/pytextrank/pull/198
    • add more scrubber examples and documentation by @dayalstrub-cma in https://github.com/DerwenAI/pytextrank/pull/197
    • kudos by @ceteri in https://github.com/DerwenAI/pytextrank/pull/199
    • prep PyPi release by @ceteri in https://github.com/DerwenAI/pytextrank/pull/200

    New Contributors

    • @clabornd made their first contribution in https://github.com/DerwenAI/pytextrank/pull/191
    • @dayalstrub-cma made their first contribution in https://github.com/DerwenAI/pytextrank/pull/197

    Full Changelog: https://github.com/DerwenAI/pytextrank/compare/v3.2.1...v3.2.2

    Source code(tar.gz)
    Source code(zip)
  • v3.2.1(Jul 24, 2021)

  • v3.2.0(Jul 17, 2021)

    2021-07-17

    Various support for spaCy 3.1.x updates, which changes some interfaces.

    • NB: THE SCRUBBER UPDATE WILL BREAK PREVIOUS RELEASES
    • allow Span as scrubber argument, to align with spaCy 3.1.x; kudos @Ankush-Chander
    • add lgtm code reviews (slow, not integrating into GitHub PRs directly)
    • evaluating grayskull to generate a conda-forge recipe
    • add use of pipdeptree to analyze dependencies
    • use KG from biblio.ttl to generate bibliography
    • fixed overlooked comment from earlier code; kudos @debraj135
    • add visualisation using altair; kudos @louisguitton
    • add scrubber usage in sample notebook; kudos @Ankush-Chander
    • integrating use of MkRefs to generate semantic reference pages in docs
    Source code(tar.gz)
    Source code(zip)
  • v3.1.1(Mar 25, 2021)

    2021-03-25

    • fix the span length calculation in explanation notebook; kudos @Ankush-Chander
    • add BiasedTextRank by @Ankush-Chander (many thanks!)
    • add conda environment.yml plus instructions
    • use bandit to check for security issues
    • use codespell to check for spelling errors
    • add pre-commit checks in general
    • update doc._.phrases in the call to change_focus() so the summarization will sync with the latest focus
    Source code(tar.gz)
    Source code(zip)
  • v3.1.0(Mar 12, 2021)

    2021-03-12

    • rename master branch to main
    • add a factory class that assigns each doc its own Textrank object; kudos @Ankush-Chander
    • refactor the stopwords feature as a constructor argument
    • add get_unit_vector() method to expose the characteristic unit vector
    • add calc_sent_dist() method to expose the sentence distance measures (for summarization)
    • include a unit test for summarization
    • updated contributor instructions
    • pylint coverage for code checking
    • linking definitions and citations in source code apidocs to our online docs
    • updated links on PyPi
    Source code(tar.gz)
    Source code(zip)
  • v3.0.1(Feb 27, 2021)

  • v3.0.0(Feb 14, 2021)

    2021-02-14

    • THIS WILL BREAK THINGS!!!
    • support for spaCy 3.0.x; kudos @Lord-V15
    • full integration of PositionRank
    • migrated all unit tests to pytest
    • removed use of logger for debugging, introducing icecream instead
    Source code(tar.gz)
    Source code(zip)
  • v2.1.0(Jan 31, 2021)

    2021-01-31

    • add PositionRank by @louisguitton (many thanks!)
    • fixes chunk in explain_summ.ipynb by @anna-droid-beep
    • add option preserve_order in TextRank.summary by @kavorite
    • tested with spaCy 2.3.5
    Source code(tar.gz)
    Source code(zip)
  • v2.0.3(Sep 15, 2020)

    2020-09-15

    • try-catch ZeroDivisionError in summary method -- kudos @shyamcody
    • tested with updated dependencies: spaCy 2.3.x and NetworkX 2.5
    Source code(tar.gz)
    Source code(zip)
  • v2.0.2(Jun 28, 2020)

  • v2.0.1(Mar 2, 2020)

    2020-03-02

    • fix KeyError issue for pre Python 3.6
    • integrated codecov.io
    • added PyTextRank to the spaCy uniVerse
    • fixed README.md instructions to download en_core_web_sm
    Source code(tar.gz)
    Source code(zip)
  • v2.0.0(Nov 5, 2019)

    • refactored library to run as a spaCy extension
    • supports multiple languages
    • significantly faster, with less memory required
    • better extraction of top-ranked phrases
    • changed license to MIT
    • uses lemma-based stopwords for more precise control
    • WIP toward integration with knowledge graph use cases
    Source code(tar.gz)
    Source code(zip)
  • v1.2.1(Nov 1, 2019)

  • v1.2.0(Nov 1, 2019)

  • v1.1.1(Sep 15, 2017)

  • v1.1.0(Jun 7, 2017)

    Replaced TextBlob usage with spaCy for improved parsing results. Updated the other Python dependencies. Also added better handling for UTF-8.

    Source code(tar.gz)
    Source code(zip)
  • v1.0.1(May 1, 2017)

  • v1.0.0(Mar 13, 2017)

Owner
derwen.ai
"In the loop..."
derwen.ai
Use the power of GPT3 to execute any function inside your programs just by giving some doctests

gptrun Don't feel like coding today? Use the power of GPT3 to execute any function inside your programs just by giving some doctests. How is this diff

Roberto Abdelkader Martínez Pérez 11 Nov 11, 2022
Nystromformer: A Nystrom-based Algorithm for Approximating Self-Attention

Nystromformer: A Nystrom-based Algorithm for Approximating Self-Attention April 6, 2021 We extended segment-means to compute landmarks without requiri

Zhanpeng Zeng 322 Jan 01, 2023
🏆 • 5050 most frequent words in 109 languages

🏆 Most Common Words Multilingual 5000 most frequent words in 109 languages. Uses wordfrequency.info as a source. 🔗 License source code license data

14 Nov 24, 2022
Blender addon - Scrub timeline from viewport with a shortcut

Viewport scrub timeline Move in the timeline directly in viewport and snap to nearest keyframe Note : This standalone feature will be added in the nat

Samuel Bernou 40 Nov 07, 2022
Extracting Summary Knowledge Graphs from Long Documents

GraphSum This repo contains the data and code for the G2G model in the paper: Extracting Summary Knowledge Graphs from Long Documents. The other basel

Zeqiu (Ellen) Wu 10 Oct 21, 2022
💬 Open source machine learning framework to automate text- and voice-based conversations: NLU, dialogue management, connect to Slack, Facebook, and more - Create chatbots and voice assistants

Rasa Open Source Rasa is an open source machine learning framework to automate text-and voice-based conversations. With Rasa, you can build contextual

Rasa 15.3k Dec 30, 2022
lightweight, fast and robust columnar dataframe for data analytics with online update

streamdf Streamdf is a lightweight data frame library built on top of the dictionary of numpy array, developed for Kaggle's time-series code competiti

23 May 19, 2022
Analyse japanese ebooks using MeCab to determine the difficulty level for japanese learners

japanese-ebook-analysis This aim of this project is to make analysing the contents of a japanese ebook easy and streamline the process for non-technic

Christoffer Aakre 14 Jul 23, 2022
CoSENT 比Sentence-BERT更有效的句向量方案

CoSENT 比Sentence-BERT更有效的句向量方案

苏剑林(Jianlin Su) 201 Dec 12, 2022
Graphical user interface for Argos Translate

Argos Translate GUI Website | GitHub | PyPI Graphical user interface for Argos Translate. Install pip3 install argostranslategui

Argos Open Tech 16 Dec 07, 2022
Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

anaGo anaGo is a Python library for sequence labeling(NER, PoS Tagging,...), implemented in Keras. anaGo can solve sequence labeling tasks such as nam

Hiroki Nakayama 1.5k Dec 05, 2022
Multispeaker & Emotional TTS based on Tacotron 2 and Waveglow

This Repository contains a sample code for Tacotron 2, WaveGlow with multi-speaker, emotion embeddings together with a script for data preprocessing.

Ivan Didur 106 Jan 01, 2023
Unofficial Parallel WaveGAN (+ MelGAN & Multi-band MelGAN & HiFi-GAN & StyleMelGAN) with Pytorch

Parallel WaveGAN implementation with Pytorch This repository provides UNOFFICIAL pytorch implementations of the following models: Parallel WaveGAN Mel

Tomoki Hayashi 1.2k Dec 23, 2022
MPNet: Masked and Permuted Pre-training for Language Understanding

MPNet MPNet: Masked and Permuted Pre-training for Language Understanding, by Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, Tie-Yan Liu, is a novel pre-tr

Microsoft 228 Nov 21, 2022
A design of MIDI language for music generation task, specifically for Natural Language Processing (NLP) models.

MIDI Language Introduction Reference Paper: Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions: code This

Robert Bogan Kang 3 May 25, 2022
The source code of HeCo

HeCo This repo is for source code of KDD 2021 paper "Self-supervised Heterogeneous Graph Neural Network with Co-contrastive Learning". Paper Link: htt

Nian Liu 106 Dec 27, 2022
Machine Learning Course Project, IMDB movie review sentiment analysis by lstm, cnn, and transformer

IMDB Sentiment Analysis This is the final project of Machine Learning Courses in Huazhong University of Science and Technology, School of Artificial I

Daniel 0 Dec 27, 2021
Fake news detector filters - Smart filter project allow to classify the quality of information and web pages

fake-news-detector-1.0 Lists, lists and more lists... Spam filter list, quality keyword list, stoplist list, top-domains urls list, news agencies webs

Memo Sim 1 Jan 04, 2022