Autoregressive Entity Retrieval

Related tags

Text Data & NLPGENRE
Overview

The GENRE (Generative ENtity REtrieval) system as presented in Autoregressive Entity Retrieval implemented in pytorch.

@inproceedings{decao2020autoregressive,
  title={Autoregressive Entity Retrieval},
  author={Nicola De Cao and Gautier Izacard and Sebastian Riedel and Fabio Petroni},
  booktitle={International Conference on Learning Representations},
  url={https://openreview.net/forum?id=5k8F6UU39V},
  year={2021}
}

The mGENRE system as presented in Multilingual Autoregressive Entity Linking

@inproceedings{decao2020multilingual,
  title={Multilingual Autoregressive Entity Linking}, 
  author={Nicola De Cao and Ledell Wu and Kashyap Popat and Mikel Artetxe and 
          Naman Goyal and Mikhail Plekhanov and Luke Zettlemoyer and 
          Nicola Cancedda and Sebastian Riedel and Fabio Petroni},
  booktitle={arXiv pre-print 2103.12528},
  url={https://arxiv.org/abs/2103.12528},
  year={2021},
}

Please consider citing our works if you use code from this repository.

In a nutshell, (m)GENRE uses a sequence-to-sequence approach to entity retrieval (e.g., linking), based on fine-tuned BART architecture or mBART (for multilingual). (m)GENRE performs retrieval generating the unique entity name conditioned on the input text using constrained beam search to only generate valid identifiers. Here an example of generation for Wikipedia page retrieval for open-domain question answering:

For end-to-end entity linking GENRE re-generates the input text annotated with a markup:

GENRE achieves state-of-the-art results on multiple datasets.

mGENRE performs multilingual entity linking in 100+ languages treating language as latent variables and marginalizing over them:

Main dependencies

  • python>=3.7
  • pytorch>=1.6
  • fairseq>=0.10 (optional for training GENRE) NOTE: fairseq is going though changing without backward compatibility. Install fairseq from source and use this commit for reproducibilty. See here for the current PR that should fix fairseq/master.
  • transformers>=4.2 (optional for inference of GENRE)

Examples & Usage

For a full review of (m)GENRE API see:

GENRE

After importing and loading the model and a prefix tree (trie), you would generate predictions (in this example for Entity Disambiguation) with a simple call like:

import pickle
from genre.trie import Trie
from genre.fairseq_model import GENRE

# load the prefix tree (trie)
with open("../data/kilt_titles_trie_dict.pkl", "rb") as f:
    trie = Trie.load_from_dict(pickle.load(f))

# load the model
model = GENRE.from_pretrained("models/fairseq_entity_disambiguation_aidayago").eval()

# generate Wikipedia titles
model.sample(
    sentences=["Einstein was a [START_ENT] German [END_ENT] physicist."],
    prefix_allowed_tokens_fn=lambda batch_id, sent: trie.get(sent.tolist()),
)
[[{'text': 'Germany', 'score': tensor(-0.1856)},
  {'text': 'Germans', 'score': tensor(-0.5461)},
  {'text': 'German Empire', 'score': tensor(-2.1858)}]

mGENRE

Making predictions with mGENRE is very similar, but we additionally need to map (title, language_ID) to Wikidata IDs and (optionally) marginalize over predictions of the same entity:

import pickle
from genre.trie import Trie, MarisaTrie
from genre.fairseq_model import mGENRE

with open("../data/lang_title2wikidataID-normalized_with_redirect.pkl", "rb") as f:
    lang_title2wikidataID = pickle.load(f)

# memory efficient prefix tree (trie) implemented with `marisa_trie`
with open("../data/titles_lang_all105_marisa_trie_with_redirect.pkl", "rb") as f:
    trie = pickle.load(f)

# generate Wikipedia titles and language IDs
model = mGENRE.from_pretrained("../models/fairseq_multilingual_entity_disambiguation").eval()

model.sample(
    sentences=["[START] Einstein [END] era un fisico tedesco."],
    # Italian for "[START] Einstein [END] was a German physicist."
    prefix_allowed_tokens_fn=lambda batch_id, sent: [
        e for e in trie.get(sent.tolist()) if e < len(model.task.target_dictionary)
    ],
    text_to_id=lambda x: max(lang_title2wikidataID[
        tuple(reversed(x.split(" >> ")))
    ], key=lambda y: int(y[1:])),
    marginalize=True,
)
[[{'id': 'Q937',
   'texts': ['Albert Einstein >> it',
    'Alberto Einstein >> it',
    'Einstein >> it'],
   'scores': tensor([-0.0808, -1.4619, -1.5765]),
   'score': tensor(-0.0884)},
  {'id': 'Q60197',
   'texts': ['Alfred Einstein >> it'],
   'scores': tensor([-1.4337]),
   'score': tensor(-3.2058)},
  {'id': 'Q15990626',
   'texts': ['Albert Einstein (disambiguation) >> en'],
   'scores': tensor([-1.0998]),
   'score': tensor(-3.6478)}]]

Models & Datasets

For GENRE use this script to download all models and this to download all datasets. See here the list of all individual models for each task and for both pytorch fairseq and huggingface transformers. See the example on how to download additional optional files like the prefix tree (trie) for KILT Wikipedia.

For mGENRE we only have a model available here. See the example on how to download additional optional files like the prefix tree (trie) for Wikipedia in all languages and the mapping between titles and Wikidata IDs.

Pre-trained mBART model on 125 languages available here.

Troubleshooting

If the module cannot be found, preface the python command with PYTHONPATH=.

Licence

GENRE is licensed under the CC-BY-NC 4.0 license. The text of the license can be found here.

Comments
  • Finetuning mGENRE using fairseq-train  -  please ensure architectures match

    Finetuning mGENRE using fairseq-train - please ensure architectures match

    I processed my Icelandic dataset into KILT and then into fairseq binary format using the codes in the repo (with sentencepiece and mGENRE's dictionary) with the goal of finetuning mGENRE.

    The model I'm using is model.pt from fairseq_multilingual_entity_disambiguation. I used scripts_genre/train.sh and when I ran it I got the following error:

    RuntimeError: Error(s) in loading state_dict for BARTModel: Unexpected key(s) in state_dict: "encoder.layer_norm.weight", "encoder.layer_norm.bias", "decoder.layer_norm.weight", "decoder.layer_norm.bias".

    Exception: Cannot load model parameters from checkpoint mgenre/model.pt; please ensure that the architectures match.

    I'm able to train GENRE using the same train.sh script (for data I processed using BPE) without a problem. But not mGENRE.

    I tried changing the --arch parameter to mbart_large, but still get the same error.

    Any idea what parameters I need to change to make calling fairseq-train work?

    opened by Valdegg 11
  • Invalid prediction - no wikipedia entity

    Invalid prediction - no wikipedia entity

    Hi, I use the end-to-end entity linking model of GENRE. Unfortunately, for some predictions, I get entity names, that do not appear in Wikipedia.

    Code:

    from genre.hf_model import GENRE
    from genre.entity_linking import get_end_to_end_prefix_allowed_tokens_fn_hf as get_prefix_allowed_tokens_fn
    
    model = GENRE.from_pretrained("models/hf_e2e_entity_linking_aidayago").eval()
    
    sentences = ["For some people he's the John Travolta of early 80's art."]
    
    prefix_allowed_tokens_fn = get_prefix_allowed_tokens_fn(model, sentences)
    
    
    print(model.sample(
        sentences,
        prefix_allowed_tokens_fn=prefix_allowed_tokens_fn,
    ))
    
    

    Output:

    [[{'text': "For some people he's the { John Travolta } [ John Trapolta ] of early 80's art.", 'score': tensor(-0.6125)}, {'text': "For some { people } [ People (magazine) ] he's the { John Travolta } [ John Trapolta ] of early 80's { art } [ Art ].", 'score': tensor(-0.7216)}, {'text': "For some { people } [ People (magazine) ] he's the { John Travolta } [ John Trapolta ] of early 80's art.", 'score': tensor(-0.7357)}, {'text': "For some people he's the { John Travolta } [ John Trapolta ] of early 80's { art } [ Art ].", 'score': tensor(-0.7769)}, {'text': "For some { people } [ People (magazine) ] he's the { John Travolta } [ John Trapolta ] of early 80's { art } [ Visual arts ].", 'score': tensor(-0.8873)}]]
    

    "John Trapolta" does not exist neither does John Trapolta. If I understood the paper correct, the model should only output valid wikipedia entities, right? Can you help me out what I did wrong?

    Cheers!

    opened by schwabmi 10
  • A bug on mGENRE.from_pretrained

    A bug on mGENRE.from_pretrained

    model = mGENRE.from_pretrained("fairseq_multilingual_entity_disambiguation").eval() Traceback (most recent call last): File "", line 1, in File "/local/home/wendaxu/GENRE/genre/fairseq_model.py", line 146, in from_pretrained x = hub_utils.from_pretrained( File "/local/home/wendaxu/anaconda3/envs/genre/lib/python3.8/site-packages/fairseq/hub_utils.py", line 70, in from_pretrained models, args, task = checkpoint_utils.load_model_ensemble_and_task( File "/local/home/wendaxu/anaconda3/envs/genre/lib/python3.8/site-packages/fairseq/checkpoint_utils.py", line 279, in load_model_ensemble_and_task state = load_checkpoint_to_cpu(filename, arg_overrides) File "/local/home/wendaxu/anaconda3/envs/genre/lib/python3.8/site-packages/fairseq/checkpoint_utils.py", line 231, in load_checkpoint_to_cpu setattr(args, arg_name, arg_val) AttributeError: 'NoneType' object has no attribute 'bpe'

    opened by xu1998hz 10
  • RuntimeError when using get_entity_spans

    RuntimeError when using get_entity_spans

    This error comes out in a non predicibile way when using get_entity_spans

    # entity wikipedia link
    entity_spans = get_entity_spans(
        model,
        sentences,
        mention_trie=Trie([
            model.encode(" {}".format(e))[1:].tolist()
            for e in ["Einstein", "Nobel Prize"]
        ]),
        mention_to_candidates_dict={
            "Einstein": ["Albert Einstein", "Einstein (surname)"],
            "Nobel Prize": ["Nobel Prize in Physics", "Nobel Prize in Medicine"],
        }
    )
    print(get_markdown(sentences, entity_spans)[0])
    
    ---------------------------------------------------------------------------
    RuntimeError                              Traceback (most recent call last)
    <ipython-input-48-2c88416e3c2b> in <module>
          9     mention_to_candidates_dict={
         10         "Einstein": ["Albert Einstein", "Einstein (surname)"],
    ---> 11         "Nobel Prize": ["Nobel Prize in Physics", "Nobel Prize in Medicine"],
         12     }
         13 )
    
    ~/SageMaker/hf-experiments/src/genre/genre/utils.py in get_entity_spans_hf(model, input_sentences, mention_trie, candidates_trie, mention_to_candidates_dict, redirections)
        197 
        198     return get_entity_spans_finalize(
    --> 199         input_sentences, output_sentences, redirections=redirections
        200     )
        201 
    
    ~/SageMaker/hf-experiments/src/genre/genre/utils.py in get_entity_spans_finalize(input_sentences, output_sentences, redirections)
        228                     status = "m"
        229                 else:
    --> 230                     raise RuntimeError
        231 
        232             elif status == "m":
    
    RuntimeError: 
    

    In normal output I would get as usual

    In 1921, [Einstein](https://en.wikipedia.org/wiki/Albert_Einstein) received a Nobel Prize.
    
    
    opened by loretoparisi 10
  • Regarding training of Entity Disambiguation model reported in GENRE paper on a new data

    Regarding training of Entity Disambiguation model reported in GENRE paper on a new data

    Hi Nicola,

    I want to train the Entity Disambiguation model reported in the GENRE paper (not mGENRE) from scratch on my own data. Can you please tell the steps to generate the training data in the format expected by GENRE? Which scripts to use and in what order?

    I can see scripts: https://github.com/facebookresearch/GENRE/blob/main/scripts_genre/preprocess_fairseq.sh (what values to provide for arguments --source-lang, --target-lang, --srcdict, --tgtdict as I am not working in the multi-lingual setting)

    https://github.com/facebookresearch/GENRE/blob/main/scripts_genre/train.sh (Again what values to provide for arguments --source-lang, --target-lang, --task).

    opened by dineshkh 9
  • Potential bugs in evaluate_kilt_dataset.py

    Potential bugs in evaluate_kilt_dataset.py

    Hi,

    Thanks for your work. I want to run evaluate_kilt_dataset.py for evaluation purpose.

    I found two problems:

    1. https://github.com/facebookresearch/GENRE/blob/a87df17ca61899b391c8af751644d457e3520f59/scripts_genre/evaluate_kilt_dataset.py#L43

    Should it be? iter_ = tqdm(batch_it(dataset, batch_size), desc="Evaluating")

    1. Model behavior diffs a lot when changing the batch_size

    Specifically, when batch_size = 1, --candidates, and other default parameters. The performance on ace2004-test-kilt is: f1=0.897, prec=0.929, rec=0.868 When the batch_size = 64, --candidates and other default parameters. The performance on ace2004-test-kilt is: f1=0.0524, prec=0.0544, rec=0.0506 I also tried other batch_sizes, each time the performance is different, so I guess there are potential bugs in this script.

    Could you please take a look at it and can you provide the command to generate the entity disambiguation performance reported in the ICLR 21 paper? Thanks a lot.

    opened by hitercs 9
  • issue in huggingface prefix_allowed_tokens_fn

    issue in huggingface prefix_allowed_tokens_fn

    Hello,

    I tried to use the constrained beam search with huggingface and realized that @nicola-decao has added the function by prefix_allowed_tokens_fn in huggingface generation. However, I am occasionally getting an error of generating a token that is not in the constraint. For example, when given constraint {2: [3, 6, 47], 3: [6], 6: [47], 47: [3]}, when given [2,6] as input_ids, I get number other than 47 which is the only possible output given this constraint. Is there any way I can solve this or is there anything I'm missing?

    Thanks!

    opened by amy-hyunji 8
  • 4 point interval between my  finetune model and shared model

    4 point interval between my finetune model and shared model

    I am doing the entity disambiguation task. The model you published "fairseq_entity_disambiguation_aidayago" get this result (without candidate and without the trie tree build specially for aida). This is almost same with issue #26 but the aida-test-kilt result is a little different, #26 is 87.89, here is 87.92.

    1629877965(1)

    With the same code, my finetune model get this result (use the same WIKI trie tree 'kilt_titles_trie_dict.pkl'). For the test dateset aida-test-kilt, there is a ~4 point gap here.

    1629878164(1)

    Although fairseq provide the default seed=1, each time i rerun the train.sh, the model is different, have different test result, but most of them are similar with the upon picture, where always have a ~4 point gap compare with the shared model.

    Here is my finetune shell code.

    • Using the shared model 'fairseq_entity_disambiguation_blink' as pretrained model.
    • Based on the /GENRE/tree/main/scripts_genre/train.sh, i just change the file path and the --max-update 10000, --total-num-update 10000. So the finetune step is same as mentioned in the paper (10k). **
    • Remove the --reset-meter** and --reset-optimizer, so the parameter is initialized from fairseq_entity_disambiguation_blink and the optimizer didn't changed.
    • The code save 3 checkpoints: 50, 51, checkpoint_last. The I test each of them and the best result is still around 83% for aida-test-kilt. The dict.source.txt and dict.target.txt is same with 'fairseq_entity_disambiguation_blink', which is just change the name of the 'dict.txt' file from the bart.large model. (I just copy the two file from 'fairseq_entity_disambiguation_blink' to the finetuned model)
    DATASET=GENRE-main/datasets/aida
    BASED_MODEL=fairseq_entity_disambiguation_blink
    NAME=fairseq_aida_basedon_blink_10k_default
    STEP=10000
    
    fairseq-train $DATASET/bin/ \
        --save-dir GENRE-main/models/$NAME \
        --tensorboard-logdir tensorboard_logs/$NAME \
        --restore-file GENRE-main/models/$BASED_MODEL/model.pt \
        --arch bart_large  \
        --task translation  \
        --criterion label_smoothed_cross_entropy  \
        --source-lang source  \
        --target-lang target  \
        --truncate-source  \
        --label-smoothing 0.1  \
        --max-tokens 1024  \
        --update-freq 1  \
        --max-update $STEP  \
        --required-batch-size-multiple 1  \
        --dropout 0.1  \
        --attention-dropout 0.1  \
        --relu-dropout 0.0  \
        --weight-decay 0.01  \
        --optimizer adam  \
        --adam-betas "(0.9, 0.999)"  \
        --adam-eps 1e-08  \
        --clip-norm 0.1  \
        --lr-scheduler polynomial_decay  \
        --lr 3e-05  \
        --total-num-update $STEP  \
        --warmup-updates 500  \
        --ddp-backend no_c10d  \
        --num-workers 20  \ 
        --share-all-embeddings \
        --layernorm-embedding \
        --share-decoder-input-output-embed  \
        --skip-invalid-size-inputs-valid-test  \
        --log-format json  \
        --log-interval 10  \
        --patience 200  \
    

    Here is my runing environment: pytorch-1.6.0 , cuda10cudnn7 , python 3.7.7 I wanna know if something wrong with my finetune procedure? Since 83 to 87 has really a big gap.

    Thanks for your work again~

    opened by HuiBinR 7
  • Disambiguate

    Disambiguate "George W. Bush" and "George H. W. Bush"

    To correctly disambiguate the two US presidents, "George W. Bush" and "George H. W. Bush" I have tried several approaches using hf_e2e_entity_linking_aidayago and hf_e2e_entity_linking_wiki_abs models:

    # Example: End-to-End Entity Linking
    # wikipedia aidayago
    model = GENRE.from_pretrained(os.path.join(cache_dir,"hf_e2e_entity_linking_aidayago")).eval()
    # or wikipedia
    wiki_model = GENRE.from_pretrained(os.path.join(cache_dir,"hf_e2e_entity_linking_wiki_abs")).eval()
    

    w/ mention_trie, mention_to_candidates_dict

    sentences = ["George Bush was the 43rd president of United States"]
    entity_spans = get_entity_spans(
        model,
        sentences,
        mention_trie=Trie([
            model.encode(" {}".format(e))[1:].tolist()
            for e in ["George Bush"]
        ]),
         mention_to_candidates_dict={
            "George Bush": ["George W. Bush", "George H. W. Bush"]
        }
    )
    print(get_markdown(sentences, entity_spans)[0])
    

    Result: WRONG 👎🏾

    [George Bush](https://en.wikipedia.org/wiki/George_H._W._Bush) was the 43rd president of United States
    
    sentences = ["George Bush was the 43rd president of United States"]
    entity_spans = get_entity_spans(
        model,
        sentences,
        mention_trie=Trie([
            model.encode(" {}".format(e))[1:].tolist()
            for e in ["George Bush"]
        ]),
        mention_to_candidates_dict={
            "George Bush": ["George W. Bush"]
        }
    )
    print(get_markdown(sentences, entity_spans)[0])
    

    Result: CORRECT 👍🏾 🥇

    [George Bush](https://en.wikipedia.org/wiki/George_W._Bush) was the 43rd president of United States
    

    w/ candidates_trie

    sentences = ["George Bush was the 43rd president of the United States from 2001 to 2009"]
    
    prefix_allowed_tokens_fn = get_prefix_allowed_tokens_fn(
        model,
        sentences,
        candidates_trie=Trie([
            model.encode(" }} [ {} ]".format(e))[1:].tolist()
            for e in ["George W. Bush", "George H. W. Bush"]
        ])
    )
    out = model.sample(
        sentences,
        prefix_allowed_tokens_fn=prefix_allowed_tokens_fn,
    )
    print(out)
    

    Result:WRONG (+ possible bug...) 👎🏾

    [[{'text': 'George { Bush } [ George H. W. Bush ] was the 43rd president of the United States from 2001 to 2009', 'logprob': tensor(-0.7810)}], [{'text': 'George { Bush } [ George H. W. Bush ] was the 43rd president of the United States from 2001 to { 2009 } [ and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and ...
    

    but using

    sentences = ["George Bush was the 41rd president of the United States from 1989 to 2003"]
    entity_spans = get_entity_spans(
        wiki_model,
        sentences,
        mention_trie=Trie([
            model.encode(" {}".format(e))[1:].tolist()
            for e in ["George Bush"]
        ]),
        mention_to_candidates_dict={
            "George Bush": ["George W. Bush", "George H. W. Bush"]
        }
    )
    print(get_markdown(sentences, entity_spans)[0])
    

    CORRECT Result: 👍🏾 🥇

    [George Bush](https://en.wikipedia.org/wiki/George_H._W._Bush) was the 41rd president of the United States from 2001 to 2009
    

    and

    sentences = ["George Bush was the 43rd president of the United States from 2001 to 2009"]
    entity_spans = get_entity_spans(
        wiki_model,
        sentences,
        mention_trie=Trie([
            model.encode(" {}".format(e))[1:].tolist()
            for e in ["George Bush"]
        ]),
        mention_to_candidates_dict={
            "George Bush": ["George W. Bush", "George H. W. Bush"]
        }
    )
    print(get_markdown(sentences, entity_spans)[0])
    

    WRONG Result: 👎🏾

    [George Bush](https://en.wikipedia.org/wiki/George_H._W._Bush) was the 43rd president of the United States from 2001 to 2009
    

    I therefore approached with kilt wikipedia trie and hf_entity_disambiguation_aidayago model

    with open(os.path.join(cache_dir,"kilt_titles_trie_dict.pkl"), "rb") as f:
        trie = Trie.load_from_dict(pickle.load(f))
    dmodel = GENRE.from_pretrained(os.path.join(cache_dir,"hf_entity_disambiguation_aidayago")).eval()
    sentences = ["[START_ENT] George Bush [END_ENT] was the 43rd president of United States"]
    out = dmodel.sample(
        sentences,
        prefix_allowed_tokens_fn=lambda batch_id, sent: trie.get(sent.tolist()),
    )
    print(out)
    

    WRONG Result 👎🏾

    [[{'text': 'George H. W. Bush', 'logprob': tensor(-0.0866)}], [{'text': 'George W. Bush', 'logprob': tensor(-0.7922)}], [{'text': 'George H. W. Bush vomiting incident', 'logprob': tensor(-1.6464)}], [{'text': 'George H. W. Bush 1988 presidential campaign', 'logprob': tensor(-1.6795)}], [{'text': 'George H. W. Bush Supreme Court candidates', 'logprob': tensor(-2.1032)}]]
    

    and for the model hf_wikipage_retrieval:

    smodel = GENRE.from_pretrained(os.path.join(cache_dir,"hf_wikipage_retrieval")).eval()
    sentences = ["[START_ENT] George Bush [END_ENT] was the 43rd president of United States"]
    out = smodel.sample(
        sentences,
        prefix_allowed_tokens_fn=lambda batch_id, sent: trie.get(sent.tolist()),
    )
    print(out)
    

    WRONG Result 👎🏾

    [[{'text': 'George H. W. Bush', 'logprob': tensor(-0.0850)}], [{'text': 'George W. Bush', 'logprob': tensor(-0.7715)}], [{'text': 'George H. W. Bush Supreme Court candidates', 'logprob': tensor(-1.3834)}], [{'text': 'George H. W. Bush 1988 presidential campaign', 'logprob': tensor(-1.4211)}], [{'text': 'George H. W. Bush vomiting incident', 'logprob': tensor(-2.1070)}]]
    

    So to recap, I have got only two cases when the best prediction is the desired one:

    sentences = ["George Bush was the 43rd president of United States"]
    entity_spans = get_entity_spans(
        model,
        sentences,
        mention_trie=Trie([
            model.encode(" {}".format(e))[1:].tolist()
            for e in ["George Bush"]
        ]),
        mention_to_candidates_dict={
            "George Bush": ["George W. Bush"]
        }
    )
    print(get_markdown(sentences, entity_spans)[0])
    

    Result: CORRECT 👍🏾 🥇

    [George Bush](https://en.wikipedia.org/wiki/George_W._Bush) was the 43rd president of United States
    

    and

    sentences = ["George Bush was the 41rd president of the United States from 1989 to 2003"]
    entity_spans = get_entity_spans(
        wiki_model,
        sentences,
        mention_trie=Trie([
            model.encode(" {}".format(e))[1:].tolist()
            for e in ["George Bush"]
        ]),
        mention_to_candidates_dict={
            "George Bush": ["George W. Bush", "George H. W. Bush"]
        }
    )
    print(get_markdown(sentences, entity_spans)[0])
    

    CORRECT Result: 👍🏾 🥇

    [George Bush](https://en.wikipedia.org/wiki/George_H._W._Bush) was the 41rd president of the United States from 2001 to 2009
    

    Questions

    • Am I missing some other ways to disambiguate?
    • Assumed that both George W. Bush and George H. W. Bush are in the knowledge graph, do the model needs more additional context like other mention_to_candidates_dict or mention_trie possibile values?
    • Supposed that one president (like George W. Bush) was missing, how to add his wikipedia entity page?
    • Are the "wrong" cases formally correct or am I missing something?

    Thanks a lot!

    opened by loretoparisi 7
  • Best practice to link already known entities

    Best practice to link already known entities

    I'm executing some examples in this page with hf_e2e_entity_linking_aidayago model. I have already found the entities, so I only want to link them to Wikipedia given a list of hints. I have encountered some problems:

    • When I put more than one mention_trie I don't see all the results, for example in this case:
    from genre.entity_linking import get_end_to_end_prefix_allowed_tokens_fn_hf as get_prefix_allowed_tokens_fn
    from genre.utils import get_entity_spans_hf as get_entity_spans
    model = GENRE.from_pretrained("../models/hf_e2e_entity_linking_aidayago").eval()
    
    sentences = ["In 1921, Einstein received a Nobel Prize."]
    
    get_entity_spans(
        model,
        sentences,
        mention_trie=Trie([
            model.encode(" {}".format(e))[1:].tolist()
            for e in ["Einstein", "Nobel Prize"]
        ]),
        mention_to_candidates_dict={
            "Einstein": ["Albert Einstein", "Einstein (surname)"],
            "Nobel Prize": ["Nobel Prize in Physics", "Nobel Prize in Medicine"],
        }
    )
    

    My output is:

    [[(9, 8, 'Albert_Einstein')]]
    

    If I remove both Einstein from mention_trie and mention_to_candidates_dict the result is:

    [[(29, 11, 'Nobel_Prize_in_Physics')]]
    

    But in the example shown in the README both the two entities should be visible.

    • Moreover I'm encountering some problems (maybe bug) in the definition of the entities to search like here:
    sentences = ["George Walker Bush (born July 6, 1946) is an American politician and businessman who served as the 43rd president of the United States from 2001 to 2009. A member of the Republican Party, Bush previously served as the 46th governor of Texas from 1995 to 2000. He was born into the Bush family; his father, George H. W. Bush, was the 41st president of the United States from 1989 to 1993."]
    
    get_entity_spans(
        model,
        sentences,
        mention_trie=Trie([
            model.encode(" {}".format(e))[1:].tolist()
            for e in ["George Walker Bush"]
        ]),
        mention_to_candidates_dict={
            "George Walker Bush": ["George W. Bush", "George H. W. Bush"],
        }
    )
    

    The result is:

    ---------------------------------------------------------------------------
    RuntimeError                              Traceback (most recent call last)
    <ipython-input-12-2ae034f807cd> in <module>
          1 sentences = ["George Walker Bush (born July 6, 1946) is an American politician and businessman who served as the 43rd president of the United States from 2001 to 2009. A member of the Republican Party, Bush previously served as the 46th governor of Texas from 1995 to 2000. He was born into the Bush family; his father, George H. W. Bush, was the 41st president of the United States from 1989 to 1993."]
          2 
    ----> 3 get_entity_spans(
          4     model,
          5     sentences,
    
    ~/genre/genre/utils.py in get_entity_spans_hf(model, input_sentences, mention_trie, candidates_trie, mention_to_candidates_dict, redirections)
        176     redirections=None,
        177 ):
    --> 178     return _get_entity_spans(
        179         model,
        180         input_sentences,
    
    ~/genre/utils.py in _get_entity_spans(model, input_sentences, prefix_allowed_tokens_fn, redirections)
        141     )
        142 
    --> 143     return get_entity_spans_finalize(
        144         input_sentences, output_sentences, redirections=redirections
        145     )
    
    ~/genre/utils.py in get_entity_spans_finalize(input_sentences, output_sentences, redirections)
        218                     status = "m"
        219                 else:
    --> 220                     raise RuntimeError
        221 
        222             elif status == "m":
    
    RuntimeError: 
    

    So my question is: if I already have the entities, but I want link them to Wikipedia (I have also candidates of the Wikipedia pages, so I only want to disambiguate them) which is the best function ( get_prefix_allowed_tokens_fn, get_entity_spans,?) and how I have to declare my entities and my candidates to avoid these problems?

    Thank you! It seems a very promising work! 😊

    opened by paulthemagno 6
  • 'super' object has no attribute 'generate'... Issues following example...

    'super' object has no attribute 'generate'... Issues following example...

    Hey, I have been having a few issues with getting the basic example to work on colab (python version = 3.7.10)

    If I install GENRE from the cloned repo and install fairseq from the branch you suggest I can't import GENRE from genre.fairseq_model.py:

    from genre.fairseq_model import GENRE
    

    gives

    ---------------------------------------------------------------------------
    TypeError                                 Traceback (most recent call last)
    <ipython-input-19-8d627c96b455> in <module>()
          1 import pickle
    ----> 2 from genre.fairseq_model import GENRE
          3 from genre.entity_linking import get_end_to_end_prefix_allowed_tokens_fn_fairseq as get_prefix_allowed_tokens_fn
          4 from genre.utils import get_entity_spans_fairseq as get_entity_spans
          5 model = GENRE.from_pretrained("../models/fairseq_e2e_entity_linking_aidayago").eval()
    
    2 frames
    /content/gdrive/MyDrive/entity_linking_demo/fairseq/fairseq/criterions/__init__.py in <module>()
         22     CRITERION_DATACLASS_REGISTRY,
         23 ) = registry.setup_registry(
    ---> 24     "--criterion", base_class=FairseqCriterion, default="cross_entropy"
         25 )
         26 
    
    TypeError: cannot unpack non-iterable NoneType object
    

    So then I tried installing GENRE from the cloned repo and !pip install fairseq and I got further

    import pickle
    import genre
    from genre.trie import Trie
    from genre.fairseq_model import GENRE
    with open("kilt_titles_trie_dict.pkl", "rb") as f:
        trie = Trie.load_from_dict(pickle.load(f))
    model = GENRE.from_pretrained("fairseq_entity_disambiguation_aidayago").eval()
    model.sample(sentences=["Einstein was a [START_ENT] German [END_ENT] physicist."])
    

    Which gives...

    1042301B [00:00, 1105195.62B/s]
    456318B [00:00, 601446.35B/s]
    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    <ipython-input-6-f9cae36978b8> in <module>()
          6     trie = Trie.load_from_dict(pickle.load(f))
          7 model = GENRE.from_pretrained("fairseq_entity_disambiguation_aidayago").eval()
    ----> 8 model.sample(sentences=["Einstein was a [START_ENT] German [END_ENT] physicist."])
    
    1 frames
    /content/drive/MyDrive/entity_linking_demo/GENRE/genre/fairseq_model.py in sample(self, sentences, beam, verbose, text_to_id, marginalize, marginalize_lenpen, max_len_a, max_len_b, **kwargs)
         41             max_len_a=max_len_a,
         42             max_len_b=max_len_b,
    ---> 43             **kwargs,
         44         )
         45         outputs = [
    
    /content/drive/MyDrive/entity_linking_demo/GENRE/genre/fairseq_model.py in generate(self, *args, **kwargs)
         90 
         91     def generate(self, *args, **kwargs) -> List[List[Dict[str, torch.Tensor]]]:
    ---> 92         return super(BARTHubInterface, self).generate(*args, **kwargs)
         93 
         94 
    
    AttributeError: 'super' object has no attribute 'generate'
    

    I'm probably doing something pretty stupid, but just trying to follow the examples as I read them. I got the same TypeError: super()... when I tried a couple of different models. Any advice?

    opened by epb378 6
  • colab script to run GENRE

    colab script to run GENRE

    Hi, I wrote a script to run GENRE in colab: https://gist.github.com/raven44099/32babed67e122427ec36e1fafd142c08#file-entitylinking_genre_colab_minimalcode-ipynb

    Because I searched the issues for how to run GENRE in google colab, I thought I should share this minimal code here. Hope you like it.

    opened by raven44099 0
  • mGENRE finetuning issue

    mGENRE finetuning issue

    Thank you for a great work! I want to finetune mGENRE model on new dataset to receive better quality, but your model differs from widespread ordinary models that have clear structure and just forward method. Can you please a little clarify what is the training process of this model in order to allow me start finetuning it?

    opened by SergeyPetrakov 0
  • Fail to Reproduce the dev score of GENRE Document Retrieval

    Fail to Reproduce the dev score of GENRE Document Retrieval

    Hi, I was trying to reproduce the Page-level Document Retrieval of GENRE. But the dev score is significantly lower than the model you provided fairseq_wikipage_retrieval.

    Here are my details for training:

    Training set: Following Section 4.1 in the paper, I mix and shuffle the BLINK & 8 KILT jsonl training files to a single file, and using scripts convert_kilt_to_fairseq.py & preprocess_fairseq.sh to process the training file.

    Dev set: I just cat all 11 KILT dev jsonl files to one single jsonl file, then used the same process mentioned above to process it.

    Training Hypermeters: I use the script train.sh for training. I set the keep-best-checkpoints=1 to save the model that performs best on the dev set.

    Following Appendix A.3, I notice that 128 GPUs were used with max-tokens=1024 and update-freq=1. I use 16 GPUs for training, so I use max-tokens=8192 to keep the Total max tokens per update=128*1024.

    Here are the dev results of the model you provided fairseq_wikipage_retrieval and my own reproduced model for KILT.

    | model_name | fever | aidayago2 | wn | cweb | trex | structured_zerosho | nq | hotpotqa | triviaqa | eli5 | wow | |---------------------------------------------|----------|-----------|----------|----------|--------|--------------------|----------|----------|----------|----------|-----------| | genre_fairseq_wikipage_retrieval (provided) | 0.846907 | 0.927467 | 0.876914 | 0.705305 | 0.7968 | 0.948443 | 0.642228 | 0.518214 | 0.71114 | 0.134705 | 0.563196 | | My reproduced model | 0.826217 | 0.927048 | 0.874264 | 0.713342 | 0.716 | 0.864125 | 0.576665 | 0.399821 | 0.701064 | 0.13935 | 0.570727 |

    The results for TREX, structured_zeroshot, NQ, and HotpotQA are lower than the model you provided. Could you give me some help to find out anything wrong?

    Thank you very much. @nicola-decao

    opened by ma787639046 7
  • Setup to reproduce E2E-EL results / Wikipedia model worse than fine-tuned model?

    Setup to reproduce E2E-EL results / Wikipedia model worse than fine-tuned model?

    I am trying to reproduce the end-to-end entity linking results from the paper (Table 2). I have read appendix A.2 and issues #30 and #37 and try to follow it as closely as possible, but I fail to get the same numbers as in the paper, or even numbers close to it. In particular, I get very bad results from the model without fine-tuning.

    I would appreciate very much if you could tell me what I am doing wrong. If you could explain the exact setup with which you got the results in Table 2, this would not only help me in reproducing the results, but also allow other users of your system to get the best performance.

    Candidate sets

    I got from issues #30 and #37 that you use candidate sets and a mention trie. I first create the candidate sets and then create a mention trie with all mentions which have at least 1 candidate.

    In appendix A.2 you write that you used the candidate sets from Kolitsas et al. with additions of the table computed by Hoffart et al.

    I downloaded the data by Kolitsas et al. from the link provided in their GitHub repository (https://github.com/dalab/end2end_neural_el, file prob_yago_crosswikis_wikipedia_p_e_m.txt), and the data by Hoffart et al. (https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/ambiverse-nlu/aida/downloads, file aida_means.tsv).

    Since you mentioned in isse #30 that you removed all mentions that do not start with an uppercase letter, and in issue #37 that you probably removed mentions that start with punctuation, I do that too.

    The following code produces my mention_to_candidates_dict:

    import re
    import string
    import itertools
    import pickle
    
    
    def read_dalab_candidates():
        for line in open("data/dalab/prob_yago_crosswikis_wikipedia_p_e_m.txt"):
            line = line[:-1]
            columns = line.split("\t")
            mention = columns[0]
            for column in columns[2:]:
                if len(column.strip()) == 0:
                    continue
                values = column.split(",")
                candidate = ",".join(values[2:])
                candidate = candidate.replace("_", " ")
                yield mention, candidate
    
    
    def hex2int(hexa: str) -> int:
        return int(hexa, 16)
    
    
    def replace_unicode(u_str):
        matches = set(re.findall("\\\\u....", u_str))
        for match in matches:
            u_str = u_str.replace(match, chr(hex2int(match[2:])))
        return u_str
    
    
    PUNCTUATION_CHARS = set(string.punctuation)
    
    
    def filter_mention(mention):
        if mention[0].islower():
            return True
        if mention[0] in PUNCTUATION_CHARS:
            return True
        return False
    
    
    def read_aida_candidates():
        for line in open("data/aida/aida_means.tsv"):
            line = line[:-1]
            values = line.split("\t")
            mention = replace_unicode(values[0][1:-1])
            candidate = replace_unicode(values[1]).replace("_", " ")
            yield mention, candidate
    
    
    if __name__ == "__main__":
        mention_candidates_dict = {}
        for mention, candidate in itertools.chain(read_dalab_candidates(), read_aida_candidates()):
            if filter_mention(mention):
                continue
            if mention not in mention_candidates_dict:
                mention_candidates_dict[mention] = set()
            mention_candidates_dict[mention].add(candidate)
        for mention in mention_candidates_dict:
            mention_candidates_dict[mention] = sorted(mention_candidates_dict[mention])
        with open("data/mention_candidates_dict.pkl", "wb") as f:
            pickle.dump(mention_candidates_dict, f)
    

    Mention trie

    The following code creates a mention trie with all mentions from the mention_to_candidates_dict.

    import sys
    import pickle
    from tqdm import tqdm
    from genre.fairseq_model import GENRE
    from genre.trie import Trie
    
    
    if __name__ == "__main__":
        sys.setrecursionlimit(10000)
        model_path = "models/fairseq_e2e_entity_linking_wiki_abs"
        model = GENRE.from_pretrained(model_path).eval()
        with open("data/mention_to_candidates_dict.pkl", "rb") as f:
            mention_to_candidates_dict = pickle.load(f)
        mention_trie = Trie()
        for mention in tqdm(mention_to_candidates_dict):
            encoded = model.encode(" {}".format(mention))[1:].tolist()
            mention_trie.add(encoded)
        out_file = "data/mention_trie.pkl"
        with open(out_file, "wb") as f:
            pickle.dump(mention_trie, f)
    

    Results

    I use the mention_trie and mention_to_candidates_dict to restrict the beam search, as in the example.

    When an article is too long, I split it iteratively into 2, 3, 4, ... parts, until the parts are short enough to be processed by the model. I split the article into sentences first (using spaCy), and then concatenate n/k sentences to create the k parts.

    The results are much worse than the results reported in the paper. With the Wikipedia model (fairseq_e2e_entity_linking_wiki_abs) I get an F1-score of ~33% (28% precision, 40% recall) on the MSNBC benchmark. With the model fine-tuned on AidaYago (fairseq_e2e_entity_linking_aidayago) the result is better, with ~68% F1-score (72% precision, 64% recall).

    Since you write in the paper that you "[...] considered only mentions that have entities in the KB" and in issue #30 that you "... used the entity universe from https://github.com/dalab/end2end_neural_el", I also tried to restrict candidates to entities from the file entities_universe.txt from that source. However, the results are worse, with ~26% F1-score for the Wikipedia model and ~41% F1-score for the fine-tuned model.

    Example

    Let's look at an example. The text is the first half of the first article in the MSNBC benchmark.

    import pickle
    import sys
    
    from genre.utils import get_entity_spans_fairseq as get_entity_spans
    from genre.fairseq_model import GENRE
    from genre.utils import get_markdown
    
    
    if __name__ == "__main__":
        model_path = "models/fairseq_e2e_entity_linking_wiki_abs"
        dict_path = "data/mention_to_candidates_dict.pkl"
        trie_path = "data/mention_trie.pkl"
        model = GENRE.from_pretrained(model_path).eval()
        with open(trie_path, "rb") as f:
            mention_trie = pickle.load(f)
        with open(dict_path, "rb") as f:
            mention_to_candidates_dict = pickle.load(f)
    
        text = """Home Depot CEO Nardelli quits Home-improvement retailer's chief executive had been criticized over pay ATLANTA - Bob Nardelli abruptly resigned Wednesday as chairman and chief executive of The Home Depot Inc. after a six-year tenure that saw the world’s largest home improvement store chain post big profits but left investors disheartened by poor stock performance. Nardelli has also been under fire by investors for his hefty pay and is leaving with a severance package valued at about $210 million. He became CEO in December 2000 after being passed over for the top job at General Electric Co., where Nardelli had been a senior executive. Home Depot said Nardelli was being replaced by Frank Blake, its vice chairman, effective immediately. Blake’s appointment is permanent, Home Depot spokesman Jerry Shields said. What he will be paid was not immediately disclosed, Shields said. The company declined to make Blake available for comment, and a message left for Nardelli with his secretary was not immediately returned. Before Wednesday’s news, Home Depot’s stock had been down more than 3 percent on a split-adjusted basis since Nardelli took over. Nardelli’s sudden departure was stunning in that he told The Associated Press as recently as Sept. 1 that he had no intention of leaving, and a key director also said that the board was pleased with Nardelli despite the uproar by some investors. Asked in that interview if he had thought of hanging up his orange apron and leaving Home Depot, Nardelli said unequivocally that he hadn’t. Asked what he thought he would be doing 10 years from now, Nardelli said, “Selling hammers.” For The Home Depot? “Absolutely,” he said at the time. Home Depot said Nardelli’s decision to resign was by mutual agreement with the Atlanta-based company. “We are very grateful to Bob for his strong leadership of The Home Depot over the past six years. Under Bob’s tenure, the company made significant and necessary investments that greatly improved the company’s infrastructure and operations, expanded our markets to include wholesale distribution and new geographies, and undertook key strategic initiatives to strengthen the company’s foundation for the future,” Home Depot’s board said in a statement. Nardelli was a nuts-and-bolts leader, a former college football player and friend of President Bush. He helped increase revenue and profits at Home Depot and increase the number of stores the company operates to more than 2,000. Home Depot’s earnings per share have increased by approximately 150 percent over the last five years."""
    
        sentences = [text]
        entity_spans = get_entity_spans(
            model,
            sentences,
            mention_trie=mention_trie,
            mention_to_candidates_dict=mention_to_candidates_dict
        )
        markdown = get_markdown(sentences, entity_spans)[0]
        print(markdown)
    

    Output: Home Depot CEO Nardelli quits Home-improvement retailer's chief executive had been criticized over pay ATLANTA - Bob Nardelli abruptly resigned Wednesday as chairman and chief executive of The Home Depot Inc. after a six-year tenure that saw the world’s largest home improvement store chain post big profits but left investors disheartened by poor stock performance. Nardelli has also been under fire by investors for his hefty pay and is leaving with a severance package valued at about $210 million. He became CEO in December 2000 after being passed over for the top job at General Electric Co., where Nardelli had been a senior executive. Home Depot said Nardelli was being replaced by Frank Blake, its vice chairman, effective immediately. Blake’s appointment is permanent, Home Depot spokesman Jerry Shields said. What he will be paid was not immediately disclosed, Shields said. The company declined to make Blake available for comment, and a message left for Nardelli with his secretary was not immediately returned. Before Wednesday’s news, Home Depot’s stock had been down more than 3 percent on a split-adjusted basis since Nardelli took over. Nardelli’s sudden departure was stunning in that he told The Associated Press as recently as Sept. 1 that he had no intention of leaving, and a key director also said that the board was pleased with Nardelli despite the uproar by some investors. Asked in that interview if he had thought of hanging up his orange apron and leaving Home Depot, Nardelli said unequivocally that he hadn’t. Asked what he thought he would be doing 10 years from now, Nardelli said, “Selling hammers.” For The Home Depot? “Absolutely,” he said at the time. Home Depot said Nardelli’s decision to resign was by mutual agreement with the Atlanta-based company. “We are very grateful to Bob for his strong leadership of The Home Depot over the past six years. Under Bob’s tenure, the company made significant and necessary investments that greatly improved the company’s infrastructure and operations, expanded our markets to include wholesale distribution and new geographies, and undertook key strategic initiatives to strengthen the company’s foundation for the future,” Home Depot’s board said in a statement. Nardelli was a nuts-and-bolts leader, a former college football player and friend of President Bush. He helped increase revenue and profits at Home Depot and increase the number of stores the company operates to more than 2,000. Home Depot’s earnings per share have increased by approximately 150 percent over the last five years.

    There are many wrong predictions:

    • Mismatching boundaries ("The Home Depot" vs. "The Home Depot Inc.", "General Electric" vs. "General Electric Co.")
    • Stopwords (The company, The Associated Press, He)
    • Numbers (2,000, 150)
    • Disambiguation errors (Bob)
    • Other weird stuff (President Bush)

    Clearly we can't expect the model to predict everything correctly, but the amount of mistakes is unexpected and does not fit with the good results reported in the paper.

    The result of the fine-tuned model looks better:

    Output: Home Depot CEO Nardelli quits Home-improvement retailer's chief executive had been criticized over pay ATLANTA - Bob Nardelli abruptly resigned Wednesday as chairman and chief executive of The Home Depot Inc. after a six-year tenure that saw the world’s largest home improvement store chain post big profits but left investors disheartened by poor stock performance. Nardelli has also been under fire by investors for his hefty pay and is leaving with a severance package valued at about $210 million. He became CEO in December 2000 after being passed over for the top job at General Electric Co., where Nardelli had been a senior executive. Home Depot said Nardelli was being replaced by Frank Blake, its vice chairman, effective immediately. Blake’s appointment is permanent, Home Depot spokesman Jerry Shields said. What he will be paid was not immediately disclosed, Shields said. The company declined to make Blake available for comment, and a message left for Nardelli with his secretary was not immediately returned. Before Wednesday’s news, Home Depot’s stock had been down more than 3 percent on a split-adjusted basis since Nardelli took over. Nardelli’s sudden departure was stunning in that he told The Associated Press as recently as Sept. 1 that he had no intention of leaving, and a key director also said that the board was pleased with Nardelli despite the uproar by some investors. Asked in that interview if he had thought of hanging up his orange apron and leaving Home Depot, Nardelli said unequivocally that he hadn’t. Asked what he thought he would be doing 10 years from now, Nardelli said, “Selling hammers.” For The Home Depot? “Absolutely,” he said at the time. Home Depot said Nardelli’s decision to resign was by mutual agreement with the Atlanta-based company. “We are very grateful to Bob for his strong leadership of The Home Depot over the past six years. Under Bob’s tenure, the company made significant and necessary investments that greatly improved the company’s infrastructure and operations, expanded our markets to include wholesale distribution and new geographies, and undertook key strategic initiatives to strengthen the company’s foundation for the future,” Home Depot’s board said in a statement. Nardelli was a nuts-and-bolts leader, a former college football player and friend of President Bush. He helped increase revenue and profits at Home Depot and increase the number of stores the company operates to more than 2,000. Home Depot’s earnings per share have increased by approximately 150 percent over the last five years.

    Questions

    Could you please help me (and other users of your system) in answering the following questions to clarify the best-performing setup:

    1. I understand the paper that in Table 2, the results on all benchmarks except for AIDA are from the Wikipedia model (out-of-domain scenario), and only the results on the AIDA benchmark are from the fine-tuned model (in-domain scenario). Is that correct?
    2. Do you have an idea why in my case the Wikipedia model is so much worse than the fine-tuned model? I get this result consistently on MSNBC, AIDA (here it is expected), KORE50, Spotlight, another benchmark with news articles and even on a benchmark with (excerpts of) Wikipedia articles (where the Wikipedia model should clearly be better).
    3. What exactly do you use as a knowledge base (that is, the set of entities)? The entities_universe.txt, all Wikipedia article titles, or something else? Is it the same for the in-domain and out-of-domain experiments?
    4. What exactly do you use as candidate sets? In the paper you write that you combine the sets from Kolitsas et al. and Hoffart et al., but in issue #37 it seems you only use the sets by Kolitsas et al. Do the candidate sets in the in-domain and out-of-domain experiments differ?
    5. I experience that the Wikipedia model links numbers (e.g., 2000, 12,580.35) and stop words (e.g., The, It), all resulting in false positives. Do you filter more mentions than lowercase mentions, mentions starting with punctuation and mentions with non-English characters (as you say in issue #37)?
    6. How do you split longer texts? Here I mean texts that are too long to be processed by the model, resulting in None, [] (empty list), or truncated beams which are shorter than the input texts. I experience that the predictions of the model can differ a lot with different contexts (single sentences, paragraphs, texts spanning multiple paragraphs, or full articles).

    Best regards, Matthias

    opened by hertelm 2
  • wiki-redirects.txt file and tuto for preprocessing mgenre data

    wiki-redirects.txt file and tuto for preprocessing mgenre data

    Hello

    I'm trying to preprocess a wikidump for a custom mgenre workout but I don't have access to the {}wiki-redirects.txt file (with {} being the language of the wikidump).

    This file is processed in preprocess_wikidata with the step option set to "redirects" to generate a pkl dictionary which will be used in process_anchor. It is searched in the wikipedia_redirect/target folder.

    I couldn't find any script to generate this redirect file from a wikipedia dump, nor any explanation of the format of the file so I couldn't recreate the script.

    Similarly, I haven't found a tutorial explaining how to arrange the different mgenre preprocessing scripts in order to create the datasets and start learning. I think I understood the role of each script and the order in which to execute them, but I wouldn't mind having an explanation from the start.

    Thank you for your answers.

    opened by Denescor 0
Releases(v0.1.3)
  • v0.1.3(Jun 7, 2022)

  • v0.1.2(Nov 4, 2021)

    What's Changed

    • Add integration test for document retrieval example by @ynouri in https://github.com/facebookresearch/GENRE/pull/24
    • Corrected bug in setting BOS/EOS by @nicola-decao in https://github.com/facebookresearch/GENRE/commit/40ce90aaf0421eaf7f6b1bd23604bb70ec1301f1

    Full Changelog: https://github.com/facebookresearch/GENRE/compare/v0.1.1...v0.1.2

    Source code(tar.gz)
    Source code(zip)
  • v0.1.1(Apr 5, 2021)

Owner
Meta Research
Meta Research
Knowledge Oriented Programming Language

KoPL: 面向知识的推理问答编程语言 安装 | 快速开始 | 文档 KoPL全称 Knowledge oriented Programing Language, 是一个为复杂推理问答而设计的编程语言。我们可以将自然语言问题表示为由基本函数组合而成的KoPL程序,程序运行的结果就是问题的答案。目前,

THU-KEG 62 Dec 12, 2022
Large-scale pretraining for dialogue

A State-of-the-Art Large-scale Pretrained Response Generation Model (DialoGPT) This repository contains the source code and trained model for a large-

Microsoft 1.8k Jan 07, 2023
Conditional Transformer Language Model for Controllable Generation

CTRL - A Conditional Transformer Language Model for Controllable Generation Authors: Nitish Shirish Keskar, Bryan McCann, Lav Varshney, Caiming Xiong,

Salesforce 1.7k Dec 28, 2022
File-based TF-IDF: Calculates keywords in a document, using a word corpus.

File-based TF-IDF Calculates keywords in a document, using a word corpus. Why? Because I found myself with hundreds of plain text files, with no way t

Jakob Lindskog 1 Feb 11, 2022
A repository to run gpt-j-6b on low vram machines (4.2 gb minimum vram for 2000 token context, 3.5 gb for 1000 token context). Model loading takes 12gb free ram.

Basic-UI-for-GPT-J-6B-with-low-vram A repository to run GPT-J-6B on low vram systems by using both ram, vram and pinned memory. There seem to be some

90 Dec 25, 2022
Ecommerce product title recognition package

revizor This package solves task of splitting product title string into components, like type, brand, model and article (or SKU or product code or you

Bureaucratic Labs 16 Mar 03, 2022
This code is the implementation of Text Emotion Recognition (TER) with linguistic features

APSIPA-TER This code is the implementation of Text Emotion Recognition (TER) with linguistic features. The network model is BERT with a pretrained mod

kenro515 1 Feb 08, 2022
A retro text-to-speech bot for Discord

hawking A retro text-to-speech bot for Discord, designed to work with all of the stuff you might've seen in Moonbase Alpha, using the existing command

Nick Schorr 23 Dec 25, 2022
TextFlint is a multilingual robustness evaluation platform for natural language processing tasks,

TextFlint is a multilingual robustness evaluation platform for natural language processing tasks, which unifies general text transformation, task-specific transformation, adversarial attack, sub-popu

TextFlint 587 Dec 20, 2022
Implementation of TF-IDF algorithm to find documents similarity with cosine similarity

NLP learning Trying to learn NLP to use in my projects! Table of Contents About The Project Built With Getting Started Requirements Run Usage License

Faraz Farangizadeh 3 Aug 25, 2022
Partially offline multi-language translator built upon Huggingface transformers.

Translate Command-line interface to translation pipelines, powered by Huggingface transformers. This tool can download translation models, and then us

Richard Jarry 8 Oct 25, 2022
An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition

CRNN paper:An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition 1. create your ow

Tsukinousag1 3 Apr 02, 2022
NLP command-line assistant powered by OpenAI

NLP command-line assistant powered by OpenAI

Axel 16 Dec 09, 2022
Blazing fast language detection using fastText model

Luga A blazing fast language detection using fastText's language models Luga is a Swahili word for language. fastText provides a blazing fast language

Prayson Wilfred Daniel 18 Dec 20, 2022
Code for "Generating Disentangled Arguments with Prompts: a Simple Event Extraction Framework that Works"

GDAP The code of paper "Code for "Generating Disentangled Arguments with Prompts: a Simple Event Extraction Framework that Works"" Event Datasets Prep

45 Oct 29, 2022
fastNLP: A Modularized and Extensible NLP Framework. Currently still in incubation.

fastNLP fastNLP是一款轻量级的自然语言处理(NLP)工具包,目标是快速实现NLP任务以及构建复杂模型。 fastNLP具有如下的特性: 统一的Tabular式数据容器,简化数据预处理过程; 内置多种数据集的Loader和Pipe,省去预处理代码; 各种方便的NLP工具,例如Embedd

fastNLP 2.8k Jan 01, 2023
Python bindings to the dutch NLP tool Frog (pos tagger, lemmatiser, NER tagger, morphological analysis, shallow parser, dependency parser)

Frog for Python This is a Python binding to the Natural Language Processing suite Frog. Frog is intended for Dutch and performs part-of-speech tagging

Maarten van Gompel 46 Dec 14, 2022
Include MelGAN, HifiGAN and Multiband-HifiGAN, maybe NHV in the future.

Fast (GAN Based Neural) Vocoder Chinese README Todo Submit demo Support NHV Discription Include MelGAN, HifiGAN and Multiband-HifiGAN, maybe include N

Zhengxi Liu (刘正曦) 134 Dec 16, 2022
DELTA is a deep learning based natural language and speech processing platform.

DELTA - A DEep learning Language Technology plAtform What is DELTA? DELTA is a deep learning based end-to-end natural language and speech processing p

DELTA 1.5k Dec 26, 2022