Dataset and Source code of paper 'Enhancing Keyphrase Extraction from Academic Articles with their Reference Information'.

Overview

Enhancing Keyphrase Extraction from Academic Articles with their Reference Information

Overview

Dataset and code for paper "Enhancing Keyphrase Extraction from Academic Articles with their Reference Information".

The research content of this project is to analyze the impact of the introduction of reference title in scientific literature on the effect of keyword extraction. This project uses three datasets: SemEval-2010, PubMed and LIS-2000, which are located in the dataset folder. At the same time, we use two unsupervised methods: TF-IDF and TextRank, and three supervised learning methods: NaiveBayes, CRF and BiLSTM-CRF. The first four are traditional keywords extraction methods, located in the folder ML, and the last one is deep learning method, located in the folder DL.

Directory structure

Keyphrase_Extraction:                 Root directory
│  dl.bat:                            Batch commands to run deep learning model
│  ml.bat:                            Batch commands to run traditional models
│ 
├─Dataset:                            Store experimental datasets
│      SemEval-2010:                  Contains 244 scientific papers 
│      PubMed:                        Contains 1316 scientific papers
│      LIS-2000:                      Contains 2000 scientific papers
│ 
├─DL:                                 Store the source code of the deep learning model
│  │  build_path.py:                  Create file paths for saving preprocessed data
│  │  crf.py:                         Source code of CRF algorithm implementation(Use pytorch framework)
│  │  main.py:                        The main function of running the program
│  │  model.py:                       Source code of BiLSTM-CRF model
│  │  preprocess.py:                  Source code of preprocessing function
│  │  textrank.py:                    Source code of TextRank algorithm implementation.
│  │  tf_idf.py:                      Source code of TF-IDF algorithm implementation.
│  │  utils.py:                       Some auxiliary functions
│  ├─models:                          Parameter configuration of deep learning models
│  └─datas
│        tags:                        Label settings for sequence labeling
│ 
└─ML:                                 Store the source code of the traditional models
    │  build_path.py:                 Create file paths for saving preprocessed data
    │  configs.py:                    Path configuration file
    │  crf.py:                        Source code of CRF algorithm implementation(Use CRF++ Toolkit)
    │  evaluate.py:                   Source code for result evaluation
    │  naivebayes.py:                 Source code of naivebayes algorithm implementation(Use KEA-3.0 Toolkit)
    │  preprocessing.py:              Source code of preprocessing function
    │  textrank.py:                   Source code of TextRank algorithm implementation
    │  tf_idf.py:                     Source code of TF-IDF algorithm implementation
    │  utils.py:                      Some auxiliary functions
    ├─CRF++:                          CRF++ Toolkit
    └─KEA-3.0:                        KEA-3.0 Toolkit

Dataset Description

The dataset includes the following three json files:

  • SemEval-2010: SemEval-2010 Task 5 dataset, it contains 244 scientific papers and can be visited at: https://semeval2.fbk.eu/semeval2.php?location=data.
  • PubMed: Contains 1316 scientific papers from PubMed (https://github.com/boudinfl/ake-datasets/tree/master/datasets/PubMed).
  • LIS-2000: Contains 2000 scientific papers from journals in Library and Information Science (LIS).

    Each line of the json file includes:

  • title (T): The title of the paper.
  • abstract (A): The abstract of the paper.
  • introduction (I): The introduction of the paper.
  • conclusion (C): The conclusion of the paper.
  • body1 (Fp): The first sentence of each paragraph.
  • body2 (Lp): The last sentence of each paragraph.
  • full_text (F): The full text of the paper.
  • references (R): references list and only the title of each reference is provided.
  • keywords (K): the keywords of the paper and these keywords were annotated manually.

    Quick Start

    In order to facilitate the reproduction of the experimental results, the project uses bat batch command to run the program uniformly (only in Windows Environment). The dl.bat file is the batch command to run the deep learning model, and the ml.bat file is the batch command to run the traditional algorithm.

    How does it work?

    In the Windows environment, use the key combination Win + R and enter cmd to open the DOS command box, and switch to the project's root directory (Keyphrase_Extraction). Then input dl.bat, that is, run deep learning model to get the result of keyword extraction; Enter ml.bat to run traditional algorithm to get keywords Extract the results.

    Experimental results

    The following figures show that the influence of reference information on keyphrase extraction results of TF*IDF, TextRank, NB, CRF and BiLSTM-CRF.

    Table 1: Keyphrase extraction performance of multiple corpora constructed using different logical structure texts on the dataset of SemEval-2010 Table1

    Table 2: Keyphrase extraction performance of multiple corpora constructed using different logical structure texts on the dataset of PubMed Table2

    Table 3: Keyphrase extraction performance of multiple corpora constructed using different logical structure texts on the dataset of LIS-2000 Table3

    Note: The yellow, green and blue bold fonts in the table represent the largest of the P, R and F1 value obtained from different corpora using the same model, respectively.

    Dependency packages

    Before running this project, check that the following Python packages are included in your runtime environment.

  • pytorch 1.7.1
  • nltk 3.5
  • numpy 1.19.2
  • pandas 1.1.3
  • tqdm 4.50.2

    Citation

    Please cite the following paper if you use this codes and dataset in your work.

    Chengzhi Zhang, Lei Zhao, Mengyuan Zhao, Yingyi Zhang. Enhancing Keyphrase Extraction from Academic Articles with their Reference Information. Scientometrics, 2021. (in press) [arXiv]

  • Owner
    Professor at iSchool of Nanjing University of Science and Technology
    Tutorial: Introduction to Graph Machine Learning, with Jupyter notebooks

    GraphMLTutorialNLDL22 Tutorial NLDL22: Introduction to Graph Machine Learning, with Jupyter notebooks This tutorial takes place during the conference

    UiT Machine Learning Group 3 Jan 10, 2022
    Head2Toe: Utilizing Intermediate Representations for Better OOD Generalization

    Head2Toe: Utilizing Intermediate Representations for Better OOD Generalization Code for reproducing our results in the Head2Toe paper. Paper: arxiv.or

    Google Research 62 Dec 12, 2022
    "Learning Free Gait Transition for Quadruped Robots vis Phase-Guided Controller"

    PhaseGuidedControl The current version is developed based on the old version of RaiSim series, and possibly requires further modification. It will be

    X-Mechanics 12 Oct 21, 2022
    Project for music generation system based on object tracking and CGAN

    Project for music generation system based on object tracking and CGAN The project was inspired by MIDINet: A Convolutional Generative Adversarial Netw

    1 Nov 21, 2021
    Flexible-Modal Face Anti-Spoofing: A Benchmark

    Flexible-Modal FAS This is the official repository of "Flexible-Modal Face Anti-

    Zitong Yu 22 Nov 10, 2022
    计算机视觉中用到的注意力模块和其他即插即用模块PyTorch Implementation Collection of Attention Module and Plug&Play Module

    PyTorch实现多种计算机视觉中网络设计中用到的Attention机制,还收集了一些即插即用模块。由于能力有限精力有限,可能很多模块并没有包括进来,有任何的建议或者改进,可以提交issue或者进行PR。

    PJDong 599 Dec 23, 2022
    A collection of easy-to-use, ready-to-use, interesting deep neural network models

    Interesting and reproducible research works should be conserved. This repository wraps a collection of deep neural network models into a simple and un

    Aria Ghora Prabono 16 Jun 16, 2022
    Finetune SSL models for MOS prediction

    Finetune SSL models for MOS prediction This is code for our paper under review for ICASSP 2022: "Generalization Ability of MOS Prediction Networks" Er

    Yamagishi and Echizen Laboratories, National Institute of Informatics 32 Nov 22, 2022
    Official implementation of the MM'21 paper Constrained Graphic Layout Generation via Latent Optimization

    [MM'21] Constrained Graphic Layout Generation via Latent Optimization This repository provides the official code for the paper "Constrained Graphic La

    Kotaro Kikuchi 73 Dec 27, 2022
    GLANet - The code for Global and Local Alignment Networks for Unpaired Image-to-Image Translation arxiv

    GLANet The code for Global and Local Alignment Networks for Unpaired Image-to-Image Translation arxiv Framework: visualization results: Getting Starte

    stanley 29 Dec 14, 2022
    An open source python library for automated feature engineering

    "One of the holy grails of machine learning is to automate more and more of the feature engineering process." ― Pedro Domingos, A Few Useful Things to

    alteryx 6.4k Jan 03, 2023
    PyTorch implementation for ACL 2021 paper "Maria: A Visual Experience Powered Conversational Agent".

    Maria: A Visual Experience Powered Conversational Agent This repository is the Pytorch implementation of our paper "Maria: A Visual Experience Powered

    Jokie 22 Dec 12, 2022
    In this project we investigate the performance of the SetCon model on realistic video footage. Therefore, we implemented the model in PyTorch and tested the model on two example videos.

    Contrastive Learning of Object Representations Supervisor: Prof. Dr. Gemma Roig Institutions: Goethe University CVAI - Computational Vision & Artifici

    Dirk Neuhäuser 6 Dec 08, 2022
    OpenMMLab Detection Toolbox and Benchmark

    MMDetection is an open source object detection toolbox based on PyTorch. It is a part of the OpenMMLab project.

    OpenMMLab 22.5k Jan 05, 2023
    This repository is the official implementation of Unleashing the Power of Contrastive Self-Supervised Visual Models via Contrast-Regularized Fine-Tuning (NeurIPS21).

    Core-tuning This repository is the official implementation of ``Unleashing the Power of Contrastive Self-Supervised Visual Models via Contrast-Regular

    vanint 18 Dec 17, 2022
    Official PyTorch implementation for Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers, a novel method to visualize any Transformer-based network. Including examples for DETR, VQA.

    PyTorch Implementation of Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers 1 Using Colab Please notic

    Hila Chefer 489 Jan 07, 2023
    Model that predicts the probability of a Twitter user being anti-vaccination.

    stylebody {text-align: justify}/style AVAXTAR: Anti-VAXx Tweet AnalyzeR AVAXTAR is a python package to identify anti-vaccine users on twitter. The

    10 Sep 27, 2022
    A cross-lingual COVID-19 fake news dataset

    CrossFake An English-Chinese COVID-19 fake&real news dataset from the ICDMW 2021 paper below: Cross-lingual COVID-19 Fake News Detection. Jiangshu Du,

    Yingtong Dou 11 Dec 01, 2022
    NuPIC Studio is an all­-in-­one tool that allows users create a HTM neural network from scratch

    NuPIC Studio is an all­-in-­one tool that allows users create a HTM neural network from scratch, train it, collect statistics, and share it among the members of the community. It is not just a visual

    HTM Community 93 Sep 30, 2022
    A framework for Quantification written in Python

    QuaPy QuaPy is an open source framework for quantification (a.k.a. supervised prevalence estimation, or learning to quantify) written in Python. QuaPy

    41 Dec 14, 2022