Data, model training, and evaluation code for "PubTables-1M: Towards a universal dataset and metrics for training and evaluating table extraction models".

Overview

PubTables-1M

This repository contains training and evaluation code for the paper "PubTables-1M: Towards a universal dataset and metrics for training and evaluating table extraction models".

The goal of PubTables-1M is to create a large, detailed, high-quality dataset for training and evaluating a wide variety of models for the tasks of table detection, table structure recognition, and functional analysis. It contains:

  • 460,589 annotated document pages containing tables for table detection.
  • 947,642 fully annotated tables including text content and complete location (bounding box) information for table structure recognition and functional analysis.
  • Full bounding boxes in both image and PDF coordinates for all table rows, columns, and cells (including blank cells), as well as other annotated structures such as column headers and projected row headers.
  • Rendered images of all tables and pages.
  • Bounding boxes and text for all words appearing in each table and page image.
  • Additional cell properties not used in the current model training.

Additionally, cells in the headers are canonicalized and we implement multiple quality control steps to ensure the annotations are as free of noise as possible. For more details, please see our paper.

News

10/21/2021: The full PubTables-1M dataset has been officially released on Microsoft Research Open Data.

Getting the Data

PubTables-1M is available for download from Microsoft Research Open Data.

It comes in 5 tar.gz files:

  • PubTables-1M-Image_Page_Detection_PASCAL_VOC.tar.gz
  • PubTables-1M-Image_Page_Words_JSON.tar.gz
  • PubTables-1M-Image_Table_Structure_PASCAL_VOC.tar.gz
  • PubTables-1M-Image_Table_Words_JSON.tar.gz
  • PubTables-1M-PDF_Annotations_JSON.tar.gz

To download from the command line:

  1. Visit the dataset home page with a web browser and click Download in the top left corner. This will create a link to download the dataset from Azure with a unique access token for you that looks like https://msropendataset01.blob.core.windows.net/pubtables1m?[SAS_TOKEN_HERE].
  2. You can then use the command line tool azcopy to download all of the files with the following command:
azcopy copy "https://msropendataset01.blob.core.windows.net/pubtables1m?[SAS_TOKEN_HERE]" "/path/to/your/download/folder/" --recursive

Then unzip each of the archives from the command line using:

tar -xzvf yourfile.tar.gz

Code Installation

Create a conda environment from the yml file and activate it as follows

conda env create -f environment.yml
conda activate tables-detr

Model Training

The code trains models for 2 different sets of table extraction tasks:

  1. Table Detection
  2. Table Structure Recognition + Functional Analysis

For a detailed description of these tasks and the models, please refer to the paper.

Sample training commands:

cd src
python main.py --data_root_dir /path/to/detection --data_type detection
python main.py --data_root_dir /path/to/structure --data_type structure

GriTS metric evaluation

GriTS metrics proposed in the paper can be evaluated once you have trained a model. We consider the model trained in the previous step. This script calculates all 4 variations presented in the paper. Based on the model, one can tune which variation to use. The table words dir path is not required for all variations but we use it in our case as PubTables1M contains this information.

python main.py --data_root_dir /path/to/structure --model_load_path /path/to/model --table_words_dir /path/to/table/words --mode grits

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

Owner
Microsoft
Open source projects and samples from Microsoft
Microsoft
Minimal fastai code needed for working with pytorch

fastai_minima A mimal version of fastai with the barebones needed to work with Pytorch #all_slow Install pip install fastai_minima How to use This lib

Zachary Mueller 14 Oct 21, 2022
This is the official implement of paper "ActionCLIP: A New Paradigm for Action Recognition"

This is an official pytorch implementation of ActionCLIP: A New Paradigm for Video Action Recognition [arXiv] Overview Content Prerequisites Data Prep

268 Jan 09, 2023
Narya API allows you track soccer player from camera inputs, and evaluate them with an Expected Discounted Goal (EDG) Agent

Narya The Narya API allows you track soccer player from camera inputs, and evaluate them with an Expected Discounted Goal (EDG) Agent. This repository

Paul Garnier 121 Dec 30, 2022
Knowledge Distillation Toolbox for Semantic Segmentation

SegDistill: Toolbox for Knowledge Distillation on Semantic Segmentation Networks This repo contains the supported code and configuration files for Seg

9 Dec 12, 2022
The official pytorch implementation of our paper "Is Space-Time Attention All You Need for Video Understanding?"

TimeSformer This is an official pytorch implementation of Is Space-Time Attention All You Need for Video Understanding?. In this repository, we provid

Facebook Research 1k Dec 31, 2022
Trajectory Variational Autoencder baseline for Multi-Agent Behavior challenge 2022

MABe_2022_TVAE: a Trajectory Variational Autoencoder baseline for the 2022 Multi-Agent Behavior challenge This repository contains jupyter notebooks t

Andrew Ulmer 15 Nov 08, 2022
Expert Finding in Legal Community Question Answering

Expert Finding in Legal Community Question Answering Arian Askari, Suzan Verberne, and Gabriella Pasi. Expert Finding in Legal Community Question Answ

Arian Askari 3 Oct 31, 2022
Light-SERNet: A lightweight fully convolutional neural network for speech emotion recognition

Light-SERNet This is the Tensorflow 2.x implementation of our paper "Light-SERNet: A lightweight fully convolutional neural network for speech emotion

Arya Aftab 29 Nov 12, 2022
GeneDisco is a benchmark suite for evaluating active learning algorithms for experimental design in drug discovery.

GeneDisco is a benchmark suite for evaluating active learning algorithms for experimental design in drug discovery.

22 Dec 12, 2022
Implement slightly different caffe-segnet in tensorflow

Tensorflow-SegNet Implement slightly different (see below for detail) SegNet in tensorflow, successfully trained segnet-basic in CamVid dataset. Due t

Tseng Kuan Lun 364 Oct 27, 2022
A curated list of references for MLOps

A curated list of references for MLOps

Larysa Visengeriyeva 9.3k Jan 07, 2023
This is the official pytorch implementation for the paper: Instance Similarity Learning for Unsupervised Feature Representation.

ISL This is the official pytorch implementation for the paper: Instance Similarity Learning for Unsupervised Feature Representation, which is accepted

19 May 04, 2022
A PyTorch implementation for Unsupervised Domain Adaptation by Backpropagation(DANN), support Office-31 and Office-Home dataset

DANN A PyTorch implementation for Unsupervised Domain Adaptation by Backpropagation Prerequisites Linux or OSX NVIDIA GPU + CUDA (may CuDNN) and corre

8 Apr 16, 2022
CondNet: Conditional Classifier for Scene Segmentation

CondNet: Conditional Classifier for Scene Segmentation Introduction The fully convolutional network (FCN) has achieved tremendous success in dense vis

ycszen 31 Jul 22, 2022
Magic tool for managing internet connection in local network by @zalexdev

Megacut ✂️ A new powerful Python3 tool for managing internet on a local network Installation git clone https://github.com/stryker-project/megacut cd m

Stryker 12 Dec 15, 2022
Code for the paper Relation Prediction as an Auxiliary Training Objective for Improving Multi-Relational Graph Representations (AKBC 2021).

Relation Prediction as an Auxiliary Training Objective for Knowledge Base Completion This repo provides the code for the paper Relation Prediction as

Facebook Research 85 Jan 02, 2023
Personal project about genus-0 meshes, spherical harmonics and a cow

How to transform a cow into spherical harmonics ? Spot the cow, from Keenan Crane's blog Context In the field of Deep Learning, training on images or

3 Aug 22, 2022
Implement A3C for Mujoco gym envs

pytorch-a3c-mujoco Disclaimer: my implementation right now is unstable (you ca refer to the learning curve below), I'm not sure if it's my problems. A

Andrew 70 Dec 12, 2022
Code accompanying "Dynamic Neural Relational Inference" from CVPR 2020

Code accompanying "Dynamic Neural Relational Inference" This codebase accompanies the paper "Dynamic Neural Relational Inference" from CVPR 2020. This

Colin Graber 48 Dec 23, 2022
A hifiasm fork for metagenome assembly using Hifi reads.

hifiasm_meta - de novo metagenome assembler, based on hifiasm, a haplotype-resolved de novo assembler for PacBio Hifi reads.

44 Jul 10, 2022