SciFive: a text-text transformer model for biomedical literature

Last update: Dec 24, 2022

Overview

SciFive

SciFive provided a Text-Text framework for biomedical language and natural language in NLP. Under the T5's framework and desrbibed in the paper SciFive: a text-to-text transformer model for biomedical literature, SciFive achieve state-of-the-art and competitive results on multiple biomedical-natural language tasks.

Google Cloud Storage

Our base Google Cloud Storage URI is at gs://scifive

As described in our paper, we make public 6 version of SciFive, each one has been benchmarked to achieve state-of-the-art on different biomedical task. They are all available on our Google Cloud bucket, we are working on release the models on HuggingFace also.

Instruction on access Cloud Storage from the command line with python library gsutil is described here

gsutil URI for 6 SciFive models:

SciFive Pubmed+PMC Base: gs://scifive/models/pubmed_pmc/base
SciFive Pubmed+PMC Large: gs://scifive/models/pubmed_pmc/large
SciFive Pubmed Base: gs://scifive/models/pubmed/base
SciFive Pubmed Large: gs://scifive/models/pubmed/large
SciFive PMC Base: gs://scifive/models/pmc/base
SciFive PMC Large: gs://scifive/models/pmc/large

gsutil URI for Pretrain data:

Pubmed: gs://scifive/pretrain/pubmed
PMC: gs://scifive/pretrain/pmc

Example

Below, we give an example of how to use SciFive on Huggingface to generate MedNLI outputs. We also publish our SciFive finetuned on MedNLI for reproducing experiments.

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("razent/SciFive-large-Pubmed_PMC-MedNLI")  
model = AutoModelForSeq2SeqLM.from_pretrained("razent/SciFive-large-Pubmed_PMC-MedNLI")
model.cuda()

sent_1 = "In the ED, initial VS revealed T 98.9, HR 73, BP 121/90, RR 15, O2 sat 98% on RA."
sent_2 = "The patient is hemodynamically stable"
text =  f"mednli: sentence1: {sent_1} sentence2: {sent_2}"

encoding = tokenizer.encode_plus(text, padding='max_length', max_length=256, return_tensors="pt")
input_ids, attention_masks = encoding["input_ids"].to("cuda"), encoding["attention_mask"].to("cuda")

outputs = model.generate(
    input_ids=input_ids, attention_mask=attention_masks,
    max_length=8,
    early_stopping=True
)

for output in outputs:
    line = tokenizer.decode(output, skip_special_tokens=True, clean_up_tokenization_spaces=True)
    print(line)

HuggingFace

SciFive Pubmed+PMC: Base | Large
SciFive Pubmed: Base | Large
SciFive PMC: Base | Large

Datasets

All of the finetune dataset already pre-procossed into text-text format also availabe at this

📊 Expected Results

Citations

If you use SciFive model or our code for publications, please cite:

@misc{phan2021scifive,
      title={SciFive: a text-to-text transformer model for biomedical literature}, 
      author={Long N. Phan and James T. Anibal and Hieu Tran and Shaurya Chanana and Erol Bahadroglu and Alec Peltekian and Grégoire Altan-Bonnet},
      year={2021},
      eprint={2106.03598},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

SciFive: a text-text transformer model for biomedical literature

Related tags

Overview

SciFive

Google Cloud Storage

gsutil URI for 6 SciFive models:

gsutil URI for Pretrain data:

Example

HuggingFace

Datasets

📊 Expected Results

Citations

Owner

Long Phan

A spatial genome aligner for analyzing multiplexed DNA-FISH imaging data.

Diagnostic tests for linguistic capacities in language models

[Preprint] "Chasing Sparsity in Vision Transformers: An End-to-End Exploration" by Tianlong Chen, Yu Cheng, Zhe Gan, Lu Yuan, Lei Zhang, Zhangyang Wang

Disentangled Face Attribute Editing via Instance-Aware Latent Space Search, accepted by IJCAI 2021.

A "gym" style toolkit for building lightweight Neural Architecture Search systems

Assessing the Influence of Models on the Performance of Reinforcement Learning Algorithms applied on Continuous Control Tasks

Pytorch implementation of NeurIPS 2021 paper: Geometry Processing with Neural Fields.

PClean: A Domain-Specific Probabilistic Programming Language for Bayesian Data Cleaning

A data annotation pipeline to generate high-quality, large-scale speech datasets with machine pre-labeling and fully manual auditing.

Get a Grip! - A robotic system for remote clinical environments.

The code of Zero-shot learning for low-light image enhancement based on dual iteration

Official Pytorch implementation of "Beyond Static Features for Temporally Consistent 3D Human Pose and Shape from a Video", CVPR 2021

An official implementation of "Background-Aware Pooling and Noise-Aware Loss for Weakly-Supervised Semantic Segmentation" (CVPR 2021) in PyTorch.

my graduation project is about live human face augmentation by projection mapping by using CNN

Malware Analysis Neural Network project.

[NeurIPS 2021] "G-PATE: Scalable Differentially Private Data Generator via Private Aggregation of Teacher Discriminators"

Deep learning based hand gesture recognition using LSTM and MediaPipie.

Anatomy of Matplotlib -- tutorial developed for the SciPy conference

[CVPR'2020] DeepDeform: Learning Non-rigid RGB-D Reconstruction with Semi-supervised Data

Implementation of Graph Transformer in Pytorch, for potential use in replicating Alphafold2