Analysis of Antarctica sequencing samples contaminated with SARS-CoV-2

Overview

Analysis of SARS-CoV-2 reads in sequencing of 2018-2019 Antarctica samples in PRJNA692319

The samples analyzed here are described in this preprint, which is a pre-print by Istvan Csabai and co-workers that describes SARS-CoV-2 reads in samples from Antarctica sequencing in China. I was originally alerted to the pre-print by Carl Zimmer on Dec-23-2021. Istvan Csabai and coworkers subsequently posted a second pre-print that also analyzes the host reads.

Repeating key parts of the analysis

The code in this repo independently repeats some of the analyses.

To run the analysis, build the conda environment in environment.yml and then run the analysis using Snakefile. To do this on the Hutch cluster, using run.bash:

sbatch -c 16 run.bash

The results are placed in the ./results/ subdirectory. Most of the results files are not tracked due to file-size limitations, but the following key files are tracked:

  • results/alignment_counts.csv gives the number of reads aligning to SARS-CoV-2 for each sample. This confirms that three accessions (SRR13441704, SRR13441705, and SRR13441708) have most of the SARS-CoV-2 reads, although a few other samples also have some.
  • results/variant_analysis.csv reports all variants found in the samples relative to Wuhan-Hu-1.
  • results/variant_analysis_to_outgroup.csv reports the variants found in the samples that represent mutations from Wuhan-Hu-1 towards the two closest bat coronavirus relatives, RaTG13 and BANAL-20-52. Note that some of the reads contain three key mutations relative to Wuhan-Hu-1 (C8782T, C18060T, and T28144C) that move the sequence closer to the bat coronavirus relatives. These mutations define one of the two plausible progenitors for all currently known human SARS-CoV-2 sequences (see Kumar et al (2021) and Bloom (2021)).

Archived links after initially hearing about pre-print

I archived the following links on Dec-23-2021 after hearing about the pre-print from Carl Zimmer:

Deletion of some samples from SRA

On Jan-3-2022, I received an e-mail one of the pre-print authors, Istvan Csabai, saying that three of the samples (appearing to be the ones with the most SARS-CoV-2 reads) had been removed from the SRA. He also noted that bioRxiv had refused to publish their pre-print without explanation; the file he attached indicates the submission ID was BIORXIV-2021-472446v1. I confirmed that three of the accessions had indeed been removed from the SRA as shown in the following archived links:

I also e-mailed Richard Sever at bioRxiv to ask why the pre-print was rejected, and explained I had repeated and validated the key findings. Richard Sever said he could not give details about the pre-print review process, but that in the future the authors could appeal if they thought the rejection was unfounded.

Details from Istvan Csabai

On Jan-4-2022, I chatted with Istvan Csabai. He had contacted the authors of the pre-print, and shared their reply to him. The authors had prepped the samples in early 2019, and submitted to Sangon BioTech for sequencing in December, getting the results back in early January.

Second pre-print from Csabai and restoration of deleted files

Istvan Csabai then worked on a second pre-print that analyzed host reads and made various findings, including co-contamination with African green monkey (Vero?) and human DNA. He sent me pre-print drafts on Jan-16-2022 and on Jan-24-2022, and I provided comments on both drafts and agreed to be listed in the Acknowledgments.

On Feb-3-2022, Istvan Csabai told me that the second pre-print had also been rejected from bioRxiv. Because I had previously contacted Richard Sever when I heard the first pre-print was rejected, I suggested Istvan could CC me on an e-mail to Richard Sever appealing the rejection, which he did. Unfortunately, Richard Sever declined the appeal, so instead Istvan posted the pre-print on Resarch Square.

At that point on Feb-3-2022, I also re-checked the three deletion accessions (SRR13441704, SRR13441705, and SRR13441708). To my surprise, all three were now again available by public access. Here are archived links demonstrating that they were again available:

I confirmed that the replaced accessions were identical to the deleted ones.

Inquiry to authors of PRJNA692319

On Feb-8-2022, I e-mailed the Chinese authors of the paper to ask about the sample deletion and restoration. They e-mailed back almost immediately. They confirmed what they had told Istvan: they had sequenced the samples with Sangon Biotech (Shanghai) after extracting the DNA in December 2019 from their samples. The suspect that contamination of the samples happened at Sangon Biotech. They deleted the three most contaminated samples from the Sequence Read Archive. They do not know why the samples were then "un-deleted."

Owner
Jesse Bloom
I research the evolution of viruses and proteins.
Jesse Bloom
A collection of Reinforcement Learning algorithms from Sutton and Barto's book and other research papers implemented in Python.

Reinforcement-Learning-Notebooks A collection of Reinforcement Learning algorithms from Sutton and Barto's book and other research papers implemented

Pulkit Khandelwal 1k Dec 28, 2022
This dlib-based facial login system

Facial-Login-System This dlib-based facial login system is a technology capable of matching a human face from a digital webcam frame capture against a

Mushahid Ali 3 Apr 23, 2022
Code for the ICML 2021 paper "Bridging Multi-Task Learning and Meta-Learning: Towards Efficient Training and Effective Adaptation", Haoxiang Wang, Han Zhao, Bo Li.

Bridging Multi-Task Learning and Meta-Learning Code for the ICML 2021 paper "Bridging Multi-Task Learning and Meta-Learning: Towards Efficient Trainin

AI Secure 57 Dec 15, 2022
A graph adversarial learning toolbox based on PyTorch and DGL.

GraphWar: Arms Race in Graph Adversarial Learning NOTE: GraphWar is still in the early stages and the API will likely continue to change. 🚀 Installat

Jintang Li 54 Jan 05, 2023
Neural implicit reconstruction experiments for the Vector Neuron paper

Neural Implicit Reconstruction with Vector Neurons This repository contains code for the neural implicit reconstruction experiments in the paper Vecto

Congyue Deng 35 Jan 02, 2023
SuMa++: Efficient LiDAR-based Semantic SLAM (Chen et al IROS 2019)

SuMa++: Efficient LiDAR-based Semantic SLAM This repository contains the implementation of SuMa++, which generates semantic maps only using three-dime

Photogrammetry & Robotics Bonn 701 Dec 30, 2022
Supporting code for short YouTube series Neural Networks Demystified.

Neural Networks Demystified Supporting iPython notebooks for the YouTube Series Neural Networks Demystified. I've included formulas, code, and the tex

Stephen 1.3k Dec 23, 2022
PyTorch code for the paper "Complementarity is the King: Multi-modal and Multi-grained Hierarchical Semantic Enhancement Network for Cross-modal Retrieval".

Complementarity is the King: Multi-modal and Multi-grained Hierarchical Semantic Enhancement Network for Cross-modal Retrieval (M2HSE) PyTorch code fo

Xinlei-Pei 6 Dec 23, 2022
This project provides the proof of the uniqueness of the equilibrium and the global asymptotic stability.

Delayed-cellular-neural-network This project provides the proof of the uniqueness of the equilibrium and the global asymptotic stability. There is als

4 Apr 28, 2022
PSML: A Multi-scale Time-series Dataset for Machine Learning in Decarbonized Energy Grids

PSML: A Multi-scale Time-series Dataset for Machine Learning in Decarbonized Energy Grids The electric grid is a key enabling infrastructure for the a

Texas A&M Engineering Research 19 Jan 07, 2023
Physics-Aware Training (PAT) is a method to train real physical systems with backpropagation.

Physics-Aware Training (PAT) is a method to train real physical systems with backpropagation. It was introduced in Wright, Logan G. & Onodera, Tatsuhiro et al. (2021)1 to train Physical Neural Networ

McMahon Lab 230 Jan 05, 2023
MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification

MixText This repo contains codes for the following paper: Jiaao Chen, Zichao Yang, Diyi Yang: MixText: Linguistically-Informed Interpolation of Hidden

GT-SALT 309 Dec 12, 2022
Semi-supervised Semantic Segmentation with Directional Context-aware Consistency (CVPR 2021)

Semi-supervised Semantic Segmentation with Directional Context-aware Consistency (CAC) Xin Lai*, Zhuotao Tian*, Li Jiang, Shu Liu, Hengshuang Zhao, Li

DV Lab 137 Dec 14, 2022
Official repo for BMVC2021 paper ASFormer: Transformer for Action Segmentation

ASFormer: Transformer for Action Segmentation This repo provides training & inference code for BMVC 2021 paper: ASFormer: Transformer for Action Segme

42 Dec 23, 2022
An essential implementation of BYOL in PyTorch + PyTorch Lightning

Essential BYOL A simple and complete implementation of Bootstrap your own latent: A new approach to self-supervised Learning in PyTorch + PyTorch Ligh

Enrico Fini 48 Sep 27, 2022
Code for "Training Neural Networks with Fixed Sparse Masks" (NeurIPS 2021).

Fisher Induced Sparse uncHanging (FISH) Mask This repo contains the code for Fisher Induced Sparse uncHanging (FISH) Mask training, from "Training Neu

Varun Nair 37 Dec 30, 2022
Linear algebra python - Number of operations and problems in Linear Algebra and Numerical Linear Algebra

Linear algebra in python Number of operations and problems in Linear Algebra and

Alireza 5 Oct 09, 2022
Semi-supervised learning for object detection

Source code for STAC: A Simple Semi-Supervised Learning Framework for Object Detection STAC is a simple yet effective SSL framework for visual object

Google Research 348 Dec 25, 2022
A PyTorch implementation of the paper "Semantic Image Synthesis via Adversarial Learning" in ICCV 2017

Semantic Image Synthesis via Adversarial Learning This is a PyTorch implementation of the paper Semantic Image Synthesis via Adversarial Learning. Req

Seonghyeon Nam 146 Nov 25, 2022
Simple Text-Generator with OpenAI gpt-2 Pytorch Implementation

GPT2-Pytorch with Text-Generator Better Language Models and Their Implications Our model, called GPT-2 (a successor to GPT), was trained simply to pre

Tae-Hwan Jung 775 Jan 08, 2023