Bulk2Space is a spatial deconvolution method based on deep learning frameworks

Overview

Bulk2Space

Spatially resolved single-cell deconvolution of bulk transcriptomes using Bulk2Space

python 3.8

Bulk2Space is a spatial deconvolution method based on deep learning frameworks, which converts bulk transcriptomes into spatially resolved single-cell expression profiles.

Image text

Installation

For bulk2space, the python version need is over 3.8. If you have installed Python3.6 or Python3.7, consider installing Anaconda, and then you can create a new environment.

conda create -n bulk2space python=3.8.5
conda activate bulk2space

cd bulk2space
pip install -r requirements.txt 

Usage

Run the demo data

If you choose the spatial barcoding-based data(like 10x Genomics or ST) as spatial reference, run the following command:

python bulk2space.py --project_name test1 --data_path example_data/demo1 --input_sc_meta_path demo1_sc_meta.csv --input_sc_data_path demo1_sc_data.csv --input_bulk_path demo1_bulk.csv --input_st_data_path demo1_st_data.csv --input_st_meta_path demo1_st_meta.csv --BetaVAE_H --epoch 10 --spot_data True

else, if you choose the image-based in situ hybridization data(like MERFISH, SeqFISH, and STARmap) as spatial reference, run the following command:

python bulk2space.py --project_name test2 --data_path example_data/demo2 --input_sc_meta_path demo2_sc_meta.csv --input_sc_data_path demo2_sc_data.csv --input_bulk_path demo2_bulk.csv --input_st_data_path demo2_st_data.csv --input_st_meta_path demo2_st_meta.csv --BetaVAE_H --epoch 10 --spot_data False

Run your own data

When using your own data, make sure

  • the bulk.csv file must contain one column of gene expression

    Sample
    Gene1 5.22
    Gene2 3.67
    ... ...
    GeneN 15.76
  • the sc_meta.csv file must contain two columns of cell name and cell type. Make sure the column names are correct, i.e., Cell and Cell_type

    Cell Cell_type
    Cell_1 Cell_1 T cell
    Cell_2 Cell_2 B cell
    ... ... ...
    Cell_n Cell_n Monocyte
  • the st_meta.csv file must contain at least two columns of spatial coordinates. Make sure the column names are correct, i.e., xcoord and ycoord

    xcoord ycoord
    Cell_1 / Spot_1 1.2 5.2
    Cell_2 / Spot_2 5.4 4.3
    ... ... ...
    Cell_n / Spot_n 11.3 6.3
  • the sc_data.csv and st_data.csv files are gene expression matrices

Then you will get your results in the output_data folder.

For more details, see user guide in the document.

About

Bulk2Space manuscript is under major revision. Should you have any questions, please contact Jie Liao at [email protected], Jingyang Qian at [email protected], or Yin Fang at [email protected]

Comments
  • Data availability

    Data availability

    Hey team, thanks for coming up with this useful tool. I'm looking to follow your tutorial on hypothalamus deconvolution, and it seems the lcm.gz data file on your Github only contains a single file, without all the processes count matrices and cell metadata table. Is that supposed to be the case? If so, I wonder how I should process this single file to generate the input data I need. Thanks for any heads up!

    opened by loganminhdang 6
  • Cannot locate the bulk2space.py script and directory after installation

    Cannot locate the bulk2space.py script and directory after installation

    Hi, I'm writing to seek your assistance on an issue I'm having. After installation of the conda environment, I cannot locate the bulk2space directory, which should contain the bulk2space python script to run the algorithm. The installation also seems incomplete, seeing that after I manually retrieve the python script from your Github page, I received the following error message: Traceback (most recent call last): File "bulk2space.py", line 2, in from utils.tool import * ModuleNotFoundError: No module named 'utils'

    I would appreciate any guidance. Thanks!

    opened by loganminhdang 5
  • Preproccessed PDAC data

    Preproccessed PDAC data

    Hello,

    I am trying to understand how to use bulk2space by going though the tutorials. I am currently going though the first tutorial with the PDAC datasets. I would like to know how you generated the preprocessed files "st_data" and "st_meta".

    I went to the original data from Moncada et al. (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE111672) but I don't know which files you used from there to make the above preprocessed files. Could you clarify that and explain a bit more in detail how you generated "st_data" and "st_meta"? This will be helpful to understand how to process other reference datasets.

    opened by AlexUOM 4
  • the question of

    the question of "quick start" section

    Dear professors, We are very sorry to bother you. We recently downloaded the bulk2space and used the test data of demo1, but we don't know why there are no result output, and we don't know whether the data are written normally. After operation, the Bulk2space-1.0.0-Py3.8egg displayed empty. Some information are as follows. I would wonder if you can help check it in your busy schedule or if there is any other step guidance. bulk2space

    opened by coconutll 2
  • Only obtain three cell types such as tumor cell, macrophages and neutrophils from bulk data?

    Only obtain three cell types such as tumor cell, macrophages and neutrophils from bulk data?

    Hi, thanks for coming up with this useful tool.  I have bulk RNAseq data and scRNA-seq data from the same patient which was made by our lab. I want to convert bulk transcriptomes into spatially resolved single-cell expression profile. Here are my questions: 1.Why do I only obtain three cell types such as tumor cell, macrophages and neutrophils from bulk data? However, there are many other celltypes like Fibroblasts, T cell and B cell in my scRNA reference. 2.How to normalize my bulk data?

    Thanks, Qi.

    opened by zhangqi234 2
  • convert bulk transcriptomes into spatially resolved single-cell expression profile

    convert bulk transcriptomes into spatially resolved single-cell expression profile

    Hi, I'm new to bulk2space, and I only have bulk RNAseq data from mouse brain which was made by our lab. I want to convert bulk transcriptomes into spatially resolved single-cell expression profile. I know how to convert bulk RNAseq data into single cell data. Here are my questions:

    1. How to get the spatial information from my bulk RNAseq data, do I have to do some experiments about spatial information by Laser capture microdissection (LCM) technology?
    2. since my bulk RNAseq data are form brain tissue, tissues contain many layers of cells. How do you distinguish between different layers of cells? Or do I have to do bulk RNAseq from single layers?

    Thanks, Echo.

    opened by Echoloria 2
  • Cannot import CascadeForestClassifier from deepforest

    Cannot import CascadeForestClassifier from deepforest

    I am running the bulk2space.py script via Python 3.8.5. The deepforest package is installed and imports successfully, but I am still receiving the following error message:

    ImportError: cannot import name 'CascadeForestClassifier' from 'deepforest'

    I would appreciate any help you could offer.

    opened by sarah-chapin 2
  • Effect of Irrelevant Bulk RNA-Seq Sample and Selection of Optimal Projects for Test Data

    Effect of Irrelevant Bulk RNA-Seq Sample and Selection of Optimal Projects for Test Data

    Hi,

    Thank you very much for putting together this code.

    I would like to better understand when Bulk2Space might help versus when there are limits to applicability to Bulk2Space, following a journal club presentation where I learned more about the paper and method.

    I apologize that I am not sure how best to precisely ask my question, but I have tried to use a few examples to try and give a sense of what I am asking about.

    Example 1 (Exact Code for Concrete Test):

    In the spirit of a GitHub “issue,” I tried to start with concrete examples for discussion based upon issue #8 .

    I have attached a summary of that analysis (PDAC_Test.pdf), and I have also attached any input files not already provided on this repository.

    However, when I changed the bulk RNA-Seq gene symbols in order to use the same gene symbols for both the PDAC example and the demo1 example, I lost the Ductal cells in the PDAC example that otherwise still used only files derived from the same samples used for the PDAC example. I also have some more details notes in the uploaded PDF.

    Nevertheless, if that might possibly help the discussion, I have provided those.

    If there are any other relatively small files that it would help to upload to GitHub, then I would also be happy to add those. For example, I also ran the analysis with epoch_num=1000 instead of epoch_num=3500. I am currently not providing those results, but my impression is that they look qualitatively similar in terms of cancer cell and ductal cell assignments (for all of the provided PDAC files).

    Example 2 (Theoretical Question):

    Is it possible to run bulk2spatial as described below?

    1) Use bulk RNA-Seq + scRNA-Seq + spatial data that all come from Patient A.

    2) Export model from Patient A.

    3) Only provide bulk RNA-Seq data from patient B, and test how predictions from model defined on Patient A compare to scRNA-Seq and spatial data generated for Patient B.

    • Additionally, if I understand correctly, then I think an image for the tissue for Patient B can not be provided. If so, I think the shape of the issue section for Patient B can’t be known, and I would guess the spatial coordinates from Bulk2Space might not be directly applicable to interpret Patient B. However, if I might be misunderstanding anything, then please let me know.

    Example 3 (Summary Questions):

    Am I correctly understanding that consecutive slides are often used in the paper? For example, the 2 slices in Figure S17f already have different shapes, and it looks like you a projection of estimations on the histology image for slide 2 was not (or could not?) be provided.

    Data from different patients would be even more different. So, is it reasonable and/or correct to say that there is a preference to use all 3 data types generated from the same experiment? Even if the exact slice is not the same, the true composition of the multiple data types can hopefully be as close as possible?

    For example, I am not sure if the difference is sufficiently extreme, but let’s say Patient A has histology like the “Inflammation” sample in Figure 6 and Patient B has histology like the “Cancer” sample in Figure 6. If you didn’t have a spatial transcriptomics (ST) dataset for Patient B, then I think use of the ST data from Patient A might not be of much benefit to Patient B. Do you think that is a fair conclusion?

    Similarly, if your training sample had 90% tumor, then I would expect limitations is looking at the projection from a spatial transcriptomics project where the tissues had a very different percent tumor such as closer to 20% tumor. I would also expect there often could be a challenge in even knowing the general shape of an independent/unrelated tumor sample, and I believe that you should not be able to know the spatial information for the tumor cells within an independent tissue without a more direct measurement.

    I am not sure if the points above might also possibly relate to the shift in the frequency of cancer cells per spot with the reduced/matching gene symbols in the uploaded PDF for Example 1.

    However, if I am then understanding correctly, then might that be at least somewhat contradictory to what I believe is a recommendation to use public data in issue #7? If I might be misunderstanding anything, then please let me know.

    Thank you very much for your help!

    Sincerely, Charles

    Code.zip demo1_bulk-FALSE_PDAC_LABEL.csv demo1_bulk-FALSE_PDAC_LABEL-MATCHING_SUBSET.csv pdac_bulk-MATCHING_SUBSET.csv

    SC Cell_Type_Counts.pdf SC Cell_Type_Correlation.pdf ST Spot_Deconvolution.pdf ST Cancer_Cells_per_Spot.pdf

    PDAC_Test.pdf

    opened by cwarden45 0
  • Confused about the train/test steps

    Confused about the train/test steps

    Dear Professors,

    Thanks for coming up with this great tool. However, I'm confused with how to use it by the tutorial. In PDAC deconvolution, the tutorial only uses the train_vae function, however, in demo1 tutorial for example, it uses additional load_vae_and_generate function from the .pth vae model from train_vae function.

    So here comes to my question, if I only focus on the first step to transform bulkRNA to single-cell RNA (i.e., no consideration of further scRNA to spatial RNA):

    If I have e.g., two bulkRNAseq from 2-month-old and 7-month-old mice lung cancer tissue, say bulkA and bulkB. I also have one single-cell RNA reference, say scRNAref. When I deconvolute bulkA using scRNAref to a new, bulk2space-generated scRNA data (name it "generated-scRNA from bulkA"), I will get a .pth vae model (name it "A.pth"). Next, when I'd like to deconvolute bulkB, which step should I use? Should I 1) use "load_vae_and_generate" function that use the previous A.pth model, or 2) use "train_vae" function that will generate a new B.pth model?

    I believe this is crucial because it directly guides us how to use this tool. In CIBERSORT, we provide only two variables, the bulkRNAseq and the reference immune cell expression profile. The reference would not change most of the time, thus we just feed CIBERSORT with many bulkRNAseq dataset and it will return many generated immune cell expression dataframes. Simple and easy. But in Bulk2space, we got a new .pth model everytime if we follow step 2, and to be honest, I don't know what this .pth model is used for if not following step1 to use it to load and generate new scRNA dataset.

    Besides issues above, if we use step 1), there'll also be problems. What if bulkA and bulkB are from different status of tissues as the example above? I see that in the article, you mentioned that "the state of each cell type still fluctuates within a relatively stable high-dimensional space". But if bulkA was from a pre-cancerous tissue, and bulkB was from a cancerous tissue, would bulk2space still work fine? This is important because if we'd like to deconvolute bulkRNAseq from longitudinal dataset, for example, a series of bulkRNAseq data from 10 timepoints along cancer progression that contains normal, pre-cancerous, turning stage and finally cancerous tissue, or a series of bulkRNAseq data from different development stages of liver, what is the correct way of using bulk2space if I want single-cell RNA dataset from bulkRNA? Would bulk2space still work under this scenario?

    Also, does bulk2space requires that scRNA ref and bulkRNA are from similar status of tissue? For example, can bulk2space deconvolute bulkRNA derived from cancer lung using the reference scRNA derived from normal lung?

    Actually I've tried to use step 1 (i.e., the same model) to deal with my longitudinal dataset but the results seemed very identical concerning the distribution of cell types that bulk2space returned (which should have some difference at least in immune cell types since I'm deconvoluting bulkRNA from normal and cancer tissues using the same scRNA ref). Also, another key issue is, I don't know whether the generated sc_cell_type and sc_data dataframe can be treated as a standard Seurat object that we can use standard analysis pipeline (like filtering nfeature and nCount, scaling, centering, pca, umap, or newly assign cell types according to FindMarkers function, etc. Acturally I've tried on them but the PCA, tSNE or UMAP can't efficiently separate cell types well), and whether different scRNA datasets generated by bulk2space can be supported to integrate into a single Seurat object like other normal single-cell data do?

    Thank you so much and it would be of a great help if the experts in your team who developped this nice tool could answer the issues above.

    opened by Bennylikescoding 1
  •  β-VAE  algorithm in the paper

    β-VAE algorithm in the paper

    Hello, author, In Figure 1b of your paper,I don't know why β-VAE can analyze the rate of cells of each cell type. I have studied this algorithm carefully and its input and output should correspond, so I don't understand why the input cell type is changed into the output of a single cell. Could you please answer it, or what is the input data of this step? image

    opened by wxpbioinfo 0
  • Question: Scalability

    Question: Scalability

    Good day,

    I am eager to test this excellent tool on our data. I have seen in the tutorial and demo data that the vignette uses only one bulk RNA sample as well as an ST experiment.

    Is it possible to scale up and process several bulk RNA samples and ST experiments in one go? and for the inferred single-cell data derived from the bulk, can we have those integrated across multiple biological replicates, as if they were truly scRNA-seq data?

    Thanks in advance!

    opened by ccruizm 2
  • model.train_df_and_spatial_deconvolution error

    model.train_df_and_spatial_deconvolution error

    Hi, thanks for coming up with this useful tool. When I conducted the model.train_df_and_spatial_deconvolution function to decompose ST data into spatially resolved single-cell transcriptomics data, I found the error like "pandas.errors.MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False". I don't know what caused this error.

    1668324980352

    opened by zhangqi234 7
Releases(v1.0.0)
Owner
Dr. FAN, Xiaohui
single-cell omics; spatial transcriptomics; TCM network biology
Dr. FAN, Xiaohui
Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners

DART Implementation for ICLR2022 paper Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners. Environment

ZJUNLP 83 Dec 27, 2022
The official implementation of Autoregressive Image Generation using Residual Quantization (CVPR '22)

Autoregressive Image Generation using Residual Quantization (CVPR 2022) The official implementation of "Autoregressive Image Generation using Residual

Kakao Brain 529 Dec 30, 2022
Finding Donors for CharityML

Finding-Donors-for-CharityML - Investigated factors that affect the likelihood of charity donations being made based on real census data.

Moamen Abdelkawy 1 Dec 30, 2021
Using contrastive learning and OpenAI's CLIP to find good embeddings for images with lossy transformations

Creating Robust Representations from Pre-Trained Image Encoders using Contrastive Learning Sriram Ravula, Georgios Smyrnis This is the code for our pr

Sriram Ravula 26 Dec 10, 2022
a pytorch implementation of auto-punctuation learned character by character

Learning Auto-Punctuation by Reading Engadget Articles Link to Other of my work 🌟 Deep Learning Notes: A collection of my notes going from basic mult

Ge Yang 137 Nov 09, 2022
Head2Toe: Utilizing Intermediate Representations for Better OOD Generalization

Head2Toe: Utilizing Intermediate Representations for Better OOD Generalization Code for reproducing our results in the Head2Toe paper. Paper: arxiv.or

Google Research 62 Dec 12, 2022
C3D is a modified version of BVLC caffe to support 3D ConvNets.

C3D C3D is a modified version of BVLC caffe to support 3D convolution and pooling. The main supporting features include: Training or fine-tuning 3D Co

Meta Archive 1.1k Nov 14, 2022
Official repository of "DeepMIH: Deep Invertible Network for Multiple Image Hiding", TPAMI 2022.

DeepMIH: Deep Invertible Network for Multiple Image Hiding (TPAMI 2022) This repo is the official code for DeepMIH: Deep Invertible Network for Multip

Junpeng Jing 67 Nov 22, 2022
ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation

ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation This repository contains the source code of our paper, ESPNet (acc

Sachin Mehta 515 Dec 13, 2022
ML From Scratch

ML from Scratch MACHINE LEARNING TOPICS COVERED - FROM SCRATCH Linear Regression Logistic Regression K Means Clustering K Nearest Neighbours Decision

Tanishq Gautam 66 Nov 02, 2022
Linear algebra python - Number of operations and problems in Linear Algebra and Numerical Linear Algebra

Linear algebra in python Number of operations and problems in Linear Algebra and

Alireza 5 Oct 09, 2022
Run object detection model on the Raspberry Pi

Using TensorFlow Lite with Python is great for embedded devices based on Linux, such as Raspberry Pi.

Dimitri Yanovsky 6 Oct 08, 2022
Symbolic Music Generation with Diffusion Models

Symbolic Music Generation with Diffusion Models Supplementary code release for our work Symbolic Music Generation with Diffusion Models. Installation

Magenta 119 Jan 07, 2023
Unofficial pytorch-lightning implement of Mip-NeRF

mipnerf_pl Unofficial pytorch-lightning implement of Mip-NeRF, Here are some results generated by this repository (pre-trained models are provided bel

Jianxin Huang 159 Dec 23, 2022
A scikit-learn compatible neural network library that wraps PyTorch

A scikit-learn compatible neural network library that wraps PyTorch. Resources Documentation Source Code Examples To see more elaborate examples, look

4.9k Jan 03, 2023
Official implementation for the paper "Attentive Prototypes for Source-free Unsupervised Domain Adaptive 3D Object Detection"

Attentive Prototypes for Source-free Unsupervised Domain Adaptive 3D Object Detection PyTorch code release of the paper "Attentive Prototypes for Sour

Deepti Hegde 23 Oct 17, 2022
Code for "Learning the Best Pooling Strategy for Visual Semantic Embedding", CVPR 2021

Learning the Best Pooling Strategy for Visual Semantic Embedding Official PyTorch implementation of the paper Learning the Best Pooling Strategy for V

Jiacheng Chen 106 Jan 06, 2023
Robust Lane Detection via Expanded Self Attention (WACV 2022)

Robust Lane Detection via Expanded Self Attention (WACV 2022) Minhyeok Lee, Junhyeop Lee, Dogyoon Lee, Woojin Kim, Sangwon Hwang, Sangyoun Lee Overvie

Min Hyeok Lee 18 Nov 12, 2022
Pytorch ImageNet1k Loader with Bounding Boxes.

ImageNet 1K Bounding Boxes For some experiments, you might wanna pass only the background of imagenet images vs passing only the foreground. Here, I'v

Amin Ghiasi 11 Oct 15, 2022
Scripts and outputs related to the paper Prediction of Adverse Biological Effects of Chemicals Using Knowledge Graph Embeddings.

Knowledge Graph Embeddings and Chemical Effect Prediction, 2020. Scripts and outputs related to the paper Prediction of Adverse Biological Effects of

Knowledge Graphs at the Norwegian Institute for Water Research 1 Nov 01, 2021