Last updated: 2020-02-05

Checks: 7 0

Knit directory: BUSpaRse_notebooks/

This reproducible R Markdown analysis was created with workflowr (version 1.6.0). The Checks tab describes the reproducibility checks that were applied when the results were created. The Past versions tab lists the development history.

R Markdown file: up-to-date

Great! Since the R Markdown file has been committed to the Git repository, you know the exact version of the code that produced these results.

Environment: empty

Great job! The global environment was empty. Objects defined in the global environment can affect the analysis in your R Markdown file in unknown ways. For reproduciblity it’s best to always run the code in an empty environment.

Seed: set.seed(20181214)

The command set.seed(20181214) was run prior to running the code in the R Markdown file. Setting a seed ensures that any results that rely on randomness, e.g. subsampling or permutations, are reproducible.

Session information: recorded

Great job! Recording the operating system, R version, and package versions is critical for reproducibility.

Cache: none

Nice! There were no cached chunks for this analysis, so you can be confident that you successfully produced the results during this run.

File paths: relative

Great job! Using relative paths to the files within your workflowr project makes it easier to run your code on other machines.

Repository version: 47f0c7c

Great! You are using Git for version control. Tracking code development and connecting the code version to the results is critical for reproducibility. The version displayed above was the version of the Git repository at the time these results were generated.

Note that you need to be careful to ensure that all relevant files for the analysis have been committed to Git prior to generating the results (you can use wflow_publish or wflow_git_commit). workflowr only checks the R Markdown file, but you know if there are other scripts or data files that it depends on. Below is the status of the Git repository when the results were generated:


Ignored files:
    Ignored:    .Rhistory
    Ignored:    .Rproj.user/
    Ignored:    BUSpaRse_notebooks.Rproj
    Ignored:    data/fastqs/
    Ignored:    data/hgmm_100_fastqs.tar
    Ignored:    data/hgmm_1k_fastqs.tar
    Ignored:    data/hgmm_1k_v3_fastqs.tar
    Ignored:    data/hgmm_1k_v3_fastqs/
    Ignored:    data/hs_cdna.fa.gz
    Ignored:    data/mm_cdna.fa.gz
    Ignored:    data/mm_cdna97.fa.gz
    Ignored:    data/neuron_10k_v3_fastqs.tar
    Ignored:    data/neuron_10k_v3_fastqs/
    Ignored:    data/retina/
    Ignored:    data/whitelist_v2.txt
    Ignored:    data/whitelist_v3.txt
    Ignored:    output/hs_mm_tr_index.idx
    Ignored:    output/mm_cDNA_introns_97.idx
    Ignored:    output/mm_cDNA_introns_97_collapse.idx
    Ignored:    output/mm_tr_index.idx
    Ignored:    output/mm_tr_index97.idx
    Ignored:    output/out_hgmm1k/
    Ignored:    output/out_hgmm1k_v3/
    Ignored:    output/out_retina/

Untracked files:
    Untracked:  .ipynb_checkpoints/
    Untracked:  SRR7244429_1.fastq
    Untracked:  SRR7244429_2.fastq
    Untracked:  analysis/dropseq_retina.Rmd
    Untracked:  analysis/ee_ie_junctions.Rmd
    Untracked:  analysis/junction.Rmd
    Untracked:  analysis/junction2.Rmd
    Untracked:  human_mouse_transcript_to_gene.tsv
    Untracked:  output/ee_ie/
    Untracked:  output/neuron10k/
    Untracked:  output/neuron10k_collapse/
    Untracked:  output/neuron10k_junction/
    Untracked:  output/neuron10k_velocity/
    Untracked:  output/out_hgmm1k2/
    Untracked:  output/tr2g_hgmm.tsv
    Untracked:  output/tr2g_mm97.tsv
    Untracked:  problem5.ipynb
    Untracked:  test/

Note that any generated files, e.g. HTML, png, CSS, etc., are not included in this status report because it is ok for generated content to have uncommitted changes.

These are the previous versions of the R Markdown and HTML files. If you’ve configured a remote Git repository (see ?wflow_git_remote), click on the hyperlinks in the table below to view them.

File	Version	Author	Date	Message
Rmd	47f0c7c	Lambda Moses	2020-02-05	Added link to colab version
html	6634135	Lambda Moses	2019-06-23	Build site.
Rmd	013799b	Lambda Moses	2019-06-23	Added explanation of kallisto outputs
html	f65450b	Lambda Moses	2019-06-22	Build site.
Rmd	cf90f32	Lambda Moses	2019-06-22	Changed site name and slightly changed 10xv2 notebook
html	68bd8fb	Lambda Moses	2019-06-22	Build site.
Rmd	ce7a44d	Lambda Moses	2019-06-22	Changed title
html	f5e971d	Lambda Moses	2019-06-22	Build site.
html	9fa20ae	Lambda Moses	2019-06-22	Build site.
Rmd	dba27cc	Lambda Moses	2019-06-22	Adapted notebook to new version of bustools
Rmd	06e9869	Lambda Moses	2019-03-01	Adapted old notebooks to new version of BUSpaRse
html	06e9869	Lambda Moses	2019-03-01	Adapted old notebooks to new version of BUSpaRse
html	c465ce1	Lambda Moses	2019-02-14	Build site
Rmd	f22beac	Lambda Moses	2019-02-14	Added CellRanger whiltelist clarification
Rmd	000215e	Lambda Moses	2019-02-14	Clarified git cloning this repo and resolved swapped code chunks for output.sorted.bus
html	71513e5	Lambda Moses	2019-02-14	Good site with figures and reproducibility metrics
Rmd	95f2951	Lambda Moses	2019-02-14	Added head of bus file
html	7184968	Lambda Moses	2019-02-14	Build site.
Rmd	8cd81e0	Lambda Moses	2019-02-14	Finally saved the figures
html	8cd81e0	Lambda Moses	2019-02-14	Finally saved the figures
html	45c5d4a	Lambda Moses	2019-02-14	Build site.
Rmd	92aa915	Lambda Moses	2019-02-14	Named chunks with images
html	87d15f5	Lambda Moses	2019-02-14	Build site.
Rmd	8d9fe9a	Lambda Moses	2019-02-14	More detailed explanations for kallisto bus workshop
html	b7b21a0	Lambda Moses	2019-02-02	Build site.
Rmd	c7ecb56	Lambda Moses	2019-02-02	Added BUSpaRse installation link and note
html	a0becdf	Lambda Moses	2018-12-19	Build site.
html	93c1053	Lambda Moses	2018-12-19	Build site.
Rmd	9731fd6	Lambda Moses	2018-12-19	Corrected typo
html	85a5770	Lambda Moses	2018-12-14	Build site.
Rmd	ca1d6ce	Lambda Moses	2018-12-14	Clean up
html	6276894	Lambda Moses	2018-12-14	Build site.
html	db70187	Lambda Moses	2018-12-14	Build site.
Rmd	b67fc92	Lambda Moses	2018-12-14	Added 10xv3 notebook and further elaboration to 10xv2 notebook
html	fff442d	Lambda Moses	2018-12-14	Build site.
Rmd	695e202	Lambda Moses	2018-12-14	Changed name to BUSpaRse
html	09a56c1	Lambda Moses	2018-12-14	Build site.
Rmd	d288e19	Lambda Moses	2018-12-14	Cache rather than skip evaluation
html	30c9aa3	Lambda Moses	2018-12-14	Build site.
Rmd	fd3d5ae	Lambda Moses	2018-12-14	Don’t collapse output
html	2da3b35	Lambda Moses	2018-12-14	Build site.
Rmd	1fc3e91	Lambda Moses	2018-12-14	Publish 10xv2 notebook
Rmd	074d55f	Lambda Moses	2018-12-14	Initial commit, already with 10xv2 notebook

In this vignette, we process fastq data from scRNA-seq (10x v2 chemistry) with to make a sparse matrix that can be used in downstream analysis with command line tools kallisto and bustools, as described in the kallisto bus paper. Then we will start a standard downstream analysis with Seurat.

Setup

If you would like to rerun this notebook, you can git clone this repository or use the Google Colab version.

Install packages

This notebook demonstrates the use of command line tools kallisto and bustools. Please use kallisto >= 0.46, whose binary can be downloaded here. The binary of bustools can be found here.

After you download the binary, you should decompress the file (if it is tar.gz) with tar -xzvf file.tar.gz in the bash terminal, and add the directory containing the binary to PATH by export PATH=$PATH:/foo/bar, where /foo/bar is the directory of interest. Then you can directly invoke the binary on the command line as we will do in this notebook.

We will be using the R packages below. BUSpaRse is on Bioconductor >= 3.10. For Mac users, see the installation note for BUSpaRse. BUSpaRse will be used to generate the transcript to gene file for bustools and to read output of bustools into R. We will also use Seurat version 3 which is now on CRAN.

The package DropletUtils will be used to estimate the number of real cells as opposed to empty droplets. It’s on Bioconductor, and here is how it should be installed:

if (!require(BiocManager)) {
  install.packages("BiocManager")
}
BiocManager::install(c("DropletUtils", "BUSpaRse"))

The other R packages below are on CRAN, and can be installed with install.packages.

library(BUSpaRse)
library(Seurat)
library(tidyverse)
library(DropletUtils)
library(Matrix)
theme_set(theme_bw())

Download data

The data set we are using here is 1k 1:1 Mixture of Fresh Frozen Human (HEK293T) and Mouse (NIH3T3) Cells from the 10x website. First, we download the fastq files (6.34 GB).

if (!file.exists("./data/hgmm_1k_fastqs.tar")) {
  download.file("http://cf.10xgenomics.com/samples/cell-exp/2.1.0/hgmm_1k/hgmm_1k_fastqs.tar", destfile = "./data/hgmm_1k_fastqs.tar", quiet = TRUE)
}

Then untar this file

cd ./data
tar -xvf ./hgmm_1k_fastqs.tar

#> fastqs/
#> fastqs/hgmm_1k_S1_L001_I1_001.fastq.gz
#> fastqs/hgmm_1k_S1_L001_R1_001.fastq.gz
#> fastqs/hgmm_1k_S1_L001_R2_001.fastq.gz
#> fastqs/hgmm_1k_S1_L002_I1_001.fastq.gz
#> fastqs/hgmm_1k_S1_L002_R1_001.fastq.gz
#> fastqs/hgmm_1k_S1_L002_R2_001.fastq.gz
#> fastqs/hgmm_1k_S1_L003_I1_001.fastq.gz
#> fastqs/hgmm_1k_S1_L003_R1_001.fastq.gz
#> fastqs/hgmm_1k_S1_L003_R2_001.fastq.gz
#> fastqs/hgmm_1k_S1_L004_I1_001.fastq.gz
#> fastqs/hgmm_1k_S1_L004_R1_001.fastq.gz
#> fastqs/hgmm_1k_S1_L004_R2_001.fastq.gz
#> fastqs/hgmm_1k_S1_L005_I1_001.fastq.gz
#> fastqs/hgmm_1k_S1_L005_R1_001.fastq.gz
#> fastqs/hgmm_1k_S1_L005_R2_001.fastq.gz
#> fastqs/hgmm_1k_S1_L006_I1_001.fastq.gz
#> fastqs/hgmm_1k_S1_L006_R1_001.fastq.gz
#> fastqs/hgmm_1k_S1_L006_R2_001.fastq.gz
#> fastqs/hgmm_1k_S1_L007_I1_001.fastq.gz
#> fastqs/hgmm_1k_S1_L007_R1_001.fastq.gz
#> fastqs/hgmm_1k_S1_L007_R2_001.fastq.gz
#> fastqs/hgmm_1k_S1_L008_I1_001.fastq.gz
#> fastqs/hgmm_1k_S1_L008_R1_001.fastq.gz
#> fastqs/hgmm_1k_S1_L008_R2_001.fastq.gz

Generate the gene count matrix

Build the `kallisto` index

Here we use kallisto to pseudoalign the reads to the transcriptome and then to create the bus file to be converted to a sparse matrix. The first step is to build an index of the transcriptome. This data set has both human and mouse cells, so we need both human and mouse transcriptomes. The transcriptomes downloaded here are from Ensembl version 94, released in October 2018.

# Human transcriptome
if (!file.exists("./data/hs_cdna.fa.gz")) {
  download.file("ftp://ftp.ensembl.org/pub/release-99/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz", "./data/hs_cdna.fa.gz", method = "wget", quiet = TRUE)
}
# Mouse transcriptome
if (!file.exists("./data/mm_cdna.fa.gz")) {
  download.file("ftp://ftp.ensembl.org/pub/release-99/fasta/mus_musculus/cdna/Mus_musculus.GRCm38.cdna.all.fa.gz", "./data/mm_cdna.fa.gz", method = "wget", quiet = TRUE)
}

# This chunk is in bash
kallisto version

#> kallisto, version 0.46.0

Actually, we don’t need to unzip the fasta files

if (!file.exists("./output/hs_mm_tr_index.idx")) {
  system("kallisto index -i ./output/hs_mm_tr_index.idx ./data/hs_cdna.fa.gz ./data/mm_cdna.fa.gz")
}

Run `kallisto bus`

Here we will generate the bus file. Here bus stands for Barbode, UMI, Set (i.e. equivalent class). In text form, it is a table whose first column is the barcode. The second column is the UMI that are associated with the barcode. The third column is the index of the equivalence class reads with the UMI maps to (equivalence class will be explained later). The fourth column is count of reads with this barcode, UMI, and equivalence class combination, which is ignored as one UMI should stand for one molecule. See this paper for more detail.

These are the technologies supported by kallisto bus:

system("kallisto bus --list", intern = TRUE)

#> Warning in system("kallisto bus --list", intern = TRUE): running command
#> 'kallisto bus --list' had status 1

#>  [1] "List of supported single-cell technologies"
#>  [2] ""                                          
#>  [3] "short name       description"              
#>  [4] "----------       -----------"              
#>  [5] "10xv1            10x version 1 chemistry"  
#>  [6] "10xv2            10x version 2 chemistry"  
#>  [7] "10xv3            10x version 3 chemistry"  
#>  [8] "CELSeq           CEL-Seq"                  
#>  [9] "CELSeq2          CEL-Seq version 2"        
#> [10] "DropSeq          DropSeq"                  
#> [11] "inDrops          inDrops"                  
#> [12] "SCRBSeq          SCRB-Seq"                 
#> [13] "SureCell         SureCell for ddSEQ"       
#> [14] ""                                          
#> attr(,"status")
#> [1] 1

Here we have 8 samples. Each sample has 3 files: I1 means sample index, R1 means barcode and UMI, and R2 means the piece of cDNA. The -i argument specifies the index file we just built. The -o argument specifies the output directory. The -x argument specifies the sequencing technology used to generate this data set. The -t argument specifies the number of threads used. I ran this on a server and used 8 threads.

# This chunk is in bash
cd ./data
kallisto bus -i ../output/hs_mm_tr_index.idx -o ../output/out_hgmm1k -x 10xv2 -t8 \
./fastqs/hgmm_1k_S1_L001_R1_001.fastq.gz ./fastqs/hgmm_1k_S1_L001_R2_001.fastq.gz \
./fastqs/hgmm_1k_S1_L002_R1_001.fastq.gz ./fastqs/hgmm_1k_S1_L002_R2_001.fastq.gz \
./fastqs/hgmm_1k_S1_L003_R1_001.fastq.gz ./fastqs/hgmm_1k_S1_L003_R2_001.fastq.gz \
./fastqs/hgmm_1k_S1_L004_R1_001.fastq.gz ./fastqs/hgmm_1k_S1_L004_R2_001.fastq.gz \
./fastqs/hgmm_1k_S1_L005_R1_001.fastq.gz ./fastqs/hgmm_1k_S1_L005_R2_001.fastq.gz \
./fastqs/hgmm_1k_S1_L006_R1_001.fastq.gz ./fastqs/hgmm_1k_S1_L006_R2_001.fastq.gz \
./fastqs/hgmm_1k_S1_L007_R1_001.fastq.gz ./fastqs/hgmm_1k_S1_L007_R2_001.fastq.gz \
./fastqs/hgmm_1k_S1_L008_R1_001.fastq.gz ./fastqs/hgmm_1k_S1_L008_R2_001.fastq.gz

#> 
#> [index] k-mer length: 31
#> [index] number of targets: 309,785
#> [index] number of k-mers: 209,728,067
#> [index] number of equivalence classes: 1,288,400
#> [quant] will process sample 1: ./fastqs/hgmm_1k_S1_L001_R1_001.fastq.gz
#>                                ./fastqs/hgmm_1k_S1_L001_R2_001.fastq.gz
#> [quant] will process sample 2: ./fastqs/hgmm_1k_S1_L002_R1_001.fastq.gz
#>                                ./fastqs/hgmm_1k_S1_L002_R2_001.fastq.gz
#> [quant] will process sample 3: ./fastqs/hgmm_1k_S1_L003_R1_001.fastq.gz
#>                                ./fastqs/hgmm_1k_S1_L003_R2_001.fastq.gz
#> [quant] will process sample 4: ./fastqs/hgmm_1k_S1_L004_R1_001.fastq.gz
#>                                ./fastqs/hgmm_1k_S1_L004_R2_001.fastq.gz
#> [quant] will process sample 5: ./fastqs/hgmm_1k_S1_L005_R1_001.fastq.gz
#>                                ./fastqs/hgmm_1k_S1_L005_R2_001.fastq.gz
#> [quant] will process sample 6: ./fastqs/hgmm_1k_S1_L006_R1_001.fastq.gz
#>                                ./fastqs/hgmm_1k_S1_L006_R2_001.fastq.gz
#> [quant] will process sample 7: ./fastqs/hgmm_1k_S1_L007_R1_001.fastq.gz
#>                                ./fastqs/hgmm_1k_S1_L007_R2_001.fastq.gz
#> [quant] will process sample 8: ./fastqs/hgmm_1k_S1_L008_R1_001.fastq.gz
#>                                ./fastqs/hgmm_1k_S1_L008_R2_001.fastq.gz
#> [quant] finding pseudoalignments for the reads ... done
#> [quant] processed 63,252,296 reads, 52,281,921 reads pseudoaligned

See what the outputs are

list.files("./output/out_hgmm1k/")

#> [1] "matrix.ec"       "output.bus"      "run_info.json"   "transcripts.txt"

Explaining the output:

matrix.ec: A text file with two columns. The first column is the 0 based index of equivalence classes. The second column is the set of transcripts (denoted by 0 based index based on order of appearance in the transcriptome fasta file) present in the corresponding equivalence class.
output.bus: The data represented in bus format. This is a binary file, so don’t use something like read.table to read it into R.
run_info.json: Information about the call to kallisto bus, including the command used, number and percentage of reads pseudoaligned, version of kallisto used, and etc.
transcript.txt: A text file with one column, which is the transcripts present in the data, in the same order as in the transcriptome fasta file.

Run `bustools`

Map transcripts to genes

For the sparse matrix, most people are interested in how many UMIs per gene per cell, we here we will quantify this from the bus output, and to do so, we need to find which gene corresponds to each transcript. Remember in the output of kallisto bus, there’s the file transcripts.txt. Those are the transcripts in the transcriptome index.

Remember that we downloaded transcriptome FASTA files from Ensembl just now. In FASTA files, each entry is a sequence with a name. In Ensembl FASTA files, the sequence name has genome annotation of the corresponding sequence, so we can extract transcript IDs and corresponding gene IDs and gene names from there.

tr2g <- transcript2gene(fasta_file = c("./data/hs_cdna.fa.gz", "./data/mm_cdna.fa.gz"),
                        kallisto_out_path = "./output/out_hgmm1k")

#> Reading FASTA file.
#> Reading FASTA file.

#> Sorting transcripts

head(tr2g)

#>           transcript              gene gene_name
#> 1: ENST00000434970.2 ENSG00000237235.2     TRDD2
#> 2: ENST00000415118.1 ENSG00000223997.1     TRDD1
#> 3: ENST00000448914.1 ENSG00000228985.1     TRDD3
#> 4: ENST00000631435.1 ENSG00000282253.1     TRBD1
#> 5: ENST00000632684.1 ENSG00000282431.1     TRBD1
#> 6: ENST00000390583.1 ENSG00000211923.1  IGHD3-10

bustools requires tr2g to be written into a tab delimited file of a specific format: No headers, first column is transcript ID, and second column is the corresponding gene ID. Transcript IDs must be in the same order as in the kallisto index.

# Write tr2g to format required by bustools
save_tr2g_bustools(tr2g, "./output/tr2g_hgmm.tsv")

A whitelist that contains all the barcodes known to be present in the kit is provided by 10x and comes with CellRanger. A CellRanger installation is required, though we will not run CellRanger here.

# Copy v2 chemistry whitelist to working directory
cp ~/cellranger-3.1.0/cellranger-cs/3.1.0/lib/python/cellranger/barcodes/737K-august-2016.txt \
./data/whitelist_v2.txt

Then we’re ready to make the gene count matrix. First, bustools runs barcode error correction on the bus file. Then, the corrected bus file is sorted by barcode, UMI, and equivalence classes. Then the UMIs are counted and the counts are collapsed into gene level. Here the | is pipe in bash, just like the magrittr pipe %>% in R, that pipes the output of one command to the next.

mkdir ./output/out_hgmm1k/genecount ./tmp
bustools correct -w ./data/whitelist_v2.txt -p ./output/out_hgmm1k/output.bus | \
bustools sort -T tmp/ -t 4 -p - | \
bustools count -o ./output/out_hgmm1k/genecount/genes -g ./output/tr2g_hgmm.tsv \
-e ./output/out_hgmm1k/matrix.ec -t ./output/out_hgmm1k/transcripts.txt --genecounts -

#> Found 737280 barcodes in the whitelist
#> Number of hamming dist 1 barcodes = 20550336
#> Processed 52281921 bus records
#> In whitelist = 50825261
#> Corrected = 348752
#> Uncorrected = 1107908
#> Read in 51174013 BUS records

See what the outputs are

list.files("./output/out_hgmm1k/genecount")

#> [1] "genes"              "genes.barcodes.txt" "genes.genes.txt"   
#> [4] "genes.mtx"

Here we have text files for barcodes and gene names, and an mtx file for the sparse gene count matrix.

Explore the data

Now we can load the matrix into R for analysis.

res_mat <- read_count_output("./output/out_hgmm1k/genecount",
                             name = "genes", tcc = FALSE)

Remove empty droplets

Cool, so now we have the sparse matrix. What does it look like?

dim(res_mat)

#> [1]  77202 362207

The number of genes is as expected for two species. There’re way more cells than we expect, which is about 1000. So what’s going on?

How many UMIs per barcode?

tot_counts <- Matrix::colSums(res_mat)
summary(tot_counts)

#>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
#>     0.00     1.00     1.00    76.04     8.00 74292.00

The vast majority of “cells” have only a few UMI detected. Those are empty droplets. 10x claims to have cell capture rate of up to 65%, but in practice, depending on how many cells are in fact loaded, the rate can be much lower. A commonly used method to estimate the number of empty droplets is barcode ranking knee and inflection points, as those are often assumed to represent transition between two components of a distribution. While more sophisticated method exist (e.g. see emptyDrops in DropletUtils), for simplicity, we will use the barcode ranking method here. However, whichever way we go, we don’t have the ground truth.

# Compute barcode rank
bc_rank <- barcodeRanks(res_mat)

qplot(bc_rank$total, bc_rank$rank, geom = "line") +
  geom_vline(xintercept = metadata(bc_rank)$knee, color = "blue", linetype = 2) +
  geom_vline(xintercept = metadata(bc_rank)$inflection, color = "green", linetype = 2) +
  annotate("text", y = 1000, x = 1.5 * c(metadata(bc_rank)$knee, metadata(bc_rank)$inflection),
           label = c("knee", "inflection"), color = c("blue", "green")) +
  scale_x_log10() +
  scale_y_log10() +
  labs(y = "Barcode rank", x = "Total UMI count")

#> Warning: Transformation introduced infinite values in continuous x-axis

Version	Author	Date
f65450b	Lambda Moses	2019-06-22
9fa20ae	Lambda Moses	2019-06-22
06e9869	Lambda Moses	2019-03-01
71513e5	Lambda Moses	2019-02-14
7184968	Lambda Moses	2019-02-14
8cd81e0	Lambda Moses	2019-02-14

The inflection point looks like a reasonable number of cells.

# Filter the matrix
res_mat <- res_mat[, tot_counts > metadata(bc_rank)$inflection]
dim(res_mat)

#> [1] 77202  1095

Cell species

How many cells are from humans and how many from mice? The number of cells with mixed species indicates doublet rate.

gene_species <- ifelse(str_detect(rownames(res_mat), "^ENSMUSG"), "mouse", "human")
mouse_inds <- gene_species == "mouse"
human_inds <- gene_species == "human"
# mark cells as mouse or human
cell_species <- tibble(n_mouse_umi = Matrix::colSums(res_mat[mouse_inds,]),
                       n_human_umi = Matrix::colSums(res_mat[human_inds,]),
                       tot_umi = Matrix::colSums(res_mat),
                       prop_mouse = n_mouse_umi / tot_umi,
                       prop_human = n_human_umi / tot_umi)

# Classify species based on proportion of UMI, with cutoff of 90%
cell_species <- cell_species %>% 
  mutate(species = case_when(
    prop_mouse > 0.9 ~ "mouse",
    prop_human > 0.9 ~ "human",
    TRUE ~ "mixed"
  ))

ggplot(cell_species, aes(n_human_umi, n_mouse_umi, color = species)) +
  geom_point(size = 0.5)

Version	Author	Date
9fa20ae	Lambda Moses	2019-06-22
71513e5	Lambda Moses	2019-02-14
7184968	Lambda Moses	2019-02-14
8cd81e0	Lambda Moses	2019-02-14

Great, looks like the vast majority of cells are not mixed.

cell_species %>% 
  dplyr::count(species) %>% 
  mutate(proportion = n / ncol(res_mat))

#> # A tibble: 3 x 3
#>   species     n proportion
#>   <chr>   <int>      <dbl>
#> 1 human     566    0.517  
#> 2 mixed       3    0.00274
#> 3 mouse     526    0.480

Great, only about 0.3% of cells here are doublets, which is lower than the ~1% 10x lists. Doublet rate tends to be lower when cell concentration is lower. However, doublets can still be formed with cells from the same species, so the number of mixed species “cells” is only a lower bound of doublet rate.

Dimension reduction

seu <- CreateSeuratObject(res_mat, min.cells = 3) %>% 
  NormalizeData(verbose = FALSE) %>% 
  ScaleData(verbose = FALSE) %>% 
  FindVariableFeatures(verbose = FALSE)

# Add species to meta data
seu <- AddMetaData(seu, metadata = cell_species$species, col.name = "species")

See how number of total counts and number of genes expressed are distributed.

VlnPlot(seu, c("nCount_RNA", "nFeature_RNA"), group.by = "species",
        pt.size = 0.1)

Version	Author	Date
9fa20ae	Lambda Moses	2019-06-22
71513e5	Lambda Moses	2019-02-14
7184968	Lambda Moses	2019-02-14
8cd81e0	Lambda Moses	2019-02-14

Another QC plot

ggplot(seu@meta.data, aes(nCount_RNA, nFeature_RNA, color = species)) +
  geom_point(alpha = 0.7, size = 0.5) +
  labs(x = "Total UMI counts per cell", y = "Number of genes detected")

Version	Author	Date
db70187	Lambda Moses	2018-12-14
2da3b35	Lambda Moses	2018-12-14

The mixed species doublets do look different from human and mouse cells.

seu <- RunPCA(seu, verbose = FALSE, npcs = 30)
ElbowPlot(seu, ndims = 30)

Version	Author	Date
9fa20ae	Lambda Moses	2019-06-22
06e9869	Lambda Moses	2019-03-01
71513e5	Lambda Moses	2019-02-14
7184968	Lambda Moses	2019-02-14
8cd81e0	Lambda Moses	2019-02-14

DimPlot(seu, reduction = "pca", pt.size = 0.5, group.by = "species")

Version	Author	Date
9fa20ae	Lambda Moses	2019-06-22
06e9869	Lambda Moses	2019-03-01
71513e5	Lambda Moses	2019-02-14
7184968	Lambda Moses	2019-02-14
8cd81e0	Lambda Moses	2019-02-14

The first PC separates species, as expected. Also as expected, the doublets are in between human and mouse cells in this plot.

seu <- RunTSNE(seu, dims = 1:20, check_duplicates = FALSE)
DimPlot(seu, reduction = "tsne", pt.size = 0.5, group.by = "species")

Version	Author	Date
9fa20ae	Lambda Moses	2019-06-22
06e9869	Lambda Moses	2019-03-01
71513e5	Lambda Moses	2019-02-14
7184968	Lambda Moses	2019-02-14
8cd81e0	Lambda Moses	2019-02-14

The species separate, and the few doublets form its own cluster, as expected.

sessionInfo()

#> R version 3.6.0 (2019-04-26)
#> Platform: x86_64-redhat-linux-gnu (64-bit)
#> Running under: CentOS Linux 7 (Core)
#> 
#> Matrix products: default
#> BLAS/LAPACK: /usr/lib64/R/lib/libRblas.so
#> 
#> locale:
#> [1] en_US.UTF-8
#> 
#> attached base packages:
#> [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] Matrix_1.2-18               DropletUtils_1.6.1         
#>  [3] SingleCellExperiment_1.8.0  SummarizedExperiment_1.16.1
#>  [5] DelayedArray_0.12.2         BiocParallel_1.20.1        
#>  [7] matrixStats_0.55.0          Biobase_2.46.0             
#>  [9] GenomicRanges_1.38.0        GenomeInfoDb_1.22.0        
#> [11] IRanges_2.20.2              S4Vectors_0.24.3           
#> [13] BiocGenerics_0.32.0         forcats_0.4.0              
#> [15] stringr_1.4.0               dplyr_0.8.3                
#> [17] purrr_0.3.3                 readr_1.3.1                
#> [19] tidyr_1.0.2                 tibble_2.1.3               
#> [21] ggplot2_3.2.1               tidyverse_1.3.0            
#> [23] Seurat_3.1.2                BUSpaRse_1.0.0             
#> [25] workflowr_1.6.0            
#> 
#> loaded via a namespace (and not attached):
#>   [1] utf8_1.1.4               reticulate_1.14          R.utils_2.9.2           
#>   [4] tidyselect_1.0.0         RSQLite_2.2.0            AnnotationDbi_1.48.0    
#>   [7] htmlwidgets_1.5.1        grid_3.6.0               Rtsne_0.15              
#>  [10] munsell_0.5.0            codetools_0.2-16         ica_1.0-2               
#>  [13] future_1.16.0            withr_2.1.2              colorspace_1.4-1        
#>  [16] knitr_1.27               rstudioapi_0.10          ROCR_1.0-7              
#>  [19] gbRd_0.4-11              listenv_0.8.0            labeling_0.3            
#>  [22] Rdpack_0.11-1            git2r_0.26.1             GenomeInfoDbData_1.2.2  
#>  [25] farver_2.0.3             bit64_0.9-7              rhdf5_2.30.1            
#>  [28] rprojroot_1.3-2          vctrs_0.2.2              generics_0.0.2          
#>  [31] xfun_0.12                BiocFileCache_1.10.2     R6_2.4.1                
#>  [34] rsvd_1.0.2               locfit_1.5-9.1           AnnotationFilter_1.10.0 
#>  [37] bitops_1.0-6             assertthat_0.2.1         promises_1.1.0          
#>  [40] SDMTools_1.1-221.2       scales_1.1.0             gtable_0.3.0            
#>  [43] npsurv_0.4-0             globals_0.12.5           ensembldb_2.10.2        
#>  [46] rlang_0.4.3              zeallot_0.1.0            splines_3.6.0           
#>  [49] rtracklayer_1.46.0       lazyeval_0.2.2           broom_0.5.4             
#>  [52] plyranges_1.6.6          yaml_2.2.0               reshape2_1.4.3          
#>  [55] modelr_0.1.5             GenomicFeatures_1.38.1   backports_1.1.5         
#>  [58] httpuv_1.5.2             tools_3.6.0              gplots_3.0.1.2          
#>  [61] RColorBrewer_1.1-2       ggridges_0.5.2           Rcpp_1.0.3              
#>  [64] plyr_1.8.5               progress_1.2.2           zlibbioc_1.32.0         
#>  [67] RCurl_1.98-1.1           prettyunits_1.1.1        openssl_1.4.1           
#>  [70] pbapply_1.4-2            cowplot_1.0.0            zoo_1.8-7               
#>  [73] haven_2.2.0              ggrepel_0.8.1            cluster_2.1.0           
#>  [76] fs_1.3.1                 magrittr_1.5             data.table_1.12.8       
#>  [79] reprex_0.3.0             lmtest_0.9-37            RANN_2.6.1              
#>  [82] whisker_0.4              ProtGenerics_1.18.0      fitdistrplus_1.0-14     
#>  [85] hms_0.5.3                lsei_1.2-0               evaluate_0.14           
#>  [88] XML_3.99-0.3             readxl_1.3.1             gridExtra_2.3           
#>  [91] compiler_3.6.0           biomaRt_2.42.0           KernSmooth_2.23-16      
#>  [94] crayon_1.3.4             R.oo_1.23.0              htmltools_0.4.0         
#>  [97] later_1.0.0              RcppParallel_4.4.4       lubridate_1.7.4         
#> [100] DBI_1.1.0                dbplyr_1.4.2             MASS_7.3-51.5           
#> [103] rappdirs_0.3.1           cli_2.0.1                R.methodsS3_1.7.1       
#> [106] gdata_2.18.0             metap_1.0                igraph_1.2.4.2          
#> [109] pkgconfig_2.0.3          GenomicAlignments_1.22.1 plotly_4.9.1            
#> [112] xml2_1.2.2               dqrng_0.2.1              XVector_0.26.0          
#> [115] rvest_0.3.5              bibtex_0.4.2.2           digest_0.6.23           
#> [118] sctransform_0.2.1        RcppAnnoy_0.0.14         tsne_0.1-3              
#> [121] Biostrings_2.54.0        rmarkdown_2.1            cellranger_1.1.0        
#> [124] leiden_0.3.2             edgeR_3.28.0             uwot_0.1.5              
#> [127] curl_4.3                 Rsamtools_2.2.1          gtools_3.8.1            
#> [130] lifecycle_0.1.0          nlme_3.1-143             jsonlite_1.6            
#> [133] Rhdf5lib_1.8.0           limma_3.42.0             fansi_0.4.1             
#> [136] viridisLite_0.3.0        askpass_1.1              BSgenome_1.54.0         
#> [139] pillar_1.4.3             lattice_0.20-38          httr_1.4.1              
#> [142] survival_3.1-8           glue_1.3.1               png_0.1-7               
#> [145] bit_1.1-15.1             HDF5Array_1.14.1         stringi_1.4.5           
#> [148] blob_1.2.1               caTools_1.18.0           memoise_1.1.0           
#> [151] irlba_2.3.3              future.apply_1.4.0       ape_5.3

1k 1:1 Mixture of Fresh Frozen Human (HEK293T) and Mouse (NIH3T3) Cells (10x v2 chemistry)

Lambda Moses

2020-02-05