Additional Topics - qualtile.clipping - scale.factor - Python and Bioc - multiplexing - multi batch - FAQ

This vignette provides additional information on dsb for the topics below:

  1. Integrating dsb with Bioconductor
  2. Integrating dsb with python/Scanpy
  3. Using dsb with data lacking isotype controls
  4. Integrating dsb with sample multiplexing experiments
  5. advanced usage - return internal stats used by dsb
  6. outlier clipping with the quantile.clipping argument
  7. using a different background scaling method
  8. Frequently Asked Questions

Integrating dsb with Bioconductor

Rather than Seurat you may wish to use the SingleCellExperiment class to use Bioconductor packages. To use Bioconductor’s semantics, we store raw protein values in an ‘alternative Experiment’ in a SingleCellExperiment object containing RNA counts.

suppressMessages(library(SingleCellExperiment))
sce = SingleCellExperiment(assays = list(counts = cell.rna.raw), colData = cellmd)
# define the dsb normalized values as logcounts to use a common SingleCellExperiment / Bioconductor convention
adt = SummarizedExperiment(
  assays = list(
    'counts' = as.matrix(cell.adt.raw),
    'logcounts' = as.matrix(cell.adt.dsb)
    )
  )
altExp(sce, "CITE") = adt

Using dsb in Python

NEW: Python users are encouraged to checkout the muon python multimodal framework to use dsb from within python about muon dsb wrapper for python in muon

dsb is available directly from within muon, for example see the snippet below from muon documentation

import numpy as np
import pandas as pd
import scanpy as sc
import muon as mu
from muon import prot as pt
# see muon documentation for example and data
mdata = mu.read_10x_mtx(os.path.join(data_dir, "filtered_feature_bc_matrix"))
mdata_raw = mu.read_10x_mtx(os.path.join(data_dir, "raw_feature_bc_matrix"))
prot = mdata.mod['prot']
pt.pp.dsb(mdata, raw=mdata_raw, empty_droplets=droplets)
isotypes = mdata_raw['prot'].var_names[29:32].values isotypes
pt.pp.dsb(mdata, mdata_raw, empty_counts_range=(1.5, 2.8), isotype_controls=isotypes, random_state=1)

You can also use dsb normalized values with the AnnData class in Python by using reticulate to create the AnnData object from dsb denoised and normalized protein values as well as raw RNA data. Anndata are not structured as separate assays; we therefore need to merge the RNA and protein data into the same matrix.

See interoperability between Scanpy Bioconductor and Seurat

library(reticulate); sc = import("scanpy")

# merge dsb-normalized protein and raw RNA data 
combined_dat = rbind(cell.rna.raw, cell.adt.dsb)
s[["combined_data"]] = CreateAssayObject(data = combined_dat)

# create Anndata Object 
adata_seurat = sc$AnnData(
    X   = t(GetAssayData(s,assay = "combined_data")),
    obs = seurat@meta.data,
    var = GetAssay(seurat)[[]]
    )

Using dsb with data lacking isotype controls

If isotype controls are not included, you can run dsb correcting ambient background without cell denoising. We only recommend setting denoise.counts = FALSE if isotype controls were not included in the experiment which results in not defining the technical component of each cell’s protein library. The values of the normalized matrix returned are the number of standard deviations above the expected ambient noise captured by empty droplets.

# suggested workflow if isotype controls are not included 
dsb_rescaled = DSBNormalizeProtein(cell_protein_matrix = cells_citeseq_mtx,
                                   empty_drop_matrix = empty_drop_citeseq_mtx, 
                                   # do not denoise each cell's technical component
                                   denoise.counts = FALSE)

We strongly recommend using isotype controls, however if these are not available, the background mean for each cell inferred via a per-cell gaussian mixture model (µ1) can theoretically be used alone to define the cell’s technical component, however this assumes the background mean has no expected biological variation. In our data the background mean had weak but significant correlation with the foreground mean (µ2) across single cells (see the paper). Isotype controls anchor the component of the background mean associated with noise.


dsb_rescaled = dsb::DSBNormalizeProtein(cell_protein_matrix = cells_citeseq_mtx,
                                        empty_drop_matrix = empty_drop_citeseq_mtx, 
                                        # denoise with background mean only 
                                        denoise.counts = TRUE, 
                                        use.isotype.control = FALSE)

Using dsb with sample multiplexing experiments

In multiplexing experiments with cell superloading, demultiplexing functions define a “negative” cell population which can then be used to define background droplets for dsb. Multiplexing / Demultiplexing methods and functions compatible with dsb:
HTODemux (Seurat)
deMULTIplex (Multiseq)
demuxlet

In our data, dsb normalized values were nearly identically distributed when dsb was run with background defined by demultiplexing functions or protein library size (see the paper).

Note– we must load the raw output from cell ranger! This is essential; the filtered output are the cells estimated by Cell Ranger. there is not sufficient background in the filtered output for demultiplexing functions like HTODemux which need a negative population and that’s the population of droplets needed by dsb to estimate the ambient component. A good way to use demultiplexing with dsb and improve your demultiplexing calls is to use the min.genes argument in the Seurat::Read10X function to partially threshold out some background drops yet still retain sufficient (often > 80,000 droplets per 10X lane depending on experiment) from which to estimate the background. This balances memory strain when demultiplexing tens of thousands of cells with requirements of the Seurat::HTODemux function to have sufficient empty drops to estimate the background population of each Hash antibody. Increasing the number of drops used in demultiplexing will result in more droplets defined by the function as “negative” which can increase the confidence in the estimate of background used by dsb

# raw = Read10X see above -- path to cell ranger outs/raw_feature_bc_matrix ; 

# partial thresholding to slightly subset negative drops include all with 5 unique mRNAs
seurat_object = CreateSeuratObject(raw, min.genes = 5)

# demultiplex (positive.quantile can be tuned to dataset depending on size)
seurat_object = HTODemux(seurat_object, assay = "HTO", positive.quantile = 0.99)
Idents(seurat_object) = "HTO_classification.global"

# subset empty drop/background and cells 
neg_object = subset(seurat_object, idents = "Negative")
singlet_object = subset(seurat_object, idents = "Singlet")

## (QC the negative object to filter out cells with high RNA content)
# quick example below, different crteria can be used
# this step depends on dataset; see main vignette for more principled filtering 
neg_object = subset(seurat_object, idents = "Negative", nGene < 80)


# non sparse CITEseq data store more efficiently in a regular matrix
neg_adt_matrix = GetAssayData(neg_object, assay = "CITE", slot = 'counts') %>% as.matrix()
positive_adt_matrix = GetAssayData(singlet_object, assay = "CITE", slot = 'counts') %>% as.matrix()

# normalize the data with dsb
dsb_norm_prot = DSBNormalizeProtein(
                           cell_protein_matrix = cells_mtx_rawprot,
                           empty_drop_matrix = negative_mtx_rawprot,
                           denoise.counts = TRUE,
                           use.isotype.control = TRUE,
                           isotype.control.name.vec = rownames(cells_mtx_rawprot)[30:32])

# now add the normalized dat back to the object (the singlets defined above as "object")
singlet_object[["CITE"]] = CreateAssayObject(data = dsb_norm_prot)

# proceed with same tutorial workflow shown above. 

Advanced usage - return internal stats used by dsb

If you want to look at internal stats calculated by dsb you can do so by setting return.stats = TRUE.

library(dsb)
result.list = 
  DSBNormalizeProtein(
    cell_protein_matrix = cells_citeseq_mtx[ ,1:50], 
    empty_drop_matrix = empty_drop_citeseq_mtx, 
    denoise.counts = TRUE, 
    use.isotype.control = TRUE, 
    isotype.control.name.vec = rownames(cells_citeseq_mtx)[67:70], 
    return.stats = TRUE
  )
#> [1] "correcting ambient protein background noise"
#> [1] "some proteins with low background variance detected check raw and normalized distributions.  protein stats can be returned with return.stats = TRUE"
#> [1] "CD185_PROT"
#> [1] "fitting models to each cell for dsb technical component and removing cell to cell technical noise"
#> [1] "returning results in a list; normalized matrix accessed x$dsb_normalized_matrix"

The results are provided as a list

names(result.list)
#> [1] "dsb_normalized_matrix" "technical_stats"       "protein_stats"

Different protein level stats available:

names(result.list$protein_stats)
#> [1] "raw cell matrix stats"       "dsb normalized matrix stats"
#> [3] "background_mean"             "background_sd"

We can also extract the dsb technical component and variables used to derive the technical component for example:

head(result.list$technical_stats)
#>                          MouseIgG1kappaisotype_PROT MouseIgG2akappaisotype_PROT
#> GAACGGATCGAATCCA_H1B1ln2                  0.7409978                   2.1388457
#> ATAGACCAGTCCATAC_H1B1ln3                 -0.6802394                   1.4410413
#> GAAGCAGAGTATCTCG_H1B1ln4                 -0.6802394                  -0.1319484
#> CTACCCAAGCTACCGC_H1B1ln6                 -0.6802394                  -0.1319484
#> AGCCTAAAGGATATAC_H1B1ln2                 -0.6802394                   1.4410413
#> CGGACTGGTCCGCTGA_H1B1ln6                  0.7409978                  -1.0293939
#>                          Mouse IgG2bkIsotype_PROT RatIgG2bkIsotype_PROT
#> GAACGGATCGAATCCA_H1B1ln2                2.0081678            -0.4511315
#> ATAGACCAGTCCATAC_H1B1ln3                0.1938776             0.7393731
#> GAAGCAGAGTATCTCG_H1B1ln4               -0.8412382            -0.4511315
#> CTACCCAAGCTACCGC_H1B1ln6                0.1938776            -0.4511315
#> AGCCTAAAGGATATAC_H1B1ln2                0.1938776             0.7393731
#> CGGACTGGTCCGCTGA_H1B1ln6               -0.8412382            -0.4511315
#>                          cellwise_background_mean dsb_technical_component
#> GAACGGATCGAATCCA_H1B1ln2              0.424416035             -1.69894383
#> ATAGACCAGTCCATAC_H1B1ln3              0.129457815             -0.62075991
#> GAAGCAGAGTATCTCG_H1B1ln4              0.008699038              0.77617240
#> CTACCCAAGCTACCGC_H1B1ln6              0.239007906              0.30128399
#> AGCCTAAAGGATATAC_H1B1ln2             -0.170614869             -0.24429265
#> CGGACTGGTCCGCTGA_H1B1ln6             -0.039058826             -0.06709057

outlier clipping with the quantile.clipping argument

By default setting quanitle.clipping = TRUE sets cells values for a given protein above the 99.95th percentile or below the 0.1 percentile of that protein’s expression to be thees quantile values. That value is optimized to remove a few high and low magnitude outliers but can be set to adapt to the number of cells in the dataset.

dsb_norm_prot = DSBNormalizeProtein(
                           cell_protein_matrix = cells_citeseq_mtx,
                           empty_drop_matrix = empty_drop_citeseq_mtx,
                           denoise.counts = TRUE,
                           use.isotype.control = TRUE,
                           isotype.control.name.vec = rownames(cells_citeseq_mtx)[67:70],
                           # implement Quantile clipping 
                           quantile.clipping = TRUE
                           # high and low otlier quantile across proteins to clip
                           # the `quantile.clip` parameter can be adjusted: 
                           quantile.clip = c(0.001, 0.9995) 
                           )

Using a different background scaling method

To enable subtraction of the ambient background mean without dividing by the ambient background standard deviation, one can set scale.factor = 'mean.subtract'. This method may be more appropriate for datasets staining with very large panels with lowly titrated antibodies. Can be used if the range of values for some proteins looks vary large or if minimal background detected.


dsb_norm_prot = DSBNormalizeProtein(
                           cell_protein_matrix = cells_citeseq_mtx,
                           empty_drop_matrix = empty_drop_citeseq_mtx,
                           denoise.counts = TRUE,
                           use.isotype.control = TRUE,
                           isotype.control.name.vec = rownames(cells_citeseq_mtx)[67:70],
                           quantile.clipping = TRUE,
                           scale.factor = 'mean.subtract'
                           )

Frequently Asked Questions

I get the error “Error in quantile.default(x, seq(from = 0, to = 1, length = n)): missing values and NaN’s not allowed if ‘na.rm’ is FALSE” What should I do? - This error occurs during denoising, (denoise = TRUE) when you have antibodies with 0 counts or close to 0 across all cells. To get rid of this error, check the distributions of the antibodies with e.g. apply(cells_protein_matrix, 1, quantile) to find the protein(s) with basically no counts, then remove these from BOTH the empty drops and the cells. Please refer to these links:
https://github.com/niaid/dsb/issues/6
https://github.com/niaid/dsb/issues/26

I am getting a memory limit error when I use dsb
This is likely because you are using too many barcodes in the negative population–please follow the vignette; if you are using Cell Ranger counts the raw barcode matrix is all the theoretical barcodes, you need to narrow in on the barcodes for which you have empty drops that captured ADT and cell containing droplets.

I get a “problem too large or memory exhausted error when I try to convert to a regular R matrix - See above and see issue 10 on the dsb github. CITE-seq protein counts don’t need a sparse representation-very likely this error is because there are too many negative droplets defined (i.e. over 1 million). You should be able to normalize datasets with 100,000+ cells and similar numbers of negative droplets (or less) on a normal 16GB laptop. By further narrowing in on the major background distribution, one should be able to convert the cells and background to a normal R matrix which should run successfully.
https://github.com/niaid/dsb/issues/10

the range of dsb normalized values is large is this normal? In nearly all cases encountered thus far, the large range of values for a protein (e.g. ranging from -50 to 50) are caused by just a few outlier cells, most often a few cells with low negative values for the protein. We have now provided a quantile clipping option in dsb to address these outlier cells. Users can also investigate these cells to see if they have very high values for isotype control proteins and they can possibly be removed from the dataset. https://github.com/niaid/dsb/issues/22
https://github.com/niaid/dsb/issues/9

check for outliers in dsb normalized values

# find outliers 
pheatmap::pheatmap(apply(dsb_norm_prot, 1, function(x){
  quantile(x,c(0.9999, 0.99, 0.98, 0.95, 0.0001, 0.01, 0.1))
  }))

We can address this by clipping the max and min values as above in the quantile clipping section.

What are the minimum number of proteins required to use dsb dsb is compatible with datasets with any number of proteins with step I alone. To remove ambient background noise simply set denoise.counts = FALSE. We have validated the algorithm assumptions on datasets with 14 phenotyping antibodies and 3 isotype controls. With less proteins than this, we recommend users return the internal stats calculated by dsb and check correlations of the variables as shown below. If these values are reasonably correlated, it indicates the model assumptions of dsb could be valid.


dsb_object = DSBNormalizeProtein(cell_protein_matrix = dsb::cells_citeseq_mtx, 
                                 empty_drop_matrix = dsb::empty_drop_citeseq_mtx,
                                 denoise.counts = TRUE, 
                                 isotype.control.name.vec = rownames(dsb::cells_citeseq_mtx)[67:70], 
                                 return.stats = TRUE)
d = as.data.frame(dsb_object$dsb_stats)

# test correlation of background mean with the inferred dsb technical component 
cor(d$cellwise_background_mean, d$dsb_technical_component)

# test average isotype control value correlation with the background mean 
isotype_names = rownames(dsb::cells_citeseq_mtx)[67:70]
cor(rowMeans(d[,isotype_names]), d$cellwise_background_mean)

How do I know whether I should set the denoise.counts argument to TRUE vs FALSE?
In the vast majority of cases we recommend setting denoise.counts = TRUE and use.isotype.control = TRUE (this is the package default). The only reason not to use this argument is if the model assumptions used to define the technical component are not expected to be met by the particular experiment: with denoise.counts = TRUE dsb models the negative protein population (µ1) for each cell with a two-component Gaussian mixture, making the conservative assumption that cells in the experiment should be negative for a subset of the measured proteins. If you expect all cells in your experiment be positive for all of the proteins measured, this may not be an optimal assumption. Model assumptions were validated on datasets measuring less than 20 to more than 200 proteins.

I have multiple “lanes” of 10X data from the same pool of cells, how should I run the workflow above?

Droplets derived from the same pool of stained cells partitioned across multiple lanes should ideally be normalized together, though pooling background droplets derived from independent staining reactions may still produce good results if the same staining panel was used with the same experimental conditions. To do this, you should merge the raw output of each lane, then run step 1 in the workflow-note that since the cell barcode names are the same for each lane in the raw output, you need to add a string to each barcode to identify the lane of origin to make the barcodes have unique names; here is one way to do that: First, add each 10X lane raw output from Cell Ranger into a separate directory in a folder “data”
data.
|_10xlane1
  |_outs
    |_raw_feature_bc_matrix
|_10xlane2
  |_outs
    |_raw_feature_bc_matrix

library(Seurat) # for Read10X helper function

# path_to_reads = here("data/")
umi.files = list.files(path_to_reads, full.names=T, pattern = "10x" )
umi.list = lapply(umi.files, function(x) Read10X(data.dir = paste0(x,"/outs/raw_feature_bc_matrix/")))
prot = rna = list()
for (i in 1:length(umi.list)) {
  prot[[i]] = umi.list[[i]]`Antibody Capture`
  rna[[i]] = umi.list[[i]]`Gene Expression`
  colnames(prot[[i]]) = paste0(colnames(prot[[i]]),"_", i )
  colnames(rna[[i]]) = paste0(colnames(rna[[i]]),"_", i )
}  
prot = do.call(cbind, prot)
rna = do.call(cbind, rna)
# proceed with step 1 in tutorial - define background and cell containing drops for dsb

I have 2 batches, should I combine them into a single batch or normalize each batch separately? - (See issue 12 on the dsb github) How much batch variation there is depends on how much experiment-specific and expected biological variability there is between the batches. In the dataset used in the preprint, if we normalized with all background drops and cells in a single normalization, the resulting dsb normalized values were highly concordant with when we normalized each batch separately, this held true with either definition of background drops used (i.e. based on thresholding with the library size or based on hashing-see below). One could try both and see which mitigates the batch variation the most. See https://github.com/niaid/dsb/issues/12 for example code. If there are significant batch to batch variations from an experiment with the same antibody panel on the same type of cells, we recommend starting out with simple linear model based correction with limma as a starting point (type ?limma::removeBatchEffect into R console for more information).