Articles
https://doi.org/10.1038/s41592-019-0619-0

Fast, sensitive and accurate integration
of single-cell data with Harmony
Ilya Korsunsky 1,2,3,4, Nghia Millard1,2,3,4, Jean Fan 5, Kamil Slowikowski1,2,3,4,
Fan Zhang 1,2,3,4, Kevin Wei2, Yuriy Baglaenko 1,2,3,4, Michael Brenner2, Po-ru Loh
Soumya Raychaudhuri 1,2,3,4,6*

1,3,4

and

The emerging diversity of single-cell RNA-seq datasets allows for the full transcriptional characterization of cell types across a
wide variety of biological and clinical conditions. However, it is challenging to analyze them together, particularly when datasets
are assayed with different technologies, because biological and technical differences are interspersed. We present Harmony
(https://github.com/immunogenomics/harmony), an algorithm that projects cells into a shared embedding in which cells group
by cell type rather than dataset-specific conditions. Harmony simultaneously accounts for multiple experimental and biological
factors. In six analyses, we demonstrate the superior performance of Harmony to previously published algorithms while requiring fewer computational resources. Harmony enables the integration of ~106 cells on a personal computer. We apply Harmony
to peripheral blood mononuclear cells from datasets with large experimental differences, five studies of pancreatic islet cells,
mouse embryogenesis datasets and the integration of scRNA-seq with spatial transcriptomics data.

R

ecent technological advances1 enable unbiased single-cell
transcriptional profiling of thousands of cells in one experiment. Projects such as the Human Cell Atlas2 (HCA) and
Accelerating Medicines Partnership3–5 exemplify the growing body
of reference datasets of primary human tissues. While individual
experiments incrementally expand our understanding of cell types,
a comprehensive catalog of healthy and diseased cells will require
the ability to integrate multiple datasets across donors, studies and
technological platforms. Moreover, in translational research, joint
analyses across tissues and clinical conditions will be essential to
identify disease-expanded populations. Since meaningful biological variation in single-cell RNAseq datasets from different studies
is often confounded by data source6, investigators have developed
unsupervised multi-dataset integration algorithms7–10. These methods embed cells from diverse experimental conditions and biological contexts into a common reduced dimensional embedding to
enable shared cell-type identification across datasets.
Here, we introduce Harmony, an algorithm for robust, scalable
and flexible multi-dataset integration to meet four key challenges of
unsupervised scRNAseq joint embedding: scaling to large datasets,
identification of both broad populations and fine-grained subpopulations, flexibility to accommodate complex experimental design,
and the power to integrate across modalities. We apply Harmony
to a diverse range of examples, including cell lines, peripheral blood
mononuclear cells (PBMCs) assayed with different technologies,
a meta-analysis of pancreatic islet cells from multiple donors and
studies, longitudinal samples from mouse embryogenesis, and
cross-modality integration of scRNA-seq data with spatial transcriptomics data. Harmony is available as an R package on github
(https://github.com/immunogenomics/harmony), with functions
for standalone and Seurat7 pipeline analyses.

Results

Harmony iteratively learns a cell-specific linear correction function. Harmony begins with a low-dimensional embedding of cells,
such as principal components analysis (PCA), (Supplementary Note
1 and Methods). Using this embedding, Harmony first groups cells
into multi-dataset clusters (Fig. 1a). We use soft clustering to assign
cells to potentially multiple clusters, to account for smooth transitions between cell states. These clusters serve as surrogate variables,
rather than to identify discrete cell types. We developed a new soft
k-means clustering algorithm that favors clusters with cells from
multiple datasets (Methods). Clusters disproportionately containing
cells from a small subset of datasets are penalized by an information
theoretic metric. Harmony allows for multiple different penalties to
accommodate multiple technical or biological factors, such as different batches and tissue sources. Soft clustering preserves discrete
and continuous topologies while avoiding local minima that might
result from maximizing representation too quickly across multiple
datasets. After clustering, each dataset has cluster-specific centroids
(Fig. 1b) that are used to compute cluster-specific linear correction
factors (Fig. 1c). Since clusters correspond to cell types and states,
cluster-specific correction factors correspond to individual cell-type
and cell-state specific correction factors. In this way, Harmony learns
a simple linear adjustment function that is sensitive to intrinsic cellular phenotypes. Finally, each cell is assigned a cluster-weighted
average of these terms and corrected by its cell-specific linear factor
(Fig. 1d). Since each cell may be in multiple clusters, each cell has
a potentially unique correction factor. Harmony iterates these four
steps until cell cluster assignments are stable.
Quantifying performance in cell-line data. We first assessed
Harmony using three carefully controlled datasets, to evaluate

Center for Data Sciences, Brigham and Women’s Hospital, Boston, MA, USA. 2Divisions of Genetics and Rheumatology, Department of Medicine, Brigham
and Women’s Hospital and Harvard Medical School, Boston, MA, USA. 3Department of Biomedical Informatics, Harvard Medical School, Boston, MA,
USA. 4Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA. 5Department of Chemistry and Chemical
Biology, Harvard University, Cambridge, MA, USA. 6Versus Arthritis Centre for Genetics and Genomics, Centre for Musculoskeletal Research, Manchester
Academic Health Science Centre, The University of Manchester, Manchester, UK. *e-mail: soumya@broadinstitute.org
1

Nature Methods | VOL 16 | December 2019 | 1289–1296 | www.nature.com/naturemethods

1289

Articles
Dataset

NATuRe MeTHods
Cell type
Iterate until convergence

Cl
u
Clu
ste

Cl

Cl
u

Cl
u

Get cluster centroids
for each dataset

3
ter
us

Get dataset correction
factors for each cluster

r2
ste

3
ter
us

Clus
ter

4

Cl
u

Clus
ter

Clus
ter
1

3
ter
us

r2
ste

r4

Soft assign cells to
clusters, favoring mixed
dataset representation

1

Clust
er
4

4

r3
ste

r2
ste

d

Clus
ter

1

r2
ste

c

Clus
ter

Cl

b

Cl

Clu
ste
r1

Cl
u

a

Move cells based on
soft cluster membership

Fig. 1 | Overview of Harmony algorithm. PCA embeds cells into a space with reduced dimensionality. Harmony accepts the cell coordinates in this reduced
space and runs an iterative algorithm to adjust for dataset specific effects. a, Harmony uses fuzzy clustering to assign each cell to multiple clusters, while
a penalty term ensures that the diversity of datasets within each cluster is maximized. b, Harmony calculates a global centroid for each cluster, as well
as dataset-specific centroids for each cluster. c, Within each cluster, Harmony calculates a correction factor for each dataset based on the centroids.
d, Finally, Harmony corrects each cell with a cell-specific factor: a linear combination of dataset correction factors weighted by the cell’s soft cluster
assignments made in step a. Harmony repeats steps a to d until convergence. The dependence between cluster assignment and dataset diminishes with
each round. Datasets are represented with colors, cell types with different shapes.

performance on both integration (mixing of datasets) and accuracy (no mixing of cell types). Perfect integration can be achieved
by mixing all cells, regardless of cellular identity. Similarly, high
accuracy can be achieved by partitioning cells into broad clusters
without mixing datasets in small neighborhoods. In this situation,
broad cellular states are defined, but fine-grained cellular substates
and subtypes are confounded by the originating dataset. To quantify
integration and accuracy of this embedding, we defined an objective
metric: the local inverse Simpson’s Index (LISI, see Methods) in the
local neighborhood of each cell. To assess integration, we employ
‘integration LISI’ (iLISI, see Fig. 2a), which defines the effective
number of datasets in a neighborhood. Neighborhoods represented
by only a single dataset get an iLISI of 1, while neighborhoods with
an equal number of cells from two datasets get an iLISI of 2. Note
that even under ideal mixing, if the datasets have different numbers
of cells, iLISI would be less than 2. To assess accuracy, we use ‘celltype LISI’ (cLISI, see Fig. 2b), the same mathematical measure, but
applied to cell type instead of dataset labels. Accurate integration
should maintain a cLISI of 1, reflecting a separation of unique cell
types throughout the embedding. An erroneous embedding would
include neighborhoods with a cLISI of 2, indicating that neighbors
have two different types of cell.
We begin with three datasets from two cell lines: (1) pure Jurkat,
(2) pure 293T and (3) a 50/50 mix11. These datasets are ideal for
illustration and for assessment, as each cell can be unambiguously
labeled Jurkat or 293T (Supplementary Fig. 1a). A thorough integration would mix the 1,799 Jurkat cells from the mixture dataset
with 3,255 cells from the pure Jurkat dataset and the 1,565 293T
cells from the mixture dataset with the 2,859 from the pure 293T
dataset. Thus, we expect the average iLISI to range from 1, reflecting no integration, to 1.8 (= 1/[(1,799/(1,799 + 2,859))2 + (3,255/
(1,799 + 3,255))2]) for Jurkat cells and 1.5 (= 1/[(1,565/
(1,565 + 2,859))2 + (2,859/(1,565 + 2,859))2]) for 293T cells, reflecting maximal accurate integration. Application of a standard PCA
pipeline followed by UMAP embedding demonstrates that the
cells group broadly by dataset and cell type. This is both visually
apparent and quantified (Fig. 2c,d) with high accuracy reflected
by a low cLISI (median iLISI 1.00, 95% confidence interval (CI)
[1.00, 1.00]). However, the iLISI (median iLISI 1.01, 95% CI [1.00,
1.61]) is also low, reflecting imperfect integration and ample structure within each cell type reflecting the dataset of origin. After
1290

Harmony, cells from the 50/50 dataset are appropriately mixed into
the pure datasets (Fig. 2e). The increased iLISI (median iLISI 1.59,
95% CI [1.27, 1.97]) reflects the mixing of datasets, while the low
cLISI (Fig. 2f, median iLISI 1.00, 95% CI [1.00, 1.02]) reflects the
accurate separation of Jurkat from 293T cells. iLISI and cLISI provide a quantitative way to compare the integration and accuracy of
multiple algorithms. We repeated the integration and LISI analyses with MNN Correct, BBKNN, MultiCCA and Scanorama and
showed that they produced embeddings with statistically inferior
integration (Supplementary Results, Fig. 1b and Table 1).
This benchmark demonstrates the two key metrics for assessing mixing and accuracy and shows that Harmony performs well
on both metrics in a well-controlled analysis of cell-line datasets.
A potential pitfall of LISI is that it is sensitive to datasets of vastly
different sizes. In such a situation, most neighborhoods can be
dominated by a single dataset and LISI values become difficult to
interpret (Supplementary Note 2).
Harmony scales to large data. We evaluated Harmony’s computational performance, measuring both total runtime and maximum memory usage. To demonstrate Harmony’s scalability versus
other methods, we downsampled HCA data12 (528,688 cells from
16 donors and two tissues) to create five benchmark datasets with
500,000, 250,000, 125,000, 60,000 and 30,000 cells. We reported
the runtime and memory (Supplementary Tables 2 and 3) for
all benchmarks. Harmony runtime scaled well for all datasets
(Fig. 3a), ranging from 4 min on 30,000 cells to 68 min on 500,000
cells, 30 to 200 times faster than MultiCCA and MNN Correct.
The runtimes for Harmony, BBKNN and Scanorama were comparable for datasets with up to 125,000 cells. Harmony required
dramatically less memory (Fig. 3b) compared to other algorithms:
only 0.9 gigabytes (GB) on 30,000 cells and 7.2 GB on 500,000 cells.
At 125,000 cells, Harmony required 30–50 times less memory
than Scanorama, MNN Correct and Seurat MultiCCA; these other
methods could not scale beyond 125,000 cells. Harmony returned
substantially more integrated embeddings (Fig. 3c) than did other
competing algorithms (Supplementary Results), allowing for the
identification of shared cell types (Fig. 3d) across tissues and
donors. These results demonstrate that Harmony is computationally efficient and capable of analyzing even large datasets (105–106
cells) on personal computers.

Nature Methods | VOL 16 | December 2019 | 1289–1296 | www.nature.com/naturemethods

Articles

NATuRe MeTHods
a

b

iLISI measures integration

cLISI measures error

iLISI = 1
iLISI = 2

iLISI = 1.4

iLISI = 1
Poorly integrated
datasets

cLISI = 1

cLISI = 2

iLISI = 1.8

cLISI = 2

Well-integrated
datasets

Different cell types
group together

cLISI = 1
Different cell types
group separately

Dataset Cell type

c

d

iLISI
1.00
5.0

UMAP 2

2.5

1.25

1.50

1.75

1.00

1.25

1.50

1.75

2.00

Jurkat cells

HEK293T cells

Pure Jurkat
dataset

Pure
HEK293T
dataset

cLISI

2.00

0.0
50:50 mixture
dataset

−2.5

−5.0
−5

0

5

10

−5

0

5

10

UMAP 1

e

f

iLISI
1.00

1.25

1.50

1.75

−5

0

5

10

cLISI
1.00

2.00

1.25

1.50

1.75

2.00

0

5

10

15

7.5

UMAP 2

5.0
2.5
0.0
−2.5
−5.0
−10

15

−10

−5

UMAP 1

Fig. 2 | Quantitative assessment of dataset mixing and cell-type accuracy with cell-line datasets. a, iLISI measures the degree of mixing among datasets
in an embedding, ranging from 1 in an unmixed space to B, the number of datasets in the analysis, in a well mixed space. b, cLISI measures integration
accuracy using the same formulation but computed on cell-type labels instead. An accurate embedding has a cLISI close to 1 for every neighborhood,
reflecting separation of different cell types. n = 3,255 Jurkat and n = 2,859 human embryonic kidney 293T (HEK293T) cells from pure (purple and yellow)
cell-line datasets and n = 1,799 Jurkat and n = 1,565 HEK293T cells from a mixed (green) cell-line dataset were analyzed together. c,d, Before Harmony
integration, cells grouped by dataset (c) and known cell type (d). iLISI (c) and cLISI (d) were computed for every cell’s neighborhood and summarized with
density plots. e,f, After Harmony integration, cells from the mixture dataset are mixed into the other datasets (e), achieved by mixing Jurkat with Jurkat
cells and HEK293T with HEK293T cells (f). iLISI (e) and cLISI (f) were re-computed in the Harmony embedding.

Identification of broad and fine-grained PBMC subpopulations.
To assess how Harmony might perform under more challenging scenarios, we gathered three datasets of human PBMCs, each
assayed on the Chromium 10X platform but prepared with different protocols: 3′ end v1 (3pV1), 3′ end v2 (3pV2) and 5′ end (5p)
chemistries. After pooling all the cells together, we performed a
joint analysis. Before integration, cells group primarily by dataset
(Fig. 4a, median iLISI 1.00, 95% CI [1.00, 1.00]).

Harmony clustered the cells into biologically coherent groups
(Supplementary Fig. 6a) and removed dataset-specific variation
within each cluster. In the final integrated space, the datasets are
well mixed (Fig. 4b, median iLISI 1.96, 95% CI [1.36, 2.56]), more so
than with other methods (median iLISI 1.02, 95% CI [1.00, 2.11]).
We confirmed that >83% of cells had a significantly (false discovery
rate (FDR) < 5%) higher iLISI value in Harmony than with any other
algorithm (Supplementary Fig. 6b,c). To assess accuracy, within each

Nature Methods | VOL 16 | December 2019 | 1289–1296 | www.nature.com/naturemethods

1291

Articles

NATuRe MeTHods

a

b

Total runtime
50

Maximum memory usage
120

MultiCCA

MultiCCA
MNN Correct

Memory (GB)

Time (h)

40
MNN Correct

30
20

Scanorama

10

Scanorama

64

32

0
30,000

c

60,000

125,000
250,000
Number of cells

Harmony

16
8
0

Harmony

BBKNN

BBKNN

500,000

30,000

60,000

125,000

250,000

d

Tissue mixing after Harmony (500,000 cells)

Cell types after Harmony (500,000 cells)

Cord blood

10

500,000

Number of cells

Naive B

Bone marrow

pDC

5

Mem B

pre−B

HSC/MPP

Eryth

Plasma

UMAP 2

CD4 naive
Mk

0

DC

GMP

Treg
pre−T

PPBP mono

−5

CD8 naive

CD4 mem

CD14 mono

CD8 effector

CD16 mono

−10

NK

−10

−5

0

5

10

−10

−5

0

5

10

UMAP 1

Fig. 3 | Computational efficiency benchmarks. BBKNN, Scanorama, MNN Correct and MultiCCA are compared on five downsampled HCA datasets of
increasing sizes. a,b, Total runtime (a) and maximum memory (b) required to analyze each dataset are shown. Scanorama, MultiCCA and MNN Correct
were terminated for excessive memory requests on the 250,000 and 500,000 cell datasets. c, The mixing between tissues in the Harmony embedding of
the 500,000 cell analysis is visualized here. This includes n = 239,794 cells from cord blood of n = 8 independent donors and n = 260,206 cells from bone
marrow of n = 8 independent donors. d, In the Harmony embedding, clustered cell populations are labeled by canonical markers: pre-T cells, CD4 naive
T cells, CD4 memory T cells, T-regs, CD8 naive T cells, CD8 effector T cells, NK cells, pre-B cells, naive B cells, memory B cells, plasma cells, plasmacytoid
dendritic cells (pDC), conventional dendritic cells (DC), granulocyte macrophage progenitor (GMP), CD16− monocytes (CD14 mono), CD16+ monocytes
(CD16 mono), a population of monocytes also positive for megakaryocyte markers (PPBP mono), megakaryocytes (Mk), erythroid progenitors (Eryth) and
a cluster of hematopoietic stem cells and multipotent progenitor cells (HSC/MPP).

dataset, we separately annotated (Methods) broad cell clusters with
canonical markers of major expected populations (Supplementary
Fig. 6d): monocytes (CD14+ or CD16+), dendritic cells (FCER1A+),
B cells (CD20+), T cells (CD3+), megakaryocytes (PPBP+) and natural killer (NK) cells (CD3−/GNLY+) before clustering. We observed
that Harmony (median cLISI 1.00, 95% CI [1.00, 1.15]) retained
differences among cell types, similar to other integration methods
(median cLISI 1.00, 95% CI [1.00, 1.28]). The greater dataset integration, compared to other algorithms, affords a unique opportunity to identify fine-grained cell subtypes (Supplementary Fig. 6e).
Using canonical markers (Fig. 4e), we identified shared subpopulations of cells (Fig. 4f) including naive CD4 T (CD4+/CCR7+),
effector memory CD4 T (CD4+/CCR7−), Treg (CD4+/FOXP3+),
memory CD8 (CD8+/GZMK−), effector CD8 T (CD8+/GZMK+),
naive B (CD20+/CD27−) and memory B cells (CD20+/CD27+). In
the embeddings produced by other algorithms, the median iLISI
failed to exceed 1.1 (Supplementary Table 5). Accordingly, the
subtypes identified above reside in dataset-specific, rather than
dataset-mixed clusters (Supplementary Fig. 7). We were able to
maintain high quality results even as we downsampled whole populations of cells to create imbalanced datasets with nonoverlapping
cell types (Supplementary Results and Supplementary Figs. 8–10).
In both the full and downsampled PBMC datasets, Harmony was
1292

robust to the choice of parameters, particularly the diversity penalty
(Supplementary Results and Supplementary Fig. 11). These results
show that Harmony successfully accounts for technical variation
among different protocols and integrates many different cell types
while preserving large-scale and fine-grained structures in the data,
even with nonoverlapping populations among datasets.
Simultaneous integration across donors and technologiesidentifies rare pancreas islet subtypes. We considered a more complex experimental design, in which integration must be performed
simultaneously over more than one source of variation. We gathered
human pancreatic islet cells from five independent studies13–17, each
generated with a different technological platform. Integration across
these platforms is particularly challenging because of the large
spread in number of cells per dataset and number of unique genes
measured in each cell (Supplementary Fig. 12). Moreover, within
two of the datasets13,14, the authors noted considerable donor-specific effects. A successful integration of these studies must account
for the effects of both different technologies and the 36 donors used
in these studies. These effects may affect cell types in different ways.
Harmony is the only currently available integration algorithm specifically designed for single-cell data that is able to explicitly integrate over more than one variable.

Nature Methods | VOL 16 | December 2019 | 1289–1296 | www.nature.com/naturemethods

Articles

NATuRe MeTHods
a

b

Before integration
3ʹ V1

10

5ʹ

0

UMAP 2

UMAP 2

10

After Harmony integration

0
−5

3ʹ V2

−10

5

−10
−15

−10

−5

0

5

−10

10

−5

UMAP 1

c

Percentage cell
expressing

CD4
CCR7
GZMA
FOXP3
CD8A
MS4A1
CD27

0

5

UMAP 1

d
10

CD8T

CD4 naive

0

5

50

cd4naive
cd4naive
cd4naive
cd4mem
cd4mem
cd4mem
treg
treg
treg
cd8eff
cd8eff
cd8eff
bnaive
bnaive
bnaive
bmem
bmem
bmem

75

UMAP 2

25

0

Treg
CD4 mem

pDC
CD16+ monocyte
Megakaryocyte

Monocyte

HSC

aDC

CD8 eff

−5

NK

−10

B naive
−10

−5

B mem
0

5

UMAP 1

Fig. 4 | Fine-grained subpopulation identification in PBMCs across technologies. Three PBMC datasets were assayed with 10X, using different library
construction protocols: 5′ (orange, n = 7,697 cells), 3′ V1 (purple, n = 4,809 cells) and 3′ V2 (green, n = 8,380 cells). a, Before integration, cells group by
dataset. b, After Harmony integration, datasets are mixed together. c,d, Using canonical markers (c), we identified (d) five shared subtypes of T cell and
two shared subtypes of B cell.

As before, we assess cell-type accuracy cLISI with canonical cell
types identified independently within each dataset (Supplementary
Fig. 13a): alpha (GCG+), beta (MAFA+), gamma (PPY+), delta
(SST+), acinar (PRSS1+), ductal (KRT19+), endothelial (CDH5+),
stellate (COL1A2+) and immune (PTPRC+). Since there are two
integration variables, we assess both donor iLISI and technology iLISI. Before integration, PCA separates cells by technology
(Fig. 5a, median iLISI 1.00 95% CI [1.00, 1.06]), donor (Fig. 5b,
median iLISI 1.42 95% CI [1.00, 5.50]) and cell type (median cLISI
1.00 95% CI [1.00, 1.48]). The wide range of donor iLISI reflects
that in the CEL-seq, CEL-seq2 and Fluidigm C1 datasets, many
donors were well mixed before integration. Harmony integrates
cells (Supplementary Fig. 13b) by both technology (Fig. 5c, median
iLISI 2.17 95% CI [1.02, 3.91]) and donor (Fig. 5d, median iLISI
5.05 95% CI [1.24, 10.05]), while merging cells correctly according to previously defined types (accuracy >98%, Supplementary
Fig. 13c). We also benchmarked these data with BBKNN,
Scanorama, MNN Correct and MultiCCA, as well with limma18,
which can correct over multiple levels. Only Harmony was able to
mix both the donors and technologies substantially (Supplementary
Results and Supplementary Fig. 14).
Harmony was able to discern rare (<2%) cell subtypes (Fig.
5e) across the five datasets (Fig. 5f). We labeled previously
described subtypes using canonical markers: activated stellate cells (PDGFRA+), quiescent stellate cells (RGS5+), mast cells
(BTK+), macrophages (C1QC+) and beta cells under endoplasmic reticulum (ER) stress (Fig. 5g). Beta ER stress cells may represent a dysfunctional population, marked by ER stress genes
DDIT3 (FDR = 2.1 × 10−187), ATF3 (FDR = 3.2 × 10−79), ATF4
(FDR = 1.9 × 10−76) and PPP1R15A (FDR = 1.8 × 10−145). This
cluster has significantly lower expression of genes key to beta
cell identity19 and function20: PDX1 (FDR = 1.1 × 10−24), MAFA

(FDR = 4.0 × 10−30), INSM1 (FDR = 3.2 × 10−4) and NEUROD1
(FDR = 9.6 × 10−42) (Fig. 5h). Further, Sachdeva et al.21 suggest that
PDX1 deficiency makes beta cells less functional and exposes them
to ER stress induced apoptosis.
We also observed an alpha cell subset that to our knowledge,
has not been previously described. This cluster was also enriched
with genes involved in ER stress (Fig. 5i, DDIT3 (FDR = 2.6 × 10−29),
ATF3 (FDR = 5.5 × 10−55), ATF4 (FDR = 2.4 × 10−74) and PPP1R15A
(FDR = 6.7 × 10−113)). Similar to the beta ER stress population,
these alpha cells also expressed significantly lower levels of genes
necessary for proper function22,23: GCG (FDR = 4.2 × 10−13),
ISL1 (FDR = 1.0 × 10−22), ARX (FDR = 3.8 × 10−26) and MAFB
(FDR = 1.0× 10−71) (Fig. 5j). A recent study24 reported ER stress in
alpha cells in mice and linked the stress to dysfunctional glucagon
secretion. Moreover, we found that the proportions of alpha and
beta ER stress cells are significantly correlated (Spearman r = 0.46,
P = 0.004, Fig. 5k) across donors in all datasets. These results suggest alpha cell injury might parallel beta cell dysfunction in humans
during diabetes25. This rare population was nearly impossible to
identify in the original studies, because it was either too rare or
obscured by donor-specific variation (Supplementary Fig. 15).
Harmony integrates time course developmental trajectories. We
next evaluated Harmony’s ability to integrate datasets with smooth
trajectories, rather than discrete cell types. We analyzed 15,875
cells from eight time points of mouse hematopoiesis26, from E6.75
to E8.5 and mixed gastrulation. This dataset presents a new challenge to Harmony: each sample was taken at a different developmental time point. Thus, no sample has all cell types represented
(Supplementary Fig. 16a). Cells initially grouped by sample ID
rather than by cell type (Supplementary Fig. 16b). After Harmony
integration, samples were well mixed and grouped largely by

Nature Methods | VOL 16 | December 2019 | 1289–1296 | www.nature.com/naturemethods

1293

Articles

NATuRe MeTHods
Before integration

a

c

Colored by donor

d

Colored by technology

Colored by donor

Smart−seq2

CEL−seq2
10

After Harmony

b

Colored by technology

10

CEL−seq

UMAP 2

5
0

0

−5
−10

−10
inDrop

Fluidigm C1
−10

0

−10

10

0

10

−10

0

10

−10

0

10

UMAP 1

f
Beta
Alpha

10

0

Beta ER stress

UMAP 2

Delta

40%

Mast
Macrophage

−5

Activated stellate

0
UMAP 1

10

Quiescent stellate
Mast cell
Macrophage
Beta ER stress
Alpha ER stress
Activated stellate

Ductal
−10

Key alpha
function genes

ATF3

2

3

0

1

2

0

1

2

3

4

3

0 1 2 3 4

MAFA

ATF4

0 1 2 3 4 5

DDIT3

0

1

NEUROD1

2

5

3

1

2

3

4

5

0

1

PPP1R15A

0 1 2 3 4 5

2

0

1

2

3

GCG

4

0

2.5

5

DDIT3

3

4

0

PDX1

0

1

2

3

1

2

7.5
ISL1

3

0

1

2

PPP1R15A

0

1

2

3

k
25

ARX

20%
0%

Acinar

j

Alpha ER
stress genes

INSM1

ATF4

0

−10

1

60%

Quiescent stellate

Gamma

i

Key beta
function genes

80%

Alpha ER stress
0

h

Beta ER
stress genes
ATF3

Endothelial

Density

5

g

100%

Percentage beta
cells ER stress

e

r = 0.46

20

P = 0.004

15
10
5
0
0

5
10
15
20
Percentage alpha
ER stress cells

3

MAFB

4

0

1

2

3

4

log2(CP10K + 1)

Fig. 5 | Integration of pancreatic islet cells by both donor and technology. The n = 14,746 Human pancreatic islet cells from 36 donors were assayed
on five different technologies: inDrop (n = 4 donors, n = 8,569 cells), Fluidigm C1 (n = 13 donors, n = 638 cells), Smart-Seq2 (n = 10 donors, n = 2,355
cells), CEL-seq (n = 5 donors, n = 946 cells) and CEL-seq2 (n = 4 donors, n = 2,238 cells). a,b, Cells initially group by (a) technology, denoted by different
colors and (b) donor, denoted by shades of colors. c,d, Harmony integrates cells simultaneously across (c) technology and (d) donor. e, Clustering in the
Harmony embedding. f, Proportion of the different cell types. g,h, We compare the log expression (normalized to counts-per-ten-thousand, CP10K) of
ER stress genes (g) and beta endocrine function genes (h) in the beta ER stress cell population (green, n = 306 cells) to that in the beta cell population
(orange, n = 3,374 cells). i,j, Similarly, we compare the expression ER stress genes (i) and alpha endocrine function genes (j) in the alpha ER stress
cell population (purple, 392 cells) to the alpha cell population (green, n = 3,978 cells). Differential expression analysis was performed with two-tailed
moderated t-tests. FDR was computed using the Benjamini–Hochberg procedure. k, The abundances of the two ER stress populations were correlated
across n = 36 donors. Correlation was computed using Spearman’s rho and the nomial P value estimated with algorithm AS 89. Only one test was
performed, so no multiple hypothesis correction was applied.

previously defined cell type (Supplementary Fig. 16c). Harmony
preserved the continuous structure of the developmental cells
rather than erroneously clustering cells into discrete groups.
Harmony’s soft clustering was critical to preserve the smoothness:
it assigned cells from transitional populations to more clusters and
cells in more terminal populations to fewer clusters (Supplementary
Fig. 16d). With corrected Harmony embeddings, we performed
trajectory analysis with DDRTree (Supplementary Fig. 16e). We
recovered a branching trajectory structure that correctly captures
the progression from common mesoderm and hematoendothelial
progenitor populations to differentiated endothelial and erythroid
populations. Harmony preserved the separation between the two
blood progenitor populations but not in the three erythroid populations (Supplementary Fig. 16e). This suggests that these distinct
erythroid populations may be an artifact of batch rather than biology. Benchmarking this analysis, (Supplementary Results), we
found that except for MNN Correct, other algorithms failed to
maintain smoothness or incorrectly mixed distinct progenitor populations (Supplementary Fig. 17).
Harmony integrates dissociated scRNAseq with spatially
resolved datasets. Now we consider the integration of cells measured with different modalities, measuring different aspects of
1294

a cell’s biology. This type of integration is especially powerful, in
that it allows inference of features measured in one modality but
not the other. Harmony integrates a spatially resolved multiplexed
error robust fluorescence in situ hybridization (MERFISH) dataset with a dissociated scRNAseq dataset from the mouse hypothalamic preoptic region (Bregma +0.5 to −0.6 mm), using 154 shared
genes (Fig. 6a). From the embedding, we infer the spatial localization of unmeasured genes to identify previously unreported spatial
patterns in neuronal transcription factors. We downloaded two
datasets published by Moffit et al.27,28: 30,370 cells from six mice
measured with dissociated scRNAseq (10×) and 64,373 cells from
one mouse measured with MERFISH. The spatially resolved dataset is a combination of 135 genes assayed with MERFISH and 20
assayed with two-color single molecule FISH (smFISH). The dissociated 10× data gives information about the full transcriptome
(22,067 genes post quality control) but without any spatial information. In contrast, MERFISH was applied in this dataset to only a
targeted number of genes. The primary challenge in this analysis is
the limited number of overlapping features. Successful integration
will merge similar cell types and allow inference of spatial patterns
of the remaining 21,913 genes.
Before Harmony, cells group primarily by modality (median iLISI
1.00, 95% CI [1.00, 1.04], see Supplementary Fig. 18a). Harmony

Nature Methods | VOL 16 | December 2019 | 1289–1296 | www.nature.com/naturemethods

Articles

NATuRe MeTHods
a

b Harmony integrated space

Dissociated scRNAseq
(30,370 cells, 22,067 genes)

10
0

10X
Cells
enzyme

Oil

−20

Spatially resolved transcriptomics
(64,373 cells, 155 genes)

10
0
MERFISH

−10
−20
−20

−10

0

UMAP 2

Hypothalamic
preoptic region

−10

Astrocytes
Endothelial
Ependymal
Excitatory
Fibroblast
Immature
oligodendrocyte
Inhibitory
Macrophage
Mature
oligodendrocyte
Microglia
Mural/pericyte
Newly formed
oligodendrocyte

10

UMAP 1

Fig. 6 | Harmony integrates spatially resolved transcriptomic with dissociated scRNAseq datasets. a, Cells from the hypothalamic preoptic region of
mouse brain were assayed in parallel with two technologies. The full transcriptome of n = 30,370 dissociated cells from n = 6 animals was profiled with
10X. 155 genes were profiled in situ for n = 64,373 cells from n = 1 animal on intact tissue with MERFISH. Scale bar, 2 mm. b, Harmony integrated cells from
the two modalities into a shared embedding, correctly merging the 12 previously identified cell types.

mixes the two modalities, 10× and MERFISH (median iLISI 1.15,
95% CI [1.00, 1.99], see Fig. 6b and Supplementary Fig. 18b) and
merges cells based on their previous cell-type labels, covering 12
major neuronal and glial populations (93.7% prediction accuracy
after integration versus 94% before integration, see Supplementary
Fig. 18c). As an independent validation, we show that the Harmony
embedding agrees with the fine-grained cluster analysis in the original paper (Supplementary Results and Supplementary Fig. 18d,e).
We next used kernel k-nearest-neighbor (kNN) imputation to
predict gene expression in MERFISH dataset cells, based on their
nearest 10X dataset neighbors (Methods). We used fivefold crossvalidation to evaluate performance and found that we predicted
150 of 154 genes accurately, performing consistently better on
highly expressed genes (Supplementary Results and Supplementary
Fig. 18f). The original paper identified key transcription factors that
distinguished spatially distributed neuronal subtypes. After applying
Harmony, we can easily identify spatially autocorrelated transcription factors. We measured the spatial autocorrelation (Moran’s I)
of all known vertebrate transcription factors documented in
JASPAR29 (Supplementary Tables 6 and 7). As expected, all factors
associated with a neuronal subtype were significantly (FDR < 5%)
spatially autocorrelated. In addition, we found 50 transcription
factors significantly (FDR < 5%, Moran’s I > 0.1) autocorrelated in
excitatory neurons and 37 in inhibitory neurons. We confirmed
that these new factors were positive markers for at least one neuronal subtype. In fact, spatial correlation was significantly correlated
(Spearman’s r = 0.71, P = 2.6 × 10−62) with the area under the curve
of that gene’s best subtype (Supplementary Fig. 18g).
We next followed up on Satb1, a chromatin organizer linked with
survival of mouse cortical interneurons30, as it was strongly autocorrelated in inhibitory neurons (Moran’s I = 0.44). The predicted Satb1
expression was localized to anterior slices (Supplementary Fig. 18h).
To validate this localization, we compared predicted expression to
matched images of Satb1 expression from the Allen Brain Atlas31.
These images were chosen for similar anterior–posterior position
(between Bregma +0.8 to −0.2 mm) and ventricular morphology
(in green). The measured Satb1 expression follows qualitatively
similar patterns of predicted expression.
These 154 genes were carefully selected by the authors to be
biologically relevant to cell types in the hypothalamic preoptic
region. Using correlation informed downsampling, we were able
to achieve similar results with only 60 representative anchor genes
(Supplementary Results and Supplementary Fig. 18i,j).

Discussion

Harmony accepts a matrix of cells and covariate labels for each cell
as input. Its output is a matrix of adjusted coordinates with the same
dimensions as the input matrix. Hence, Harmony should be applied
first before a full analysis pipeline is employed. Downstream analyses, such as clustering, trajectory analysis and visualization, can then
use adjusted Harmony embeddings. Harmony does not alter the
expression values of individual genes to account for dataset-specific
differences. We recommend using a batch-aware approach, such as
mixed effects linear models, for differential expression analysis.
Harmony uses clusters to partition the large nonlinear embeddings in which cells sit into smaller linear regions. With discrete clustering, it is possible to over-discretize the space and
lose smooth transitions between populations (for examples, see
Supplementary Note 3). Harmony avoids over-discretization by
using soft clustering. With this strategy, in the mouse embryogenesis dataset, Harmony effectively modeled both terminal populations and transition states, retaining smooth transitions and
bifurcation events from a common progenitor into endothelial and
hematopoietic lineages.
With the advent of high throughput single-cell technologies,
it is now possible to assay different interacting facets of a cell’s
biology. Effective synthesis of datasets across multiple modalities
may help reveal such interactions. Harmony embedded spatially
resolved MERFISH cells with dissociated 10× cells, despite limited
overlap of genes and dramatically different detection efficiencies
between these technologies. Consequently, we were able to analyze
spatial patterns of gene expression on genes that were never measured with MERFISH. The ability of Harmony to integrate 10X and
MERFISH and impute unmeasured gene expression relies on the
quality of the overlapping set of anchor genes. Selecting the optimal anchor genes is an important open question. From our crossvalidation and minimal gene set analyses, we found two trends that
may help guide spatial probeset design. First, we can remove correlated genes from the anchors with minimal loss in integration
quality. Second, less abundant genes are more difficult to predict.
Thus, a spatial probeset containing genes with lower abundance
that are also maximally nonredundant should enable imputation of
a large number of genes through Harmony with an appropriately
matched scRNAseq dataset.
It is common practice to apply batch-sensitive gene preprocessing steps before application of single-cell integration algorithms.
In particular, some investigators scale gene expression values

Nature Methods | VOL 16 | December 2019 | 1289–1296 | www.nature.com/naturemethods

1295

Articles

NATuRe MeTHods

within datasets separately, before pooling cells into a single matrix.
While this strategy may make it easier to integrate certain datasets
(Supplementary Results and Supplementary Fig. 19a,b) in which all
cell populations are present across all datasets, it may increase error
when datasets consist of overlapping but not identical populations
(Supplementary Fig. 19c). Hence, we do not use this scaling strategy
in this article. In our analysis pipelines, we avoid all batch-sensitive
preprocessing steps. We simply concatenate the data and perform
PCA on the combined dataset.
Harmony models and removes the effects of known sources
of variation. However, unwanted variation may also arise from
unknown sources. Methods such as SVA32 and PEER33 infer and
remove latent sources of variation with linear models in bulk transcriptomics. In future work, we plan to extend Harmony to identify
and remove unwanted latent effects.
With the rise of automated classification algorithms, it is feasible
that a user may be able to assign information about cell types before
integration. For example, in the past, our group has used a classifier
to define soft or hard identities to assign probabilistic classifications
to single cells with linear discriminant analysis3,34. Harmony has an
advanced option to initialize soft cluster assignments using such
probabilistic cell-type assignments (see HarmonyMatrix). If the previous assignments are helpful, Harmony will converge to an accurate
solution faster. If the previous assignments are inaccurate, Harmony
can reassign cells appropriately during the clustering steps.
The Harmony framework lays the groundwork for several exciting future applications. First, it can be extended to accurately model
gene counts. This will allow users to apply Harmony preprocessing for methods that require full gene expression profiles, rather
than low-dimensional cell embeddings, such as RNA velocity35.
Next, we envision specializing Harmony to quickly map cells to a
multi-billion cell reference that covers a comprehensive set of tissues, organisms and clinical conditions. This application will enable
rapid comparison of cells from a single experiment against a larger
reference, and in the process annotate known and new cell types
and states in seconds.

Online content

Any methods, additional references, Nature Research reporting
summaries, source data, extended data, supplementary information, acknowledgements, peer review information, details of author
contributions and competing interests, and statements of code and
data availability are available at https://doi.org/10.1038/s41592-0190619-0.
Received: 5 November 2018; Accepted: 6 September 2019;
Published online: 18 November 2019

References

1. Svensson, V., Vento-Tormo, R. & Teichmann, S. A. Exponential scaling of
single-cell RNA-seq in the past decade. Nat. Protocols 13, 599–604 (2018).
2. Regev, A. et al. The human cell atlas. eLife 6, e27041 (2017).
3. Zhang, F. et al. Defining inflammatory cell states in rheumatoid arthritis joint
synovial tissues by integrating single-cell transcriptomics and mass cytometry.
Nat. Immunol. 20, 928–942 (2019).
4. Arazi, A. et al. The immune cell landscape in kidneys of lupus nephritis
patients. Nat. Immunol. 20, 902–914 (2019).
5. Der, E. et al. Tubular cell and keratinocyte single-cell transcriptomics
applied to lupus nephritis reveal type I IFN and fibrosis relevant pathways.
Nat. Immunol. 20, 915–927 (2019).
6. Hicks, S. C., Townes, F. W., Teng, M. & Irizarry, R. A. Missing data and
technical variability in single-cell RNA-sequencing experiments. Biostatistics
19, 562–578 (2017).

1296

7. Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating
single-cell transcriptomic data across different conditions, technologies, and
species. Nat. Biotechnol. 36, 411–420 (2018).
8. Haghverdi, L., Lun, A. T. L., Morgan, M. D. & Marioni, J. C. Batch effects in
single-cell RNA-sequencing data are corrected by matching mutual nearest
neighbors. Nat. Biotechnol. 36, 421–427 (2018).
9. Hie, B. L., Bryson, B. & Berger, B. Efficient integration of heterogeneous
single-cell transcriptomes using Scanorama. Nat. Biotechnol. 37,
685–691 (2018).
10. Polanski, K. et al. BBKNN: fast batch alignment of single cell transcriptomes.
Bioinformatics https://doi.org/10.1093/bioinformatics/btz625 (2019).
11. Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of
single cells. Nat. Commun. 8, 14049 (2017).
12. Li, B. et al. HCA Data Portal: census of immune cells (Human Cell Atlas,
2019).
13. Segerstolpe, A. et al. Single-cell transcriptome profiling of human pancreatic
islets in health and type 2 diabetes. Cell Metab. 24, 593–607 (2016).
14. Baron, M. et al. A single-cell transcriptomic map of the human and mouse
pancreas reveals inter- and intra-cell population structure. Cell Syst. 3,
346–360 (2016).
15. Lawlor, N. et al. Single-cell transcriptomes identify human islet cell
signatures and reveal cell-type-specific expression changes in type 2 diabetes.
Genome Res. 27, 208–222 (2017).
16. Grun, D. et al. De novo prediction of stem cell identity using single-cell
transcriptome data. Cell Stem Cell 19, 266–277 (2016).
17. Muraro, M. J. et al. A single-cell transcriptome atlas of the human pancreas.
Cell Syst. 3, 385–394 (2016).
18. Ritchie, M. E. et al. Limma powers differential expression analyses for
RNA-sequencing and microarray studies. Nucleic Acids Res. 43,
e47 (2015).
19. Gao, T. et al. Pdx1 maintains β cell identity and function by repressing an α
cell program. Cell Metab. 19, 259–271 (2014).
20. Jia, S. et al. Insm1 cooperates with neurod1 and foxa2 to maintain mature
pancreatic β-cell function. EMBO J. 34, 1417–1433 (2015).
21. Sachdeva, M. M. et al. Pdx1 (MODY4) regulates pancreatic beta
cell susceptibility to ER stress. Proc. Natl Acad. Sci. USA 106,
19090–19095 (2009).
22. Katoh, M. C. et al. MafB is critical for glucagon production and secretion in
mouse pancreatic α cells in vivo. Mol. Cell. Biol. 38, e00504–e00517 (2018).
23. Liu, J. et al. Islet-1 regulates arx transcription during pancreatic islet α-cell
development. J. Biol. Chem. 286, 15352–15360 (2011).
24. Akiyama, M. et al. X-box binding protein 1 is essential for insulin regulation
of pancreatic α-cell function. Diabetes 62, 2439–2449 (2013).
25. Burcelin, R., Knauf, C. & Cani, P. D. Pancreatic alpha-cell dysfunction in
diabetes. Diabetes Metab. 34, S49–S55 (2008).
26. Pijuan-Sala, B. et al. A single-cell molecular map of mouse gastrulation and
early organogenesis. Nature 566, 490–495 (2019).
27. Moffitt, J. R.et al. Molecular, spatial, and functional single-cell profiling of the
hypothalamic preoptic region. Science 362, eaau5324 (2018).
28. Moffitt, J. et al. Data from: Molecular, Spatial and Functional Single-cell
Profiling of the Hypothalamic Preoptic Region (Dryad, Dataset, 2018);
https://doi.org/10.5061/dryad.8t8s248
29. Khan, A. et al. JASPAR 2018: update of the open-access database of
transcription factor binding profiles and its web framework. Nucleic Acids
Res. 46, D260–D266 (2018).
30. Close, J. et al. Satb1 is an activity-modulated transcription factor required for
the terminal differentiation and connectivity of medial ganglionic eminencederived cortical interneurons. J. Neurosci. 32, 17690–17705 (2012).
31. Lein, E. S. et al. Genome-wide atlas of gene expression in the adult mouse
brain. Nature 445, 168–176 (2007).
32. Leek, J. T. & Storey, J. D. Capturing heterogeneity in gene expressionstudies
by surrogate variable analysis. PloS Genet. 3, e161 (2007).
33. Stegle, O., Parts, L., Piipari, M., Winn, J. & Durbin, R. Using probabilistic
estimation of expression residuals (PEER) to obtain increased power
and interpretability of gene expression analyses. Nature Protocols 7,
500–507 (2012).
34. Mizoguchi, F. et al. Functionally distinct disease-associated fibroblast subsets
in rheumatoid arthritis. Nat. Commun. 9, 789 (2018).
35. Manno, G. L. et al. RNA velocity of single cells. Nature 560, 494–498 (2018).
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in
published maps and institutional affiliations.
© The Author(s), under exclusive licence to Springer Nature America, Inc. 2019

Nature Methods | VOL 16 | December 2019 | 1289–1296 | www.nature.com/naturemethods

Articles

NATuRe MeTHods
Methods

Harmony. Overview. The Harmony algorithm inputs a PCA embedding (Z) of cells,
along
  with their batch assignments (ϕ), and returns a batch corrected embedding
^ . This algorithm, summarized as Algorithm 1 below, iterates between two
Z
complementary
stages: maximum diversity clustering (Algorithm 2) and a mixture
I
model based linear batch correction (Algorithm 3). The clustering step uses the
^ to compute a soft assignment of cells to clusters,
batch corrected embedding Z
encoded in the matrix R. The correction step uses these soft clusters to compute
a new corrected embedding from the original one. Efficient implementations of
Harmony, including the clustering and correction subroutines, are available as part
of an R package at https://github.com/immunogenomics/harmony.
^ to regress out confounder
Note that the correction procedure uses Z, not Z
effects. In this way, we restrict correction to a linear model of the original
^ of the last iteration
embedding. An alternative approach would use the output Z
^ would be the result of a
as input to the correction procedure. Thus, the final Z
series of linear corrections of the original embedding. While this allows for more
expressive transformations, we found that in practice, this can over correct the
data. Our choice to limit the transformation reflects the notion in the introduction.
Namely, if we had perfect knowledge of the cell types before correction, we would
linearly regress out batch within each cell type.
Algorithm 1 Harmony
function harmonize (Z, ϕ)
^
Z
   Z
I
  repeat
^ ϕ)
   R ← cluster ( Z;
^ ← correct (Z,
I R, ϕ)
   Z
  until convergence
^
  return Z
Glossary. For reference, we define all data structures used in all Harmony functions.
For each one, we define its dimensions and possible values, as well as an intuitive
description of what it means in context. The dimensions are stated in terms of d,
the dimensionality of the embedding (for example, number of PCs); B, the number
of batches; N, the number of samples; Nb, the number of samples in batch b; and K,
the number of clusters.
Z 2 Rd ´ N The input embedding, to be corrected in Harmony. This is often
PCAIembeddings of cells.
^ 2 Rd ´ N The integrated embedding, output by Harmony.
Z
RI 2 ½0; 1K ´ N The soft cluster assignment matrix of cells (columns) to clusters
I Each column is a probability distribution and thus sums to 1.
(rows).
ϕ 2 f0; 1gB ´ N One-hot assignment matrix of cells (columns) to batches (rows).
I b 2 ½0; 1B Frequency of batches.
Pr
OI 2 ½0; 1K ´ B The observed co-occurence matrix of cells in clusters (rows) and
I (columns).
batches
E 2 ½0; 1K ´ B The expected co-occurence matrix of cells in clusters and batches,
I the assumption of independence between cluster and batch assignment.
under
Y 2 ½0; 1d ´ K Cluster centroid locations in the k-means clustering algorithm.
I
Assumptions about input data. In this article, we always use Harmony on a lowdimensional embedding of the cells. To be clear about the properties of this lowdimensional space, we explicitly state three assumptions:
1.

2.

3.

Cells are embedded into a low-dimensional space as the result of PCA. The
PCA embedding captures the variation of gene expression in a compact
orthonormal space. For this reason, the default input to Harmony is now a
matrix of gene expression data, normalized for library size. We then perform
PCA on this high-dimensional matrix and use the eigenvalue-scaled eigenvectors as the low-dimensional embedding input to Harmony.
Gene expression has been normalized for library size. In RNA-seq, each cell
will be sequenced to a different depth, which results in different library sizes
for each cell. It is best practice to account for this source of technical variation
before performing PCA. In this article, we use the standard log countsper-10,000 (CP10K) transformation, described in the Methods. As a result of
this depth transformation, expression values are turned into relative frequencies inside each cell. Thus, it is impossible for every gene to be upregulated in
one group of cells.
The low-dimensional nearest-neighbor structure induced by Euclidean
distance should be preserved with common similarity metrics such as cosine
similarity and correlation. This can be easily checked by computing sparse
nearest-neighbor graphs and comparing the adjacency matrices. Cells with
less than 20% overlapping neighbors can be removed as outliers. A common
way to violate this assumption is to simulate cells around the origin (that is,
all embeddings equal 0). We do not find this to be the case in real scRNAseq
data. This assumption is common to integration methods that use cosine
distance to compare cells, such as MNN Correct and Scanorama.

Maximum diversity clustering. We developed a clustering algorithm to maximize
the diversity among batches within clusters. We present this method as follows.
Nature Methods | www.nature.com/naturemethods

First, we review a previously published objective function for soft k-means
clustering. We then add a diversity maximizing regularization term to this
objective function, and derive this regularization term as the penalty on statistical
dependence between two random variables: batch membership and cluster
assignment. We then derive and present pseudocode for an algorithm to optimize
the objective function. Finally, we explain key details of the implementation.
Background: entropy regularization for soft k-means. The basic objective function
for classical k-means clustering, in which each cell belongs to exactly one cluster, is
defined by the distance from cells to their assigned centroids.
X
min
Rki kZi � Yk k2
ð1Þ
R;Y
i;k

s:t:8i 8k Rki 2 f0; 1g

Above, Z is some feature space of the data, shared by centroids Y. Rk,i can
take values 0 or 1, denoting membership of cell i in cluster k. To transform
this into a soft clustering objective, we follow the direction of ref. 36 and add an
entropy regularization term over R, weighted by a hyperparameter σ. Now, Rki
can take values between 0 and 1, so long as for a given cell i, the sum over cluster
memberships Σk Rki equals 1. That is, Ri must be a proper probability distribution
with support [1,K].
X
min
Rki kZi � Yk k2 þσRki log Rki
ð2Þ
R;Y
i;k

s:t: 8i 8k Rki > 0; 8i

K
X
k¼1

Rki ¼ 1

As σ approaches 0, this penalty approaches hard clustering. As σ approaches
infinity, the entropy of R outweighs the data-centroid distances. In this case, each
data point is assigned equally to all clusters.
Objective function for maximum diversity clustering. The full objective function for
Harmony’s clustering builds on the previous section. In addition to soft assignment
regularization, the function below penalizes clusters with low batch-diversity, for
all defined batch variables. This penalty, derived in the following section, depends
on the cluster and batch identities Ω(R, ϕi) = Σi,k Rki log(Oki/Eki)ϕi.
 
X
Oki
ϕ
min
Rki kZi � Yk k2 þσRki log Rki þ σθRki log
ð3Þ
R;Y
Eki i
i;k
s:t: 8i 8k Rki > 0; 8i

K
X
k¼1

Rki ¼ 1

For each batch variable, we add a new parameter θ. Here θ decides the degree
of penalty for dependence between batch membership and cluster assignment.
When ∀fθ = 0, the problem reverts back to (2), with no penalty on dependence. As
θ increases, the objective function favors more independence between batch f and
cluster assignment. As θ approaches infinity, it will yield a degenerate solution. In
this case, each cluster has an equivalent distribution across batch f. However, the
distances between cells and centroids may be large. Finally, σ is added to this term
for notational convenience in the gradient calculations.
We found that this clustering works best when we compute the cosine, rather
than Euclidean distance, between Z and Y. Haghverdi et al.8 showed that the
squared Euclidean distance is equivalent to cosine distance when the vectors are
log2 normalized. Therefore, assuming that all Zi and Yk have a unity log2 norm, the
squared Euclidean distance above can be re-written as a dot product.
 
X
�

Oki
ϕ
min
Rki 2 1 � YkT Zi þ σRki log Rki þ σθRki log
ð4Þ
R;Y
Eki i
i;k
s:t: 8i 8k Rki > 0; 8i

K
X
k¼1

Rki ¼ 1

Cluster diversity score. Here, we discuss and derive the diversity penalty term
Ω(.), defined in the previous section. For simplicity, we discuss diversity with
respect to a single batch variable, as the multiple batch penalty terms are additive
in the objective function. The goal of Ω(.) is to penalize statistical dependence
between batch identity and cluster assignment. In statistics, dependence between
two discrete random variables is typically measured with the χ2 statistic. This test
considers the frequencies with which different values of the two random variables
are observed together. The observed co-occurrence counts (O) are compared to
the counts expected under independence (E). For practical reasons, we do not use

Articles

NATuRe MeTHods

the χ2 statistic directly. Instead, we use the Kullback Leibler divergence (DKL), an
information theoretic distance between two distributions. In this section, we define
the O and E distributions, as well the DKL penalty, in the context of the probabilistic
cluster assignment matrix R.

Next, we use the probability constraint

X

Okb ¼ N

1i2b

i

Okb ¼

X

Rki
Nb

!

k¼1

1¼
Nb
N

1i2b Rki

ð5Þ

i

Ekb ¼ NPrðk; bÞ
Ekb ¼ NPrðkÞPrðbÞ
Ekb ¼ N

X Rki
i

Nb

!

ð6Þ

 
Okb
Obk log
Ekb
b¼1 k¼1

B X
K
X

"
B X
K
N
X
X
b¼1 k¼1

i¼1

#

1i2b Rki log

 
Okb
Ekb

 
N X
K X
B
X
Okb
DKL ðEjjOÞ ¼
1i2b Rki log
Ekb
i¼1 k¼1 b¼1
DKL ðEjjOÞ ¼

 
Okb
ϕ
Rki log
Ekb i
i¼1 k¼1

N X
K
X

ð7Þ

Optimization. Optimization of equation (4) admits an expectation-maximization
framework, iterating between cluster assignment (R) and centroid (Y) estimation.
Cluster assignment R. Using the same strategy as ref. 36, we solve for the optimal
assignment Ri for each cell i. First, we set up the Lagrangian with dual parameter λ
and solve for the partial derivative with respect to each cluster k.
 
K
X
�

Oki
LðRi ; λÞ ¼
Rki 2 1 � YkT Zi þ σRki logRki þ σθRki log
Rki
k¼1
"
!
#
K
X
þλ
Rki � 1
k¼1

 


δLðRi ; λÞ
Oki
þλ
¼ 0 ¼ 2 1 � YkT Zi þ σ þ σlogRki þ σθ log
δRki
Eki
logRki ¼ �

Eki



λ
1


exp � � 1 ¼
PK Oki θ
σ
2ð1�YkT Zi Þ
exp
�
k¼1 Eki
σ

Finally, we substitute expðλ=σ � 1Þ to remove the dependency of Rki on the
I
dual parameter λ.


2ð1�YkT Zi Þ
exp �
σ
1
0 
 2
2 1�Y^T Zi
O^ki
k
A
exp@�

 2
Oki
Eki

PK

E^ki

ð8Þ

σ

The denominator term above makes sure that Ri sums to one. In practice
(Algorithm 2), we compute the numerator and divide by the sum.

Next, we define the KL divergence in terms of R. Note that both O and E
depend on R. However, in the derivation below, we expand one of the O terms.
This serves a functional purpose in the optimization procedure, described later.
Intuitively, in the update step of R for a single cell, we compute O and E on all the
other cells. In this way, we decide how to assign the single cell to clusters given the
current distribution of batches among ation clusters.

DKL ðEjjOÞ ¼

k¼1

 
 �

2 1 � YkT Zi
λ
exp � � 1
exp �
σ
σ

^
k¼1

Nb X
Ekb ¼
Rki
N i

DKL ðEjjOÞ ¼


K 
X
Oki θ

Rki ¼

Nb
N

Rki ¼ 1 to solve for expðλ=σ � 1Þ.
I

K I
X
1¼
Rki

Okb ¼ NPrðk; bÞ
Okb ¼ NPrðkjbÞPrðbÞ

K
P

k¼1



 
2 1 � YkT Zi
Oki
λ
�
� 1 � θlog
σ
Eki
σ

 
 θ  

2 1 � YkT Zi
Oki
λ
exp � � 1
Rki ¼
exp �
σ
Eki
σ

Centroid estimation Y. Our clustering algorithm uses cosine distance instead
of Euclidean distance. In the context of k-means clustering, this approach was
pioneered by Dhillon et al.37. We adopt their centroid estimation procedure for our
algorithm. Instead of just computing the mean position of all cells that belong in
cluster k, this approach then log2 normalizes each centroid vector to make it unit
length. Note that normalizing the sum over cells is equivalent to normalizing the
mean of the cells. In the soft clustering case, this summation is an expected value
of the cell positions, under the distribution defined by R. That is, re-normalizing
R.k for cluster k gives the probability of each cell belonging to cluster k. Again, this
re-normalization is a scalar factor that is irrelevant once we log2 normalize the
centroids. Thus, the unnormalized expectation of centroid position for cluster k
would be Yk ¼ ER:k Z = ΣiRkiZi. In vector form, for all centroids, this is Y = ZRT. The
I of the cluster centroids is given by this summation followed by log2
final position
normalization of each centroid. This procedure is implemented in Algorithm 2 in
the section {Compute Cluster Centroids}.
Algorithm 2 Maximum Diversity Clustering
^ ϕ)
function cluster ( Z;
I Centroids}
  {Initialize Cluster
^ KÞ
kmeansðZ;
   Y
I i
1:::N 
  for
do  
Y;i =Y;i 2
    Y;iI
 
^I;i
^;i =Z
^;i 
Z
    Z
2
   E I R1PrbT
T
Rϕ
   OI
I
  repeat
   for all Update Blocks do
    in ← cells to update in block
    {Compute O and E on left out data}
E � Rin 1PrbT
     E
O � Rin ϕTin
     OI
I
    {Update and Normalize R}


T^
exp � 2ð1�Yσ Zin Þ
     Rin
ðE þ 1=O þ 1Þθ ϕin
     ΩI
Rin  Ω
     RIin
Rin  diagð1T Rin Þ�1 
     RIin
I
    {Compute O and E with full data}
E þ Rin 1PrbT
     E
O þ Rin ϕTin
     OI
I
w   {Compute Cluster Centroids}
ZRT
    Y
I i ← 1…N do 
   for
 
Y;i =Y;i 2
     Y;i
  until convergence
I
  return R

⊳ L2 Normalization

⊳ Ri sum to one

⊳ L2 Normalization

Nature Methods | www.nature.com/naturemethods

Articles

NATuRe MeTHods
Implementation details. The update steps of R and Y derived above form the core of
maximum diversity clustering, outlined as Algorithm 2. This section explains the
other implementation details of this pseudocode. Again, for simplicity, we discuss
details related to diversity penalty terms θ, ϕ, O and E for each single batch variable
independently.
Block updates of R. Unlike in regular k-means, the optimization procedure above
for R cannot be faithfully parallelized, as the values of O and E change with R. The
exact solution therefore depends an online procedure. For speed, we can coarse
grain this procedure and update R in small blocks (for example, 5% of the data).
Meanwhile, O and E are computed on the held out data. In practice, this approach
succeeds in minimizing the objective for sufficiently small block size. In the
algorithm, these blocks are included as the Update Blocks in the for loop.
Centroid initialization. We initialize cluster centroids using regular k-means
clustering, implemented in the base R k-means function. We use ten random restarts
and keep the best one. We then log2 normalize the centroids to prepare them for
spherical k-means clustering in Algorithm 2, maximum diversity clustering.
Regularization for smoother penalty. The diversity penalty term (Ebk/Obk)θ can tend
toward infinity if there are no cells from batch b assigned to cluster k. This extreme
penalty can erroneously force cells into an inappropriate cluster.
protect
against
 To 
θ
1þEbk
this, we add 1 to O and E to ensure that the fraction is stable: 1þO
bk

I
θ discounting. The diversity penalty, weighted by θ enforces an even mixing of
cells from a batch among all clusters. This assumption is more likely to break
for a batch with few cells. The smaller the batch, the more likely it is, through
a sampling argument, that some cell types are not represented in the batch.
Spreading such a batch across all clusters would result in erroneous clustering. To
prevent such a situation, we allot each batch its own θb term, scaled to the number
of cells in the batch.
"


 #
Nb 2
θb ¼ θmax 1 � exp �
Kτ

Above, θmax is thenondiscounted θ value,
 for a large enough batch. The
multiplicative factor 1 � expð�Nb =KτÞ2 ranges from 0 to 1. This factor scales
exponentially for small
I values of batch size Nb and plateaus for sufficiently large
Nb. The hyperparameter τ can be interpreted as the minimum number of cells that
should be assigned to each cluster from a single batch. By default, we use values
between τ = 5 and τ = 20.
K, the number of clusters. The number of clusters K used in Harmony soft
clustering should be set to a value that reflects the size and complexity of the
dataset. Too few clusters will not capture the number of biologically distinct cell
types and states. Too many clusters will give too much weight to batch-specific
outliers and prevent effective integration. As a heuristic, we assume that the
datasets have at most 100 distinct cell types and that each cluster should have at
least 30 cells. We set the default number of clusters, K, to lie between these two
extremes, for N cells.


N
K ¼ min 100;
30

Linear mixture model correction. In this section, we refer to all effects to be
integrated out of the original embedding as batch effects. This does not indicate
that these effects are purely technical. This terminology is only meant for
convenience.
Mixture of experts model. Once we define batch-diverse clusters, we would
like to remove batch-specific variation from each cluster. We achieve this with a
variation on the original mixture of experts model by Jordan and Jacobs38.
In the context of Harmony, each cell is probabilistically assigned to a small set
of experts. This assignment was computed previously in the clustering step.
Conditioned on a cluster/expert, mixture of experts assumes a linear relationship
between the response and independent variables. Thus, we condition on
cluster/expert k and define a Gaussian probability distribution for the
response variables.




ð9Þ
Zi j ϕ*i ; Ri ¼ k  N Zi jWkT ϕ*i ; σ 2k I

Here, the mean is a function of the independent variables (μk = WkTϕ*i), while
the covariance is not (σk2I). Note that the design matrix above (ϕ*) is not the same
as the one used in the clustering step (ϕ). We augment the original design matrix
ϕ to include an intercept term: ϕ* = 1 || ϕ. These intercept terms capture batchindependent (that is, cell type) variation in each cluster or expert. We can also

Nature Methods | www.nature.com/naturemethods

achieve more complex behavior, such as reference mapping (Reference mapping)
by modifying ϕ*.
2
3
1

1
6 ϕ 11    ϕ1N 7
7
6
ϕ* ¼ 1jjϕ ¼ 6 .
..
... 7
5
4 ..
.
ϕB1    ϕBN

With this generative formulation, we can solve for the parameters (Wk) of the
linear model for each cluster or expert independently.

�1
ð10Þ
Wk ¼ ϕ* diagðRk Þϕ*T ϕ* diagðRk ÞZ T

Above, diag(Rk) is the diagonal matrix of cluster membership terms for cluster
k. Z is the matrix of original PCA embeddings
2
3
0
Rk1   
6 .
..
... 7
diagðRk Þ ¼ 4 ..
5
.
0    RkN

Each column in Wk corresponds to a PC dimension, from PC1 to PCd. The
first row of Wk corresponds to batch-independent intercept terms. The subsequent
rows (1 to B) correspond to one-hot encoded batch assignments from the original
design matrix ϕ.
To get the cell-specific correction values, we take the expectation of Wk with
respect to the cluster assignment probability distribution R. In particular, each cell
is modeled by a batch-independent intercept, represented in the first row of Wk
(Wk[0,.]) and its batch-dependent terms represented by the remaining rows: Wk[1:B,.].
Zi ¼

X
k

h
i
T
Rki Wk½0; þ Wk½1:B;
ϕ*i½1:B þ ϵi

ð11Þ

We split Wk into two parts because we want to retain the intercept terms
 and
^i , we
remove the batch-dependent terms. To get the corrected embeddings Z
(Zi).
subtract the batch-specific term (Σk WTk[1:B,.]ϕi) from the original embedding
I
^ i ¼ Zi �
Z

X

T
Rki Wk½1:B;
ϕ*i½1:B

k

ð12Þ

What remains is the cell-type specific intercept Wk[0,.] and the cell-specific
residual ϵi.
X
^i ¼
Z
Rki Wk½0; þ ϵi
ð13Þ
k

Unfortunately, for the design matrix ϕ*, the formulation in equation (13) does
not have a solution. This is because ϕ* is not full rank and thus ϕ* diag(Rk) ϕ*T
is not invertible. This singularity arises from the fact that the sum of a one-hot
encoded categorical variable is equal to the intercept. To address this colinearity,
we penalize nonintercept terms in Wk with an log2 norm, akin to ridge regression.
This shrinks the Wk terms to 0. Just as in ridge regression, instead of inverting ϕ*
diag(Rk) ϕ*T, we invert ϕ* diag(Rk) ϕ*T + λI, which is not singular. The solution for
Wk now becomes

�1
Wk ¼ ϕ* diagðRk Þϕ*T þ λI ϕ* diagðRk ÞZT

ð14Þ

λ is the ridge penalty hyperparameter. Larger values of λ will shrink Wk
more toward 0. We adopt the strategy to set λ0 = 0, thus not penalizing the batchindependent intercept, and 8b2½1:B λb to be small enough to make the solution
tractable. In practice, we set 8
I b2½1:B λb ¼ 1. These correction steps are summarized
in Algorithm 3.
I
Algorithm 3 Mixture of Experts Correct
function correct (Z, R, ϕ)
^
Z
    Z
1jjϕ
    ϕI*
I k ← 1…K do
   for
     Wk ¼ ðϕ* diagðRk Þϕ*T þ λIÞ�1 ϕ* diagðRk ÞZ T
I k½0;
0
     W
^I ¼ Z � W T ϕ* diagðRk Þ
     Z
k
^
   return
I Z
Reference mapping. In Supplementary Fig. 13, we used Harmony to map two query
datasets onto a reference dataset. We achieved this by modifying ϕ* so that batch
terms for reference cells does not get modeled or corrected. For every cell i in a
reference dataset, set ϕi* = [1, 0, …, 0]. This makes Harmony explain cell i in terms
of an intercept and nothing else. Since intercept terms do not get removed, cell i
^i).
never changes its embedding (that is, Zi ¼ Z
I

Articles
Caveat. This section assumes the modeled data are orthogonal and each normally
distributed. This is not true for the log2 normalized data used in spherical
clustering. Regression in this space requires the estimation and interpolation of
rotation matrices, a difficult problem. We instead perform batch correction in the
^ are then log2-normalized for the next
unnormalized space. The corrected data Z
iteration of clustering.
Performance and benchmarking. LISI metric. Assessing the degree of mixing
during batch correction and dataset integration is an open problem. Several
groups have proposed methods to quantify the diversity of batches within local
neighborhoods, defined by kNN graphs, of the embedded space. Buttner et al.39
provide a statistical test to evaluate the degree of mixing, while Azizi et al.40 report
the entropy of these distributions. Our metric for local diversity is related to these
approaches, in that we start with a kNN graph. However, our approach considers
two problems that these do not.
First, the metric should be more sensitive to local distances. For example, a
neighborhood of 100 cells can be equally mixed among four batches. However,
within the neighborhood, the cells may be clustered by batch. The second problem
is one of interpretation. kBET provides a statistical test to assess the significance of
mixing, but it is not clear whether all neighborhoods should be significantly mixed
when the datasets have vastly different cell-type proportions. Azizi et al.40 use
entropy as a measure of diversity, but it is not clear how to interpret the number of
bits required to encode a neighborhood distribution.
Our diversity score, the LISI addresses both points. To be sensitive to local
diversity, we build Gaussian kernel-based distributions of neighborhoods.
This gives distance-based weights to cells in the neighborhood. The current
implementation computes these local distributions using a fixed perplexity (default
30), which has been shown to be a smoother function than fixing the number
of neighbors. We address the second issue of interpretation using the inverse
Simpson’s Index (1/Σb=1Bp(b)). The probabilities here refer to the batch probabilities
in the local neighborhood distributions described above. This index is the expected
number of cells needed to be sampled before two are drawn from the same batch.
If the neighborhood consists of only one batch, then only one draw is needed.
If it is an equal mix of two batches, two draws are required on average. Thus,
this index reports the effective number of batches in a local neighborhood. Our
diversity score, LISI, combines these two features: perplexity-based neighborhood
construction and the Inverse Simpson’s Index. LISI assigns a diversity score to each
cell. This score is the effective number of batches in that cell’s neighborhood. Code
to compute LISI is available as at https://github.com/immunogenomics/LISI. The
major shortcoming of LISI is in situations with datasets with vastly different sizes.
We demonstrate what happens in this situation in Supplementary Note 2.
Significance between embeddings. In benchmarking against other algorithms, we
compared the LISI scores of Harmony to LISI scores of embeddings from other
algorithms. We did this with a standard bootstrap estimation framework: first,
build an empirical distribution of Harmony LISI values for each cell; then count
the frequency with which the observed benchmark LISI is more extreme than
bootstrapped LISI values. During the resampling phase, we reran PCA, Harmony
and LISI for every bootstrap sample. For P value computation, we used a twotailed empirical approach, adding 1 to the observed counts to avoid P = 0. This
d i be the observed LISI value
calculation is described in the equation below. Let LISI
for cell i for a benchmarking algorithm. Let LISIib toI be LISI value for the bth
bootstrap value for cell i. We compute P as


PB
P
1 þ min
; Bb¼1 1 c
b¼1 1Lc
b
b
ISIi > LISIi
LISIi < LISIi
P¼2
Bþ1
Time and memory. We performed execution time and maximum memory usage
benchmarks on all analyses. All jobs were run on Linux servers and allotted six
cores and 120 GB of memory. The machines were equipped with Intel Xeon E52690 v.3 processors. To evaluate execution time and maximal memory usage,
we used the Linux time utility (/usr/bin/time on our systems) with the -v flag to
record memory usage. Execution time was recorded from the {Elapsed time} field.
Maximum memory usage was recorded from the {Maximum resident set size} field.
Cell-type prediction accuracy. In addition to cLISI, we measured the accuracy of an
embedding with cell-type prediction. Briefly, we predict each cell’s type based on its
neighboring cells and compute accuracy as the frequency of correct predictions. In
detail, we compute the 30 nearest neighbors of a cell, based on cosine distance. We
then weigh them with an radial basis function kernel with σ = 0.1 width. Finally,
we take the weighted sum over the neighbor’s cell types and return the most likely
cell type. Formally, for cell i, its 30 nearest cells nearest-neighbor (NN)(i), an
embedding matrix Z and cell-type labels vector T, compute the most probable
cell-type label t:

 !
Zi =kZi k � jZj =kZjk2
X
´ 1Tj ¼t
Ti ¼ argmax
exp �
ð15Þ
σ
t
j2NNðiÞ

NATuRe MeTHods
Gene expression prediction. We predicted gene expression in the 10X + MERFISH
analysis using a similar approach to cell-type prediction above. Instead of
predicting cell-type labels T, we predicted gene expression yg. We predict the
expression ygi will take a weighted average of the observed normalized gene
expression values of the cell is nearest neighbors.

  !
Zi =Zi � Zj =Zj 2
X
´ ygj
ygi ¼ argmax
exp �
ð16Þ
σ
t
j2NNðiÞ
Five-fold cross-validation. In the 10X + MERFISH analysis, we predicted gene
expression of the 154 intersecting measured genes with a five-fold cross-validation
framework. In this analysis, we removed a randomly sampled one fifth of the genes
from both datasets before normalization. We then performed normalization, scaling,
PCA and Harmony integration from the remaining four fifths of the genes and
predicted the expression of the original removed genes. We repeated this procedure
for each of the five holdout genesets. We then repeated the entire procedure ten
times, selecting a new random partition in each iteration. We considered the mean
of these ten iterations as the predicted value of the gene. We calculated the Pearson
correlation between the predicted and measured expression of each gene.
Neuronal subtype correlation. To calculate the correlations between the 10X
predicted clusters and the MERFISH clusters, we first subsetted the predicted
gene expression matrix to the 154 intersecting 10X/MERFISH genes; we then
concatenated the predicted 10X neuronal subtypes to the predicted gene expression
matrix and the known neuronal subtypes to the MERFISH dataset. In both the
predicted gene expression matrix and the MERFISH dataset, we calculated the
average expression of the predicted 10X neuronal subtypes and known neuronal
subtypes, respectively. Finally, we calculated the Spearman correlation between the
average expression of each predicted 10X inhibitory/excitatory subtype and known
MERFISH inhibitory/excitatory subtypes.
Analysis details. Preprocessing scRNAseq data. We downloaded raw read or unique
molecular identifier matrices for all datasets, from their respective sources. The
one exception was the 3pV1 dataset from the PBMC analysis. These data were
originally quantified with the hg19 reference, while the other two PBMC datasets
were quantified with GRCh38. Thus, we downloaded the fastq files from the 10X
website (Supplementary Table 8). We quantified gene expression counts using Cell
Ranger11,41 v.2.1.0 with GRCh38. From the raw count matrices, we used a standard
data normalization procedure, laid out below, for all analyses, unless otherwise
specified. Except for the log2 normalization and within-batch variable gene
detection, this procedure follows the standard guidelines of the Seurat single-cell
analysis platform.
We filtered cells with fewer than 500 genes or more than 20% mitochondrial
reads. In the pancreas datasets, we filtered cells with the same thresholds used
in Butler et al.7: 1,750 genes for Cel-Seq, 2500 genes for Cel-Seq2, no filter for
Fluidigm C1, 2500 genes for Smart-Seq2 and 500 genes for inDrop. We then library
normalized each cell to 10,000 reads, by multiplicative scaling, then log scaled
the normalized data. We then identified the top 1,000 variable genes, ranked by
coefficient of variation, within in each dataset. We pooled these genes to form the
variable gene set of the analysis. Using only the variable genes, we mean centered
and variance 1 scaled the genes across the cells. Note that this was done in the
aggregate matrix, with all cells, rather than within each dataset separately. With
these values, we performed truncated singular value decomposition keeping the
top 30 eigenvectors. Finally, we multiplied the cell embeddings by the eigenvalues
to avoid giving eigenvectors equal variance.
Spatial analysis. The raw count 10X and MERFISH datasets were downloaded from
their Gene Expression Omnibus and Dryad repositories, respectively. MERFISH
metadata was downloaded as a supplementary table from the Moffitt et al.
publication27. Figures were generated using the six mice from the 10X dataset and
the first naive female mouse in the MERFISH dataset. For purposes of prediction
all cells labeled as ‘Ambiguous’ or ‘Unstable’ were removed from both datasets, as
these represent cells with a low number of unique genes expressed, a low library
size or a nondeterministic clustering. All genes in the 10X dataset that were not
expressed in any cells were removed. In the MERFISH dataset, the gene blanks
were removed for integration and the Fos gene was removed (no numerical values).
In the 10X dataset, we normalized the counts by dividing the raw counts
within each cell by the total number of transcripts within that cell, scaling these
counts by 10,000, offsetting the counts by 1 and log transforming the counts.
As the MERFISH dataset counts were already normalized by volume, we only
performed log transformation on the counts with an offset of 1. We then combined
the two datasets by subsetting the 10X dataset to the 154 intersecting genes,
concatenating the normalized count matrices and scaling each gene through
z-score transformation.
Visualization. We used the UMAP algorithm42,43 to visualize cells in a twodimensional space. For all analyses, UMAP was run with the following parameters:
k = 30 nearest neighbors, correlation based distance and min_dist = 0.1.
Nature Methods | www.nature.com/naturemethods

Articles

NATuRe MeTHods
Comparison to other algorithms. We used the provided packages or source code
provided by the four comparison algorithm publications.
MNN Correct. We used the mnnCorrect function, with default parameters, in the
scran R package44, v.1.9.4. As input, we provided a matrix of PCA embeddings.
Seurat multiCCA. We followed the suggested integration pipeline in the Seurat
R package7, v.2.3.4. This included the RunMultiCCA, MetageneBicorPlot,
CalcVarExpRatio, SubsetData (subset.name = ‘var.ratio.pca’, accept.low = 0.5)
and AlignSubspace functions. Unless specified, we used all default parameters
for these functions. We used the same number of canonical components as the
number of PCs used as input to Harmony, MNN Correct and Scanorama. Unlike
the integration examples, we did not scale data within datasets separately, unless
otherwise specified. We scaled data on the pooled count matrix instead.
Scanorama. We used the assemble function, with precomputed PCs, from the
primary github repository (brianhie/scanorama). We set knn = 30 and sigma = 1,
to match the default comparable MNN Correct parameters. All other parameters
were kept at default values. We did not use the correct function, as this included
both preprocessing and integration of the data. For more equitable comparisons,
we tried to use the same preprocessing pipelines for all methods and only
compared only the integration steps.
BBKNN. We downloaded the BBKNN software from the primary github repository
(Teichlab/bbknn) and followed the suggested integration pipelines, using the
bbknn and scanpy umap functions. For the bbknn function, we used k = 5 and
trim = 20 for all analyses except for the HCA datasets, in which we used k = 10 and
trim = 30, to accommodate the larger number of cells. All other parameters were
kept at default values. Because BBKNN by design constructs a batch-balanced
neighborhood graph, LISI should be biased toward these neighbors. On the other
hand, BBKNN does not learn a low-dimensional embedding aside from UMAP.
Therefore, we were forced to use the potentially biased neighborhood graph
provided by the method. We felt that this was appropriate because one step of
BBKNN is to prune this graph before the UMAP step. Thus, we used the pruned
BBKNN graph to compute LISI.
Harmony parameters. By default, we set the following parameters for Harmony:
θ = 2, K = 100, τ = 0, σ = 0.1, block_size = 0.05, ϵcluster = 10−5, ϵharmony = 10−4, max_
itercluster = 200, max_iterHarmony = 10, λ =:1. For the pancreas analysis, we set τ = 5. We
set donors to be the primary covariate (θ = 2) and technology secondary (θ = 4). In
the spatial analysis, we used θ = 3 and λ = 0.05.
Identification of alpha and beta ER stress subpopulations. We identified the alpha and
beta ER stress clusters in Fig. 5 by performing downstream analysis, specified in this
section, on the integrated joint embedding produced by Harmony. After Harmony
integration, we performed clustering analysis to find new subtypes. Clustering was
done on the trimmed shared nearest-neighbor graph with the Louvain algorithm45,
as implemented in the Seurat package BuildSNN and RunModularityClustering
functions. We used the parameters resolution = 0.8, k = 30 and nn.eps = 0. We
identified several clusters within the alpha, beta and ductal cell populations. For
each cluster, we performed differential expression analysis within the defined cell
type. That is, we compared alpha clusters to all other alpha cells. For differential
expression, we used the R Limma package18 on the normalized data. We included
technology and library complexity (log number of unique genes) as covariates in the
linear models. We used the top 100 over-expressed genes for each cluster, weighted
by the t-statistic, to perform pathway enrichment with the enrichR46,47 R package,
using the three Gene Ontology genesets48,49. The ductal subpopulation was enriched
for ribosomal genes; we did not follow up on this cluster.
Labeling cells with canonical markers. In the cell-line, PBMC and pancreas analyses,
we labeled cells within individual datasets using canonical markers. We did this by
using the standard preprocessing pipeline for each dataset, clustering (Louvain, as
above) and identifying clusters specific for the canonical markers for that analysis.
We used a similar strategy to identify fine-grained subpopulations of PBMCs and
in the HCA 500,000 cell dataset. In these cases, we clustered in the joint embedding
produced by Harmony, then looked for clusters that specifically expressed expected
canonical markers.
Identification of spatially autocorrelated transcription factors. To identify which
of our predicted genes were transcription factors, we downloaded all known
vertebrate transcription factors from JASPAR in MEME format29. We used this list
of transcription factors as a reference for our predicted genes list. We assessed the
degree of spatial localization of each transcription factor using Moran’s I statistic.
Due to the large number of cells, we found the calculation of a dense weights
matrix of all pairwise distances between cells to be untenable. Therefore, after
taking each slice and centering their respective cell’s X and Y centroids around 0,
we calculated a sparse weights matrix by finding each MERFISH cell’s 30 nearest
MERFISH neighbors (using centered X centroids, centered Y centroids and slice
Z position as locations for each cell) excluding itself and applying a radial basis
Nature Methods | www.nature.com/naturemethods

function kernel on the distances of each MERFISH cell’s nearest neighbors with
bandwidth σ = 60. For each set of nearest neighbors, we divided each distance
by the sum of the distances to obtain probabilities to use as weights, and input
these weights into a sparse matrix. Calculations then proceeded as normal from
the equation, using the sparse distance matrix as the weights matrix. Genes with
significant spatial autocorrelation were evaluated at FDR 0.05. Formally, for gene g,
a nearest-neighbor map NN and a three-dimensional spatial embedding matrix X,
the nearest-neighbor formulation of Moran’s I is

I¼P

PN P

�

wij ¼ exp �


 !
jXi � Xj j2
σ

N

i¼1

i;j wij

�

j2NNðiÞ wij ygi � yg
2
PN �
i¼1 ygi � yg

ygj � yg



ð17Þ

ð18Þ

ISH Allen Brain Atlas access. To look at the coarse ISH data from Allen Brain
Atlas, we used the ISH data portal31. We searched for our gene of interest (Satb1)
and selected experiment number 79488931, with coronal slices. To estimate the
anterior-posterior position of each slice, we used the online tool to map each
slice to an Atlas reference slice. Each reference slice has a position from the
Bregma, searchable in their online Allen Mouse Brain volumetric Atlas: https://
scalablebrainatlas.incf.org/mouse/ABA12. The representative image in Fig. 6a is
slice 62 (282) in the volumetric atlas. We also used the Atlas annotations online
to estimate the location of anatomical structures (that is, hypothalamus and
ventricles) in each slice. We downloaded 13 to 16 from experiment 79488931 and
focused on a 2 × 2 cm2 region around the hypothalamus.
Statistics. Statistical analysis of LISI score comparisons is described in Significance
between embeddings. Differential expression tests in the pancreatic islet cell
analysis were performed with two-tailed moderated t-tests, implemented in the
LIMMA R package. Correlation analyses in the pancreatic islet cell and MERFISH
analyses were performed using the Spearman correlation.
Reporting Summary. Further information on research design is available in the
Nature Research Reporting Summary linked to this article.

Data availability

All data analyzed in this article are publicly available through online sources. We
included links to all data sources in Supplementary Table 8.

Code availability

Harmony and LISI are available as R packages on https://github.com/
immunogenomics/harmony and https://github.com/immunogenomics/lisi.
Scripts to reproduce results of the primary analyses will be made available on
https://github.com/immunogenomics/harmony2019. Additionally, vignettes
are included as Supplementary Notes. Supplementary Note 1 provides a detailed
walkthrough of Harmony, connecting theoretical algorithm components to their
code implementations. Supplementary Note 2 demonstrates the LISI metric and
how to evaluate its statistical significance. Supplementary Note 1 uses Harmony
with simulated datasets.

References

36. Mao, Q., Wang, L., Goodison, S. & Sun, Y. Dimensionality reduction via
graph structure learning. In Proc. 21th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, KDD 2015,
765–774 (ACM, 2015).
37. Dhillon, I. S. & Modha, D. S. Concept decompositions for large sparse text
data using clustering. Mach. Learn. 42, 143–175 (2001).
38. Jordan, M. I. & Jacobs, R. A. Hierarchical mixtures of experts and the EM
algorithm. Neural Comput. 6, 181–214 (1994).
39. Buttner, M., Miao, Z., Wolf, F. A., Teichmann, S. A. & Theis, F. J. A test
metric for assessing single-cell RNA-seq batch correction. Nat. Methods 16,
43–49 (2019).
40. Azizi, E. et al. Single-cell map of diverse immune phenotypes in the breast
tumor microenvironment. Cell 174, 1293–1308 (2018).
41. Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29,
15–21 (2013).
42. McInnes, L. & Healy, J. UMAP: uniform manifold approximation and
projection for dimension reduction. Preprint at https://arxiv.org/
abs/1802.03426 (2018).
43. Becht, E. et al. Dimensionality reduction for visualizing single-cell data using
UMAP. Nat. Biotechnol. 37, 38–44 (2019).
44. Lun, A. T. L., McCarthy, D. J. & Marioni, J. C. A step-by-step workflow for
low-level analysis of single-cell RNA-seq data with bioconductor. F1000 Res.
5, 2122 (2016).

Articles
45. Blondel, V. D., Guillaume, J.-L., Lambiotte, R. & Lefebvre, E. Fast unfolding
of communities in large networks. J. Stat. Mech.: Theory Exp. 2008,
P10008 (2008).
46. Chen, E. Y. et al. Enrichr: interactive and collaborative HTML5 gene list
enrichment analysis tool. BMC Bioinformatics 14, 128 (2013).
47. Kuleshov, M. V. et al. Enrichr: a comprehensive gene set enrichment
analysis web server 2016 update. Nucleic Acids Res. 44,
W90–W97 (2016).
48. The Gene Ontology Consortium. Expansion of the gene ontology
knowledgebase and resources. Nucleic Acids Res. 45, D331–D338 (2017).
49. Ashburner, M. et al. Gene ontology: tool for the unification of biology. the
gene ontology consortium. Nat. Genet. 25, 25–29 (2000).

Acknowledgements

This work was supported in part by funding from the National Institutes of Health
(grant nos. UH2AR067677 and U19AI111224 and no. 1R01AR063759 (to S.R.) and T32
AR007530-31 (to I.K.)). We thank members of the Raychaudhuri and Brenner labs for
comments and discussion. I.K. and K.W. were funded as part of a collaborative research
agreement with F. Hoffmann-La Roche Ltd (Basel, Switzerland), to S.R. and M.B.B.

NATuRe MeTHods
Author contributions

S.R. and I.K. conceived the research. I.K. led computational work under the guidance
of S.R., assisted by N.M., P.L., J.F. and K.S. All authors participated in interpretation and
writing the manuscript.

Competing interests

I.K. does paid bioinformatics consulting through Brilyant LLC.

Additional information

Supplementary information is available for this paper at https://doi.org/10.1038/
s41592-019-0619-0.
Correspondence and requests for materials should be addressed to S.R.
Peer review information Nicole Rusk was the primary editor on this article and
managed its editorial process and peer review in collaboration with the rest of the
editorial team.
Reprints and permissions information is available at www.nature.com/reprints.

Nature Methods | www.nature.com/naturemethods

