How to Make Custom-Colored Dendrogram Ends in R

mjs · July 29, 2020

Today I describe how to color the terminal ends of a dendrogram based on some metadata variable you want to define. If you would just like to see the code, click here.

Most of the projects I work on involve some sort of clustering analysis. For one of them, I wanted to color the ends of a dendrogram by some variable from my metadata, to visualize whether that variable followed the clustering. There exist excellent packages in R like ggdendro that allow you to either plot colored bars under dendrograms to represent how groups cluster or color the terminal segments by the cluster itself.

That said, I still haven’t found an easy way to change the color of the terminal ends of the dendrogram itself based on user-defined metadata, which I personally think can be more visually appealing in some situations. This tutorial describes how I figured out how to do it and provides reproducible code if you are hoping to do the same thing!

Dendrogram Basics

Before I start, what is a dendrogram, anyway?

A dendrogram is a graphical representation of hierarchical clustering. Clusters can be made in different ways (i.e., top-down or bottom-up), most commonly in R through the application of hclust() on a distance matrix. Dendrograms are built by connecting nodes to branches or other nodes, resulting in a tree-like figure that shows how individual things are related to each other based on multiple variables.

Let’s say we want to compare how invidual irises relate to each other in the well-known R-core data set. This dataframe contains four numeric vectors (Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width) as well as one character vector (Species). We could easily construct and plot a dendrogram incorporating all these numeric data with base R, but what if we want to color the terminal segments by the species of iris to visualize whether Species follows the clustering determined by hclust()?

Step 1: Install Packages

For this tutorial, you’ll want to load three R packages: tidyverse for data manipulation and visualization, ggdendro to extract dendrogram segment data into a dataframe, and RColorBrewer to make an automatic custom color palette for your dendrogram ends. If you would like to make your dendrogram interactive, be sure to load plotly as well.

pacman::p_load(tidyverse, ggdendro, RColorBrewer, plotly)

Step 2: Load Data

Now we’ll want to load the iris dataframe into our environment. Typically, we have sample names mapped to each observation, so we will want to create our own (sample_name) right at the start.

With microbial community data, I usually work with two objects: a giant matrix of ASV abundances by sample_name, and metadata associated with each sample. To simulate this, we will separate iris into numeric_data, from which we will calculate distance and construct a dendrogram, and metadata, which consists simply of the species of iris for each sample. For this workflow, it is important to have a sample_name identifier for each observation; it will be the basis of merging everything at the end.

# label rows with unique sample_name
dat <- iris %>%
  mutate(sample_name = paste("iris", seq(1:nrow(iris)), sep = "_")) # create unique sample ID

# save non-numeric metadata in separate dataframe
metadata <- dat %>%
  select(sample_name, Species) 

# extract numeric vectors for distance matrix
numeric_data <- dat %>%
  select(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, sample_name)

# check data 
head(numeric_data)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width sample_name
## 1          5.1         3.5          1.4         0.2      iris_1
## 2          4.9         3.0          1.4         0.2      iris_2
## 3          4.7         3.2          1.3         0.2      iris_3
## 4          4.6         3.1          1.5         0.2      iris_4
## 5          5.0         3.6          1.4         0.2      iris_5
## 6          5.4         3.9          1.7         0.4      iris_6

Step 3: Normalize Data and Create Dendrogram

Before we make the dendrogram, we will calculate a distance matrix based on numeric_data using dist(). It is good practice to normalize your data before doing this calculation; I will therefore normalize all values within a vector on a scale from 0 to 1.

After we do that, we can create a distance matrix (dist_matrix) and generate a dendrogram from our normalized data.

# normalize data to values from 0 to 1 
numeric_data_norm <- numeric_data %>%
  select(sample_name, everything()) %>%
  pivot_longer(cols = 2:ncol(.), values_to = "value", names_to = "type") %>%
  group_by(type) %>%
  mutate(value_norm = (value-min(value))/(max(value)-min(value))) %>% # normalize data to values 0-1
  select(sample_name, value_norm) %>%
  pivot_wider(names_from = "type", values_from = "value_norm") %>%
  column_to_rownames("sample_name")

# create dendrogram from distance matrix of normalized data
dist_matrix <- dist(numeric_data_norm, method = "euclidean")
dendrogram <- as.dendrogram(hclust(dist_matrix, method = "complete"))

Step 4: Extract Dendrogram Segment Data Using ggdendro

Now let’s quickly take a look at what our dendrogram looks like using base R:

plot(dendrogram)

Okay, it’s not very pretty, but bear with me. This is a useful visual to show how we will extract the coordinate data from the dendrogram object with ggdendro::dendro_data() to make a better figure. Every dendrogram is plotted by adding individual segments between points on an x and y grid.

When we apply ggdendro::dendro_data() and look at the extracted segment data, we see there are four vectors for every dendrogram: x, y, xend, and yend. Every horizontal or vertical line you see in the base R figure is ultimately constructed from one row of the following dataframe:

# extract dendrogram segment data
dendrogram_data <- dendro_data(dendrogram)
dendrogram_segments <- dendrogram_data$segments # contains all dendrogram segment data

head(dendrogram_segments)
##           x         y      xend      yend
## 1 54.982910 1.6511874 18.886719 1.6511874
## 2 18.886719 1.6511874 18.886719 0.6103705
## 3 18.886719 0.6103705 10.773438 0.6103705
## 4 10.773438 0.6103705 10.773438 0.4096452
## 5 10.773438 0.4096452  4.296875 0.4096452
## 6  4.296875 0.4096452  4.296875 0.2548251

We will split these coordinate data into two dataframes: dendrogram_segments, containing all the segments, and dendrogram_ends, containing only the terminal branches of the figure. As the plot above shows, when the value in the y-direction as 0 (i.e., yend == 0), that only includes those single segments at the bottom of the plot:

# get terminal dendrogram segments
dendrogram_ends <- dendrogram_segments %>%
  filter(yend == 0) %>% # filter for terminal dendrogram ends
  left_join(dendrogram_data$labels, by = "x") %>% # .$labels contains the row names from dist_matrix (i.e., sample_name)
  rename(sample_name = label) %>%
  left_join(metadata, by = "sample_name") # dataframe now contains only terminal dendrogram segments and merged metadata associated with each iris

Looking at dendrogram_ends, we now have a dataframe with vectors containing the dendrogram coordinate data matched to the sample_name and Species vector. We are now ready to start plotting in ggplot2!

head(dendrogram_ends)
##   x        y.x xend yend y.y sample_name   Species
## 1 1 0.15339027    1    0   0    iris_101 virginica
## 2 2 0.10971611    2    0   0    iris_116 virginica
## 3 3 0.06047157    3    0   0    iris_137 virginica
## 4 4 0.06047157    4    0   0    iris_149 virginica
## 5 5 0.07148290    5    0   0    iris_142 virginica
## 6 6 0.07148290    6    0   0    iris_146 virginica

Step 5: Generate a Custom Color Palette for Dendrogram Ends Based on Metadata Variables using RColorBrewer (Optional)

If you want to dynamically create a list of colors based on how many unique variables the metadata vector of interest contains, you can run this code. In this example, our metadata only contains three species of iris, so this could be done manually fairly quickly. However, if the number of unique species in your dataset is more than that, as is common with microbial community data, chances are you might want to automate this process.

# Generate custom color palette for dendrogram ends based on metadata attribute
unique_vars <- levels(factor(dendrogram_ends$Species)) %>% 
  as.data.frame() %>% rownames_to_column("row_id") 

# count number of unique variables
color_count <- length(unique(unique_vars$.))
# RColorBrewer
get_palette <- colorRampPalette(brewer.pal(n = 8, name = "Set1"))

# produce RColorBrewer palette based on number of unique variables in metadata
palette <- get_palette(color_count) %>% 
  as.data.frame() %>%
  rename("color" = ".") %>%
  rownames_to_column(var = "row_id")
color_list <- left_join(unique_vars, palette, by = "row_id") %>%
  select(-row_id)
species_color <- as.character(color_list$color)
names(species_color) <- color_list$.

If you don’t want to bother with the above code for this tutorial, you could manually create a named character vector as an alternative:

# Alternatively, create a custom named vector for iris species color:
species_color <- c("setosa" = "#E41A1C", "versicolor" = "#CB6651", "virginica" =  "#F781BF")

Step 6: Plot your Custom-Colored Dendrogram!

Now it’s time to plot our dendrogram! You will want to define two geoms for geom_segment: one plotting all the segment data extracted from Step 4, which are uncolored, and one for just the terminal branches of the dendrogram, which is what we will color with name_color from the previous step. If you wrap this plot this with plotly (see below), I recommend adding an extra text aesthetic to control which information will display on your output.

p <- ggplot() +
  geom_segment(data = dendrogram_segments, 
               aes(x=x, y=y, xend=xend, yend=yend)) +
  geom_segment(data = dendrogram_ends,
               aes(x=x, y=y.x, xend=xend, yend=yend, color = Species, text = paste('sample name: ', sample_name,
                                                                                   '<br>',
                                                                                   'species: ', Species))) + # test aes is for plotly
  scale_color_manual(values = species_color) +
  scale_y_reverse() +
  coord_flip() + theme_bw() + theme(legend.position = "none") + ylab("Distance") + # flipped x and y coordinates for aesthetic reasons
  ggtitle("Iris dendrogram")
  
p

If you want to get really fancy, you can wrap your ggplot with plotly to make your dendrogram interactive! Be sure to specify tooltip = "text" to control which information is displayed.

p.lotly <- ggplotly(p, tooltip = "text")

Open me in a new tab to view the interactive plot!

And there you have it - dendrogram ends dynamically colored by your metadata! I hope you found this helpful. If you have any questions, comments, or suggestions, please feel free to comment below or contact me directly! :]

Twitter, Facebook