Hierarchical clustering on continuous heterogeneous variables with different range/scales in R

Question

I would like to use R to perform hierarchical clustering with two groups of variables describing the same samples. One group is microarray gene expression data (for specific genes) that have been normalized and batch effect corrected. The other group also has some quantitative clinical parameters that describe the same samples. However, these clinical variables have not been normalized or subjected to any kind of transformation(i.e. raw continuous values).

For example, one variable of these could have range of values from 2 to 35, whereas another from 0.1 to 0.9, etc.

Thus, as my ultimate goal in to implement hierarchical clustering and use both groups simultaneously (merged in a matrix/dataframe), in order to inspect which of these clinical variables cluster with specific genes, etc:

1) Is an initial transformation in the group of the clinical variables necessary before merging with the genes and perform the clustering ? For example: log2 transformation, which has also been done to part of my gene expression data !!

2) Or, a row scaling (that is the total features in the input data) would take into account this discrepancy ?

3) For a similar analysis/approach, like constructing a correlation plot of the above total variables, would a simple scaling be sufficient?

David C. David C. · Accepted Answer · 2017-01-28T01:55:07

Without having seen your gene expression data, I can only provide you some general suggestions based on your description, in the context of the 3 questions you asked:

1) You should definitely check the distribution of each group. In R, you may use one or more of the following function to visualize the distribution:

hist(expression_data) ##histogram
plot(density(expression_data)) ##density plot; alternative to histogram
qqnorm(expression_data); qqline(expression_data) #QQ plot

Since my understanding is that one of your expression data group is log2 transformed, that particular group should have a normal distribution (i.e. a bell curve shape in the histogram and a straight line in the QQ plot). Whether to transform the group that has not yet been transformed will depend on what you want to do with the data. For instance, if you want to use a t-test to compare the two groups, then you definitely need a transformation, as there is a normality assumption associated with a t-test. With regard to hierarchical clustering, if you decide to use both groups in a single clustering analysis, then why would you ever keep one transformed and the other not?

2) Scaling by features is a reasonable approach. Here is a clustering lecture from a Utah State Univ. stats course, with an example. scale=TRUE is an option for you if you decide to use heatmap function in R.

3) I don't think there is a definitive answer to your third question. It has to depend on how many available features you have and what analyses you will be doing downstream. Similar to question 1, I would argue that simple scaling may be sufficient for visualizing your data by hierarchical clustering. However, do keep in mind that, say you decide to perform a linear model (which is very common with microarray gene expression data), you might want to consider more sophisticated data scaling.

Hierarchical clustering on continuous heterogeneous variables with different range/scales in R

1 Answers