Import datasets

# install.packages(c("mixOmics","scales","knitr"))
library(mixOmics)
library(RColorBrewer)
library(knitr)
X <- as.matrix(read.csv2("genes.csv"))
y <- read.csv2("cellType.csv",header = F)$V1

PLS, sPLS and sPLS-DA

The package comes from (Lê Cao, Boitard, and Besse 2011) and the data come from (Zheng et al. 2017).

As a reminder, each axis of a PLS problem solves the following optimization problem :

\[ \max_{\substack{\boldsymbol{u}^T\boldsymbol{u}=1\\\boldsymbol{v}^T\boldsymbol{v} =1}} cov(\boldsymbol{Xu},\boldsymbol{Yv})= \max_{\substack{\boldsymbol{u}^T\boldsymbol{u}=1\\\boldsymbol{v}^T\boldsymbol{v} =1}} \boldsymbol{u}^T\boldsymbol{X}^T\boldsymbol{Y}\boldsymbol{v}. \]

The deflation permits to repeat that problem over successive axes in orthonormal ways.

The sPLS is the \(L_1\) modified optimization problem, like for the LASSO, which constraints the weights to be smaller than the unconstrained problem (PLS).

sPLS-DA is a PLS method which account to answer to 2 questions in plus than the classical PLS model :

So, the sPLS-DA permits to select variables on different axes which will discriminate the different classes. For this we will have to tune two parameters :

Remark : In the sPLS problem we do not have to tune \(keep_X\) because we fix it to the number of variables.

Back to the 10X dataset!

We have decided to deal with the 10X dataset and to treat firstly the case of 4 populations which are easily separable, which are :

So, first of all we create new datasets with only the considered cells. We have to standardize the X dataset :

Tune the parameters

The tricky part of the algorithm is to tune the different parameters \(n_{comp}\) and \(keep_X\).

The common way is to tune \(keep_X\) on the \(1^{st}\) component, then do the deflation, then tune \(keep_X\) on the \(2^{nd}\) component and re-apply up to a coherent value for \(n_{comp}\).

To fix \(n_{comp}\) we will most of times use the rule of thumb : \(n_{comp}\)=K-1, where K is the number of classes in the dataset.

Here K=4 so we will build

\[n_{comp}=3,\]

different components.

Now we have to find a way of selecting the number of variables per axis using a validation criterion. We will use here the quality of classification in a cross-validation based method.

More precisely

Tune \(keep_{X_1}\) : the number of variables selected along the first axis

Import the function crossValidate_splsDA from file crossValidate_splsDA.R.

Please use that function to perform cross-validation, plot the error and discuss the results.

… it took a few times but you found certainly that \(keep_X \in (2,7)\) are quite good possibilities. Tell us if you found anything different…

Assume you have selected the parameter :

\[ keep_{X_1}=7. \]

Get \(keep_{X_2}\) and \(keep_{X_3}\)

As we do not want to waste to much time, we have performed the other cross-validation procedures. The code is here :

# ncomp <- 2
# previsous.keepX <- 7
# errors <- crossValidate_splsDA(X_4_pop,y_4_pop,
#                                ncomp=ncomp,
#                                keepXs=keepXs,
#                                previsous.keepX=previsous.keepX,
#                                n.folds=20)
# plot(errors,pch=16,type="b")
# 
# ncomp <- 3
# previsous.keepX <- c(7,4)
# errors <- crossValidate_splsDA(X_4_pop,y_4_pop,
#                                ncomp=ncomp,
#                                keepXs=keepXs,
#                                previsous.keepX=previsous.keepX,
#                                n.folds=20)
# plot(errors,pch=16,type="b")

Which permits to select the following values :

\[(keep_{X_2},keep_{X_3})=(5,6)\]

Check the final model

As we have selected the parameters which gives a good level of satisfaction in terms of prediction, we want to check out what we selected.

Model builing

Build the desired model and plot the components. Discuss about the interest of the components to discriminate the classes.

Selected variables

Find the positions of the genes whch are selected in that study

PlotVar representation

It permits to represent the weights in two-dimensionnal figures. Look for that function in the Help and plot it.

What are the problems of that representation ?

Contribution plots

Those plots are quite fancy to check the weights of each variable selected along its component. Look for that function in the Help and plot it.

cim representation

An interpretable way of representing the selected genes is to use cim representation from mixOmics. Plot the selected genes and only them ;).

How can you comment that figure ?

Can you find information redundant with previous figures ?

Variances explained

Plot the variance explained by each axis

Are the variances increasing or decreasing ?

Is that necessarly the case

Predict functions

PLS-based methods permit to construct a classification model.

For example, if we want to test a few predictions :

Do the same with the all dataset

No cheat here but ask for help if needed!

References

Lê Cao, Kim-Anh, Simon Boitard, and Philippe Besse. 2011. “Sparse Pls Discriminant Analysis: Biologically Relevant Feature Selection and Graphical Displays for Multiclass Problems.” BMC Bioinformatics 12 (1). BioMed Central: 253.

Zheng, Grace XY, Jessica M Terry, Phillip Belgrader, Paul Ryvkin, Zachary W Bent, Ryan Wilson, Solongo B Ziraldo, et al. 2017. “Massively Parallel Digital Transcriptional Profiling of Single Cells.” Nature Communications 8. Nature Publishing Group: 14049.