Simulated Data with Noises

Simulated Data with Noises

Supporting Information:

Simulated data with noises:

Results and discussions:

The results for every approach applied to the simulated data with noises are listed in Table 1 and Table 2. For SIMU1 data, four approaches all obtained good prediction performance. For SIMU2 data, due to the influence of the outliers, PLS failed to achieve good performance. But RPLS and MCOVS-PLS could indeed diagnose the outliers (samples 46-50). For SIMU3 data, similar results can be obtained, however, IVE-PLS rather than RPLS obtained the better prediction performance than PLS. we from Table 1 seems to obtain the conclusion: MCOVS-PLS could simultaneously take into account two directions (i.e., sample and feature space) and therefore yields the good prediction ability under two situations. For SIMU4 data, six approaches including single approaches and their simple combinations are used to build robust and reliable PLS models. We can again see that PLS obtained the poor prediction ability among all the modeling approaches. IVE-PLS obtained the better results than RPLS. This may indicate that the influence of variable selection is far larger than one of RPLS. In the IVE-PLS model, 22 redundant variables are forced to enter into the established model. This seems to indicate that the existence of outliers make the obtained model overfull select redundant variables to make up the errors caused by these outliers (compared with the results of IVE for SIMU3 data). Similarly, RPLS failed to detect five true outliers. Only samples 47 and 48 are correctly diagnosed. This indicates that the redundant variables influence the detection of outliers for RPLS. As for two combination approaches, similar results to the simulated data without noises are obtained. However, IVE-PLS+RPLS detected more true outliers than RPLS+IVE-PLS. it is somewhat surprising that RPLS+IVE-PLS produced the better results than IVE-PLS+RPLS. Among all modeling approaches, MCOVS-PLS yielded the best prediction results. It not only diagnosed all five true outliers, but also obtained the useful variables.

Table 1 The prediction results of the first three simulated data using different modeling approaches

Methods / RMSECVa / Q2 / LVsb / Vars / Outliers
SIMU1
PLS / 0.1861 / 0.9985 / 5 / 100 / ---
IVE-PLS / 0.1853 / 0.9985 / 5 / 100 / ---
RPLS / 0.1870 / 0.9985 / 5 / 100 / ---
MCOVS-PLS / 0.1849 / 0.9985 / 5 / 100 / ---
SIMU2
PLS / 0.5520 / 0.9875 / 5 / 100 / ---
RPLS / 0.1840 / 0.9986 / 5 / 100 / 46-50
MCOVS-PLS / 0.1844 / 0.9986 / 5 / 100 / 46-50
SIMU3
PLS / 0.4713 / 0.9904 / 13 / 100+100 / ---
IVE-PLS / 0.1623 / 0.9988 / 3 / 91+7 / ---
MCOVS-PLS / 0.1618 / 0.9988 / 3 / 92+6 / ----

a root mean square error for cross validation. b the number of the latent variables for PLS.

Table 2 The prediction results of SIMU4 using different modeling approaches

Methods / RMSECV / Q2 / LVs / Vars / Outliers
PLS / 0.8306 / 0.9720 / 2 / 100+100 / ---
IVE-PLS / 0.3234 / 0.9957 / 13 / 94+22 / ---
RPLS / 0.7142 / 0.9773 / 4 / 100+100 / 2,8,30,47,48
IVE-PLS+RPLS / 0.2828 / 0.9961 / 13 / 94+22 / 22 47 48 49 50
RPLS+IVE-PLS / 0.2586 / 0.9969 / 8 / 93+17 / 2,8,30,47,48
MCOVS-PLS / 0.1812 / 0.9986 / 5 / 82+0 / 46-50

Figure Captions

Figure 1The result of Standardized residual versus Score distance on simulated data (SIMU2) using two latent variables in RPLS. Five extreme samples were correctly detected as y outliers.

Figure 2The result of Standardized residual versus Score distance on simulated data (SIMU4) using five latent variables in RPLS. Five extreme samples were detected as y outliers or leverage points. However, only two true outliers (47 and 48) are correctly detected.

Figure 3 The plots of the RMSECV value versus the number of variables eliminated using the IVE-PLS approach for SIMU4 dataset.

Figure 4The result of Standardized residual versus Score distance on simulated data (SIMU4) using two latent variables in IVE-PLS+RPLS. Five extreme samples were correctly detected as y outliers. After the elimination of certain redundant variables, RPLS correctly detected four outliers (sample 47-50).

Figure 5 The plots of the RMSECV value versus the number of variables eliminated using the RPLS+IVE-PLS approach for SIMU4 dataset.

Figure 6 The plots of the RMSECV value versus the number of variables eliminated using the MCOVS-PLS approach for SIMU4 dataset.

Figure 7 The pathway analysis of outlying molecules detected with different chemical space of descriptors. Each row represents 5 outliers detected with current iteration. Each iteration removes two most unimportant descriptors. The residuals of outliers in each row are ranked (from small residual to large residual).

Boiling point data:

The outliers detected by different approaches

RPLS and RPLS+IVE-PLS: 210 75 46 211 129 138 119170 130 45 220 214 139 122 121 111 213 1 173 85

IVE-PLS+RPLS: 210 46 3 211 1 43 172220 130 138 119 139 44 129 122 111 213 121 173 85

MCOVS-PLS: 43 115 65 119 215 44 46 85 111 121 122129 130 138 139 170 172 173 218 227

The variables selected by different approaches

IVE-PLS and IVE-PLS+RPLS: 2 8 11 16 17 22 23 27 34 37 39 42 44 54 5659 60 61

RPLS+IVE-PLS: 2 9 10 13 17 18 20 28 35 39 40 47 60 61

MCOVS-PLS: 1721222324272834374254565960

Figure 8 Boiling point data: the pathway analysis of outlying molecules detected with different chemical space of descriptors. Each row represents the outliers detected with current iteration. Each iteration removes two most unimportant descriptors. The residuals of outliers in each row are ranked (from small residual to large residual).