Results of PLS1

PLS1 is a regression model that uses a matrix X (said "of predictors") to predict a single vector (said "the predictand"). It is a well established algorithm widely used in several fields, including analytical chemistry and process monitoring (see literature)

Several plots may be of interest, depending on the purpose of the analysis.
Some of them are available both for X and for the Y

while others are specific for the X:

or for the Y:

Some other plots become available only in specific cases:

The number of models computed and validated depends on the number of Y variables that was originally selected.
The choice of which predicted variable one desires to look at is done directly in the 'Results' menu (red mark).
Whenever the variable name is not in the plot's title, it appears in the InfoBox as the name of the Y data.

The choice between X and Y plots can be made in the following submenu (yellow).
The subsequent levels of menus are those that allow the actual choice of the type of plot.
 

NB Some plots need the number of latent variables to be specified (e.g. D-statistic). This is done via the requester shown on the side.

Whenever projections are presents the corresponding results are given precedence with respect to the calibration (which can be displayed, as an option) and the validation.

Go to Tucker Go to nPLS1

Back to Factor Analysis

 


Score/Weights

Weights and scores can be displayed in up to tridimensional plots (1D, 2D or 3D)

After the desired menu is selected, an "nPLS1 Plot Control" window opens:

The "Axes" frame

It allows to choose which LV, if any, is to be plot along which axis:

  1. 1D plot, the X-axis are the scalars for the selected mode, the scores/weights are on the Y-axis. Only one menu is in the "Axes" frame
    It is possible by choosing "All" to plot all the components at the same time
     
  2. 2D plots, the first menu from the left refers to the X-axis and the second to the Y-axis. The "all" option is not available.
  3. 3D plots, the first menu from the left refers to the X-axis,the central one to the Y-axis and the last to the Z-axis.
    The "all" option is not available.


The "Validation & prediction" frame

Its content varies depending on the type of validation (if any) or if the model is applied on external data :

Choosing to plot replicates/predictions deactivate the "All" option in the 1D plots.
NB Due to the intrinsic rotational indeterminacy the replicates normally form an "arc" centred in the origin of the axes and do not "cluster" around a specific point. The possibility of displaying them was maintained as it is expected that the suitable rotation to the same model will be implemented in future versions of the software.


The "Display options" frame and the preferences menu

The first menu in the frame allows to choose the type of marker:

The second menu specify whether the a line should link the points (Continuous) or not (Discrete)

The submenus in the 'Preferences' menu are activated/deactivated depending on the choices made on the main control window:

The other submenus are:

Before the changes become visible it is necessary to press the button 'Plot'.


Explained Variation, PRESS, RMSE
 

Explained Variation (expressed as a % of the total variation in the set), Prediction REsiduals Sum of Squares and Root Mean Squared Error reflect all the goodness of the fit, in particular:

where the PRESS refers to the calibration or to the validation samples/batches and TSS stands for Total Sum of Squares.
The RMSE is linked to the PRESS by the relation:

where n is the number of "non-missing" elements in the array.
They give then the same type of information, albeit in different measurement units.

This plots are available both for the X and the y.
The plots on the X are defined also when new data is projected on a model. For obvious reasons these values are not available for the y.

To plot the overall explained variation the following menu has to be selected

The other submenus will have the EV% plotted against a specific axis (see figure on the side). Subsets can be chosen, as for the residual sum of squares.


When the model is validated using full cross-validation, the EV can be computed both on the complete model or on the predictions of the samples left out at each step. The choice is made through a requester
that appears when subsets are selected in the first mode:

The default is "Calibration".
A similar choice is to be made when displaying the
explained v versus one mode that is not the first when test set validation is employed. In this case it is asked wheter the residuals shall refer to the calibration or to the test set.
Again the default is "Calibration".


Residuals sum of squares (Q-statistics)

This plot involves many selections, besides the choice of the model's rank:
 

1) Select mode to plot against in the 'Results' ->'Var. name' -> 'X plots' -> 'Residuals' menu

 

NB Although the model is computed on the matricised form of X, the residuals are plot according to the original size and number of dimensions of the array. If one desires to have only two modes it is necessary to matricise the array before computing the model (see Edit -> Reshape).
 

2) Choose subsets in various modes

When the model is validated using full cross-validation, the residuals can be computed both on the complete model or on the predictions of the samples left out at each step. The choice is made through a requester that appears when subsets are selected in the first mode:

The default is "Calibration".
A similar choice is to be made when displaying the Q statistic versus one mode that is not the first when test set validation is employed. In this case it is asked wheter the residuals shall refer to the calibration or to the test set.
Again the default is "Calibration".

If more than one sample is selected and the desired mode is not the first, one plot per sample is displayed. It is possible to see the sample label by clicking on the plot with the left mouse button.
T
he confidence limits for the Q-statistic are computed on the basis of Jackson and Mudholkar and displayed as light blue solid/dashed lines. The NOC batches/samples (the calibration set) are used to compute these limits.
NB It must be kept in mind that the residuals in the (n)PLS case are unlikely to be normally distributed. Although the confidence limits are computed on the base of the first moments of the residuals and so this is partially accounted, their meaningfulness is to be deemed very carefully.

Note: if subsets are selected in modes other than the first the non-normalised residuals plot actually displays the contribution plot to the Q-statistic.


Slab-wise congruence

The congruence (i.e. the cosine) between the real data and the model is plotted versus the scalars in the selected modes.
A value of 1 of the congruence means that the corresponding slab is perfectly recovered by the model, although, due to the noise, this limit is hardly ever attained.

When the model is validated using full cross-validation, the congruence can be computed both on the complete model or on the predictions of the samples left out at each step. The choice is made through a requester:

Figure a) and b) show the congruence for the 66 emission wavelengths of the Fluorescence data set in a 4 LV model for variables 'DOPA' and 'Tyro'.
When using nPLS1, different models can be obtained on the same X depending on the predicted variable. It is not surprising then, that some systematicness (likely connected to other compounds) is left in the residuals (figure b) and that the various compounds lead to different congruences profiles (here for the emission mode).

 

a) 'DOPA'
b) 'Tyro'


D-statistic

The D statistic is the Hotelling T2 statistic when a reduced spaces with R components is used instead of x with JK (or JKL, etc) variables; in other terms it is the Mahalanobis distance of a certain sample/batch from the origin of the axes in the model space.
It is used as a diagnostic tool especially in the field of Multivariate Statistics Process Control (MSPC): if the D-statistics for a certain batch is larger than a threshold determined on the basis of an F-distribution the batch is considered a faulty batch (post-batch analysis). The D-statistics limits (set at 95 and 99%) depend on the number of samples in the NOC (Normal Operating Conditions) data and on the rank of the model.

[1] Nomikos P., Mac Gregor J.F., "Monitoring batch process using multiway principal component analysis", AIChE journal, Vol 40, n°8, 1994, 1361-1373
[2]
Westerhuis,J.A.; Gurden,S.P.; Smilde,A.K., "Generalized contribution plots in multivariate statistical process monitoring", Chemometrics and Intelligent Laboratory Systems, Vol 51, 2000, 96-114
[3] Nomikos P., Mac Gregor J.F., "Multi-way partial least squares in monitoring batch process", Chemometrics and Intelligent Laboratory Systems, Vol 30, 1995, 97-108

 


Residual and Model Landscape

Residuals in particular are very powerful diagnostic tools. Most of the models work under the assumption that the residuals should be independent and identically distributed, possibly according to a normal distribution centred in 0.
The presence of systematic variation in the residuals may reflect an inappropriate choice of the rank or more simply an inadequacy of the model in explaining the data at hand. As nPLS1 do not maximise the explained variation on the X array the residuals can retain some of the systematic variation making the use of the residuals somewhat more difficult.

A three LV model on the fluorescence data (rank 4) for predicting 'Tyro' yields very systematic residuals.

A four LV model on the same data and predicting 'Tyro', yields better predictions; the residuals are relatively non-systematic  of much smaller magnitude.

   


Predicted vs Measured

This plots shows the predicted values versus the measured ones.
The display options and the number of LVs to use can be chosen via the "nPLS1 Plot Control" window that opens after clicking on the menu.
Replicates and predictions can also be plotted when available.
This plot is not available when new data is projected on the model.
The green bisecting line represents the optimality, i.e. the predictions are equal to the measured values.
The black labels always identify the model, while the red ones locate the predictions after the leave one out procedure (like in the figure) or the replicate from any resampling method.


t vs u

nPLS (as well as PLS) regression coefficient define a linear relationship between the scores t of the X array and the scores u of the Y array (for (n)PLS1 the y vector).
The "t vs u" can help in determining the correct number of components: when the correlation between predictors and predictands becomes random (e.g. the t vs u resembles a scatter-shot) there is likely "no model" between X and y. Thus it may be better to use less LVs then the one displayed in the plot as to avoid overfitting.
The display options are available via the standard "nPLS1 Plot Control" window.
 

a) there is an evident correlation between the scores in X and those in y

b) the us seem to vary independently from the ts. The linear model (represented by the line) does not capture any systematic variation: 6 LVs are too many.


Predictions plot

It is equivalent to the predicted vs measured plot with the predictions for the new data projected on the model on the diagonal (i.e. the predictions are, for these values only, used on the x-axis as well).
 


D-statistic on-line

The D-statistic can be computed in an on-line fashion by filling in the incomplete sample/batch. It is possible then to detect the occurrence of a fault during the evolution of the batch itself.
There are several options for filling in the batches, CuBatch supports two: 'zero' and 'current deviation'. In the first one the sample/batch is treated as if it proceded like the NOC samples/batches, in the second it is assumed that the difference of the current batch from the NOC samples/batches remains constant for the rest of the batch. For more info see literature.
The fill-in method is asked every time the plot is requested in the 'advanced' mode and never in the 'plant' mode.
The two figures show two possible evolutions, in figure a) no fault occurs, in figure b) the fault occurs at the very beginning of the batch.

a) b)


Q-statistic on-line
 

The Q-statistic can be computed (as well as the D-statistic) in an on-line fashion by adequately filling in the incomplete batches (see the D-statistics for more details or the literature).
This plot is started by the 'On-line' menu indicated by the light blu circle.

The evolution of the Q-statistic is displayed versus the time scalars.
The confidence limits for the RSS-online are based on the Jackson and Mudholkar work.
This plot is available (as the other "on-line" plots) only if the last mode is given name: 'Time' (case insensitive)
 


SPE

The SPE is the Sum of Squares of the Residuals calculated only at time t.
The evolution of the SPE is plot versus the scalars in the time mode.
The confidence limits are based again on the Jackson and Mudholkar work.
This plot is available (as the other "on-line" plots) only if the last mode is given name: 'Time' (case insensitive)