Results of Parafac

The PARAFAC model is a decomposition method that allows for both exploratory purposes and curve resolution.
The same model can also been applied to monitoring schemes for batch processes.
 Quantitative determination (i.e. regression) is available in theory, but it has not been implemented as yet.

For more information see literature.

Several plots are of interest depending on the purpose of the analysis.

Some other plots become available only in specific cases:

NB Some plots allow for several models to be present; when this is not the case (like for the Score/Loadings plots), the user is requested to choose which model to plot:

Whenever projections are presents the corresponding results are given precedence with respect to the calibration (which can be displayed, as an option) and the validation.

Go to OPA3D Go to PARAFAC2

Back to Factor Analysis

 


Score/Loadings

Loadings and scores can be displayed in up to tridimensional plots (1D, 2D or 3D)

If the number of components is insufficient certain plot's dimensions will be inactive (i.e. for a two dimensional PARAFAC the 3D plots are not available).
After the desired menu is selected, a
"PARAFAC Plot Control" window opens:

The "Axes" frame

It allows to choose which component, if any, is to be plot along which axis:

  1. 1D plot, the X-axis are the scalars for the selected mode, the scores/loadings are on the Y-axis. Only one menu is in the "Axes" frame
    It is possible, by choosing "All" to plot all the components at the same time
     
  2. 2D plots, the first menu from the left refers to the X-axis and the second to the Y-axis. The "all" option is not available.
  3. 3D plots, the first menu from the left refers to the X-axis,the central one to the Y-axis and the last to the Z-axis.
    The "all" option is not available.

NB. Unless the PARAFAC model were calculated with orthogonality constraints, the the axes in the plot are then not orthogonal in reality (read Kiers... for more informations).


The "Validation & prediction" frame

Its content varies depending on the type of validation (if any) or if the model is applied on external data :

Choosing to plot replicates/predictions deactivate the "All" option in the 1D plots.


The "Display options" frame and the preferences menu

The first menu in the frame allows to choose the type of marker:

The second menu specify whether the a line should link the points (Continuous) or not (Discrete)

The submenus in the 'Preferences' menu are activated/deactivated depending on the choices made on the main control window:

The other submenus are:

Before the changes become visible it is necessary to press the button 'Plot'.


Explained Variation, PRESS, RMSE
 

Explained Variation (expressed as a % of the total variation in the set), Prediction REsiduals Sum of Squares and Root Mean Squared Error reflect all the goodness of the fit, in particular:

where the PRESS refers to the calibration or to the validation samples/batches and TSS stands for Total Sum of Squares.
The RMSE is linked to the PRESS by the relation:

where n is the number of "non-missing" elements in the array.
They give then the same type of information, albeit in different measurement units.

To plot the overall explained variation the following menu has to be selected

The other submenus will have the EV% plotted against a specific axis (see figure on the side). Subsets can be chosen, as for the residual sum of squares.

When the model is validated using full cross-validation, the EV can be computed both on the complete model or on the predictions of the samples left out at each step. The choice is made through a requester that appears when subsets are selected in the first mode:

The default is "Calibration".
A similar choice is to be made when displaying the
explained v versus one mode that is not the first when test set validation is employed. In this case it is asked wheter the residuals shall refer to the calibration or to the test set.
Again the default is "Calibration".


Residuals sum of squares (Q-statistics)

This plot involves many selections, besides the choice of the model's rank:
 

1) Select mode to plot against in the 'Results' -> 'Residuals' menu


2) Choose subsets in the various modes.

When the model is validated using full cross-validation, the residuals can be computed both on the complete model or on the predictions of the samples left out at each step. The choice is made through a requester that appears when subsets are selected in the first mode:

The default is "Calibration".
A similar choice is to be made when displaying the Q statistic versus one mode that is not the first when test set validation is employed. In this case it is asked wheter the residuals shall refer to the calibration or to the test set.
Again the default is "Calibration".

If more than one sample is selected and the desired mode is not the first, one plot per sample is displayed. It is possible to see the sample label by clicking on the plot with the left mouse button.
T
he confidence limits for the Q-statistic are computed on the basis of Jackson and Mudholkar and displayed as light blue solid/dashed lines. The NOC batches/samples (the calibration set) are used to compute these limits.

Note: if subsets are selected in modes other than the first the non-normalised residuals plot actually displays the contribution plot to the Q-statistic.


Slab-wise congruence

The congruence (i.e. the cosine) between the real data and the model is plotted versus the scalars in the selected modes.
A value of 1 of the congruence means that the corresponding slab is perfectly recovered by the model, although, due to the noise, this limit is hardly ever attained.

When the model is validated using full cross-validation, the congruence can be computed both on the complete model or on the predictions of the samples left out at each step. The choice is made through a requester:

The default is "Calibration".
A similar choice is to be made when displaying the Q statistic versus one mode that is not the first when test set validation is employed. In this case it is asked wheter the residuals shall refer to the calibration or to the test set.
Again the default is "Calibration".

Figure a) shows the congruence for the 15 samples of the Fluorescence data set for the rank 3 model. Sample 8 is not well described by this model and the analysis of the scores and of the concentrations can show how this sample contains only one compound, explicitely the one described by the fourth component, here not present.

Figure b) shows the congruence for the different emission wavelengths, also for the rank 3 model. The lowest emission wavelengths are not well described and this also shows some systematicness as the four factor model (not displayed here) describes these wavelengths much better (although still quite far from congruence 1).

a) b)


D-statistic

The D statistic is the Hotelling T2 statistic when a reduced spaces with R components is used instead of x with JK (or JKL, etc) variables; in other terms it is the Mahalanobis distance of a certain sample/batch from the origin of the axes in the model space.
It is used as a diagnostic tool especially in the field of Multivariate Statistics Process Control (MSPC): if the D-statistics for a certain batch is larger than a threshold determined on the basis of an F-distribution, the batch is considered a faulty batch (post-batch analysis). The D-statistics limits (set at 95 and 99%) depend on the number of samples in the NOC (Normal Operating Conditions) data and on the rank of the model.
 

[1] Nomikos P., Mac Gregor J.F., "Monitoring batch process using multiway principal component analysis", AIChE journal, Vol 40, n°8, 1994, 1361-1373
[2]
Westerhuis,J.A.; Gurden,S.P.; Smilde,A.K.,"Generalized contribution plots in multivariat statistical process monitoring", Chemometrics and Intelligent Laboratory Systems, Vol 51, 2000, 96-114


Residual and Model Landscape

Residuals in particular are very powerful diagnostic tools. Most of the models work under the assumption that the residuals should be independent and identically distributed, possibly according to a normal distribution centred in 0.
The presence of systematic variation in the residuals may reflect an inappropriate choice of the rank or more simply an inadequacy of the model in explaining the data at hand.

A three factor model on the fluorescence data (rank 4), yields very systematic residuals: the model is inadequate.

A four factor model, on the same data, yields better predictions and the residuals are relatively non-systematic.

The residuals look less systematic and are much smaller than in the three factor model: the ridge marked in red is some scattering that could not be removed via the pretreatment.
This part of the fluorescence data, not being trilinear, cannot be modelled by PARAFAC. This part of data is fortunately really small and the model can still be used.


IMP

The Identity Match Plot is available only when the leave one out cross validation.
The scores obtained in prediction (i.e. projecting the left-out sample/sample on the model computed on the others) are plot versus the scores of the complete model.
Because of the uniqueness property of PARAFAC the scores should be identical and this plot represents an excellent diagnostic tool for identifying outliers. See literature...
A "PARAFAC Plot Control" window is opened (with an empty "Validation frame") to choose the display options and which factor's scores are to be plot.


RIP

The Resample Influence Plot is currently available for the leave one out validation case only.
Via the calling menu is possible to decide to which mode (apart from the first, which should refer to the sample batches) the plot refers to. 
It shows the MSE (Mean Squared Error) for the loadings in a specific mode versus the sum of squares of the residuals for the left out sample/batch when this is projected on the model computed on the remaining samples/batches.
The samples/batches in the top right corner, yields high residuals and when eliminated lead to very different loadings. This may be a strong indication for these samples/batches to be outliers.
More is to be found in the literature.

NB. The calculation of the correct MSE requires the solving of an optimisation problem and this procedure may be very expensive. Therefore a requester asks the user if she/he wants to proceed.
The same function calculates also the risk.


Risk plot
This plot is available only when resampling methods (leave one out or bootstrap) have been used to validate the model.
The risk function for dimension F is defined as:

Thanks to the uniqueness properties of PARAFAC the models obtained by leaving out one or more sample should be "identical". The sum of the congruences between two different models (the rth replicate of leave one out or bootstrap and the complete model) should be equal to F if they yield the same factors (provided that the permutational indeterminacy has been removed). RF-1 is a normalisation factor.

The figure on the side shows the risk plot for models computed on the Fluorescence data set with 1 to 5 components.
The 5 components model is less "stable" (i.e. the extracted components' loadings vary more depending on the composition of the data set employed to compute the model). It is not visible in this figure, but the risk for 6 components is even higher (~0.038). The minimum is attained for 4 components, which is the correct dimensionality for the problem at hand (Fluorescence data set).


D-statistic on-line

The D-statistic can be computed in an on-line fashion by filling in the incomplete sample/batch. It is possible then to detect the occurrence of a fault during the evolution of the batch itself.
There are several options for filling in the batches, CuBatch supports two: 'zero' and 'current deviation'. In the first one the sample/batch is treated as if it proceded like the NOC samples/batches, in the second it is assumed that the difference of the current batch from the NOC samples/batches remains constant for the rest of the batch. For more info
rmation see literature.
The fill-in method is asked every time the plot is requested in the 'advanced' mode and never in the 'plant' mode.
The two figures show two possible evolutions, in figure a) no fault occurs, in figure b) the fault occurs at the very beginning of the batch.

a) b)


Q-statistic on-line
 

The Q-statistic can be computed (as well as the D-statistic) in an on-line fashion by adequately filling in the incomplete batches (see the D-statistics for more details or the literature).
This plot is started by the menu 'Residuals'->'Mode #: Time'->'On-line'

The evolution of the Q-statistic is displayed versus the time scalars.
The confidence limits for the RSS-online are based on the Jackson and Mudholkar work.
This plot is available (as the other "on-line" plots) only if the last mode is given name: 'Time' (case insensitive)
 


SPE

The SPE is the Sum of Squares of the Residuals calculated only at time t.
The evolution of the SPE is plot versus the scalars in the time mode.
The confidence limits are based again on the Jackson and Mudholkar work.
This plot is available (as the other "on-line" plots) only if the last mode is given name: 'Time' (case insensitive)