**NEW TOOLS DEVELOPPED TO IMPROVE PREDICTIVE
CAPACITY**

**OF MULTIVARIATE STATISTICAL
ANALYSIS MODELS**

Multivariate statistical analysis (MA) methods like PLS (partial least
squares) are very interesting and potentially powerful tools to develop
predictive models of very complex phenomena or system when first laws
models cannot be used for any reason.

These techniques include in fact many categories of approaches like the
forementioned PLS and also neural networks, linear and non-linear multiple
regression analysis, principal components regression, etc. Pragmathic uses
mainly the Umetri Simca-p software for model development in MA.

Unfortunately, applying MA on process operation
databases for model development can often deliver disappointing results with poor quality models
having low predictive capacities often expressed by the Q coefficient
(instead of the regression coefficient R).

We believe that unsatisfactory results are mostly a consequence of some unfavourable
inherent characteristics of the input data, characteristics that would lead to poor models whatever the MA method used. We believe the
corrollary is true. These unfavourable characteristics are the following.

- A key parameter is not measured or recorded in the plant archive
databases.

There is nothing magical that can be done there but find the factor
and measure it.

- Wrong specifications given for the compression of data in the IT
archive system lead to overcompression and an important lack of
information occurs.

This is an important issue with the popularity of data archive. To
handle it, we have developped a computer based analysis algorithm that
calculates the amount of information loss of a recorded parameter for
the compression criteria used and that establishes the optimum
specifications to apply for a desired maximum level of information loss.
The loss of information is expressed in terms of percentage of the
original variance that is kept in the archive. A value of 99.9% has
often led to good compression ratios for what we consider an acceptable
level of losses (0.1% loss of variance) for any future MA application.

- The effects of some important parameters are not always simply
linear.

Non-linearities are usually considered by the analyst through an
iterative and lengthly procedure. Non-linearities must first be
suspected and many kind of transformations can be tried. When an
important amount of variables are examined, and this is often the case,
this can become a very tedious or arbitrary task subjected to the
analyst's intuition. We have developped a tool to
efficiently identify any statistically significant non-linearities and
the appropriate transformation to introduce. This algorithm can easily
and rapidly treat large databases. First, a procedure to numerically
evaluate some derivative values is used, and the result is a fingerprint
that gives the appropriate transformation to apply.

- Lags combined to dilution effects are often present and lags values
can change regularly with time (example: when the rate of production
of a process is modified).

Lags and dilution effects (met in a mixing tank, for example) are
among the most important factors that have a detrimental effect on the
quality (in terms of R and Q coefficients) of the model to develop. We
use a new approach to improve largely the way variable lags and mixing
effects are both treated. First, we observe that a lag is a consequence
of a volume or capacity element found between a factor and the parameter
to model. If a flow rate passing through this volume changes, the lag
does the same but the volume itself stays obviously constant. Instead of
introducing a time lag value in the model, we rather use a capacity lag
term that is a constant value related to the physical configuration of a
system under examination. To summarize, a significant lag factor X is
replaced by a transformed factor Xe that is equalled to:

Here, a_{i} are coefficients used to include any mixing
effects by the introduction of many X values met at different locations
in the system as given by the capacity term Qi. The Qi values are found
by statistical analysis. Values of X are extracted from the archive for
each position Qi using a translation of Qi in time ti with the help of a
user specified parameter that constitutes a good indication any velocity
changes in the system with capacity Qi.

Once the appropriate conditioning of a database has been completed with
the previous algorithms, we believe the MA model development will get a
much higher level of success in the tune-up of higher quality model.

**Application example: pulp freeness prediction**

Here is an example for the development of a predictive model for the
freeness of a 4-line thermomechanical pulp process (data obtained from a
TMP computer process simulation). The freeness is measured at the entrance
of the primary screening for a 2-day period. The 4 lines are in operation
for 12 hours, then we have 2 lines, 3 lines and finally 4 lines in
operation always for periods of 12 hours. Since the freeness obtained is
an average of the 4 lines, changing the number of them in production
corresponds to a change in a lag factor and in any dilution effect met
mostly in the latency chest. Hence, from the preceding equations, the term
W would be represented by the number of lines in production.

**Results
with the "usual" approach**

Using our typical approach, 2 significant factors on the freeness
values have been identified: the refiners dilution flowrate and the
refiners specific energy. An average lag time value of 45 minutes has been
identified for both parameters. The database built to develop the MA model
using the Simca-p PLS method contained 7 columns: the freeness, three
lagged dilution flowrates for lag time of 35, 45 and 55 minutes, and three
lagged specific energy columns with the same lag values.

A total of about 5500 lines of data composed the database for the
Simca-p PLS analysis. The first 75% of the data was used to develop the
model, and the last 25%, for the prediction set.

The model obtained shows poor results with R^{2} and Q^{2}
values of 0.273.

**Results
with the new data conditioning approach**

Using our new algorithms, Qi values have been identified from a
statistical analysis for both input variables, and the a_{i } coefficients
were then calculated to obtain 2 new variables Xe respectively for the dilution and
the
specific energy. Using Simca-p again with this new
conditionned database composed of 3 columns, the new model found is much
better with R^{2} and Q^{2} values of 0.806.