Home ] Contact us ] Services ] About us ] Download ]

 

    

 
NEW TOOLS DEVELOPPED TO IMPROVE PREDICTIVE CAPACITY

OF MULTIVARIATE STATISTICAL ANALYSIS MODELS

Multivariate statistical analysis (MA) methods like PLS (partial least squares) are very interesting and potentially powerful tools to develop predictive models of very complex phenomena or system when first laws models cannot be used for any reason.

These techniques include in fact many categories of approaches like the forementioned PLS and also neural networks, linear and non-linear multiple regression analysis, principal components regression, etc. Pragmathic uses mainly the Umetri Simca-p software for model development in MA.

Unfortunately, applying MA on process operation databases for model development can often deliver disappointing results with poor quality models having low predictive capacities often expressed by the Q coefficient (instead of the regression coefficient R). 

We believe that unsatisfactory results are mostly a consequence of some unfavourable inherent characteristics of the input data, characteristics that would lead to poor models whatever the MA method used. We believe the corrollary is true. These unfavourable characteristics are the following.

  1. A key parameter is not measured or recorded in the plant archive databases.

There is nothing magical that can be done there but find the factor and measure it.

  1. Wrong specifications given for the compression of data in the IT archive system lead to overcompression and an important lack of information occurs.

This is an important issue with the popularity of data archive. To handle it, we have developped a computer based analysis algorithm that calculates the amount of information loss of a recorded parameter for the compression criteria used and that establishes the optimum specifications to apply for a desired maximum level of information loss. The loss of information is expressed in terms of percentage of the original variance that is kept in the archive. A value of 99.9% has often led to good compression ratios for what we consider an acceptable level of losses (0.1% loss of variance) for any future MA application.

  1. wpe1D.gif (12020 octets)The effects of some important parameters are not always simply linear.

Non-linearities are usually considered by the analyst through an iterative and lengthly procedure. Non-linearities must first be suspected and many kind of transformations can be tried. When an important amount of variables are examined, and this is often the case, this can become a very tedious or arbitrary task subjected to the analyst's intuition. We have developped a tool to efficiently identify any statistically significant non-linearities and the appropriate transformation to introduce. This algorithm can easily and rapidly treat large databases. First, a procedure to numerically evaluate some derivative values is used, and the result is a fingerprint that gives the appropriate transformation to apply. 

  1. Lags combined to dilution effects are often present and lags values can change regularly with time (example: when the rate of production of a process is modified).

Lags and dilution effects (met in a mixing tank, for example) are among the most important factors that have a detrimental effect on the quality (in terms of R and Q coefficients) of the model to develop. We use a new approach to improve largely the way variable lags and mixing effects are both treated. First, we observe that a lag is a consequence of a volume or capacity element found between a factor and the parameter to model. If a flow rate passing through this volume changes, the lag does the same but the volume itself stays obviously constant. Instead of introducing a time lag value in the model, we rather use a capacity lag term that is a constant value related to the physical configuration of a system under examination. To summarize, a significant lag factor X is replaced by a transformed factor Xe that is equalled to: 

Here, ai are coefficients used to include any mixing effects by the introduction of many X values met at different locations in the system as given by the capacity term Qi. The Qi values are found by statistical analysis. Values of X are extracted from the archive for each position Qi using a translation of Qi in time ti with the help of a user specified parameter that constitutes a good indication any velocity changes in the system with capacity Qi.

Once the appropriate conditioning of a database has been completed with the previous algorithms, we believe the MA model development will get a much higher level of success in the tune-up of higher quality model. 

Application example: pulp freeness prediction

Here is an example for the development of a predictive model for the freeness of a 4-line thermomechanical pulp process (data obtained from a TMP computer process simulation). The freeness is measured at the entrance of the primary screening for a 2-day period. The 4 lines are in operation for 12 hours, then we have 2 lines, 3 lines and finally 4 lines in operation always for periods of 12 hours. Since the freeness obtained is an average of the 4 lines, changing the number of them in production corresponds to a change in a lag factor and in any dilution effect met mostly in the latency chest. Hence, from the preceding equations, the term W would be represented by the number of lines in production.

  1. Results with the "usual" approach

Using our typical approach, 2 significant factors on the freeness values have been identified: the refiners dilution flowrate and the refiners specific energy. An average lag time value of 45 minutes has been identified for both parameters. The database built to develop the MA model using the Simca-p PLS method contained 7 columns: the freeness, three lagged dilution flowrates for lag time of 35, 45 and 55 minutes, and three lagged specific energy columns with the same lag values.

A total of about 5500 lines of data composed the database for the Simca-p PLS analysis. The first 75% of the data was used to develop the model, and the last 25%, for the prediction set.

The model obtained shows poor results with R2 and Q2 values of 0.273.

  1. Results with the new data conditioning approach

Using our new algorithms, Qi values have been identified from a statistical analysis for both input variables, and the ai coefficients were then calculated to obtain 2 new variables Xe respectively for the dilution and the specific energy. Using Simca-p again with this new conditionned database composed of 3 columns, the new model found is much better with R2 and Q2 values of 0.806.

 

 

These new complementary tools to MA are still under development and testing. Do not hesitate to contact us for more information on this material and if you ever desire to try these algorithms to your cases.

 

 Home ]

Optimized for a 800 x 600 screen resolution

Copyright © 2014 Pragmathic Inc.
Updated Feb 19, 2014

top