Machine Learning for Process Control. Part 3: Neural Networks Robustness Verification.

Ivan Nemov
Jun 2, 2025
7 min read

Problem Statement

MLP networks are universal approximators [1]. And thanks to that superpower they are so popular in solving regression problems. Regression applications represent significant portion of industrial use-cases including soft sensors, data-driven optimisation, forecasting, anomaly detection, timeseries analysis etc. One of the big challenges faced by the industry is ensuring that such applications are robust, i.e. able to maintain its function under varying inputs. This is especially important for ML applications integrated with process control systems, where ML models can be used in closed loop automatic response. In 2013 it was found that neural networks are susceptible to adversarial attacks [2] where an unnoticeably small perturbation of input signal, specifically designed though, can result in a drastically different output (Fig. 1).

*Fig. 1 - Adversarial examples in computer vision and speech recognition [3].*

Such small adversarial perturbations may occur naturally due to some unlikely but still possible combination of values from input data, e.g. from sensors. They may also be ingested into raw signals with malicious intent. Regardless if it is a safety or security issue, this vulnerability is tightly associated with MLP training (fitting the training data) and its ability to generalise well or, on the contrary, overfit the data. One could reproduce overfitting behaviour when approximating noisy data with a high-degree polynomial function – it tightly fits the original data but changes rapidly, i.e. oversensitive to the input, between the points (Fig. 2).

*Fig. 2 – Overfitting example using a high-degree polynomial function.*

In this whitepaper we explore some key theoretical concepts of neural network robustness and how they can be applied in practice.

Abbreviations used in the post are explained in this section.

Visit code repository

Theoretical Foundation of MLP Robustness

Lipschitz constant and its lower and upper bounds

MLP sensitivity to input data can be evaluated quantitively using Lipschitz constant which proved to be an effective way assessing the networks robustness, generalisation and resilience to adversarial inputs [4]. Lipschitz constant provides a limit on how rapidly a function (like MLP network) can change. If a function f(x) is Lipschitz continuous with constant L, then for any two inputs x_1 and x_2:

In other words, Lipschitz constant is the largest absolute value of the function derivative. MLP networks are generally multiple-input-multiple-output functions where inputs and outputs are represented as vectors x and f(x). Considering that, expression (1) can be rewritten using vector α-norms:

2-norm (α = 2), also known as spectral norm, is normally used to calculate Lipschitz constant of MLP networks. Spectral norm of vector x is its length:

So, for a single output MLP we can define Lipschitz constant as:

This so-called true Lipschitz constant of function f(x) has no corresponding analytical calculation method in case of neural networks. This is because number of network neurons activation patterns grows exponentially with the number of neurons, and finding the largest gradient requires to search across all of them. That is why for practical purposes computationally less expensive lower and upper bounds of Lipschitz constant are used.

The upper bound of Lipschitz constant L_upper is the product of the spectral norms of the weight matrices from each layer [5]:

This follows from MLP function form illustrated on Fig. 3 considering that Lipschitz constant of an activation function is 1:

Geometrical interpretation behind the matrix spectral norm is how much the matrix can stretch a unit vector (Fig. 4). For a unit circle on 2D plane this can be illustrated by the following example:

*Fig. 4 – Illustration of matrix A 2-norm via geometrical interpretation.*

The lower bound of Lipschitz constant L_lower can be found empirically from MLP training data using expression (2) in general case and (4) for single output MLP networks [4].

Interpretation of Lipschitz constant

Before implementing a trained MLP network in production environment it is important to evaluate how sensitive it is to inputs variations and compare that sensitivity with lower and upper Lipschitz constant bounds. The sensitivity is calculated empirically as maximum local Lipschitz constant L_empirical from random samples.

The lower bound L_lower indicates the steepest gradient in the training data set. Trained MLP network must have empirical Lipschitz constant L_empirical close to the lower bound:

This naturally follows from MLP being trained to repeat the data patterns from the training dataset. Where L_empirical is smaller than L_lower, it can be attributed to presence of noise in the training dataset which is then rejected by the trained MLP, and therefore serves as an indication of a good generalisation. Alternatively, it can happen due to MLP underfitting training data, and therefore L_empirical should be always assessed along with MLP accuracy and regression coefficient. If L_empirical is substantially larger than L_lower then MLP is less smooth than the training dataset which is a warning sign. When estimating both values, it must be ensured that input data belongs to the same domain (same input variation limits).

The upper bound L_upper is a theoretical measure indicating how quickly MLP output can potentially change with respect to the inputs when all neurons are fully activated and no limits are imposed on the input domain. L_upper depends only on MLP weights. Empirical Lipschitz constant L_empirical is always smaller than L_upper due to not all neurons are active in MLP, and opposite would indicate some computational error:

Sometimes L_upper can be smaller than L_lower which indicates that MLP clearly underfits training data. But normally there is a gap between the upper and lower Lipschitz constant bounds. This gap grows exponentially with the depth of the network (number of layers) and logarithmically with the network width (number of neurons in a layer) [6]. It is therefore hard to establish a rule of thumb for what L_empirical to L_upper ratio is deemed safe.

In shallow MLP networks (1 deep layer), if L_empirical ≈ L_upper, it shows that the empirical sampling captured high-sensitivity regions, and the maximum theoretical rate of change is not far from actually observed MLP behaviour. It gives more reason to trust the network generalization and robustness.

In deep MLP networks (2 and more layers), if L_empirical ≈ L_upper and L_empirical ≫ L_lower it indicates that the high sensitivity case materialised during the empirical test and further investigation is required.

Practical Aspects of MLP Robustness

MLP robustness can be built into network design before training, and then monitored in production. In this section some techniques are explored in application to "LNG C5+ mol" MLP network of NGL extraction surrogate model.

Prior training some parameters can be adjusted to limit Lipschitz upper bound to some acceptable levels. Fig. 5 illustrates the impact of activation function type and L2 regularisation penalty α on the upper Lipschitz bound.

*Fig. 5 – Lipschitz constants and bounds vs L2 regularisation penalty and activation function type.*

L2 regularisation penalty α is a coefficient used in MLP training cost function to minimise the weights in MLP weight matrices. The larger α, the more penalty for having large weights. Because the main objective of MLP training cost function is to find weights that result in the best fit of the training data, L2 regularisation competes directly with MLP network accuracy metrics. Fig. 6 shows that increase of α beyond some point results in progressive loss of accuracy. Hence, some optimal value of L2 regularisation penalty α must be selected to keep Lipschitz constant upper bound reasonably low while not causing significant loss of MLP accuracy. For "LNG C5+ mol" network the trade-off is achieved at α = 0.1 for both tanh and relu activation functions.

Fig. 6 – L2 regularisation impact on MLP regression coefficient.

Even if each activation function has Lipschitz constant 1, the way they interact with weight matrices affects the total Lipschitz constant upper bound of the MLP. They do not increase it, but they influence how the network responds to inputs and propagates gradients, therefore affect how much of the Lipschitz capacity is used. For example, tahn squashes large input values making the network locally less sensitive in certain directions. However, in combination with weak L2 regularisation this leads to comparatively larger Lipschitz constant upper bound than for relu activation function (Fig. 5).

After finding optimal L2 regularisation penalty and activation function type, trained MLP can be explored around the highest gradient point as found from . This process is illustrated on Fig. 7 where "LNG C5+ mol" MLP local Lipschitz constants are plotted in 3D space as a function of two inputs while the remaining two out of four inputs are held constant at values corresponding to the highest gradient point. Working through all combinations of two inputs six plots are generated. A sanity check for discontinuity and sharp edges on these plots helps to confirm MLP stability.

*Fig. 7 – Distribution of local Lipschitz estimates around max dY/dX point.*

When deployed in production, it is also possible to check local Lipschitz constant values around current operating point and compare it with effective Lipschitz constant chosen as L_empirical with a multiplier greater than 1 applied. This verification ensures that abnormal MLP behaviour that has not been seen in offline robustness assessment but manifested itself in production still will not be missed and any adverse impact prevented. This way even missed adversarial attack vulnerability can be mitigated.

Overall, this process of assuring MLP robustness is called robustness certificate. It is a mathematically provable guarantee that MLP output will remain stable under certain bounded input perturbations. Lipschitz certificate is one of commonly used forms of such guarantee.

Abbreviations

Abbreviation	Full description
ML	Machine Learning
MLP	Multi-Layer Perceptron neural network

References

[1] Alexander von Felbert. The Universal Approximation Theorem

[2] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, R. Fergus. Intriguing properties of neural networks. International Conference on Learning Representations, 2014.

[3] Yuan Gong, Christian Poellabauer. An Overview of Vulnerabilities of Voice Controlled Systems. 2018

[4] Grigory Khromov, Sidak Pal Singh. Some Fundamental Aspects about Lipschitz Continuity of Neural Networks. 2024

[5] Kavya Gupta. Stability Quantification of Neural Networks. Neural and Evolutionary Computing. Université Paris-Saclay, 2023.

[6] Paul Geuchen, Thomas Heindl, Dominik Stöger, Felix Voigtlaender. Upper and Lower Bounds for the Lipschitz Constant of Random Neural Networks. 2024

Machine Learning for Process Control. Part 3: Neural Networks Robustness Verification.

Problem Statement

Theoretical Foundation of MLP Robustness

Lipschitz constant and its lower and upper bounds

Interpretation of Lipschitz constant

Practical Aspects of MLP Robustness

Abbreviations

Abbreviation

Full description

ML

Machine Learning

MLP

Multi-Layer Perceptron neural network

References

Recent Posts

Comments