PSI - Issue 38

Frédéric Kihm et al. / Procedia Structural Integrity 38 (2022) 12–29 Kihm, Miu, Bonato / Structural Integrity Procedia 00 (2021) 000 – 000

16

5

Before running any regression analysis, an interesting preliminary step is to explore the data by measuring the linear correlation between the various features using the Pearson correlation coefficient (Friedman, Hastie, Tibshirani (2001)). This includes correlation of the output to all the inputs and correlation between the inputs. For a more effective use of regression, it is advised to verify that there is no hidden correlation in the input variables and therefore no or little multicollinearity. Furthermore, a good level of correlation between the inputs and the output can be used to justify the investigation of a linear regression model. A linear regression model establishes the relation between the variables that can be expressed mathematically as (James, Gareth, et al (2013)): = + 1 1 + ⋯+ + (1) Where is the dependent variable, 1 , …, the independent variables, and 1 ,…, the intercept and slope coefficients, and an irreducible noise that the model accommodates. The linear regression equation will be trained using Ordinary Least Squares (Friedman, Hastie, Tibshirani (2001)). This approach treats the data as a matrix and uses linear algebra operations to estimate the optimal values for the coefficients, based on minimizing the sum of the squared residuals. The results of an Ordinary Least Squares analysis include: • the coefficients of the independent variables and the constant term in the equation. • R- squared: It signifies the “percentage variation in the dependent variable that is explained by the independent variables”. • Adj. R-squared: Similar to R-squared but adjusted for the number of variables in the regression so that it increases only when an additional variable actually improves the model. • Prob(F-Statistic): the probability of observing a value of the test statistic that is equal or more extreme than what is observed in the sample, under the assumption that the null hypothesis, “all the regression coefficients are zero”, is true. Having Prob(F-Statistic) close to zero means that the regressions is meaningful overall. • The p- values available for each independent variable, tests the null hypothesis “the regression coefficient for this variable is zero”. A p -value lower than the significance level (usually 5%) indicates that this variable has a statistical significance. These results contain the properties of the predictive statistical model together with several indicators to assess its validity. Note that once the model is applied, the residuals are expected to follow a normal distribution (Montgomery (2017)). 2.6. Check the results Using the methodology described above, a fatigue predictive model can be derived. Input signals collected from the CAN Bus and analogue sensors are processed to produce some time-correlated damage and used as inputs to the model, which produces a fatigue damage estimate at a given location in the vehicle. The model was obtained from a relatively small sample of data and needs to be validated. Cross-validation is a technique to evaluate a model by dividing the original sample into a training part to train the model, and a testing part to evaluate it (James, Gareth, et al (2013)). In k-fold cross-validation, the original sample is randomly partitioned into k equal size subsamples. Taking k=10, it becomes a 10-fold cross-validation, which we used in our evaluation. Out of the k parts, one part is retained as the validation data for testing the model, and the remaining k-1 parts are used as training data. The cross-validation process is then repeated k times (the folds), with each of the k part used exactly once as the validation data. The k performance measures from the folds can then be combined to produce a single performance measure. In this paper, the average is used, giving the relative absolute error (RAE).

UNRESTRICTED

Made with FlippingBook Digital Publishing Software