PSI - Issue 57

Olivier Vo Van et al. / Procedia Structural Integrity 57 (2024) 104–111 O. VoVan / Fatigue Design 2023 00 (2023) 000–000

106

3

Finally, the total cumulative damage is computed over the 6 years and then interpolated to get the damage value for each segment of the rail defined in the machine learning model. One thus obtains a cartography of temperature damage totally dissociated with time domain in order to avoid considering other exogenous phenomena such as corrosion or other material decay not depending on temperature variations.

2.2. Machine learning model

Notation. Once fatigue damage due to temperature variations is computed, the objective is to assess the correlation between damages and crack initiation at the surface or subsurface of the rail. Formally, provided description of rail segments denoted as a random variable X = ( X (1) , · · · , X ( d ) ) ∈ R d . Each marginal of X describes one of the charac teristics of the rail or track, such as its geometry, the nature of the steel, and so on. Y is a binary random variable valued in {− 1 , + 1 } describing the presence of squats between the years 2008 and 2016. The link between the two random variables X and Y is typically described by the so-called regression function η ( x ) = P { Y = + 1 | X = x } . Assessing correlation in a high dimension problem is known to be complex although there are methods still in the research stage [24]. Thus, using the Random Forest algorithm [1], an estimation of η denoted ˆ η D is computed over a training set D . This training dataset D is assumed to contain N independent and identically distributed copies of ( X , Y ), D = ( X i , Y i ) i ∈ [1 , N ] . Then, we define a scoring function s any function R d → R whose highest values are ex pected to be associated, with greater probability, with the occurrence of the event “ Y = + 1”. Lastly, the dataset D can be alternatively described through a matrix x D of size N × d , where the rows correspond to rail segments and the columns correspond to variables. Methodology. The relationship between rail properties and defect appearance stems from an analysis of ˆ η D , namely the variable analysis. This analysis aims to order the di ff erent variables according to the impact they have on the prediction. That is to say, the process of assessing feature importance aims to assign a numerical value that represents the degree of influence of each feature in the model. The scores allow to determine the value of a particular feature and compare it to the contributions of other features. If a variable possesses a higher importance score than another, it signifies that it holds greater utility in the learned model compared to the other feature. The importance of a variable X ( i ) is denoted Imp( X ( i ) ). To ensure su ffi ciently robust results, this analysis requires a meticulous and methodological approach. Preliminary steps are essential to obtain a model that will be subsequently interpreted :

1. Feature selection : The study excluded variables that showed high cor relation, as they might be redundant. Including such variables could introduce bias in measuring importance, as multiple variables could be associated with the same phenomenon. This selection process was done manually through a rank correlation analysis (specifically Spear man correlation, as shown in Figure 1). Additionally, we relied on ex pert analysis to di ff erentiate spurious correlations and suspected redun dancy. 2. Regularization: to prevent over-fitting, an inherent problem of Ran dom Forest, a regularization is operated based on pruning each Decision Tree [14]. This regularization is operated in the most simple way : the number of samples in each leave is set such that the accuracy measured using AUC is similar on train and test dataset. 3. Cross-validation and fairness considerations: Fairness is a increas ing concern in machine learning and has many definitions [13]. In the present context, having a fair estimation of η could mean that each seg ment as the same odds regardless of the way it is monitored. In prac tice, it can be challenging as asset maintenance may have underlying graph structures and complex dependencies, for instance due to regional habits. To reduce such e ff ects and prevent overfitting, a cross validation is operated such that homogeneous assets (for instance rail segment of the line, say Paris to Lyon) are grouped in the same training set.

Fig. 1: Correlation matrix according to spearman before variable selection. The highly correlated variables, whose value are superior to 0 . 5 or inferior to − 0 . 5, are re moved.

Made with FlippingBook Ebook Creator