PSI - Issue 75
Kris Hectors et al. / Procedia Structural Integrity 75 (2025) 102–110 Hectors et al. / Structural Integrity Procedia (2025)
108
7
regression, gradient boosting regression, ridge regression, partial least squares (PLS) regression, linear support vector machines, K-nearest neighbours, random forest, and voting regression. 4.1. Data analysis and preprocessing The two-dimensional profile data initially undergo the same preprocessing procedures as outlined in Section 3.2. After initial preprocessing, the profile data is not translated to its real notch section diameter, but the notch root remains aligned with the origin. The reason for this is twofold. Firstly, from Fig. 9 it can be seen that there is no significant correlation with . Secondly, translating the notch negatively affects the performance of the model. By aligning the notch root at the origin, the input vector primarily captures the geometry of the notch shape itself. This includes its curvature, angle, and irregularities near the root, which are strongly correlated with . If the input vector includes values that are also influenced by factors not strongly correlated with (like the absolute vertical position determined by the real section diameter), these uncorrelated variations act as “noise”. This noise would make it harder for the model to identify and learn the true patterns linking the relevant shape features to , thereby impacting the model's performance. By keeping the notch root aligned at the origin, the model input focuses on the geometric features of the notch profile that are most critical for stress concentration. Furthermore, only the notch height values are retained for training. Fig. 11 illustrates how the notch profile is vectorized. Vector = ( 1 , 2 ,…, −1 , ) serves as the input to the machine learning models. To address the limited training dataset and mitigate overfitting risks, data augmentation was implemented. This involved mirroring the scanned profiles across the vertical axis. This technique preserves the original -values while generating new distinct vectorized sequences. This effectively doubles the available dataset size without compromising the physical relevance of the samples. Each sequence was normalized by dividing the ‘ height from notch root’ values by the average of its outer left and right values. Thus, a normalized height value of a point in the sequence [ 1 , ] , expressed as ̂ , can be determined as 2 /( 1 + ) with ∈ [ 1 , ] in accordance with the notation defined in Fig. 11. To ensure consistency across samples, all sequences were resampled to uniform length and interpoint distance. Z score normalization was subsequently applied, transforming features to zero mean and unit standard deviation. This prevented sequences with larger ranges from disproportionately influencing the analysis. Notably, the original specimens exhibited varying sampling resolutions. Therefore, higher resolution scans were resampled onto average locations common to all lower resolution scans, establishing a standardized profile length throughout the dataset. To assess feature relevance, recursive feature elimination (RFE) with leave-one-out cross-validation (LOOCV) was performed on the pre-processed dataset. The RFE procedure conducted 100 training iterations, systematically eliminating features that demonstrated minimal contribution to model performance. Fig. 12 displays the RFE results, which reveal that one feature, corresponding to the profile origin with zero height across all profiles, was eliminated in nearly every iteration. Features in proximity to the notch root were consistently retained, confirming their strong correlation with the stress concentration factor . Generally, features more distant from the notch root were eliminated with greater frequency, indicating their diminished influence on prediction. Endpoint features exhibited higher importance due to their direct relationship with notch height. Features eliminated in more than 75% of iterations (nine features in total) were excluded from further analysis. The performance of models trained on this reduced feature set will be compared against models trained on the complete dataset to evaluate the efficacy of feature elimination. 4.2. Training and hyperparameter tuning To optimize the models used in this study, their hyperparameters were tuned using a randomized search approach coupled with leave-one-out cross-validation (LOOCV). This process involved conducting 90 iterations of a random search within a predefined hyperparameter space to pinpoint the most effective hyperparameter set. The LOOCV method was employed to validate the performance of each hyperparameter combination identified during the randomized search. In this validation technique, for a training set of samples, −1 samples are used to train the model. The remaining single sample serves as the validation instance. This procedure is repeated times, with each sample in the training set taking a turn as the validation instance. While LOOCV offers the benefit of lower bias, it can also elevate the risk of the model overfitting to the training data. To best utilize our limited dataset, we initially partitioned it, allocating 85% of the samples for the training set
Made with FlippingBook flipbook maker