PSI - Issue 78

Franco Ciminelli et al. / Procedia Structural Integrity 78 (2026) 921–928

923

2. Methodology The proposed methodology is structured into four main phases, which are detailed in the following subsections. An overview of the process is illustrated in Fig. 1.

1. Data processing and sampling

3. Training and model evaluation

2. Pre-training

4. Experimental validation

Data are synthetically generated through an algorithm that replicates the logic of the Guidelines. A stratified and balanced sampling is performed, followed by the selection of relevant features for model training.

A pre-training phase is carried out using AutoGluon to identify the most suitable machine learning algorithms. The dataset providing the best performance is selected based on learning curve analysis.

The selected synthetic dataset is used for full model training. Model performance is evaluated through confusion matrix analysis and classification reports.

The trained model is applied to real-world cases from the ANSFISA dataset to predict the seismic CoA.

Fig. 1. Overview of the proposed methodology for synthetic data generation, model selection, training, and application to real-world cases.

2.1. Data processing and sampling The data processing and sampling phase represents the starting point for the predictive analysis of the Seismic Attention Class (CoA-S). In this phase, synthetic datasets are generated using an algorithm developed by some of the Authors, originally designed to perform statistical analyses on all types of assets covered by the Guidelines and, consequently, on infrastructure networks (Ciminelli et al., 2025). The goal is to replicate the logic and criteria defined by the national Guidelines. From the conceptual analysis of the Guidelines, it emerged that a bridge can be described as an array of integers, where: • each position in the array corresponds to a structural or functional parameter of the asset; • each numeric value represents a specific modality selected from a finite and codified set of options. This approach allows real infrastructure to be transformed into a data structure compatible with computational models. The encoded parameters are processed by the algorithm, which replicates the decision-making logic of the Guidelines. The output of the process is CoA-S, which is also represented as a numeric value ranging from 1 to 5. The parameters involved in the seismic risk assessment, the possible classification options, and their corresponding numerical codes — along with the values assigned to the CoA-S — are summarized in Table 1. The synthetic data is then subjected to a pre-processing phase, which includes stratified and balanced sampling operations (Trost, 1986). These techniques allow for: • reducing the total number of samples, thereby limiting computational burden during model training; • avoiding overfitting by ensuring a balanced distribution of classes; • preserving adequate representativeness of structural variables and risk conditions. Starting from the complete dataset, composed of 1'497'600 synthetic bridges (Ciminelli et al., 2025), several subsets of increasing size were generated. These were labeled according to their size as "Small" for 1'000 bridges, "Moderate" for 5'000, "Medium" for 10'000, "Large" for 20'000, "Very Large" for 25'000, "Huge" for 50'000, and "Maximum" for 100'000 bridges. All these reduced datasets were compared with the full dataset, used as a reference, to verify that — based on the adopted sampling strategy — the distributions of classes and parameters remained representative.

Made with FlippingBook Digital Proposal Maker