PSI - Issue 77

First page Table of contents Previous page 279 Next page Last page

Hugo Mesquita Vasconcelos et al. / Procedia Structural Integrity 77 (2026) 601–610 Hugo Mesquita Vasconcelos/ Structural Integrity Procedia 00 (2026) 000–000

604

636 seconds, a standard deviation of 600 seconds, and a median of 420 seconds. A distribution of the total hours per class is presented in Figure 1. Twelve classes were defined according to the AIS classification: Background, Tug, Cargo, Dredger, Fishing, Passenger Ship, Rescue, Pleasure Craft, Tanker, Sailing, Pilot Vessel, and other. The dataset is markedly imbalanced—Tug is the most represented class with 739 samples, followed by Cargo with 273 samples, while Pilot Vessel contains only four recordings. This imbalance reflects real maritime traffic patterns and thus mirrors conditions expected in operational deployments. No file or class was discarded, as maintaining the natural distribution of vessel types provides a realistic basis for evaluating model performance under non-uniform class priors.

Fig. 1. Distribution of total hours per class in the obtained subset.

2.4. Preprocessing The raw waveforms were first normalized to zero mean and unit variance to ensure numerical stability and consistency across all recordings. After normalization, each original WAV file was divided into one-second clips, producing over one million segments. Of these, 630,283 clips were retained as containing relevant acoustic information. Relevance was determined through amplitude analysis based on background labeled files, excluding clips that consisted solely of background noise but were classified as vessel, thereby reducing the likelihood of misclassifying non-vessel sounds as vessel activity, a proposed strategy by Domingos et al. (2022b). From every one-second clip, three spectro-temporal representations were computed: Mel, Gammatone, and Constant-Q Transform (CQT) spectrograms. These representations have been extensively used in auditory and bio acoustic research due to their perceptual correspondence with human hearing mechanisms and their ability to capture frequency-dependent features relevant to vessel detection, supported by Domingos et al. (2022a). The three spectrograms were then stacked along a common channel dimension, forming a single image-like tensor in which each representation occupies one channel. This multi-channel configuration parallels the structure of RGB images and allows subsequent models to exploit complementary information across spectral scales. To prevent data leakage, clips originating from the same recording were assigned exclusively to one of the predefined dataset splits. This ensured that temporal segments from the same recording could not appear across different splits, preserving the integrity of the generalization assessment. It was defined an 80% training, 10 % validation and 10% testing dataset division. All spectrogram tensors were precomputed and stored, allowing subsequent training procedures to proceed efficiently and reproducibly. During precomputation, the dataset’s pronounced class imbalance was addressed through a controlled oversampling procedure. The number of samples per class was adjusted by replicating underrepresented classes according to inverse frequency ratios, ensuring that all vessel categories were proportionally represented within the training split. The oversampling was implemented by capturing additional one-second segments that began slightly before and after the original segment start times, thereby generating the number of samples required to achieve a balanced distribution across classes. As a result, the training dataset increased from approximately 500,000 to 3,860,728 one-second samples, while the validation and test splits remained containing the non-oversampled 61,192 and 65,275 samples, respectively.

Made with FlippingBook flipbook maker