PSI - Issue 44
Carpanese Pietro et al. / Procedia Structural Integrity 44 (2023) 1980–1987 Carpanese Pietro et al./ Structural Integrity Procedia 00 (2022) 000–000
1984
5
3.2. Description of the CNNs and results After creating the database of labeled images, the CNNs were implemented. As mentioned before, three algorithms were developed: one for the prediction of the height, one for the material, and the third for the construction period. Since the three scripts are quite similar, the code pipeline will be explained once, keeping in mind that these algorithms were trained on the same dataset of pictures but with different labels for the three cases. The specific differences among the three codes will be pointed out when necessary. Since much work has already been done on Convolutional Neural Networks, it is quite common to pre-train a CNN on a very large dataset (e.g., ImageNet) and then use this CNN as initialization for the specific task. To this end, one of the most used techniques is Transfer Learning, also adopted in this study, which leverages a CNN as an extractor of fixed features and then fine-tunes it. More in detail, in this work a CNN is firstly pre-trained on ImageNet, and then the last fully-connected (FC) layer is removed. Consequently, the rest of the CNN is treated as a fixed feature extractor, and a linear classifier is added and specifically trained for the new dataset. To clarify, fine-tuning corresponds to performing a “network surgery”: the final set of FC layers from a pre-trained CNN are cut off, and then the head is replaced with a new set of FC layers. This is justified by the observation that the first layers of a CNN contain more generic features (e.g., edge detectors or color blob detectors) that can be useful for many tasks, while the subsequent layers of the CNN become progressively more specific to the details of the classes contained in the original dataset. The main steps of the code are explained below. First of all, the images of the dataset are imported and preprocessed. Data augmentation is then applied to the training set through random rotations, zooms, translations, shears, and flips. Subsequently, the training, evaluation, and validation generators are initialized so that they can load batches of images. Then, the network surgery is implemented. The VGG16 architecture (Simonyan and Zissermann, 2015) is loaded with ImageNet pre-training weights, omitting the FC layers. Instead of the last frozen FC layers, a new code head is built using the Flatten, Dense, and Dropout layers, with the VGG16 base model output serving as the input for these new layers. After setting the layers of the base model as non-trainable, the code needs to be compiled. Firstly, the optimizer needs to be defined: in this work, an SGD (gradient descent with momentum optimizer) is chosen, with a small learning rate equal to 0.0001 and a momentum of 0.9. As regards the loss function, the “binary_crossentropy” is chosen when two classes are to be predicted (i.e., for the CNNs predicting height and material), whereas the “categorical_crossentropy” is adopted when the prediction concerns more than two classes (i.e., when the construction age is to be predicted). At this point, this head of the network is trained, updating the weights for the new FC layers only, in order to be initialized with the learned values. For all three algorithms, the number of epochs has been set to 50. Once the head FC layers are trained and initialized, the final set of convolutional layers can be unfrozen and made trainable. In particular, only the final convolutional block of VGG16 is unfrozen, which means the last three convolutional layers and the last pooling layer. The model is then trained again, fine-tuning the final set of convolutional layers and the new set of FC layers. The number of epochs is set to 20 in order not to overfit. Table 2 shows the precision, recall, and F1-score values after this second training process for all three CNNs. The precision measures how many of the positive predictions are correct (true positives); the recall measures how many positive cases are correctly predicted by the classifier out of the total number of positive cases; lastly, the F1-score is the harmonic mean of the previous two. As can be seen, these parameters generally appear promising and even excellent in cases of height and material prediction. However, specifically regarding the 1919-1945 and Post-1980 construction periods, these parameters indicate potentially inaccurate predictions. This is mainly due to the current low number of images in the dataset labeled with these two categories, as shown in Table 1. Once the three models are trained and fine-tuned on the dataset and their specific weights are obtained, they can be recalled to predict building height, material, and construction period from new building images. The image to be analyzed must be loaded and preprocessed to be comparable to the images in the dataset. It is therefore possible to predict its class label by loading the fine-tuned model and then performing inference to extract the top prediction. This operation can be repeated for multiple buildings, particularly for all buildings belonging to the area of interest (whose pictures have been collected through the Google Street View API). Finally, the new information is added to the GeoDataFrame retrieved with OSM.
Made with FlippingBook flipbook maker