PSI - Issue 64

Manish Prasad et al. / Procedia Structural Integrity 64 (2024) 1524–1531 Manish Prasad / Structural Integrity Procedia 00 (2019) 000 – 000

1527

4

2.2. Machine learning algorithm ML algorithms are a set of instructions that strides to find out the pattern and relationship between data features and the output (Berthold et al., 2010). In particular, the ML regression algorithms that we are using in this study can be divided into two categories based on how ML learns the data distribution: distance and trees. In distance-based regression algorithms, predictions are made based on the distance between data points. In regression trees, predictions are made based on trees that are repeatedly generated by dividing a dataset into subsets based on impurity measurements, such as cross-entropy and Gini index (James et al., 2013), until last iteration contains observations with the same class or category. K-nearest neighbors (KNN) predict the output for a new data point by observing the output of similar data points that were used to train the model (Cunningham & Delany, 2021). The similarity is measured by metrics such as Euclidean distance, which is the square root of the sum of the squared differences between two points. The prediction is the average of the output of the k most neighbors, k a user defined parameter. A KNN ML model is trained with a range of values for k and the k value that results in closed prediction is chosen. 2.2.2. Support Vector Machine Support vector machine (SVM) attempts to find a hyperplane to best fit a given training data (Vapkin, 1995). For n features, SVM maps the data in an imaginary dimension using kernel functions such as radial kernel function (RBF) and attempts to find a plane, in this case a hyperplane, that fits all training data with minimum error. Mathematically, SVM finds a function f(x) that deviates from target variable by a value no greater than a specified error margin  for each training data, and at the same time is as flat as possible. 2.2.3. Random Forest Random forests (RF) make use of regression trees to make predictions (Breiman, 2001). Each tree is composed of randomly selected features from the dataset. Features are selected with replacement (i.e. two different trees can have common features), and this helps in avoiding overfitting. Each tree makes a prediction for a new data point and the final prediction is the average of predictions made by all trees. Because of this inherent randomness, RF can be used for datasets irrespective of its distribution. Like RF, eXtreme Gradient Boosting (XGBOOST) makes use of trees to make predictions. However, XGBOOOST makes trees in a sequential manner. The prediction of each tree is compared to the actual value and the error is minimized using regularization methods (Chen & Guestrin, 2016). A new tree is built based on the errors obtained in the previous tree and the new prediction is added to that of previous tree. With this sequence, new trees are built up and the error at each step is minimized. The rate of minimization of error at each step is known as learning rate. Learning rate and the regularization methods are user defined parameters for the algorithm. 2.3. Hyperparameter tuning For each ML algorithm that was discussed in the previous section, there is a set of user-defined parameters, known as hyperparameters (i.e. k for KNN, kernel and  for SVM, number of trees in RF, learning rate and regularization for XGBoost). A ML model’s performance depends on hyperparameters selected to train the model. Hyperparameter tuning is an iterative procedure that finds the set of user-defined parameters that results in the best performing model. Hyperparameter tuning algorithms, such as grid search and random cross-validation (GridSearchCV), help a ML model to find the best set of parameters (Liashchynskyi & Liashchynskyi, 2019). 2.2.4. eXtreme Gradient BOOST 2.2.1. K-nearest neighbor

Made with FlippingBook Digital Proposal Maker