The company has a group of cooperation teams engaged in the oil well pipe for sale industry for many years, with dedication, innovation spirit and service awareness, and has established a sound quality control and management system to ensure product quality.
SVMs are supervised learning algorithms that can be used for both classification and regression applications (Elmousalami 2019a, 2020). SVM optimizes hyperplanes distance and the margin as shown in Fig. 4. Hyperplane distance can be maximized based on two classes of boundaries using the following equation (Vapnik 1979):
$$ {\text{Linear}}\,{\text{SVM}} = \left\{ {\begin{array}{*{20}l} {W \cdot X_{i} + b \ge 1 ,} \hfill & {{\text{if}}\;y_{i} \ge 0} \hfill \\ {W \cdot X_{i} + b < - 1 ,} \hfill & {{\text{if}}\;y_{i} < 0 } \hfill \\ \end{array} } \right. $$
(1)
For i = 1,2,3,…, m, a positive slack variable (\( \xi \)) is added for handling the nonlinearity as displayed in Eq. (2):
$$ y_{i} \left( {W.X_{i} + b} \right) \ge 0 - \xi ,\quad i = 1,2,3, \ldots \ldots m $$
(2)
Fig. 4Linear support vector machine
Full size image
Accordingly, the objective function will be as shown in Eq. (3):
$$ {\text{Min}} \mathop \sum \limits_{i = 0}^{i = m} \frac{1}{2} w \cdot w^{T} + C \mathop \sum \limits_{i = 0}^{i = m} \xi_{i} $$
(3)
Decision tree (DT) is a statistical learning algorithm that is dividing the collected data into logical rules hierarchically (Elmousalami 2019b; Breiman et al. 1984) as shown in Fig. 5. Splitting algorithm is repetitively used to formulate each node of the tree. Classification and regression trees (CART) and C4.5/C5.0 algorithms are the most common tree models used in the research and practical community. This model is applied for both classification and continues prediction applications (Curram and Mingers 1994). DT algorithm can interpret data and feature importance based on the generated logical statement for each tree node. However, DT is not a robust and stable algorithm against noisy and missing data (Perner et al. 2001).
Fig. 5Additive function concept
Full size image
Logistic regression (logit regression) is a predictive regression analysis which is appropriate for the dichotomous (binary) dependent variable (Hosmer et al. 2013). Logistic regression is used to explain data and to describe the relationship between one dependent binary variable and one or more independent variables. No outliers exist in the data, and there should be no high correlations (multicollinearity) among the predictors (Tabachnick and Fidell 2013). Mathematically, logistic regression can be defined as follows:
$$ P = \frac{1}{{1 + e^{{ - \left( {a + bX} \right)}} }} $$
(4)
where P is the classification probability, e is the base of the natural logarithm and (a) and (b) are the parameters of the model. Adding more predictors to the model can result in overfitting, which reduces the model generalizability and increases the model complexity.
KNN algorithm is building a nonparametric classifier (Altman 1992, Weinberger et al. 2006). KNN is an instance-based learning used for classification or regression applications. The object is classified by a majority vote of its neighbors in the training set. If K = 1, then the case is simply assigned to the class of its nearest neighbor. Many distance functions can be applied to measure the similarity among the instances such as Euclidian, Manhattan, and Minkowski (Singh et al. 2013).
Gaussian Naive Bayes classifier is an algorithm for classification technique which assumes independency among predictors (Patil and Sherekar 2013). Naive Bayes is useful for very large datasets and known to outperform even highly sophisticated classification methods. Bayes theorem computes posterior probability P (c|x) from P(c), P(x), and P(x|c) as shown in Eq. 5:
$$ P(c|x) = \frac{{P\left( {x|c} \right)P\left( c \right)}}{P\left( x \right)} $$
(5)
where P(c|x) represents the posterior probability of the target class (c, target) given the input predictors (x, attributes); P(c) represents the prior probability of the target class; P(x|c) is the likelihood which is the probability of predictor given class; P(x) is the prior probability of predictor. Naive Bayes algorithm works by computing likelihood and probabilities for each class. Naive Bayesian formula computes the posterior probability for each class where the highest posterior probability class is the prediction outcome (Kohavi 1996).
ANNs are computational systems biologically inspired by the design of natural neural networks (NNN). Key abilities of ANNs are generalization, categorization, prediction, and association (LeCun et al. 2015). ANNs have high ability to dynamically figure out the relationships and patterns between the objects and subjects of knowledge based on nonlinear functions (Elmousalami et al. 2018b). The feedforward network such as multilayer perceptron networks (MLPs) applies the input vector (x), a weight matrix (W), an output vector (Y), and a bias vector (b). It can be formulated as Eq. (6) and Fig. 6.
$$ Y = f\left({W \cdot x + b} \right) $$
(6)
where f (.) includes a nonlinear activation function.
Fig. 6Multilayer perceptron network (MLP)
Full size image
Ensemble methods and fusion learning are data mining techniques to fuse several ML algorithms such as ANNs, DT, and SVM to boost the overall performance and accuracy (Hansen and Salamon 1990). Ensemble methods can merge several ML algorithms such as DT, SVM, or ANNs. Each single ML used in the ensemble method is called a base learner where the final decision is taken by the ensemble model. K is an additive function to predict the final output as given in Eq. (7):
$$ \hat{y}_{i} = \mathop \sum \limits_{k = 1}^{K} f_{k} \left( {X_{i} } \right),\quad f_{k} \in F $$
(7)
where \( \hat{y}_{i} \) represents the predicted dependent variable. Each fk is an independent tree structure and leaf weights w·xi are the independent variables. F is the regression trees space. Ensemble methods include several methods such as bagging, voting, stacking, and boosting (Elmousalami 2019c; 2020). The ensemble learning models deal effectively with the issues of complex data structures, high-dimensional data, and small sample size (Breiman 1996; Dietterich 2000; Kuncheva 2004).
Breiman (1999) proposed bagging technique as shown in Fig. 7a. Bagging applies bootstrap aggregating to train several base learners for variance reduction (Breiman 1996). Bagging draws groups of training data with replacement to train each base learner. Random forest (RF) is a special case of the bagging ensemble learning techniques. RF draws bootstrap subsamples to randomly create a forest of trees as shown in Fig. 7b (Breiman 2001). Using adaptive resampling, boosting method can be established for enhancing the performance of weak base learners (Schapire 1990) as shown in Fig. 7c. Adaptive boosting algorithm (AdaBoost) has been proposed by Schapire et al. (1998). Serially, AdaBoost draws the data for each base learner using adaptive weights for all instances. These adaptive weights guide the algorithm to minimize the prediction error and misclassified cases (Bauer and Kohavi 1999).
Fig. 7a Bagging, b RF, and c boosting
Full size image
Extreme gradient boosting (XGBoost) is a gradient boosting tree algorithm. XGBoost uses parallel computing to learn faster and diminish computational complexity (Chen and Guestrin 2016). The following equation uses regularization term to the additive tree model to avoid overfitting of the model as shown in the following equation:
$$ L\left( \phi \right) = \left( {x + a} \right)^{n} = \mathop \sum \limits_{k = 0}^{n} l\left( {\hat{y}_{i} ,y_{i} } \right) + \mathop \sum \limits_{k = 1}^{K} \varOmega \left( {f_{k} } \right) ,\quad {\text{where}}\; \varOmega \left( f \right) = \gamma T + \frac{1}{2}\lambda \left\| {w^{2}} \right\| $$
(8)
where L represents a differentiable convex cost function (Friedman 2001). Moreover, XGBoost assigns a default direction into its tree branches to handle missing data in the training dataset. Therefore, no effort is required to clean the training data. Stochastic gradient boosting (SGB) is a boosting bagging hybrid model (Breiman 1996). SGB iteratively improves the model’s performance by injecting randomization into the selected data subsets to enhance fitting accuracy and computational cost (Schapire et al. 1998).
Extremely randomized trees algorithm (extra trees) is tree-based ensemble method which can be applied for both supervised classification and regression cases (Vert 2004). Extra trees algorithm essentially randomizes both cut-point choice and attribute during tree node splitting. The key advantage of extra trees algorithm is the tree structure randomization which enables the algorithm to be tuned for the optimal parameters’ selection. Moreover, extra trees have high computational efficiency based on a bias/variance analysis (Vert 2004).
In ML, many parameters are assessed and improved during the learning process. By contrast, a hyperparameter is a variable whose value is set before training. The performance of the ML algorithms depends on the tuning parameter. The objective of hyperparameters optimization is to maximize the predictive accuracy by finding the optimal hyperparameters for each ML algorithm. Manual search, random search, grid search, Bayesian optimization, and evolutionary optimization are the most common techniques used for ML hyperparameters optimization. However, manual search random search and grid search are brute force techniques which needs unlimited trails to cover all possible combinations to get the optimal hyperparameters (Bergstra et al. 2011). On the other hand, Bayesian optimization and evolutionary optimization are automatic hyperparameters optimization which selects the optimal parameter with less human intervention (Shahriari et al. 2015). Moreover, these techniques can solve the curse of dimensionality. Therefore, this study used genetic algorithms to select the global optimal setting for each model before training stage. Starting with a random population, the iterative process of selecting the strongest and producing the next generation will stop once the best-known solution is satisfactory for the user. Objective function is defined as minimization of classification accuracy (Acc in Eq. 10) for each classifier. Classification accuracy (Acc) computes the ratio between the correctly classified instances and the total number of samples as in Eq. (9):
$$ {\text{Acc}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}} $$
(9)
where TP is the true positive; FP the false positive; TN the true negative; FN the false negative. The domain space is defined as the range of the all possible hyperparameters for each algorithm as shown in Table 2. This study applied a decision tree algorithm as based learners for all ensemble methods. Accordingly, the proposed ensemble models and decision tree have been classified as tree-based models which have the same parameter as shown in Table 2. The maximum number of iterations to be run is defined as 10,000 iterations.
Table 2 Optimal hyperparameters settingsFull size table
To compare machine learning algorithms, the identical blind validating cases are used to test the algorithms performance. The datasets have been divided into training set (80%) and validation set (20%), where the validation cases are excluded from the training data to ensure the generalization capability. This study applied tenfold cross-validation (10 CV) approach using the validation data set (20% of the whole data set). The K-fold cross-validation boosts the performance of validation process using a limited dataset.
Classification accuracy (Acc), specificity, and sensitivity are scalar measures for the classification performance. Moreover, receiver operating characteristic (ROC) is graphical measure for classification algorithm (Tharwat 2018). The receiver operating characteristic (ROC) curve is a two-dimensional graph in which the true positive rate (TPR) is represented in the y-axis and false positive rate (FPR) is in the x-axis (Sokolova et al. 2006a, b; Zou 2002):
$$ {\text{TPR}} = \frac{\text{TP}}{{{\text{TP}} + {\text{FN}}}} $$
(10)
$$ {\text{FPR}} = \frac{\text{FP}}{{{\text{TN}} + {\text{FP}}}} $$
(11)
Based on ROC, the perfect classification happens when the classifier curve possesses through the upper left corner of the graph. At such a corner point, all positive and negative samples are correctly classified. Therefore, the steepest curve has better performance. Area under the ROC curve (AUC) is applied to compare different classifiers in the ROC curve based on the scalar value. The AUC score is ranging between zero and one. Therefore, no realistic classifier has an AUC score lower than 0.5 (Metz 1978; Bradley 1997). ROC curves for each classifier must be potted to show the performance of classifier against different thresholds. In addition, the cost function is represented in the following equation:
$$ {\text{Error}} = \frac{1}{N}\mathop \sum \limits_{i = 1}^{N} L\left\{ {\hat{Y}^{\left( i \right)} \ne Y^{\left( i \right)} } \right\} $$
(12)
where Error: TN + FP, N: the number of cases, \( \hat{Y}^{\left( i \right)} \) is the predicted value, \( Y^{\left( i \right)} \) is the actual value, and L is the loss function. In the current study, weights are added to the error formula (Eq. 10) to emphasize the weight of the true negative cases where the case is stuck in the reality and the model predicted it as a non-stuck case. To handle such case, Eq. 11 is added to Eq. 10 to formulate Eq. 13:
$$ W^{\left( i \right)} = \left\{ {\begin{array}{*{20}l} 1 \hfill & { {\text{if}}\quad X^{\left( i \right) } {\text{is}}\,{\text{nonstuck}}\,{\text{case}}} \hfill \\ {10} \hfill & {{\text{if}}\quad X^{\left( i \right)} \,{\text{is}}\,{\text{stuck}}\,{\text{case}}} \hfill \\ \end{array} } \right. $$
(13)
where \( X^{\left( i \right) } \) is the actual classification of the oil well stuck case.
$$ {\text{Modified}}\,{\text{Error}} = \frac{1}{{\sum W^{\left( i \right)} }}\mathop \sum \limits_{i = 1}^{N} W^{\left( i \right)} L\left\{ {\hat{Y}^{\left( i \right)} \ne Y^{\left( i \right)} } \right\} $$
(14)
Want more information on oil well pipe for sale? Click the link below to contact us.