Random Forest

Random Forest from Scratch. Random Forest is a robust machine… | by Tanvi Penumudy | Python in Plain English

Random ForestKey takeawaysInterview QuestionsSolutionsWhat is Random Forest, and how does it work?What are the advantages of using Random Forest over a single decision tree?How does Random Forest handle missing values and categorical variables?How do you determine the optimal number of trees in a Random Forest?What is the concept of feature importance in Random Forest, and how is it calculated?What is the difference between bagging and boosting? How is Random Forest related to bagging?How does Random Forest handle overfitting, and what techniques can be used to mitigate it?Can Random Forest be used for regression problems? If so, how?How does Random Forest handle imbalanced datasets?What are some applications of Random Forest in real-world scenarios?Can you explain the concept of out-of-bag (OOB) error in Random Forest?How does Random Forest handle collinear features or feature interactions?What are the limitations of Random Forest?What are the hyperparameters in Random Forest, and how do they affect the model?How can you tune the hyperparameters of a Random Forest model?Can Random Forest handle text data or high-dimensional data?How can you assess the performance of a Random Forest model?What is the computational complexity of building a Random Forest model?Are there any assumptions or prerequisites for using Random Forest?Can you explain the difference between Random Forest and other ensemble methods like AdaBoost or Gradient Boosting?Python ApplicationUsing SklearnRandomForestClassifier()From Scratch

Key takeaways

Random Forest is an ensemble learning method that combines multiple decision trees to make predictions.
It is a powerful and versatile algorithm used for both classification and regression tasks.
Random Forest creates a collection of decision trees, where each tree is trained on a different subset of the training data and uses random feature subsets for making decisions.
The random feature subsets help reduce overfitting and improve the generalization of the model.
Random Forest handles missing values and maintains accuracy even when a large portion of the data is missing.
It can handle large datasets with high-dimensional feature spaces and performs well in the presence of irrelevant or noisy features.
Random Forest provides a measure of feature importance, allowing you to identify the most influential features in your dataset.
It can handle both numerical and categorical features without requiring extensive data preprocessing.
Random Forest is resistant to overfitting and tends to generalize well, even without extensive hyperparameter tuning.
Random Forest is a popular choice in machine learning due to its robustness, scalability, and ability to handle complex problems with high accuracy.

Interview Questions

What is Random Forest, and how does it work?
What are the advantages of using Random Forest over a single decision tree?
How does Random Forest handle missing values and categorical variables?
How do you determine the optimal number of trees in a Random Forest?
What is the concept of feature importance in Random Forest, and how is it calculated?
What is the difference between bagging and boosting? How is Random Forest related to bagging?
How does Random Forest handle overfitting, and what techniques can be used to mitigate it?
Can Random Forest be used for regression problems? If so, how?
How does Random Forest handle imbalanced datasets?
What are some applications of Random Forest in real-world scenarios?
Can you explain the concept of out-of-bag (OOB) error in Random Forest?
How does Random Forest handle collinear features or feature interactions?
What are the limitations of Random Forest?
What are the hyperparameters in Random Forest, and how do they affect the model?
How can you tune the hyperparameters of a Random Forest model?
Can Random Forest handle text data or high-dimensional data?
How can you assess the performance of a Random Forest model?
What is the computational complexity of building a Random Forest model?
Are there any assumptions or prerequisites for using Random Forest?
Can you explain the difference between Random Forest and other ensemble methods like AdaBoost or Gradient Boosting?

Solutions

What is Random Forest, and how does it work?

Random Forest is an ensemble learning method that combines multiple decision trees to make predictions. Each decision tree is built on a random subset of the training data and uses a random subset of features. When making predictions, the final outcome is determined by aggregating the predictions of all the individual trees. For classification tasks, the most common prediction is selected through voting, while for regression tasks, the predictions are averaged.

What are the advantages of using Random Forest over a single decision tree?

Advantages of using Random Forest over a single decision tree:

Random Forest reduces overfitting by combining the predictions of multiple trees, leading to better generalization and improved accuracy.
It can handle high-dimensional data with a large number of features without requiring feature selection or dimensionality reduction techniques.
Random Forest is less sensitive to outliers and noisy data points.
It can handle both categorical and numerical variables without requiring extensive preprocessing or one-hot encoding.

How does Random Forest handle missing values and categorical variables?

Missing values: Random Forest can handle missing values by using surrogate splits. When a sample has missing values for a particular feature, the algorithm uses other available features to determine the best split point for that node.
Categorical variables: Random Forest handles categorical variables by performing binary splits based on each category. It effectively captures the information in categorical variables without requiring explicit encoding.

How do you determine the optimal number of trees in a Random Forest?

The optimal number of trees in a Random Forest can be determined using techniques such as cross-validation or out-of-bag (OOB) error estimation. By evaluating the performance of the model with different numbers of trees, you can identify the point where adding more trees no longer improves the model's accuracy. Typically, a higher number of trees improves the model's stability and robustness but may come with diminishing returns in terms of accuracy improvement.

What is the concept of feature importance in Random Forest, and how is it calculated?

Feature importance in Random Forest refers to assessing the relevance or importance of each feature in making accurate predictions. The concept of feature importance is calculated based on how much each feature contributes to the overall performance of the Random Forest model. One common approach to calculate feature importance is using Gini impurity or mean decrease impurity, where the algorithm measures the decrease in impurity (or increase in purity) caused by each feature when constructing decision trees. Features that result in the largest decrease in impurity are considered more important.

Bagging (Bootstrap Aggregating): In bagging, multiple models (e.g., decision trees) are trained independently on random subsets of the training data with replacement. The final prediction is obtained by averaging or voting over the predictions of all the models.
Boosting: Boosting works by sequentially training models, where each subsequent model is built to correct the mistakes made by the previous models. It assigns higher weights to misclassified samples and focuses on learning from them. Boosting combines the predictions of all the models using a weighted majority vote or weighted averaging.

Random Forest is related to bagging. It is a specific implementation of the bagging technique where the base models are decision trees. Random Forest builds multiple decision trees on different subsets of the data and combines their predictions through voting (for classification) or averaging (for regression).

How does Random Forest handle overfitting, and what techniques can be used to mitigate it?

Random Forest handles overfitting by combining the predictions of multiple decision trees. Some techniques to mitigate overfitting in Random Forest are:

Random feature subsets: Random Forest uses a random subset of features at each split of a decision tree. This reduces the correlation between trees and helps to capture different aspects of the data, preventing overfitting.
Tree depth/size control: Constraining the maximum depth or number of nodes in each tree can limit their complexity and prevent them from fitting the noise in the data.
Increasing the number of trees: Adding more trees to the Random Forest ensemble can improve generalization by reducing the impact of individual trees.

Can Random Forest be used for regression problems? If so, how?

Yes, Random Forest can be used for regression problems. For regression, the final prediction in Random Forest is obtained by averaging the predictions of all the individual decision trees. Each tree predicts a continuous value, and the final prediction is the average of these individual predictions.

How does Random Forest handle imbalanced datasets?

Random Forest handles imbalanced datasets naturally, as it considers the class distribution in the data when creating random subsets for each tree. Since each tree is trained on a random subset of the data, it automatically incorporates the imbalance and learns to make predictions accordingly. The majority class may have a larger representation in some trees, while the minority class may have a larger representation in others. This balance helps in making accurate predictions for both classes.

What are some applications of Random Forest in real-world scenarios?

Classification: It can be used for tasks such as spam detection, disease diagnosis, customer churn prediction, sentiment analysis, and credit scoring.
Regression: It can be applied to problems like house price prediction, demand forecasting, stock market analysis, and sales forecasting.
Feature Importance: Random Forest can be used to determine the importance of features in a dataset, aiding feature selection and understanding the underlying relationships.
Anomaly Detection: It can identify unusual patterns or outliers in data, such as fraud detection in financial transactions or network intrusion detection.
Recommender Systems: Random Forest can be used to build personalized recommendation systems by predicting user preferences based on historical data.

Can you explain the concept of out-of-bag (OOB) error in Random Forest?

In Random Forest, during the training process of each decision tree, a random subset of the training data is selected. The remaining samples that were not selected in the subset are referred to as the out-of-bag (OOB) samples. These OOB samples are not used for training the specific tree but can be used to estimate the model's performance.

The OOB error is calculated by evaluating each tree on its respective OOB samples. The predictions made by the tree on the OOB samples are compared to the true labels of those samples. The OOB error is the average error across all trees. It serves as an unbiased estimate of the model's performance on unseen data, without the need for an additional validation set.

How does Random Forest handle collinear features or feature interactions?

Random Forest is robust to collinear features, meaning that it can handle highly correlated predictors. Due to the random selection of features at each split, the trees in a Random Forest consider different subsets of features, which helps in reducing the impact of collinearity. The model can still make accurate predictions by relying on other informative features that are not highly correlated.

Regarding feature interactions, Random Forest can capture them naturally. When building decision trees, the algorithm considers multiple features and their combinations to make split decisions. As a result, Random Forest can detect and leverage complex interactions between features to improve prediction accuracy.

What are the limitations of Random Forest?

Interpretability: The individual decision trees in a Random Forest are generally easier to interpret than the ensemble as a whole. Understanding the combined effect of all the trees can be challenging, especially when the model consists of a large number of trees.
Computationally intensive: Building and training multiple decision trees in a Random Forest can be computationally expensive, especially for large datasets or when using a high number of trees.
Memory requirements: Random Forest requires more memory compared to a single decision tree, as it needs to store the information of multiple trees.
Overfitting in rare cases: Although Random Forest is designed to mitigate overfitting, there can still be cases where it may overfit the training data, especially when the number of trees is very high and the dataset is small.
Unbalanced datasets: While Random Forest can handle imbalanced datasets, it may still struggle to provide accurate predictions for the minority class if it is significantly underrepresented.

What are the hyperparameters in Random Forest, and how do they affect the model?

Random Forest has several hyperparameters that can be tuned to optimize the model's performance. Some important hyperparameters include:

Number of trees: The number of decision trees in the Random Forest. Increasing the number of trees can improve model accuracy but also increases computational complexity.
Maximum depth: The maximum depth allowed for each decision tree. Deeper trees can capture more complex relationships but also increase the risk of overfitting.
Number of features considered at each split: Random Forest randomly selects a subset of features at each split. This hyperparameter controls the number of features to consider. A lower value can increase diversity among trees but may result in less accurate models.
Minimum samples for a split: The minimum number of samples required to perform a split in a tree. A higher value can prevent overfitting but may lead to underfitting if the data is limited.
Minimum samples per leaf: The minimum number of samples required to be in a leaf node. A higher value can prevent overfitting but may result in biased predictions if the data is imbalanced.

How can you tune the hyperparameters of a Random Forest model?

Hyperparameter tuning involves selecting the best values for the hyperparameters to optimize model performance. Some techniques for tuning Random Forest hyperparameters include:

Grid search: Manually define a grid of hyperparameter values and exhaustively search through all possible combinations. Evaluate the model's performance using cross-validation and select the hyperparameter values that yield the best results.
Random search: Randomly sample values from the hyperparameter space and evaluate the model's performance. Repeat this process for multiple iterations and select the hyperparameter values that yield the best results.
Bayesian optimization: Use Bayesian techniques to model the performance of the model as a function of hyperparameters. Based on the model, select the hyperparameter values that maximize performance.
Automated hyperparameter optimization libraries: Utilize libraries such as scikit-learn's GridSearchCV or RandomizedSearchCV, or more advanced libraries like Optuna or Hyperopt, to automate the hyperparameter tuning process.

Can Random Forest handle text data or high-dimensional data?

Random Forest can handle both text data and high-dimensional data:

Text data: Random Forest can handle text data by encoding categorical features as numerical values. Techniques such as one-hot encoding, label encoding, or using embedding vectors can be applied to convert text data into a format suitable for Random Forest. This allows the algorithm to process and make predictions based on the encoded text features.
High-dimensional data: Random Forest is known to handle high-dimensional data well. It can effectively deal with datasets that have a large number of features without the need for extensive feature selection or dimensionality reduction techniques. The random feature selection process in Random Forest helps in capturing different aspects of the data and reduces the impact of irrelevant features, making it suitable for high-dimensional datasets.

How can you assess the performance of a Random Forest model?

To assess the performance of a Random Forest model, several evaluation metrics can be used:

Accuracy: The proportion of correctly predicted samples.
Precision: The proportion of true positive predictions among all positive predictions.
Recall: The proportion of true positive predictions among all actual positive samples.
F1 score: The harmonic mean of precision and recall, which provides a balanced measure of the model's accuracy.
Area Under the ROC Curve (AUC-ROC): It measures the model's ability to discriminate between positive and negative samples.
Mean Squared Error (MSE) or Mean Absolute Error (MAE): Used for regression tasks to measure the difference between predicted and actual values.

It is important to select evaluation metrics that are appropriate for the specific problem and consider the balance between different metrics based on the application's requirements.

What is the computational complexity of building a Random Forest model?

The computational complexity of building a Random Forest model primarily depends on the number of trees (n_estimators) and the number of features (m) in the dataset. The overall complexity can be expressed as O(n_estimators * m * log(m) * n), where n represents the number of samples in the dataset.

Building a single decision tree typically has a complexity of O(m * n * log(n)), and Random Forest builds multiple trees in parallel or sequentially. The m * log(m) term arises from the feature selection process, where the algorithm evaluates and selects the best split points based on features.

However, due to the parallelization and randomness of the Random Forest algorithm, it is often faster to build multiple trees compared to building a single tree with the same complexity.

Are there any assumptions or prerequisites for using Random Forest?

Random Forest has fewer assumptions compared to some other machine learning algorithms. However, there are still a few considerations:

Random Forest assumes that the data used for training is representative of the population or the target distribution.
The data should be in a structured format, consisting of features and corresponding labels or target values.
Random Forest assumes that the relationships between features and the target variable can be captured by a combination of decision trees.
It is important to ensure that the data is properly preprocessed, including handling missing values, encoding categorical variables, and scaling the features if necessary.

Random Forest is a versatile algorithm that can handle various types of data and does not make strict assumptions about the underlying data distribution. However, it is always recommended to carefully analyze and preprocess the data before applying Random Forest to ensure accurate and reliable results.

Can you explain the difference between Random Forest and other ensemble methods like AdaBoost or Gradient Boosting?

Random Forest:
- Random Forest is an ensemble learning method that combines multiple decision trees to make predictions.
- Each tree in the Random Forest is built independently using a random subset of the training data and a random subset of features at each split.
- The final prediction in Random Forest is determined by aggregating the predictions of all the individual trees through voting (for classification) or averaging (for regression).
AdaBoost (Adaptive Boosting):
- AdaBoost is an ensemble method that iteratively builds a sequence of weak learners (typically decision trees) to create a strong learner.
- Each weak learner is trained on a modified version of the training data, where the samples that were previously misclassified are given higher weights.
- The final prediction in AdaBoost is a weighted combination of the predictions of all the weak learners, with higher weights given to more accurate models.
Gradient Boosting:
- Gradient Boosting is another ensemble method that combines multiple weak learners to create a strong learner.
- Unlike Random Forest and AdaBoost, Gradient Boosting builds the weak learners (typically decision trees) sequentially, where each subsequent tree corrects the mistakes made by the previous trees.
- The model is trained by minimizing a loss function using gradient descent, where the gradient is computed based on the errors made by the previous trees.
- The final prediction in Gradient Boosting is the sum of the predictions from all the weak learners, each multiplied by a learning rate that controls the contribution of each tree.

Key differences:

Random Forest builds independent decision trees using random subsets of data and features, while AdaBoost and Gradient Boosting build weak learners sequentially.
AdaBoost assigns higher weights to misclassified samples to focus on difficult examples, while Gradient Boosting adjusts subsequent trees to correct the mistakes of previous trees.
Random Forest combines predictions through voting or averaging, AdaBoost combines predictions using weighted voting, and Gradient Boosting combines predictions through summation with learning rate adjustment.

Each ensemble method has its strengths and weaknesses, and the choice depends on the specific problem and the characteristics of the data. Random Forest is known for its robustness and ability to handle high-dimensional data, AdaBoost is effective at handling difficult examples, and Gradient Boosting is often used for its high predictive accuracy and ability to handle complex relationships.

Python Application

Using Sklearn


x
1
from sklearn.datasets import load_iris
2
from sklearn.ensemble import RandomForestClassifier
3
from sklearn.model_selection import train_test_split
4
from sklearn.metrics import accuracy_score
5

6
# Load the IRIS dataset
7
iris = load_iris()
8

9
# Split the dataset into features (X) and target (y)
10
X = iris.data
11
y = iris.target
12

13
# Split the data into training and testing sets
14
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
15

16
# Create a Random Forest classifier
17
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
18

19
# Train the classifier on the training data
20
rf_classifier.fit(X_train, y_train)
21

22
# Make predictions on the testing data
23
y_pred = rf_classifier.predict(X_test)
24

25
# Evaluate the model's accuracy
26
accuracy = accuracy_score(y_test, y_pred)
27
print("Accuracy:", accuracy)

Import the necessary libraries.
Load the IRIS dataset using load_iris() function.
Split the dataset into features (X) and target (y).
Split the data into training and testing sets using train_test_split() function.
Create a Random Forest classifier with 100 trees using RandomForestClassifier from the ensemble module.
Train the classifier on the training data using the fit() method.
Make predictions on the testing data using the predict() method.
Evaluate the model's accuracy by comparing the predicted labels with the actual labels using accuracy_score() function.
Print the accuracy score.

`RandomForestClassifier()`

RandomForestClassifier is a class in the scikit-learn (sklearn) library that implements the Random Forest algorithm for classification tasks. It is a powerful ensemble learning method that combines multiple decision trees to make predictions.

Here's a detailed explanation of the RandomForestClassifier class and its important parameters:

Initialization:


xxxxxxxxxx
1
1
RandomForestClassifier(n_estimators=100, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, max_features='auto', bootstrap=True, random_state=None)

Parameters:

n_estimators (default=100): The number of decision trees in the Random Forest.
criterion (default='gini'): The function to measure the quality of a split. Supported criteria are "gini" for the Gini impurity and "entropy" for the information gain.
max_depth (default=None): The maximum depth of the decision trees. If None, the trees are grown until all leaves are pure or until all leaves contain less than min_samples_split samples.
min_samples_split (default=2): The minimum number of samples required to split an internal node.
min_samples_leaf (default=1): The minimum number of samples required to be at a leaf node.
max_features (default='auto'): The number of features to consider when looking for the best split. Supported values include "auto", "sqrt", "log2", or an integer representing the number of features.
bootstrap (default=True): Whether to bootstrap the samples when building trees.
random_state (default=None): Controls the randomness of the estimator. Pass an integer to obtain reproducible results.

Methods:

fit(X, y): Trains the Random Forest classifier on the input data X and corresponding target labels y.
predict(X): Predicts the class labels for the input data X.
predict_proba(X): Returns the class probabilities for the input data X.
score(X, y): Returns the mean accuracy on the given test data X and corresponding true labels y.

The RandomForestClassifier class in scikit-learn provides an efficient and flexible implementation of the Random Forest algorithm for classification tasks. It handles various parameters to control the behavior of the model and offers methods to train the model, make predictions, and evaluate its performance.

From Scratch


x
101
1
import numpy as np
2
from sklearn.datasets import load_iris
3
from sklearn.metrics import accuracy_score
4
from sklearn.model_selection import train_test_split
5

6

7
class RandomForestClassifier:
8
    def __init__(self, n_estimators=100, max_depth=None, min_samples_split=2):
9
        self.n_estimators = n_estimators  # Number of decision trees in the random forest
10
        self.max_depth = max_depth  # Maximum depth of each decision tree
11
        self.min_samples_split = min_samples_split  # Minimum number of samples required to split a node
12
        self.trees = []  # List to store the decision trees
13

14
'''
15
The fit method trains the random forest classifier. It iterates n_estimators times to create individual decision trees. For each iteration, it generates random indices with replacement using np.random.choice to create a bootstrap sample from the input data X and target labels y. It then creates a new instance of the DecisionTreeClassifier with the specified max_depth and min_samples_split and fits it to the bootstrap sample. Finally, the trained decision tree is appended to the trees list.
16
'''
17
    def fit(self, X, y):
18
        for _ in range(self.n_estimators):
19
            indices = np.random.choice(len(X), size=len(X), replace=True)
20
            bootstrap_X, bootstrap_y = X[indices], y[indices]
21
            tree = DecisionTreeClassifier(max_depth=self.max_depth, min_samples_split=self.min_samples_split)
22
            tree.fit(bootstrap_X, bootstrap_y)
23
            self.trees.append(tree)
24
            
25
'''
26
The predict method makes predictions using the trained random forest classifier. It creates an array predictions of zeros with dimensions (len(X), len(self.trees)), where len(X) is the number of samples and len(self.trees) is the number of decision trees in the forest. It then iterates over each tree, calling the predict method of each tree to obtain predictions for the input data X and stores them in the predictions array. Finally, it applies the np.bincount function along each row of the predictions array to find the most frequent prediction for each sample and returns the resulting array of predictions.
27
'''
28

29
    def predict(self, X):
30
        predictions = np.zeros((len(X), len(self.trees)))
31
        for i, tree in enumerate(self.trees):
32
            predictions[:, i] = tree.predict(X)
33
        return np.apply_along_axis(lambda x: np.bincount(x).argmax(), axis=1, arr=predictions)
34

35

36
class DecisionTreeClassifier:
37
    def __init__(self, max_depth=None, min_samples_split=2):
38
        self.max_depth = max_depth  # Maximum depth of the decision tree
39
        self.min_samples_split = min_samples_split  # Minimum number of samples required to split a node
40
        self.split_feature = None  # Feature index to split the data on
41
        self.split_threshold = None  # Threshold value for the split feature
42
        self.left_child = None  # Left child of the current node
43
        self.right_child = None  # Right child of the current node
44
        self.prediction = None  # Prediction value at the leaf node
45
        
46
        
47
'''
48
The fit method trains the decision tree classifier. It first checks if the maximum depth is reached or less than or equal to 0. If so, it calculates the most frequent class label using np.bincount(y).argmax() and assigns it to the prediction attribute. If there is only one unique class label in the target y, it assigns that label to the prediction attribute.
49

50
Otherwise, it initializes the best_gini variable to infinity. It iterates over each feature in the input data X using range(X.shape[1]). For each feature, it calculates the unique thresholds using np.unique(X[:, feature]). Then, for each threshold, it splits the data into left and right indices based on the feature and threshold values. It calculates the Gini impurity for this split using the _gini_impurity method. If the calculated Gini impurity is lower than the current best Gini impurity, it updates the best_gini, split_feature, and split_threshold attributes accordingly.
51

52
After finding the best split, the code checks if split_feature is still None. If it is, it means no split could improve the Gini impurity, and the node becomes a leaf node. In this case, the most frequent class label is calculated and assigned to the prediction attribute.
53

54
If a valid split is found, the data is divided into left and right subsets based on the split_feature and split_threshold. Two new DecisionTreeClassifier instances are created as the left and right child nodes, with the max_depth reduced by 1. The fit method is recursively called on the left and right child nodes using the corresponding subsets of X and y.
55
'''
56

57

58
    def fit(self, X, y):
59
        if self.max_depth is not None and self.max_depth <= 0:
60
            self.prediction = np.bincount(y).argmax()
61
            return
62
        if len(set(y)) == 1:
63
            self.prediction = y[0]
64
            return
65
        best_gini = float('inf')
66
        for feature in range(X.shape[1]):
67
            thresholds = np.unique(X[:, feature])
68
            for threshold in thresholds:
69
                left_indices = X[:, feature] <= threshold
70
                right_indices = X[:, feature] > threshold
71
                gini = (np.sum(left_indices) * self._gini_impurity(y[left_indices])
72
                        + np.sum(right_indices) * self._gini_impurity(y[right_indices])) / len(y)
73
                if gini < best_gini:
74
                    best_gini = gini
75
                    self.split_feature = feature
76
                    self.split_threshold = threshold
77
        if self.split_feature is None:
78
            self.prediction = np.bincount(y).argmax()
79
            return
80
        left_indices = X[:, self.split_feature] <= self.split_threshold
81
        right_indices = X[:, self.split_feature] > self.split_threshold
82
        self.left_child = DecisionTreeClassifier(max_depth=self.max_depth - 1, min_samples_split=self.min_samples_split)
83
        self.right_child = DecisionTreeClassifier(max_depth=self.max_depth - 1, min_samples_split=self.min_samples_split)
84
        self.left_child.fit(X[left_indices], y[left_indices])
85
        self.right_child.fit(X[right_indices], y[right_indices])
86

87
    def predict(self, X):
88
        if self.prediction is not None:
89
            return np.full(X.shape[0], self.prediction)
90
        else:
91
            predictions = np.where(X[:, self.split_feature] <= self.split_threshold,
92
                                   self.left_child.predict(X),
93
                                   self.right_child.predict(X))
94
            return predictions
95

96
    @staticmethod
97
    
98
    '''
99
    The _gini_impurity method calculates the Gini impurity of the target labels y. It first calculates the counts of each unique label using np.unique(y, return_counts=True). Then, it computes the probabilities of each label by dividing the counts by the total number of samples. Finally, it calculates the Gini impurity using the formula 1 - sum(probabilities ** 2).
100
    '''
101
    def _gini_impurity(y):
102
        _, counts = np.unique(y, return_counts=True)
103
        probabilities = counts/len(y)
104
        gini_impurity = 1 - np.sum(probabilities ** 2)
105
        return gini_impurity


xxxxxxxxxx
1
26
1
import numpy as np
2
from sklearn.datasets import load_iris
3
from sklearn.model_selection import train_test_split
4
from sklearn.metrics import accuracy_score
5

6
# Load Iris dataset
7
iris = load_iris()
8
X = iris.data
9
y = iris.target
10

11
# Split the dataset into training and test sets
12
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
13

14
# Create an instance of the DecisionTreeClassifier
15
tree = DecisionTreeClassifier(max_depth=3, min_samples_split=2)
16

17
# Fit the model to the training data
18
tree.fit(X_train, y_train)
19

20
# Make predictions on the test data
21
y_pred = tree.predict(X_test)
22

23
# Calculate the accuracy of the model
24
accuracy = accuracy_score(y_test, y_pred)
25
print("Accuracy:", accuracy)