Decision Forest Classification and Regression (DF)

Decision Forest (DF) classification and regression algorithms are based on an ensemble of tree-structured classifiers, which are known as decision trees. Decision forest is built using the general technique of bagging, a bootstrap aggregation, and a random choice of features. Decision tree is a binary tree graph. Its internal (split) nodes represent a decision function used to select the child node at the prediction stage. Its leaf, or terminal, nodes represent the corresponding response values, which are the result of the prediction from the tree. For more details, see [Breiman84] and [Breiman2001].

Operation

Computational methods

Programming Interface

Training

Dense

Hist

train(…)

train_input

train_result

Inference

Dense

Hist

infer(…)

infer_input

infer_result

Mathematical formulation

Training

Given \(n\) feature vectors \(X=\{x_1=(x_{11},\ldots,x_{1p}),\ldots,x_n=(x_{n1},\ldots,x_{np})\}\) of size \(p\), their non-negative observation weights \(W=\{w_1,\ldots,w_n\}\) and \(n\) responses \(Y=\{y_1,\ldots,y_n\}\),

  • \(y_i \in \{0, \ldots, C-1\}\), where \(C\) is the number of classes, for classification

  • \(y_i \in \mathbb{R}\), for regression

the problem is to build a decision forest classification or regression model.

During the training stage, \(B\) independent classification or regression trees are created using the following:

  1. New training set generated from the original one by sampling uniformly and with replacement (bootstrapping).

  2. Impurity metric \(I\) and impurity reduction \(\Delta I\) for splitting tree’s nodes, calculated as follows:
    • Gini impurity for classification:

      • without observation weights: \(I(D)=1-\sum_{i=1}^{C}{p_i^2},\) where \(p_i\) is the fraction of observations in subset \(D\) that belong to the \(i\)-th class.

      • with observation weights: \(I(D)=1-\sum_{i=1}^{C}{p_i^2},\) where \(p_i\) is the weighted fraction of observations in subset \(D\) that belong to the \(i\)-th class, computed as follows:

        \[p_i=(\sum_{d \in \{d \in D | y_{d}=i\}}w_d)/\sum_{d \in D}w_d\]

        where \(w_d\) is a weight of observation \(d\).

    • Mean-Square Error (MSE) for regression:

      • without observation weights: \(I(D)=\frac{1}{N} \sum_{i=1}^{N}{(y_i - \bar{y})^2},\) where \(N=|D|\) and \(\bar{y}=\frac{1}{N} \sum_{i=1}^{N}y_i\).

      • with observation weights: \(I(D)=\frac{1}{W(D)} \sum_{i=1}^{N}w_i{(y_i - \bar{y})^2},\) where \(N=|D|\), \(\bar{y}=\sum_{i=1}^{N}w_{i}y_{i},\), \(W(D)=\sum_{i=1}^{N}w_{i},\) and \(w_i\) is a weight of observation \(i\).

    • \(\Delta I\) is computed as follows:

      \[\Delta I={I} - (\frac{N_{\mathrm{left}}}{N_{\mathrm{parent}}} I_{left} + \frac{N_{\mathrm{right}}}{N_{\mathrm{parent}}} I_{\mathrm{right}})\]

      where \(N_{\mathrm{left}}\) and \(N_{\mathrm{right}}\) are the number of observations in the node on the corresponding side of the split.

Let \(S=(X,Y)\) be the set of observations. Given the training parameters, such as the number of trees in the forest (\(B\)), the fraction of observations used for the training of one tree (\(f\)), and the number of features to try as a possible split per node (\(m\)), the algorithm does the following:

  1. For each tree (\(1, \ldots, B\)):

  2. Generate a bootstrapped set of observations with \(f * |S|\) elements in it.

  3. Start with the tree whose depth is equal to \(0\).

  4. For each terminal node \(t\) in the tree:
    • Choose randomly without replacement \(m\) feature indices \(J_t \in \{0, 1, \ldots, p-1\}\).

    • For each \(j \in J_t\), find the best split \(s_{j,t}\) that partitions subset \(D_t\) and maximizes impurity decrease \(\Delta I_t\).

    • Get the best split \(s_t\) that maximizes impurity decrease \(\Delta I_t\) in all \(s_{j,t}\) splits.

    • Split current node into two based the best split.

  5. Stop when a termination criterion is met.

Termination Criteria

The library supports the following termination criteria to stop growing the tree:
  • Minimal number of observations in a leaf node. Node \(t\) is not processed if the subset of observations is smaller than the predefined value. Splits that produce nodes with the number of observations smaller than that value are not allowed.

  • Maximal tree depth. Node \(t\) is not processed if its depth in the tree reaches the predefined maximal value.

  • Impurity threshold. Node \(t\) is not processed if its \(I\) value is smaller than the predefined threshold.

Random Numbers Generation

To create a bootstrap set and choose feature indices in the performant way, the training algorithm requires the source of random numbers, capable to produce sequences of random numbers in parallel.

Initialization of the engine in the decision forest is based on the scheme below:

The state of the engine is updated once the training of the decision forest model is completed. The library provides support to retrieve the instance of the engine with updated state that can be used in other computations. The update of the state is engine-specific and depends on the parallelization technique used as defined earlier:

  • Family: the updated state is the set of states that represent individual engines in the family.

  • Leapfrog: the updated state is the state of the sequence with the rightmost position on the sequence. The example below demonstrates the idea for case of 2 subsequences (‘x’ and ‘o’) of the random number sequence:

  • SkipAhead: the updated state is the state of the independent sequence with the rightmost position on the sequence. The example below demonstrates the idea for case of 2 subsequences (‘x’ and ‘o’) of the random number sequence:

Additional Characteristics Calculated by the Decision Forest

Decision forests can produce additional characteristics, such as an estimate of generalization error and an importance measure (relative decisive power) of each of p features (variables).

Out-of-bag Error

The estimate of the generalization error based on the training data can be obtained and calculated as follows:

  • For classification:
    • For each vector \(x_i\) in the dataset \(X\), predict its label \(\hat{y_i}\) by having the majority of votes from the trees that contain \(x_i\) in their OOB set, and vote for that label.

    • Calculate the OOB error of the decision forest \(T\) as the average of misclassifications:

      \[OOB(T) = \frac{1}{|{D}^{\text{'}}|}\sum _{y_i \in {D}^{\text{'}}}I\{y_i \ne \hat{y_i}\}\text{,where }{D}^{\text{'}}={\bigcup }_{b=1}^{B}\overline{D_b}.\]
    • If OOB error value per each observation is required, then calculate the prediction error for \(x_i\): \(OOB(x_i) = I\{{y}_{i}\ne \hat{{y}_{i}}\}\)

  • For regression:
    • For each vector \(x_i\) in the dataset \(X\), predict its response \(\hat{y_i}\) as the mean of prediction from the trees that contain \(x_i\) in their OOB set:

      \(\hat{y_i} = \frac{1}{{|B}_{i}|}\sum _{b=1}^{|B_i|}\hat{y_{ib}}\), where \(B_i= \bigcup{T_b}: x_i \in \overline{D_b}\) and \(\hat{y_{ib}}\) is the result of prediction \(x_i\) by \(T_b\).

    • Calculate the OOB error of the decision forest \(T\) as the Mean-Square Error (MSE):

      \[OOB(T) = \frac{1}{|{D}^{\text{'}}|}\sum _{{y}_{i} \in {D}^{\text{'}}}\sum {(y_i-\hat{y_i})}^{2}, \text{where } {D}^{\text{'}}={\bigcup}_{b=1}^{B}\overline{{D}_{b}}\]
    • If OOB error value per each observation is required, then calculate the prediction error for \(x_i\):

      \[OOB(x_i) = {(y_i-\hat{y_i})}^{2}\]
Variable Importance

There are two main types of variable importance measures:

  • Mean Decrease Impurity importance (MDI).

    Importance of the \(j\)-th variable for predicting \(Y\) is the sum of weighted impurity decreases \(p(t) \Delta i(s_t, t)\) for all nodes \(t\) that use \(x_j\), averaged over all \(B\) trees in the forest:

    \[MDI\left(j\right)=\frac{1}{B}\sum _{b=1}^{B} \sum _{t\in {T}_{b}:v\left({s}_{t}\right)=j}p\left(t\right)\Delta i\left({s}_{t},t\right),\]

    where \(p\left(t\right)=\frac{|{D}_{t}|}{|D|}\) is the fraction of observations reaching node \(t\) in the tree \(T_b\), and \(v(s_t)\) is the index of the variable used in split \(s_t\) .

  • Mean Decrease Accuracy (MDA).

    Importance of the \(j\)-th variable for predicting \(Y\) is the average increase in the OOB error over all trees in the forest when the values of the \(j\)-th variable are randomly permuted in the OOB set. For that reason, this latter measure is also known as permutation importance.

    In more details, the library calculates MDA importance as follows:

    • Let \(\pi (X,j)\) be the set of feature vectors where the \(j\)-th variable is randomly permuted over all vectors in the set.

    • Let \(E_b\) be the OOB error calculated for \(T_b:\) on its out-of-bag dataset \(\overline{D_b}\).

    • Let \(E_{b,j}\) be the OOB error calculated for \(T_b:\) using \(\pi \left(\overline{{X}_{b}},j\right)\), and its out-of-bag dataset \(\overline{D_b}\) is permuted on the \(j\)-th variable. Then

      • \({\delta }_{b,j}={E}_{b}-{E}_{b,j}\) is the OOB error increase for the tree \(T_b\).

      • \(Raw MDA\left(j\right)=\frac{1}{B}\sum _{b=1}^{B}{\delta }_{b,j}\) is MDA importance.

      • \(Scaled MDA\left(j\right)=\frac{Raw MDA\left({x}_{j}\right)}{\frac{{\sigma }_{j}}{\sqrt{B}}}\), where \({\sigma }_{j}^{2}\) is the variance of \(D_{b,j}\)

Training method: Dense

In dense training method all possible split variants for each feature (from selected features’ subset for current node) are evaluated for best split computation.

Training method: Hist

In hist training method, we consider only some selected subset of splits for best split computation. This subset of splits is computed for each feature on initialization stage of the algorithm. After computing subset of splits, we substitute each value from initially provided data with the value of the corresponding bin. Bins are continuous intervals between selected splits.

Inference methods: Dense and Hist

Dense and hist inference methods performs prediction by the same way:

  1. For classification, \(y_i \in \{0, \ldots, C-1\}\), where \(C\) is the number of classes, the tree ensemble model predicts the output by selecting the response \(y\), which is voted for by the majority of the trees in the forest.

  2. For regression, the tree ensemble model uses the mean of \(B\) functions’ results to predict the output, i.e. \(\hat{y}=\frac{1}{M} \sum_{k=1}^M{f_k(x_i)}, \; f_k \in F,\) where \(f_k\) are regression trees, \(W\) is a set of tree leaves’ scores and \(T\) is the number of leaves in the tree. In other words, each tree maps an observation to the corresponding leaf’s score.

Programming Interface

All types and functions in this section are declared in the oneapi::dal::decision_forest namespace and are available via inclusion of the oneapi/dal/algo/decision_forest.hpp header file.

Enum classes

enum class error_metric_mode
error_metric_mode::none

Do not compute error metric.

error_metric_mode::out_of_bag_error

Train produces \(1 \times 1\) table with cumulative prediction error for out of bag observations.

error_metric_mode::out_of_bag_error_per_observation

Train produces \(n \times 1\) table with prediction error for out-of-bag observations.

enum class variable_importance_mode
variable_importance_mode::none

Do not compute variable importance.

variable_importance_mode::mdi

Mean Decrease Impurity. Computed as the sum of weighted impurity decreases for all nodes where the variable is used, averaged over all trees in the forest.

variable_importance_mode::mda_raw

Mean Decrease Accuracy (permutation importance). For each tree, the prediction error on the out-of-bag portion of the data is computed (error rate for classification, MSE for regression). The same is done after permuting each predictor variable. The difference between the two are then averaged over all trees.

variable_importance_mode::mda_scaled

Mean Decrease Accuracy (permutation importance). This is MDA_Raw value scaled by its standard deviation.

enum class infer_mode
infer_mode::class_labels

Infer produces a \(n \times 1\) table with the predicted labels.

infer_mode::class_probabilities

Infer produces \(n \times c\) table with the predicted class probabilities for each observation.

enum class voting_mode
voting_mode::weighted

The final prediction is combined through a weighted majority voting.

voting_mode::unweighted

The final prediction is combined through a simple majority voting.

Descriptor

template<typename Float = detail::descriptor_base<>::float_t, typename Method = detail::descriptor_base<>::method_t, typename Task = detail::descriptor_base<>::task_t>
class descriptor
Template Parameters
  • Float – The floating-point type that the algorithm uses for intermediate computations. Can be float or double.

  • Method – Tag-type that specifies an implementation of algorithm. Can be method::v1::dense or method::v1::hist.

  • Task – Tag-type that specifies type of the problem to solve. Can be task::v1::classification or task::v1::regression.

Public Methods

auto &set_observations_per_tree_fraction(double value)
auto &set_impurity_threshold(double value)
auto &set_min_weight_fraction_in_leaf_node(double value)
auto &set_min_impurity_decrease_in_split_node(double value)
auto &set_tree_count(std::int64_t value)
auto &set_features_per_node(std::int64_t value)
auto &set_max_tree_depth(std::int64_t value)
auto &set_min_observations_in_leaf_node(std::int64_t value)
auto &set_min_observations_in_split_node(std::int64_t value)
auto &set_max_leaf_nodes(std::int64_t value)
auto &set_max_bins(std::int64_t value)
auto &set_min_bin_size(std::int64_t value)
auto &set_error_metric_mode(error_metric_mode value)
auto &set_memory_saving_mode(bool value)
auto &set_bootstrap(bool value)
auto &set_variable_importance_mode(variable_importance_mode value)
template<typename T = Task, typename None = detail::enable_if_classification_t<T>>
auto &set_class_count(std::int64_t value)
template<typename T = Task, typename None = detail::enable_if_classification_t<T>>
auto &set_infer_mode(infer_mode value)
template<typename T = Task, typename None = detail::enable_if_classification_t<T>>
auto &set_voting_mode(voting_mode value)

Method tags

struct dense

Tag-type that denotes dense computational method.

struct hist

Tag-type that denotes hist computational method.

using by_default = dense

Alias tag-type for dense computational method.

Task tags

struct classification

Tag-type that parameterizes entities used for solving classification problem.

struct regression

Tag-type that parameterizes entities used for solving regression problem.

using by_default = classification

Alias tag-type for classification task.

Model

template<typename Task = task::by_default>
class model
Template Parameters

Task – Tag-type that specifies the type of the problem to solve. Can be task::v1::classification or task::v1::regression.

Constructors

model()

Creates a new instance of the class with the default property values.

Properties

std::int64_t tree_count = 100

The number of trees in the forest.

Getter & Setter
std::int64_t get_tree_count() const
Invariants
std::int64_t class_count = 2

The class count. Used with oneapi::dal::decision_forest::task::v1::classification only.

Getter & Setter
template <typename T = Task, typename None = detail::enable_if_classification_t<T>> std::int64_t get_class_count() const

Training train(...)

Input

template<typename Task = task::by_default>
class train_input
Template Parameters

Task – Tag-type that specifies type of the problem to solve. Can be task::v1::classification or task::v1::regression.

Constructors

train_input(const table &data, const table &labels)

Creates a new instance of the class with the given data and labels property values.

Properties

const table &data = table{}

The training set \(X\).

Getter & Setter
const table & get_data() const
auto & set_data(const table &value)
const table &labels = table{}

Vector of labels \(y\) for the training set \(X\).

Getter & Setter
const table & get_labels() const
auto & set_labels(const table &value)

Result

template<typename Task = task::by_default>
class train_result
Template Parameters

Task – Tag-type that specifies type of the problem to solve. Can be task::v1::classification or task::v1::regression.

Constructors

train_result()

Creates a new instance of the class with the default property values.

Properties

const model<Task> &model = model<Task>{}

The trained Decision Forest model.

Getter & Setter
const model< Task > & get_model() const
auto & set_model(const model< Task > &value)
const table &oob_err = table{}

A \(1 \times 1\) table containing cumulative out-of-bag error value. Computed when error_metric_mode set with error_metric_mode::out_of_bag_error.

Getter & Setter
const table & get_oob_err() const
auto & set_oob_err(const table &value)
const table &oob_err_per_observation = table{}

A \(n \times 1\) table containing out-of-bag error value per observation. Computed when error_metric_mode set with error_metric_mode::out_of_bag_error_per_observation.

Getter & Setter
const table & get_oob_err_per_observation() const
auto & set_oob_err_per_observation(const table &value)
const table &var_importance = table{}

A \(1 \times p\) table containing variable importance value for each feature. Computed when variable_importance_mode != variable_importance_mode::none.

Getter & Setter
const table & get_var_importance() const
auto & set_var_importance(const table &value)

Operation

template<typename Descriptor>
decision_forest::train_result train(const Descriptor &desc, const decision_forest::train_input &input)
Template Parameters

Descriptor – Decision Forest algorithm descriptor decision_forest::desc.

Preconditions
input.data.is_empty == false
input.labels.is_empty == false
input.labels.column_count == 1
input.data.row_count == input.labels.row_count
desc.get_bootstrap() == true || (desc.get_bootstrap() == false && desc.get_variable_importance_mode() != variable_importance_mode::mda_raw && desc.get_variable_importance_mode() != variable_importance_mode::mda_scaled)
desc.get_bootstrap() == true || (desc.get_bootstrap() == false && desc.get_error_metric_mode() == error_metric_mode::none)

Inference infer(...)

Input

template<typename Task = task::by_default>
class infer_input
Template Parameters

Task – Tag-type that specifies the type of the problem to solve. Can be task::v1::classification or task::v1::regression.

Constructors

infer_input(const model<Task> &trained_model, const table &data)

Creates a new instance of the class with the given model and data property values.

Properties

const model<Task> &model = model<Task>{}

The trained Decision Forest model.

Getter & Setter
const model< Task > & get_model() const
auto & set_model(const model< Task > &value)
const table &data = table{}

The dataset for inference \(X'\).

Getter & Setter
const table & get_data() const
auto & set_data(const table &value)

Result

template<typename Task = task::by_default>
class infer_result
Template Parameters

Task – Tag-type that specifies the type of the problem to solve. Can be task::v1::classification or task::v1::regression.

Constructors

infer_result()

Creates a new instance of the class with the default property values.

Properties

const table &labels = table{}

The \(n \times 1\) table with the predicted labels.

Getter & Setter
const table & get_labels() const
auto & set_labels(const table &value)
const table &probabilities

A \(n \times c\) table with the predicted class probabilities for each observation.

Getter & Setter
template <typename T = Task, typename None = detail::enable_if_classification_t<T>> const table & get_probabilities() const
template <typename T = Task, typename None = detail::enable_if_classification_t<T>> auto & set_probabilities(const table &value)

Operation

template<typename Descriptor>
decision_forest::infer_result infer(const Descriptor &desc, const decision_forest::infer_input &input)
Template Parameters

Descriptor – Decision Forest algorithm descriptor decision_forest::desc.

Preconditions
input.data.is_empty == false