Principal Components Analysis (PCA)

Principal Component Analysis (PCA) is an algorithm for exploratory data analysis and dimensionality reduction. PCA transforms a set of feature vectors of possibly correlated features to a new set of uncorrelated features, called principal components. Principal components are the directions of the largest variance, that is, the directions where the data is mostly spread out.

Operation

Computational methods

Programming Interface

Training

Covariance

SVD

train(…)

train_input

train_result

Inference

Covariance

SVD

infer(…)

infer_input

infer_result

Mathematical formulation

Training

Given the training set \(X = \{ x_1, \ldots, x_n \}\) of \(p\)-dimensional feature vectors and the number of principal components \(r\), the problem is to compute \(r\) principal directions (\(p\)-dimensional eigenvectors [Lang87]) for the training set. The eigenvectors can be grouped into the \(r \times p\) matrix \(T\) that contains one eigenvector in each row.

Training method: Covariance

This method uses eigenvalue decomposition of the covariance matrix to compute the principal components of the datasets. The method relies on the following steps:

  1. Computation of the covariance matrix

  2. Computation of the eigenvectors and eigenvalues

  3. Formation of the matrices storing the results

Covariance matrix computation is performed in the following way:

  1. Compute the vector-column of sums \(s_i = \sum_{j=1}^n x_{i,j}, \quad 1 \leq i \leq p\).

  2. Compute the cross-product \(P = X^TX - s^Ts\).

  3. Compute the covariance matrix \(\Sigma = \frac{1}{n - 1} P\).

To compute eigenvalues \(\lambda_i\) and eigenvectors \(\upsilon_i\), the implementer can choose an arbitrary method such as [Ping14].

The final step is to sort the set of pairs \((\lambda_i, \upsilon_i)\) in the descending order by \(\lambda_i\) and form the resulting matrix \(T = (\upsilon_{i,1}, \cdots, \upsilon_{i,r}), \quad 1 \leq i \leq p\). Additionally, the means and variances of the initial dataset are returned.

Training method: SVD

This method uses singular value decomposition of the dataset to compute its principal components. The method relies on the following steps:

  1. Computation of the singular values and singular vectors

  2. Formation of the matrices storing the results

To compute singular values \(\lambda_i\) and singular vectors \(u_i\) and \(v_i\), the implementer can choose an arbitrary method such as [Demmel90].

The final step is to sort the set of pairs \((\lambda_i, v_i)\) in the descending order by \(\lambda_i\) and form the resulting matrix \(T = (v_{i,1}, \cdots, v_{i,r}), \quad 1 \leq i \leq p\). Additionally, the means and variances of the initial dataset are returned.

Sign-flip technique

Eigenvectors computed by some eigenvalue solvers are not uniquely defined due to sign ambiguity. To get the deterministic result, a sign-flip technique should be applied. One of the sign-flip techniques proposed in [Bro07] requires the following modification of matrix \(T\):

\[\hat{T}_i = T_i \cdot \mathrm{sgn}(\max_{1 \leq j \leq p } |{T}_{ij}|), \quad 1 \leq i \leq r,\]

where \(T_i\) is \(i\)-th row, \(T_{ij}\) is the element in the \(i\)-th row and \(j\)-th column, \(\mathrm{sgn}(\cdot)\) is the signum function,

\[\begin{split}\mathrm{sgn}(x) = \begin{cases} -1, & x < 0, \\ 0, & x = 0, \\ 1, & x > 0. \end{cases}\end{split}\]

Inference

Given the inference set \(X' = \{ x_1', \ldots, x_m' \}\) of \(p\)-dimensional feature vectors and the \(r \times p\) matrix \(T\) produced at the training stage, the problem is to transform \(X'\) to the set \(X'' = \{ x_1'', \ldots, x_m'' \}\), where \(x_{j}''\) is an \(r\)-dimensional feature vector, \(1 \leq j \leq m\).

The feature vector \(x_{j}''\) is computed through applying linear transformation [Lang87] defined by the matrix \(T\) to the feature vector \(x_{j}'\),

(1)\[x_{j}'' = T x_{j}', \quad 1 \leq j \leq m.\]

Inference methods: Covariance and SVD

Covariance and SVD inference methods compute \(x_{j}''\) according to (1).

Usage example

Training

pca::model<> run_training(const table& data) {
   const auto pca_desc = pca::descriptor<float>{}
      .set_component_count(5)
      .set_deterministic(true);

   const auto result = train(pca_desc, data);

   print_table("means", result.get_means());
   print_table("variances", result.get_variances());
   print_table("eigenvalues", result.get_eigenvalues());
   print_table("eigenvectors", result.get_eigenvectors());

   return result.get_model();
}

Inference

table run_inference(const pca::model<>& model,
                  const table& new_data) {
   const auto pca_desc = pca::descriptor<float>{}
      .set_component_count(model.get_component_count());

   const auto result = infer(pca_desc, model, new_data);

   print_table("labels", result.get_transformed_data());
}

Examples

Batch Processing:

Batch Processing:

Batch Processing:

Programming Interface

All types and functions in this section are declared in the oneapi::dal::pca namespace and be available via inclusion of the oneapi/dal/algo/pca.hpp header file.

Descriptor

template<typename Float = detail::descriptor_base<>::float_t, typename Method = detail::descriptor_base<>::method_t, typename Task = detail::descriptor_base<>::task_t>
class descriptor
Template Parameters
  • Float – The floating-point type that the algorithm uses for intermediate computations. Can be float or double.

  • Method – Tag-type that specifies an implementation of algorithm. Can be method::v1::cov or method::v1::svd.

  • Task – Tag-type that specifies type of the problem to solve. Can be task::v1::dim_reduction.

Constructors

descriptor(std::int64_t component_count = 0)

Creates a new instance of the class with the given component_count property value.

Public Methods

auto &set_component_count(int64_t value)
auto &set_deterministic(bool value)

Method tags

struct cov

Tag-type that denotes Covariance computational method.

struct svd

Tag-type that denotes SVD computational method.

using by_default = cov

Alias tag-type for Covariance computational method.

Task tags

struct dim_reduction

Tag-type that parameterizes entities used for solving dimensionality reduction problem.

using by_default = dim_reduction

Alias tag-type for dimensionality reduction task.

Model

template<typename Task = task::by_default>
class model
Template Parameters

Task – Tag-type that specifies type of the problem to solve. Can be task::v1::dim_reduction.

Constructors

model()

Creates a new instance of the class with the default property values.

Properties

const table &eigenvectors = table{}

An \(r \times p\) table with the eigenvectors. Each row contains one eigenvector.

Getter & Setter
const table & get_eigenvectors() const
auto & set_eigenvectors(const table &value)

Training train(...)

Input

template<typename Task = task::by_default>
class train_input
Template Parameters

Task – Tag-type that specifies type of the problem to solve. Can be task::v1::dim_reduction.

Constructors

train_input(const table &data)

Creates a new instance of the class with the given data property value.

Properties

const table &data = table{}

An \(n \times p\) table with the training data, where each row stores one feature vector.

Getter & Setter
const table & get_data() const
auto & set_data(const table &data)

Result

template<typename Task = task::by_default>
class train_result
Template Parameters

Task – Tag-type that specifies type of the problem to solve. Can be task::v1::dim_reduction.

Constructors

train_result()

Creates a new instance of the class with the default property values.

Properties

const table &eigenvectors = table{}

An \(r \times p\) table with the eigenvectors. Each row contains one eigenvector.

Getter & Setter
const table & get_eigenvectors() const
Invariants
eigenvectors == model.eigenvectors
const model<Task> &model = model<Task>{}

The trained PCA model.

Getter & Setter
const model< Task > & get_model() const
auto & set_model(const model< Task > &value)
const table &eigenvalues = table{}

A \(1 \times r\) table that contains the eigenvalues for for the first r features.

Getter & Setter
const table & get_eigenvalues() const
auto & set_eigenvalues(const table &value)
const table &variances = table{}

A \(1 \times r\) table that contains the variances for the first r features.

Getter & Setter
const table & get_variances() const
auto & set_variances(const table &value)
const table &means = table{}

A \(1 \times r\) table that contains the mean values for the first r features.

Getter & Setter
const table & get_means() const
auto & set_means(const table &value)

Operation

template<typename Descriptor>
pca::train_result train(const Descriptor &desc, const pca::train_input &input)
Template Parameters
  • desc – PCA algorithm descriptor pca::desc

  • input – Input data for the training operation

Preconditions
input.data.has_data == true
input.data.column_count >= desc.component_count
Postconditions
result.means.row_count == 1
result.means.column_count == desc.component_count
result.variances.row_count == 1
result.variances.column_count == desc.component_count
result.variances[i] >= 0.0
result.eigenvalues.row_count == 1
result.eigenvalues.column_count == desc.component_count
result.model.eigenvectors.row_count == 1
result.model.eigenvectors.column_count == desc.component_count

Inference infer(...)

Input

template<typename Task = task::by_default>
class infer_input
Template Parameters

Task – Tag-type that specifies type of the problem to solve. Can be task::v1::dim_reduction.

Constructors

infer_input(const model<Task> &trained_model, const table &data)

Creates a new instance of the class with the given model and data property values.

Properties

const model<Task> &model = model<Task>{}

The trained PCA model.

Getter & Setter
const model< Task > & get_model() const
auto & set_model(const model< Task > &value)
const table &data = table{}

The dataset for inference \(X'\).

Getter & Setter
const table & get_data() const
auto & set_data(const table &value)

Result

template<typename Task = task::by_default>
class infer_result
Template Parameters

Task – Tag-type that specifies type of the problem to solve. Can be task::v1::dim_reduction.

Constructors

infer_result()

Creates a new instance of the class with the default property values.

Properties

const table &transformed_data = table{}

An \(n \times r\) table that contains data projected to the r principal components.

Getter & Setter
const table & get_transformed_data() const
auto & set_transformed_data(const table &value)

Operation

template<typename Descriptor>
pca::infer_result infer(const Descriptor &desc, const pca::infer_input &input)
Template Parameters
  • desc – PCA algorithm descriptor pca::desc

  • input – Input data for the inference operation

Preconditions
input.data.has_data == true
input.model.eigenvectors.row_count == desc.component_count
input.model.eigenvectors.column_count == input.data.column_count
Postconditions
result.transformed_data.row_count == input.data.row_count
result.transformed_data.column_count == desc.component_count