Python scikit-learnで主成分分析を行う

主成分分析とは、多次元の特徴量(説明変数)を統合して、次元削減を行う手法。総合的な特徴量を表す第一成分とそれに直交する第二成分以降からなる。統合する特徴量は元となる特徴量の相関行列から計算する。

f:id:letitride:20200725100412p:plain:w500

左上の散布図に対して、主成分分析を行った時、右上の散布図のようになる。

irisデータでの主成分分析

特徴量の確認

irisデータは30次元の特徴量を持ち、各変数の目的変数に対するヒストグラムを確認することによって、各特徴量がどれほど目的変数を分類するのに適しているかが把握できる。

import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()
fig, axes = plt.subplots(15, 2, figsize=(10,20))
malignant = cancer.data[cancer.target == 0]
benign = cancer.data[cancer.target == 1]

ax = axes.ravel()

for i in range(30):
    _, bins = np.histogram(cancer.data[:, i], bins=50)
    ax[i].hist(malignant[:, i], bins=bins, color=mglearn.cm3(0), alpha=.5)
    ax[i].hist(benign[:, i], bins=bins, color=mglearn.cm3(2), alpha=.5)
    ax[i].set_title(cancer.feature_names[i])
    ax[i].set_yticks(())
ax[0].set_xlabel("Feature magnitude")
ax[0].set_ylabel("Frequency")
ax[0].legend(["malignant", "benign"], loc="best")
fig.tight_layout()

f:id:letitride:20200725101119p:plain:w500

mean perimeterなどはある程度境界が分かれていることに対して、mean fractal dimensionなんかはアヤメの種類を判別するのに重要ではない特徴量といえることがわかる。

主成分分析の前処理

特徴量の標準化

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(cancer.data)
X_scaled = scaler.transform(cancer.data)

主成分分析

n_components=2と指定して第二成分まで分析します。

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
pca.fit(X_scaled)
X_pca = pca.transform(X_scaled)
print("Original shape: {}".format(str(X_scaled.shape)))
print("Reduced shape: {}".format(str(X_pca.shape)))

次元削減の確認

Original shape: (569, 30)
Reduced shape: (569, 2)

分析結果の確認

主成分分析のtransformには、目的変数targetを与えてないにも関わらず、targetを分類した成分に分析され特徴量が統合されていることがわかります。

import mglearn

plt.figure(figsize=(8,8))
mglearn.discrete_scatter(X_pca[:, 0], X_pca[:, 1], cancer.target)
plt.legend(cancer.target_names, loc="best")
plt.gca().set_aspect("equal")
plt.xlabel("First principal component")
plt.ylabel("Second principal component")

f:id:letitride:20200725102303p:plain:w500

各特徴量が各統合された成分に対して影響したベクトルはcomponents_に格納されます。

print("PCA components:\n{}".format(pca.components_))

PCA components:
[[ 0.21890244  0.10372458  0.22753729  0.22099499  0.14258969  0.23928535
   0.25840048  0.26085376  0.13816696  0.06436335  0.20597878  0.01742803
   0.21132592  0.20286964  0.01453145  0.17039345  0.15358979  0.1834174
   0.04249842  0.10256832  0.22799663  0.10446933  0.23663968  0.22487053
   0.12795256  0.21009588  0.22876753  0.25088597  0.12290456  0.13178394]
 [-0.23385713 -0.05970609 -0.21518136 -0.23107671  0.18611302  0.15189161
   0.06016536 -0.0347675   0.19034877  0.36657547 -0.10555215  0.08997968
  -0.08945723 -0.15229263  0.20443045  0.2327159   0.19720728  0.13032156
   0.183848    0.28009203 -0.21986638 -0.0454673  -0.19987843 -0.21935186
   0.17230435  0.14359317  0.09796411 -0.00825724  0.14188335  0.27533947]]

ヒートマップで影響度を確認。

plt.matshow(pca.components_, cmap="viridis")
plt.yticks([0,1],["First component", "Second component"])
plt.colorbar()
plt.xticks(range(len(cancer.feature_names)), cancer.feature_names, rotation=60, ha="left")
plt.xlabel("Feature")
plt.ylabel("Principal component")