An Exploration of PCA (Principal Component Analysis) Using Python — Step b

Aior · Jul 7, 2023

PCA Nedir?

Principal Component Analysis (PCA) — Temel Bileşen Analizi, veri biliminde en yaygın kullanılan boyut azaltma (dimensionality reduction) yöntemidir. Çok boyutlu bir veri setini, varyansın büyük kısmını koruyarak daha az boyuta indirir. Bu sayede görselleştirme, model eğitimi hızlandırma ve gürültü filtreleme amaçlarına hizmet eder.

AIOR olarak müşteri projelerinde PCA'yı sıklıkla kullanıyoruz: anomali tespiti pipeline'larında özellik kümesini daraltırken, üretim sensör verilerinden anlamlı özet bilgi çıkarırken, müşteri segmentasyonu çalışmalarında. Bu rehber Python ile PCA uygulamayı adım adım gösterir.

Matematiksel Arka Plan

PCA'nın temel fikri: verilerin en çok değişkenlik gösterdiği yönleri (varyans yönleri) bulmak ve bu yönleri yeni eksenler olarak kullanmak. Bu yeni eksenlere principal components denir. İlk bileşen en büyük varyansı yakalar, ikinci bileşen ilkine dik olan ve geri kalan varyansta en büyük olan yöndür, vb.

Matematiksel olarak veri kovaryans matrisinin özdeğer (eigenvalue) ve özvektör (eigenvector) çözümlemesidir. Özdeğerler büyüklük sırasına göre sıralandığında en büyükten başlayarak ilk k özvektörü almak boyut azaltma sağlar.

Veri Hazırlığı

PCA için veri scaler kullanılarak normalize edilmelidir; aksi takdirde büyük ölçekli özellikler korelasyon hesaplamasını domine eder:

Code:

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Örnek veri yükle (iris dataset)
from sklearn.datasets import load_iris
data = load_iris()
X = data.data
y = data.target
features = data.feature_names

# Standardize (mean=0, std=1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(f"Orijinal şekil: {X.shape}")
print(f"Özellik isimleri: {features}")

PCA Uygulaması

scikit-learn ile çok basit:

Code:

# Tüm bileşenleri tut
pca = PCA(n_components=None)
X_pca = pca.fit_transform(X_scaled)

# Explained variance ratio
print("Explained variance per component:")
for i, ev in enumerate(pca.explained_variance_ratio_):
    print(f"  PC{i+1}: {ev:.4f} ({ev*100:.2f}%)")

print(f"
Cumulative: {np.cumsum(pca.explained_variance_ratio_)}")

İris verisinde tipik sonuç: PC1 %72, PC2 %23, PC3 %4, PC4 %1. İlk 2 bileşen toplam varyansın %95'ini kapsar — 4 boyutlu veriyi 2'ye indirgeyebiliriz.

Optimum Bileşen Sayısı

"Kaç bileşen tutmalıyım?" sorusu için iki yaklaşım:

Kümülatif varyans eşiği: %90 veya %95'lik açıklayıcılık yeterli kabul edilir.

Code:

threshold = 0.95
cumvar = np.cumsum(pca.explained_variance_ratio_)
n_components = np.argmax(cumvar >= threshold) + 1
print(f"Optimum bileşen sayısı: {n_components}")

Scree plot: eigenvalue'leri çiz, "dirsek" noktasını bul (eigenvalue düşüşünün yavaşladığı yer):

Code:

plt.figure(figsize=(10, 5))
plt.plot(range(1, len(pca.explained_variance_) + 1),
         pca.explained_variance_, 'bo-')
plt.title('Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Eigenvalue')
plt.grid(True)
plt.savefig('scree_plot.png')

Görselleştirme — 2D Projeksiyon

PCA en güçlü uygulamalardan biri yüksek boyutlu veriyi 2D'de görselleştirmektir:

Code:

pca_2d = PCA(n_components=2)
X_2d = pca_2d.fit_transform(X_scaled)

plt.figure(figsize=(10, 8))
for label in np.unique(y):
    plt.scatter(X_2d[y == label, 0],
                X_2d[y == label, 1],
                label=data.target_names[label],
                s=50)
plt.xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]:.2%})')
plt.ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]:.2%})')
plt.title('PCA Projection (2D)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('pca_2d.png', dpi=150)

Özellik Önemi

Hangi orijinal özelliğin hangi bileşeni nasıl etkilediğini görmek için loadings kullanılır:

Code:

loadings = pd.DataFrame(
    pca.components_.T,
    columns=[f'PC{i+1}' for i in range(pca.n_components_)],
    index=features
)
print(loadings.round(3))

Pozitif/negatif değerlerin büyüklüğü o özelliğin bileşene katkı oranını gösterir.

Pratik Kullanım Senaryoları

Anomali tespiti: PCA ile boyut azaltıp düşük boyutlu uzayda mahalanobis distance hesaplayın; eşik üstü değerler anomalili gözlem.

Model hızlandırma: 1000 özellikli veri seti yerine PCA ile 50 bileşene düşürün; random forest veya XGBoost 5-10x hızlanır.

Görselleştirme: müşteri segmentasyon, gen ekspresyon, hisse senedi getirileri gibi yüksek boyutlu verileri 2D'de keşfetmek.

Önemli Uyarılar

Doğrusal varsayım: PCA doğrusal kombinasyonlar yakalar. Doğrusal olmayan ilişkiler için t-SNE veya UMAP düşünün.

Yorumlanabilirlik kaybı: PC'ler orijinal özelliklerin lineer kombinasyonudur; iş anlamı kaybolabilir. Modelin son kullanıcısına anlatmak zorsa PCA yerine feature selection yapın.

Ölçeklenebilirlik: çok büyük matrislerde IncrementalPCA veya randomized PCA kullanın.

AIOR Veri Bilimi Hizmetleri

AIOR olarak Python data science stack ile müşterilere boyut azaltma, anomali tespiti, segmentasyon çalışmaları sunuyoruz. Hosting altyapımızda numpy, scipy, scikit-learn, pandas, dask önceden kurulu olarak gelir; Jupyter notebook erişimi VPS paketlerinde aktif.

What Is PCA?

Principal Component Analysis (PCA) is one of the most widely used dimensionality reduction methods in data science. It reduces a multi-dimensional dataset to fewer dimensions while preserving the majority of variance. This serves visualisation, model training speedup and noise filtering.

At AIOR we use PCA frequently on customer projects — narrowing feature sets in anomaly detection pipelines, extracting meaningful summaries from manufacturing sensor data, supporting customer segmentation studies. This guide walks through PCA in Python step by step.

Mathematical Background

The core idea of PCA: find the directions in which the data varies most (variance directions) and use those as new axes. These new axes are called principal components. The first component captures the largest variance; the second is orthogonal to the first and captures the next-largest variance, and so on.

Mathematically it's eigenvalue/eigenvector decomposition of the data's covariance matrix. Sorting eigenvalues by magnitude and taking the top k eigenvectors performs dimensionality reduction.

Data Preparation

For PCA the data must be normalised with a scaler; otherwise high-magnitude features dominate covariance:

Code:

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Load sample data (iris dataset)
from sklearn.datasets import load_iris
data = load_iris()
X = data.data
y = data.target
features = data.feature_names

# Standardise (mean=0, std=1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(f"Original shape: {X.shape}")
print(f"Feature names: {features}")

Applying PCA

With scikit-learn it's extremely simple:

Code:

# Keep all components
pca = PCA(n_components=None)
X_pca = pca.fit_transform(X_scaled)

# Explained variance ratio
print("Explained variance per component:")
for i, ev in enumerate(pca.explained_variance_ratio_):
    print(f"  PC{i+1}: {ev:.4f} ({ev*100:.2f}%)")

print(f"
Cumulative: {np.cumsum(pca.explained_variance_ratio_)}")

On iris, typical results: PC1 72%, PC2 23%, PC3 4%, PC4 1%. The first 2 components cover 95% of total variance — we can reduce 4D to 2D.

Optimal Number of Components

Two approaches to the "how many components?" question:

Cumulative variance threshold: 90% or 95% explained variance is typically sufficient.

Code:

threshold = 0.95
cumvar = np.cumsum(pca.explained_variance_ratio_)
n_components = np.argmax(cumvar >= threshold) + 1
print(f"Optimal components: {n_components}")

Scree plot: plot eigenvalues, find the "elbow" point (where the eigenvalue drop slows):

Code:

plt.figure(figsize=(10, 5))
plt.plot(range(1, len(pca.explained_variance_) + 1),
         pca.explained_variance_, 'bo-')
plt.title('Scree Plot')
plt.xlabel('Principal Component')
plt.ylabel('Eigenvalue')
plt.grid(True)
plt.savefig('scree_plot.png')

2D Projection Visualisation

One of PCA's most powerful applications is visualising high-dimensional data in 2D:

Code:

pca_2d = PCA(n_components=2)
X_2d = pca_2d.fit_transform(X_scaled)

plt.figure(figsize=(10, 8))
for label in np.unique(y):
    plt.scatter(X_2d[y == label, 0],
                X_2d[y == label, 1],
                label=data.target_names[label],
                s=50)
plt.xlabel(f'PC1 ({pca_2d.explained_variance_ratio_[0]:.2%})')
plt.ylabel(f'PC2 ({pca_2d.explained_variance_ratio_[1]:.2%})')
plt.title('PCA Projection (2D)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.savefig('pca_2d.png', dpi=150)

Feature Importance

To see which original features influence each component, use loadings:

Code:

loadings = pd.DataFrame(
    pca.components_.T,
    columns=[f'PC{i+1}' for i in range(pca.n_components_)],
    index=features
)
print(loadings.round(3))

Magnitudes of positive/negative values indicate each feature's contribution.

Practical Use Cases

Anomaly detection: reduce dimensions with PCA, compute Mahalanobis distance in low-dim space; threshold-crossing observations are flagged anomalous.

Model acceleration: reduce a 1000-feature dataset to 50 components; random forest or XGBoost run 5-10× faster.

Visualisation: explore customer segmentation, gene expression, stock returns and other high-dimensional data in 2D.

Important Caveats

Linear assumption: PCA captures linear combinations. For non-linear relationships consider t-SNE or UMAP.

Loss of interpretability: PCs are linear combinations of original features and may lose business meaning. If you must explain to non-technical stakeholders, consider feature selection instead.

Scalability: for very large matrices use IncrementalPCA or randomised PCA.

AIOR Data Science Services

At AIOR we deliver dimensionality reduction, anomaly detection and segmentation studies using the Python data science stack. Our hosting infrastructure ships numpy, scipy, scikit-learn, pandas and dask pre-installed; Jupyter notebook access is included on VPS plans.

An Exploration of PCA (Principal Component Analysis) Using Python — Step b

An Exploration of PCA (Principal Component Analysis) Using Python — Step b

Aior

Administrator

PCA Nedir?

Matematiksel Arka Plan

Veri Hazırlığı

PCA Uygulaması

Optimum Bileşen Sayısı

Görselleştirme — 2D Projeksiyon

Özellik Önemi

Pratik Kullanım Senaryoları

Önemli Uyarılar

AIOR Veri Bilimi Hizmetleri

What Is PCA?

Mathematical Background

Data Preparation

Applying PCA

Optimal Number of Components

2D Projection Visualisation

Feature Importance

Practical Use Cases

Important Caveats

AIOR Data Science Services

Similar threads

Forum statistics

Members online

Latest posts

Newest members

Featured content

Trending content

Share this page

Legal Notice

We value your privacy

An Exploration of PCA (Principal Component Analysis) Using Python — Step b

An Exploration of PCA (Principal Component Analysis) Using Python — Step b

Aior

Administrator

PCA Nedir?​

Matematiksel Arka Plan​

Veri Hazırlığı​

PCA Uygulaması​

Optimum Bileşen Sayısı​

Görselleştirme — 2D Projeksiyon​

Özellik Önemi​

Pratik Kullanım Senaryoları​

Önemli Uyarılar​

AIOR Veri Bilimi Hizmetleri​

What Is PCA?​

Mathematical Background​

Data Preparation​

Applying PCA​

Optimal Number of Components​

2D Projection Visualisation​

Feature Importance​

Practical Use Cases​

Important Caveats​

AIOR Data Science Services​

Similar threads

Forum statistics

Members online

Latest posts

Newest members

Featured content

Trending content

Share this page

Tüm ihtiyaçlarınız için Teklif alın

Legal Notice

We value your privacy

PCA Nedir?

Matematiksel Arka Plan

Veri Hazırlığı

PCA Uygulaması

Optimum Bileşen Sayısı

Görselleştirme — 2D Projeksiyon

Özellik Önemi

Pratik Kullanım Senaryoları

Önemli Uyarılar

AIOR Veri Bilimi Hizmetleri

What Is PCA?

Mathematical Background

Data Preparation

Applying PCA

Optimal Number of Components

2D Projection Visualisation

Feature Importance

Practical Use Cases

Important Caveats

AIOR Data Science Services