Correlation Analysis with Big Data — A Step-by-Step Application

Aior · Jun 23, 2023

Korelasyon Analizi Neden Önemli?

Veri bilimi projelerinin temelinde değişkenler arası ilişkileri anlamak yatar. İki ya da daha fazla değişken arasındaki ilişkinin yönünü ve gücünü ölçen korelasyon analizi, hipotez kurma, özellik seçimi (feature selection), modelleme öncesi keşifsel veri analizi ve iş kararı destek raporlarında merkezi rol oynar. AIOR olarak müşterilerimizin endüstriyel veri setlerinde, e-ticaret satış verilerinde ve hosting kullanım metriklerinde rutin olarak bu analizleri yapıyoruz.

Korelasyon Katsayıları

Hangi katsayıyı kullanacağınız verinizin yapısına bağlıdır. En yaygın üç seçenek:

Pearson korelasyon katsayısı (r): iki sürekli değişken arasındaki doğrusal ilişkiyi ölçer. -1 ile +1 arasında değer alır; 0 ilişkisizliği, +1 mükemmel pozitif, -1 mükemmel negatif ilişkiyi gösterir. Veriler normal dağılım ve doğrusal ilişki varsayımına bağlıdır.

Spearman sıra korelasyonu (ρ): doğrusal olmayan ama monotonik ilişkileri yakalar. Sıralı veri tipleri için ve aykırı değerlerden etkilenmek istemediğinizde tercih edilir.

Kendall tau-b (τ): küçük örneklemlerde ve ordinal verilerde Spearman'a göre daha sağlam sonuç verir. Tie (eşit değer) bulunan veri setlerinde önerilir.

Veri Hazırlığı

Büyük veri setlerinde korelasyon hesabına başlamadan önce şu adımları yapın:

Code:

import pandas as pd
import numpy as np
from scipy import stats

# Veri yükleme
df = pd.read_csv("data.csv")

# Eksik değer kontrolü
print(df.isnull().sum())

# Numerik sütunları seçme
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
print(f"Numerik sütun sayısı: {len(numeric_cols)}")

# Aykırı değer tespit (IQR yöntemi)
for col in numeric_cols:
    q1 = df[col].quantile(0.25)
    q3 = df[col].quantile(0.75)
    iqr = q3 - q1
    outliers = ((df[col] < q1 - 1.5*iqr) | (df[col] > q3 + 1.5*iqr)).sum()
    print(f"  {col}: {outliers} aykırı değer")

Pandas ile Hızlı Korelasyon Matrisi

Tüm numerik değişkenler arasındaki korelasyonu tek komutla hesaplama:

Code:

# Pearson (default)
corr_pearson = df[numeric_cols].corr(method='pearson')

# Spearman
corr_spearman = df[numeric_cols].corr(method='spearman')

# Kendall
corr_kendall = df[numeric_cols].corr(method='kendall')

print(corr_pearson.round(3))

Sonuç bir matristir; satır ve sütunlar değişken isimleri, hücreler korelasyon değerleridir.

Görselleştirme — Heatmap

Korelasyon matrisi tablo halinde okumak zordur; heatmap görsel olarak kavraması çok daha kolaydır:

Code:

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 10))
sns.heatmap(corr_pearson,
            annot=True,
            cmap='coolwarm',
            center=0,
            fmt='.2f',
            square=True,
            linewidths=0.5)
plt.title('Pearson Korelasyon Matrisi')
plt.tight_layout()
plt.savefig('correlation_heatmap.png', dpi=150)

coolwarm color palette pozitif değerleri kırmızı, negatif değerleri mavi gösterir; merkez 0 olarak ayarlanmıştır.

İstatistiksel Anlamlılık

Korelasyon katsayısı kadar onun p-değeri de önemlidir. Düşük p-değeri (genellikle <0.05) korelasyonun istatistiksel olarak anlamlı olduğunu gösterir:

Code:

from scipy.stats import pearsonr

# İki değişken için
r, p_value = pearsonr(df['x'], df['y'])
print(f"Pearson r = {r:.3f}, p = {p_value:.4f}")

# Tüm kombinasyonlar için
from itertools import combinations

results = []
for col1, col2 in combinations(numeric_cols, 2):
    r, p = pearsonr(df[col1], df[col2])
    results.append({'var1': col1, 'var2': col2, 'r': r, 'p': p})

results_df = pd.DataFrame(results)
significant = results_df[results_df['p'] < 0.05].sort_values('r', ascending=False)
print(significant.head(20))

Büyük Veri İçin Dask

Pandas tek makine RAM'iyle sınırlıdır. 100 GB+ veri setlerinde Dask paralel hesaplama:

Code:

import dask.dataframe as dd

ddf = dd.read_csv("huge_data*.csv")
corr_matrix = ddf.corr().compute()

Dask veriyi partition'lara ayırır ve çoklu çekirdek/makine üzerinde paralel hesaplar.

Yanlış Yorumlamalardan Kaçınmak

Korelasyon analizi yapılırken sık yapılan hatalar:

"Korelasyon = nedensellik" yanılgısı: iki değişken arasında yüksek korelasyon olması birinin diğerine sebep olduğu anlamına gelmez. Üçüncü bir değişken her ikisini etkiliyor olabilir (confounding variable).

Doğrusal olmayan ilişkileri kaçırmak: Pearson sadece doğrusal ilişkiyi ölçer. U-şeklinde veya log ilişkiler r ≈ 0 verebilir; her zaman dağılım grafiğine bakın.

Aykırı değerlerin etkisi: tek bir uç değer korelasyonu dramatik biçimde değiştirebilir. Sağlamlık için Spearman veya Kendall kullanın.

AIOR Veri Analizi Hizmetleri

AIOR olarak müşterilerimize özel veri analizi projeleri sunuyoruz: hosting kullanım pattern'leri, e-ticaret conversion korelasyonları, endüstriyel sensör verileri arasındaki ilişkiler. Python + pandas + Dask kombinasyonuyla terabayt ölçekli veri setlerini analiz ediyoruz. Hosting paketlerimizde Python data science stack (pandas, scipy, scikit-learn, dask) önceden kurulu olarak gelir.

Why Correlation Analysis Matters

Understanding relationships between variables is fundamental to data science. Correlation analysis measures the direction and strength of the relationship between two or more variables — playing a central role in hypothesis formation, feature selection, exploratory analysis before modelling, and business-decision support reporting. At AIOR we routinely run these analyses across industrial datasets, e-commerce sales data and hosting usage metrics.

Correlation Coefficients

Which coefficient to use depends on your data's structure. The three most common:

Pearson correlation (r): measures the linear relationship between two continuous variables. Values range from -1 to +1; 0 means no relationship, +1 perfect positive, -1 perfect negative. Assumes normal distribution and linear relationship.

Spearman rank correlation (ρ): captures monotonic non-linear relationships. Preferred for ordinal data and when you want robustness against outliers.

Kendall tau-b (τ): more reliable than Spearman for small samples and ordinal data. Recommended when there are ties (equal values).

Data Preparation

Before computing correlations on a large dataset, do the prep work:

Code:

import pandas as pd
import numpy as np
from scipy import stats

# Load data
df = pd.read_csv("data.csv")

# Check missing values
print(df.isnull().sum())

# Pick numeric columns
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
print(f"Numeric columns: {len(numeric_cols)}")

# Outlier detection (IQR method)
for col in numeric_cols:
    q1 = df[col].quantile(0.25)
    q3 = df[col].quantile(0.75)
    iqr = q3 - q1
    outliers = ((df[col] < q1 - 1.5*iqr) | (df[col] > q3 + 1.5*iqr)).sum()
    print(f"  {col}: {outliers} outliers")

Fast Correlation Matrix with pandas

Compute correlations across all numeric variables in one call:

Code:

# Pearson (default)
corr_pearson = df[numeric_cols].corr(method='pearson')

# Spearman
corr_spearman = df[numeric_cols].corr(method='spearman')

# Kendall
corr_kendall = df[numeric_cols].corr(method='kendall')

print(corr_pearson.round(3))

The result is a matrix; rows and columns are variable names, cells are correlation values.

Visualisation — Heatmap

Reading the correlation matrix as a table is hard; a heatmap is far easier to grasp visually:

Code:

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 10))
sns.heatmap(corr_pearson,
            annot=True,
            cmap='coolwarm',
            center=0,
            fmt='.2f',
            square=True,
            linewidths=0.5)
plt.title('Pearson Correlation Matrix')
plt.tight_layout()
plt.savefig('correlation_heatmap.png', dpi=150)

The coolwarm colour palette shows positive values as red and negative as blue, with 0 set as the centre.

Statistical Significance

The correlation coefficient itself is only half the story — its p-value matters too. A low p-value (typically <0.05) indicates the correlation is statistically significant:

Code:

from scipy.stats import pearsonr

# For two variables
r, p_value = pearsonr(df['x'], df['y'])
print(f"Pearson r = {r:.3f}, p = {p_value:.4f}")

# For all combinations
from itertools import combinations

results = []
for col1, col2 in combinations(numeric_cols, 2):
    r, p = pearsonr(df[col1], df[col2])
    results.append({'var1': col1, 'var2': col2, 'r': r, 'p': p})

results_df = pd.DataFrame(results)
significant = results_df[results_df['p'] < 0.05].sort_values('r', ascending=False)
print(significant.head(20))

Big Data with Dask

pandas is limited to single-machine RAM. For datasets of 100 GB+, Dask enables parallel computation:

Code:

import dask.dataframe as dd

ddf = dd.read_csv("huge_data*.csv")
corr_matrix = ddf.corr().compute()

Dask partitions the data and computes in parallel across cores or machines.

Avoiding Misinterpretations

Common pitfalls in correlation analysis:

The "correlation = causation" trap: high correlation between two variables doesn't mean one causes the other. A third variable may drive both (a confounding variable).

Missing non-linear relationships: Pearson measures only linear relationships. U-shaped or log relationships can yield r ≈ 0 — always inspect a scatter plot.

Outlier impact: a single extreme value can dramatically shift the correlation. For robustness use Spearman or Kendall.

AIOR Data Analysis Services

At AIOR we deliver custom data analysis projects: hosting usage pattern correlation, e-commerce conversion drivers, relationships in industrial sensor data. We analyse terabyte-scale datasets using Python + pandas + Dask. Our hosting packages ship with the Python data science stack (pandas, scipy, scikit-learn, dask) pre-installed.

Correlation Analysis with Big Data — A Step-by-Step Application

Correlation Analysis with Big Data — A Step-by-Step Application

Aior

Administrator

Korelasyon Analizi Neden Önemli?

Korelasyon Katsayıları

Veri Hazırlığı

Pandas ile Hızlı Korelasyon Matrisi

Görselleştirme — Heatmap

İstatistiksel Anlamlılık

Büyük Veri İçin Dask

Yanlış Yorumlamalardan Kaçınmak

AIOR Veri Analizi Hizmetleri

Why Correlation Analysis Matters

Correlation Coefficients

Data Preparation

Fast Correlation Matrix with pandas

Visualisation — Heatmap

Statistical Significance

Big Data with Dask

Avoiding Misinterpretations

AIOR Data Analysis Services

Forum statistics

Members online

Latest posts

Newest members

Featured content

Trending content

Share this page

Legal Notice

We value your privacy

Correlation Analysis with Big Data — A Step-by-Step Application

Correlation Analysis with Big Data — A Step-by-Step Application

Aior

Administrator

Korelasyon Analizi Neden Önemli?​

Korelasyon Katsayıları​

Veri Hazırlığı​

Pandas ile Hızlı Korelasyon Matrisi​

Görselleştirme — Heatmap​

İstatistiksel Anlamlılık​

Büyük Veri İçin Dask​

Yanlış Yorumlamalardan Kaçınmak​

AIOR Veri Analizi Hizmetleri​

Why Correlation Analysis Matters​

Correlation Coefficients​

Data Preparation​

Fast Correlation Matrix with pandas​

Visualisation — Heatmap​

Statistical Significance​

Big Data with Dask​

Avoiding Misinterpretations​

AIOR Data Analysis Services​

Forum statistics

Members online

Latest posts

Newest members

Featured content

Trending content

Share this page

Tüm ihtiyaçlarınız için Teklif alın

Legal Notice

We value your privacy

Korelasyon Analizi Neden Önemli?

Korelasyon Katsayıları

Veri Hazırlığı

Pandas ile Hızlı Korelasyon Matrisi

Görselleştirme — Heatmap

İstatistiksel Anlamlılık

Büyük Veri İçin Dask

Yanlış Yorumlamalardan Kaçınmak

AIOR Veri Analizi Hizmetleri

Why Correlation Analysis Matters

Correlation Coefficients

Data Preparation

Fast Correlation Matrix with pandas

Visualisation — Heatmap

Statistical Significance

Big Data with Dask

Avoiding Misinterpretations

AIOR Data Analysis Services