-22일차- 비지도 학습

아래의 나온 코드는 깃허브에서 볼 수 있습니다

GitHub - dlfrnaos19/FundamentalOfMachineLearning: Machine Learning study notes

Machine Learning study notes. Contribute to dlfrnaos19/FundamentalOfMachineLearning development by creating an account on GitHub.

github.com

여태 여러가지 알고리즘들과 딥러닝을 할 땐 정답이 있는 데이터를 통해서 정답이 없는 부분에 대해서 예측을 하는 시도들이 있었습니다. 그러나 세상에 정답이 적히지 않은 데이터들이 여전히 많이 존재하고 있습니다. 이걸 해결해주는 것이 비지도 학습(Unsupervised Learning)입니다기존 지도 학습으로 해왔던 분류 Task에 대해서 분류 기준이 정해져 있지 않다면?

군집화

분류기준이 없는 상태에서 데이터를 분석하여 유사한 것끼리 묶어주는 작업

K-means 예제

알고리즘의 순서

클러스터의 수(K) 결정
무작위로 K개의 중심점 선정
나머지 점들과 유클리드 거리를 계산 후 가까운 거리를 가지는 중심점 클러스터로 편입
K개 클러스터 중심점 재조정(클러스터에 속하는 점들의 평균값이 다음 iteration의 중심점
조정된 중심점을 바탕으로 모든 점들과 유클리드 거리 계산 후 다시 클러스터로 편입
4, 5번 반복 수행(특정 iteration이상 시 수렴)

%matplotlib inline
from sklearn.datasets import make_blobs
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random

# 중심점이 5개인 100개의 점 데이터 무작위 생성
points, labels = make_blobs(n_samples=100, centers=5, 
                            n_features=2, random_state=135)
print(points.shape, points[:10])
print(labels.shape, labels[:10]) # 중심점이 label 역할

#result
(100, 2) [[ 4.63411914 -6.52590383]
 [-6.52008604  7.16624288]
 [ 2.14142339 -5.21092623]
 [ 1.70054231  8.54077897]
 [-0.33809159  8.76509668]
 [-7.69329744  7.94546313]
 [ 3.89090121 -3.06531839]
 [ 3.22338498 -2.93209009]
 [-6.63962964  5.34777334]
 [ 6.37904965 -6.46617328]]
(100,) [2 1 0 3 3 1 0 0 1 2]

fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)

points_df = pd.DataFrame(points, columns=['X','Y'])
display(points_df.head())

ax.scatter(points[:,0], points[:,1], c='black', label='random generated data')

ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.legend()
ax.grid()

colors = ['red','blue','green','brown','indigo']
color_dict = {i:colors[i] for i in range(len(colors))}
fig = plt.figure()
ax = fig.add_subplot(1,1,1)

for cluster in range(5):
    cluster_sub_points = points[kmeans_cluster.labels_ == cluster] # label 점 데이터 분리
    ax.scatter(cluster_sub_points[:,0], cluster_sub_points[:,1],
              c=color_dict[cluster],
              label='cluster_{}'.format(cluster))
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.legend()
ax.grid()

K-means가 잘 작동하지 않는 예시

예시 1 안쪽과 바깥쪽을 나누는게 아닌 케이크 자르듯 잘린 모습

from sklearn.datasets import make_circles
# 원형 분포 점 데이터 100개 생성
circle_points, circle_labels = make_circles(n_samples=100,factor=0.5, noise=0.01)

fig = plt.figure()
ax = fig.add_subplot(1,1,1)

circle_kmeans = KMeans(n_clusters=2)
circle_kmeans.fit(circle_points, circle_labels)
color_dict = {0:'red',1:'blue'}
for cluster in range(2):
    cluster_sub_points = circle_points[circle_kmeans.labels_==cluster]
    ax.scatter(cluster_sub_points[:,0], cluster_sub_points[:,1],
              c=color_dict[cluster],label='cluster_{}'.format(cluster))
ax.set_title('K-Means on circle data, K=2')
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.legend()
ax.grid()

예시 2 우산 같은 모습

from sklearn.datasets import make_moons

moon_points, moon_labels = make_moons(n_samples=100, noise=0.01)

fig = plt.figure()
ax = fig.add_subplot(1,1,1)

moon_kmeans = KMeans(n_clusters=2)
moon_kmeans.fit(moon_points, moon_labels)

color_dict = {0:'red',1:'blue'}
for cluster in range(len(color_dict)):
    cluster_sub_points = moon_points[moon_kmeans.labels_==cluster]
    ax.scatter(cluster_sub_points[:,0],cluster_sub_points[:,1],c=color_dict[cluster], label='cluster_{}'.format(cluster))
ax.set_title('K-Means on moon-shaped data, K=2')
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.legend()
ax.grid()

예시3

from sklearn.datasets import make_circles, make_moons, make_blobs
diag_points, _ = make_blobs(n_samples=100, random_state=170)
transformation = [[0.6,-0.6],[-0.4,0.8]]
diag_points = np.dot(diag_points, transformation)

fig = plt.figure()
ax = fig.add_subplot(1,1,1)

diag_kmeans = KMeans(n_clusters=3)
diag_kmeans.fit(diag_points)
color_dict = {0:'red',1:'blue',2:'green'}
for cluster in range(3):
    cluster_sub_points = diag_points[diag_kmeans.labels_==cluster]
    ax.scatter(cluster_sub_points[:,0], cluster_sub_points[:,1], c=color_dict[cluster],label='cluster_{}'.format(cluster))
ax.set_title('K-means on diagonal-shaped data,K=2')
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.legend()
ax.grid()

DBSCAN

DBSCAN 파라미터

epsilon : 클러스터의 반경
minPts : 클러스터를 이루는 개체의 최솟값
core point : 반경 epsilon 내에 minPts개 이상의 점이 존재하는 중심점
border point : 군집의 중심이 되지는 못하지만 군집에 속하는 점
noise point: 군집에 포함되지 못하는 점

DBSCAN 진행 순서

임의의 점 p를 설정하고, p를 포함하여 주어진 클러스터의 반경(elipson) 안에 포함되어 있는 점들의 개수를 세요.
만일 해당 원에 minPts 개 이상의 점이 포함되어 있으면, 해당 점 p를 core point로 간주하고 원에 포함된 점들을 하나의 클러스터로 묶어요.
해당 원에 minPts 개 미만의 점이 포함되어 있으면, 일단 pass 합시다.
모든 점에 대하여 돌아가면서 1~3 번의 과정을 반복하는데, 만일 새로운 점 p'가 core point가 되고 이 점이 기존의 클러스터(p를 core point로 하는)에 속한다면, 두 개의 클러스터는 연결되어 있다고 하며 하나의 클러스터로 묶어줘요.
모든 점에 대하여 클러스터링 과정을 끝냈는데, 어떤 점을 중심으로 하더라도 클러스터에 속하지 못하는 점이 있으면 이를 noise point로 간주해요. 또한, 특정 군집에는 속하지만 core point가 아닌 점들을 border point라고 칭해요.

DBSCAN 예제

from sklearn.cluster import DBSCAN

fig = plt.figure()
ax = fig.add_subplot(1,1,1)
color = ['red','blue','green','brown','purple']
color_dict = {i:color[i] for i in range(len(color))}

epsilon, minPts = 0.2, 3
circle_dbscan = DBSCAN(eps=epsilon, min_samples=minPts)
circle_dbscan.fit(circle_points)
n_cluster = max(circle_dbscan.labels_)+1

print(f'# of cluster:{n_cluster}')
print(f'DBSCAN y-hat: {circle_dbscan.labels_}')

for cluster in range(n_cluster):
    cluster_sub_points = circle_points[circle_dbscan.labels_==cluster]
    ax.scatter(cluster_sub_points[:,0], cluster_sub_points[:,1],c=color_dict[cluster],label=f'{cluster}')
ax.set_title('DBSCAN on circle data')
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.legend()
ax.grid()

# 달 모양 분포 데이터 plot - 위와 같은 과정 반복
fig = plt.figure()
ax= fig.add_subplot(1, 1, 1)
color_dict = {0: 'red', 1: 'blue', 2: 'green', 3:'brown',4:'purple'} # n 번째 클러스터 데이터를 어떤 색으로 도식할 지 결정하는 color dictionary

epsilon, minPts = 0.4, 3
moon_dbscan = DBSCAN(eps=epsilon, min_samples=minPts)
moon_dbscan.fit(moon_points)
n_cluster = max(moon_dbscan.labels_)+1

print(f'# of cluster: {n_cluster}')
print(f'DBSCAN Y-hat: {moon_dbscan.labels_}')

for cluster in range(n_cluster):
    cluster_sub_points = moon_points[moon_dbscan.labels_ == cluster]
    ax.scatter(cluster_sub_points[:, 0], cluster_sub_points[:, 1], c=color_dict[cluster], label='cluster_{}'.format(cluster))
ax.set_title('DBSCAN on moon data')
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.legend()
ax.grid()

# 대각선 모양 분포 데이터 plot - 위와 같은 과정 반복
fig = plt.figure()
ax= fig.add_subplot(1, 1, 1)
color_dict = {0: 'red', 1: 'blue', 2: 'green', 3:'brown',4:'purple'} # n 번째 클러스터 데이터를 어떤 색으로 도식할 지 결정하는 color dictionary

epsilon, minPts = 0.7, 3
diag_dbscan = DBSCAN(eps=epsilon, min_samples=minPts)
diag_dbscan.fit(diag_points)
n_cluster = max(diag_dbscan.labels_)+1

print(f'# of cluster: {n_cluster}')
print(f'DBSCAN Y-hat: {diag_dbscan.labels_}')

for cluster in range(n_cluster):
    cluster_sub_points = diag_points[diag_dbscan.labels_ == cluster]
    ax.scatter(cluster_sub_points[:, 0], cluster_sub_points[:, 1], c=color_dict[cluster], label='cluster_{}'.format(cluster))
ax.set_title('DBSCAN on diagonal shaped data')
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.legend()
ax.grid()

DBSCAN 알고리즘과 K-means 성능 비교

DBSCAN의 경우 데이터 양이 증가할 수록 급격하게 성능이 하락하는 모습을 볼 수 있습니다

# DBSCAN 알고리즘과 K-means 알고리즘의 시간을 비교하는 코드 
import time

n_samples= [100, 500, 1000, 2000, 5000, 7500, 10000, 20000, 30000, 40000, 50000]

kmeans_time = []
dbscan_time = []
x = []
for n_sample in n_samples:
    dummy_circle, dummy_labels = make_circles(n_samples=n_sample, factor=0.5, noise=0.01) # 원형의 분포를 가지는 데이터 생성

    # K-means 시간을 측정
    kmeans_start = time.time()
    circle_kmeans = KMeans(n_clusters=2)
    circle_kmeans.fit(dummy_circle)
    kmeans_end = time.time()

    # DBSCAN 시간을 측정
    dbscan_start = time.time()
    epsilon, minPts = 0.2, 3
    circle_dbscan = DBSCAN(eps=epsilon, min_samples=minPts)
    circle_dbscan.fit(dummy_circle)
    dbscan_end = time.time()

    x.append(n_sample)
    kmeans_time.append(kmeans_end-kmeans_start)
    dbscan_time.append(dbscan_end-dbscan_start)
    print("# of samples: {} / Elapsed time of K-means: {:.5f}s / DBSCAN: {:.5f}s".format(n_sample, kmeans_end-kmeans_start, dbscan_end-dbscan_start))

# K-means와 DBSCAN의 소요 시간 그래프화
fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)
ax.scatter(x, kmeans_time, c='red', marker='x', label='K-means elapsed time')
ax.scatter(x, dbscan_time, c='green', label='DBSCAN elapsed time')
ax.set_xlabel('# of samples')
ax.set_ylabel('time(s)')
ax.legend()
ax.grid()

#result
# of samples: 100 / Elapsed time of K-means: 0.48583s / DBSCAN: 0.00165s
# of samples: 500 / Elapsed time of K-means: 0.65872s / DBSCAN: 0.00329s
# of samples: 1000 / Elapsed time of K-means: 1.22107s / DBSCAN: 0.00851s
# of samples: 2000 / Elapsed time of K-means: 0.62575s / DBSCAN: 0.03907s
# of samples: 5000 / Elapsed time of K-means: 0.94745s / DBSCAN: 0.07846s
# of samples: 7500 / Elapsed time of K-means: 2.15524s / DBSCAN: 0.19562s
# of samples: 10000 / Elapsed time of K-means: 0.82285s / DBSCAN: 0.18656s
# of samples: 20000 / Elapsed time of K-means: 0.88586s / DBSCAN: 0.45408s
# of samples: 30000 / Elapsed time of K-means: 2.08477s / DBSCAN: 1.01110s
# of samples: 40000 / Elapsed time of K-means: 0.99872s / DBSCAN: 1.48223s
# of samples: 50000 / Elapsed time of K-means: 1.28099s / DBSCAN: 2.22198s

차원 축소(PCA)

비지도 학습의 대표적인 방법 중 하나로 주성분 분석(PCA)이라는 차원축소 알고리즘이 있습니다

차원 축소를 하는 이유?

정보의 바다, 수많은 정보 속에서 중요한 요소를 찾기 위한 방법으로 어떤 특징이 가장 그 데이터를 잘 표현하는지 알게 해주는 특징 추출의 용도로 사용됩니다

주성분이라는 의미는 데이터의 분산이 가장 큰 방향의 벡터를 의미

PCA는 데이터 분포의 주성분을 찾아주는 방법이에요.
여기서 주성분이라는 의미는 데이터의 분산이 가장 큰 방향벡터를 의미해요.
PCA는 데이터들의 분산을 최대로 보존하면서, 서로 직교(orthogonal)하는 기저(basis, 분산이 큰 방향벡터의 축)들을 찾아 고차원 공간을 저차원 공간으로 사영(projection)해요.
또한 PCA에서는 기존 feature 중 중요한 것을 선택하는 방식이 아닌
기존의 feature를 선형 결합(linear combination)하는 방식을 사용하고 있어요.

직교(orthogonal), 기저(basis), 사영(projection), 선형결합(linear combination)??

기저(basis)

화살표 방향에 따라서 새로운 좌표계 역할을 할 수 있는 벡터의 모음

[ 타원 데이터 분포에서 주성분 분석을 통한 basis(출처: https://en.wikipedia.org/wiki/Principal_component_analysis) ]

사영(Projection)

X-Y-Z 좌표축에 존재하는 데이터를 X-Y, Y-Z 좌표축에 사영하는 것은

각각 Z, X 좌표축을 무시한다는 것, 그렇다면 무시하는 데이터로 인해 정보손실이

발생하는데 이를 최소화하기 위해서 상대적으로 중요한 데이터를 찾습니다

아래 그림에서는 X-Y가 더 데이터를 잘 표현했다고 볼 수 있습니다(수학적으로 Z축 방향의 분산이 작다)

[ 출처 : https://www.geeksforgeeks.org/dimensionality-reduction/ ]

차원 축소를 시도하되 주어진 좌표축 방향이 아니라 분산이 길게 나오는 기저 방향을 찾아서 그 방향의 기저만 남기고 덜 중요한 기저 방향을 삭제하는 방식으로 진행이 되는데, 이렇게 찾은 중요한 기저를 주성분 방향 또는 pc축이라고 합니다

PCA 예제

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

cancer = load_breast_cancer()

# 0: 악성 종양, 1:양성 종양
cancer_X, cancer_y = cancer.data, cancer['target']
train_X, test_X, train_y, test_y = train_test_split(cancer_X, cancer_y,test_size=0.1, random_state=10)
print("전체 검사자 수: {}".format(len(cancer_X)))
print("Train dataset 사용되는 검사자 수: {}".format(len(train_X)))
print("Test dataset에 사용되는 검사자 수: {}".format(len(test_X)))
cancer_df = pd.DataFrame(cancer_X, columns=cancer['feature_names'])
cancer_df.head()

#result
전체 검사자 수: 569
Train dataset 사용되는 검사자 수: 512
Test dataset에 사용되는 검사자 수: 57

PCA에서 정규화의 효과

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn import svm
from sklearn.metrics import accuracy_score
from collections import Counter

color_dict ={0:'red', 1:'blue', 2:'red', 3:'blue'}
target_dict = {0:'mali_train', 1:'be_train',2:'mali_test',3:'be_test'}

train_X_ = StandardScaler().fit_transform(train_X) # 데이터의 정규화, 각 데이터 범위가 다르기 때문
train_df = pd.DataFrame(train_X_, columns=cancer['feature_names'])
pca = PCA(n_components=2) # 기저 방향벡터를 2개로 설정
pc = pca.fit_transform(train_df)

여기서 StandScaler().fit_transform() 과정을 수행하는 이유는 각 열마다의 값의 범위가 전부 달라서 그렇습니다.
예를 들어, 첫 번째와 두 번째 열 데이터인 'mean radius'와 'mean texture'의 범위는 다른데, 두 값이 전부 5라는 값을 가진다고 해서 같은 영향을 준다고 취급하면 안됩니다

테스트 셋도 동일하게 해줍니다

test_X_ = StandardScaler().fit_transform(test_X)
test_df = pd.DataFrame(test_X_, columns=cancer['feature_names'])
pca_test = PCA(n_components=2)
pc_test = pca_test.fit_transform(test_df)

# 훈련한 classifier decision boundary 그리는 함수
def plot_decision_boundary(X, clf, ax):
    h = .02 #Step Size 
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min,y_max, h))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    ax.contour(xx, yy, Z, cmap='Blues')

# PCA 적용한 train data의 classifier 훈련: classifier로 Support Vector Machine(SVM)
clf = svm.SVC(kernel = 'rbf', gamma=0.5, C=0.8) # classifier는 SVM
clf.fit(pc, train_y)

# PCA 적용하지 않은 original data의 SVM 훈련
clf_orig = svm.SVC(kernel = 'rbf', gamma=0.5, C=0.8)
clf_orig.fit(train_df, train_y)

#result
SVC(C=0.8, gamma=0.5)

fig = plt.figure()
ax = fig.add_subplot(1,1,1)
# 악성과 양성 SVM decision boundary 그리기
plot_decision_boundary(pc, clf, ax)

#Train data 그리기
for cluster in range(2):
    sub_cancer_points = pc[train_y == cluster]
    ax.scatter(sub_cancer_points[:, 0], sub_cancer_points[:, 1],edgecolor=color_dict[cluster] ,c='none', label=target_dict[cluster] )
#Test data 그리기
for cluster in range(2):
    sub_cancer_points = pc_test[test_y == cluster]
    ax.scatter(sub_cancer_points[:, 0], sub_cancer_points[:,1], marker='x', c=color_dict[cluster+2], label=target_dict[cluster+2])
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_title('PCA-Breast cancer dataset')
ax.legend()
ax.grid()

# 점수
pca_test_accuracy_dict = Counter(clf.predict(pc_test) == test_y)
orig_test_accuracy_dict = Counter(clf_orig.predict(test_df) == test_y)

print("PCA 분석을 사용한 Test dataset accuracy: {}명/{}명 => {:.3f}".format(pca_test_accuracy_dict[True], sum(pca_test_accuracy_dict.values()), clf.score(pc_test, test_y)))
print("PCA를 적용하지 않은 Test dataset accuracy: {}명/{}명 => {:.3f}".format(orig_test_accuracy_dict[True], sum(orig_test_accuracy_dict.values()), clf_orig.score(test_df, test_y)))

#result
PCA 분석을 사용한 Test dataset accuracy: 54명/57명 => 0.947
PCA를 적용하지 않은 Test dataset accuracy: 43명/57명 => 0.754

차원 축소(2) T-SNE(T-Stochastic Neighbor Embedding)

시각화에 많이 쓰이는 알고리즘으로, 우리 세계는 3차원이며, 그 이상의 차원으로 표현하거나 눈으로 인지하기가 어려운데 이 때문에 1~3차원으로 불러와야 시각적으로 이해하기 좋습니다

PCA는 주로 선형적인 데이터의 분포를 가지고 있을 때 정보가 잘 보존되지만

T-SNE는 기존 차원의 공간에서 가까운 점들은 차원 축소된 공간에서도 가깝게 유지됩니다

T-SNE 예제

Mnist 숫자 손글씨를 통해 T-SNE를 알아봅니다

from sklearn.datasets import fetch_openml

mnist = fetch_openml("mnist_784", version=1)
X = mnist.data / 255.0
y = mnist.target
print('X shape',X.shape)
print('Y shape',y.shape)

#result
X shape (70000, 784)
Y shape (70000,)

n_image = X.shape[0]
n_image_pixel = X.shape[1]

pixel_columns = [f'pixel{i}' for i in range(1, n_image_pixel + 1)] # column 이름 담은 목록

import pandas as pd
df = pd.DataFrame(X, columns=pixel_columns)
df['y'] = y
df['label'] = df['y'].apply(lambda i: str(i))
X, y = None, None

데이터 샘플링 1만개

import numpy as np
# 랜덤 시드
np.random.seed(30)

# 이미지 데이터 순서를 랜덤으로 바꾼 permutation 배열 저장
rndperm = np.random.permutation(n_image)

# 랜덤  섞은 이미지 중 1만개를 뽑고 df_subset에 담기
n_image_sample = 10000
random_idx = rndperm[:n_image_sample]
df_subset = df.loc[rndperm[:n_image_sample],:].copy()
df_subset.shape

#result
(10000, 786)

Matplotlib subplot을 통해 이미지를 확인합니다

%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt

plt.gray()
fig = plt.figure(figsize=(10,6))
n_img_sample = 15
width,height = 28,28

for i in range(0, n_img_sample):
    row = df_subset.iloc[i]
    ax = fig.add_subplot(3,5,i+1, title=f"Digit: {row['label']}")
    ax.matshow(row[pixel_columns]
              .values.reshape((width,height))
              .astype(float))
plt.show()

PCA를 통해 차원축소 한 경우

from sklearn.decomposition import PCA

n_dimension = 2
pca = PCA(n_components=n_dimension)

pca_result = pca.fit_transform(df_subset[pixel_columns].values)
df_subset['pca-one'] = pca_result[:,0] # 축소한 결과의 첫번째 차원 값
df_subset['pca-two'] = pca_result[:,1] # 축소한 결과의 두번째 차원 값

print("pca_result shape{}".format(pca_result.shape))

#result
pca_result shape(10000, 2)
#---------------------------

print(f"pca-1: {round(pca.explained_variance_ratio_[0],3)*100}%")
print(f"pca-2: {round(pca.explained_variance_ratio_[1],3)*100}%")


#result
pca-1: 9.6%
pca-2: 7.3%

784차원을 2차원으로 차원축소 했을때 정보량이 전체의 16.9%가 남았습니다

이를 시각화 해보면

plt.figure(figsize=(10,6))
sns.scatterplot(
        x="pca-one",y="pca-two",
        hue="y",
palette=sns.color_palette("hls",10),
data=df_subset,
legend='full',
alpha=0.4)

여러 값들이 섞여있네요, 어쨌든 유사한 벡터값을 지닌 모습이 보입니다

T-SNE를 이용한 차원축소

Scikit-learn 공식 문서: TSNE 모듈

from sklearn.manifold import TSNE

data_subset = df_subset[pixel_columns].values
n_dimension = 2
tsne = TSNE(n_components=n_dimension)
tsne_results = tsne.fit_transform(data_subset)

print('done') # 시간이 다소 걸립니다