-8일차- 데이터 시각화

오늘의 노드는 데이터 시각화와 관련된 부분이었다

주로 사용하게 되는 라이브러리는 matplotlib

figure 객체 생성

fig = plt.figure(figsize=(5,2))
ax1 = fig.add_subplot(1,1,1)

이때 빈 출력값이 나오게 되는데 위의 코드는 시각화를 하기 위한

일종의 그림판의 역할을 해줄 친구들이다

여러개의 객체 생성(subplot)

fig = plt.figure()
ax1 = fig.add_subplot(2,2,1)
ax2 = fig.add_subplot(2,2,2)
ax3 = fig.add_subplot(2,2,4)

서브플롯의 순서를 보면 좌측 상단이 1, 우측 상단이 2, 다시 좌측하단의 순서로

배치가 되는 것을 알 수 있다

간단한 시각화 예제

subject = ['English', 'Math', 'Korean', 'Science', 'Computer']
points = [40, 90, 50, 60, 100]
fig = plt.figure()
ax1 = fig.add_subplot(1,1,1)
ax1.bar(subject,points)
plt.xlabel('Subject')
plt.ylabel('Points')
plt.title("Yuna`s Test Result")

시각화 요소에 대한 내용을 추가할때는 fig 객체에,

레이블링 등 시각화 외의 내용은 plt객체에다가 추가하는 차이가 있었다

annotation 기능 추가해보기

# 데이터 프레임 불러오는 코드는 생략
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
price.plot(ax=ax, style='black')
plt.ylim([1600,2200])
plt.xlim(['2019-05-01','2020-03-01'])

# annotation 
important_data = [(datetime(2019, 6, 3),"Low Price"),(datetime(2020,2,19),"Peak Price")]
for d, label in important_data:
    ax.annotate(label, xy=(d,price.asof(d)+10),
               xytext=(d,price.asof(d)+100),
               arrowprops=dict(facecolor='red'))
plt.grid()
ax.set_title('StockPrice')
plt.show()

데이터프레임에 있는 시계열 자료를 시각화한 부분이고

annotation 부분이 표시되어있다

lms에서는 예제를 위해서 해당 annotation 하는 데이터를 수동으로 지정해주었는데

사용하게 된다면 해당 데이터 컬럼의 .max() 와 .min()을 활용해서 표현하면 좋을 듯 하다

선 표시와 컬러 제어

import numpy as np
x = np.linspace(0, 10, 100) # 1차원의 배열을 만들기 안성맞춤
# (Start, End, NumberOfNum)
plt.plot(x,np.sin(x),'o')
plt.plot(x,np.cos(x),'--',color='black')
plt.show()

문자열을 통해서 값을 전달하면 글자 모양과 비슷하게 직관적으로 나타난다

이전에 보지 못했던 방식이라 아직은 사용하기가 헷갈린다

컬러값을 넘기는거 처럼 영단어로 하는 부분에 더 익숙하게 느껴진다

이런 부분에서는 plotly가 오히려 더 편하게 느껴진다

plt.subplot(2,1,1)
plt.plot(x,np.sin(x),'orange','o')
plt.subplot(2,1,2)
plt.plot(x,np.cos(x),'orange')

간단한 시각화는 fig 객체 없이도 바로 plot할 수 있다

이전의 예제와 마찬가지로 여러개의 plot 가능 (서브플롯)

데이터프레임에서 바로 시각화하기

fig,axes = plt.subplots(2,1)
data = pd.Series(np.random.rand(5),index=list('abcde'))
data.plot(kind='bar',ax=axes[0],color='blue',alpha=1)
data.plot(kind='barh',ax=axes[1],color='red',alpha=0.3)

디자인 자체는 기본적이지만 데이터프레임에서 바로 시각화 하기 때문에

굉장히 직관적인 기능이며, matplotlib의 객체를 넣어서

동시에 출력할 수 있는 것과, 종류도 바꿀 수 있는 것이 장점

기본적인 데이터 분석에서 사용하기 좋음

df = pd.DataFrame(np.random.rand(6,4),columns=['a','b','c','d'])
# 4개의 원소가 들어있는 6개의 난수 리스트를 만듦
df.plot(kind='line')

전체 컬럼에 적용되어 그래프가 나오는 모습

다른 데이터와 섞인다면 필요한 부분만 plot하는데 쓸 수 있을 것으로 보인다

Seaborn 맛보기

Seaborn에서 기본으로 제공하는 데이터셋 활용

import pandas as pd
import seaborn as sns
tips = sns.load_dataset("tips")
tips.shape # (244, 7)

# 상위 5개의 셀 확인
tips.head()

# 기본적인 통계 정보
tips.describe()

데이터프레임의 groupby 기능

# 'sex' 컬럼, 즉 성별을 기준으로 'tip'에 대한 정보가 나열된 객체 반환됨
grouped = tips['tip'].groupby(tips['sex'])
sex = dict(grouped.mean())
sex
# {'Male': 3.0896178343949043, 'Female': 2.833448275862069}

matplotlib과 seaborn의 플롯 비교

x = list(sex.keys())
y = list(sex.values())

import matplotlib.pyplot as plt

plt.bar(x=x, height = y)
plt.ylabel('tip[$]')
plt.title('Tip by Sex')

sns.barplot(data=tips, x='sex', y='tip')

비교적 코드가 간단하며 추가적인 효과가 자동으로 들어간다

matplotlib + seaborn 연동하기

plt.figure(figsize=(10,6))
sns.barplot(data=tips,x='sex',y='tip')
plt.ylim(0,4)
plt.title('Tip by sex')

이전 예제에서 plt는 일종의 그림판과 같은 역할을 한다고 설명했다

따라서 그림판만 plt 객체를 사용하고 다른 라이브러리로 그림을 그리는 것이

가능하다는 것

히스토그램 그려보기

matplotlib

mu1, mu2, sigma = 100, 130, 15
x1 = mu1 + sigma*np.random.randn(10000)
x2 = mu2 + sigma*np.random.randn(10000)
fig=plt.figure()
ax1 = fig.add_subplot(1,1,1)

patches = ax1.hist(x1, bins=50, density=False) # bins는 구간 표시
patches = ax1.hist(x2, bins=50, density=False, alpha=0.5)
ax1.xaxis.set_ticks_position('bottom')
ax1.yaxis.set_ticks_position('left')

plt.xlabel('Bins')
plt.ylabel('Number of Values in Bin')
ax1.set_title('Two Frequency Distributions')
plt.show()

seaborn

sns.histplot(tips['total_bill'],label="total_bill")
sns.histplot(tips['tip'],label="tip").legend()

판다스

tips['tip_pct'] = tips['tip'] / tips['total_bill']
tips['tip_pct'].hist(bins=50)

sns.barplot(data=df, x='year',y='passengers')

seaborn의 경우 기본적인 색깔 표현이 좀 더 이쁘다

간단한 코드 표현도 한 몫

sns.pointplot(data=df,x='year',y='passengers')

sns.lineplot(data=df, x='year',y='passengers')

sns.lineplot(data=df,x='year',y='passengers',hue='month',palette='ch:.50')
plt.legend(bbox_to_anchor=(1.03,1),loc=2)

월별로 파악할 수 있는 지표가 추가되었으며, palette 기능을 통해 색깔을 좀 더

유려하게 바꿀 수 있음

sns.histplot(df['passengers'])

sns.heatmap(pivot)

데이터프레임의 pivot

pivot = df.pivot(index='year', columns='month',values='passengers')
pivot

pivot기능을 통해서 기존의 데이터프레임을 자신이 원하는 기준으로 완전히

새로 바꿔볼 수 있다

index와 column, values를 설정하면 그 기준에 맞게 피벗된 데이터프레임을 객체로 반환받는다

이전에는 plotly를 집중적으로 파고들어서 공부한적이 있었지만

국내에서 주로 matplotlib을 통한 예제나 구현이 많아서 공부가 다소

까다로운 편이다

그러나 seaborn의 경우처럼 기본적인 표현은 오히려 타 라이브러리가 좋아보이며

코드의 양도 상당히 적어보인다

부담스럽긴 하겠지만

타 라이브러리 사용을 위해 공식문서를 활용하면 좋은 시각화 효과를 낼 수 있을 것이다

저작자표시

'23년 이전 글 > 모두의연구소 아이펠' 카테고리의 다른 글

-10일차- Machine Learning과 Scikit-Learn (0)	2022.01.07
-9일차- Machine Learning Classification Task (0)	2022.01.06
-7일차- 가위 바위 보 classification task (0)	2022.01.04
-6일차- 데이터 전처리 by pandas (0)	2022.01.03
-4일차- 제너레이터, 딕셔너리의 copy (0)	2021.12.30