[R] tidyr 패키지로 하는 데이터 피봇(pivot_longer / pivot_wider)

마지막 업데이트 2021-04-16 18 minute read R

데이터를 원하는 모양새로 자르고 변환시키는일은 원활한 분석을 위해 꼭 필요한 작업중에 하나입니다. 오늘은 데이터 전처리 기법 중 하나인 피봇팅을 통한 long form, wide form 로의 데이터 형 변환에 대하여 알아보도록 하겠습니다.

피봇

먼저 피봇이란 단어의 의미부터 짚고 넘어가겠습니다. 엑셀을 사용해보신 분이라면 피봇테이블이란 단어를 한번쯤은 접해보셨을 겁니다. 엑셀에서는 이 피봇테이블의 기능을 통해 데이터의 다양한 형 변환을 진행 할 수 있는데요. 이러한 기능적 특징을 통해 피봇의 뜻을 유추해보자면 피봇이란 단어에는 변환의 의미가 있음을 추측해 볼 수 있습니다.

PIVOT의 사전적 의미, 출처: 네이버 영어사전

사전에 등재되어 있는 “축을 중심으로 회전하다” 라는 동사의 의미처럼 데이터를 피봇 한다는 것은 데이터를 특정한 변수를 기준으로 위아래로 길게 늘이거나(Long form), 그 반대로 옆으로 늘이는(Wide form)형태로 만든다는 의미를 지니고 있습니다.

Long form & Wide form

그렇다면 이러한 데이터의 피봇팅은 어떻게 할 수 있을까요? 데이터 처리에 특화 되어있는 R에서는 여러 패키지들이 데이터 피봇팅을 할 수 있도록 하는 함수를 지원하고 있습니다. 피봇팅을 통한 long form, wide form을 변환하기 위한 데이터로서 gapminder 데이터를 통해 알아보도록 하겠습니다.

library(gapminder)

gapminder %>% head()
#> # A tibble: 6 x 6
#>   country     continent  year lifeExp      pop gdpPercap
#>   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
#> 1 Afghanistan Asia       1952    28.8  8425333      779.
#> 2 Afghanistan Asia       1957    30.3  9240934      821.
#> 3 Afghanistan Asia       1962    32.0 10267083      853.
#> 4 Afghanistan Asia       1967    34.0 11537966      836.
#> 5 Afghanistan Asia       1972    36.1 13079460      740.
#> 6 Afghanistan Asia       1977    38.4 14880372      786.

gapminder %>% tail()
#> # A tibble: 6 x 6
#>   country  continent  year lifeExp      pop gdpPercap
#>   <fct>    <fct>     <int>   <dbl>    <int>     <dbl>
#> 1 Zimbabwe Africa     1982    60.4  7636524      789.
#> 2 Zimbabwe Africa     1987    62.4  9216418      706.
#> 3 Zimbabwe Africa     1992    60.4 10704340      693.
#> 4 Zimbabwe Africa     1997    46.8 11404948      792.
#> 5 Zimbabwe Africa     2002    40.0 11926563      672.
#> 6 Zimbabwe Africa     2007    43.5 12311143      470.

gapminder %>% summary()
#>         country        continent        year         lifeExp     
#>  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
#>  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
#>  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
#>  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
#>  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
#>  Australia  :  12                  Max.   :2007   Max.   :82.60  
#>  (Other)    :1632                                                
#>       pop              gdpPercap       
#>  Min.   :6.001e+04   Min.   :   241.2  
#>  1st Qu.:2.794e+06   1st Qu.:  1202.1  
#>  Median :7.024e+06   Median :  3531.8  
#>  Mean   :2.960e+07   Mean   :  7215.3  
#>  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
#>  Max.   :1.319e+09   Max.   :113523.1  
#> 

gapminder %>% glimpse()
#> Rows: 1,704
#> Columns: 6
#> $ country   <fct> "Afghanistan", "Afghanistan", "Afghanistan", "Afghanistan", ~
#> $ continent <fct> Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, Asia, ~
#> $ year      <int> 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, ~
#> $ lifeExp   <dbl> 28.801, 30.332, 31.997, 34.020, 36.088, 38.438, 39.854, 40.8~
#> $ pop       <int> 8425333, 9240934, 10267083, 11537966, 13079460, 14880372, 12~
#> $ gdpPercap <dbl> 779.4453, 820.8530, 853.1007, 836.1971, 739.9811, 786.1134, ~

gapminder 데이터셋은 전 세계 142개 국가들의 gdp와 국가의 평균 수명, 인구 수의 변화 등을 기록한 일종의 시계열 데이터 입니다. 이 데이터는 기본적으로 long form의 형태로 제공되고 있습니다.

country	continent	year	lifeExp	pop	gdpPercap
Afghanistan	Asia	1952	28.801	8425333	779.4453
Afghanistan	Asia	1957	30.332	9240934	820.8530
Afghanistan	Asia	1962	31.997	10267083	853.1007
Afghanistan	Asia	1967	34.020	11537966	836.1971
Afghanistan	Asia	1972	36.088	13079460	739.9811
Afghanistan	Asia	1977	38.438	14880372	786.1134
Afghanistan	Asia	1982	39.854	12881816	978.0114
Afghanistan	Asia	1987	40.822	13867957	852.3959
Afghanistan	Asia	1992	41.674	16317921	649.3414
Afghanistan	Asia	1997	41.763	22227415	635.3414

long form의 형식을 살펴보자면 각각의 변수열에는 동일한 성격의 값이 있음을 알 수 있습니다. long form 형태의 데이터의 장점으로는 데이터를 다루기 쉽다는 점에 있습니다. 인간이 인지하기에는 다소 편리한 형태는 아니지만 컴퓨터가(특히 R에서) 연산하기에는 좋은 형태이기 때문에 시각화를 진행한다거나 머신러닝 모델을 생성할 시 유용하게 활용될 수 있습니다.

# long form 데이터는 다루기가 용이함 
library(ggplot2)
library(plotly)

g <- gapminder %>% 
  filter(year == 2007) %>% 
  ggplot(aes(x = gdpPercap, y = lifeExp, color = continent, size = pop, ids = country)) +
  geom_point(alpha = 0.5) +
  ggtitle("Life expectancy versus GDP, 2007") +
  xlab("GDP per capita (US$)") +
  ylab("Life expectancy (years)") +
  scale_color_discrete(name = "Continent") +
  scale_size_continuous(name = "Population") + 
  theme_bw()

ggplotly(g)

한편 wide form 형태의 데이터는 다루기는 어렵지만 인지적인 측면에서는 유리한 이점이 있습니다. 유럽 지역의 1인당 GDP를 연도별로 정리한 데이터를 각각 long form 의 형태와 wide form 형태로 만들어 비교해보도록 하겠습니다.

long form

country	year	gdpPercap
Albania	1952	1601.056
Albania	1957	1942.284
Albania	1962	2312.889
Albania	1967	2760.197
Albania	1972	3313.422
Albania	1977	3533.004
Albania	1982	3630.881
Albania	1987	3738.933
Albania	1992	2497.438
Albania	1997	3193.055

…

Table 1: long form
country	year	gdpPercap
United Kingdom	1962	12477.18
United Kingdom	1967	14142.85
United Kingdom	1972	15895.12
United Kingdom	1977	17428.75
United Kingdom	1982	18232.42
United Kingdom	1987	21664.79
United Kingdom	1992	22705.09
United Kingdom	1997	26074.53
United Kingdom	2002	29479.00
United Kingdom	2007	33203.26

wide form

Table 2: wide form
country	1952	1957	1962	1967	1972	1977	1982	1987	1992	1997	2002	2007
Albania	1601.0561	1942.284	2312.889	2760.197	3313.422	3533.004	3630.881	3738.933	2497.438	3193.055	4604.212	5937.030
Austria	6137.0765	8842.598	10750.721	12834.602	16661.626	19749.422	21597.084	23687.826	27042.019	29095.921	32417.608	36126.493
Belgium	8343.1051	9714.961	10991.207	13149.041	16672.144	19117.974	20979.846	22525.563	25575.571	27561.197	30485.884	33692.605
Bosnia and Herzegovina	973.5332	1353.989	1709.684	2172.352	2860.170	3528.481	4126.613	4314.115	2546.781	4766.356	6018.975	7446.299
Bulgaria	2444.2866	3008.671	4254.338	5577.003	6597.494	7612.240	8224.192	8239.855	6302.623	5970.389	7696.778	10680.793
Croatia	3119.2365	4338.232	5477.890	6960.298	9164.090	11305.385	13221.822	13822.584	8447.795	9875.605	11628.389	14619.223
Czech Republic	6876.1403	8256.344	10136.867	11399.445	13108.454	14800.161	15377.229	16310.443	14297.021	16048.514	17596.210	22833.309
Denmark	9692.3852	11099.659	13583.314	15937.211	18866.207	20422.901	21688.040	25116.176	26406.740	29804.346	32166.500	35278.419
Finland	6424.5191	7545.415	9371.843	10921.636	14358.876	15605.423	18533.158	21141.012	20647.165	23723.950	28204.591	33207.084
France	7029.8093	8662.835	10560.486	12999.918	16107.192	18292.635	20293.897	22066.442	24703.796	25889.785	28926.032	30470.017
Germany	7144.1144	10187.827	12902.463	14745.626	18016.180	20512.921	22031.533	24639.186	26505.303	27788.884	30035.802	32170.374
Greece	3530.6901	4916.300	6017.191	8513.097	12724.830	14195.524	15268.421	16120.528	17541.496	18747.698	22514.255	27538.412
Hungary	5263.6738	6040.180	7550.360	9326.645	10168.656	11674.837	12545.991	12986.480	10535.629	11712.777	14843.936	18008.944
Iceland	7267.6884	9244.001	10350.159	13319.896	15798.064	19654.962	23269.607	26923.206	25144.392	28061.100	31163.202	36180.789
Ireland	5210.2803	5599.078	6631.597	7655.569	9530.773	11150.981	12618.321	13872.867	17558.816	24521.947	34077.049	40675.996
Italy	4931.4042	6248.656	8243.582	10022.401	12269.274	14255.985	16537.483	19207.235	22013.645	24675.024	27968.098	28569.720
Montenegro	2647.5856	3682.260	4649.594	5907.851	7778.414	9595.930	11222.588	11732.510	7003.339	6465.613	6557.194	9253.896
Netherlands	8941.5719	11276.193	12790.850	15363.251	18794.746	21209.059	21399.460	23651.324	26790.950	30246.131	33724.758	36797.933
Norway	10095.4217	11653.973	13450.402	16361.876	18965.056	23311.349	26298.635	31540.975	33965.661	41283.164	44683.975	49357.190
Poland	4029.3297	4734.253	5338.752	6557.153	8006.507	9508.141	8451.531	9082.351	7738.881	10159.584	12002.239	15389.925
Portugal	3068.3199	3774.572	4727.955	6361.518	9022.247	10172.486	11753.843	13039.309	16207.267	17641.032	19970.908	20509.648
Romania	3144.6132	3943.370	4734.998	6470.867	8011.414	9356.397	9605.314	9696.273	6598.410	7346.548	7885.360	10808.476
Serbia	3581.4594	4981.091	6289.629	7991.707	10522.067	12980.670	15181.093	15870.879	9325.068	7914.320	7236.075	9786.535
Slovak Republic	5074.6591	6093.263	7481.108	8412.902	9674.168	10922.664	11348.546	12037.268	9498.468	12126.231	13638.778	18678.314
Slovenia	4215.0417	5862.277	7402.303	9405.489	12383.486	15277.030	17866.722	18678.535	14214.717	17161.107	20660.019	25768.258
Spain	3834.0347	4564.802	5693.844	7993.512	10638.751	13236.921	13926.170	15764.983	18603.065	20445.299	24835.472	28821.064
Sweden	8527.8447	9911.878	12329.442	15258.297	17832.025	18855.725	20667.381	23586.929	23880.017	25266.595	29341.631	33859.748
Switzerland	14734.2327	17909.490	20431.093	22966.144	27195.113	26982.291	28397.715	30281.705	31871.530	32135.323	34480.958	37506.419
Turkey	1969.1010	2218.754	2322.870	2826.356	3450.696	4269.122	4241.356	5089.044	5678.348	6601.430	6508.086	8458.276
United Kingdom	9979.5085	11283.178	12477.177	14142.851	15895.116	17428.748	18232.425	21664.788	22705.093	26074.531	29478.999	33203.261

차이점이 눈에 보이시나요? long form의 형태는 연도마다 국가와 1인당 GDP값이 일일이 표기가 되어있어 데이터가 마치 치즈처럼 아래로 쭉 늘어난 것을 볼 수 있습니다. 출력 과정에선 그로인한 정보적 손실이 발생하고 있는 상황이죠. 이와는 대조되게 wide form의 데이터는 개별 연도를 각각의 column으로 배치하여 해당 국가들의 연도별 1인당GDP의 변화량을 좀 더 잘 캐치할 수 있습니다. 만약 요약된 정보를 report의 형태로 제작해야 할 상황이라면 wide form의 형태가 가독성 측면에서 더욱 우수하겠죠?

피봇팅

wide form -> long form

이번에는 직접 R에서 피봇팅을 지원하는 패키지들 함수를 통한 데이터 형 변환을 실시 해보겠습니다. R의 기본함수 만으로도 충분히 피봇팅을 진행 할 수 있겠지만 절차도 복잡할 뿐더러 무엇보다도 이미 피봇팅을 지원하고 있는 함수들이 있기 때문에 기본함수를 사용하는 것은 비 효율적입니다. 다음은 long form을 지원하는 함수입니다.

 - reshape2::melt()
 - tidyr::gather() 
 - tidyr::pivot_longer()

melt()

reshape2::melt()

reshape2 패키지는 이름에도 나와있듯이 데이터를 재구성하는데 유용한 함수들을 제공하는 패키지입니다. 하지만 만들어진지 5년이 넘은 패키지이기 때문에 구형의 함수이고, 패키지 제작자인 Hadley Wickham도 다른 최신의 패키지 함수를 사용하길 권하고 있는 상황입니다. melt함수는 일반적으로 넓게 퍼져있는 wide form 형태의 데이터를 아이스크림을 녹이듯 밑으로 흘러내린 long form의 형태로 변환시켜주는 기능을 지니고 있는데 자주 사용되는 S3 class 데이터프레임에서 사용하는 melt 함수의 사용법으로는 다음과 같습니다.

reshape2::melt(data, id.vars, measure.vars, variable.name = "variable", na.rm = !preserve.na, preserve.na = TRUE, ...)

매개변수 설명 보기

매개변수명	의미
data	데이터 프레임
id.vars	기준 변수 열
measure.vars	측정 변수 열
variable.name	기준 변수 열 이름
value.name	측정 변수 열 이름

melt()함수를 사용하여 wide form의 데이터를 long form으로 변환하는 과정을 진행해 보도록 하죠. 미리 위에서 알아봤던 연도별 유럽 지역의 1인당 GDP 데이터를 gapminder_wide변수에 저장했고, 이를 사용해 보도록 하겠습니다.

# long form 데이터 확인 
gapminder_long <- gapminder %>% 
  filter(continent == 'Europe') %>% 
  select(country, year, gdpPercap) %>% 
  as.data.frame()

gapminder_long %>% head(5)
#>   country year gdpPercap
#> 1 Albania 1952  1601.056
#> 2 Albania 1957  1942.284
#> 3 Albania 1962  2312.889
#> 4 Albania 1967  2760.197
#> 5 Albania 1972  3313.422

# wide form 데이터 확인 
gapminder_wide %>% head(5)
#> # A tibble: 5 x 13
#>   country  `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992` `1997`
#>   <fct>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#> 1 Albania   1601.  1942.  2313.  2760.  3313.  3533.  3631.  3739.  2497.  3193.
#> 2 Austria   6137.  8843. 10751. 12835. 16662. 19749. 21597. 23688. 27042. 29096.
#> 3 Belgium   8343.  9715. 10991. 13149. 16672. 19118. 20980. 22526. 25576. 27561.
#> 4 Bosnia ~   974.  1354.  1710.  2172.  2860.  3528.  4127.  4314.  2547.  4766.
#> 5 Bulgaria  2444.  3009.  4254.  5577.  6597.  7612.  8224.  8240.  6303.  5970.
#> # ... with 2 more variables: 2002 <dbl>, 2007 <dbl>

# wide form -> long form 변환 
gapminder_long_melt <- gapminder_wide %>% 
  reshape2::melt(id.vars = "country", variable.name = "year", value.name = "gdpPercap") %>% 
  arrange(country)

# 결과 비교 
gapminder_long %>% head(5)
#>   country year gdpPercap
#> 1 Albania 1952  1601.056
#> 2 Albania 1957  1942.284
#> 3 Albania 1962  2312.889
#> 4 Albania 1967  2760.197
#> 5 Albania 1972  3313.422

gapminder_long_melt %>% head(5)
#>   country year gdpPercap
#> 1 Albania 1952  1601.056
#> 2 Albania 1957  1942.284
#> 3 Albania 1962  2312.889
#> 4 Albania 1967  2760.197
#> 5 Albania 1972  3313.422

melt()함수의 id.vars에 국가들의 이름을 의미하는 country를 지정하고, 변수열의 이름으로 “year,” 값 열 이름엔 “gdpPercap”을 지정하였습니다. id.vars를 지정하면 함수는 기존에 column으로 존재하던 나머지 열 들을 하나의 변수 안에 범주형 값으로 통합시켜(measure.vars) 데이터를 long form의 형식으로 변환하는 작업을 수행합니다.

gather()

tidyr::gather()

tidyr 패키지는 tidyverse 군에 속한 패키지로서 tidy한 데이터를 생성하도록 돕는 함수를 제공하는 패키지 입니다. 여기서 언급한 tidy한 데이터란 이렇게 정의가 됩니다.

Every column is variable.
Every row is an observation.
Every cell is a single value.

여기서 잠깐 퀴즈 하나 내보겠습니다. 앞서 살펴보았던 wide form의 데이터는 tidy한 데이터일까요 아닐까요? 이름에서 추측하자면 정돈된 데이터라는 관점에서 생각해본다면 tidy한 데이터가 맞을수도 있다고 생각할 수 있습니다. 하지만 정답은 X 입니다. 데이터가 tidy하다 라는 것을 이해하려면 데이터의 변수 과 데이터의 관측점 를 이해하고 있어야 합니다. 위의 정의를 좀 더 풀어서 알아보도록 하죠.

Every column is variable. >> 모든 열은 변수 여야 한다는 말의 의미는 열에는 속성(키, 온도, 무게 등)값이 와야 합니다.
Every row is an observation. >> 각 행은 관측점 이여야 한다는 말의 의미는 속성(변수)마다 동일한 단위로 측정된 값이 등장해야 한다는 의미입니다.(범주형 : 사람, 국가, 연도 등, 연속형 : 숫자 값)

지금까지 알아보았던 wide form형태의 데이터는 연도(관측점)가 각각 독립된 열에 존재하여 다루기가 다소 난해한 경우의 데이터 였습니다. 이처럼 데이터가 tidy하다 함은 일반적으로 long form 형태의 데이터를 지칭합니다.

gather함수의 사용법으로는 다음과 같습니다.

tidyr::gather(
  data,
  key = "key",
  value = "value",
  ...,
  na.rm = FALSE,
  convert = FALSE,
  factor_key = FALSE
)

매개변수 설명 보기

매개변수명	의미
data	데이터 프레임
key	key열 이름
value	value 열 이름
…	열 선택
convert	숫자형 변환, TRUE/FALSE
factor\_key	범주형 변환, TRUE/FALSE

이번에는 gather함수를 사용하여 이전 melt함수를 사용한 과정과 동일한 작업을 진행해 보도록 하겠습니다.

# long form 데이터 확인 
gapminder_long <- gapminder %>% 
  filter(continent == 'Europe') %>% 
  select(country, year, gdpPercap) %>% 
  as.data.frame()

gapminder_long %>% head(5)
#>   country year gdpPercap
#> 1 Albania 1952  1601.056
#> 2 Albania 1957  1942.284
#> 3 Albania 1962  2312.889
#> 4 Albania 1967  2760.197
#> 5 Albania 1972  3313.422

# wide form 데이터 확인 
gapminder_wide %>% head(5)
#> # A tibble: 5 x 13
#>   country  `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992` `1997`
#>   <fct>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#> 1 Albania   1601.  1942.  2313.  2760.  3313.  3533.  3631.  3739.  2497.  3193.
#> 2 Austria   6137.  8843. 10751. 12835. 16662. 19749. 21597. 23688. 27042. 29096.
#> 3 Belgium   8343.  9715. 10991. 13149. 16672. 19118. 20980. 22526. 25576. 27561.
#> 4 Bosnia ~   974.  1354.  1710.  2172.  2860.  3528.  4127.  4314.  2547.  4766.
#> 5 Bulgaria  2444.  3009.  4254.  5577.  6597.  7612.  8224.  8240.  6303.  5970.
#> # ... with 2 more variables: 2002 <dbl>, 2007 <dbl>

# wide form -> long form 변환 
gapminder_long_gather <- gapminder_wide %>% 
  tidyr::gather(key = "year", value = "gdpPercap", -country, convert = T) %>% 
  arrange(country)

# 결과 비교 
gapminder_long %>% head(5)
#>   country year gdpPercap
#> 1 Albania 1952  1601.056
#> 2 Albania 1957  1942.284
#> 3 Albania 1962  2312.889
#> 4 Albania 1967  2760.197
#> 5 Albania 1972  3313.422

gapminder_long_gather %>% head(5)
#> # A tibble: 5 x 3
#>   country  year gdpPercap
#>   <fct>   <int>     <dbl>
#> 1 Albania  1952     1601.
#> 2 Albania  1957     1942.
#> 3 Albania  1962     2313.
#> 4 Albania  1967     2760.
#> 5 Albania  1972     3313.

이전의 melt 함수와 비교해보면 변수열 통합을 위한 기준열을 지정하는 방식이 약간 차이가 나는 것을 알 수 있습니다. melt함수에는 매개변수로 id.vars를 통해 기준열을 지정할 수 있는데, gather함수에서는 마이너스 기호( - )를 통해 기준열을 지정할 수 있었습니다. 마이너스 기호로 country를 지정하면 이후 자동으로 나머지 열을 통합하는 작업을 수행하는 점에 있어서는 melt함수와 별반 차이가 없고, 결과를 비교해 봐도 일치하는 점을 알 수 있습니다.

pivot_loger()

tidyr::pivot_longer()

tidyr 패키지에서 지원하는 long form 데이터 생성을 목적으로 한 두 번째 함수인 pivot_longer함수는 기존의 gather함수가 지니고 있던 직관적이지 못해 이해하기 어렵던 명칭과 기능을 개선한 함수로서 reshape2 및 tidyr 패키지의 제작자이자 R 구루이신 Hadley Wickham이 추천하는 방식입니다.

pivot_longer함수의 사용법으로는 다음과 같습니다.

tidyr::pivot_longer(
  data,
  cols,
  names_to = "name",
  names_prefix = NULL,
  names_sep = NULL,
  names_pattern = NULL,
  names_ptypes = list(),
  names_transform = list(),
  names_repair = "check_unique",
  values_to = "value",
  values_drop_na = FALSE,
  values_ptypes = list(),
  values_transform = list(),
  ...
)

매개변수 설명 보기

매개변수명	의미
data	데이터 프레임
col	long form으로 변환할 변수 열
names\_to	새 변수열 이름
names\_prefix	기존 변수열 이름에서 제거할 문자열 패턴(정규식)
names\_sep	기존 변수열 이름에서 분할한 지점(숫자/문자)
names\_pattern	기존 변수열 이름에서 일치하는 지점(정규식)
names\_ptypes	새 변수열 이름 타입
names\_transform	새 변수열 이름 타입 변환
names\_repair	유효하지 않은 변수열 처리 방법 지정
values\_to	새 값 열 이름
values\_drop\_na	NA가 포함된 행 삭제(TRUE/FALSE)
values\_ptypes	새 값 열 이름 타입
values\_transform	새 값 열 이름 타입 변환

pivot_loger함수를 사용하여 long form변환을 진행해 보도록 하겠습니다.

# long form 데이터 확인 
gapminder_long <- gapminder %>% 
  filter(continent == 'Europe') %>% 
  select(country, year, gdpPercap) %>% 
  as.data.frame()

gapminder_long %>% head(5)
#>   country year gdpPercap
#> 1 Albania 1952  1601.056
#> 2 Albania 1957  1942.284
#> 3 Albania 1962  2312.889
#> 4 Albania 1967  2760.197
#> 5 Albania 1972  3313.422

# wide form 데이터 확인 
gapminder_wide %>% head(5)
#> # A tibble: 5 x 13
#>   country  `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992` `1997`
#>   <fct>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#> 1 Albania   1601.  1942.  2313.  2760.  3313.  3533.  3631.  3739.  2497.  3193.
#> 2 Austria   6137.  8843. 10751. 12835. 16662. 19749. 21597. 23688. 27042. 29096.
#> 3 Belgium   8343.  9715. 10991. 13149. 16672. 19118. 20980. 22526. 25576. 27561.
#> 4 Bosnia ~   974.  1354.  1710.  2172.  2860.  3528.  4127.  4314.  2547.  4766.
#> 5 Bulgaria  2444.  3009.  4254.  5577.  6597.  7612.  8224.  8240.  6303.  5970.
#> # ... with 2 more variables: 2002 <dbl>, 2007 <dbl>

# wide form -> long form 변환 
gapminder_long_pivot_longer <- gapminder_wide %>% 
  tidyr::pivot_longer(col = -country, names_to = "year", values_to = "gdpPercap") %>% 
  arrange(country)

# 결과 비교 
gapminder_long %>% head(5)
#>   country year gdpPercap
#> 1 Albania 1952  1601.056
#> 2 Albania 1957  1942.284
#> 3 Albania 1962  2312.889
#> 4 Albania 1967  2760.197
#> 5 Albania 1972  3313.422

gapminder_long_pivot_longer %>% head(5)
#> # A tibble: 5 x 3
#>   country year  gdpPercap
#>   <fct>   <chr>     <dbl>
#> 1 Albania 1952      1601.
#> 2 Albania 1957      1942.
#> 3 Albania 1962      2313.
#> 4 Albania 1967      2760.
#> 5 Albania 1972      3313.

gather함수와 마찬가지로 기준이 되는 열을 지정하기 위해 마이너스 기호( - )를 사용합니다. 하지만 gather함수에서는 기준열의 지정을 하는 인수가 없어 실제 함수를 사용하는 상황에 있어서 이해가 필요한 시간이 필요한 반면, pivot_longer함수에서는 인수 col을 지정하여 기준열 지정을 좀 더 명확히 하는 장점이 있습니다. 또한 여기 예제에서는 다루지 않았지만 변수열과 이름열을 설정하는데 있어 gather함수보다 세분화된 옵션을 지정할 수 있으며, 이를 통해 좀 더 원활한 피봇팅을 수행할 수 있습니다.

long form -> wide form

반대의 경우도 살펴보겠습니다. 위에서 알아본 함수들은 각각 쌍으로 대응되는 함수들을 보유하고 있는데 이는 다음과 같습니다.

 - reshape2::dcast()
 - tidyr::spread()
 - tidyr::pivot_wider()

dcast()

reshape2::dcast()

reshape2 패키지의 melt 함수와 대응되는 함수인 dcast함수는 데이터를 wide form으로 만드는데 용이한 함수입니다. 사용법으로는 다음과 같습니다.

dcast(
  data,
  formula,
  fun.aggregate = NULL,
  sep = "_",
  ...,
  margins = NULL,
  subset = NULL,
  fill = NULL,
  drop = TRUE,
  value.var = guess(data),
  verbose = getOption("datatable.verbose")
)

매개변수 설명 보기

매개변수명	의미
data	데이터프레임 / data.table
formula	포뮬러
fun.aggregate	집계함수
sep	구분문자
fill	누락된 값 채울 값
drop	누락값 제외 여부
value.var	값 열

gapminder_long_melt 객체는 wide form형태의 데이터를 melt함수를 사용하여 long form형태로 변환했던 데이터 였습니다. dcast함수를 사용하여 다시 wide form으로 변환하는 작업을 진행해 보도록 하겠습니다.

# wide form 데이터 확인 
gapminder_wide %>% head(5)
#> # A tibble: 5 x 13
#>   country  `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992` `1997`
#>   <fct>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#> 1 Albania   1601.  1942.  2313.  2760.  3313.  3533.  3631.  3739.  2497.  3193.
#> 2 Austria   6137.  8843. 10751. 12835. 16662. 19749. 21597. 23688. 27042. 29096.
#> 3 Belgium   8343.  9715. 10991. 13149. 16672. 19118. 20980. 22526. 25576. 27561.
#> 4 Bosnia ~   974.  1354.  1710.  2172.  2860.  3528.  4127.  4314.  2547.  4766.
#> 5 Bulgaria  2444.  3009.  4254.  5577.  6597.  7612.  8224.  8240.  6303.  5970.
#> # ... with 2 more variables: 2002 <dbl>, 2007 <dbl>

# long form 데이터 확인 
gapminder_long_melt %>% head(5)
#>   country year gdpPercap
#> 1 Albania 1952  1601.056
#> 2 Albania 1957  1942.284
#> 3 Albania 1962  2312.889
#> 4 Albania 1967  2760.197
#> 5 Albania 1972  3313.422

# wide form -> long form 변환 
gapminder_wide_dcast <- gapminder_long_melt %>% 
  reshape2::dcast(country ~ ..., value.var = "gdpPercap") %>% 
  as_tibble()

# 결과 비교 
gapminder_wide_dcast %>% head(5)
#> # A tibble: 5 x 13
#>   country  `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992` `1997`
#>   <fct>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#> 1 Albania   1601.  1942.  2313.  2760.  3313.  3533.  3631.  3739.  2497.  3193.
#> 2 Austria   6137.  8843. 10751. 12835. 16662. 19749. 21597. 23688. 27042. 29096.
#> 3 Belgium   8343.  9715. 10991. 13149. 16672. 19118. 20980. 22526. 25576. 27561.
#> 4 Bosnia ~   974.  1354.  1710.  2172.  2860.  3528.  4127.  4314.  2547.  4766.
#> 5 Bulgaria  2444.  3009.  4254.  5577.  6597.  7612.  8224.  8240.  6303.  5970.
#> # ... with 2 more variables: 2002 <dbl>, 2007 <dbl>

gapminder_wide %>% head(5)
#> # A tibble: 5 x 13
#>   country  `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992` `1997`
#>   <fct>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#> 1 Albania   1601.  1942.  2313.  2760.  3313.  3533.  3631.  3739.  2497.  3193.
#> 2 Austria   6137.  8843. 10751. 12835. 16662. 19749. 21597. 23688. 27042. 29096.
#> 3 Belgium   8343.  9715. 10991. 13149. 16672. 19118. 20980. 22526. 25576. 27561.
#> 4 Bosnia ~   974.  1354.  1710.  2172.  2860.  3528.  4127.  4314.  2547.  4766.
#> 5 Bulgaria  2444.  3009.  4254.  5577.  6597.  7612.  8224.  8240.  6303.  5970.
#> # ... with 2 more variables: 2002 <dbl>, 2007 <dbl>

포뮬러상 물결무늬를 기준으로 좌측의 변수를 기준으로 나머지 열을 의미하는 ...을 사용해 기존 변수값을 새로운 변수열로 생성하여 wide form데이터로의 변환을 진행하였습니다. 정상적으로 변환이 된 것을 볼 수 있었지만 이전부터 dcast 함수를 사용하고 있어 함수를 능숙하게 다루는 사람이 아닌 이상 pivot wider함수를 사용하는게 더욱 쉽고 편리할 것입니다.

spread()

tidyr::spread()

melt함수와 대응되는 함수인 spread함수는 그 명칭처럼 long form데이터를 wide form으로 넓게 펼치는 함수입니다. 함수의 사용법으로는 다음과 같습니다.

spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE, sep = NULL)

매개변수 설명 보기

매개변수명	의미
data	데이터프레임
key	기준 열
value	값 열
fill	결측값 대채 할 값
convert	값 열 클래스 자동 변환
drop	누락값 제외 여부
sep	값 열 구분 문자

# wide form 데이터 확인 
gapminder_wide %>% head(5)
#> # A tibble: 5 x 13
#>   country  `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992` `1997`
#>   <fct>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#> 1 Albania   1601.  1942.  2313.  2760.  3313.  3533.  3631.  3739.  2497.  3193.
#> 2 Austria   6137.  8843. 10751. 12835. 16662. 19749. 21597. 23688. 27042. 29096.
#> 3 Belgium   8343.  9715. 10991. 13149. 16672. 19118. 20980. 22526. 25576. 27561.
#> 4 Bosnia ~   974.  1354.  1710.  2172.  2860.  3528.  4127.  4314.  2547.  4766.
#> 5 Bulgaria  2444.  3009.  4254.  5577.  6597.  7612.  8224.  8240.  6303.  5970.
#> # ... with 2 more variables: 2002 <dbl>, 2007 <dbl>

# long form 데이터 확인 
gapminder_long %>% head(5)
#>   country year gdpPercap
#> 1 Albania 1952  1601.056
#> 2 Albania 1957  1942.284
#> 3 Albania 1962  2312.889
#> 4 Albania 1967  2760.197
#> 5 Albania 1972  3313.422

# wide form -> long form 변환 
gapminder_wide_spread <- gapminder_long %>%
  tidyr::spread(key = year, value = gdpPercap) %>% 
  as_tibble()

# 결과 비교 
gapminder_wide_spread %>% head(5)
#> # A tibble: 5 x 13
#>   country  `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992` `1997`
#>   <fct>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#> 1 Albania   1601.  1942.  2313.  2760.  3313.  3533.  3631.  3739.  2497.  3193.
#> 2 Austria   6137.  8843. 10751. 12835. 16662. 19749. 21597. 23688. 27042. 29096.
#> 3 Belgium   8343.  9715. 10991. 13149. 16672. 19118. 20980. 22526. 25576. 27561.
#> 4 Bosnia ~   974.  1354.  1710.  2172.  2860.  3528.  4127.  4314.  2547.  4766.
#> 5 Bulgaria  2444.  3009.  4254.  5577.  6597.  7612.  8224.  8240.  6303.  5970.
#> # ... with 2 more variables: 2002 <dbl>, 2007 <dbl>

gapminder_wide %>% head(5)
#> # A tibble: 5 x 13
#>   country  `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992` `1997`
#>   <fct>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#> 1 Albania   1601.  1942.  2313.  2760.  3313.  3533.  3631.  3739.  2497.  3193.
#> 2 Austria   6137.  8843. 10751. 12835. 16662. 19749. 21597. 23688. 27042. 29096.
#> 3 Belgium   8343.  9715. 10991. 13149. 16672. 19118. 20980. 22526. 25576. 27561.
#> 4 Bosnia ~   974.  1354.  1710.  2172.  2860.  3528.  4127.  4314.  2547.  4766.
#> 5 Bulgaria  2444.  3009.  4254.  5577.  6597.  7612.  8224.  8240.  6303.  5970.
#> # ... with 2 more variables: 2002 <dbl>, 2007 <dbl>

pivot_wider()

tidyr::pivot_wider()

마지막으로 알아볼 함수는 pivot_wider함수입니다. 마찬가지로 pivot_longer함수와 대응되는 함수이며, 가장 최신의 함수라 사용자 편의성에서나 기능적인 면에서나 우수한 함수라고 할 수 있습니다. pivot_wider함수의 사용법으로는 다음과 같습니다.

pivot_wider(
  data,
  id_cols = NULL,
  names_from = name,
  names_prefix = "",
  names_sep = "_",
  names_glue = NULL,
  names_sort = FALSE,
  names_repair = "check_unique",
  values_from = value,
  values_fill = NULL,
  values_fn = NULL,
  ...
)

매개변수 설명 보기

매개변수명	의미
data	데이터프레임
id.cols	기준 열
names\_from	변수 열
names\_prefix	변수 열 접두어
names\_sep	결합 문자
names\_glue	glue문법 사용자 이름
names\_sort	이름 정렬(TRUE/FALSE)
names\_repair	오류 발생시 처리 방식 지정
values\_from	값 열
values\_fill	결측치 채울 값
values\_fn	값 적용 함수

pivot_wider 함수를 통해 wide form 데이터를 생성해 보도록 합시다.

# wide form 데이터 확인 
gapminder_wide %>% head(5)
#> # A tibble: 5 x 13
#>   country  `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992` `1997`
#>   <fct>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#> 1 Albania   1601.  1942.  2313.  2760.  3313.  3533.  3631.  3739.  2497.  3193.
#> 2 Austria   6137.  8843. 10751. 12835. 16662. 19749. 21597. 23688. 27042. 29096.
#> 3 Belgium   8343.  9715. 10991. 13149. 16672. 19118. 20980. 22526. 25576. 27561.
#> 4 Bosnia ~   974.  1354.  1710.  2172.  2860.  3528.  4127.  4314.  2547.  4766.
#> 5 Bulgaria  2444.  3009.  4254.  5577.  6597.  7612.  8224.  8240.  6303.  5970.
#> # ... with 2 more variables: 2002 <dbl>, 2007 <dbl>

# long form 데이터 확인 
gapminder_long %>% head(5)
#>   country year gdpPercap
#> 1 Albania 1952  1601.056
#> 2 Albania 1957  1942.284
#> 3 Albania 1962  2312.889
#> 4 Albania 1967  2760.197
#> 5 Albania 1972  3313.422

# wide form -> long form 변환 
gapminder_wide_pivot_wider <- gapminder_long %>% 
  tidyr::pivot_wider(id_cols = country, names_from = year, values_from = gdpPercap)

# 결과 비교 
gapminder_wide_pivot_wider %>% head(5)
#> # A tibble: 5 x 13
#>   country  `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992` `1997`
#>   <fct>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#> 1 Albania   1601.  1942.  2313.  2760.  3313.  3533.  3631.  3739.  2497.  3193.
#> 2 Austria   6137.  8843. 10751. 12835. 16662. 19749. 21597. 23688. 27042. 29096.
#> 3 Belgium   8343.  9715. 10991. 13149. 16672. 19118. 20980. 22526. 25576. 27561.
#> 4 Bosnia ~   974.  1354.  1710.  2172.  2860.  3528.  4127.  4314.  2547.  4766.
#> 5 Bulgaria  2444.  3009.  4254.  5577.  6597.  7612.  8224.  8240.  6303.  5970.
#> # ... with 2 more variables: 2002 <dbl>, 2007 <dbl>

gapminder_wide %>% head(5)
#> # A tibble: 5 x 13
#>   country  `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` `1992` `1997`
#>   <fct>     <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
#> 1 Albania   1601.  1942.  2313.  2760.  3313.  3533.  3631.  3739.  2497.  3193.
#> 2 Austria   6137.  8843. 10751. 12835. 16662. 19749. 21597. 23688. 27042. 29096.
#> 3 Belgium   8343.  9715. 10991. 13149. 16672. 19118. 20980. 22526. 25576. 27561.
#> 4 Bosnia ~   974.  1354.  1710.  2172.  2860.  3528.  4127.  4314.  2547.  4766.
#> 5 Bulgaria  2444.  3009.  4254.  5577.  6597.  7612.  8224.  8240.  6303.  5970.
#> # ... with 2 more variables: 2002 <dbl>, 2007 <dbl>

결과를 비교해 보자면 역시나 동일한 결과를 뱉는 것을 볼 수 있습니다. spread 함수와의 차이점이라면 pivot_wider함수가 좀 더 변수 열 및 값 열을 더욱 세밀하게 control 하도록 하는 옵션(인수)들을 제공한다는 점과, 그 외에도 인수들의 이름이 더욱 직관적이고 좀 더 사용자 친화적인 부분이란 점 입니다.

결론

지금까지 총 세 종류의 피봇팅 함수들을 살펴 보았습니다. 예전부터 melt, dcast, gather, spread 등의 함수를 쭈욱 사용하여 해당 함수들을 쓰는데 거리낌이 없는 상황이 아니라면 피봇팅 작업을 할 시 pivot_longer, pivot_wider 함수를 사용하는 것을 추천드립니다. 이번 포스팅에서 전부 다루지는 못했지만 pivot 함수들이 다른 함수들보다 더욱 tidy한 문법에 어울리며, 또한 tidy 문법을 통해 더욱 풍부한 형 변환 및 전처리를 할 수 있기 때문입니다. 조금만 익혀둔다면 많은 도움을 받으실 수 있을거라 조심스레 생각해 봅니다.

참고자료

R 전처리 tidyr 피봇 pivot_longer pivot_wider

[R] tidyr 패키지로 하는 데이터 피봇(pivot_longer / pivot_wider)

피봇

Long form & Wide form

피봇팅

wide form -> long form

reshape2::melt()

tidyr::gather()

tidyr::pivot_longer()

long form -> wide form

reshape2::dcast()

tidyr::spread()

tidyr::pivot_wider()

결론

참고자료

JDW

Data Analyst
UNIST M.S. Student

관련문서

[R] tidyr 패키지로 하는 데이터 피봇(pivot_longer / pivot_wider)

피봇

Long form & Wide form

피봇팅

wide form -> long form

reshape2::melt()

tidyr::gather()

tidyr::pivot_longer()

long form -> wide form

reshape2::dcast()

tidyr::spread()

tidyr::pivot_wider()

결론

참고자료

JDW

Data AnalystUNIST M.S. Student

관련문서

Data Analyst
UNIST M.S. Student