## Warning in file(filename, "r", encoding = encoding): URL
## 'https://metrics.rstudioprimers.com/learnr/installClient': status was 'Couldn't
## resolve host name'
[1] “Warning: An error occurred with the client code.”
Welcome
Data is easiest to analyze in R when it is stored in a tidy format. In the last tutorial, you learned how to tidy data that has an untidy layout, but there is another way that data sets can be untidy: a data set can combine multiple values in a single cell or spread a single value across multiple cells. This makes it difficult to extract and use values in your analysis.
This tutorial will teach you two tools that you can use to tidy this type of data:
separate()
- which separates a column of cells into multiple columnsunite()
- which combines multiple columns of cells into a single column
It ends with a case study that requires you to use all of the tidy tools to wrangle a messy real world data set.
This tutorial uses the core tidyverse packages, including tidyr. All of these packages have been pre-installed and pre-loaded for your convenience.
Click the Next Topic button to begin.
separate()
hurricanes
The hurricanes
data set contains historical information
about five hurricanes. At first glance it appears to contain four
variables: name, wind_speed, pressure, and
date. However, there are three more variables hidden in plain
sight. Can you spot them?
Dates
Did you realize that dates are a combination of multiple variables? They are.
You’ll almost always display these variables together to make a date, because a date is itself a variable—one that conveys more than the sum of its parts.
However, there are times where it is convenient to treat each element
of a date separately. For example, what if you wanted to filter
hurricanes
to just the storms that occurred in June
(i.e. month == 6
)? Then it would be convenient to
reorganize the data to look like this.
But how could you do it?
separate()
You can separate the elements of date
with the
separate()
function. separate()
divides a
column of values into multiple columns that each contain a portion of
the original values.
Run the code below to see separate()
in action. Then
click continue to learn about the syntax.
hurricanes %>%
separate(col = date, into = c("year","month","day"), sep = "-")
Syntax
Let’s rewrite our above command without the pipe, to make the syntax
of separate()
easier to see.
separate(hurricanes, col = date, into = c("year","month","day"), sep = "-")
separate()
takes a data frame and then the name of a
column in the data frame to separate. Here our code will separate the
date
column of the hurricane
data set.
The sep = "-"
argument tells separate()
to
split each value in date
wherever a -
appears.
You can choose to split on any character or character string.
Separating on -
will split each date into three dates: a
year, month, and day. As a result, separate()
will need to
add three new columns to the result. The into
argument
gives separate()
a character vector of names to use for the
new columns. Since the result will have three new columns, this vector
will need to have three new names. separate()
will provide
an error message if it ends up creating fewer or more columns than
column names.
Defaults
By default separate()
will separate values at the
location of any non-alphanumeric character, like
-
, ,
, /
, etc. So for example, we
could run our code without the sep = "-"
argument and—in
this case—get the same result.
Or will we? Do a quick mental check and then run the code to see if you are right.
hurricanes %>%
separate(col = date, into = c("year","month","day"))
Separating by position
If you set sep
equal to an integer,
separate()
will split the values at the location indicated
by the integers. For example,
sep = 1
will split the values after the first charactersep = -2
will split the values after the second to last character, no matter how many characters appear in the value. In other words, it will split off the last character of each value.sep = c(2, 4, 6)
will split the values after the second, fourth, and sixth characters, creating four sub-values
Think you have it? Create this version of hurricanes
by
adding a second call to separate()
that uses an integer
separator to the code below:
hurricanes %>%
separate(col = date, into = c("year","month","day"))
hurricanes %>%
separate(col = date, into = c("year","month","day")) %>%
separate(col = year, into = c("century", "year"), sep = 2)
Quiz - What if
Would these two commands return the same result? Why or why not? Once you have an answer, run the code below to see if you were right.
hurricanes %>%
separate(col = pressure, into = c("first", "last"), sep = 1)
hurricanes %>%
separate(col = pressure, into = c("first", "last"), sep = "1")
Convert
You may have noticed that separate()
returns its results
as columns of character strings. However, in some cases, like ours, the
columns will contain integers, doubles, or other types of non-character
data.
You can ask separate()
to convert the new columns to an
appropriate data type by adding convert = TRUE
to your
separate()
call. This is identical to the
convert = TRUE
argument of gather()
.
Identify the data types of year
, month
, and
day
(they appear under the column names) in the output
below. Then add convert = TRUE
and re-run the code. What
changes?
hurricanes %>%
separate(col = date, into = c("year","month","day"))
hurricanes %>%
separate(col = date, into = c("year","month","day"), convert = TRUE)
Remove
Let’s take a look at one last argument for separate()
.
If you add remove = FALSE
to your separate()
call, R will retain the original column in the results.
hurricanes %>%
separate(col = date, into = c("year","month","day"), convert = TRUE, remove = FALSE)
unite()
unite()
You can do the inverse of separate()
with
unite()
. unite()
uses multiple input columns
to create a single output column. It builds this column by pasting
together the cells of the input column with a separator.
hurricanes %>%
separate(date, c("year", "month", "day"), sep = "-") %>%
unite(col = "date", month, day, year, sep = ":")
Syntax
hurricanes %>%
separate(date, c("year", "month", "day"), sep = "-") %>%
unite(col = "date", month, day, year, sep = ":")
Notice that the syntax of unite()
is the inverse of
separate()
:
- The first argument is a character string: the name of the new column
that
unite()
will make - The arguments that follow are the columns to be combine into the new column. You can list as many columns as you like, their names do not need to be in quotes, and each name is listed as its own argument.
Exercise - Separate and Unite
Use separate()
and unite()
to rewrite the
dates in hurricanes in the format below:
- month/day/year, e.g., 1/27/2020
hurricanes %>%
separate(date, c("year", "month", "day"), sep = "-") %>%
unite(col = "date", month, day, year, sep = "/")
Exercise - Separate and Unite 2
Use the chunk below to:
- Use separate to isolate the first two digits of each date as “century”
- Filter the data to just rows where
century == 19
. These will be storms that occurred in the 1900’s. - Use
unite()
to return the results to the original date format. Hint: you can setsep = ""
to avoid including a separator character when uniting.
hurricanes %>%
separate(col = date, c("century", "rest"), sep = 2) %>%
filter(century == 19) %>%
unite(col = "date", century, rest, sep = "")
Tidy data
So far we’ve separated and united date
, a variable that
contains legitimate sub-variables. This is because it makes little sense
to combine unrelated values within the same cells. However, many data
sets follow this senseless practice. If you inherit one, you can use
separate()
and unite()
to reorganize the
values in a tidy fashion.
In the case study that follows, you will do just that. You will also practice using all of the tidyr functions as you do.
Case study
who
The who
data set contains a subset of data from the
World Health Organization Global Tuberculosis Report, available here.
In its original format, the data is very untidy
who
who variables
The first four columns of who
each contain a single
variable:
- country - the name of a country
- iso2 - a two letter country code
- iso3 - a three letter country code
- year - year
The remaining columns are named after codes that contain multiple variables.
who codes
Each column name after the fourth contains a code comprised of three values from three variables: type of TB, gender, and age.
A goal
To make who
easier to use in R, we should tidy it into
the format below. This data set contains six non-redundant variables:
country, year, type, sex, age
(group), and n (the number of cases of TB reported for
each group).
Warm up
It will take a number of operations to tidy the who
data
set. When you look at the end result, the task may seem overwhelming,
but each individual operation will be fairly simple (and familiar). We
will use the pipe operator, %>%
, to string these simple
operations together.
Let’s begin the pipe by removing the redundant variables
iso2
and iso3
from who
. In
otherwords, let’s use a dplyr function to select every column except
iso2
and iso3
. Recall that there is a way to
do this without a large amount of typing.
who %>%
select(-iso2, -iso3)
A strategy
Next, we need to move the type, sex, and age variables out of the column names and into a column of their own. It is true that we want to separate these values into their own cells, but that will be easier to do once they are in their own column.
In short, we want to do something like this:
Exercise - Reshape
Add to the pipe below. Use a tidyr reshaping function to gather the column names into their own column, named “codes”. Place the column cells into a column named “n”. Hint: it may be helpful to know that there are now 58 columns in the data set.
You can think of each column name as a key that combines the values of several variables. We want to move those keys into their own key column.
who %>%
select(-iso2, -iso3)
who %>%
select(-iso2, -iso3) %>%
gather(key = "codes", value = "n", 3:58)
Exercise - Separate
Add to the pipe below. Separate the codes in codes
into
three columns named “new”, “type”, and “sexage”. What type of separator
should you separate on?
who %>%
select(-iso2, -iso3) %>%
gather(key = "codes", value = "n", 3:58)
who %>%
select(-iso2, -iso3) %>%
gather(key = "codes", value = "n", 3:58) %>%
separate(codes, into = c("new", "type", "sexage"), sep = "_")
Exercise - Separate again
Our last separate, isolated two components of the who codes: new and type. However, it did not separate the sex and age variables.
If you look closely at the structure of the sexage
column, you will see that each cell begins with a single letter that
represents a gender, m
or f
, and is then
followed by three or more numbers, which represent an age group. Use
this insight to perform a second separate that isolates the “sex” and
“age” variables:
who %>%
select(-iso2, -iso3) %>%
gather(key = "codes", value = "n", 3:58) %>%
separate(codes, into = c("new", "type", "sexage"), sep = "_")
who %>%
select(-iso2, -iso3) %>%
gather(key = "codes", value = "n", 3:58) %>%
separate(codes, into = c("new", "type", "sexage"), sep = "_") %>%
separate(sexage, into = c("sex", "age"), sep = 1)
Exercise - Select
Add to the pipe to remove the new
variable, which
doesn’t provide any useful information. (Every row in the data set shows
new cases of TB and has the same value of new
).
who %>%
select(-iso2, -iso3) %>%
gather(key = "codes", value = "n", 3:58) %>%
separate(codes, into = c("new", "type", "sexage"), sep = "_") %>%
separate(sexage, into = c("sex", "age"), sep = 1)
who %>%
select(-iso2, -iso3) %>%
gather(key = "codes", value = "n", 3:58) %>%
separate(codes, into = c("new", "type", "sexage"), sep = "_") %>%
separate(sexage, into = c("sex", "age"), sep = 1) %>%
select(-new)
n
Notice that the n
column of who
contains
the most insightful information. You do not need to take any measurments
to list out the country, year, type, sex, and age combinations in the
data set. In a sense, you know these combinations in advance. However,
n
shows how many cases of TB were reported for each
combination. You do not know this information in advance, and you can
only acquire it through field work—yours or someone else’s. As a result,
it is concerning that our data contains so many NA
s for
n
.
NA
NA
is R’s symbol for missing information, and it is
common to have multiple NA
s when you reshape your data from
a wide format to a long format. The rectangular table structure imposed
by wide data requires a place holder for every combination of variable
values—even if no data was collected for that combination.
In contrast, the long data format does not require a place holder for each combination of variable values. Since each combination is saved as its own row, you can simply not include rows that contain an NA.
drop_na()
The tidyr package provides a convenient function for dropping rows
that contain an NA in a specific column. The function is
drop_na()
. To use it, give drop_na()
a data
set (perhaps via a pipe), then list one or more columns in that data
set, e.g.
data %>% drop_na(column1, column2)
drop_na()
will drop every row that contains an NA in one
or more of the listed columns.
Add drop_na()
to the pipe below to drop every row that
has an NA in the n
column.
who %>%
select(-iso2, -iso3) %>%
gather(key = "codes", value = "n", 3:58) %>%
separate(codes, into = c("new", "type", "sexage"), sep = "_") %>%
separate(sexage, into = c("sex", "age"), sep = 1) %>%
select(-new)
who %>%
select(-iso2, -iso3) %>%
gather(key = "codes", value = "n", 3:58) %>%
separate(codes, into = c("new", "type", "sexage"), sep = "_") %>%
separate(sexage, into = c("sex", "age"), sep = 1) %>%
select(-new) %>%
drop_na(n)
Recap
Good job! You’ve wrangled who
into a tidy, polished data
set that is ready to be explored, modelled, and analyzed.
The difference between the initial and final versions of
who
is drastic, but each step in our pipe imposed a small,
logical change. This is by design.
The tidyverse contains a vocabulary of functions that each do one simple thing, but can be combined to do more sophisticated tasks. In this way, the tidyverse is like a written language, it is made up of words (functions) that can be combined into sentences that have a sophisticated meaning (pipes).
This approach also makes it easier to solve problems with code. You can approach any problem by decomposing it into a series of small, simple steps.