## Warning in file(filename, "r", encoding = encoding): URL
## 'https://metrics.rstudioprimers.com/learnr/installClient': status was 'SSL
## connect error'
[1] “Warning: An error occurred with the client code.”
Welcome
A line graph displays a functional relationship between two continuous variables. A map displays spatial data. The two may seem different, but they are made in similar ways. This tutorial will examine them both.
In this tutorial, you’ll learn how to:
- Make new types of line plots with
geom_step()
,geom_area()
,geom_path()
, andgeom_polygon()
- Avoid “whipsawing” with the group aesthetic
- Find and plot map data with
geom_map()
- Transform a coordinate system into a map projection with
coord_map()
The tutorial is adapted from R for Data Science by Hadley Wickham and Garrett Grolemund, published by O’Reilly Media, Inc., 2016, ISBN: 9781491910399. You can purchase the book at shop.oreilly.com.
The tutorial uses the ggplot2, maps, mapproj, and dplyr packages, which have been pre-loaded for your convenience.
Line graphs
Line Graph vs. Scatterplot
Like scatterplots, line graphs display the relationship between two continuous variables. However, unlike scatterplots, line graphs expect the variables to have a functional relationship, where each value of \(x\) is associated with only one value of \(y\).
For example, in the plot below, there is only one value of unemploy
for each value of date.
geom_line()
Use the geom_line()
function to make line graphs. Like geom_point()
, it requires x and y aesthetics.
Use geom_line()
in the chunk below to recreate the graph above. The graph uses the economics
dataset that comes with ggplot2 and maps the date
and unemploy
variables to the \(x\) and \(y\) axes. See Visualization Basics if you are completely stuck.
ggplot(economics) +
geom_line(mapping = aes(x = date, y = unemploy))
asia
I’ve used the gapminder package to assemble a new data set named asia
to plot. Among other things, asia
contains the per capita GDP of four countries from 1952 to 2007.
asia
whipsawing
However, when we plot the asia data we get an odd looking graph. The line seems to “whipsaw” up and down. Whipsawing is one of the most encountered challenges with line graphs.
ggplot(asia) +
geom_line(mapping = aes(x = year, y = gdpPercap))
Review 1 - Whipsawing
Multiple lines
Redraw our graph as a scatterplot. Can you spot more than one “line” in the data?
ggplot(asia) +
geom_point(mapping = aes(x = year, y = gdpPercap))
group
Many geoms, like lines, boxplots, and smooth lines, use a single object to display the entire dataset. You can use the group aesthetic to instruct these geoms to draw separate objects for different groups of observations.
For example, in the code below, you can map group
to the grouping variable country
to create a separate line for each country. Try it. Be sure to place the group mapping inside of aes()
.
ggplot(asia) +
geom_line(mapping = aes(x = year, y = gdpPercap))
ggplot(asia) +
geom_line(mapping = aes(x = year, y = gdpPercap, group = country))
aesthetics
You do not have to rely on the group aesthetic to perform a grouping. ggplot2 will automatically group a monolithic geom whenever you map an aesthetic to a categorical variable.
So for example, the code below performs an implied grouping. And since we use the color aesthetic, the plot includes the color legend.
ggplot(asia) +
geom_line(mapping = aes(x = year, y = gdpPercap, color = country))
linetype
Lines recognize a useful aesthetic that we haven’t encountered before, linetype. Change color to linetype below and inspect the results. What happens if you map both a color and a linetype to country?
ggplot(asia) +
geom_line(mapping = aes(x = year, y = gdpPercap, color = country))
ggplot(asia) +
geom_line(mapping = aes(x = year, y = gdpPercap, linetype = country, color = country))
Exercise 1 - Life Expectancy
Use what you’ve learned to plot the life expectancy of each country over time. Life expectancy is saved in the asia
data set as lifeExp
. Which country has the highest life expectancy? The lowest?
ggplot(asia) +
geom_line(mapping = aes(x = year, y = lifeExp, color = country, linetype = country))
Similar geoms
geom_step()
geom_step()
draws a line chart in a stepwise fashion. To see what I mean, change the geom in the plot below and rerun the code.
ggplot(asia) +
geom_line(mapping = aes(x = year, y = lifeExp, color = country, linetype = country))
ggplot(asia) +
geom_step(mapping = aes(x = year, y = lifeExp, color = country, linetype = country))
geom_area()
geom_area()
is similar to a line graph, but it fills in the area under the line. To see geom_area()
in action, change the geom in the plot below and rerun the code.
ggplot(economics) +
geom_line(mapping = aes(x = date, y = unemploy))
ggplot(economics) +
geom_area(mapping = aes(x = date, y = unemploy))
Review 2 - Set vs. Map
Do you recall from Visualization Basics how you would set the fill of our plot to blue (instead of, say, map the fill to a variable)? Give it a try.
ggplot(economics) +
geom_area(mapping = aes(x = date, y = unemploy))
ggplot(economics) +
geom_area(mapping = aes(x = date, y = unemploy), fill = "blue")
Accumulation
geom_area()
is a great choice if your measurements represent the accumulation of objects (like unemployed people). Notice that the \(y\) axis geom_area()
always begins or ends at zero.
Perhaps because of this, geom_area()
can be quirky when you have multiple groups. Run the code below. Can you tell what happens here?
ggplot(asia) +
geom_area(mapping = aes(x = year, y = lifeExp, fill = country))
Review 3 - Position adjustments
If you answered that people in China were living to be 300 years old, you guessed wrong.
geom_area()
is stacking each group above the group below. As a result, the line that should display the life expectancy for China displays the combined life expectancy for all countries.
You can fix this by changing the position adjustment for geom_area()
. Give it a try below. Change the position parameter from “stack” (the implied default) to "identity"
. See Bar Charts if you’d like to learn more about position adjustments.
ggplot(asia) +
geom_area(mapping = aes(x = year, y = lifeExp, fill = country), alpha = 0.3)
ggplot(asia) +
geom_area(mapping = aes(x = year, y = lifeExp, fill = country), position = "identity", alpha = 0.3)
geom_path()
geom_line()
comes with a strange bed-fellow, geom_path()
. geom_path()
draws a line between points like geom_line()
, but instead of connecting points in the order that they appear along the \(x\) axis, geom_path()
connects the points in the order that they appear in the data set.
It starts with the observation in row one of the data and connects it to the observation in row two, which it then connects to the observation in row three, and so on.
geom_path() example
To see how geom_path()
does this, let’s rearrange the rows in the economics
dataset. We can reorder them by unemploy
value. Now the data set will begin with the observation that had the lowest value of unemploy
.
economics2 <- economics %>%
arrange(unemploy)
economics2
geom_path() example continued
If we plot the reordered data with both geom_line()
and geom_path()
we get two very different graphs.
ggplot(economics2) +
geom_line(mapping = aes(x = date, y = unemploy))
ggplot(economics2) +
geom_path(mapping = aes(x = date, y = unemploy))
The plot on the left uses geom_line()
, hence the points are connected in order along the \(x\) axis. The plot on the right uses geom_path()
. These points are connected in the order that they appear in the dataset, which happens to put them in order along the \(y\) axis.
A use case
Why would you want to use geom_path()
? The code below illustrates one particularly useful case. The tx
dataset contains latitude and longitude coordinates saved in a specific order.
tx
tx
What do you think happens when you plot the data in tx
? Run the code to find out.
ggplot(tx) +
geom_path(mapping = aes(x = long, y = lat))
ggplot(tx) +
geom_path(mapping = aes(x = long, y = lat))
geom_polygon()
geom_polygon()
extends geom_path()
one step further: it connects the last point to the first and then colors the interior region with a fill. The result is a polygon.
ggplot(tx) +
geom_polygon(mapping = aes(x = long, y = lat))
Exercise 2 - Shattered Glass
What do you think went wrong in the plot of Texas below?
Maps
maps
The tx
data set comes from the maps package, which is an R package that contains similarly formatted data sets for many regions of the globe.
A short list of the datasets saved in maps includes: france
, italy
, nz
, usa
, world
, and world2
, along with county
and state
. These last two map the US at the county and state levels. To learn more about maps, run help(package = maps)
.
map_data
You do not need to access the maps package to use its data. ggplot2 provides the function map_data()
which fetches maps from the maps package and returns them in a format that ggplot2 can plot.
map_data syntax
To use map_data()
give it the name of a dataset to retrieve. You can retrieve a subset of the data by providing an optional region
argument. For example, I can use this code to retrieve a map of Florida from state
, which is the dataset that contains all 50 US states.
fl <- map_data("state", region = "florida")
ggplot(fl) +
geom_polygon(mapping = aes(x = long, y = lat))
Alter the code to retrieve and plot your home state (Try Idaho if you are outside of the US). Notice the capitalization.
id <- map_data("state", region = "idaho")
ggplot(id) +
geom_polygon(mapping = aes(x = long, y = lat))
state
If you do not specify a region, map_data()
will retrieve the entire data set, in this case state
.
us <- map_data("state")
In practice, you will often have to retrieve an entire dataset at least once to learn what region names to use with map_data()
. The names will be stored in the region
column of the dataset.
Hmmm
The code below retrieves and plots the entire state data set, but something goes wrong. What?
us <- map_data("state")
ggplot(us) +
geom_polygon(mapping = aes(x = long, y = lat))
Multiple polygons
In this case, our data is not out of order, but it contains more than one polygon: it contains 50 polygons—one for each state.
By default, geom_polygon()
tries to plot a single polygon, which causes it to connect multiple polygons in weird ways.
groups
Which aesthetic can you use to plot multiple polygons? In the code below, map the aesthetic to the group
variable in the state
dataset. This variable contains all of the grouping information needed to make a coherent map. Then rerun the code.
ggplot(us) +
geom_polygon(mapping = aes(x = long, y = lat))
ggplot(us) +
geom_polygon(mapping = aes(x = long, y = lat, group = group))
USArrests
R comes with a data set named USArrests
that we can use in conjunction with our plot above to make a choropleth map. A choropleth map uses the color of each region in the plot to display some value associated with the region.
In our case we will use the UrbanPop
variable of USAarrests
which records how urbanized each state was in 1973. UrbanPop
is the percent of the population who lived within a city.
USArrests
geom_map()
You can use geom_map()
to create choropleth maps. geom_map()
pairs a data frame like USArrests
with a map dataset like us
by matching region names.
Data wrangling
To use geom_map()
, we first need to ensure that a common set of region names appears across both datasets.
At the moment, this isn’t the case. USArrests
uses capitalized state names and hides them outside of the dataset in the row names (instead of in a column). In contrast, us
uses a column of lower case state names. The code below fixes this.
USArrests2 <- USArrests %>%
rownames_to_column("region") %>%
mutate(region = tolower(region))
USArrests2
geom_map() syntax
To use geom_map()
:
Initialize a plot with the data set that contains your data. Here that is
USArrests2
.Add
geom_map()
. Set the map_id aesthetic to the variable that contains the regions names. Then set the fill aesthetic to the fill variable. You do not need to supply x and y aesthetics,geom_map()
will derive these values from the map data set, which you must set with the map parameter. Since map is a parameter, it should go outside theaes()
function.Follow
geom_map()
withexpand_limits()
, and tellexpand_limits()
what the \(x\) and \(y\) variables in the map dataset are. This shouldn’t be necessary in future iterations ofgeom_map()
, but for now ggplot2 will use the x and y arguments ofexpand_limits()
to build the bounding box for your plot.
ggplot(USArrests2) +
geom_map(aes(map_id = region, fill = UrbanPop), map = us) +
expand_limits(x = us$long, y = us$lat)
coord_map()
You may have noticed that our maps look a little off. So far, we’ve plotted them in Cartesian coordinates, which distort the spherical surface described by latitude and longitude. Also, ggplot2 adjusts the aspect ratio of our plots to fit our graphing window, which can further distort our maps.
You can avoid both of these distortions by adding coord_map()
to your plot. coord_map()
displays the plot in a fixed cartographic projection. Note that coord_map()
, relies on the mapproj package, so you’ll need to have mapproj installed before you use coord_map()
.
ggplot(USArrests2) +
geom_map(aes(map_id = region, fill = UrbanPop), map = us) +
expand_limits(x = us$long, y = us$lat) +
coord_map()
projections
By default, coord_map()
replaces the coordinate system with a Mercator projection. To use a different projection, set the projection
argument of coord_map()
to a projection name, surrounded by quotation marks.
To see this, extend the code below to view the map in a "sinusoidal"
projection.
ggplot(USArrests2) +
geom_map(aes(map_id = region, fill = UrbanPop), map = us) +
expand_limits(x = us$long, y = us$lat)
ggplot(USArrests2) +
geom_map(aes(map_id = region, fill = UrbanPop), map = us) +
expand_limits(x = us$long, y = us$lat) +
coord_map(projection = "sinusoidal")
Recap
You can now make all of the plots recommended in the Exploratory Data Analysis tutorial. The next tutorial in this primer will teach you several strategies for dealing with overplotting, a problem that can occur when you have large data or low resolution data.