Dataviz with ggplot2

Marcelino Guerra

Last update: 2/4/2021

Line plots

In this exercise, we are working with urbanization and per capita income of a bunch of countries from Our World in Data. The first chart is a simple line plot. The first step is to import this .csv file. You might want to check the data using View().

library(tidyverse)
urban_gdp<-read_csv("urbanization_gdp.csv", col_names=T)

Let’s check the evolution of the share of urban population (%) in Brazil, from 1950 to 2016. To do that, you need to filter the full dataset.

brazil<-urban_gdp%>%filter(Entity=="Brazil")%>%arrange(Year)%>%filter(Year>=1950)

Working with the filtered data, use ggplot() package to create line plots. The package ggthemes allows you to customize the chart appearance, and you might want to check all the possibilites here. For this exercise, I will use theme_economist. The basic line plot works with geom_line(), and you can set the line type and its size easily. theme() adjust the size of the label in the axis, and scale_x_continuous() gives you some freedom to establish the years that appear in the x axis (note that we are not working with dates yet!).

library(ggthemes)

ggplot(data=brazil, aes(x=Year,y=`Share of population living in urban areas (%)`))+
  geom_line(size=1.4, color="skyblue4")+ labs(x = "Year", y="Share of population living in urban areas (%) in Brazil")+
  theme_economist(base_size = 14) +scale_colour_economist()+
  scale_x_continuous(breaks = seq(from = 1950, to =2016 , by =10))+
  theme(axis.text=element_text(size=12),
        axis.title=element_text(size=12,face="bold"))

Scatter plots

Using a scatter plot, you can see the extent to which two variables are associated, and it is a very compelling way to show simple correlations. For example, one can argue that per capita income goes along with urbanization - as countries urbanize, they get richer. To construct a scatter plot between those two variables for 2016, we filter the full dataset (urban_gdp) for Year==2016. You can check your new dataset using View(urban_gdp). There are entities without information about GDP per capita, and we drop those NA’s. If you look at the data, you will realize that when an Entity has Code equal to NA that means the entity is not actually a single country, and we also want to drop those data points. Finally, we drop the entity World because we only want to work with countries.

As you can see, cleaning the data is a crucial step before plotting. Finally, I will rename columns 4 and 5 and create a new variable applying the natural logarithm to the GDP per capita.

urban_gdp_16<-urban_gdp%>%filter(Year==2016)
urban_gdp_16<-urban_gdp_16%>%drop_na(Code, `GDP per capita (2011 int-$) ($)`)
urban_gdp_16<-urban_gdp_16%>%filter(Entity!="World")
names(urban_gdp_16)[4]<-"urban_pop"
names(urban_gdp_16)[5]<-"gdp"
urban_gdp_16$ln_gdp<-log(urban_gdp_16$gdp)
head(urban_gdp_16,3)

## # A tibble: 3 x 7
##   Entity      Code   Year urban_pop   gdp `Total population (Gapminder)` ln_gdp
##   <chr>       <chr> <dbl>     <dbl> <dbl>                          <dbl>  <dbl>
## 1 Afghanistan AFG    2016      25.0  1929                             NA   7.56
## 2 Albania     ALB    2016      58.4 11285                             NA   9.33
## 3 Algeria     DZA    2016      71.5 13328                             NA   9.50

Back to ggplot()! Instead of geom_line(), we use geom_point() to construct scatter plots.

ggplot(urban_gdp_16, aes(x=urban_pop, y=ln_gdp)) +
  geom_point(color="darkgrey") + 
  scale_x_continuous(name = "Share of population living in urban areas (%)") +
  scale_y_continuous(name = "Ln GDP per capita U$ 2011") +
  theme_economist_white(base_size = 17, gray_bg=FALSE)+
  theme(axis.text=element_text(size=12),
        axis.title=element_text(size=12,face="bold"))

You can do better than that. The package ggrepel allows you to label the dots in your chart (function geom_text_repel()). I will label the dots with respective country codes. Finally, stat_smooth() adds the regression line to your plot.

library(ggrepel)
ggplot(urban_gdp_16, aes(x=urban_pop, y=ln_gdp)) + geom_point(color="darkgrey") + geom_text_repel(aes(label=Code), size=3)+
  stat_smooth(method = "lm", formula =y~x, se=F, color="black") +
  scale_x_continuous(name = "Share of population living in urban areas (%)") +
  scale_y_continuous(name = "Ln GDP per capita U$ 2011") +
  theme_economist_white(base_size = 17, gray_bg=FALSE)+
  theme(axis.text=element_text(size=12),
        axis.title=element_text(size=12,face="bold"))

Here we see the strong and positive relationship between the share of the population living in urban areas (%) and the GDP per capita (U$ 2011) for 164 selected countries. For more facts about urbanization, check Our World in Data.