data visualization with ggplot2 (1)

Introduction to awesome R package ggplot2 - graphic grammar

1. Introduction

When we start analyzing various of data sets, the first step is always the exploratory data analysis (EDA). Generally we will try to find some characteristics of the data, and data visualization is an important way to do that.

Here I list five basic types of plot that will be useful in EDA:

  1. Scatter plot

    Scatter plot is useful to show the relationship between two variables (both numerical and categorical variable). For example, we can use the scatter plot to display the relationship between people’s weight and height, people’s years of education and their average salaries.

  2. Line plot

    Line plot is also a way to illustrate the relationship between two variables, specifically, it’s useful to visualize time series data, where the data is time indexed. For example, we can visualize the daily stock price of Apple in the past two years using line plot.

  3. Histogram

    Histogram is a way to show the distribution of numerical (continuous) variables, such as the height of people. Personally speaking, the ‘limit’ version of histogram will be the plot of probability density function of the variable. In practice, we can also add a layer of density plot upon the histogram layer.

  4. Box plot

    Like histogram, box plot is also one way to show the distribution of numerical variables. In addition, it has more characteristics of the data: it will attach with the quantiles of the data - minimum, 25% quantile, median, 75% quantile and maximum. It will be useful to let the outliers stand out.

  5. Bar plot

    Bar plot is useful when we dealing with the distribution of categorical variables. For example, we want to know the usage of top 10 machine learning algorithms in FLAG.

In summary, scatter plot and line plot must applied on two variables; histogram, box plot and bar plot are the results of aggregation on data set, and usually applied on single variable. There is a helpful cheat sheet of data visualization available on the website of Rstudio.

2. Data

Next we are going to illustrate those five basic plots on a real world data set from kaggle. Here is the housing price data in Seattle, the data has 19 variables including price, number of bedrooms and bathrooms, number of floors, etc. The data has already been split into training set and testing set (click them to get the data).

3. Play with five basic plots

ggplot2 is developed by Hadley Wickham, who is the chief scientist in Rstudio. ggplot2 now is half of the world of data visualization in R and it’s my favorite plot package.

First, we gonna install and load the package, also import the data,

1
2
3
4
5
> install.packages('ggplot2')
> library(ggplot2)
> setwd("C://Users//Bangda//Desktop//project-housing price analysis")
> training = read.csv("training.csv", header = TRUE)
> testing = read.csv("test.csv", header = TRUE)

here we start from the training set, take a quick look at the data set:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
> str(training)
'data.frame': 7088 obs. of 21 variables:
$ id : num 7.13e+09 6.41e+09 5.63e+09 2.49e+09 1.95e+09 ...
$ date : Factor w/ 340 levels "20140502T000000",..: 152 202 265 202 258 10 55 231 310 278 ...
$ price : num 221900 538000 180000 604000 510000 ...
$ bedrooms : int 3 3 2 4 3 4 3 3 3 3 ...
$ bathrooms : num 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
$ sqft_living : int 1180 2570 770 1960 1680 5420 1715 1060 1780 1890 ...
$ sqft_lot : int 5650 7242 10000 5000 8080 101930 6819 9711 7470 6560 ...
$ floors : num 1 2 1 1 1 1 2 1 1 2 ...
$ waterfront : int 0 0 0 0 0 0 0 0 0 0 ...
$ view : int 0 0 0 0 0 0 0 0 0 0 ...
$ condition : int 3 3 3 5 3 3 3 3 3 3 ...
$ grade : int 7 7 6 7 8 11 7 7 7 7 ...
$ sqft_above : int 1180 2170 770 1050 1680 3890 1715 1060 1050 1890 ...
$ sqft_basement: int 0 400 0 910 0 1530 0 0 730 0 ...
$ yr_built : int 1955 1951 1933 1965 1987 2001 1995 1963 1960 2003 ...
$ yr_renovated : int 0 1991 0 0 0 0 0 0 0 0 ...
$ zipcode : int 98178 98125 98028 98136 98074 98053 98003 98198 98146 98038 ...
$ lat : num 47.5 47.7 47.7 47.5 47.6 ...
$ long : num -122 -122 -122 -122 -122 ...
$ sqft_living15: int 1340 1690 2720 1360 1800 4760 2238 1650 1780 2390 ...
$ sqft_lot15 : int 5650 7639 8062 5000 7503 101930 6819 9711 8113 7570 ...

Ok, all set. We can start doing some visualizations now!

ggplot2 will plotting based on the data.frame, which is the basic data structure in R. The basic idea of ggplot2 is using grammar of graphics, which is different from many other packages and software. There are two principles:

  1. graphics are consists of distinct layers of grammatical elements;
  2. meaningful plots through is from aesthetic mapping.

Here are three basic grammatical elements to create a plot using ggplot2:

  1. data;
  2. aesthetics;
  3. geometrics;

The basic function is ggplot(), where will we specify the data as the parameter of ggplot(). Next always comes the geometric objects, in ggplot2 we using geom_obj, for example, we use geom_point() for scatter plots; geom_histogram() for histograms. Also we need to specify the aesthetic mapping, the basic statement is mapping = aes(x = .., y = ..) and it could be specified in both ggplot() and geom_obj.

Scatter plot: we want to show the relationship between sqft_living (Square footage of the apartments interior living space) and price, we specify the aesthetic to be: sqft_living maps to x-axis, price maps to y-axis, and set size = 2, later we will get to the discussion of alpha. Therefore here is our first plot

1
2
3
> ggplot(training) +
+ geom_point(mapping = aes(x = sqft_living, y = price), alpha = I(1/10), size = 2) +
+ labs(x = 'sqft living', y = 'price', title = 'price ~ sqft living')

Box plot: then we want to show the distribution of the price in different groups of views. We specify aesthetic to be: view maps to x-axis, price maps to y-axis. Notice that in the original data set, view are described as numerical data types, actually we should treat it as categorical variable, therefore we apply factor() on it. And get our second plot!

1
2
3
> ggplot(training) +
+ geom_boxplot(mapping = aes(x = factor(view), y = price)) +
+ labs(x = "view", y = "price", title = "price ~ views")

Histogram: Then we play with the 3rd most basic plot. Supposed we just want to know the distribution of house price. price should be mapped to x-axis, and for y-axis there is no variable (here we specify it as ..density.. to make sure the y-axis is probability rather than frequency), because some statistical transformation has been done to produce the histogram.

1
2
3
4
> ggplot(training) +
+ geom_histogram(aes(x = price, y = ..density..)) +
+ labs(x = "price")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Every time you get your first histogram using ggplot2, it will remind you to try difference bins to get better view of the histogram, that actually what we should do when dealing with histograms, we will discuss more about that later.

Bar plot: Similar to histogram, bar plot in some sense can be viewed as ‘discrete’ version of histogram. For example, we want to show the distribution of grade. We map grade to x-axis, and set stat = ‘count’,

1
2
3
> ggplot(training) +
+ geom_bar(aes(x = factor(grade)), stat = 'count') +
+ labs(x = 'grade')

4. Summary

We introduced five most basic plots in data visualization and apply four of them on a real world data set. Since line plot is similar to scatter plot on syntax, we didn’t give an example.

We see that the basic elements for creating a plot using ggplot2 are the data (data.frame in R), aesthetics and geometric objects. This is a brand new concept for people who used to plotting using R basic graphic package and other languages like Matlab or Python.

The plots for now are plain and naive, to polish them, we need to specify more aesthetics and parameters, like size and color of geometric object, scale and tick of axis, etc. Also, to extract more information, some times we also need some techniques like facet. We will discuss more details about that in later posts.