Introduction to DataExplorer

Boxuan Cui

2018-01-09

This document introduces the package DataExplorer, and shows how it can help you with different tasks throughout your data exploration process.

There are 3 main goals for DataExplorer:

  1. Exploratory Data Analysis (EDA)
  2. Feature Engineering
  3. Data Reporting

The remaining of this guide will be organized in accordance with the goals. As the package evolves, more content will be added.

Data

We will be using the nycflights13 datasets for this document. If you have not installed the package, please do the following:

install.packages(“nycflights13”) library(nycflights13)

There are 5 datasets in this package:

If you want to quickly visualize the structure of all, you may do the following:

library(DataExplorer)
data_list <- list(airlines, airports, flights, planes, weather)
plot_str(data_list)

You may also try plot_str(data_list, type = "r") for a radial network.


Now let’s merge all tables together for a more robust dataset for later sections.

merge_airlines <- merge(flights, airlines, by = "carrier", all.x = TRUE)
merge_planes <- merge(merge_airlines, planes, by = "tailnum", all.x = TRUE, suffixes = c("_flights", "_planes"))
merge_airports_origin <- merge(merge_planes, airports, by.x = "origin", by.y = "faa", all.x = TRUE, suffixes = c("_carrier", "_origin"))
final_data <- merge(merge_airports_origin, airports, by.x = "dest", by.y = "faa", all.x = TRUE, suffixes = c("_origin", "_dest"))

Exploratory Data Analysis

Exploratory data analysis is the process to get to know your data, so that you can generate and test your hypothesis. Visualization techniques are usually applied.

You can easily check the basic statistics with base R, e.g.,

dim(final_data)
summary(final_data)
object.size(final_data)

Missing values

Real-world data is messy. After running the basic descriptive statistics, you might be interested in the missing data profile. You can simple use plot_missing function for this.

plot_missing(final_data)

You may also store the missing data profile with missing_data <- plot_missing(final_data) for additional analysis.

Distributions

To visualize distributions for all discrete features:

plot_bar(final_data)
## 5 columns ignored with more than 50 categories.
## dest: 105 categories
## tailnum: 4044 categories
## time_hour: 6936 categories
## model: 128 categories
## name: 102 categories

To visualize distributions for all continuous features:

plot_histogram(final_data)

You may also visualize just one feature using this function:

plot_bar(final_data$manufacturer)

plot_histogram(final_data$seats)

Correlation

To visualize correlation heatmap for discrete and continuous features:

plot_correlation(final_data, use = "pairwise.complete.obs")
## 6 features with more than 20 categories ignored!
## dest: 105 categories
## tailnum: 4044 categories
## time_hour: 6936 categories
## manufacturer: 36 categories
## model: 128 categories
## name: 102 categories

Slicing & dicing

Often, slicing and dicing data in different ways could be crucial to your analysis, and yields insights quickly. For example, suppose you would like to build a model to predict arrival delays, you may view the distribution of all continuous features based on arrival delays with the following code:

## Reduce data size
arr_delay_df <- final_data[, c("arr_delay", "month", "day", "hour", "minute", "dep_delay", "distance", "year_planes", "seats", "speed")]
plot_boxplot(arr_delay_df, "arr_delay")

Among all the subtle changes in correlation with arrival delays, you could immediately spot that planes with 300+ seats tend to have much longer delays (16 ~ 21 hours). You may now drill down further to verify or generate more hypotheses.

An alternative visualization is scatterplot. For example,

## Reduce data size
arr_delay_df2 <- final_data[, c("arr_delay", "month", "distance", "seats", "origin", "carrier", "manufacturer")]
plot_scatterplot(arr_delay_df2, "arr_delay", size = 0.8)

Feature Engineering

Feature engineering is the process of creating new features from existing ones. Newly engineered features often generate valuable insights.

For functions in this section, most data objects need to be set to data.table. However, you don’t need to know data.table package, simply set the object back to the original class after feature engineering.

Replace missing values

Missing values may have meanings for a feature. Other than imputation methods, we may also set them to some logical values. For example, for discrete features, we may want to group missing values to a new category. For continuous features, we may want to set missing values to a known number based on existing knowledge.

In DataExplorer, this can be done by set_missing. The function automatically matches the argument for either discrete or continuous features, i.e., if you specify a number, all missing continuous values will be set to that number. If you specify a string, all missing discrete values will be set to that string. If you supply both, both types will be set.

library(data.table)
final_dt <- data.table(final_data)
set_missing(final_dt, list(0L, "unknown"))
## Column [dep_time]: Set 8255 missing values to 0
## Column [dep_delay]: Set 8255 missing values to 0
## Column [arr_time]: Set 8713 missing values to 0
## Column [arr_delay]: Set 9430 missing values to 0
## Column [air_time]: Set 9430 missing values to 0
## Column [year_planes]: Set 57912 missing values to 0
## Column [engines]: Set 52606 missing values to 0
## Column [seats]: Set 52606 missing values to 0
## Column [speed]: Set 335813 missing values to 0
## Column [lat_dest]: Set 7602 missing values to 0
## Column [lon_dest]: Set 7602 missing values to 0
## Column [alt_dest]: Set 7602 missing values to 0
## Column [tz_dest]: Set 7602 missing values to 0
## Column [tailnum]: Set 2512 missing values to unknown
## Column [type]: Set 52606 missing values to unknown
## Column [manufacturer]: Set 52606 missing values to unknown
## Column [model]: Set 52606 missing values to unknown
## Column [engine]: Set 52606 missing values to unknown
## Column [name]: Set 7602 missing values to unknown
## Column [dst_dest]: Set 7602 missing values to unknown
## Column [tzone_dest]: Set 7602 missing values to unknown

Group sparse categories

From the bar charts above, we observed a number of discrete features with sparse categorical distributions. Sometimes, we want to group the low-frequency categories to a new bucket, or reduce the number of categories to a reasonable range. group_category will do the work.

Take manufacturer feature for example, suppose we want to group the long tail to another category. We could try with bottom 20% (by count) first:

group_category(data = final_dt, feature = "manufacturer", threshold = 0.2)
##    manufacturer   cnt       pct   cum_pct
## 1:       BOEING 82912 0.2461933 0.2461933
## 2:      EMBRAER 66068 0.1961779 0.4423712
## 3:      unknown 52606 0.1562047 0.5985759
## 4:       AIRBUS 47302 0.1404554 0.7390313

As we can see, manufacturer will be shrinked down to 5 categories, i.e., BOEING, EMBRAER, unknown, AIRBUS and OTHER. If you like this threshold, you may specify update = TRUE to update the original dataset:

group_category(data = final_dt, feature = "manufacturer", threshold = 0.2, update = TRUE)
plot_bar(final_dt$manufacturer)

Instead of shrinking categories by frequency, you may also group the categories by another continuous metric. For example, if you want to bucket the carrier with bottom 20% distance travelled, you may do the following:

group_category(data = final_dt, feature = "name_carrier", threshold = 0.2, measure = "distance")
##              name_carrier      cnt       pct   cum_pct
## 1:  United Air Lines Inc. 89705524 0.2561422 0.2561422
## 2:   Delta Air Lines Inc. 59507317 0.1699153 0.4260575
## 3:        JetBlue Airways 58384137 0.1667082 0.5927657
## 4: American Airlines Inc. 43864584 0.1252495 0.7180152

Similarly, if you like it, you may add update = TRUE to update the original dataset.

group_category(data = final_dt, feature = "name_carrier", threshold = 0.2, measure = "distance", update = TRUE)
plot_bar(final_dt$name_carrier)

Drop features

After viewing the feature distribution, you often want to drop features that are insignificant. For example, features like dst_origin has only one value, and it doesn’t provide any valuable information. You can use drop_columns to quickly drop features. The function takes either names or column indices.

drop_columns(final_dt, c("dst_origin", "dst_dest", "tzone_dest"))
drop_columns(final_dt, c(34, 41, 42))

Data Reporting

To organize all the data profiling statistics into a report, you may use the create_report() function. It will run most of the EDA functions and output a html file.

create_report(final_data)