The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Explore penguins

Roland Krasser

2024-04-15

How to explore the penguins dataset using the explore package.

The explore package simplifies Exploratory Data Analysis (EDA). Get faster insights with less code! We will use < 10 lines of code and just 6 function names to explore penguins:

function package description
library() {base} load a package
filter() {dplyr} subset rows using column values
describe() {explore} describe variables of the table
explore() {explore} explore graphically a variable
explore_all() {explore} explore all variables of the table
explain_tree() {explore} explain a target using a decision tree

The penguins dataset comes with the palmerpenguins package. It has 344 observations and 8 variables. (https://github.com/allisonhorst/palmerpenguins)

Furthermore, we use the packages {dplyr} for filter() and %>% and {explore} for data exploration.

library(dplyr)
library(explore)
penguins <- use_data_penguins()
# equivalent to 
# penguins <- palmerpenguins::penguins

Describe variables

penguins %>% describe()
#> # A tibble: 8 × 8
#>   variable          type     na na_pct unique    min   mean    max
#>   <chr>             <chr> <int>  <dbl>  <int>  <dbl>  <dbl>  <dbl>
#> 1 species           fct       0    0        3   NA     NA     NA  
#> 2 island            fct       0    0        3   NA     NA     NA  
#> 3 bill_length_mm    dbl       2    0.6    165   32.1   43.9   59.6
#> 4 bill_depth_mm     dbl       2    0.6     81   13.1   17.2   21.5
#> 5 flipper_length_mm int       2    0.6     56  172    201.   231  
#> 6 body_mass_g       int       2    0.6     95 2700   4202.  6300  
#> 7 sex               fct      11    3.2      3   NA     NA     NA  
#> 8 year              int       0    0        3 2007   2008.  2009

There are some NA-values (unknown values) in the data. The variable containing the most NAs is sex. flipper_length_mm and others contain only 2 observations with NAs.

Data cleaning

We use only penguins with known flipper length for the data exploration!

data <- penguins %>% 
  filter(flipper_length_mm > 0)

We reduced the penguins from 344 to 342.

Explore variables

data %>% 
  explore_all(color = "skyblue")

Which species?

What is the relationship between all the variables and species?

data %>% 
  explore_all(
    target = species,
    color = c("darkorange", "purple", "lightseagreen"))

We already see some strong patterns in the data. flipper_length_mm separates species Gentoo, bill_length_mm separates species Adelie from Chinstrap. And we see that Chinstrap and Gentoo are located on separate islands.

Now we explain species using a decision tree:

data %>% explain_tree(target = species)

We found an easy explanation how to find out the species by just using flipper_length_mm and bill_length_mm.

Now let’s take a closer look to these variables:

data %>% 
  explore(
    flipper_length_mm, bill_length_mm, 
    target = species,
    color = c("darkorange", "purple", "lightseagreen")
    )

The plot shows a not perfect but good separation between the 3 species!

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.