The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Generate a data dictionnary and search for variables with look_for()

Joseph Larmarange

Showing a summary of a data frame

Default printing of tibbles

It is a common need to easily get a description of all variables in a data frame.

When a data frame is converted into a tibble (e.g. with dplyr::as_tibble()), it as a nice printing showing the first rows of the data frame as well as the type of column.

library(dplyr)
iris %>% as_tibble()
## # A tibble: 150 × 5
##    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
##           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
##  1          5.1         3.5          1.4         0.2 setosa 
##  2          4.9         3            1.4         0.2 setosa 
##  3          4.7         3.2          1.3         0.2 setosa 
##  4          4.6         3.1          1.5         0.2 setosa 
##  5          5           3.6          1.4         0.2 setosa 
##  6          5.4         3.9          1.7         0.4 setosa 
##  7          4.6         3.4          1.4         0.3 setosa 
##  8          5           3.4          1.5         0.2 setosa 
##  9          4.4         2.9          1.4         0.2 setosa 
## 10          4.9         3.1          1.5         0.1 setosa 
## # ℹ 140 more rows

However, when you have too many variables, all of them cannot be printed and their are just listed.

data(fertility, package = "questionr")
women
## # A tibble: 2,000 × 17
##    id_woman id_household weight interview_date date_of_birth   age residency
##       <dbl>        <dbl>  <dbl> <date>         <date>        <dbl> <dbl+lbl>
##  1      391          381  1.80  2012-05-05     1997-03-07       15 2 [rural]
##  2     1643         1515  1.80  2012-01-23     1982-01-06       30 2 [rural]
##  3       85           85  1.80  2012-01-21     1979-01-01       33 2 [rural]
##  4      881          844  1.80  2012-01-06     1968-03-29       43 2 [rural]
##  5     1981         1797  1.80  2012-05-11     1986-05-25       25 2 [rural]
##  6     1072         1015  0.998 2012-02-20     1993-07-03       18 2 [rural]
##  7     1978         1794  0.998 2012-02-23     1967-01-28       45 2 [rural]
##  8     1607         1486  0.998 2012-02-20     1989-01-21       23 2 [rural]
##  9      738          711  0.192 2012-03-09     1962-07-24       49 2 [rural]
## 10     1656         1525  0.192 2012-03-15     1980-12-25       31 2 [rural]
## # ℹ 1,990 more rows
## # ℹ 10 more variables: region <dbl+lbl>, instruction <dbl+lbl>,
## #   employed <dbl+lbl>, matri <dbl+lbl>, religion <dbl+lbl>,
## #   newspaper <dbl+lbl>, radio <dbl+lbl>, tv <dbl+lbl>,
## #   ideal_nb_children <dbl+lbl>, test <dbl+lbl>

Note: in R console, value labels (if defined) are usually printed but they do not appear in a R markdown document like this vignette.

dplyr::glimpse()

The function dplyr::glimpse() allows you to have a quick look at all the variables in a data frame.

glimpse(iris)
## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.…
## $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7, 3.…
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5, 1.…
## $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2, 0.…
## $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa, s…
glimpse(women)
## Rows: 2,000
## Columns: 17
## $ id_woman          <dbl> 391, 1643, 85, 881, 1981, 1072, 1978, 1607, 738, 165…
## $ id_household      <dbl> 381, 1515, 85, 844, 1797, 1015, 1794, 1486, 711, 152…
## $ weight            <dbl> 1.803150, 1.803150, 1.803150, 1.803150, 1.803150, 0.…
## $ interview_date    <date> 2012-05-05, 2012-01-23, 2012-01-21, 2012-01-06, 201…
## $ date_of_birth     <date> 1997-03-07, 1982-01-06, 1979-01-01, 1968-03-29, 198…
## $ age               <dbl> 15, 30, 33, 43, 25, 18, 45, 23, 49, 31, 26, 45, 25, …
## $ residency         <dbl+lbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ region            <dbl+lbl> 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, …
## $ instruction       <dbl+lbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 2, 1, 0, …
## $ employed          <dbl+lbl> 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ matri             <dbl+lbl> 0, 2, 2, 2, 1, 0, 1, 1, 2, 5, 2, 3, 0, 2, 1, 2, …
## $ religion          <dbl+lbl> 1, 3, 2, 3, 2, 2, 3, 1, 3, 3, 2, 3, 2, 2, 2, 2, …
## $ newspaper         <dbl+lbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, …
## $ radio             <dbl+lbl> 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, …
## $ tv                <dbl+lbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, …
## $ ideal_nb_children <dbl+lbl>  4,  4,  4,  4,  4,  5, 10,  5,  4,  5,  6, 10, …
## $ test              <dbl+lbl> 0, 9, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, …

It will show you the first values of each variable as well as the type of each variable. However, some important informations are not displayed:

labelled::look_for()

look_for() provided by the labelled package will print in the console a data dictionary of all variables, showing variable labels when available, the type of variable and a list of values corresponding to:

library(labelled)
look_for(iris)
##  pos variable     label           col_type missing values    
##  1   Sepal.Length —               dbl      0                 
##  2   Sepal.Width  —               dbl      0                 
##  3   Petal.Length Length of petal dbl      0                 
##  4   Petal.Width  Width of Petal  dbl      0                 
##  5   Species      —               fct      0       setosa    
##                                                    versicolor
##                                                    virginica
look_for(women)
##  pos variable          label              col_type missing values            
##  1   id_woman          Woman Id           dbl      0                         
##  2   id_household      Household Id       dbl      0                         
##  3   weight            Sample weight      dbl      0                         
##  4   interview_date    Interview date     date     0                         
##  5   date_of_birth     Date of birth      date     0                         
##  6   age               Age at last anniv~ dbl      0                         
##  7   residency         Urban / rural res~ dbl+lbl  0       [1] urban         
##                                                            [2] rural         
##  8   region            Region             dbl+lbl  0       [1] North         
##                                                            [2] East          
##                                                            [3] South         
##                                                            [4] West          
##  9   instruction       Level of instruct~ dbl+lbl  0       [0] none          
##                                                            [1] primary       
##                                                            [2] secondary     
##                                                            [3] higher        
##  10  employed          Employed?          dbl+lbl  7       [0] no            
##                                                            [1] yes           
##                                                            [9] missing       
##  11  matri             Matrimonial status dbl+lbl  0       [0] single        
##                                                            [1] married       
##                                                            [2] living togeth~
##                                                            [3] windowed      
##                                                            [4] divorced      
##                                                            [5] separated     
##  12  religion          Religion           dbl+lbl  4       [1] Muslim        
##                                                            [2] Christian     
##                                                            [3] Protestant    
##                                                            [4] no religion   
##                                                            [5] other         
##  13  newspaper         Read newspaper?    dbl+lbl  0       [0] no            
##                                                            [1] yes           
##  14  radio             Listen to radio?   dbl+lbl  0       [0] no            
##                                                            [1] yes           
##  15  tv                Watch TV?          dbl+lbl  0       [0] no            
##                                                            [1] yes           
##  16  ideal_nb_children Ideal number of c~ dbl+lbl  0       [96] don't know   
##                                                            [99] missing      
##  17  test              Ever tested for H~ dbl+lbl  29      [0] no            
##                                                            [1] yes           
##                                                            [9] missing

Note that lookfor() and generate_dictionary() are synonyms of look_for() and works exactly in the same way.

If there is not enough space to print full labels in the console, they will be truncated (truncation is indicated by a ~).

Searching variables by key

When a data frame has dozens or even hundreds of variables, it could become difficult to find a specific variable. In such case, you can provide an optional list of keywords, which can be simple character strings or regular expression, to search for specific variables.

# Look for a single keyword.
look_for(iris, "petal")
##  pos variable     label           col_type missing values
##  3   Petal.Length Length of petal dbl      0             
##  4   Petal.Width  Width of Petal  dbl      0
look_for(iris, "s")
##  pos variable     label col_type missing values    
##  1   Sepal.Length —     dbl      0                 
##  2   Sepal.Width  —     dbl      0                 
##  5   Species      —     fct      0       setosa    
##                                          versicolor
##                                          virginica
# Look for with a regular expression
look_for(iris, "petal|species")
##  pos variable     label           col_type missing values    
##  3   Petal.Length Length of petal dbl      0                 
##  4   Petal.Width  Width of Petal  dbl      0                 
##  5   Species      —               fct      0       setosa    
##                                                    versicolor
##                                                    virginica
look_for(iris, "s$")
##  pos variable label col_type missing values    
##  5   Species  —     fct      0       setosa    
##                                      versicolor
##                                      virginica
# Look for with several keywords
look_for(iris, "pet", "sp")
##  pos variable     label           col_type missing values    
##  3   Petal.Length Length of petal dbl      0                 
##  4   Petal.Width  Width of Petal  dbl      0                 
##  5   Species      —               fct      0       setosa    
##                                                    versicolor
##                                                    virginica
# Look_for will take variable labels into account
look_for(women, "read", "level")
##  pos variable    label                col_type missing values       
##  9   instruction Level of instruction dbl+lbl  0       [0] none     
##                                                        [1] primary  
##                                                        [2] secondary
##                                                        [3] higher   
##  13  newspaper   Read newspaper?      dbl+lbl  0       [0] no       
##                                                        [1] yes

By default, look_for() will look through both variable names and variables labels. Use labels = FALSE to look only through variable names.

look_for(women, "read")
##  pos variable  label           col_type missing values 
##  13  newspaper Read newspaper? dbl+lbl  0       [0] no 
##                                                 [1] yes
look_for(women, "read", labels = FALSE)
## Nothing found. Sorry.

Similarly, the search is by default case insensitive. To make the search case sensitive, use ignore.case = FALSE.

look_for(iris, "sepal")
##  pos variable     label col_type missing values
##  1   Sepal.Length —     dbl      0             
##  2   Sepal.Width  —     dbl      0
look_for(iris, "sepal", ignore.case = FALSE)
## Nothing found. Sorry.

Level of details

If you just want to use the search feature of look_for() without computing the details of each variable, simply indicate details = "none" or details = FALSE.

look_for(women, "id", details = "none")
##  pos variable          label                   
##   1  id_woman          Woman Id                
##   2  id_household      Household Id            
##   7  residency         Urban / rural residency 
##  16  ideal_nb_children Ideal number of children

If you want more details (but can be time consuming for big data frames), indicate details = "full" or details = TRUE.

look_for(women, details = "full")
##  pos variable          label              col_type missing unique_values
##  1   id_woman          Woman Id           dbl      0       2000         
##  2   id_household      Household Id       dbl      0       1814         
##  3   weight            Sample weight      dbl      0       351          
##  4   interview_date    Interview date     date     0       165          
##  5   date_of_birth     Date of birth      date     0       1740         
##  6   age               Age at last anniv~ dbl      0       36           
##  7   residency         Urban / rural res~ dbl+lbl  0       2            
##                                                                         
##  8   region            Region             dbl+lbl  0       4            
##                                                                         
##                                                                         
##                                                                         
##  9   instruction       Level of instruct~ dbl+lbl  0       4            
##                                                                         
##                                                                         
##                                                                         
##  10  employed          Employed?          dbl+lbl  7       3            
##                                                                         
##                                                                         
##  11  matri             Matrimonial status dbl+lbl  0       6            
##                                                                         
##                                                                         
##                                                                         
##                                                                         
##                                                                         
##  12  religion          Religion           dbl+lbl  4       6            
##                                                                         
##                                                                         
##                                                                         
##                                                                         
##  13  newspaper         Read newspaper?    dbl+lbl  0       2            
##                                                                         
##  14  radio             Listen to radio?   dbl+lbl  0       2            
##                                                                         
##  15  tv                Watch TV?          dbl+lbl  0       2            
##                                                                         
##  16  ideal_nb_children Ideal number of c~ dbl+lbl  0       18           
##                                                                         
##  17  test              Ever tested for H~ dbl+lbl  29      3            
##                                                                         
##                                                                         
##  values             na_values na_range
##  range: 1 - 2000                      
##  range: 1 - 1814                      
##  range: 0.044629 -~                   
##  range: 2011-12-01~                   
##  range: 1962-02-07~                   
##  range: 14 - 49                       
##  [1] urban                            
##  [2] rural                            
##  [1] North                            
##  [2] East                             
##  [3] South                            
##  [4] West                             
##  [0] none                             
##  [1] primary                          
##  [2] secondary                        
##  [3] higher                           
##  [0] no             9                 
##  [1] yes                              
##  [9] missing                          
##  [0] single                           
##  [1] married                          
##  [2] living togeth~                   
##  [3] windowed                         
##  [4] divorced                         
##  [5] separated                        
##  [1] Muslim                           
##  [2] Christian                        
##  [3] Protestant                       
##  [4] no religion                      
##  [5] other                            
##  [0] no                               
##  [1] yes                              
##  [0] no                               
##  [1] yes                              
##  [0] no                               
##  [1] yes                              
##  [96] don't know                      
##  [99] missing                         
##  [0] no             9                 
##  [1] yes                              
##  [9] missing
look_for(women, details = "full") %>%
  dplyr::glimpse()
## Rows: 17
## Columns: 14
## $ pos           <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17
## $ variable      <chr> "id_woman", "id_household", "weight", "interview_date", …
## $ label         <chr> "Woman Id", "Household Id", "Sample weight", "Interview …
## $ col_type      <chr> "dbl", "dbl", "dbl", "date", "date", "dbl", "dbl+lbl", "…
## $ missing       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0, 4, 0, 0, 0, 0, 29
## $ levels        <named list> <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, …
## $ value_labels  <named list> <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <1, 2>, …
## $ class         <named list> "numeric", "numeric", "numeric", "Date", "Date", …
## $ type          <chr> "double", "double", "double", "double", "double",…
## $ na_values     <named list> <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <…
## $ na_range      <named list> <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, …
## $ n_na          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0, 4, 0, 0, 0, 0, 29
## $ unique_values <int> 2000, 1814, 351, 165, 1740, 36, 2, 4, 4, 3, 6, 6,…
## $ range         <named list> <1, 2000>, <1, 1814>, <0.044629, 4.396831>, <2011…

Advanced usages of look_for()

look_for() returns a detailed tibble which is summarized before printing. To deactivate default printing and see full results, simply use dplyr::as_tibble(), dplyr::glimpse() or even utils::View().

look_for(women) %>% View()
look_for(women) %>% as_tibble()
## # A tibble: 17 × 7
##      pos variable          label            col_type missing levels value_labels
##    <int> <chr>             <chr>            <chr>      <int> <name> <named list>
##  1     1 id_woman          Woman Id         dbl            0 <NULL> <NULL>      
##  2     2 id_household      Household Id     dbl            0 <NULL> <NULL>      
##  3     3 weight            Sample weight    dbl            0 <NULL> <NULL>      
##  4     4 interview_date    Interview date   date           0 <NULL> <NULL>      
##  5     5 date_of_birth     Date of birth    date           0 <NULL> <NULL>      
##  6     6 age               Age at last ann… dbl            0 <NULL> <NULL>      
##  7     7 residency         Urban / rural r… dbl+lbl        0 <NULL> <dbl [2]>   
##  8     8 region            Region           dbl+lbl        0 <NULL> <dbl [4]>   
##  9     9 instruction       Level of instru… dbl+lbl        0 <NULL> <dbl [4]>   
## 10    10 employed          Employed?        dbl+lbl        7 <NULL> <dbl [3]>   
## 11    11 matri             Matrimonial sta… dbl+lbl        0 <NULL> <dbl [6]>   
## 12    12 religion          Religion         dbl+lbl        4 <NULL> <dbl [5]>   
## 13    13 newspaper         Read newspaper?  dbl+lbl        0 <NULL> <dbl [2]>   
## 14    14 radio             Listen to radio? dbl+lbl        0 <NULL> <dbl [2]>   
## 15    15 tv                Watch TV?        dbl+lbl        0 <NULL> <dbl [2]>   
## 16    16 ideal_nb_children Ideal number of… dbl+lbl        0 <NULL> <dbl [2]>   
## 17    17 test              Ever tested for… dbl+lbl       29 <NULL> <dbl [3]>
glimpse(look_for(women))
## Rows: 17
## Columns: 7
## $ pos          <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17
## $ variable     <chr> "id_woman", "id_household", "weight", "interview_date", "…
## $ label        <chr> "Woman Id", "Household Id", "Sample weight", "Interview d…
## $ col_type     <chr> "dbl", "dbl", "dbl", "date", "date", "dbl", "dbl+lbl", "d…
## $ missing      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0, 4, 0, 0, 0, 0, 29
## $ levels       <named list> <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <…
## $ value_labels <named list> <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <NULL>, <1, 2>, <…

The tibble returned by look_for() could be easily manipulated for advanced programming.

When a column has several values for one variable (e.g. levels or value_labels), results as stored with nested named list. You can convert named lists into simpler character vectors, you can use convert_list_columns_to_character().

look_for(women) %>% convert_list_columns_to_character()
## # A tibble: 17 × 7
##      pos variable          label            col_type missing levels value_labels
##    <int> <chr>             <chr>            <chr>      <int> <chr>  <chr>       
##  1     1 id_woman          Woman Id         dbl            0 ""     ""          
##  2     2 id_household      Household Id     dbl            0 ""     ""          
##  3     3 weight            Sample weight    dbl            0 ""     ""          
##  4     4 interview_date    Interview date   date           0 ""     ""          
##  5     5 date_of_birth     Date of birth    date           0 ""     ""          
##  6     6 age               Age at last ann… dbl            0 ""     ""          
##  7     7 residency         Urban / rural r… dbl+lbl        0 ""     "[1] urban;…
##  8     8 region            Region           dbl+lbl        0 ""     "[1] North;…
##  9     9 instruction       Level of instru… dbl+lbl        0 ""     "[0] none; …
## 10    10 employed          Employed?        dbl+lbl        7 ""     "[0] no; [1…
## 11    11 matri             Matrimonial sta… dbl+lbl        0 ""     "[0] single…
## 12    12 religion          Religion         dbl+lbl        4 ""     "[1] Muslim…
## 13    13 newspaper         Read newspaper?  dbl+lbl        0 ""     "[0] no; [1…
## 14    14 radio             Listen to radio? dbl+lbl        0 ""     "[0] no; [1…
## 15    15 tv                Watch TV?        dbl+lbl        0 ""     "[0] no; [1…
## 16    16 ideal_nb_children Ideal number of… dbl+lbl        0 ""     "[96] don't…
## 17    17 test              Ever tested for… dbl+lbl       29 ""     "[0] no; [1…

Alternatively, you can use lookfor_to_long_format() to transform results into a long format with one row per factor level and per value label.

look_for(women) %>% lookfor_to_long_format()
## # A tibble: 41 × 7
##      pos variable       label               col_type missing levels value_labels
##    <int> <chr>          <chr>               <chr>      <int> <chr>  <chr>       
##  1     1 id_woman       Woman Id            dbl            0 <NA>   <NA>        
##  2     2 id_household   Household Id        dbl            0 <NA>   <NA>        
##  3     3 weight         Sample weight       dbl            0 <NA>   <NA>        
##  4     4 interview_date Interview date      date           0 <NA>   <NA>        
##  5     5 date_of_birth  Date of birth       date           0 <NA>   <NA>        
##  6     6 age            Age at last annive… dbl            0 <NA>   <NA>        
##  7     7 residency      Urban / rural resi… dbl+lbl        0 <NA>   [1] urban   
##  8     7 residency      Urban / rural resi… dbl+lbl        0 <NA>   [2] rural   
##  9     8 region         Region              dbl+lbl        0 <NA>   [1] North   
## 10     8 region         Region              dbl+lbl        0 <NA>   [2] East    
## # ℹ 31 more rows

Both can be combined:

look_for(women) %>%
  lookfor_to_long_format() %>%
  convert_list_columns_to_character()
## # A tibble: 41 × 7
##      pos variable       label               col_type missing levels value_labels
##    <int> <chr>          <chr>               <chr>      <int> <chr>  <chr>       
##  1     1 id_woman       Woman Id            dbl            0 <NA>   <NA>        
##  2     2 id_household   Household Id        dbl            0 <NA>   <NA>        
##  3     3 weight         Sample weight       dbl            0 <NA>   <NA>        
##  4     4 interview_date Interview date      date           0 <NA>   <NA>        
##  5     5 date_of_birth  Date of birth       date           0 <NA>   <NA>        
##  6     6 age            Age at last annive… dbl            0 <NA>   <NA>        
##  7     7 residency      Urban / rural resi… dbl+lbl        0 <NA>   [1] urban   
##  8     7 residency      Urban / rural resi… dbl+lbl        0 <NA>   [2] rural   
##  9     8 region         Region              dbl+lbl        0 <NA>   [1] North   
## 10     8 region         Region              dbl+lbl        0 <NA>   [2] East    
## # ℹ 31 more rows

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.