The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

dataProfilerR

Automated exploratory data analysis for R. Point it at a data frame and it returns a structured profile — column types, missingness, distributional statistics, normality tests, outliers, correlations, a data-quality score, and ggplot2 figures — through a single function, profile_data().

The aim is to cover the first hour of EDA that you’d otherwise write by hand for every new dataset, while keeping the result a plain, inspectable object you can build on.

Installation

# install.packages("remotes")
remotes::install_github("mqfarooqi1/dataProfilerR")

Depends on ggplot2. The Anderson–Darling normality test additionally uses the suggested nortest package; if it isn’t installed, only Shapiro–Wilk is run.

Quick start

library(dataProfilerR)

p <- profile_data(iris)
p                                  # concise overview + quality score
summary(p)                         # numeric summary, missingness, normality, outliers, correlations
plot(p, which = "correlation")     # retrieve a figure
plot(p, which = "distribution", column = "Sepal.Length")

# components are just list elements
p$metadata$column_types
p$diagnostics$quality$score
p$statistics$numeric

# grouped comparison + a self-contained HTML report (needs pandoc)
p <- profile_data(iris, group_by = "Species")
p$diagnostics$groups$numeric_by_group
report(p, "iris_report.html")

See the vignette (vignette("dataProfilerR")) for a full walkthrough on a messy dataset.

Architecture and design decisions

The package is organised as a pipeline of independent, individually-callable functions, with one orchestrator on top:

                       profile_data()                 <- orchestrator
        ┌───────────────────┼───────────────────────────────┐
   profiling            statistics                      visualization
   ─────────            ──────────                       ────────────
   infer_column_types   normality_tests                 plot_missing
   analyze_missing      detect_outliers / outlier_summary plot_distribution
   summarize_columns    correlation_analysis            plot_correlation
   data_quality_score                                   plot_boxplots
                                                        plot_pairs
                          │
                          ▼
                  data_profile (S3 object)  ──  print() / summary() / plot()

Design choices worth calling out:

Function reference

Profiling

Function Purpose
infer_column_types(df) Classify each column; character columns split into categorical vs text.
analyze_missing(df) Per-column and overall missingness; complete-row count.
summarize_columns(df) Numeric summary (mean, sd, variance, quartiles, IQR, skewness, kurtosis) and categorical cardinality / top level.
data_quality_score(df) 0–100 score and letter grade from completeness, row uniqueness, column variability, and (optionally) outlier rate.

Statistics

Function Purpose
normality_tests(df) Shapiro–Wilk (and Anderson–Darling if nortest is present) per numeric column; large columns subsampled to 5000.
detect_outliers(x, method) "iqr", "zscore", or "robust" (median/MAD) on a vector.
outlier_summary(df, method) Per-column outlier counts and an overall rate.
correlation_analysis(df, method) Pearson and/or Spearman matrices over numeric columns.
categorical_association(df) Cramer’s V matrix between categorical columns.
analyze_dates(df) Range, unique count, and largest gap for date/datetime columns.
compare_groups(df, group) Numeric summaries within the levels of a grouping column.
skewness(x), kurtosis(x) Moment-based, exported for direct use.

Visualization (ggplot2)

Function Purpose
plot_missing(df) Missing-value heatmap (rows subsampled when large).
plot_distribution(df, column) Histogram + density (numeric) or bar chart (categorical).
plot_correlation(df, method) Annotated correlation heatmap.
plot_association(df) Cramer’s V heatmap for categorical columns.
plot_boxplots(df) Faceted boxplots for the numeric columns.
plot_pairs(df, columns) Scatterplot matrix for selected numeric columns.

Pipeline, reporting & object

Function Purpose
profile_data(df, ...) Run everything; return a data_profile. Options include group_by and distributions.
report(x, file) Render the profile to a self-contained HTML file (needs pandoc).
print / summary / plot methods Overview / detail / figures (plot() adds which = "association").
is_data_profile(x) Class predicate.

The data_profile object

profile_data() returns an S3 list with four parts plus the call:

Folder structure

dataProfilerR/
├── DESCRIPTION
├── NAMESPACE                # generated by roxygen2
├── LICENSE
├── NEWS.md
├── R/
│   ├── dataProfilerR-package.R
│   ├── utils.R              # validation + skewness/kurtosis
│   ├── profiling.R          # types, missingness, summaries, quality score
│   ├── statistics.R         # normality, outliers, correlation
│   ├── association.R        # Cramer's V for categoricals
│   ├── dates.R              # date/datetime profiling
│   ├── groups.R             # grouped comparison
│   ├── visualization.R      # ggplot2 functions
│   ├── report.R             # HTML report (rmarkdown)
│   ├── profile_data.R       # orchestrator + S3 constructor
│   └── methods.R            # print / summary / plot
├── man/                     # generated by roxygen2
├── tests/testthat/          # unit + edge-case tests
└── vignettes/dataProfilerR.Rmd

Testing

testthat (edition 3) covers each function plus edge cases — empty frames, wrong types, all-NA columns, single-column frames, missing-column plot requests, and output-shape consistency. Run with devtools::test().

Limitations and future improvements

Added in 0.2.0: report() (HTML), categorical_association() (Cramer’s V), analyze_dates(), compare_groups(), and a distributions = FALSE switch to avoid eager per-column plots on wide data. See NEWS.md.

Still open / honest gaps:

License

MIT © Muhammad Farooqi

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.