The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

dataProfilerR

Automated exploratory data analysis for R. Point it at a data frame and it returns a structured profile — column types, missingness, distributional statistics, normality tests, outliers, correlations, a data-quality score, and ggplot2 figures — through a single function, profile_data().

The aim is to cover the first hour of EDA that you’d otherwise write by hand for every new dataset, while keeping the result a plain, inspectable object you can build on.

Installation

# install.packages("remotes")
remotes::install_github("mqfarooqi1/dataProfilerR")

Depends on ggplot2. The Anderson–Darling normality test additionally uses the suggested nortest package; if it isn’t installed, only Shapiro–Wilk is run.

Quick start

library(dataProfilerR)

p <- profile_data(iris)
p                                  # concise overview + quality score
summary(p)                         # numeric summary, missingness, normality, outliers, correlations
plot(p, which = "correlation")     # retrieve a figure
plot(p, which = "distribution", column = "Sepal.Length")

# components are just list elements
p$metadata$column_types
p$diagnostics$quality$score
p$statistics$numeric

# grouped comparison + a self-contained HTML report (needs pandoc)
p <- profile_data(iris, group_by = "Species")
p$diagnostics$groups$numeric_by_group
report(p, "iris_report.html")

See the vignette (vignette("dataProfilerR")) for a full walkthrough on a messy dataset.

Architecture and design decisions

The package is organised as a pipeline of independent, individually-callable functions, with one orchestrator on top:

                       profile_data()                 <- orchestrator
        ┌───────────────────┼───────────────────────────────┐
   profiling            statistics                      visualization
   ─────────            ──────────                       ────────────
   infer_column_types   normality_tests                 plot_missing
   analyze_missing      detect_outliers / outlier_summary plot_distribution
   summarize_columns    correlation_analysis            plot_correlation
   data_quality_score                                   plot_boxplots
                                                        plot_pairs
                          │
                          ▼
                  data_profile (S3 object)  ──  print() / summary() / plot()

Design choices worth calling out:

S3, not S4. A profiling result is data, not behaviour. Modelling it as a plain list with a class keeps it transparent (str(p) just works), serialisable, and easy to extend with new elements without redefining a formal class. S4’s validity and dispatch machinery would be overhead with no payoff here. The methods provided are print, summary, and plot.
Each stage stands alone. infer_column_types(), detect_outliers(), plot_correlation() etc. all work directly on a data frame or vector, so the package is useful piecemeal, not only through the orchestrator.
Type inference drives the rest. Columns are classified once (numeric/integer/date/logical/categorical/text) and that classification routes which statistics and plots apply.
Fail early on bad input. A shared validator rejects non-data-frames, empty frames, and duplicate/blank column names with clear messages rather than letting them surface as cryptic downstream errors.
Minimal dependencies. Only ggplot2 beyond base/recommended packages. Skewness and kurtosis are implemented directly rather than pulling in moments; Anderson–Darling degrades gracefully when nortest is absent.

Function reference

Profiling

Function	Purpose
`infer_column_types(df)`	Classify each column; character columns split into categorical vs text.
`analyze_missing(df)`	Per-column and overall missingness; complete-row count.
`summarize_columns(df)`	Numeric summary (mean, sd, variance, quartiles, IQR, skewness, kurtosis) and categorical cardinality / top level.
`data_quality_score(df)`	0–100 score and letter grade from completeness, row uniqueness, column variability, and (optionally) outlier rate.

Statistics

Function	Purpose
`normality_tests(df)`	Shapiro–Wilk (and Anderson–Darling if `nortest` is present) per numeric column; large columns subsampled to 5000.
`detect_outliers(x, method)`	`"iqr"`, `"zscore"`, or `"robust"` (median/MAD) on a vector.
`outlier_summary(df, method)`	Per-column outlier counts and an overall rate.
`correlation_analysis(df, method)`	Pearson and/or Spearman matrices over numeric columns.
`categorical_association(df)`	Cramer’s V matrix between categorical columns.
`analyze_dates(df)`	Range, unique count, and largest gap for date/datetime columns.
`compare_groups(df, group)`	Numeric summaries within the levels of a grouping column.
`skewness(x)`, `kurtosis(x)`	Moment-based, exported for direct use.

Visualization (ggplot2)

Function	Purpose
`plot_missing(df)`	Missing-value heatmap (rows subsampled when large).
`plot_distribution(df, column)`	Histogram + density (numeric) or bar chart (categorical).
`plot_correlation(df, method)`	Annotated correlation heatmap.
`plot_association(df)`	Cramer’s V heatmap for categorical columns.
`plot_boxplots(df)`	Faceted boxplots for the numeric columns.
`plot_pairs(df, columns)`	Scatterplot matrix for selected numeric columns.

Pipeline, reporting & object

Function	Purpose
`profile_data(df, ...)`	Run everything; return a `data_profile`. Options include `group_by` and `distributions`.
`report(x, file)`	Render the profile to a self-contained HTML file (needs pandoc).
`print` / `summary` / `plot` methods	Overview / detail / figures (`plot()` adds `which = "association"`).
`is_data_profile(x)`	Class predicate.

The `data_profile` object

profile_data() returns an S3 list with four parts plus the call:

metadata — dataset name, dimensions, per-column types, type counts, timestamp.
statistics — numeric summary, categorical summary, correlation matrices, and the categorical association matrix.
diagnostics — missingness, normality, outliers, date-column profile, the grouped comparison (when group_by is set), and the quality score.
plots — the ggplot2 objects (empty if build_plots = FALSE; the per-column distribution plots are also skipped when distributions = FALSE).

Folder structure

dataProfilerR/
├── DESCRIPTION
├── NAMESPACE                # generated by roxygen2
├── LICENSE
├── NEWS.md
├── R/
│   ├── dataProfilerR-package.R
│   ├── utils.R              # validation + skewness/kurtosis
│   ├── profiling.R          # types, missingness, summaries, quality score
│   ├── statistics.R         # normality, outliers, correlation
│   ├── association.R        # Cramer's V for categoricals
│   ├── dates.R              # date/datetime profiling
│   ├── groups.R             # grouped comparison
│   ├── visualization.R      # ggplot2 functions
│   ├── report.R             # HTML report (rmarkdown)
│   ├── profile_data.R       # orchestrator + S3 constructor
│   └── methods.R            # print / summary / plot
├── man/                     # generated by roxygen2
├── tests/testthat/          # unit + edge-case tests
└── vignettes/dataProfilerR.Rmd

Testing

testthat (edition 3) covers each function plus edge cases — empty frames, wrong types, all-NA columns, single-column frames, missing-column plot requests, and output-shape consistency. Run with devtools::test().

Limitations and future improvements

Added in 0.2.0: report() (HTML), categorical_association() (Cramer’s V), analyze_dates(), compare_groups(), and a distributions = FALSE switch to avoid eager per-column plots on wide data. See NEWS.md.

Still open / honest gaps:

Numeric-vs-categorical effect sizes (e.g. eta-squared, group-mean differences with tests) aren’t here yet; compare_groups() reports descriptive summaries only, not significance.
Date analysis is shallow — range and gaps, but no seasonality/trend.
Distribution plots are still eager unless you opt out with distributions = FALSE; a fully lazy, build-on-demand path would be cleaner.
report() requires pandoc (the usual R Markdown dependency); there is no pandoc-free fallback.
Text columns are detected but not analysed beyond cardinality.

License

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.