The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
Automated exploratory data analysis for R. Point it at a data frame
and it returns a structured profile — column types, missingness,
distributional statistics, normality tests, outliers, correlations, a
data-quality score, and ggplot2 figures — through a single
function, profile_data().
The aim is to cover the first hour of EDA that you’d otherwise write by hand for every new dataset, while keeping the result a plain, inspectable object you can build on.
# install.packages("remotes")
remotes::install_github("mqfarooqi1/dataProfilerR")Depends on ggplot2. The Anderson–Darling normality test
additionally uses the suggested nortest package; if it
isn’t installed, only Shapiro–Wilk is run.
library(dataProfilerR)
p <- profile_data(iris)
p # concise overview + quality score
summary(p) # numeric summary, missingness, normality, outliers, correlations
plot(p, which = "correlation") # retrieve a figure
plot(p, which = "distribution", column = "Sepal.Length")
# components are just list elements
p$metadata$column_types
p$diagnostics$quality$score
p$statistics$numeric
# grouped comparison + a self-contained HTML report (needs pandoc)
p <- profile_data(iris, group_by = "Species")
p$diagnostics$groups$numeric_by_group
report(p, "iris_report.html")See the vignette (vignette("dataProfilerR")) for a full
walkthrough on a messy dataset.
The package is organised as a pipeline of independent, individually-callable functions, with one orchestrator on top:
profile_data() <- orchestrator
┌───────────────────┼───────────────────────────────┐
profiling statistics visualization
───────── ────────── ────────────
infer_column_types normality_tests plot_missing
analyze_missing detect_outliers / outlier_summary plot_distribution
summarize_columns correlation_analysis plot_correlation
data_quality_score plot_boxplots
plot_pairs
│
▼
data_profile (S3 object) ── print() / summary() / plot()
Design choices worth calling out:
str(p) just works), serialisable, and easy to
extend with new elements without redefining a formal class. S4’s
validity and dispatch machinery would be overhead with no payoff here.
The methods provided are print, summary, and
plot.infer_column_types(), detect_outliers(),
plot_correlation() etc. all work directly on a data frame
or vector, so the package is useful piecemeal, not only through the
orchestrator.ggplot2
beyond base/recommended packages. Skewness and kurtosis are implemented
directly rather than pulling in moments; Anderson–Darling
degrades gracefully when nortest is absent.Profiling
| Function | Purpose |
|---|---|
infer_column_types(df) |
Classify each column; character columns split into categorical vs text. |
analyze_missing(df) |
Per-column and overall missingness; complete-row count. |
summarize_columns(df) |
Numeric summary (mean, sd, variance, quartiles, IQR, skewness, kurtosis) and categorical cardinality / top level. |
data_quality_score(df) |
0–100 score and letter grade from completeness, row uniqueness, column variability, and (optionally) outlier rate. |
Statistics
| Function | Purpose |
|---|---|
normality_tests(df) |
Shapiro–Wilk (and Anderson–Darling if nortest is
present) per numeric column; large columns subsampled to 5000. |
detect_outliers(x, method) |
"iqr", "zscore", or "robust"
(median/MAD) on a vector. |
outlier_summary(df, method) |
Per-column outlier counts and an overall rate. |
correlation_analysis(df, method) |
Pearson and/or Spearman matrices over numeric columns. |
categorical_association(df) |
Cramer’s V matrix between categorical columns. |
analyze_dates(df) |
Range, unique count, and largest gap for date/datetime columns. |
compare_groups(df, group) |
Numeric summaries within the levels of a grouping column. |
skewness(x), kurtosis(x) |
Moment-based, exported for direct use. |
Visualization (ggplot2)
| Function | Purpose |
|---|---|
plot_missing(df) |
Missing-value heatmap (rows subsampled when large). |
plot_distribution(df, column) |
Histogram + density (numeric) or bar chart (categorical). |
plot_correlation(df, method) |
Annotated correlation heatmap. |
plot_association(df) |
Cramer’s V heatmap for categorical columns. |
plot_boxplots(df) |
Faceted boxplots for the numeric columns. |
plot_pairs(df, columns) |
Scatterplot matrix for selected numeric columns. |
Pipeline, reporting & object
| Function | Purpose |
|---|---|
profile_data(df, ...) |
Run everything; return a data_profile. Options include
group_by and distributions. |
report(x, file) |
Render the profile to a self-contained HTML file (needs pandoc). |
print / summary / plot
methods |
Overview / detail / figures (plot() adds
which = "association"). |
is_data_profile(x) |
Class predicate. |
data_profile
objectprofile_data() returns an S3 list with four parts plus
the call:
metadata — dataset name, dimensions, per-column types,
type counts, timestamp.statistics — numeric summary, categorical summary,
correlation matrices, and the categorical association matrix.diagnostics — missingness, normality, outliers,
date-column profile, the grouped comparison (when group_by
is set), and the quality score.plots — the ggplot2 objects (empty if
build_plots = FALSE; the per-column distribution plots are
also skipped when distributions = FALSE).dataProfilerR/
├── DESCRIPTION
├── NAMESPACE # generated by roxygen2
├── LICENSE
├── NEWS.md
├── R/
│ ├── dataProfilerR-package.R
│ ├── utils.R # validation + skewness/kurtosis
│ ├── profiling.R # types, missingness, summaries, quality score
│ ├── statistics.R # normality, outliers, correlation
│ ├── association.R # Cramer's V for categoricals
│ ├── dates.R # date/datetime profiling
│ ├── groups.R # grouped comparison
│ ├── visualization.R # ggplot2 functions
│ ├── report.R # HTML report (rmarkdown)
│ ├── profile_data.R # orchestrator + S3 constructor
│ └── methods.R # print / summary / plot
├── man/ # generated by roxygen2
├── tests/testthat/ # unit + edge-case tests
└── vignettes/dataProfilerR.Rmd
testthat (edition 3) covers each function plus edge
cases — empty frames, wrong types, all-NA columns,
single-column frames, missing-column plot requests, and output-shape
consistency. Run with devtools::test().
Added in 0.2.0: report() (HTML),
categorical_association() (Cramer’s V),
analyze_dates(), compare_groups(), and a
distributions = FALSE switch to avoid eager per-column
plots on wide data. See NEWS.md.
Still open / honest gaps:
compare_groups() reports descriptive summaries only, not
significance.distributions = FALSE; a fully lazy,
build-on-demand path would be cleaner.report() requires pandoc (the usual R
Markdown dependency); there is no pandoc-free fallback.MIT © Muhammad Farooqi
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.