Repository Mirror for your Cloud Server and Webhosting

Title:

Automated Exploratory Data Analysis and Dataset Profiling

Version:

0.2.1

Description:

Profiles a data frame with minimal input: column type inference, missing-value analysis, distributional summary statistics (including skewness and kurtosis), normality tests, outlier detection, correlation and categorical-association analysis, date-column profiling, grouped comparisons and an overall data-quality score, alongside a set of 'ggplot2' visualisations. A single entry point, profile_data(), returns a structured S3 object holding metadata, statistics, diagnostics and plots, with print(), summary() and plot() methods, and report() renders the whole profile to a self-contained HTML file. Statistical methods include the Shapiro-Wilk normality test as implemented by Royston (1995) <doi:10.2307/2986146> and the Anderson-Darling test following Stephens (1974) <doi:10.1080/01621459.1974.10480196>, with power comparisons of these tests in Yap and Sim (2011) <doi:10.1080/00949655.2010.520163>, and the categorical association measure of Cramer (1946, ISBN:9780691080048).

License:

MIT + file LICENSE

Encoding:

UTF-8

Language:

en-GB

Depends:

R (≥ 4.1.0)

Imports:

ggplot2, stats, utils

Suggests:

testthat (≥ 3.0.0), knitr, rmarkdown, nortest, spelling

VignetteBuilder:

knitr

URL:

https://github.com/mqfarooqi1/dataProfilerR

BugReports:

https://github.com/mqfarooqi1/dataProfilerR/issues

Config/testthat/edition:

Config/roxygen2/version:

8.0.0

NeedsCompilation:

Packaged:

2026-06-18 22:53:01 UTC; faroo

Author:

Muhammad Farooqi [aut, cre]

Maintainer:

Muhammad Farooqi <mqfarooqi@gmail.com>

Repository:

CRAN

Date/Publication:

2026-06-24 08:20:18 UTC

dataProfilerR: automated exploratory data analysis

Description

dataProfilerR profiles a data frame with a single call. It infers column types, quantifies missingness, computes distributional statistics, runs normality tests, detects outliers, measures correlation, and rolls the findings into a data-quality score. It also builds a set of ggplot2 visualisations. The main entry point is profile_data(), which returns a data_profile S3 object with print(), summary() and plot() methods.

Design

The package uses the S3 object system. The profiling result is a plain list with class "data_profile", which keeps the structure transparent and easy to inspect, serialise, and extend. S4 would add formality (and overhead) that an EDA result object does not need.

Author(s)

Maintainer: Muhammad Farooqi mqfarooqi@gmail.com

Authors:

Muhammad Farooqi mqfarooqi@gmail.com

Profile date / datetime columns

Description

For each Date/POSIXct column, reports the count, missingness, range, and the largest gap between consecutive (sorted, unique) timestamps – a quick way to spot coverage holes in a time series.

Usage

analyze_dates(df, types = NULL)

Arguments

df

A data frame.

types

Optional named character vector of column types; computed if not supplied.

Value

A data frame with one row per date column (column, n, n_missing, min, max, range_days, n_unique, max_gap_days), or NULL if there are no date columns.

Examples

df <- data.frame(d = as.Date("2026-01-01") + c(0, 1, 2, 10))
analyze_dates(df)

Analyse missing values

Description

Reports missingness per column and overall, including how many rows are fully complete. Only NA is counted as missing (blank strings are not).

Usage

analyze_missing(df)

Arguments

df

A data frame.

Value

A list with per_column (a data frame of column, n_missing, pct_missing) and overall (a list with total/missing cell counts, pct_missing, complete_rows and pct_complete_rows).

Examples

analyze_missing(data.frame(a = c(1, NA, 3), b = c("x", "y", NA)))

Categorical association (Cramer's V)

Description

Computes Cramer's V between every pair of categorical/logical columns. V ranges from 0 (no association) to 1 (perfect association) and is the categorical analogue of a correlation matrix. It is derived from the chi-squared statistic: V = sqrt(X^2 / (n * (k - 1))), where k is the smaller of the two factors' level counts.

Usage

categorical_association(df, types = NULL, max_levels = 50)

Arguments

df

A data frame.

types

Optional named character vector of column types (from infer_column_types()); computed if not supplied.

max_levels

Categorical columns with more than this many levels are skipped (a high-cardinality column makes the chi-squared test unreliable and the table huge). Default 50.

Value

A symmetric numeric matrix of Cramer's V with a unit diagonal, or NULL if fewer than two eligible categorical columns are present.

Examples

df <- data.frame(a = c("x", "x", "y", "y"), b = c("p", "p", "q", "q"),
                 c = c("m", "n", "m", "n"))
categorical_association(df)

Compare numeric columns across groups

Description

Grouped profiling: split the data by a categorical column and summarise each numeric column within each group (count, mean, sd, median, missingness). This is the quickest way to see whether a metric differs by segment.

Usage

compare_groups(df, group, max_groups = 50)

Arguments

df

A data frame.

group

Name of the grouping column. Should be categorical/logical (or a low-cardinality column); a warning is issued if it has many levels.

max_groups

Maximum number of groups before erroring (guards against accidentally grouping on a near-unique column). Default 50.

Value

A list with group_sizes (a data frame of group, n) and numeric_by_group (a long data frame of group, column, n, n_missing, mean, sd, median), or NULL if there are no numeric columns to compare.

Examples

compare_groups(iris, "Species")

Correlation analysis

Description

Correlation matrices over the numeric columns, using pairwise-complete observations.

Usage

correlation_analysis(df, types = NULL, method = c("pearson", "spearman"))

Arguments

df

A data frame.

types

Optional named character vector of column types.

method

Character vector; any of "pearson", "spearman". Default both.

Value

A named list of correlation matrices (one per requested method), or NULL if there are fewer than two numeric columns.

Examples

correlation_analysis(iris)

Data quality score

Description

Rolls several signals into a single 0-100 score and a letter grade. The components are completeness (share of non-missing cells), row uniqueness (penalises duplicate rows), and column variability (penalises constant, single-value columns). If an outlier_rate is supplied it adds a cleanliness component. Components are averaged with the supplied weights.

Usage

data_quality_score(
  df,
  missing = NULL,
  outlier_rate = NULL,
  weights = c(completeness = 0.4, uniqueness = 0.2, variability = 0.2, cleanliness = 0.2)
)

Arguments

df

A data frame.

missing

Optional result of analyze_missing(); computed if NULL.

outlier_rate

Optional fraction (0-1) of numeric cells flagged as outliers; if supplied, a cleanliness component is included.

weights

Optional named numeric vector of component weights. Missing components are dropped and the rest renormalised.

Value

A list with score (0-100), grade (a letter), and components (a named numeric vector of the component scores).

Examples

data_quality_score(iris)

Detect outliers in a numeric vector

Description

Three standard rules:

"iqr": outside Q1 - k*IQR / Q3 + k*IQR (Tukey's rule, k = 1.5).
"zscore": absolute z-score above threshold (default 3).
"robust": absolute modified z-score using the median and MAD above threshold (default 3.5); resistant to the outliers it is detecting.

Usage

detect_outliers(x, method = c("iqr", "zscore", "robust"), threshold = NULL)

Arguments

x

A numeric vector.

method

One of "iqr", "zscore", "robust".

threshold

Cutoff for "zscore"/"robust"; the IQR multiplier for "iqr". Defaults: 1.5 (iqr), 3 (zscore), 3.5 (robust).

Value

A list: method, n (non-missing count), n_outliers, pct, is_outlier (a logical vector aligned to x, FALSE for NA), and bounds (lower/upper, where applicable).

Examples

detect_outliers(c(1, 2, 3, 4, 100), method = "iqr")

Infer a semantic type for each column

Description

Maps each column to one of "numeric", "integer", "date", "logical", "categorical", "text" or "other". Character columns are split into "categorical" and "text" heuristically: long strings, or high-cardinality columns where most values are unique, are treated as free text; everything else is categorical.

Usage

infer_column_types(df, text_min_avg_chars = 50, text_unique_ratio = 0.8)

Arguments

df

A data frame.

text_min_avg_chars

Average character length above which a character column is considered free text. Default 50.

text_unique_ratio

Fraction of unique values above which a character column (with enough rows) is considered free text. Default 0.8.

Value

A named character vector of inferred types, one per column.

Examples

infer_column_types(data.frame(a = 1:3, b = c("x", "y", "z"),
                              d = Sys.Date() + 0:2))

Is an object a data_profile?

Description

Is an object a data_profile?

Usage

is_data_profile(x)

Arguments

x

Any object.

Value

TRUE if x has class data_profile.

Examples

is_data_profile(profile_data(iris))

Sample excess kurtosis

Description

Moment-based kurtosis minus 3, so a normal distribution scores near 0.

Usage

kurtosis(x)

Arguments

x

A numeric vector.

Value

A single numeric value, or NA_real_ if there are fewer than four non-missing values or the variance is zero.

Examples

kurtosis(rnorm(100))

Normality tests for numeric columns

Description

Runs the Shapiro-Wilk test on each numeric/integer column, and the Anderson-Darling test as well if the suggested nortest package is installed. Shapiro-Wilk requires 3 to 5000 observations; larger columns are reduced to an evenly-spaced subsample of 5000. The subsample is deterministic and does not touch the session's random-number state.

Usage

normality_tests(df, types = NULL, alpha = 0.05)

Arguments

df

A data frame.

types

Optional named character vector of column types.

alpha

Significance level for the normal verdict. Default 0.05.

Value

A data frame with one row per numeric column: column, n_used, shapiro_W, shapiro_p, ad_A and ad_p (the Anderson-Darling columns are NA if nortest is absent), and a logical normal. Returns NULL if there are no numeric columns.

Examples

normality_tests(iris)

Outlier summary across numeric columns

Description

Applies detect_outliers() to every numeric column and tabulates the result.

Usage

outlier_summary(df, types = NULL, method = "iqr")

Arguments

df

A data frame.

types

Optional named character vector of column types.

method

Outlier method passed to detect_outliers().

Value

A list with per_column (a data frame of column, n_outliers, pct) and overall_rate (fraction of numeric cells flagged, 0-1), or NULL if there are no numeric columns.

Examples

outlier_summary(iris)

Plot a data profile

Description

Returns one of the figures built by profile_data().

Usage

## S3 method for class 'data_profile'
plot(
  x,
  which = c("missing", "correlation", "association", "boxplots", "pairs", "distribution"),
  column = NULL,
  ...
)

Arguments

x

A data_profile object (built with build_plots = TRUE).

which

Which figure: "missing", "correlation", "association", "boxplots", "pairs", or "distribution".

column

Column name, required when which = "distribution".

...

Ignored.

Value

A ggplot2 object (also drawn when called at the console).

Examples

p <- profile_data(iris)

plot(p, which = "missing")
plot(p, which = "distribution", column = "Sepal.Length")

Categorical association heatmap

Description

Heatmap of the Cramer's V matrix from categorical_association().

Usage

plot_association(df, max_levels = 50)

Arguments

df

A data frame.

max_levels

Passed to categorical_association().

Value

A ggplot2 object, or NULL (with a warning) if there are fewer than two eligible categorical columns.

Examples

plot_association(
  data.frame(a = c("x", "x", "y", "y"), b = c("p", "p", "q", "q"))
)

Boxplots for numeric columns

Description

One boxplot per numeric column, faceted with free y-scales so columns on different scales are still readable. Useful as a quick outlier scan.

Usage

plot_boxplots(df)

Arguments

df

A data frame.

Value

A ggplot2 object, or NULL (with a warning) if there are no numeric columns.

Examples

plot_boxplots(iris)

Correlation heatmap

Description

A heatmap of the correlation matrix over the numeric columns, annotated with the rounded coefficients.

Usage

plot_correlation(df, method = c("pearson", "spearman"))

Arguments

df

A data frame.

method

Correlation method: "pearson" or "spearman".

Value

A ggplot2 object, or NULL (with a warning) if there are fewer than two numeric columns.

Examples

plot_correlation(iris)

Distribution plot for a single column

Description

Histogram with a density overlay for numeric columns; a bar chart of the most frequent levels for categorical/text/logical columns.

Usage

plot_distribution(df, column, bins = 30, max_levels = 20)

Arguments

df

A data frame.

column

Name of the column to plot.

bins

Histogram bins for numeric columns. Default 30.

max_levels

Maximum categories to show for categorical columns. Default 20.

Value

A ggplot2 object.

Examples

plot_distribution(iris, "Sepal.Length")
plot_distribution(iris, "Species")

Missing-value heatmap

Description

A tile plot of where NAs fall: columns on the x-axis, rows on the y-axis, shaded by whether each cell is missing. For wide/tall data the rows are subsampled to max_rows so the plot stays legible.

Usage

plot_missing(df, max_rows = 500)

Arguments

df

A data frame.

max_rows

Maximum rows to display (subsampled if exceeded). Default 500.

Value

A ggplot2 object.

Examples

df <- data.frame(a = c(1, NA, 3), b = c(NA, "y", "z"))
plot_missing(df)

Pairwise scatterplot matrix

Description

A scatterplot matrix over selected numeric columns, drawn with facets. Capped at a handful of columns because the number of panels grows quadratically.

Usage

plot_pairs(df, columns = NULL, max_cols = 5)

Arguments

df

A data frame.

columns

Optional character vector of numeric columns to include. If NULL, the first max_cols numeric columns are used.

max_cols

Maximum number of columns to include. Default 5.

Value

A ggplot2 object, or NULL (with a warning) if fewer than two numeric columns are available.

Examples

plot_pairs(iris, c("Sepal.Length", "Sepal.Width", "Petal.Length"))

Print a concise overview of a data profile

Description

Print a concise overview of a data profile

Usage

## S3 method for class 'data_profile'
print(x, ...)

Arguments

x

A data_profile object.

...

Ignored.

Value

x, invisibly.

Examples

print(profile_data(iris))

Profile a data frame

Description

The package's single entry point. It runs type inference, missing-value analysis, summary statistics, normality tests, outlier detection, correlation analysis and a data-quality score, and (optionally) builds a set of ggplot2 visualisations. The result is a data_profile S3 object with print(), summary() and plot() methods.

Usage

profile_data(
  df,
  dataset_name = NULL,
  build_plots = TRUE,
  distributions = TRUE,
  normality = TRUE,
  outlier_method = "iqr",
  cor_method = c("pearson", "spearman"),
  group_by = NULL,
  verbose = FALSE
)

Arguments

df

A data frame with at least one row and one column and unique, non-empty column names.

dataset_name

Optional label stored in the metadata; defaults to the deparsed name of df.

build_plots

Whether to build the ggplot2 objects. Set FALSE to skip plotting on very wide data. Default TRUE.

distributions

Whether to build a per-column distribution plot (the eager, heaviest part of plotting). Set FALSE on wide data and use plot_distribution() on demand. Ignored if build_plots = FALSE. Default TRUE.

normality

Whether to run normality tests. Default TRUE.

outlier_method

Method passed to outlier_summary(): "iqr", "zscore" or "robust". Default "iqr".

cor_method

Correlation methods: any of "pearson", "spearman".

group_by

Optional name of a categorical column. If supplied, a grouped comparison of the numeric columns is added to the diagnostics (see compare_groups()).

verbose

Print progress messages. Default FALSE.

Value

An object of class data_profile: a list with elements metadata, statistics, diagnostics, plots and call.

Examples

p <- profile_data(iris)
p
summary(p)

plot(p, which = "correlation")

Render a profile to a self-contained HTML report

Description

Turns a data_profile into a standalone HTML file containing the metadata, quality score, statistical tables and every figure. The report is built with rmarkdown, so a working pandoc installation is required (R Markdown's usual dependency); report() errors clearly if pandoc is unavailable.

Usage

report(
  x,
  output_file = "dataProfilerR_report.html",
  title = NULL,
  quiet = TRUE
)

Arguments

x

A data_profile (built with build_plots = TRUE).

output_file

Path to write. A bare file name lands in the working directory. Default "dataProfilerR_report.html".

title

Report title. Defaults to the dataset name.

quiet

Passed to rmarkdown::render(). Default TRUE.

Value

The path to the written file, invisibly.

Examples


if (requireNamespace("rmarkdown", quietly = TRUE) &&
    rmarkdown::pandoc_available()) {
  p <- profile_data(iris)
  f <- report(p, file.path(tempdir(), "iris_report.html"))
}

Sample skewness

Description

Moment-based skewness, computed as m3 / m2^(3/2) on the non-missing values.

Usage

skewness(x)

Arguments

x

A numeric vector.

Value

A single numeric value, or NA_real_ if there are fewer than three non-missing values or the variance is zero.

Examples

skewness(c(1, 2, 2, 3, 10))

Summary statistics by column type

Description

Produces a numeric summary data frame (count, missingness, mean, sd, variance, quartiles, IQR, skewness, kurtosis) for numeric and integer columns, and a categorical summary (cardinality and most frequent level) for factor, logical, categorical and text columns.

Usage

summarize_columns(df, types = NULL)

Arguments

df

A data frame.

types

Optional named character vector of column types (as returned by infer_column_types()). Computed if not supplied.

Value

A list with numeric (a data frame, or NULL if no numeric columns) and categorical (a named list, possibly empty).

Examples

summarize_columns(iris)

Detailed summary of a data profile

Description

Prints the numeric summary, the columns with the most missingness, normality verdicts, outlier counts, and the strongest correlations, and returns the same pieces invisibly as a list.

Usage

## S3 method for class 'data_profile'
summary(object, max_rows = 10, ...)

Arguments

object

A data_profile object.

max_rows

Maximum rows to print per table. Default 10.

...

Ignored.

Value

A list of the printed tables, invisibly.

Examples

summary(profile_data(iris))

Package {dataProfilerR}

dataProfilerR: automated exploratory data analysis

Description

Design

Author(s)

See Also

Profile date / datetime columns

Description

Usage

Arguments

Value

Examples

Analyse missing values

Description

Usage

Arguments

Value

Examples

Categorical association (Cramer's V)

Description

Usage

Arguments

Value

Examples

Compare numeric columns across groups

Description

Usage

Arguments

Value

Examples

Correlation analysis

Description

Usage

Arguments

Value

Examples

Data quality score

Description

Usage

Arguments

Value

Examples

Detect outliers in a numeric vector

Description

Usage

Arguments

Value

Examples

Infer a semantic type for each column

Description

Usage

Arguments

Value

Examples

Is an object a data_profile?

Description

Usage

Arguments

Value

Examples

Sample excess kurtosis

Description

Usage

Arguments

Value

Examples

Normality tests for numeric columns

Description

Usage

Arguments

Value

Examples

Outlier summary across numeric columns

Description

Usage

Arguments

Value

Examples

Plot a data profile

Description