---
title: "Profiling a dataset with dataProfilerR"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Profiling a dataset with dataProfilerR}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, comment = "#>",
                      fig.width = 7, fig.height = 4.5)
```

`dataProfilerR` turns a data frame into a structured profile with one call:
type inference, missing-value analysis, summary statistics, normality tests,
outlier detection, correlation, a data-quality score, and `ggplot2` figures.

```{r setup}
library(dataProfilerR)
```

## A deliberately messy dataset

To show what the profiler surfaces, here is a small frame with missing values,
an outlier, a constant column, and a high-cardinality text column.

```{r data}
set.seed(1)
n <- 200
df <- data.frame(
  age        = round(rnorm(n, 40, 12)),
  income     = c(rlnorm(n - 1, log(50000), 0.4), 5e6),   # one extreme outlier
  signup     = as.Date("2025-01-01") + sample(0:600, n, replace = TRUE),
  plan       = sample(c("free", "pro", "enterprise"), n, replace = TRUE),
  region     = sample(c("NA", "EU", "APAC"), n, replace = TRUE),
  constant   = 1L,                                        # zero-variance column
  note       = replicate(n, paste(sample(letters, 12), collapse = "")),
  stringsAsFactors = FALSE
)
df$income[sample(n, 20)] <- NA          # inject missingness
df$plan[sample(n, 8)]    <- NA
```

## One call to profile it

```{r profile}
p <- profile_data(df, dataset_name = "customers")
p
```

`print()` gives the headline: dimensions, type breakdown, missingness, and the
quality score. Note the score is below 100 -- the missingness and the constant
column both cost points.

## Drilling in with summary()

```{r summary}
summary(p)
```

The numeric summary shows `income` is heavily right-skewed (large positive
skewness and kurtosis) thanks to the injected outlier, and the outlier table
flags it. `age` looks roughly symmetric.

## The object is just a list

Everything is accessible directly, which makes the profile easy to use
programmatically:

```{r structure}
p$metadata$column_types
p$diagnostics$quality$components
head(p$statistics$numeric[, c("column", "mean", "sd", "skewness")])
```

## Figures

The figures are built during `profile_data()` and retrieved with `plot()`.

```{r missing-plot}
plot(p, which = "missing")
```

```{r dist-plot}
plot(p, which = "distribution", column = "income")
```

```{r corr-plot}
plot(p, which = "correlation")
```

You can also call the plotting functions directly without a full profile, e.g.
`plot_boxplots(df)` or `plot_pairs(df, c("age", "income"))`.

## Tuning the run

- `build_plots = FALSE` skips figure construction on very wide data.
- `outlier_method` can be `"iqr"` (default), `"zscore"`, or `"robust"` (median/MAD).
- `cor_method` accepts `"pearson"`, `"spearman"`, or both.
- `normality = FALSE` skips the Shapiro-Wilk / Anderson-Darling tests.

```{r tuning}
p2 <- profile_data(df, build_plots = FALSE, outlier_method = "robust",
                   cor_method = "spearman")
p2$diagnostics$outliers$per_column
```

## Beyond correlation (0.2.0)

Categorical columns get their own association matrix (Cramer's V):

```{r association}
p$statistics$association
plot(p, which = "association")
```

Date columns are profiled for range and gaps:

```{r dates}
p$diagnostics$dates
```

And you can compare the numeric columns across the levels of a factor:

```{r groups}
pg <- profile_data(df, group_by = "plan")
head(pg$diagnostics$groups$numeric_by_group, 8)
```

## A full HTML report

`report()` renders everything above -- tables and figures -- into one
self-contained HTML file. It needs pandoc (the usual R Markdown dependency).

```{r report, eval=FALSE}
report(p, "customers_report.html")
```
