The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
After you have acquired the data, you should do the following:
The dlookr package makes these steps fast and easy:
dlookr increases synergy with dplyr
. Particularly in
data exploration and data wrangling, it increases the efficiency of the
tidyverse
package group.
Data diagnosis supports the following data structures.
Tasks | Descriptions | Functions | Support DBI |
---|---|---|---|
describe overview of data | Inquire basic information to understand the data in general | overview() |
|
summary overview object | summary described overview of data | summary.overview() |
|
plot overview object | plot described overview of data | plot.overview() |
|
diagnose data quality of variables | The scope of data quality diagnosis is information on missing values and unique value information | diagnose() |
x |
diagnose data quality of categorical variables | frequency, ratio, rank by levels of each variables | diagnose_category() |
x |
diagnose data quality of numerical variables | descriptive statistics, number of zero, minus, outliers | diagnose_numeric() |
x |
diagnose data quality for outlier | number of outliers, ratio, mean of outliers, mean with outliers, mean without outliers | diagnose_outlier() |
x |
plot outliers information of numerical data | box plot and histogram whith outliers, without outliers | plot_outlier.data.frame() |
x |
plot outliers information of numerical data by target variable | box plot and density plot whith outliers, without outliers | plot_outlier.target_df() |
x |
diagnose combination of categorical variables | Check for sparse cases of level combinations of categorical variables | diagnose_sparese() |
Tasks | Descriptions | Functions | Support DBI |
---|---|---|---|
pareto chart for missing value | visualize the Pareto chart for variables with a missing value. | plot_na_pareto() |
|
combination chart for missing value | visualize the distribution of missing value by combining variables. | plot_na_hclust() |
|
plot the combination variables that is include missing value | visualize the combinations of missing value across cases | plot_na_intersect() |
Types | Descriptions | Functions | Support DBI |
---|---|---|---|
report the information of data diagnosis into a PDF file | report the information for diagnosing the data quality | diagnose_report() |
x |
reporting the information of data diagnosis into HTML file | report the information for diagnosing the quality of the data | diagnose_report() |
x |
reporting the information of data diagnosis into HTML file | dynamic report the information for diagnosing the quality of the data | diagnose_web_report() |
x |
reporting the information of data diagnosis into PDF and HTML files | paged report the information for diagnosing the quality of the data | diagnose_paged_report() |
x |
Types | Tasks | Descriptions | Functions | Support DBI |
---|---|---|---|---|
categorical | summaries | frequency tables | univar_category() |
|
categorical | summaries | chi-squared test | summary.univar_category() |
|
categorical | visualize | bar charts | plot.univar_category() |
|
categorical | visualize | bar charts | plot_bar_category() |
|
numerical | summaries | descriptive statistics | describe() |
x |
numerical | summaries | descriptive statistics | univar_numeric() |
|
numerical | summaries | descriptive statistics of standardized variable | summary.univar_numeric() |
|
numerical | visualize | histogram, box plot | plot.univar_numeric() |
|
numerical | visualize | Q-Q plots | plot_qq_numeric() |
|
numerical | visualize | box plot | plot_box_numeric() |
|
numerical | visualize | histogram | plot_hist_numeric() |
Types | Tasks | Descriptions | Functions | Support DBI |
---|---|---|---|---|
categorical | summaries | frequency tables cross cases | compare_category() |
|
categorical | summaries | contingency tables, chi-squared test | summary.compare_category() |
|
categorical | visualize | mosaics plot | plot.compare_category() |
|
numerical | summaries | correlation coefficient, linear model summaries | compare_numeric() |
|
numerical | summaries | correlation coefficient, linear model summaries with threshold | summary.compare_numeric() |
|
numerical | visualize | scatter plot with marginal box plot | plot.compare_numeric() |
|
numerical | Correlate | correlation coefficient | correlate() |
x |
numerical | Correlate | summaries with correlation matrix | summary.correlate() |
x |
numerical | Correlate | visualization of a correlation matrix | plot.correlate() |
x |
both | PPS | PPS(Predictive Power Score) | pps() |
x |
both | PPS | summaries with PPS | summary.pps() |
x |
both | PPS | visualization of a PPS matrix | plot.pps() |
x |
Types | Tasks | Descriptions | Functions | Support DBI |
---|---|---|---|---|
numerical | summaries | Shapiro-Wilk normality test | normality() |
x |
numerical | summaries | normality diagnosis plot (histogram, Q-Q plots) | plot_normality() |
x |
Target Variable | Predictor | Descriptions | Functions | Support DBI |
---|---|---|---|---|
categorical | categorical | contingency tables | relate() |
x |
categorical | categorical | mosaics plot | plot.relate() |
x |
categorical | numerical | descriptive statistic for each levels and total observation | relate() |
x |
categorical | numerical | density plot | plot.relate() |
x |
categorical | categorical | bar charts | plot_bar_category() |
|
numerical | categorical | ANOVA test | relate() |
x |
numerical | categorical | scatter plot | plot.relate() |
x |
numerical | numerical | simple linear model | relate() |
x |
numerical | numerical | box plot | plot.relate() |
x |
categorical | numerical | Q-Q plots | plot_qq_numeric() |
|
categorical | numerical | box plot | plot_box_numeric() |
|
categorical | numerical | histogram | plot_hist_numeric() |
Types | Descriptions | Functions | Support DBI |
---|---|---|---|
reporting the information of EDA into PDF file | reporting the information of EDA | eda_report() |
x |
reporting the information of EDA into HTML file | reporting the information of EDA | eda_report() |
x |
reporting the information of EDA into PDF file | dynamic reporting the information of EDA | eda_web_report() |
x |
reporting the information of EDA into HTML file | paged reporting the information of EDA | eda_paged_report() |
x |
Types | Descriptions | Functions | Support DBI |
---|---|---|---|
missing values | find the variable that contains the missing value in the object that inherits the data.frame | find_na() |
|
outliers | find the numerical variable that contains outliers in the object that inherits the data.frame | find_outliers() |
|
skewed variable | find the numerical variable that is the skewed variable that inherits the data.frame | find_skewness() |
Types | Descriptions | Functions | Support DBI |
---|---|---|---|
missing values | missing values are imputed with some representative values and statistical methods. | imputate_na() |
|
outliers | outliers are imputed with some representative values and statistical methods. | imputate_outlier() |
|
summaries | calculate descriptive statistics of the original and imputed values. | summary.imputation() |
|
visualize | the imputation of a numerical variable is a density plot, and the imputation of a categorical variable is a bar plot. | plot.imputation() |
Types | Descriptions | Functions | Support DBI |
---|---|---|---|
binning | converts a numeric variable to a categorization variable | binning() |
|
summaries | calculate frequency and relative frequency for each levels(bins) | summary.bins() |
|
visualize | visualize two plots on a single screen. The plot at the top is a histogram representing the frequency of the level. The plot at the bottom is a bar chart representing the frequency of the level. | plot.bins() |
|
optimal binning | categorizes a numeric characteristic into bins for ulterior usage in scoring modeling | binning_by() |
|
summaries | summary metrics to evaluate the performance of binomial classification model | summary.optimal_bins() |
|
visualize | generates plots for understand distribution, bad rate, and weight of evidence after running binning_by() | plot.optimal_bins() |
|
infogain binning | categorizes a numeric characteristic into bins for multi-class variables using recursive information gain ratio maximization | binning_rgr() |
|
visualize | generates plots for understanding distribution and distribution by target variable after running binning_rgr() | plot.infogain_bins() |
|
evaluate | calculates metrics to evaluate the performance of binned variable for binomial classification model | performance_bin() |
|
summaries | summary metrics to evaluate the performance of binomial classification model after performance_bin() | summary.performance_bin() |
|
visualize | It generates plots to understand frequency, WoE by bins using performance_bin after running binning_by() | plot.performance_bin() |
|
visualize | extract bins from “bins” and “optimal_bins” objects | extract.bins() |
Types | Descriptions | Functions | Support DBI |
---|---|---|---|
diagnosis | performs diagnose performance that calculates metrics to evaluate the performance of binned variable for binomial classification model | performance_bin() |
|
summaries | summary method for “performance_bin”. summary metrics to evaluate the performance of the binomial classification model | summary.performance_bin() |
|
visualize | visualize for understanding frequency, WoE by bins using performance_bin and something else | plot.performance_bin() |
Types | Descriptions | Functions | Support DBI |
---|---|---|---|
transformation | performs variable transformation for standardization and resolving skewness of numerical variables | transform() |
|
summaries | compares the distribution of data before and after data transformation | summary.transform() |
|
visualize | visualize two kinds of a plot by attribute of the ‘transform’ class. The transformation of a numerical variable is a density plot | plot.transform() |
Types | Descriptions | Functions | Support DBI |
---|---|---|---|
reporting the information of transformation into PDF | reporting the information of transformation | transformation_report() |
|
reporting the information of transformation into HTML | reporting the information of transformation | transformation_report() |
|
reporting the transformation information into PDF | dynamic reporting the transformation information | transformation_web_report() |
|
reporting the information of transformation into HTML | paged reporting the information of transformation | transformation_paged_report() |
Types | Descriptions | Functions | Support DBI |
---|---|---|---|
statistics | calculate the entropy | entropy() |
|
statistics | calculate the skewness of the data | skewness() |
|
statistics | calculate the kurtosis of the data | kurtosis() |
|
statistics | calculate the Jensen-Shannon divergence between two probability distributions | jsd() |
|
statistics | calculate the Kullback-Leibler divergence between two probability distributions | kld() |
|
statistics | calculate the Cramer’s V statistic between two categorical(discrete) variables | cramer() |
|
statistics | calculate the Theil’s U statistic between two categorical(discrete) variables | theil() |
|
statistics | finding percentile of a numerical variable. | get_percentile() |
|
statistics | transform a numeric vector using several methods like “log”, “sqrt”, “log+1”, “log+a”, “1/x”, “x^2”, “x^3”, “Box-Cox”, “Yeo-Johnson” | get_transform() |
|
statistics | calculate the Cramer’s V statistic | cramer() |
|
statistics | calculate the Theil’s U statistic | theil() |
Types | Descriptions | Functions | Support DBI |
---|---|---|---|
programming | extracts variable information having a certain class from an object inheriting data.frame | find_class() |
|
programming | gets class of variables in data.frame or tbl_df | get_class() |
|
programming | retrieves the column information of the DBMS table through the tbl_bdi object of dplyr | get_column_info() |
|
programming | finding the user machine’s OS. | get_os() |
|
programming | import Google fonts | import_google_font() |
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.