Repository Mirror for your Cloud Server and Webhosting

Title:

Concise and Efficient Tools for Everyday Statistical Production

Version:

1.0

Description:

A set of concise and efficient tools for statistical production. Can also be used for data management. In statistical production, you deal with complex data and need to control your process at each step of your work. Concise functions are very helpful, because you do not hesitate to use them. The following functions are included in the package. 'dup' checks duplicates. 'miss' checks missing values. 'tac' computes contingency table of all columns. 'toc' compares two tables, spotting significant deviations. 'chi2_find' compares columns within a data.frame, spotting related categories of (a more complex function).

Encoding:

UTF-8

Imports:

dplyr, rlang, tibble

RoxygenNote:

7.3.2

License:

MIT + file LICENSE

NeedsCompilation:

Packaged:

2025-10-28 12:50:38 UTC; vincent.reduron

Author:

Vincent Reduron [cre, aut]

Maintainer:

Vincent Reduron <vincent.reduron@laposte.net>

Repository:

CRAN

Date/Publication:

2025-10-31 18:30:02 UTC

short for 'paste0()'

Description

short for 'paste0()'

Usage

a %+% b

Arguments

a

string

b

string

Value

string

Find modalities related to a criterion

Description

Find modalities related to a criterion

Usage

chi2_find(df, criterion)

Arguments

df

data.frame

criterion

character string: criterion that spots target rows

Value

data.frame

coltypes()

Description

Create vector of df's column types. Similar to colnames(), but with column types instead of names.

Usage

coltypes(df)

Arguments

df

data.frame

Value

vector

Analysis of the cardinality of a key/identifier in a table

Description

Creates multiple result tables. The term "n-plicate" is used to generalize the notion of duplicate: a n_plicate can be a duplicate, a triplicate, etc.

Usage

dup(tab, keyby, count_what = "rows", partition = NULL, view = TRUE)

Arguments

tab

Either an R dataframe or a reference to a remote table ("remote table")

keyby

(character vector) names of the column(s) considered as keys

count_what

(character vector) defines what to count by key (by *keyby*). 'rows' to count distinct rows, otherwise the name of the columns whose distinct values are to be counted

partition

(character vector) names of the columns by which to break down the analysis

view

automatic opening of generated tables

Value

A set of dataframes in the global environment. * nup_r_tab: table of n-plicate counts * nup_xpl_r: table of n-plicate examples * nup_exZ_r: table of examples of (n-plicates with value 0) * nup_r_tab_part: table of n-plicate counts broken down by the modalities of the 'partition' columns

Examples

# Check if "name" is a unique key of the starwars table (yes !)
dup(dplyr::starwars, keyby = "name", view = FALSE)

# Check if "key" is a unique key of the basic table (no !)
basic <- data.frame("key"   = c("a", "b", "c", "d", NA, "a", "e", "f"), 
                    "value" = c(112, 117, 317,  NA,  0,  17, 117, 112))
dup(basic, keyby = "key", view = FALSE)

get_recursion_depth

Description

get recursion depth of a list

Usage

get_recursion_depth(x, depth = 0)

Arguments

x

: input list

depth

: depth of x in another list (1 if x in a list. 2 if x is in a list of lists. Etc.)

Value

integer

Contingency table for column 'col_name' in data.frame 'df

Description

Contingency table for column 'col_name' in data.frame 'df

Usage

get_tac_column(df, col_name, values, strates)

Arguments

df

Input data.frame

col_name

string : name of column to which generate the contingency table

values

Vector of columns that serve as measures (amounts, counts, etc.)

strates

Vector of column names by which to stratify the contingency tables

is.Date

Description

Returns TRUE or FALSE depending on whether its argument is of Date type or not

Usage

is.Date(x)

Arguments

x

object

Value

TRUE/FALSE

is.POSIXct

Description

Returns TRUE or FALSE depending on whether its argument is of POSIXct type or not

Usage

is.POSIXct(x)

Arguments

x

object

Value

TRUE/FALSE

is.POSIXlt

Description

Returns TRUE or FALSE depending on whether its argument is of POSIXlt type or not

Usage

is.POSIXlt(x)

Arguments

x

object

Value

TRUE/FALSE

is.POSIXt

Description

Returns TRUE or FALSE depending on whether its argument is of POSIXxt type or not

Usage

is.POSIXt(x)

Arguments

x

object

Value

TRUE/FALSE

Missing: Generate a synthetic table of missing values for all columns of a data.frame

Description

Get a synthetic table of missing values for all columns of a data.frame

Usage

miss(df, values = NULL, view = FALSE)

Arguments

df

data.frame: Input data.frame

values

column: Variable (~weight) to measure the number of missing values (otherwise, count of rows)

view

boolean: Display a glimpse of cases with NA values

Value

data.frame

Examples

miss(mtcars)  # Checking NA values for all columns of mtcars (none)

Computes a contingency table (tac) of all columns in a dataframe for control purposes

Description

Contingency table (tac) of all columns in a dataframe for control purposes

Usage

tac(
  df,
  values = NULL,
  sample_rate = 0.01,
  num_but_discrete = "NULL",
  strates = NULL
)

Arguments

df

Input data.frame

values

Vector of columns that serve as measures (amounts, counts, etc.)

sample_rate

Sampling rate, if df is a remote table

num_but_discrete

Vector of names of numeric columns with discrete modalities (not continuous)

strates

Vector of column names by which to stratify the contingency tables

Value

data.frame

Examples

tab <- tac(iris) # calculate column frequencies

TAC-based Outlier Control (TOC)

Description

Generalized detection of outlier values in a database, based on contingency tables (tac)

Usage

toc(
  df1,
  df2,
  values = NULL,
  a = 10,
  r = 0.34,
  sample_rate = 0.01,
  num_but_discrete = "NULL"
)

Arguments

df1

Input data.frame (to compare with df2)

df2

Input data.frame (to compare with df1)

values

Vector of columns that serve as measures (amounts, counts, etc.)

a

Allowed absolute variation

r

Allowed relative variation

sample_rate

Sampling rate, if df is a remote table

num_but_discrete

Numeric variables to be treated as discrete modal variables. If 'all', all numeric variables are treated as discrete modal variables.

Details

Currently does not work with values parameter

Value

data.frame

Scoring significativity of difference between two values x and y

Description

Difference score between x and y (0 = no significant difference, >0 = presence of significant difference)

Usage

toc_score(x, y, a)

Arguments

x

(num) First value to compare

y

(num) Second value to compare

a

(num) Absolute difference threshold below which all differences are considered normal

Value

numeric

Examples

toc_score(15, 1500, a = 500) # 1.91: significant difference
toc_score(1432, 1501, a = 100) # 0: non-significant difference