The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.

Type: Package
Title: New Experimental Design Based Subsampling Methods for Big Data
Version: 1.0.1
Maintainer: Amalan Mahendran <amalan0595@gmail.com>
Description: Subsampling methods for big data under different models and assumptions. Starting with linear regression and leading to Generalised Linear Models, softmax regression, and quantile regression. Specifically, the model-robust subsampling method proposed in Mahendran, A., Thompson, H., and McGree, J. M. (2023) <doi:10.1007/s00362-023-01446-9>, where multiple models can describe the big data, and the subsampling framework for potentially misspecified Generalised Linear Models in Mahendran, A., Thompson, H., and McGree, J. M. (2025) <doi:10.48550/arXiv.2510.05902>.
License: MIT + file LICENSE
URL: https://github.com/Amalan-ConStat/NeEDS4BigData,https://amalan-constat.github.io/NeEDS4BigData/index.html
BugReports: https://github.com/Amalan-ConStat/NeEDS4BigData/issues
Depends: R (≥ 4.1.0)
Imports: dplyr, foreach, gam, ggh4x, ggplot2, ggridges, matrixStats, mvnfast, psych, Rdpack, Rfast, rlang, stats, tidyr
RdMacros: Rdpack
Suggests: doParallel, ggpubr, kableExtra, knitr, parallel, rmarkdown, spelling, testthat, vctrs, pillar
Encoding: UTF-8
Language: en-GB
LazyData: true
LazyDataCompression: xz
RoxygenNote: 7.3.1
Config/testthat/edition: 3
NeedsCompilation: no
Packaged: 2025-10-18 14:12:20 UTC; amala
Author: Amalan Mahendran ORCID iD [aut, cre]
Repository: CRAN
Date/Publication: 2025-10-22 19:00:08 UTC

A- and L-optimality criteria based subsampling under Generalised Linear Models

Description

Using this function sample from big data under linear, logistic and Poisson regression to describe the data. Subsampling probabilities are obtained based on the A- and L- optimality criteria.

Usage

ALoptimalGLMSub(r0,rf,Y,X,N,family)

Arguments

r0

sample size for initial random sample

rf

final sample size including initial(r0) and optimal(r) samples

Y

response data or Y

X

covariate data or X matrix that has all the covariates (first column is for the intercept)

N

size of the big data

family

a character value for "linear", "logistic" and "poisson" regression from Generalised Linear Models

Details

Two stage subsampling algorithm for big data under Generalised Linear Models (linear, logistic and Poisson regression).

First stage is to obtain a random sample of size r_0 and estimate the model parameters. Using the estimated parameters subsampling probabilities are evaluated for A- and L-optimality criteria.

Through the estimated subsampling probabilities an optimal sample of size r \ge r_0 is obtained. Finally, the two samples are combined and the model parameters are estimated.

NOTE : If input parameters are not in given domain conditions necessary error messages will be provided to go further.

If r \ge r_0 is not satisfied then an error message will be produced.

If the big data X,Y has any missing values then an error message will be produced.

The big data size N is compared with the sizes of X,Y and if they are not aligned an error message will be produced.

A character value is provided for family and if it is not of the any three types an error message will be produced.

Value

The output of ALoptimalGLMSub gives a list of

Beta_Estimates estimated model parameters in a data.frame after subsampling

Variance_Epsilon_Estimates matrix of estimated variance for epsilon in a data.frame after subsampling

Sample_A-Optimality list of indexes for the initial and optimal samples obtained based on A-Optimality criteria

Sample_L-Optimality list of indexes for the initial and optimal samples obtained based on L-Optimality criteria

Subsampling_Probability matrix of calculated subsampling probabilities for A- and L- optimality criteria

References

Wang H, Zhu R, Ma P (2018). “Optimal subsampling for large sample logistic regression.” Journal of the American Statistical Association, 113(522), 829–844.

Ai M, Yu J, Zhang H, Wang H (2021). “Optimal subsampling algorithms for big data regressions.” Statistica Sinica, 31(2), 749–772.

Yao Y, Wang H (2021). “A review on optimal subsampling methods for massive datasets.” Journal of Data Science, 19(1), 151–172.

Examples

Dist<-"Normal"; Dist_Par<-list(Mean=0,Variance=1,Error_Variance=0.5)
No_Of_Var<-2; Beta<-c(-1,2,1); N<-5000; Family<-"linear"
Full_Data<-GenGLMdata(Dist,Dist_Par,No_Of_Var,Beta,N,Family)

r0<-300; rf<-rep(c(6,9)*100,50); Original_Data<-Full_Data$Complete_Data;

ALoptimalGLMSub(r0 = r0, rf = rf,Y = as.matrix(Original_Data[,1]),
                X = as.matrix(Original_Data[,-1]),N = nrow(Original_Data),
                family = "linear")->Results

plot_Beta(Results)

Dist<-"Normal"; Dist_Par<-list(Mean=0,Variance=1)
No_Of_Var<-2; Beta<-c(-1,2,1); N<-5000; Family<-"logistic"
Full_Data<-GenGLMdata(Dist,Dist_Par,No_Of_Var,Beta,N,Family)

r0<-300; rf<-rep(c(6,9)*100,50); Original_Data<-Full_Data$Complete_Data;

ALoptimalGLMSub(r0 = r0, rf = rf,Y = as.matrix(Original_Data[,1]),
                X = as.matrix(Original_Data[,-1]),N = nrow(Original_Data),
                family = "logistic")->Results

plot_Beta(Results)

Dist<-"Normal";
No_Of_Var<-2; Beta<-c(-1,2,1); N<-5000; Family<-"poisson"
Full_Data<-GenGLMdata(Dist,NULL,No_Of_Var,Beta,N,Family)

r0<-300; rf<-rep(c(6,9)*100,50); Original_Data<-Full_Data$Complete_Data;

ALoptimalGLMSub(r0 = r0, rf = rf,Y = as.matrix(Original_Data[,1]),
                X = as.matrix(Original_Data[,-1]),N = nrow(Original_Data),
                family = "poisson")->Results

plot_Beta(Results)


A-optimality criteria based subsampling under Gaussian Linear Models

Description

Using this function sample from big data under Gaussian linear regression models to describe the data. Subsampling probabilities are obtained based on the A-optimality criteria.

Usage

AoptimalGauLMSub(r0,rf,Y,X,N)

Arguments

r0

sample size for initial random sample

rf

final sample size including initial(r0) and optimal(r) samples

Y

response data or Y

X

covariate data or X matrix that has all the covariates (first column is for the intercept)

N

size of the big data

Details

Two stage subsampling algorithm for big data under Gaussian Linear Model.

First stage is to obtain a random sample of size r_0 and estimate the model parameters. Using the estimated parameters subsampling probabilities are evaluated for A-optimality criteria.

Through the estimated subsampling probabilities an optimal sample of size r \ge r_0 is obtained. Finally, the two samples are combined and the model parameters are estimated.

NOTE : If input parameters are not in given domain conditions necessary error messages will be provided to go further.

If r \ge r_0 is not satisfied then an error message will be produced.

If the big data X,Y has any missing values then an error message will be produced.

The big data size N is compared with the sizes of X,Y and if they are not aligned an error message will be produced.

Value

The output of AoptimalGauLMSub gives a list of

Beta_Estimates estimated model parameters in a data.frame after subsampling

Variance_Epsilon_Estimates matrix of estimated variance for epsilon in a data.frame after subsampling

Sample_A-Optimality list of indexes for the initial and optimal samples obtained based on A-Optimality criteria

Subsampling_Probability matrix of calculated subsampling probabilities for A-optimality criteria

References

Lee J, Schifano ED, Wang H (2021). “Fast optimal subsampling probability approximation for generalized linear models.” Econometrics and Statistics.

Examples

Dist<-"Normal"; Dist_Par<-list(Mean=0,Variance=1,Error_Variance=0.5)
No_Of_Var<-2; Beta<-c(-1,2,1); N<-10000; Family<-"linear"
Full_Data<-GenGLMdata(Dist,Dist_Par,No_Of_Var,Beta,N,Family)

r0<-300; rf<-rep(100*c(6,12),50); Original_Data<-Full_Data$Complete_Data;

AoptimalGauLMSub(r0 = r0, rf = rf,Y = as.matrix(Original_Data[,1]),
                 X = as.matrix(Original_Data[,-1]),
                 N = nrow(Original_Data))->Results

plot_Beta(Results)


A-optimality criteria based subsampling under measurement constraints for Generalised Linear Models

Description

Using this function sample from big data under linear, logistic and Poisson regression to describe the data when response y is partially unavailable. Subsampling probabilities are obtained based on the A-optimality criteria.

Usage

AoptimalMCGLMSub(r0,rf,Y,X,N,family)

Arguments

r0

sample size for initial random sample

rf

final sample size including initial(r0) and optimal(r) samples

Y

response data or Y

X

covariate data or X matrix that has all the covariates (first column is for the intercept)

N

size of the big data

family

a character value for "linear", "logistic" and "poisson" regression from Generalised Linear Models

Details

Two stage subsampling algorithm for big data under Generalised Linear Models (linear, logistic and Poisson regression) when the response is not available for subsampling probability evaluation.

First stage is to obtain a random sample of size r_0 and estimate the model parameters. Using the estimated parameters subsampling probabilities are evaluated for A-optimality criteria.

Through the estimated subsampling probabilities an optimal sample of size r \ge r_0 is obtained. Finally, only the optimal sample is used and the model parameters are estimated.

NOTE : If input parameters are not in given domain conditions necessary error messages will be provided to go further.

If r \ge r_0 is not satisfied then an error message will be produced.

If the big data X,Y has any missing values then an error message will be produced.

The big data size N is compared with the sizes of X,Y and if they are not aligned an error message will be produced.

A character value is provided for family and if it is not of the any three types an error message will be produced.

Value

The output of AoptimalMCGLMSub gives a list of

Beta_Estimates estimated model parameters in a data.frame after subsampling

Variance_Epsilon_Estimates matrix of estimated variance for epsilon in a data.frame after subsampling (valid only for linear regression)

Sample_A-Optimality list of indexes for the initial and optimal samples obtained based on A-Optimality criteria

Subsampling_Probability matrix of calculated subsampling probabilities for A-optimality criteria

References

Zhang T, Ning Y, Ruppert D (2021). “Optimal sampling for generalized linear models under measurement constraints.” Journal of Computational and Graphical Statistics, 30(1), 106–114.

Examples

Dist<-"Normal"; Dist_Par<-list(Mean=0,Variance=1,Error_Variance=0.5)
No_Of_Var<-2; Beta<-c(-1,2,1); N<-10000; Family<-"linear"
Full_Data<-GenGLMdata(Dist,Dist_Par,No_Of_Var,Beta,N,Family)

r0<-300; rf<-rep(100*c(6,12),50); Original_Data<-Full_Data$Complete_Data;

AoptimalMCGLMSub(r0 = r0, rf = rf,Y = as.matrix(Original_Data[,1]),
                 X = as.matrix(Original_Data[,-1]),N = nrow(Original_Data),
                 family = "linear")->Results

plot_Beta(Results)

Dist<-"Normal"; Dist_Par<-list(Mean=0,Variance=1)
No_Of_Var<-2; Beta<-c(-1,2,1); N<-10000; Family<-"logistic"
Full_Data<-GenGLMdata(Dist,Dist_Par,No_Of_Var,Beta,N,Family)

r0<-300; rf<-rep(100*c(6,12),50); Original_Data<-Full_Data$Complete_Data;

AoptimalMCGLMSub(r0 = r0, rf = rf,Y = as.matrix(Original_Data[,1]),
                 X = as.matrix(Original_Data[,-1]),N = nrow(Original_Data),
                 family = "logistic")->Results

plot_Beta(Results)

Dist<-"Normal";
No_Of_Var<-2; Beta<-c(-1,2,1); N<-10000; Family<-"poisson"
Full_Data<-GenGLMdata(Dist,NULL,No_Of_Var,Beta,N,Family)

r0<-300; rf<-rep(100*c(6,12),50); Original_Data<-Full_Data$Complete_Data;

AoptimalMCGLMSub(r0 = r0, rf = rf,Y = as.matrix(Original_Data[,1]),
                 X = as.matrix(Original_Data[,-1]),N = nrow(Original_Data),
                 family = "poisson")->Results

plot_Beta(Results)


Electric consumption data

Description

Hebrail and Berard (2012) described data which contains 2,049,280 completed measurements for a house located at Sceaux, France between December 2006 and November 2010. The log scale minute-averaged current intensity is selected as the response and the covariates are voltage, active electrical energy (watt-hour) in the kitchen, the laundry room, and electric water-heater and an air-conditioner.

Usage

Electric_consumption

Format

A data frame with 4 columns and 2,049,280 rows.

Intensity

Minute-averaged current intensity

Voltage

Voltage

EE_Kitchen

Active electrical energy (watt-hour) in the kitchen

EE_Laundry

Active electrical energy (watt-hour) in the laundry room

EE_WH_AC

Active electrical energy (watt-hour) of electric water-heater and an air-conditioner

Source

Extracted from

Hebrail G, Berard A (2012) Individual Household Electric Power Consumption. UCI Machine Learning Repository.

Available at: doi:10.24432/C58K54

Examples

nrow(Electric_consumption)


Generate data for Generalised Linear Models

Description

Function to simulate big data under linear, logistic and Poisson regression for sampling. Covariate data X is through Normal, Multivariate Normal or Uniform distribution for linear regression. Covariate data X is through Exponential, Normal, Multivariate Normal or Uniform distribution for logistic regression. Covariate data X is through Normal or Uniform distribution for Poisson regression.

Usage

GenGLMdata(Dist,Dist_Par,No_Of_Var,Beta,N,family)

Arguments

Dist

a character value for the distribution "Normal", "MVNormal", "Uniform or "Exponential"

Dist_Par

a list of parameters for the distribution that would generate data for covariate X

No_Of_Var

number of variables

Beta

a vector for the model parameters, including the intercept

N

the big data size

family

a character vector for "linear", "logistic" and "poisson" regression from Generalised Linear Models

Details

Big data for the Generalised Linear Models are generated by the "linear", "logistic" and "poisson" regression types.

We have limited the covariate data generation for linear regression through normal, multivariate normal and uniform distribution, logistic regression through exponential, normal, multivariate normal and uniform distribution Poisson regression through normal and uniform distribution.

Value

The output of GenGLMdata gives a list of

Complete_Data a matrix for Y and X

References

Lee Y, Nelder JA (1996). “Hierarchical generalized linear models.” Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(4), 619–656.

Examples

No_Of_Var<-2; Beta<-c(-1,2,1); N<-5000;

Dist<-"MVNormal";
Dist_Par<-list(Mean=rep(0,No_Of_Var),Variance=diag(rep(2,No_Of_Var)),Error_Variance=0.5)
Family<-"linear"
Results<-GenGLMdata(Dist,Dist_Par,No_Of_Var,Beta,N,Family)

Dist<-"Normal"; Dist_Par<-list(Mean=0,Variance=1);
Family<-"logistic"
Results<-GenGLMdata(Dist,Dist_Par,No_Of_Var,Beta,N,Family)

Dist<-"Uniform"; Family<-"poisson"
Results<-GenGLMdata(Dist,NULL,No_Of_Var,Beta,N,Family)


Generate data for Generalised Linear Models under model misspecification scenario

Description

Function to simulate big data under Generalised Linear Models for the model misspecification scenario through any misspecification type.

Usage

GenModelMissGLMdata(N,X_Data,Misspecification,Beta,Var_Epsilon,family)

Arguments

N

the big data size

X_Data

a matrix for the covariate data

Misspecification

a vector of values for the misspecification

Beta

a vector for the model parameters, including the intercept and misspecification term

Var_Epsilon

variance value for the residuals

family

a character vector for "linear", "logistic" and "poisson" regression from Generalised Linear Models

Details

Big data for the Generalised Linear Models are generated by the "linear", "logistic" and "poisson" regression types under model misspecification.

Value

The output of GenModelMissGLMdata gives a list of

Complete_Data a matrix for Y,X and f(x)

References

Adewale AJ, Wiens DP (2009). “Robust designs for misspecified logistic models.” Journal of Statistical Planning and Inference, 139(1), 3–15.

Adewale AJ, Xu X (2010). “Robust designs for generalized linear models with possible overdispersion and misspecified link functions.” Computational statistics & data analysis, 54(4), 875–890.

Examples

Beta<-c(-1,0.75,0.75,1); Var_Epsilon<-0.5; family <- "linear"; N<-10000
X_1 <- replicate(2,stats::runif(n=N,min = -1,max = 1))

Temp<-Rfast::rowprods(X_1)
Misspecification <- (Temp-mean(Temp))/sqrt(mean(Temp^2)-mean(Temp)^2)
X_Data <- cbind(X0=1,X_1);

Results<-GenModelMissGLMdata(N,X_Data,Misspecification,Beta,Var_Epsilon,family)

Results<-GenModelMissGLMdata(N,X_Data,Misspecification,Beta,Var_Epsilon=NULL,family="logistic")

Results<-GenModelMissGLMdata(N,X_Data,Misspecification,Beta,Var_Epsilon=NULL,family="poisson")


Local case control sampling for logistic regression

Description

Using this function sample from big data under logistic regression to describe the data. Sampling probabilities are obtained based on local case control method.

Usage

LCCsampling(r0,rf,Y,X,N)

Arguments

r0

sample size for initial random sample

rf

final sample size including initial(r0) and case control(r) samples

Y

response data or Y

X

covariate data or X matrix that has all the covariates (first column is for the intercept)

N

size of the big data

Details

Two stage sampling algorithm for big data under logistic regression.

First obtain a random sample of size r_0 and estimate the model parameters. Using the estimated parameters sampling probabilities are evaluated for local case control.

Through the estimated sampling probabilities an optimal sample of size r \ge r_0 is obtained. Finally, the optimal sample is used and the model parameters are estimated.

NOTE : If input parameters are not in given domain conditions necessary error messages will be provided to go further.

If r \ge r_0 is not satisfied then an error message will be produced.

If the big data X,Y has any missing values then an error message will be produced.

The big data size N is compared with the sizes of X,Y and if they are not aligned an error message will be produced.

Value

The output of LCCsampling gives a list of

Beta_Estimates estimated model parameters in a data.frame after sampling

Sample_LCC_Sampling list of indexes for the initial and optimal samples obtained based on local case control sampling

Sampling_Probability vector of calculated sampling probabilities for local case control sampling

References

Fithian W, Hastie T (2015). “Local case-control sampling: Efficient subsampling in imbalanced data sets.” Quality control and applied statistics, 60(3), 187–190.

Examples

Dist<-"Normal"; Dist_Par<-list(Mean=0,Variance=1)
No_Of_Var<-2; Beta<-c(-1,2,1); N<-10000; Family<-"logistic"
Full_Data<-GenGLMdata(Dist,Dist_Par,No_Of_Var,Beta,N,Family)

r0<-300; rf<-rep(100*c(6,9,12),50); Original_Data<-Full_Data$Complete_Data;

LCCsampling(r0 = r0, rf = rf, Y = as.matrix(Original_Data[,1]),
            X = as.matrix(Original_Data[,-1]),
            N = nrow(Original_Data))->Results

plot_Beta(Results)


Basic and shrinkage leverage sampling for Generalised Linear Models

Description

Using this function sample from big data under linear, logistic and Poisson regression to describe the data. Sampling probabilities are obtained based on the basic and shrinkage leverage method.

Usage

LeverageSampling(rf,Y,X,N,S_alpha,family)

Arguments

rf

sample size

Y

response data or Y

X

covariate data or X matrix that has all the covariates (first column is for the intercept)

N

size of the big data

S_alpha

shrinkage factor in between 0 and 1

family

a character vector for "linear", "logistic" and "poisson" regression from Generalised Linear Models

Details

Leverage sampling algorithm for big data under Generalised Linear Models (linear, logistic and Poisson regression).

First is to obtain a random sample of size min(rf)/2 and estimate the model parameters. Using the estimated parameters leverage scores are evaluated for leverage sampling.

Through the estimated leverage scores a sample of size rf was obtained. Finally, the sample of size rf is used and the model parameters are estimated.

NOTE : If input parameters are not in given domain conditions necessary error messages will be provided to go further.

If rf is not satisfied then an error message will be produced.

If the big data X,Y has any missing values then an error message will be produced.

The big data size N is compared with the sizes of X,Y and if they are not aligned an error message will be produced.

If 0 < \alpha_{S} < 1 is not satisfied an error message will be produced.

A character vector is provided for family and if it is not of the any three types an error message will be produced.

Value

The output of LeverageSampling gives a list of

Beta_Estimates estimated model parameters in a data.frame after sampling

Variance_Epsilon_Estimates matrix of estimated variance for epsilon in a data.frame after sampling (valid only for linear regression)

Sample_Basic_Leverage list of indexes for the optimal samples obtained based on basic leverage

Sample_Shrinkage_Leverage list of indexes for the optimal samples obtained based on shrinkage leverage

Sampling_Probability matrix of calculated sampling probabilities for basic and shrinkage leverage

References

Ma P, Mahoney M, Yu B (2014). “A statistical perspective on algorithmic leveraging.” In International conference on machine learning, 91–99. PMLR.

Ma P, Sun X (2015). “Leveraging for big data regression.” Wiley Interdisciplinary Reviews: Computational Statistics, 7(1), 70–76.

Examples

Dist<-"Normal"; Dist_Par<-list(Mean=0,Variance=1,Error_Variance=0.5)
No_Of_Var<-2; Beta<-c(-1,2,1); N<-5000; Family<-"linear"
Full_Data<-GenGLMdata(Dist,Dist_Par,No_Of_Var,Beta,N,Family)

rf<-rep(100*c(6,10),50); Original_Data<-Full_Data$Complete_Data;

LeverageSampling(rf = rf, Y = as.matrix(Original_Data[,1]),
                 X = as.matrix(Original_Data[,-1]),N = nrow(Original_Data),
                 S_alpha = 0.95,
                 family = "linear")->Results

plot_Beta(Results)

Dist<-"Normal"; Dist_Par<-list(Mean=0,Variance=1)
No_Of_Var<-2; Beta<-c(-1,2,1); N<-5000; Family<-"logistic"
Full_Data<-GenGLMdata(Dist,Dist_Par,No_Of_Var,Beta,N,Family)

rf<-rep(100*c(6,10),25); Original_Data<-Full_Data$Complete_Data;

LeverageSampling(rf = rf, Y = as.matrix(Original_Data[,1]),
                 X = as.matrix(Original_Data[,-1]),N = nrow(Original_Data),
                 S_alpha = 0.95,
                 family = "logistic")->Results

plot_Beta(Results)

Dist<-"Normal";
No_Of_Var<-2; Beta<-c(-1,0.5,0.5); N<-5000; Family<-"poisson"
Full_Data<-GenGLMdata(Dist,NULL,No_Of_Var,Beta,N,Family)

rf<-rep(100*c(6,10),25); Original_Data<-Full_Data$Complete_Data;

LeverageSampling(rf = rf, Y = as.matrix(Original_Data[,1]),
                 X = as.matrix(Original_Data[,-1]),N = nrow(Original_Data),
                 S_alpha = 0.95,
                 family = "poisson")->Results

plot_Beta(Results)


One million songs data

Description

This data set contains 1,019,318 unique users' music play counts in the Echo Nest, which is available at "http://millionsongdataset.com/tasteprofile/". As a basic step, it is interesting to predict the play counts using the song information collected in the Million Song Dataset (Bertin-Mahieux et al. (2011)). After cleaning up and feature engineering the data in total contains 205,032 observations where we consider the covariates duration, loudness, tempo, artist hotness and song hotness to model the response, the number of song counts.

Usage

One_million_songs

Format

A data frame with 4 columns and 309,685 rows.

Counts

Number of playback counts for songs

Duration

Duration of the song

Loudness

Loudness of the song

Tempo

Tempo of the song

Artist_Hotness

A value between 0 and 1

Song_Hotness

A value between 0 and 1

References

McFee B, Bertin-Mahieux T, Ellis DP, Lanckriet GR (2012). “The million song dataset challenge.” In Proceedings of the 21st International Conference on World Wide Web, 909–916.

Ai M, Yu J, Zhang H, Wang H (2021). “Optimal subsampling algorithms for big data regressions.” Statistica Sinica, 31(2), 749–772.

Examples

nrow(One_million_songs)


Skin segmentation data

Description

Rajen and Abhinav (2012) addressed the challenge of detecting skin-like regions in images as a component of the intricate process of facial recognition. To achieve this goal, they curated the “Skin segmentation” data set, comprising RGB (R-red, G-green, B-blue) values of randomly selected pixels from N = 245,057 facial images, including 50,859 skin samples and 194,198 nonskin samples, spanning diverse age groups, racial backgrounds, and genders.

Usage

Skin_segmentation

Format

A data frame with 4 columns and 245,057 rows.

Skin_presence

Skin presence in the randomly selected pixels

Red

Red values in the randomly selected pixels

Green

Green values in the randomly selected pixels

Blue

Blue values in the randomly selected pixels

Source

Extracted from

Rajen B, Abhinav D (2012) Skin segmentation. UCI Machine Learning Repository.

Available at: doi:10.24432/C5T30C

Examples

nrow(Skin_segmentation)


Subsampling under linear regression for a potentially misspecified model

Description

Using this function sample from big data under linear regression for a potentially misspecified model. Subsampling probabilities are obtained based on the A-, L- and L1- optimality criteria with the RLmAMSE (Reduction of Loss by minimizing the Average Mean Squared Error).

Usage

modelMissLinSub(r0,rf,Y,X,N,Alpha,proportion,model="Auto")

Arguments

r0

sample size for initial random sample

rf

final sample size including initial(r0) and optimal(r) samples

Y

response data or Y

X

covariate data or X matrix that has all the covariates (first column is for the intercept)

N

size of the big data

Alpha

scaling factor when using Log Odds or Power functions to magnify the probabilities

proportion

a proportion of the big data is used to help estimate AMSE values from the subsamples

model

formula for the model used in the GAM or the default choice

Details

The article for this function is in preparation for publication. Please be patient.

Two stage subsampling algorithm for big data under linear regression for potential model misspecification.

First stage is to obtain a random sample of size r_0 and estimate the model parameters. Using the estimated parameters subsampling probabilities are evaluated for A-, L-, L1-optimality criteria, RLmAMSE and enhanced RLmAMSE (log-odds and power) subsampling methods.

Through the estimated subsampling probabilities a sample of size r \ge r_0 is obtained. Finally, the two samples are combined and the model parameters are estimated for A-, L-, L1-optimality, RLmAMSE and enhanced RLmAMSE (log-odds and power).

NOTE : If input parameters are not in given domain conditions necessary error messages will be provided to go further.

If r \ge r_0 is not satisfied then an error message will be produced.

If the big data X,Y has any missing values then an error message will be produced.

The big data size N is compared with the sizes of X,Y,F_estimate_Full and if they are not aligned an error message will be produced.

If \alpha > 1 for the scaling factor is not satisfied an error message will be produced.

If proportion is not in the region of (0,1] an error message will be produced.

model is a formula input formed based on the covariates through the spline terms (s()), squared term (I()), interaction terms (lo()) or automatically. If model is empty or NA or NAN or not one of the defined inputs an error message is printed. As a default we have set model="Auto", which is the main effects model wit the spline terms.

Value

The output of modelMissLinSub gives a list of

Beta_Estimates estimated model parameters after subsampling

Variance_Epsilon_Estimates matrix of estimated variance for epsilon after subsampling

Utility_Estimates estimated A-, L- and L1- optimality values for the obtained subsamples

AMSE_Estimates matrix of estimated AMSE values after subsampling

Sample_A-Optimality list of indexes for the initial and optimal samples obtained based on A-Optimality criteria

Sample_L-Optimality list of indexes for the initial and optimal samples obtained based on L-Optimality criteria

Sample_L1-Optimality list of indexes for the initial and optimal samples obtained based on L1-Optimality criteria

Sample_RLmAMSE list of indexes for the optimal samples obtained based obtained based on RLmAMSE

Sample_RLmAMSE_Log_Odds list of indexes for the optimal samples obtained based on RLmAMSE with Log Odds function

Sample_RLmAMSE_Power list of indexes for the optimal samples obtained based on RLmAMSE with Power function

Subsampling_Probability matrix of calculated subsampling probabilities

References

Adewale AJ, Wiens DP (2009). “Robust designs for misspecified logistic models.” Journal of Statistical Planning and Inference, 139(1), 3–15.

Adewale AJ, Xu X (2010). “Robust designs for generalized linear models with possible overdispersion and misspecified link functions.” Computational statistics & data analysis, 54(4), 875–890.

Mahendran A, Thompson H, McGree JM (2025). “A subsampling approach for large data sets when the Generalised Linear Model is potentially misspecified.” 2510.05902, https://arxiv.org/abs/2510.05902.

Examples

Beta <- c(-1, 0.75, 0.75, 1); Var_Epsilon <- 0.5;
family <- "linear"; N <- 500
X_1 <- replicate(2, stats::runif(n = N, min = -1, max = 1))

Temp <- Rfast::rowprods(X_1)
Misspecification <- (Temp - mean(Temp)) / sqrt(mean(Temp^2) - mean(Temp)^2)
X_Data <- cbind(X0 = 1, X_1)

Full_Data <- GenModelMissGLMdata(N, X_Data, Misspecification, Beta, Var_Epsilon, family)
r0 <- 40; rf <- rep(10 * c(8, 12), 25)
Original_Data <- Full_Data$Complete_Data[, -ncol(Full_Data$Complete_Data)]

Results <- modelMissLinSub(r0 = r0, rf = rf,
                           Y = as.matrix(Original_Data[, 1]),
                           X = as.matrix(Original_Data[, -1]),
                           N = N, Alpha = 10, proportion = 0.5)

plot_Beta(Results)
plot_AMSE(Results)



Subsampling under logistic regression for a potentially misspecified model

Description

Using this function sample from big data under logistic regression for a potentially misspecified model. Subsampling probabilities are obtained based on the A-, L- and L1- optimality criteria with the RLmAMSE (Reduction of Loss by minimizing the Average Mean Squared Error).

Usage

modelMissLogSub(r0,rf,Y,X,N,Alpha,proportion,model="Auto")

Arguments

r0

sample size for initial random sample

rf

final sample size including initial(r0) and optimal(r) samples

Y

response data or Y

X

covariate data or X matrix that has all the covariates (first column is for the intercept)

N

size of the big data

Alpha

scaling factor when using Log Odds or Power functions to magnify the probabilities

proportion

a proportion of the big data is used to help estimate AMSE values from the subsamples

model

formula for the model used in the GAM or the default choice

Details

The article for this function is in preparation for publication. Please be patient.

Two stage subsampling algorithm for big data under logistic regression for potential model misspecification.

First stage is to obtain a random sample of size r_0 and estimate the model parameters. Using the estimated parameters subsampling probabilities are evaluated for A-, L-, L1-optimality criteria, RLmAMSE and enhanced RLmAMSE(log-odds and power) subsampling methods.

Through the estimated subsampling probabilities a sample of size r \ge r_0 is obtained. Finally, the two samples are combined and the model parameters are estimated for A-, L-, L1-optimality, RLmAMSE and enhanced RLmAMSE (log-odds and power).

NOTE : If input parameters are not in given domain conditions necessary error messages will be provided to go further.

If r \ge r_0 is not satisfied then an error message will be produced.

If the big data X,Y has any missing values then an error message will be produced.

The big data size N is compared with the sizes of X,Y,F_estimate_Full and if they are not aligned an error message will be produced.

If \alpha > 1 for the scaling vector is not satisfied an error message will be produced.

If proportion is not in the region of (0,1] an error message will be produced.

model is a formula input formed based on the covariates through the spline terms (s()), squared term (I()), interaction terms (lo()) or automatically. If model is empty or NA or NAN or not one of the defined inputs an error message is printed. As a default we have set model="Auto", which is the main effects model wit the spline terms.

Value

The output of modelMissLogSub gives a list of

Beta_Estimates estimated model parameters after subsampling

Utility_Estimates estimated A-, L- and L1- optimality values for the obtained subsamples

AMSE_Estimates matrix of estimated AMSE values after subsampling

Sample_A-Optimality list of indexes for the initial and optimal samples obtained based on A-Optimality criteria

Sample_L-Optimality list of indexes for the initial and optimal samples obtained based on L-Optimality criteria

Sample_L1-Optimality list of indexes for the initial and optimal samples obtained based on L1-Optimality criteria

Sample_RLmAMSE list of indexes for the optimal samples obtained based on RLmAMSE

Sample_RLmAMSE_Log_Odds list of indexes for the optimal samples obtained based on RLmAMSE with Log Odds function

Sample_RLmAMSE_Power list of indexes for the optimal samples obtained based on RLmAMSE with Power function

Subsampling_Probability matrix of calculated subsampling probabilities

References

Adewale AJ, Wiens DP (2009). “Robust designs for misspecified logistic models.” Journal of Statistical Planning and Inference, 139(1), 3–15.

Adewale AJ, Xu X (2010). “Robust designs for generalized linear models with possible overdispersion and misspecified link functions.” Computational statistics & data analysis, 54(4), 875–890.

Mahendran A, Thompson H, McGree JM (2025). “A subsampling approach for large data sets when the Generalised Linear Model is potentially misspecified.” 2510.05902, https://arxiv.org/abs/2510.05902.

Examples

Beta<-c(-1,0.75,0.75,1); family <- "logistic"; N<-100
X_1 <- replicate(2,stats::runif(n=N,min = -1,max = 1))

Temp<-Rfast::rowprods(X_1)
Misspecification <- (Temp-mean(Temp))/sqrt(mean(Temp^2)-mean(Temp)^2)
X_Data <- cbind(X0=1,X_1);

Full_Data<-GenModelMissGLMdata(N,X_Data,Misspecification,Beta,Var_Epsilon=NULL,family)
r0 <- 20; rf <- rep(10 * c(4, 6), 25)
Original_Data<-Full_Data$Complete_Data[,-ncol(Full_Data$Complete_Data)];

Results<-modelMissLogSub(r0 = r0, rf = rf,
                         Y = as.matrix(Original_Data[,1]),
                         X = as.matrix(Original_Data[,-1]),
                         N = N, Alpha = 10, proportion = 0.3)

plot_Beta(Results)
plot_AMSE(Results)


Subsampling under Poisson regression for a potentially misspecified model

Description

Using this function sample from big data under Poisson regression for a potentially misspecified model. Subsampling probabilities are obtained based on the A-, L- and L1- optimality criteria with the RLmAMSE (Reduction of Loss by minimizing the Average Mean Squared Error).

Usage

modelMissPoiSub(r0,rf,Y,X,N,Alpha,proportion,model="Auto")

Arguments

r0

sample size for initial random sample

rf

final sample size including initial(r0) and optimal(r) samples

Y

response data or Y

X

covariate data or X matrix that has all the covariates (first column is for the intercept)

N

size of the big data

Alpha

scaling factor when using Log Odds or Power functions to magnify the probabilities

proportion

a proportion of the big data is used to help estimate AMSE values from the subsamples

model

formula for the model used in the GAM or the default choice

Details

The article for this function is in preparation for publication. Please be patient.

Two stage subsampling algorithm for big data under Poisson regression for potential model misspecification.

First stage is to obtain a random sample of size r_0 and estimate the model parameters. Using the estimated parameters subsampling probabilities are evaluated for A-, L-, L1-optimality criteria, RLmAMSE and enhanced RLmAMSE (log-odds and power) subsampling methods.

Through the estimated subsampling probabilities a sample of size r \ge r_0 is obtained. Finally, the two samples are combined and the model parameters are estimated for A-, L-, L1-optimality, RLmAMSE and enhanced RLmAMSE (log-odds and power).

NOTE : If input parameters are not in given domain conditions necessary error messages will be provided to go further.

If r \ge r_0 is not satisfied then an error message will be produced.

If the big data X,Y has any missing values then an error message will be produced.

The big data size N is compared with the sizes of X,Y,F_estimate_Full and if they are not aligned an error message will be produced.

If \alpha > 1 for the scaling vector is not satisfied an error message will be produced.

If proportion is not in the region of (0,1] an error message will be produced.

model is a formula input formed based on the covariates through the spline terms (s()), squared term (I()), interaction terms (lo()) or automatically. If model is empty or NA or NAN or not one of the defined inputs an error message is printed. As a default we have set model="Auto", which is the main effects model wit the spline terms.

Value

The output of modelMissPoiSub gives a list of

Beta_Estimates estimated model parameters after subsampling

Utility_Estimates estimated A-, L- and L1- optimality values for the obtained subsamples

AMSE_Estimates matrix of estimated AMSE values after subsampling

Sample_A-Optimality list of indexes for the initial and optimal samples obtained based on A-Optimality criteria

Sample_L-Optimality list of indexes for the initial and optimal samples obtained based on L-Optimality criteria

Sample_L1-Optimality list of indexes for the initial and optimal samples obtained based on L1-Optimality criteria

Sample_RLmAMSE list of indexes for the optimal samples obtained based on RLmAMSE

Sample_RLmAMSE_Log_Odds list of indexes for the optimal samples obtained based on RLmAMSE with Log Odds function

Sample_RLmAMSE_Power list of indexes for the optimal samples obtained based on RLmAMSE with Power function

Subsampling_Probability matrix of calculated subsampling probabilities

References

Adewale AJ, Wiens DP (2009). “Robust designs for misspecified logistic models.” Journal of Statistical Planning and Inference, 139(1), 3–15.

Adewale AJ, Xu X (2010). “Robust designs for generalized linear models with possible overdispersion and misspecified link functions.” Computational statistics & data analysis, 54(4), 875–890.

Mahendran A, Thompson H, McGree JM (2025). “A subsampling approach for large data sets when the Generalised Linear Model is potentially misspecified.” 2510.05902, https://arxiv.org/abs/2510.05902.

Examples

Beta<-c(-1,0.75,0.75,1); family <- "poisson"; N<-100
X_1 <- replicate(2,stats::runif(n=N,min = -1,max = 1))

Temp<-Rfast::rowprods(X_1)
Misspecification <- (Temp-mean(Temp))/sqrt(mean(Temp^2)-mean(Temp)^2)
X_Data <- cbind(X0=1,X_1);

Full_Data<-GenModelMissGLMdata(N,X_Data,Misspecification,Beta,Var_Epsilon=NULL,family)
r0 <- 20; rf <- rep(10 * c(4, 6), 25)
Original_Data<-Full_Data$Complete_Data[,-ncol(Full_Data$Complete_Data)];

Results<-modelMissPoiSub(r0 = r0, rf = rf,
                         Y = as.matrix(Original_Data[,1]),
                         X = as.matrix(Original_Data[,-1]),
                         N = N, Alpha = 10, proportion = 0.3)

plot_Beta(Results)
plot_AMSE(Results)


Model robust optimal subsampling for A- and L- optimality criteria under linear regression

Description

Using this function sample from big data under linear regression when there are more than one model to describe the data. Subsampling probabilities are obtained based on the A- and L- optimality criteria.

Usage

modelRobustLinSub(r0,rf,Y,X,N,Apriori_probs,All_Combinations,All_Covariates)

Arguments

r0

sample size for initial random sample

rf

final sample size including initial(r0) and optimal(r) samples

Y

response data or Y

X

covariate data or X matrix that has all the covariates (first column is for the intercept)

N

size of the big data

Apriori_probs

vector of a priori model probabilities that are used to obtain the model robust subsampling probabilities

All_Combinations

list of possible models that can describe the data

All_Covariates

all the covariates in the models

Details

Two stage subsampling algorithm for big data under linear regression for multiple models that can describe the big data.

First stage is to obtain a random sample of size r_0 and estimate the model parameters for all models. Using the estimated parameters subsampling probabilities are evaluated for A-, L-optimality criteria and model averaging A-, L-optimality subsampling methods.

Through the estimated subsampling probabilities a sample of size r \ge r_0 is obtained. Finally, the two samples are combined and the model parameters are estimated for all the models.

NOTE : If input parameters are not in given domain conditions necessary error messages will be provided to go further.

If r \ge r_0 is not satisfied then an error message will be produced.

If the big data X,Y has any missing values then an error message will be produced.

The big data size N is compared with the sizes of X,Y and if they are not aligned an error message will be produced.

If 0 < \alpha_{q} < 1 for the a priori model probabilities are not satisfied an error message will be produced, where q=1,\ldots,Q and Q is the number of models in the model set.

Value

The output of modelRobustLinSub gives a list of

Beta_Estimates estimated model parameters for each model in a list after subsampling

Variance_Epsilon_Estimates matrix of estimated variance for epsilon for each model after subsampling

Sample_A-Optimality list of indexes for the initial and optimal samples obtained based on A-Optimality criteria

Sample_A-Optimality_MR list of indexes for the initial and model robust optimal samples obtained based on A-Optimality criteria

Sample_L-Optimality list of indexes for the initial and optimal samples obtained based on L-Optimality criteria

Sample_L-Optimality_MR list of indexes for the initial and model robust optimal samples obtained based on L-Optimality criteria

Subsampling_Probability matrix of calculated subsampling probabilities for A- and L- optimality criteria

References

Mahendran A, Thompson H, McGree JM (2023). “A model robust subsampling approach for Generalised Linear Models in big data settings.” Statistical Papers, 64(4), 1137–1157.

Examples

indexes<-1:ceiling(nrow(Electric_consumption)*0.005)
Original_Data<-cbind(Electric_consumption[indexes,1],1,
                     Electric_consumption[indexes,-1])
colnames(Original_Data)<-c("Y",paste0("X",0:ncol(Original_Data[,-c(1,2)])))
for (j in 3:5) {
  Original_Data[,j]<-scale(Original_Data[,j])
}

No_of_Variables<-ncol(Original_Data[,-c(1,2)])
Squared_Terms<-paste0("X",1:No_of_Variables,"^2")
term_no <- 2
All_Models <- list(c("X0",paste0("X",1:No_of_Variables)))

Original_Data<-cbind(Original_Data,Original_Data[,-c(1,2)]^2)
colnames(Original_Data)<-c("Y","X0",paste0("X",1:No_of_Variables),
                           paste0("X",1:No_of_Variables,"^2"))

for (i in 1:No_of_Variables){
  x <- as.vector(combn(Squared_Terms,i,simplify = FALSE))
    for(j in 1:length(x)){
       All_Models[[term_no]] <- c("X0",paste0("X",1:No_of_Variables),x[[j]])
       term_no <- term_no+1
     }
   }

All_Models<-All_Models[c(1,12:16)]
names(All_Models)<-paste0("Model_",1:length(All_Models))

r0<-300; rf<-rep(100*c(6,9),50);

modelRobustLinSub(r0 = r0, rf = rf, Y = as.matrix(Original_Data[,1]),
                  X = as.matrix(Original_Data[,-1]),N = nrow(Original_Data),
                  Apriori_probs = rep(1/length(All_Models),length(All_Models)),
                  All_Combinations = All_Models,
                  All_Covariates = colnames(Original_Data)[-1])->Results

Beta_Plots<-plot_Beta(Results)


Model robust optimal subsampling for A- and L- optimality criteria under logistic regression

Description

Using this function sample from big data under logistic regression when there are more than one model to describe the data. Subsampling probabilities are obtained based on the A- and L- optimality criteria.

Usage

modelRobustLogSub(r0,rf,Y,X,N,Apriori_probs,All_Combinations,All_Covariates)

Arguments

r0

sample size for initial random sample

rf

final sample size including initial(r0) and optimal(r) samples

Y

response data or Y

X

covariate data or X matrix that has all the covariates (first column is for the intercept)

N

size of the big data

Apriori_probs

vector of a priori model probabilities that are used to obtain the model robust subsampling probabilities

All_Combinations

list of possible models that can describe the data

All_Covariates

all the covariates in the models

Details

Two stage subsampling algorithm for big data under logistic regression for multiple models that can describe the big data.

First stage is to obtain a random sample of size r_0 and estimate the model parameters for all models. Using the estimated parameters subsampling probabilities are evaluated for A-, L-optimality criteria and model averaging A-, L-optimality subsampling methods.

Through the estimated subsampling probabilities a sample of size r \ge r_0 is obtained. Finally, the two samples are combined and the model parameters are estimated for all the models.

NOTE : If input parameters are not in given domain conditions necessary error messages will be provided to go further.

If r \ge r_0 is not satisfied then an error message will be produced.

If the big data X,Y has any missing values then an error message will be produced.

The big data size N is compared with the sizes of X,Y and if they are not aligned an error message will be produced.

If 0 < \alpha_{q} < 1 for the a priori model probabilities are not satisfied an error message will be produced, where q=1,\ldots,Q and Q is the number of models in the model set.

Value

The output of modelRobustLinSub gives a list of

Beta_Data estimated model parameters for each model in a list after subsampling

Sample_L-optimality list of indexes for the initial and optimal samples obtained based on L-optimality criteria

Sample_L-optimality_MR list of indexes for the initial and model robust optimal samples obtained based on L-optimality criteria

Sample_A-optimality list of indexes for the initial and optimal samples obtained based on A-optimality criteria

Sample_A-optimality_MR list of indexes for the initial and model robust optimal samples obtained based on A-optimality criteria

Subsampling_Probability matrix of calculated subsampling probabilities for A- and L- optimality criteria

References

Mahendran A, Thompson H, McGree JM (2023). “A model robust subsampling approach for Generalised Linear Models in big data settings.” Statistical Papers, 64(4), 1137–1157.

Examples

indexes<-1:ceiling(nrow(Skin_segmentation)*0.25)
Original_Data<-cbind(Skin_segmentation[indexes,1],1,Skin_segmentation[indexes,-1])
colnames(Original_Data)<-c("Y",paste0("X",0:ncol(Original_Data[,-c(1,2)])))

for (j in 3:5) {
  Original_Data[,j]<-scale(Original_Data[,j])
}
No_of_Variables<-ncol(Original_Data[,-c(1,2)])

Squared_Terms<-paste0("X",1:No_of_Variables,"^2")
term_no <- 2
All_Models <- list(c("X0",paste0("X",1:No_of_Variables)))

Original_Data<-cbind(Original_Data,Original_Data[,-c(1,2)]^2)
colnames(Original_Data)<-c("Y","X0",paste0("X",1:No_of_Variables),
                           paste0("X",1:No_of_Variables,"^2"))

for (i in 1:No_of_Variables){
  x <- as.vector(combn(Squared_Terms,i,simplify = FALSE))
  for(j in 1:length(x)){
    All_Models[[term_no]] <- c("X0",paste0("X",1:No_of_Variables),x[[j]])
    term_no <- term_no+1
  }
}
All_Models<-All_Models[-c(5:7)]
names(All_Models)<-paste0("Model_",1:length(All_Models))

r0<-300; rf<-rep(100*c(6,9),50);

modelRobustLogSub(r0 = r0, rf = rf, Y = as.matrix(Original_Data[,1]),
                  X = as.matrix(Original_Data[,-1]),N = nrow(Original_Data),
                  Apriori_probs = rep(1/length(All_Models),length(All_Models)),
                  All_Combinations = All_Models,
                  All_Covariates = colnames(Original_Data)[-1])->Results

Beta_Plots<-plot_Beta(Results)


Model robust optimal subsampling for A- and L- optimality criteria under Poisson regression

Description

Using this function sample from big data under Poisson regression when there are more than one model to describe the data. Subsampling probabilities are obtained based on the A- and L- optimality criteria.

Usage

modelRobustPoiSub(r0,rf,Y,X,N,Apriori_probs,All_Combinations,All_Covariates)

Arguments

r0

sample size for initial random sample

rf

final sample size including initial(r0) and optimal(r) samples

Y

response data or Y

X

covariate data or X matrix that has all the covariates (first column is for the intercept)

N

size of the big data

Apriori_probs

vector of a priori model probabilities that are used to obtain the model robust subsampling probabilities

All_Combinations

list of possible models that can describe the data

All_Covariates

all the covariates in the models

Details

Two stage subsampling algorithm for big data under Poisson regression for multiple models that can describe the big data.

First stage is to obtain a random sample of size r_0 and estimate the model parameters for all models. Using the estimated parameters subsampling probabilities are evaluated for A-, L-optimality criteria and model averaging A-, L-optimality subsampling methods.

Through the estimated subsampling probabilities a sample of size r \ge r_0 is obtained. Finally, the two samples are combined and the model parameters are estimated for all the models.

NOTE : If input parameters are not in given domain conditions necessary error messages will be provided to go further.

If r \ge r_0 is not satisfied then an error message will be produced.

If the big data X,Y has any missing values then an error message will be produced.

The big data size N is compared with the sizes of X,Y and if they are not aligned an error message will be produced.

If 0 < \alpha_{q} < 1 for the a priori model probabilities are not satisfied an error message will be produced, where q=1,\ldots,Q and Q is the number of models in the model set.

Value

The output of modelRobustLinSub gives a list of

Beta_Data estimated model parameters for each model in a list after subsampling

Sample_L-optimality list of indexes for the initial and optimal samples obtained based on L-optimality criteria

Sample_L-optimality_MR list of indexes for the initial and model robust optimal samples obtained based on L-optimality criteria

Sample_A-optimality list of indexes for the initial and optimal samples obtained based on A-optimality criteria

Sample_A-optimality_MR list of indexes for the initial and model robust optimal samples obtained based on A-optimality criteria

Subsampling_Probability matrix of calculated subsampling probabilities for A- and L- optimality criteria

References

Mahendran A, Thompson H, McGree JM (2023). “A model robust subsampling approach for Generalised Linear Models in big data settings.” Statistical Papers, 64(4), 1137–1157.

Examples

indexes<-1:ceiling(nrow(One_million_songs)*0.5)
Original_Data<-One_million_songs[indexes,]
colnames(Original_Data)<-c("Y",paste0("X",1:ncol(Original_Data[,-1])))

# Scaling the covariate data
for (j in 2:4) {
  Original_Data[,j]<-scale(Original_Data[,j])
}

No_of_Variables<-ncol(Original_Data[,-1])
Squared_Terms<-paste0("X",1:No_of_Variables,"^2")
term_no <- 2
All_Models <- list(paste0("X",1:No_of_Variables))

Original_Data<-cbind(Original_Data,Original_Data[,-1]^2)
colnames(Original_Data)<-c("Y",paste0("X",1:No_of_Variables),
                            paste0("X",1:No_of_Variables,"^2"))

for (i in 1:No_of_Variables)
{
  x <- as.vector(combn(Squared_Terms,i,simplify = FALSE))
  for(j in 1:length(x))
  {
    All_Models[[term_no]] <- c(paste0("X",1:No_of_Variables),x[[j]])
    term_no <- term_no+1
  }
}
All_Models<-All_Models[1:4]
names(All_Models)<-paste0("Model_",1:length(All_Models))

r0<-300; rf<-rep(100*c(6,9),25);

modelRobustPoiSub(r0 = r0, rf = rf, Y = as.matrix(Original_Data[,1]),
                  X = as.matrix(Original_Data[,-1]),N = nrow(Original_Data),
                  Apriori_probs = rep(1/length(All_Models),length(All_Models)),
                  All_Combinations = All_Models,
                  All_Covariates = colnames(Original_Data)[-1])->Results

Beta_Plots<-plot_Beta(Results)


Plotting AMSE outputs for the samples under model misspecification

Description

After using the subsampling methods under potential model misspecification we obtain their respective AMSE values for the predictions. They are summarised as plots here.

Usage

plot_AMSE(object)

Arguments

object

Any object after subsampling from our subsampling function under potential model misspecification

Details

For A- and L-optimality criteria and RLmAMSE subsampling under Generalised Linear Models with potential model misspecification the facets are for variance and bias^2 of AMSE values.

Value

The output is a faceted ggplot result


Plotting model parameter outputs after subsampling

Description

After using the subsampling methods we mostly obtain the estimated model parameter estimates. Here, they are summarised as histogram plots.

Usage

plot_Beta(object)

Arguments

object

Any object after subsampling from our subsampling functions

Details

For local case control sampling the facets are for sample sizes and beta values.

For leverage sampling the facets are for sample sizes and beta values.

For A- and L-optimality criteria subsampling under Generalised Linear Models the facets are for sample sizes and beta values.

For A-optimality criteria subsampling under Gaussian Linear Models the facets are for sample sizes and beta values.

For A-optimality criteria subsampling under Generalised Linear Models with response variable not inclusive the facets are for sample sizes and beta values.

For A- and L-optimality criteria subsampling under Generalised Linear Models where multiple models can describe the data the facets are for sample sizes and beta values.

For A- and L-optimality criteria and LmAMSE subsampling under Generalised Linear Models with potential model misspecification the facets are for sample sizes and beta values.

Value

The output is a faceted ggplot result

These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.