Mastering ProjectTemplate

Before you work through this walkthrough, you should make sure you've read (or at least understood) the contents of the beginner's tutorial.

Ad Hoc File Types

In the beginner's tutorial, we showed how ProjectTemplate automatically loads data files from the data and cache directories. If you're working with plain text files or any of the supported binary file formats, this automatic data loading should work out of the box without any effort on your part. But if you have to retrieve data sets from more complex data sources, ProjectTemplate has advanced features that will let you set up ad hoc autoloading. In the rest of this document, we'll talk about working with SQL databases, remote resources available over HTTP and FTP, large data files stored on external drives and R files that contain code that generates data at runtime.

SQLite Databases

Let's start by working with an SQLite database. We'll use a database from the Analytics X competition in which contestants were trying to predict crimes that took place in Philadelphia. You can download the database file here.

Autoloading the Database

The simplest way to access the database is to store the analyticsx.db file in the data directory of a new project. Let's set up a project using the standard ProjectTemplate invocation:

    library('ProjectTemplate')
    create.project('AnalyticsX')

Then we'll shift into the relevant directory and move our database over:

    cd AnalyticsX
    mv ~/Downloads/analyticsx.db data

Then we reload R and load the project. You'll see ProjectTemplate automatically load the five tables found in our SQLite database:

    library('ProjectTemplate')
    load.project()

IMAGE OF AUTOLOADING GOES HERE

For most users, this automatic loading procedure is probably enough. But if you need more fine-grained control, you can use the .sql ad hoc file type to load specific tables from an SQLite database. You can also specify an exact SQL query to run against the database. We'll go through all three cases below.

Load One Specific Table

First, let's move the analyticsx.db file out of the data directory to prevent it from being autoloaded:

    mv data/analyticsx.db .

After that, we'll create an .sql file in the data directory. We need to specify (a) that we're working with an SQLite database, (b) the path to the SQLite database and (c) the specific table we want to load:

    type: sqlite
    dbname: analytics.db
    table: homicides

Running load.project will then load only this table from our database.

IMAGE OF AUTOLOADING GOES HERE

Load All Tables from a Specific Database

If we want to load all of the tables from a database file that we can't place inside of the data directory, we can use a .sql file to do this by replacing the name of a specific table with an asterisk:

    type: sqlite
    dbname: analytics.db
    table: *

IMAGE OF AUTOLOADING GOES HERE

Loading Data with an SQL Query

You can load a subset of your data by specifying an SQL query instead of a table:

    type: sqlite
    dbname: analytics.db
    query: SELECT * FROM homicides

IMAGE OF AUTOLOADING GOES HERE

MySQL Databases

Working with a MySQL database is exactly as easy as using a .sql file to access a SQLite database. All that changes is the use of the mysql type instead of the sqlite type:

    type: mysql
    dbname: analytics.db
    table: *

IMAGE OF AUTOLOADING GOES HERE

URL Files

If you need to access a file that's available over HTTP or FTP, you can use a .url file. Inside of the file, you'll specify the URL where your data set is available and the type of data set you're accessing:

    url: http://www.johnmyleswhite.com/ProjectTemplate/sample_data.csv
    separator: ,

IMAGE OF AUTOLOADING GOES HERE

.file Files

If you need to access a file that's stored outside of the project's main directory, you use a .file file. Inside of the file, you'll specify the path of the data file and the extension of the data set you're accessing:

    path: /usr/share/dict/words
    extension: csv

IMAGE OF AUTOLOADING GOES HERE

R Files

Sometimes you want to generate random data for your analysis: this, after all, is the heart of Monte Carlo analyses of statistical methods. You can do this by inserting R code into a file in the data directory. We'll put this into the data/d.R file:

    set.seed(1)
    d <- rnorm(1000, 0, 1)

IMAGE OF AUTOLOADING GOES HERE

Unit Testing Your Project

ProjectTemplate has been designed to make it easier to unit test the functions you've written for your analysis. To get started, you can call stub.tests(), which will generate a file at tests/autogenerated.R filled with sample tests for every one of the functions you defined inside of the lib directory. You should edit these tests, as they are expected to fail by default.

After editing your tests, you can call test.project() to run all of the unit tests in the tests directory.

EXAMPLE

Logging Your Work

If you want to log your work, ProjectTemplate will automatically load a log4r logger object into the logger variable that will write to a plain text stored at the logs/project.log. To use this logger, you only need to change the configuration file to specify:

    logging: on

After making this change, the logger object will be created once you call load.project().

Data Diagnostics

Coming soon

Profiling Your Project

Coming soon