Main components
There are four main components:
A source provides an iterator-based interface to a file, string or raw vector.
In C++, these are
SourceFile
,SourceString
,SourceRaw
, etc. Each source has a corresponding R-representation usually generated bydatasource()
.Source::create()
instantiates the approach class from the R representation.A token is an iterator that points to a single value in source. A token also contains metadata about the location of the value (e.g. the row and col, needed for informative error message) and optionally, an unescape method. The unescape method is used if the tokenizer detects escapes (and hence the memory allocated by the source can’t be used directly).
A tokeniser converts a stream of characters from a source into a stream of tokens. Tokenizers are typically written in DFA style. This is a bit more verbose than informal parsing, but it makes it much easier to verify correctness.
Field collectors take a stream of tokens, parsing each token and storing it an R vector.
There is one collector for each column type:
CollectorLogical
,CollectorInteger
,CollectorDouble
etc. On the R side, these are represented bycol_logical
,col_integer()
,col_double()
etc.Collector::create()
dynamically creates a Collector subclass from an R list.
Each component is described in more detail below.
Sources
There are three main sources (Source.h
):
- A file on disk (mmapped for optimal performance),
SourceFile.h
. - A string,
SourceString.h
. - A raw vector,
SourceRaw.h
.
Sources abstract away the underlying data storage to provide an iterator based interface (.begin()
and .end()
).
Currently, connections are supported by saving to a file. Eventually, we’ll need to fully support streaming connections by implementing a stream-based parsing interface.
Tokens
A token (Token.h
) is either:
Empty
Missing
A string, represented by two iterators into the underlying source. If the string is escaped, the token also contains a pointer to an unescaping function.
EOF. Used to indicate that parsing is complete.
Tokens also store their position (row and col of the field) for informative error messages.
Tokenizer
The tokenizer (Tokenizer.h
) turns a source (a stream of characters) into a stream of tokens. To use a tokenizer:
# Create the C++ object from the R spec
TokenizerPtr tokenizer = Tokenizer::create(tokenizerSpec);
# Initialise it with a source
tokenizer->tokenize(source->begin(), source->end());
# Call nextToken until there are no tokens left
for (Token t = tokenizer->nextToken(); t.type() != TOKEN_EOF; t = tokenizer->nextToken());
The most important tokenisers are:
TokenizerDelim
for parsing general delimited filesTokenizerFixedWidth
for parsing fixed width files.
Tokenizers also identify missing values and manage encoding (mediated through the token).
TokenizerDelim with doubled escapes
(,
is short hand for any delimiter, newline or EOF.)
It is designed to support the most common style of csv files, where quotes in strings are double escaped. In other words, to create a string containing a single double quote, you use """"
. In the future, we’ll add other tokenizers to support more esoteric formats.
TokenizerDelim with backslash escapes
Column collectors
Column collectors collect corresponding fields across multiple records, parsing strings and storing in the appropriate R vector.
Four collectors correspond to existing behaviour in read.csv()
etc:
- Logical
- Integer
- Double: decimal
- Character: encoding, trim, emptyIsMissing?
Three others support the most important S3 vectors:
- Factor: levels, ordered
- Date
- DateTime
There are two others that don’t represent existing S3 vectors, but might be useful to add:
- BigInteger (64 bit)
- Time