The hardware and bandwidth for this mirror is donated by dogado GmbH, the Webhosting and Full Service-Cloud Provider. Check out our Wordpress Tutorial.
If you wish to report a bug, or if you are interested in having us mirror your free-software or open-source project, please feel free to contact us at mirror[@]dogado.de.
The information about each dataset is spread across two files, a metadata file and a data file:
Metadata file names start with “D_”. We can use the file naming conventions to link each metadata file with its corresponding data file.
Each metadata file, in essence, contains the following information about the variables in the corresponding data file:
Metadata is stored by FCA in a way that is not straightforward to process:
Nevertheless, there are some features of metadata files that come in handy:
These features allow us to use regular expressions to process metadata files.
For the purposes of parsing files to create tidy data frames, the metadata files are classified into the following unique scenarios:
This classification is important because the corresponding data file will have a different structure depending on the scenario.
In general terms, the processing workflow of data files involves:
Specific pivot operations depend on the scenario of the corresponding metadata file
This is the simplest scenario. The data in the data file is already in the expected format, and the only task is to apply the column names specified in the metadata file.
In this case, the data in the data file does not adhere to “tidy” principles. The multiple-occurrence columns are repeated for each class within the “code” variable. To help clarify the previous statement, consider a data file has following (mock) column names:
ColumnAColumnBCode1Metric1_Code1Metric2_Code1Code2Metric1_Code2Metric2_Code2The corresponding metadata file would likely list the column names as:
ColumnAColumnBCodeMetric1Metric2This creates a mismatch between the number of columns identified in
the metadata file (5) and the number of columns in the
corresponding data file (8). To properly apply column
names to the data file, the total number of distinct
codes in the Code column must be determined. Once the
column names have been appropriately applied to the data
file, we complete the data processing by applying both long and
wide pivots.
This is the most complex scenario, currently applicable to only one data file (RCR7). Unlike Scenario 2, there is an additional set of single-occurrence columns that follow a set of multiple-occurrence columns. For each observation in the data file, there is a row that contains comma-separated values of variables that belong to the first set of single-occurrence columns, followed by a row for each class of the ‘code’ variable with comma-separated values of multiple-occurrence variables, and finally, a row that contains comma-separated values of the remaining single-occurrence columns.
In this case, the initial step of reading the data involves concatenating lines corresponding to the same observation (using loops). Once the data is read, the processing is similar to Scenario 2, except when naming the columns it is necessary to consider that the repeating columns are in the middle of the dataset. After naming the columns, the processing is the same as in Scenario 2. The resulting dataset differs from the metadata file in that it places all single-occurrence columns at the beginning.
These binaries (installable software) and packages are in development.
They may not be fully stable and should be used with caution. We make no claims about them.
Health stats visible at Monitor.