Decoding ODB-2 Data
The high-level decoding API in pyodc is compatible with pandas and is designed to be as straightforward as possible.
Trivial Example
To decode the data, read it directly via read_odb()
function:
[1]:
import sys
import os
sys.path.insert(0, os.path.abspath('../../..'))
[2]:
import pandas as pd
import pyodc as odc
df_decoded = odc.read_odb('example-1.odb', single=True)
print(df_decoded)
expver date@hdr statid@hdr wigos@hdr obsvalue@body \
0 1 20210420 stat00 0-12345-0-67890 0.0000
1 1 20210420 stat01 0-12345-0-67891 12.3456
2 1 20210420 stat02 0-12345-0-67892 24.6912
3 1 20210420 stat03 0-12345-0-67893 37.0368
4 1 20210420 stat04 0-12345-0-67894 49.3824
5 1 20210420 stat05 0-12345-0-67895 61.7280
6 1 20210420 stat06 0-12345-0-67896 74.0736
7 1 20210420 stat07 0-12345-0-67897 86.4192
8 1 20210420 stat08 0-12345-0-67898 98.7648
9 1 20210420 stat09 0-12345-0-67899 111.1104
integer_missing double_missing bf_column bf_missing
0 1234.0 12.34 0 0.0
1 4321.0 43.21 9 9.0
2 NaN NaN 6 6.0
3 1234.0 12.34 10 10.0
4 4321.0 43.21 5 5.0
5 NaN NaN 7 NaN
6 1234.0 12.34 15 15.0
7 4321.0 43.21 0 0.0
8 NaN NaN 9 9.0
9 1234.0 12.34 6 6.0
Note
By passing single=True
argument to read_odb()
, you are making sure that the data is aggregated and returned as a single frame if possible. For more information on aggregation capabilities, please see the following section.
File Type Object
Decoding of ODB-2 data works with file-like objects as well as with file names:
[3]:
with open('example-1.odb', 'rb') as f:
odc.read_odb(f, single=True)
Decoding a Subset of the Data
For large ODB-2 files, it can be very valuable to not decode all of the data. The decode functions accept a list or tuple specifying the columns to decode.
This is especially helpful when the structure of ODB-2 frames in a file is not constant, but all of the frames supply the desired data:
[4]:
df_decoded = odc.read_odb('example-1.odb', single=True, columns=('statid@hdr', 'obsvalue@body'))
print(df_decoded)
statid@hdr obsvalue@body
0 stat00 0.0000
1 stat01 12.3456
2 stat02 24.6912
3 stat03 37.0368
4 stat04 49.3824
5 stat05 61.7280
6 stat06 74.0736
7 stat07 86.4192
8 stat08 98.7648
9 stat09 111.1104
Note
For historical reasons, column references can omit the @
sign and everything after it, but only in case the column name is unique and unambigious. For example, the line above could also refer to the two columns in the following format:
columns=('statid', 'obsvalue')
Decoding Bitfield Data
Bitfield data is encoded as integer values. These can be directly decoded and will be represented in the normal manner as integers (including missing values)
[5]:
df_decoded = odc.read_odb('example-6.odb', single=True, columns=('bf_column', 'bf_missing'))
print(df_decoded)
bf_column bf_missing
0 0 0.0
1 9 9.0
2 6 6.0
3 10 10.0
4 5 5.0
5 7 NaN
6 15 15.0
7 0 0.0
8 9 9.0
9 6 6.0
Note
If integral data contains missing values, these decoded columns will be represented as floating point numbers and the missing values represented as NaNs. This is because numpy/pandas do not have a mechanism to represent missing values directly for integer data.
The data in these values corresponds to the bitfields revealed by the low-level API.
[6]:
for column in ('bf_column', 'bf_missing'):
for bits in odc.Reader('example-6.odb').frames[0].column_dict['bf_missing'].bitfields:
print(f"column={column:<10} name={bits.name:<7} offset={bits.offset} size={bits.size}")
column=bf_column name=bit1 offset=0 size=1
column=bf_column name=bitpair offset=1 size=2
column=bf_column name=bit4 offset=3 size=1
column=bf_missing name=bit1 offset=0 size=1
column=bf_missing name=bitpair offset=1 size=2
column=bf_missing name=bit4 offset=3 size=1
To facilitate direct access to the named bit ranges, these can be explicitly requested when decoding the data. Single bit values will be translated into booleans, whereas multiple-bit values will be translated into the corresponding unsigned integers. Missing values are handled in the same manner as for normal columns.
[7]:
df_decoded = odc.read_odb('example-6.odb',
single=True,
columns=('bf_column.bit1', 'bf_column.bitpair', 'bf_missing.bit1', 'bf_missing.bitpair'))
print(df_decoded)
bf_column.bit1 bf_column.bitpair bf_missing.bit1 bf_missing.bitpair
0 False 0 False 0.0
1 True 0 True 0.0
2 False 3 False 3.0
3 False 1 False 1.0
4 True 2 True 2.0
5 True 3 None NaN
6 True 3 True 3.0
7 False 0 False 0.0
8 True 0 True 0.0
9 False 3 False 3.0
Decoding a Sequence of Frames
If ODB-2 data is extremely large, it is undesirable to attempt to decode it into memory in its entirety. Furthermore, if the frames within the file are not compatible, it may be a better idea to consider each of the frames separately.
By default, read_odb()
function returns an iterable sequence that lazily decodes ODB-2 frames as they are needed:
[8]:
for idx, df_decoded in enumerate(odc.read_odb('example-2.odb')):
if idx > 0: print()
print('Decoded data frame:', idx)
print(df_decoded)
Decoded data frame: 0
expver date@hdr statid@hdr wigos@hdr obsvalue@body \
0 1 20210420 stat00 0-12345-0-67890 0.0000
1 1 20210420 stat01 0-12345-0-67891 12.3456
2 1 20210420 stat02 0-12345-0-67892 24.6912
3 1 20210420 stat03 0-12345-0-67893 37.0368
4 1 20210420 stat04 0-12345-0-67894 49.3824
5 1 20210420 stat05 0-12345-0-67895 61.7280
6 1 20210420 stat06 0-12345-0-67896 74.0736
7 1 20210420 stat07 0-12345-0-67897 86.4192
8 1 20210420 stat08 0-12345-0-67898 98.7648
9 1 20210420 stat09 0-12345-0-67899 111.1104
integer_missing double_missing bf_column bf_missing
0 1234.0 12.34 0 0.0
1 4321.0 43.21 9 9.0
2 NaN NaN 6 6.0
3 1234.0 12.34 10 10.0
4 4321.0 43.21 5 5.0
5 NaN NaN 7 NaN
6 1234.0 12.34 15 15.0
7 4321.0 43.21 0 0.0
8 NaN NaN 9 9.0
9 1234.0 12.34 6 6.0
Decoded data frame: 1
expver date@hdr statid@hdr obsvalue@body
0 2 20210420 stat00 0.0000
1 2 20210420 stat01 12.3456
2 2 20210420 stat02 24.6912
3 2 20210420 stat03 37.0368
4 2 20210420 stat04 49.3824
5 2 20210420 stat05 61.7280
6 2 20210420 stat06 74.0736
7 2 20210420 stat07 86.4192
8 2 20210420 stat08 98.7648
9 2 20210420 stat09 111.1104
Aggregated or Non-aggregated Decoding
To page the data through memory without consuming more resources than exist, a sequence of frames may be considered to be one frame that has been split for technical reasons. The library is able to group these frames together into one logical, aggregated frame (and, indeed, it does this by default). Decoding aggregated logical frames in one step significantly improves performance of the decoder if offloading to odc.
Both Reader
and read_odb()
functionality take two arguments:
aggregated
- (default:True
) enables or disables aggregation of compatible frames.max_aggregated
- (default:None
) sets a maximum number of rows to be combined into one logical frame, before the library will split them anyway (for pagination purposes).
For example, first we build a decoder with several real and a smaller number of logical frames:
[9]:
df = pd.read_csv('data-1.csv')
df2 = pd.read_csv('data-2.csv')
with open('example-5.odb', 'wb') as f:
odc.encode_odb(df, f, rows_per_frame=3)
odc.encode_odb(df2, f, rows_per_frame=3)
Interrogation of the structure can be done by using two different readers:
[10]:
r5a = odc.Reader('example-5.odb')
r5b = odc.Reader('example-5.odb', aggregated=False)
print('aggregated row counts:', [f.nrows for f in r5a.frames])
print('separate row counts:', [f.nrows for f in r5b.frames])
aggregated row counts: [10, 10]
separate row counts: [3, 3, 3, 1, 3, 3, 3, 1]
By default, data is decoded in an aggregated fashion:
[11]:
for idx, df_decoded in enumerate(odc.read_odb('example-5.odb')):
if idx > 0: print()
print('Decoded data frame:', idx)
print(df_decoded)
Decoded data frame: 0
expver date@hdr statid@hdr wigos@hdr obsvalue@body \
0 1 20210420 stat00 0-12345-0-67890 0.0000
1 1 20210420 stat01 0-12345-0-67891 12.3456
2 1 20210420 stat02 0-12345-0-67892 24.6912
3 1 20210420 stat03 0-12345-0-67893 37.0368
4 1 20210420 stat04 0-12345-0-67894 49.3824
5 1 20210420 stat05 0-12345-0-67895 61.7280
6 1 20210420 stat06 0-12345-0-67896 74.0736
7 1 20210420 stat07 0-12345-0-67897 86.4192
8 1 20210420 stat08 0-12345-0-67898 98.7648
9 1 20210420 stat09 0-12345- 111.1104
integer_missing double_missing bf_column bf_missing
0 1234.0 12.34 0 0.0
1 4321.0 43.21 9 9.0
2 NaN NaN 6 6.0
3 1234.0 12.34 10 10.0
4 4321.0 43.21 5 5.0
5 NaN NaN 7 NaN
6 1234.0 12.34 15 15.0
7 4321.0 43.21 0 0.0
8 NaN NaN 9 9.0
9 1234.0 12.34 6 6.0
Decoded data frame: 1
expver date@hdr statid@hdr obsvalue@body
0 2 20210420 stat00 0.0000
1 2 20210420 stat01 12.3456
2 2 20210420 stat02 24.6912
3 2 20210420 stat03 37.0368
4 2 20210420 stat04 49.3824
5 2 20210420 stat05 61.7280
6 2 20210420 stat06 74.0736
7 2 20210420 stat07 86.4192
8 2 20210420 stat08 98.7648
9 2 20210420 stat09 111.1104
But, the real frames can also be decoded separately:
[12]:
for idx, df_decoded in enumerate(odc.read_odb('example-5.odb', aggregated=False)):
if idx > 0: print()
print('Decoded data frame:', idx)
print(df_decoded)
Decoded data frame: 0
expver date@hdr statid@hdr wigos@hdr obsvalue@body \
0 1 20210420 stat00 0-12345-0-67890 0.0000
1 1 20210420 stat01 0-12345-0-67891 12.3456
2 1 20210420 stat02 0-12345-0-67892 24.6912
integer_missing double_missing bf_column bf_missing
0 1234.0 12.34 0 0
1 4321.0 43.21 9 9
2 NaN NaN 6 6
Decoded data frame: 1
expver date@hdr statid@hdr wigos@hdr obsvalue@body \
0 1 20210420 stat03 0-12345-0-67893 37.0368
1 1 20210420 stat04 0-12345-0-67894 49.3824
2 1 20210420 stat05 0-12345-0-67895 61.7280
integer_missing double_missing bf_column bf_missing
0 1234.0 12.34 10 10.0
1 4321.0 43.21 5 5.0
2 NaN NaN 7 NaN
Decoded data frame: 2
expver date@hdr statid@hdr wigos@hdr obsvalue@body \
0 1 20210420 stat06 0-12345-0-67896 74.0736
1 1 20210420 stat07 0-12345-0-67897 86.4192
2 1 20210420 stat08 0-12345-0-67898 98.7648
integer_missing double_missing bf_column bf_missing
0 1234.0 12.34 15 15
1 4321.0 43.21 0 0
2 NaN NaN 9 9
Decoded data frame: 3
expver date@hdr statid@hdr wigos@hdr obsvalue@body integer_missing \
0 1 20210420 stat09 0-12345- 111.1104 1234
double_missing bf_column bf_missing
0 12.34 6 6
Decoded data frame: 4
expver date@hdr statid@hdr obsvalue@body
0 2 20210420 stat00 0.0000
1 2 20210420 stat01 12.3456
2 2 20210420 stat02 24.6912
Decoded data frame: 5
expver date@hdr statid@hdr obsvalue@body
0 2 20210420 stat03 37.0368
1 2 20210420 stat04 49.3824
2 2 20210420 stat05 61.7280
Decoded data frame: 6
expver date@hdr statid@hdr obsvalue@body
0 2 20210420 stat06 74.0736
1 2 20210420 stat07 86.4192
2 2 20210420 stat08 98.7648
Decoded data frame: 7
expver date@hdr statid@hdr obsvalue@body
0 2 20210420 stat09 111.1104