Data in Larch#
A larch.Model
works with data in two stages: the datatree
and the dataset
.
The higher level of data organization is the datatree
. This represents the
data in close to its “raw” form, as it is read in from the data file(s). This
can be a simple table of data that is entirely in idca (tall) or
idco (wide) format, or it can be a collection of related tables and datasets (e.g. households,
persons, tours, trips, skims, etc.). It is generally reasonable to assemble
whatever data sources you have into a single datatree
object, and then reference
against this object when building a discrete choice Model
with larch.
The lower level of data organization is the dataset
. This is a single
:class:xarray.Dataset
that is used in model estimation and application. The
dataset
is built from the datatree
by selecting the data that is needed for
a particular model, and then transforming it into the format that is required
for the estimation or application of that model. All of this is generally done
automatically by larch, so most users will never need to see or interact with
the dataset
object directly. Moreover, the dataset
is generally not saved
to disk, and will be recreated from the datatree
whenever it is needed, or
whenever the model structure is changed in any relevant way.
import pandas as pd
import larch as lx
Simple Datatrees#
The datatree
at its simplest is initialized from
as simple pandas DataFrame, which can be either idca (tall) or
idco (wide) format. A simple datatree
like
this can be created with the appropriate constructor.
Example idco data#
Here is a simple example of idco data, which is a table of data with one row per case, and one column per variable. This data can easily be read from a CSV and expressed as a pandas DataFrame with a simple one-level index.
data_co = pd.read_csv("example-data/tiny_idco.csv", index_col="caseid")
data_co
Income | CarTime | CarCost | BusTime | BusCost | WalkTime | Chosen | |
---|---|---|---|---|---|---|---|
caseid | |||||||
1 | 30000 | 30 | 150 | 40 | 100 | 20 | Car |
2 | 30000 | 25 | 125 | 35 | 100 | 0 | Bus |
3 | 40000 | 40 | 125 | 50 | 75 | 30 | Walk |
4 | 50000 | 15 | 225 | 20 | 150 | 10 | Walk |
Converting this data to a Dataset that can be used as a datatree
is as simple as calling the from_idco
constructor on the DataFrame.
tree_co = lx.Dataset.dc.from_idco(data_co)
tree_co
<xarray.Dataset> Size: 256B Dimensions: (caseid: 4) Coordinates: * caseid (caseid) int64 32B 1 2 3 4 Data variables: Income (caseid) int64 32B 30000 30000 40000 50000 CarTime (caseid) int64 32B 30 25 40 15 CarCost (caseid) int64 32B 150 125 125 225 BusTime (caseid) int64 32B 40 35 50 20 BusCost (caseid) int64 32B 100 100 75 150 WalkTime (caseid) int64 32B 20 0 30 10 Chosen (caseid) object 32B 'Car' 'Bus' 'Walk' 'Walk' Attributes: _caseid_: caseid
Example idca data#
Here is a simple example of idca data, which is a table of data with one row per alternative, and one column per variable. This data can easily be read from a CSV and expressed as a pandas DataFrame with a two-level MultiIndex, where the first level contains the case id and the second level contains the alternative id.
data_ca = pd.read_csv("example-data/tiny_idca.csv", index_col=["caseid", "altid"])
data_ca
Income | Time | Cost | Chosen | ||
---|---|---|---|---|---|
caseid | altid | ||||
1 | Car | 30000 | 30 | 150 | 1 |
Bus | 30000 | 40 | 100 | 0 | |
Walk | 30000 | 20 | 0 | 0 | |
2 | Car | 30000 | 25 | 125 | 0 |
Bus | 30000 | 35 | 100 | 1 | |
3 | Car | 40000 | 40 | 125 | 0 |
Bus | 40000 | 50 | 75 | 0 | |
Walk | 40000 | 30 | 0 | 1 | |
4 | Car | 50000 | 15 | 225 | 0 |
Bus | 50000 | 20 | 150 | 0 | |
Walk | 50000 | 10 | 0 | 1 |
As long as the DataFrame has a MultiIndex with two levels as described,
the from_idca
constructor can be used to convert it to a Dataset that
can be used as a datatree
for a :class:larch.Model
.
tree_ca = lx.Dataset.dc.from_idca(data_ca)
tree_ca
/opt/hostedtoolcache/Python/3.10.18/x64/lib/python3.10/site-packages/xarray/core/duck_array_ops.py:253: RuntimeWarning: invalid value encountered in cast
return data.astype(dtype, **kwargs)
<xarray.Dataset> Size: 412B Dimensions: (caseid: 4, altid: 3) Coordinates: * caseid (caseid) int64 32B 1 2 3 4 * altid (altid) int64 24B 1 2 3 alt_names (altid) object 24B 'Bus' 'Car' 'Walk' Data variables: Income (caseid) int64 32B 30000 30000 40000 50000 Time (caseid, altid) int64 96B 40 30 20 35 25 ... 40 30 20 15 10 Cost (caseid, altid) int64 96B 100 150 0 100 125 ... 125 0 150 225 0 Chosen (caseid, altid) int64 96B 0 1 0 1 0 ... 0 1 0 0 1 _avail_ (caseid, altid) int8 12B 1 1 1 1 1 0 1 1 1 1 1 1 Attributes: _caseid_: caseid _altid_: altid
You may have noticed in the result shown above that the from_idca
constructor
does not simply transform the DataFrame into a Dataset. In addition to the
transformation, from_idca
also analyzed the data and determined
that the “Income” variable has no variation across alternatives, and so it was
collapsed into a idca variable.