Data in Larch#

A larch.Model works with data in two stages: the datatree and the dataset. The higher level of data organization is the datatree. This represents the data in close to its “raw” form, as it is read in from the data file(s). This can be a simple table of data that is entirely in idca (tall) or idco (wide) format, or it can be a collection of related tables and datasets (e.g. households, persons, tours, trips, skims, etc.). It is generally reasonable to assemble whatever data sources you have into a single datatree object, and then reference against this object when building a discrete choice Model with larch.

The lower level of data organization is the dataset. This is a single :class:xarray.Dataset that is used in model estimation and application. The dataset is built from the datatree by selecting the data that is needed for a particular model, and then transforming it into the format that is required for the estimation or application of that model. All of this is generally done automatically by larch, so most users will never need to see or interact with the dataset object directly. Moreover, the dataset is generally not saved to disk, and will be recreated from the datatree whenever it is needed, or whenever the model structure is changed in any relevant way.

import pandas as pd

import larch as lx

Simple Datatrees#

The datatree at its simplest is initialized from as simple pandas DataFrame, which can be either idca (tall) or idco (wide) format. A simple datatree like this can be created with the appropriate constructor.

Example idco data#

Here is a simple example of idco data, which is a table of data with one row per case, and one column per variable. This data can easily be read from a CSV and expressed as a pandas DataFrame with a simple one-level index.

data_co = pd.read_csv("example-data/tiny_idco.csv", index_col="caseid")
data_co
Income CarTime CarCost BusTime BusCost WalkTime Chosen
caseid
1 30000 30 150 40 100 20 Car
2 30000 25 125 35 100 0 Bus
3 40000 40 125 50 75 30 Walk
4 50000 15 225 20 150 10 Walk

Converting this data to a Dataset that can be used as a datatree is as simple as calling the from_idco constructor on the DataFrame.

tree_co = lx.Dataset.dc.from_idco(data_co)
tree_co
<xarray.Dataset> Size: 256B
Dimensions:   (caseid: 4)
Coordinates:
  * caseid    (caseid) int64 32B 1 2 3 4
Data variables:
    Income    (caseid) int64 32B 30000 30000 40000 50000
    CarTime   (caseid) int64 32B 30 25 40 15
    CarCost   (caseid) int64 32B 150 125 125 225
    BusTime   (caseid) int64 32B 40 35 50 20
    BusCost   (caseid) int64 32B 100 100 75 150
    WalkTime  (caseid) int64 32B 20 0 30 10
    Chosen    (caseid) object 32B 'Car' 'Bus' 'Walk' 'Walk'
Attributes:
    _caseid_:  caseid

Example idca data#

Here is a simple example of idca data, which is a table of data with one row per alternative, and one column per variable. This data can easily be read from a CSV and expressed as a pandas DataFrame with a two-level MultiIndex, where the first level contains the case id and the second level contains the alternative id.

data_ca = pd.read_csv("example-data/tiny_idca.csv", index_col=["caseid", "altid"])
data_ca
Income Time Cost Chosen
caseid altid
1 Car 30000 30 150 1
Bus 30000 40 100 0
Walk 30000 20 0 0
2 Car 30000 25 125 0
Bus 30000 35 100 1
3 Car 40000 40 125 0
Bus 40000 50 75 0
Walk 40000 30 0 1
4 Car 50000 15 225 0
Bus 50000 20 150 0
Walk 50000 10 0 1

As long as the DataFrame has a MultiIndex with two levels as described, the from_idca constructor can be used to convert it to a Dataset that can be used as a datatree for a :class:larch.Model.

tree_ca = lx.Dataset.dc.from_idca(data_ca)
tree_ca
/opt/hostedtoolcache/Python/3.10.17/x64/lib/python3.10/site-packages/xarray/core/duck_array_ops.py:237: RuntimeWarning: invalid value encountered in cast
  return data.astype(dtype, **kwargs)
<xarray.Dataset> Size: 412B
Dimensions:    (caseid: 4, altid: 3)
Coordinates:
  * caseid     (caseid) int64 32B 1 2 3 4
  * altid      (altid) int64 24B 1 2 3
    alt_names  (altid) object 24B 'Bus' 'Car' 'Walk'
Data variables:
    Income     (caseid) int64 32B 30000 30000 40000 50000
    Time       (caseid, altid) int64 96B 40 30 20 35 25 ... 40 30 20 15 10
    Cost       (caseid, altid) int64 96B 100 150 0 100 125 ... 125 0 150 225 0
    Chosen     (caseid, altid) int64 96B 0 1 0 1 0 ... 0 1 0 0 1
    _avail_    (caseid, altid) int8 12B 1 1 1 1 1 0 1 1 1 1 1 1
Attributes:
    _caseid_:  caseid
    _altid_:   altid

You may have noticed in the result shown above that the from_idca constructor does not simply transform the DataFrame into a Dataset. In addition to the transformation, from_idca also analyzed the data and determined that the “Income” variable has no variation across alternatives, and so it was collapsed into a idca variable.