Data in Larch#

A larch.Model works with data in two stages: the datatree and the dataset. The higher level of data organization is the datatree. This represents the data in close to its “raw” form, as it is read in from the data file(s). This can be a simple table of data that is entirely in idca (tall) or idco (wide) format, or it can be a collection of related tables and datasets (e.g. households, persons, tours, trips, skims, etc.). It is generally reasonable to assemble whatever data sources you have into a single datatree object, and then reference against this object when building a discrete choice Model with larch.

The lower level of data organization is the dataset. This is a single :class:xarray.Dataset that is used in model estimation and application. The dataset is built from the datatree by selecting the data that is needed for a particular model, and then transforming it into the format that is required for the estimation or application of that model. All of this is generally done automatically by larch, so most users will never need to see or interact with the dataset object directly. Moreover, the dataset is generally not saved to disk, and will be recreated from the datatree whenever it is needed, or whenever the model structure is changed in any relevant way.

import pandas as pd

import larch as lx

Simple Datatrees#

The datatree at its simplest is initialized from as simple pandas DataFrame, which can be either idca (tall) or idco (wide) format. A simple datatree like this can be created with the appropriate constructor.

Example idco data#

Here is a simple example of idco data, which is a table of data with one row per case, and one column per variable. This data can easily be read from a CSV and expressed as a pandas DataFrame with a simple one-level index.

data_co = pd.read_csv("example-data/tiny_idco.csv", index_col="caseid")
data_co

	Income	CarTime	CarCost	BusTime	BusCost	WalkTime	Chosen
caseid
1	30000	30	150	40	100	20	Car
2	30000	25	125	35	100	0	Bus
3	40000	40	125	50	75	30	Walk
4	50000	15	225	20	150	10	Walk

Converting this data to a Dataset that can be used as a datatree is as simple as calling the from_idco constructor on the DataFrame.

Example idca data#

Here is a simple example of idca data, which is a table of data with one row per alternative, and one column per variable. This data can easily be read from a CSV and expressed as a pandas DataFrame with a two-level MultiIndex, where the first level contains the case id and the second level contains the alternative id.

data_ca = pd.read_csv("example-data/tiny_idca.csv", index_col=["caseid", "altid"])
data_ca

		Income	Time	Cost	Chosen
caseid	altid
1	Car	30000	30	150	1
	Bus	30000	40	100	0
	Walk	30000	20	0	0
2	Car	30000	25	125	0
2	Bus	30000	35	100	1
3	Car	40000	40	125	0
	Bus	40000	50	75	0
	Walk	40000	30	0	1
4	Car	50000	15	225	0
	Bus	50000	20	150	0
	Walk	50000	10	0	1

As long as the DataFrame has a MultiIndex with two levels as described, the from_idca constructor can be used to convert it to a Dataset that can be used as a datatree for a :class:larch.Model.

You may have noticed in the result shown above that the from_idca constructor does not simply transform the DataFrame into a Dataset. In addition to the transformation, from_idca also analyzed the data and determined that the “Income” variable has no variation across alternatives, and so it was collapsed into a idca variable.

Data in Larch#

Simple Datatrees#

Example idco data#

Example idca data#

This Page