Data for Discrete Choice#

import numpy as np
import pandas as pd
import xarray as xr

import larch as lx

This discussion of data for discrete choice modeling is broken into two parts. The first part is a review of the abstract fundamentals, and is recommended for users new to discrete choice modeling.

If you are are already familar with the various kinds of discrete choice data (case-only and case-alt, or wide and tall), you probably want to skip the the second section which focuses on the practical implementation of these formats when used in Larch.

Fundamental Data Formats#

When working with discrete choice models in Larch, we will generally receive data to input into the system in one of two basic formats: the case-only (“idco”) format or the case-alternative (“idca”) format.

This are sometimes referred to as IDCase (each record contains all the information for mode choice over alternatives for a single trip) or IDCase-IDAlt (each record contains all the information for a single alternative available to each decision maker so there is one record for each alternative for each choice).

idco Format#

In the idco case-only format, each record provides all the relevant information about an individual choice, including the variables related to the decision maker or the choice itself, as well as alternative-related variables for all available alternatives, and a variable indicating which alternative was chosen. This style of data has a variety of names in the choice modeling literature, including “IDCase”, “case-only”, and “wide”.

data_co = pd.read_csv("example-data/tiny_idco.csv", index_col="caseid")
Income CarTime CarCost BusTime BusCost WalkTime Chosen
caseid
1 30000 30 150 40 100 20 Car
2 30000 25 125 35 100 0 Bus
3 40000 40 125 50 75 30 Walk
4 50000 15 225 20 150 10 Walk

idca Format#

In the idca case-alternative format, each record can include information on the variables related to the decision maker or the choice itself, the attributes of that particular alternative, and a choice variable that indicates whether the alternative was or was not chosen. This style of data has a variety of names in the choice modeling literature, including “IDCase-IDAlt”, “case-alternative”, and “tall”.

data_ca = pd.read_csv("example-data/tiny_idca.csv")
caseid altid Income Time Cost Chosen
0 1 Car 30000 30 150 1
1 1 Bus 30000 40 100 0
2 1 Walk 30000 20 0 0
3 2 Car 30000 25 125 0
4 2 Bus 30000 35 100 1
5 3 Car 40000 40 125 0
6 3 Bus 40000 50 75 0
7 3 Walk 40000 30 0 1
8 4 Car 50000 15 225 0
9 4 Bus 50000 20 150 0
10 4 Walk 50000 10 0 1

sparse vs dense#

The idca format actually has two technical variations, a sparse version and a dense version. The table shown above is a sparse version, where any alterative that is not available is simply missing from the data table. Thus, in caseid 2 above, there are only 2 rows, not 3. By dropping these rows, this data storage is potentially more efficient than the dense version. But, in cases where the number of missing alternatives is managably small (less than half of all the data, certainly) it can be much more computationally efficient to simply store and work with the dense array.

In Larch, these two distinct sub-types of idca data are labeled so that the dense version labeled as idca and the sparse version labeled as idce.

Data Conversion#

Converting between idca format data and idco format in Python can be super easy if the alternative id’s are stored appropriately in a two-level MultiIndex. In that case, we can simply stack or unstack the DataFrame, and change formats. This is typically more readily available when switching from idca to idco formats, as the alterative id’s typically appear in a column of the DataFrame that can be used for indexing.

data_ca.set_index(["caseid", "altid"]).unstack()
Income Time Cost Chosen
altid Bus Car Walk Bus Car Walk Bus Car Walk Bus Car Walk
caseid
1 30000.0 30000.0 30000.0 40.0 30.0 20.0 100.0 150.0 0.0 0.0 1.0 0.0
2 30000.0 30000.0 NaN 35.0 25.0 NaN 100.0 125.0 NaN 1.0 0.0 NaN
3 40000.0 40000.0 40000.0 50.0 40.0 30.0 75.0 125.0 0.0 0.0 0.0 1.0
4 50000.0 50000.0 50000.0 20.0 15.0 10.0 150.0 225.0 0.0 0.0 0.0 1.0

Getting our original idco data into idca format is not so clean, as there’s no analagous set_columns method in pandas, and even if there were, the alternative codes are not typically neatly arranged in a row of data. We can force it to work, but it’s not pretty.

forced_ca = data_co.T.set_index(
    pd.MultiIndex.from_tuples(
        [
            ["Car", "Income"],
            ["Car", "Time"],
            ["Car", "Cost"],
            ["Bus", "Time"],
            ["Bus", "Cost"],
            ["Walk", "Time"],
            ["Car", "Chosen"],
        ],
        names=("alt", "var"),
    )
).T.stack(0)
forced_ca[["Chosen", "Income"]] = (
    forced_ca[["Chosen", "Income"]]
    .groupby("caseid")
    .transform(lambda x: x.fillna(x.value_counts().index[0]))
)
forced_ca["Chosen"] = (
    forced_ca["Chosen"] == forced_ca.index.get_level_values("alt")
).astype(float)
forced_ca
/tmp/ipykernel_7370/1380147596.py:14: FutureWarning: The previous implementation of stack is deprecated and will be removed in a future version of pandas. See the What's New notes for pandas 2.1.0 for details. Specify future_stack=True to adopt the new implementation and silence this warning.
  ).T.stack(0)
/tmp/ipykernel_7370/1380147596.py:18: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  .transform(lambda x: x.fillna(x.value_counts().index[0]))
/tmp/ipykernel_7370/1380147596.py:18: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  .transform(lambda x: x.fillna(x.value_counts().index[0]))
/tmp/ipykernel_7370/1380147596.py:18: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  .transform(lambda x: x.fillna(x.value_counts().index[0]))
/tmp/ipykernel_7370/1380147596.py:18: FutureWarning: Downcasting object dtype arrays on .fillna, .ffill, .bfill is deprecated and will change in a future version. Call result.infer_objects(copy=False) instead. To opt-in to the future behavior, set `pd.set_option('future.no_silent_downcasting', True)`
  .transform(lambda x: x.fillna(x.value_counts().index[0]))
var Income Time Cost Chosen
caseid alt
1 Bus 30000 40 100 0.0
Car 30000 30 150 1.0
Walk 30000 20 NaN 0.0
2 Bus 30000 35 100 1.0
Car 30000 25 125 0.0
Walk 30000 0 NaN 0.0
3 Bus 40000 50 75 0.0
Car 40000 40 125 0.0
Walk 40000 30 NaN 1.0
4 Bus 50000 20 150 0.0
Car 50000 15 225 0.0
Walk 50000 10 NaN 1.0

Practical Data Formating in Larch#

The data formats described above are relevant when storing data in a tabular (two-dimensional) format. This is quite common and generally expected, especially for data exchange between most software tools, but Larch doesn’t require you to choose one or the other.

Instead, Larch uses a Dataset structure based on xarray, to store and use a collection of relevant variables, and each variable can be stored in either idco or idca format, as appropriate.

dataset = lx.dataset.merge(
    [
        data_co[["Income", "Chosen"]].to_xarray(),
        data_ca.set_index(["caseid", "altid"])[["Time", "Cost"]].to_xarray(),
    ],
    caseid="caseid",
    alts="altid",
)
dataset
<xarray.Dataset> Size: 312B
Dimensions:  (caseid: 4, altid: 3)
Coordinates:
  * caseid   (caseid) int64 32B 1 2 3 4
  * altid    (altid) object 24B 'Bus' 'Car' 'Walk'
Data variables:
    Income   (caseid) int64 32B 30000 30000 40000 50000
    Chosen   (caseid) object 32B 'Car' 'Bus' 'Walk' 'Walk'
    Time     (caseid, altid) float64 96B 40.0 30.0 20.0 35.0 ... 20.0 15.0 10.0
    Cost     (caseid, altid) float64 96B 100.0 150.0 0.0 ... 150.0 225.0 0.0
Attributes:
    _caseid_:  caseid
    _altid_:   altid

As we saw above, it’s quite easy to move from idca to idco format, and Larch can apply those transformations automatically when loading idca data, using the Dataset.construct.from_idca method.
In the example below, note that the Income variable has automatically been collapsed to idco, while the other variables remain as idca.

lx.Dataset.construct.from_idca(
    data_ca.set_index(["caseid", "altid"]),
)
/opt/hostedtoolcache/Python/3.10.18/x64/lib/python3.10/site-packages/xarray/core/duck_array_ops.py:253: RuntimeWarning: invalid value encountered in cast
  return data.astype(dtype, **kwargs)
<xarray.Dataset> Size: 412B
Dimensions:    (caseid: 4, altid: 3)
Coordinates:
  * caseid     (caseid) int64 32B 1 2 3 4
  * altid      (altid) int64 24B 1 2 3
    alt_names  (altid) object 24B 'Bus' 'Car' 'Walk'
Data variables:
    Income     (caseid) int64 32B 30000 30000 40000 50000
    Time       (caseid, altid) int64 96B 40 30 20 35 25 ... 40 30 20 15 10
    Cost       (caseid, altid) int64 96B 100 150 0 100 125 ... 125 0 150 225 0
    Chosen     (caseid, altid) int64 96B 0 1 0 1 0 ... 0 1 0 0 1
    _avail_    (caseid, altid) int8 12B 1 1 1 1 1 0 1 1 1 1 1 1
Attributes:
    _caseid_:  caseid
    _altid_:   altid

Loading data in sparse format is as easy as swapping out from_idca for from_idce. The resulting dataset will have a similar collection of variables, but each idca variable is stored in a one-dimensional array, using a variety of the compressed sparse row data format.

lx.Dataset.construct.from_idce(
    data_ca.set_index(["caseid", "altid"]),
)
<xarray.Dataset> Size: 403B
Dimensions:    (caseid: 4, _casealt_: 11, altid: 3, _caseptr_: 5)
Coordinates:
  * caseid     (caseid) int64 32B 1 2 3 4
  * altid      (altid) object 24B 'Bus' 'Car' 'Walk'
    alt_idx    (_casealt_) int8 11B 1 0 2 1 0 1 0 2 1 0 2
  * _caseptr_  (_caseptr_) int64 40B 0 3 5 8 11
Dimensions without coordinates: _casealt_
Data variables:
    Income     (caseid) int64 32B 30000 30000 40000 50000
    Time       (_casealt_) int64 88B 30 40 20 25 35 40 50 30 15 20 10
    Cost       (_casealt_) int64 88B 150 100 0 125 100 125 75 0 225 150 0
    Chosen     (_casealt_) int64 88B 1 0 0 0 1 0 0 1 0 0 1
Attributes:
    _exclude_dims_:  ('caseid', 'altid', '_caseptr_')
    _caseid_:        caseid
    _altid_:         altid
    _casealt_:       _casealt_
    _alt_idx_:       alt_idx
    _caseptr_:       _caseptr_

Data Encoding#

For the most part, data used in the utility functions of discrete choice models enters into the utility function as part of a linear-in-parameters function. That is, we have some “data” that expresses an attribute of some part of the transportation system as a number, we multiply that by some numerical parameter that will be estimated, and we sum up the total over all the data-times-parameter operations. This kind of structure is known as “linear algebra” and it’s something computers can do super fast, as long as all the data and all the parameters are queued up in memory in the right formats. So, typically it is optimal to pre-compute the “data” part of the process into one large contiguous array of floating point values, regardless if the values otherwise seem to be binary or integers. Most tools, such as Larch, will do much of this work for you, so you don’t need to worry about it too much.

There are two notable exceptions to this guideline:

  • choices: the data that represents the observed choices, which are inherently categorical

  • availability: data that represents the availability of each choice, which is inherently boolean

Categorical Encoding#

When we are looking at discrete choices, it is natural to employ a categorical data type for at least the “choice” data itself, if not for other columns as well. Pandas can convert columns to categorical data simply by assigning the type “category”.

choices = data_co["Chosen"].astype("category")
choices
caseid
1     Car
2     Bus
3    Walk
4    Walk
Name: Chosen, dtype: category
Categories (3, object): ['Bus', 'Car', 'Walk']

Once we have categorical data, if we like we can work with the underlying code values instead of the original raw data.

choices.cat.codes
caseid
1    1
2    0
3    2
4    2
dtype: int8

The cat.categories attribute contains the array of values matching each of the code.

choices.cat.categories
Index(['Bus', 'Car', 'Walk'], dtype='object')

When using astype("category") there’s no control over the ordering of the categories. If we want to control the apparent order (e.g. we already have codes defined elsewhere such that Car is 1, Bus is 2, and walk is 3) then we can explicitly set the category value positions using pd.CategoricalDtype instead of "category". Note that the cat.codes numbers used internally by categoricals start with zero as standard in Python, so if you want codes to start with 1 you need to include a dummy placeholder for zero.

choices1 = data_co["Chosen"].astype(pd.CategoricalDtype(["_", "Car", "Bus", "Walk"]))
choices1
caseid
1     Car
2     Bus
3    Walk
4    Walk
Name: Chosen, dtype: category
Categories (4, object): ['_', 'Car', 'Bus', 'Walk']
choices1.cat.codes
caseid
1    1
2    2
3    3
4    3
dtype: int8

To be clear, by asserting the placement ordering of alternative like this, we are not simultaneously asserting that the alternatives are ordinal. Put another way, we are forcing Car to be coded as 1 and Bus to be coded as 2, but we are not saying that Car is less than Bus. Pandas categoricals can allow this, by adding ordered=True to the CategoricalDtype.

pd.CategoricalDtype(["NoCars", "1Car", "2Cars", "3+Cars"], ordered=True)
CategoricalDtype(categories=['NoCars', '1Car', '2Cars', '3+Cars'], ordered=True, categories_dtype=object)

One Hot Encoding#

One-hot encoding, also known as dummy variables, is the creation of a seperate binary-valued column for every categorical value. We can convert a categorical data column into a set of one-hot encoded columns using the get_dummies function.

pd.get_dummies(choices)
Bus Car Walk
caseid
1 False True False
2 True False False
3 False False True
4 False False True

It’s not required to have first converted the data to a categorical data type.

pd.get_dummies(data_co["Chosen"])
Bus Car Walk
caseid
1 False True False
2 True False False
3 False False True
4 False False True

Encoding with xarray#

The xarray library doesn’t use formal “categorical” datatypes, but we can still use the get_dummies function to explode choice and availability data as needed.

dataset["Chosen_ca"] = lx.DataArray(
    pd.get_dummies(data_co["Chosen"]).rename_axis(columns="altid")
)
dataset
<xarray.Dataset> Size: 324B
Dimensions:    (caseid: 4, altid: 3)
Coordinates:
  * caseid     (caseid) int64 32B 1 2 3 4
  * altid      (altid) object 24B 'Bus' 'Car' 'Walk'
Data variables:
    Income     (caseid) int64 32B 30000 30000 40000 50000
    Chosen     (caseid) object 32B 'Car' 'Bus' 'Walk' 'Walk'
    Time       (caseid, altid) float64 96B 40.0 30.0 20.0 ... 20.0 15.0 10.0
    Cost       (caseid, altid) float64 96B 100.0 150.0 0.0 ... 150.0 225.0 0.0
    Chosen_ca  (caseid, altid) bool 12B False True False ... False False True
Attributes:
    _caseid_:  caseid
    _altid_:   altid

Working with Skims#

Sometimes, data used by Larch for estimating transportation choices is available as “skims”. This data is usually stored in one of two formats: open matrix files, or indexed csv tables. Both formats are readily usable by models in Larch. For open matrix files, there is the OMX class, which is derived from a PyTables File object and allows you to open and refer to open matrix files on disk. When you first access an open matrix file, only the meta-data is actually read, allowing you to see what is in the file before actually loading the whole thing into memory.

from larch.omx import OMX

skims = OMX(lx.example_file("exampville_skims.omx"), mode="r")
skims
<larch.OMX> ⋯/exampville_skims.omx
 |  shape:(np.int64(40), np.int64(40))
 |  data:
 |    AUTO_COST    (float64)
 |    AUTO_DIST    (float64)
 |    AUTO_TIME    (float64)
 |    BIKE_TIME    (float64)
 |    TRANSIT_FARE (float64)
 |    TRANSIT_IVTT (float64)
 |    TRANSIT_OVTT (float64)
 |    WALK_DIST    (float64)
 |    WALK_TIME    (float64)
 |  lookup:
 |    TAZ_AREA_TYPE (40 |S3)
 |    TAZ_ID        (40 int64)

It is easy to convert an OMX object into a dataset:

skims_dataset = skims.to_dataset()
skims_dataset
<xarray.Dataset> Size: 116kB
Dimensions:        (otaz: 40, dtaz: 40)
Coordinates:
  * otaz           (otaz) int64 320B 1 2 3 4 5 6 7 8 ... 33 34 35 36 37 38 39 40
  * dtaz           (dtaz) int64 320B 1 2 3 4 5 6 7 8 ... 33 34 35 36 37 38 39 40
Data variables:
    AUTO_COST      (otaz, dtaz) float64 13kB 0.441 1.868 1.754 ... 1.266 0.3288
    AUTO_DIST      (otaz, dtaz) float64 13kB 1.26 5.337 5.013 ... 3.616 0.9393
    AUTO_TIME      (otaz, dtaz) float64 13kB 3.78 7.551 20.04 ... 6.149 2.818
    BIKE_TIME      (otaz, dtaz) float64 13kB 6.3 26.68 22.92 ... 15.89 4.697
    TRANSIT_FARE   (otaz, dtaz) float64 13kB 0.0 2.5 2.5 0.0 ... 2.5 2.5 2.5 0.0
    TRANSIT_IVTT   (otaz, dtaz) float64 13kB 0.0 12.17 3.343 ... 2.093 6.768 0.0
    TRANSIT_OVTT   (otaz, dtaz) float64 13kB 1.68 55.41 53.31 ... 11.14 1.252
    WALK_DIST      (otaz, dtaz) float64 13kB 1.26 5.337 4.583 ... 3.178 0.9393
    WALK_TIME      (otaz, dtaz) float64 13kB 25.2 106.7 91.66 ... 63.56 18.79
    TAZ_AREA_TYPE  (otaz) |S3 120B b'SUB' b'SUB' b'URB' ... b'SUB' b'SUB' b'SUB'
    TAZ_ID         (otaz) int64 320B 1 2 3 4 5 6 7 8 ... 33 34 35 36 37 38 39 40

The other common (although less efficient) format for skim data is an indexed table. This could be a csv file, with “X” and “Y” coordinates given in a pair of columns, and one or more variables in other columns. Data in this format might look something like this:

skim_df = pd.read_csv("example-data/exampville_skims.csv")
skim_df
I J AUTO_COST AUTO_DIST
0 1 1 0.44 1.26
1 1 2 1.87 5.34
2 1 3 1.75 5.01
3 1 4 0.95 2.72
4 1 5 1.90 5.44
... ... ... ... ...
1595 40 36 1.07 3.06
1596 40 37 1.38 3.94
1597 40 38 1.50 4.28
1598 40 39 1.27 3.62
1599 40 40 0.33 0.94

1600 rows × 4 columns

As shown above, loading this format of data into a pandas DataFrame is as simple as reading the CSV file. in the usual way. Converting it to an xarray Dataset is also easy:

skim_df.set_index(["I", "J"]).to_xarray()
<xarray.Dataset> Size: 26kB
Dimensions:    (I: 40, J: 40)
Coordinates:
  * I          (I) int64 320B 1 2 3 4 5 6 7 8 9 ... 32 33 34 35 36 37 38 39 40
  * J          (J) int64 320B 1 2 3 4 5 6 7 8 9 ... 32 33 34 35 36 37 38 39 40
Data variables:
    AUTO_COST  (I, J) float64 13kB 0.44 1.87 1.75 0.95 ... 1.38 1.5 1.27 0.33
    AUTO_DIST  (I, J) float64 13kB 1.26 5.34 5.01 2.72 ... 3.94 4.28 3.62 0.94

If you’re going to be reading and writing the same datasets over and over, it can be advantageous to store them in a format that’s more efficient for reading and writing, such as the ZARR format. You can write a dataset to zarr using Dataset.to_zarr:

skims_dataset.to_zarr("example-data/example-skims.zarr", mode="w")
<xarray.backends.zarr.ZarrStore at 0x7f9d55b7dd00>

Opening a dataset that has been saved in ZARR format is possible using the open_zarr command.

z = xr.open_zarr("example-data/example-skims.zarr")
z
<xarray.Dataset> Size: 116kB
Dimensions:        (otaz: 40, dtaz: 40)
Coordinates:
  * dtaz           (dtaz) int64 320B 1 2 3 4 5 6 7 8 ... 33 34 35 36 37 38 39 40
  * otaz           (otaz) int64 320B 1 2 3 4 5 6 7 8 ... 33 34 35 36 37 38 39 40
Data variables:
    AUTO_COST      (otaz, dtaz) float64 13kB dask.array<chunksize=(40, 40), meta=np.ndarray>
    AUTO_DIST      (otaz, dtaz) float64 13kB dask.array<chunksize=(40, 40), meta=np.ndarray>
    AUTO_TIME      (otaz, dtaz) float64 13kB dask.array<chunksize=(40, 40), meta=np.ndarray>
    BIKE_TIME      (otaz, dtaz) float64 13kB dask.array<chunksize=(40, 40), meta=np.ndarray>
    TAZ_AREA_TYPE  (otaz) |S3 120B dask.array<chunksize=(40,), meta=np.ndarray>
    TAZ_ID         (otaz) int64 320B dask.array<chunksize=(40,), meta=np.ndarray>
    TRANSIT_FARE   (otaz, dtaz) float64 13kB dask.array<chunksize=(40, 40), meta=np.ndarray>
    TRANSIT_IVTT   (otaz, dtaz) float64 13kB dask.array<chunksize=(40, 40), meta=np.ndarray>
    TRANSIT_OVTT   (otaz, dtaz) float64 13kB dask.array<chunksize=(40, 40), meta=np.ndarray>
    WALK_DIST      (otaz, dtaz) float64 13kB dask.array<chunksize=(40, 40), meta=np.ndarray>
    WALK_TIME      (otaz, dtaz) float64 13kB dask.array<chunksize=(40, 40), meta=np.ndarray>

The open_zarr function returns promptly because, like the OMX class, by default it does not actually read the underlying data into memory, just the meta-data that identifies variable names, shapes, and types. Actually loading the data itself can be deferred until it is actually needed, or it can be read eagerly using the load method. You see below the dask.array data elements have been replaced with actual values.

z.load()
<xarray.Dataset> Size: 116kB
Dimensions:        (otaz: 40, dtaz: 40)
Coordinates:
  * dtaz           (dtaz) int64 320B 1 2 3 4 5 6 7 8 ... 33 34 35 36 37 38 39 40
  * otaz           (otaz) int64 320B 1 2 3 4 5 6 7 8 ... 33 34 35 36 37 38 39 40
Data variables:
    AUTO_COST      (otaz, dtaz) float64 13kB 0.441 1.868 1.754 ... 1.266 0.3288
    AUTO_DIST      (otaz, dtaz) float64 13kB 1.26 5.337 5.013 ... 3.616 0.9393
    AUTO_TIME      (otaz, dtaz) float64 13kB 3.78 7.551 20.04 ... 6.149 2.818
    BIKE_TIME      (otaz, dtaz) float64 13kB 6.3 26.68 22.92 ... 15.89 4.697
    TAZ_AREA_TYPE  (otaz) |S3 120B b'SUB' b'SUB' b'URB' ... b'SUB' b'SUB' b'SUB'
    TAZ_ID         (otaz) int64 320B 1 2 3 4 5 6 7 8 ... 33 34 35 36 37 38 39 40
    TRANSIT_FARE   (otaz, dtaz) float64 13kB 0.0 2.5 2.5 0.0 ... 2.5 2.5 2.5 0.0
    TRANSIT_IVTT   (otaz, dtaz) float64 13kB 0.0 12.17 3.343 ... 2.093 6.768 0.0
    TRANSIT_OVTT   (otaz, dtaz) float64 13kB 1.68 55.41 53.31 ... 11.14 1.252
    WALK_DIST      (otaz, dtaz) float64 13kB 1.26 5.337 4.583 ... 3.178 0.9393
    WALK_TIME      (otaz, dtaz) float64 13kB 25.2 106.7 91.66 ... 63.56 18.79