204: Exampville Destination Choice without Logsums#

Welcome to Exampville, the best simulated town in this here part of the internet!

Exampville is a demonstration provided with Larch that walks through some of the data and tools that a transportation planner might use when building a travel model.

import numpy as np
import pandas as pd

import larch as lx
from larch import P, X

In this example notebook, we will walk through the estimation of a tour destination choice model, without using mode choice logsums. First, let’s load the data files from our example.

hh, pp, tour, skims, emp = lx.example(200, ["hh", "pp", "tour", "skims", "emp"])
hh["INCOME_GRP"] = pd.qcut(hh.INCOME, 3)

Preprocessing#

The alternatives in the destinations model are much more regular than in the mode choice model, as every observation will have a similar set of alternatives and the utility function for each of those alternatives will share a common functional form. We’ll leverage this by using idca format arrays in our DataTree to make data management simpler.

The base array we’ll start with is the tours, and it only has one dimension (cases). We’ll specify the alternatives seperately as alts, and it will be a second dimension, which only has coordinates (alt id’s) and no other data.

co = lx.Dataset.construct(
    tour.set_index("TOURID"),
    caseid="TOURID",
    alts=skims.TAZ_ID,
)
co
<xarray.Dataset> Size: 2MB
Dimensions:      (TOURID: 20739, TAZ_ID: 40)
Coordinates:
  * TOURID       (TOURID) int64 166kB 0 1 2 3 4 ... 20735 20736 20737 20738
  * TAZ_ID       (TAZ_ID) int64 320B 1 2 3 4 5 6 7 8 ... 33 34 35 36 37 38 39 40
Data variables:
    HHID         (TOURID) int64 166kB 50000 50000 50000 ... 54999 54999 54999
    PERSONID     (TOURID) int64 166kB 60000 60001 60001 ... 72345 72346 72347
    DTAZ         (TOURID) int64 166kB 22 4 10 20 20 25 35 ... 16 4 40 36 5 28 11
    TOURMODE     (TOURID) int64 166kB 1 1 2 1 1 1 1 1 1 1 ... 5 1 1 1 1 5 1 1 2
    TOURPURP     (TOURID) int64 166kB 1 1 2 1 2 2 2 1 2 2 ... 2 1 2 2 1 2 1 2 2
    N_STOPS      (TOURID) int64 166kB 0 0 0 0 1 0 0 2 2 0 ... 1 0 4 3 0 1 3 4 1
    N_TRIPS      (TOURID) int64 166kB 2 2 2 2 3 2 2 4 4 2 ... 3 2 6 5 2 3 5 6 3
    N_TRIPS_HBW  (TOURID) int64 166kB 2 2 0 2 0 0 0 1 0 0 ... 0 2 0 0 2 0 1 0 0
    N_TRIPS_HBO  (TOURID) int64 166kB 0 0 2 0 2 2 2 1 2 2 ... 2 0 2 2 0 2 1 2 2
    N_TRIPS_NHB  (TOURID) int64 166kB 0 0 0 0 1 0 0 2 2 0 ... 1 0 4 3 0 1 3 4 1
Attributes:
    _caseid_:  TOURID
    _altid_:   TAZ_ID

For our destination choice model, we’ll also want to use employment data. This data, as included in our example, has unique values only by alternative and not by caseid, so there are only 40 unique rows. (This kind of structure is common for destination choice models.)

emp.info()
<class 'pandas.core.frame.DataFrame'>
Index: 40 entries, 1 to 40
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   NONRETAIL_EMP  40 non-null     int64
 1   RETAIL_EMP     40 non-null     int64
 2   TOTAL_EMP      40 non-null     int64
dtypes: int64(3)
memory usage: 1.2 KB

Then we bundle all our raw data into a DataTree structure, which is used to collect the right data for estimation. The Larch DataTree is a slightly augmented version of the regular sharrow.DataTree.

tree = lx.DataTree(
    base=co,
    hh=hh.set_index("HHID"),
    person=pp.set_index("PERSONID"),
    emp=emp,
    skims=lx.Dataset.construct.from_omx(skims),
    relationships=(
        "base.TAZ_ID @ emp.TAZ",
        "base.HHID @ hh.HHID",
        "base.PERSONID @ person.PERSONID",
        "hh.HOMETAZ @ skims.otaz",
        "base.TAZ_ID @ skims.dtaz",
    ),
).digitize_relationships()

Model Definition#

Now we can define our choice model, using data from the tree as appropriate.

m = lx.Model(datatree=tree)
m.title = "Exampville Tour Destination Choice v2"
m.quantity_ca = (
    +P.EmpRetail_HighInc * X("RETAIL_EMP * (INCOME>50000)")
    + P.EmpNonRetail_HighInc * X("NONRETAIL_EMP") * X("INCOME>50000")
    + P.EmpRetail_LowInc * X("RETAIL_EMP") * X("INCOME<=50000")
    + P.EmpNonRetail_LowInc * X("NONRETAIL_EMP") * X("INCOME<=50000")
)

m.quantity_scale = P.Theta
m.utility_ca = +P.distance * X.AUTO_DIST
m.choice_co_code = "base.DTAZ"
m.plock(EmpRetail_HighInc=0, EmpRetail_LowInc=0)
mj = m.copy()

Model Estimation#

m.compute_engine = "numba"
result = m.maximize_loglike(stderr=True)

Iteration 020 [Optimization terminated successfully]

Best LL = -70650.07583993184

value best initvalue minimum maximum nullvalue holdfast
param_name
EmpNonRetail_HighInc 1.245597 1.245597 0.0 -inf inf 0.0 0
EmpNonRetail_LowInc -1.090242 -1.090242 0.0 -inf inf 0.0 0
EmpRetail_HighInc 0.000000 0.000000 0.0 -inf inf 0.0 1
EmpRetail_LowInc 0.000000 0.000000 0.0 -inf inf 0.0 1
Theta 0.676403 0.676403 1.0 0.01 1.0 1.0 0
distance -0.334748 -0.334748 0.0 -inf inf 0.0 0
/opt/hostedtoolcache/Python/3.10.18/x64/lib/python3.10/site-packages/larch/model/optimization.py:338: UserWarning: slsqp may not play nicely with unbounded parameters
if you get poor results, consider setting global bounds with model.set_cap()
  warnings.warn(  # infinite bounds #  )
resultj = mj.maximize_loglike(stderr=True)
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Larch Model Dashboard                                                                        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ optimization complete                                                                        │
│ Log Likelihood Current =     -70650.070312 Best =     -70650.070312                          │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┓ │
│ ┃ Parameter                       Estimate        Std. Error      t-Stat      Null Val ┃ │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━┩ │
│ │ EmpNonRetail_HighInc           │  1.2368043     │  0.14443773    │  8.563     │  0       │ │
│ │ EmpNonRetail_LowInc            │ -1.0913        │  0.052348312   │ -20.85     │  0       │ │
│ │ EmpRetail_HighInc              │  0             │  locked        │  locked    │  0       │ │
│ │ EmpRetail_LowInc               │  0             │  locked        │  locked    │  0       │ │
│ │ Theta                          │  0.67628537    │  0.0090090744  │ -35.93     │  1       │ │
│ │ distance                       │ -0.33467243    │  0.0038122181  │ -87.79     │  0       │ │
│ └────────────────────────────────┴────────────────┴────────────────┴────────────┴──────────┘ │
└──────────────────────────────────────────────────────────────────────────────────────────────┘
resultj
      message: Optimization terminated successfully
      success: True
       status: 0
            x: [ 1.237e+00 -1.091e+00  0.000e+00  0.000e+00  6.763e-01
                -3.347e-01]
          nit: 11
          jac: [ 2.939e-01  3.707e-01 -0.000e+00 -0.000e+00  1.941e+00
                 3.127e+00]
         nfev: 40
         njev: 11
      n_cases: 20739
 total_weight: 20739.0
      logloss: 3.4066285892521337
      loglike: -70650.0703125
       stderr: [ 1.444e-01  5.235e-02  0.000e+00  0.000e+00  9.009e-03
                 3.812e-03]

Model Visualization#

For destination choice and similar type models, it might be beneficial to review the observed and modeled choices, and the relative distribution of these choices across different factors. For example, we would probably want to see the distribution of travel distance. The Model object includes a built-in method to create this kind of visualization.

m.histogram_on_idca_variable("AUTO_DIST")
m.histogram_on_idca_variable("RETAIL_EMP")

The histogram_on_idca_variable has a variety of options, for example to control the number and range of the histogram bins:

m.histogram_on_idca_variable("AUTO_DIST", bins=40, span=(0, 10))

Subsets of the observations can be pulled out, to observe the distribution conditional on other idco factors, like income.

m.histogram_on_idca_variable(
    "AUTO_DIST",
    x_label="Distance (miles)",
    bins=26,
    span=(0, 13),
    filter_co="INCOME<10000",
)

Semi-Aggregate Data#

Some choice models are based on only a limited number of disaggregate data dimensions. The destination choice model shown above is an example of this, where the only explanatory data used is the home zone id (which determines the distances to all the various destination alternatives) and whether the household income is low or not. Our original source data has more than 20,000 case observations, but there can only be up to 80 actual unique choice situations (40 zones, times 2 income categories). Given this simple model, it can be much more efficient to aggregate the data along all the relevant dimensions.

tour_plus = tour.join(hh.set_index("HHID")[["HOMETAZ", "INCOME"]], on="HHID")
tour_plus["LOW_INCOME"] = tour_plus.INCOME < 50_000
tour_agg = (
    tour_plus.groupby(["HOMETAZ", "DTAZ", "LOW_INCOME"])
    .size()
    .unstack("DTAZ")
    .fillna(0)
)
# j = tour_agg.reset_index(drop=True)
# lx.DataArray(j.values, dims=("index", "DTAZ"), coords={"index": j.index, "DTAZ": j.columns})
agg_dataset = lx.Dataset.construct.from_idco(
    tour_agg.index.to_frame().reset_index(drop=True)
)
j = tour_agg.reset_index(drop=True)
agg_dataset = agg_dataset.assign(
    destinations=lx.DataArray(
        j.values,
        dims=("index", "DTAZ"),
        coords={"index": j.index, "DTAZ": j.columns},
    )
)
agg_dataset.dc.ALTID = "DTAZ"
agg_dataset
<xarray.Dataset> Size: 27kB
Dimensions:       (index: 79, DTAZ: 40)
Coordinates:
  * index         (index) int64 632B 0 1 2 3 4 5 6 7 ... 71 72 73 74 75 76 77 78
  * DTAZ          (DTAZ) int64 320B 1 2 3 4 5 6 7 8 ... 33 34 35 36 37 38 39 40
Data variables:
    HOMETAZ       (index) int64 632B 1 1 2 2 3 3 4 4 ... 37 37 38 38 39 39 40 40
    LOW_INCOME    (index) bool 79B False True False True ... True False True
    destinations  (index, DTAZ) float64 25kB 4.0 2.0 0.0 3.0 ... 0.0 5.0 10.0
Attributes:
    _caseid_:  index
    _altid_:   DTAZ
agg_tree = lx.DataTree(
    base=agg_dataset,
    emp=emp,
    skims=lx.Dataset.construct.from_omx(skims),
    relationships=(
        "base.DTAZ @ emp.TAZ",
        "base.HOMETAZ @ skims.otaz",
        "base.DTAZ @ skims.dtaz",
    ),
)
mg = lx.Model(datatree=agg_tree, compute_engine="numba")
mg.title = "Exampville Semi-Aggregate Destination Choice"
mg.quantity_ca = (
    +P.EmpRetail_HighInc * X("RETAIL_EMP * (1-LOW_INCOME)")
    + P.EmpNonRetail_HighInc * X("NONRETAIL_EMP") * X("(1-LOW_INCOME)")
    + P.EmpRetail_LowInc * X("RETAIL_EMP") * X("LOW_INCOME")
    + P.EmpNonRetail_LowInc * X("NONRETAIL_EMP") * X("LOW_INCOME")
)

mg.quantity_scale = P.Theta
mg.utility_ca = +P.distance * X.AUTO_DIST
mg.choice_ca_var = "base.destinations"
mg.plock(EmpRetail_HighInc=0, EmpRetail_LowInc=0)
result = mg.maximize_loglike(stderr=True)
result

Iteration 021 [Optimization terminated successfully]

Best LL = -70650.146009632

value best initvalue minimum maximum nullvalue holdfast
param_name
EmpNonRetail_HighInc 1.236516 1.236516 0.0 -inf inf 0.0 0
EmpNonRetail_LowInc -1.080144 -1.080144 0.0 -inf inf 0.0 0
EmpRetail_HighInc 0.000000 0.000000 0.0 -inf inf 0.0 1
EmpRetail_LowInc 0.000000 0.000000 0.0 -inf inf 0.0 1
Theta 0.678172 0.678172 1.0 0.01 1.0 1.0 0
distance -0.333965 -0.333965 0.0 -inf inf 0.0 0
/opt/hostedtoolcache/Python/3.10.18/x64/lib/python3.10/site-packages/larch/model/optimization.py:338: UserWarning: slsqp may not play nicely with unbounded parameters
if you get poor results, consider setting global bounds with model.set_cap()
  warnings.warn(  # infinite bounds #  )
keyvalue
x
0
EmpNonRetail_HighInc 1.236516
EmpNonRetail_LowInc -1.080144
EmpRetail_HighInc 0.000000
EmpRetail_LowInc 0.000000
Theta 0.678172
distance -0.333965
loglossnp.float64(3.4066322392416217)
d_logloss
0
EmpNonRetail_HighInc -0.000007
EmpNonRetail_LowInc -0.000015
EmpRetail_HighInc 0.000000
EmpRetail_LowInc 0.000000
Theta -0.000038
distance 0.000116
nit21
nfev22
njev21
status0
message'Optimization terminated successfully'
successnp.True_
elapsed_time0:00:00.167648
method'slsqp'
n_cases79
iteration_number21
loglikenp.float64(-70650.146009632)