204: Exampville Destination Choice without Logsums#
Welcome to Exampville, the best simulated town in this here part of the internet!
Exampville is a demonstration provided with Larch that walks through some of the data and tools that a transportation planner might use when building a travel model.
import numpy as np
import pandas as pd
import larch as lx
from larch import P, X
In this example notebook, we will walk through the estimation of a tour destination choice model, without using mode choice logsums. First, let’s load the data files from our example.
hh, pp, tour, skims, emp = lx.example(200, ["hh", "pp", "tour", "skims", "emp"])
hh["INCOME_GRP"] = pd.qcut(hh.INCOME, 3)
Preprocessing#
The alternatives in
the destinations model are much more regular than in the mode choice
model, as every observation will have a similar set of alternatives
and the utility function for each of those alternatives will share a
common functional form. We’ll leverage this by using idca
format
arrays in our DataTree to make data management simpler.
The base array we’ll start with is the tours, and it only has one
dimension (cases). We’ll specify the alternatives seperately as alts
,
and it will be a second dimension, which only has coordinates (alt id’s)
and no other data.
co = lx.Dataset.construct(
tour.set_index("TOURID"),
caseid="TOURID",
alts=skims.TAZ_ID,
)
co
<xarray.Dataset> Size: 2MB Dimensions: (TOURID: 20739, TAZ_ID: 40) Coordinates: * TOURID (TOURID) int64 166kB 0 1 2 3 4 ... 20735 20736 20737 20738 * TAZ_ID (TAZ_ID) int64 320B 1 2 3 4 5 6 7 8 ... 33 34 35 36 37 38 39 40 Data variables: HHID (TOURID) int64 166kB 50000 50000 50000 ... 54999 54999 54999 PERSONID (TOURID) int64 166kB 60000 60001 60001 ... 72345 72346 72347 DTAZ (TOURID) int64 166kB 22 4 10 20 20 25 35 ... 16 4 40 36 5 28 11 TOURMODE (TOURID) int64 166kB 1 1 2 1 1 1 1 1 1 1 ... 5 1 1 1 1 5 1 1 2 TOURPURP (TOURID) int64 166kB 1 1 2 1 2 2 2 1 2 2 ... 2 1 2 2 1 2 1 2 2 N_STOPS (TOURID) int64 166kB 0 0 0 0 1 0 0 2 2 0 ... 1 0 4 3 0 1 3 4 1 N_TRIPS (TOURID) int64 166kB 2 2 2 2 3 2 2 4 4 2 ... 3 2 6 5 2 3 5 6 3 N_TRIPS_HBW (TOURID) int64 166kB 2 2 0 2 0 0 0 1 0 0 ... 0 2 0 0 2 0 1 0 0 N_TRIPS_HBO (TOURID) int64 166kB 0 0 2 0 2 2 2 1 2 2 ... 2 0 2 2 0 2 1 2 2 N_TRIPS_NHB (TOURID) int64 166kB 0 0 0 0 1 0 0 2 2 0 ... 1 0 4 3 0 1 3 4 1 Attributes: _caseid_: TOURID _altid_: TAZ_ID
For our destination choice model, we’ll also want to use employment data. This data, as included in our example, has unique values only by alternative and not by caseid, so there are only 40 unique rows. (This kind of structure is common for destination choice models.)
emp.info()
<class 'pandas.core.frame.DataFrame'>
Index: 40 entries, 1 to 40
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 NONRETAIL_EMP 40 non-null int64
1 RETAIL_EMP 40 non-null int64
2 TOTAL_EMP 40 non-null int64
dtypes: int64(3)
memory usage: 1.2 KB
Then we bundle all our raw data into a DataTree
structure,
which is used to collect the right data for estimation. The
Larch DataTree is a slightly augmented version of the regular
sharrow.DataTree
.
tree = lx.DataTree(
base=co,
hh=hh.set_index("HHID"),
person=pp.set_index("PERSONID"),
emp=emp,
skims=lx.Dataset.construct.from_omx(skims),
relationships=(
"base.TAZ_ID @ emp.TAZ",
"base.HHID @ hh.HHID",
"base.PERSONID @ person.PERSONID",
"hh.HOMETAZ @ skims.otaz",
"base.TAZ_ID @ skims.dtaz",
),
).digitize_relationships()
Model Definition#
Now we can define our choice model, using data from the tree as appropriate.
m = lx.Model(datatree=tree)
m.title = "Exampville Tour Destination Choice v2"
m.quantity_ca = (
+P.EmpRetail_HighInc * X("RETAIL_EMP * (INCOME>50000)")
+ P.EmpNonRetail_HighInc * X("NONRETAIL_EMP") * X("INCOME>50000")
+ P.EmpRetail_LowInc * X("RETAIL_EMP") * X("INCOME<=50000")
+ P.EmpNonRetail_LowInc * X("NONRETAIL_EMP") * X("INCOME<=50000")
)
m.quantity_scale = P.Theta
m.utility_ca = +P.distance * X.AUTO_DIST
m.choice_co_code = "base.DTAZ"
m.plock(EmpRetail_HighInc=0, EmpRetail_LowInc=0)
mj = m.copy()
Model Estimation#
m.compute_engine = "numba"
result = m.maximize_loglike(stderr=True)
Iteration 020 [Optimization terminated successfully]
Best LL = -70650.07583993184
value | best | initvalue | minimum | maximum | nullvalue | holdfast | |
---|---|---|---|---|---|---|---|
param_name | |||||||
EmpNonRetail_HighInc | 1.245597 | 1.245597 | 0.0 | -inf | inf | 0.0 | 0 |
EmpNonRetail_LowInc | -1.090242 | -1.090242 | 0.0 | -inf | inf | 0.0 | 0 |
EmpRetail_HighInc | 0.000000 | 0.000000 | 0.0 | -inf | inf | 0.0 | 1 |
EmpRetail_LowInc | 0.000000 | 0.000000 | 0.0 | -inf | inf | 0.0 | 1 |
Theta | 0.676403 | 0.676403 | 1.0 | 0.01 | 1.0 | 1.0 | 0 |
distance | -0.334748 | -0.334748 | 0.0 | -inf | inf | 0.0 | 0 |
/opt/hostedtoolcache/Python/3.10.18/x64/lib/python3.10/site-packages/larch/model/optimization.py:338: UserWarning: slsqp may not play nicely with unbounded parameters
if you get poor results, consider setting global bounds with model.set_cap()
warnings.warn( # infinite bounds # )
resultj = mj.maximize_loglike(stderr=True)
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓ ┃ Larch Model Dashboard ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩ │ optimization complete │ │ Log Likelihood Current = -70650.070312 Best = -70650.070312 │ │ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┓ │ │ ┃ Parameter ┃ Estimate ┃ Std. Error ┃ t-Stat ┃ Null Val ┃ │ │ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━┩ │ │ │ EmpNonRetail_HighInc │ 1.2368043 │ 0.14443773 │ 8.563 │ 0 │ │ │ │ EmpNonRetail_LowInc │ -1.0913 │ 0.052348312 │ -20.85 │ 0 │ │ │ │ EmpRetail_HighInc │ 0 │ locked │ locked │ 0 │ │ │ │ EmpRetail_LowInc │ 0 │ locked │ locked │ 0 │ │ │ │ Theta │ 0.67628537 │ 0.0090090744 │ -35.93 │ 1 │ │ │ │ distance │ -0.33467243 │ 0.0038122181 │ -87.79 │ 0 │ │ │ └────────────────────────────────┴────────────────┴────────────────┴────────────┴──────────┘ │ └──────────────────────────────────────────────────────────────────────────────────────────────┘
resultj
message: Optimization terminated successfully
success: True
status: 0
x: [ 1.237e+00 -1.091e+00 0.000e+00 0.000e+00 6.763e-01
-3.347e-01]
nit: 11
jac: [ 2.939e-01 3.707e-01 -0.000e+00 -0.000e+00 1.941e+00
3.127e+00]
nfev: 40
njev: 11
n_cases: 20739
total_weight: 20739.0
logloss: 3.4066285892521337
loglike: -70650.0703125
stderr: [ 1.444e-01 5.235e-02 0.000e+00 0.000e+00 9.009e-03
3.812e-03]
Model Visualization#
For destination choice and similar type models, it might be beneficial to
review the observed and modeled choices, and the relative distribution of
these choices across different factors. For example, we would probably want
to see the distribution of travel distance. The Model
object includes
a built-in method to create this kind of visualization.
m.histogram_on_idca_variable("AUTO_DIST")
m.histogram_on_idca_variable("RETAIL_EMP")
The histogram_on_idca_variable
has a variety of options,
for example to control the number and range of the histogram bins:
m.histogram_on_idca_variable("AUTO_DIST", bins=40, span=(0, 10))
Subsets of the observations can be pulled out, to observe the
distribution conditional on other idco
factors, like income.
m.histogram_on_idca_variable(
"AUTO_DIST",
x_label="Distance (miles)",
bins=26,
span=(0, 13),
filter_co="INCOME<10000",
)
Semi-Aggregate Data#
Some choice models are based on only a limited number of disaggregate data dimensions. The destination choice model shown above is an example of this, where the only explanatory data used is the home zone id (which determines the distances to all the various destination alternatives) and whether the household income is low or not. Our original source data has more than 20,000 case observations, but there can only be up to 80 actual unique choice situations (40 zones, times 2 income categories). Given this simple model, it can be much more efficient to aggregate the data along all the relevant dimensions.
tour_plus = tour.join(hh.set_index("HHID")[["HOMETAZ", "INCOME"]], on="HHID")
tour_plus["LOW_INCOME"] = tour_plus.INCOME < 50_000
tour_agg = (
tour_plus.groupby(["HOMETAZ", "DTAZ", "LOW_INCOME"])
.size()
.unstack("DTAZ")
.fillna(0)
)
# j = tour_agg.reset_index(drop=True)
# lx.DataArray(j.values, dims=("index", "DTAZ"), coords={"index": j.index, "DTAZ": j.columns})
agg_dataset = lx.Dataset.construct.from_idco(
tour_agg.index.to_frame().reset_index(drop=True)
)
j = tour_agg.reset_index(drop=True)
agg_dataset = agg_dataset.assign(
destinations=lx.DataArray(
j.values,
dims=("index", "DTAZ"),
coords={"index": j.index, "DTAZ": j.columns},
)
)
agg_dataset.dc.ALTID = "DTAZ"
agg_dataset
<xarray.Dataset> Size: 27kB Dimensions: (index: 79, DTAZ: 40) Coordinates: * index (index) int64 632B 0 1 2 3 4 5 6 7 ... 71 72 73 74 75 76 77 78 * DTAZ (DTAZ) int64 320B 1 2 3 4 5 6 7 8 ... 33 34 35 36 37 38 39 40 Data variables: HOMETAZ (index) int64 632B 1 1 2 2 3 3 4 4 ... 37 37 38 38 39 39 40 40 LOW_INCOME (index) bool 79B False True False True ... True False True destinations (index, DTAZ) float64 25kB 4.0 2.0 0.0 3.0 ... 0.0 5.0 10.0 Attributes: _caseid_: index _altid_: DTAZ
agg_tree = lx.DataTree(
base=agg_dataset,
emp=emp,
skims=lx.Dataset.construct.from_omx(skims),
relationships=(
"base.DTAZ @ emp.TAZ",
"base.HOMETAZ @ skims.otaz",
"base.DTAZ @ skims.dtaz",
),
)
mg = lx.Model(datatree=agg_tree, compute_engine="numba")
mg.title = "Exampville Semi-Aggregate Destination Choice"
mg.quantity_ca = (
+P.EmpRetail_HighInc * X("RETAIL_EMP * (1-LOW_INCOME)")
+ P.EmpNonRetail_HighInc * X("NONRETAIL_EMP") * X("(1-LOW_INCOME)")
+ P.EmpRetail_LowInc * X("RETAIL_EMP") * X("LOW_INCOME")
+ P.EmpNonRetail_LowInc * X("NONRETAIL_EMP") * X("LOW_INCOME")
)
mg.quantity_scale = P.Theta
mg.utility_ca = +P.distance * X.AUTO_DIST
mg.choice_ca_var = "base.destinations"
mg.plock(EmpRetail_HighInc=0, EmpRetail_LowInc=0)
result = mg.maximize_loglike(stderr=True)
result
Iteration 021 [Optimization terminated successfully]
Best LL = -70650.146009632
value | best | initvalue | minimum | maximum | nullvalue | holdfast | |
---|---|---|---|---|---|---|---|
param_name | |||||||
EmpNonRetail_HighInc | 1.236516 | 1.236516 | 0.0 | -inf | inf | 0.0 | 0 |
EmpNonRetail_LowInc | -1.080144 | -1.080144 | 0.0 | -inf | inf | 0.0 | 0 |
EmpRetail_HighInc | 0.000000 | 0.000000 | 0.0 | -inf | inf | 0.0 | 1 |
EmpRetail_LowInc | 0.000000 | 0.000000 | 0.0 | -inf | inf | 0.0 | 1 |
Theta | 0.678172 | 0.678172 | 1.0 | 0.01 | 1.0 | 1.0 | 0 |
distance | -0.333965 | -0.333965 | 0.0 | -inf | inf | 0.0 | 0 |
/opt/hostedtoolcache/Python/3.10.18/x64/lib/python3.10/site-packages/larch/model/optimization.py:338: UserWarning: slsqp may not play nicely with unbounded parameters
if you get poor results, consider setting global bounds with model.set_cap()
warnings.warn( # infinite bounds # )
key | value | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
x |
| ||||||||||||||
logloss | np.float64(3.4066322392416217) | ||||||||||||||
d_logloss |
| ||||||||||||||
nit | 21 | ||||||||||||||
nfev | 22 | ||||||||||||||
njev | 21 | ||||||||||||||
status | 0 | ||||||||||||||
message | 'Optimization terminated successfully' | ||||||||||||||
success | np.True_ | ||||||||||||||
elapsed_time | 0:00:00.167648 | ||||||||||||||
method | 'slsqp' | ||||||||||||||
n_cases | 79 | ||||||||||||||
iteration_number | 21 | ||||||||||||||
loglike | np.float64(-70650.146009632) |