203: Exampville Destination Choice#

Welcome to Exampville, the best simulated town in this here part of the internet!

Exampville is a demonstration provided with Larch that walks through some of the data and tools that a transportation planner might use when building a travel model.

import numpy as np
import pandas as pd

import larch as lx
from larch import P, X

In this example notebook, we will walk through the estimation of a tour destination choice model. First, let’s load the data files from our example.

hh, pp, tour, skims, emp = lx.example(200, ["hh", "pp", "tour", "skims", "emp"])
hh["INCOME_GRP"] = pd.qcut(hh.INCOME, 3)

For this destination choice model, we’ll want to use the mode choice logsums we calculated previously from the mode choice estimation, but we’ll use these values as fixed input data instead of a modeled value. We can load these logsums from the file in which they were saved. For this example, we can indentify that file using the larch.example function, which will automatically rebuild the file if it doesn’t exists. In typical applications, a user would generally just give the filename as a string and ensure manually that the file exists before loading it.

logsums_file = lx.example(202, output_file="logsums.zarr")
logsums = lx.DataArray.from_zarr("logsums.zarr")

Preprocessing#

The alternatives in the destinations model are much more regular than in the mode choice model, as every observation will have a similar set of alternatives and the utility function for each of those alternatives will share a common functional form. We’ll leverage this by using idca format arrays in our DataTree to make data management simpler.

The base array we’ll start with will have two dimensions, cases and alternatives, anc be formed from the logsums we loaded above.

ca = lx.Dataset.construct(
    {"logsum": logsums},
    caseid="TOURID",
    alts=skims.TAZ_ID,
)
ca
<xarray.Dataset> Size: 2MB
Dimensions:  (TOURID: 7564, TAZ_ID: 40)
Coordinates:
  * TOURID   (TOURID) int64 61kB 0 1 3 7 10 13 ... 20728 20729 20731 20734 20736
  * TAZ_ID   (TAZ_ID) int64 320B 1 2 3 4 5 6 7 8 9 ... 33 34 35 36 37 38 39 40
Data variables:
    logsum   (TOURID, TAZ_ID) float64 2MB dask.array<chunksize=(1891, 20), meta=np.ndarray>
Attributes:
    _caseid_:  TOURID
    _altid_:   TAZ_ID

For our destination choice model, we’ll also want to use employment data. This data, as included in our example, has unique values only by alternative and not by caseid, so there are only 40 unique rows. (This kind of structure is common for destination choice models.)

emp.info()
<class 'pandas.core.frame.DataFrame'>
Index: 40 entries, 1 to 40
Data columns (total 3 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   NONRETAIL_EMP  40 non-null     int64
 1   RETAIL_EMP     40 non-null     int64
 2   TOTAL_EMP      40 non-null     int64
dtypes: int64(3)
memory usage: 1.2 KB

Then we bundle all our raw data into a DataTree structure, which is used to collect the right data for estimation. The Larch DataTree is a slightly augmented version of the regular sharrow.DataTree.

tree = lx.DataTree(
    base=ca,
    tour=tour.rename_axis(index="TOUR_ID"),
    hh=hh.set_index("HHID"),
    person=pp.set_index("PERSONID"),
    emp=emp,
    skims=lx.Dataset.construct.from_omx(skims),
    relationships=(
        "base.TAZ_ID @ emp.TAZ",
        "base.TOURID -> tour.TOUR_ID",
        "tour.HHID @ hh.HHID",
        "tour.PERSONID @ person.PERSONID",
        "hh.HOMETAZ @ skims.otaz",
        "base.TAZ_ID @ skims.dtaz",
    ),
)

Model Definition#

Now we can define our choice model, using data from the tree as appropriate.

m = lx.Model(datatree=tree)
m.title = "Exampville Work Tour Destination Choice v1"
m.quantity_ca = (
    +P.EmpRetail_HighInc * X("RETAIL_EMP * (INCOME>50000)")
    + P.EmpNonRetail_HighInc * X("NONRETAIL_EMP") * X("INCOME>50000")
    + P.EmpRetail_LowInc * X("RETAIL_EMP") * X("INCOME<=50000")
    + P.EmpNonRetail_LowInc * X("NONRETAIL_EMP") * X("INCOME<=50000")
)

m.quantity_scale = P.Theta
m.utility_ca = +P.logsum * X.logsum + P.distance * X.AUTO_DIST
m.choice_co_code = "tour.DTAZ"
m.plock(EmpRetail_HighInc=0, EmpRetail_LowInc=0)
mj = m.copy()

Model Estimation#

m.compute_engine = "numba"
m.loglike()
np.float64(-28238.336880999712)
m.d_loglike()
array([   -67.3844916 ,   -244.55284679,      0.        ,      0.        ,
        -2572.46076256, -12615.66438367,   4845.41481796])
mj.d_loglike()
Array([   -67.404  ,   -244.51994,      0.     ,      0.     ,
        -2572.4595 , -12614.044  ,   4845.3853 ], dtype=float32)
result = m.maximize_loglike(stderr=True)

Iteration 022 [Optimization terminated successfully]

Best LL = -25157.72457687275

value best initvalue minimum maximum nullvalue holdfast
param_name
EmpNonRetail_HighInc 1.365492 1.365492 0.0 -inf inf 0.0 0
EmpNonRetail_LowInc -0.878688 -0.878688 0.0 -inf inf 0.0 0
EmpRetail_HighInc 0.000000 0.000000 0.0 -inf inf 0.0 1
EmpRetail_LowInc 0.000000 0.000000 0.0 -inf inf 0.0 1
Theta 0.749166 0.749166 1.0 0.01 1.0 1.0 0
distance -0.041390 -0.041390 0.0 -inf inf 0.0 0
logsum 1.022440 1.022440 0.0 -inf inf 0.0 0
/opt/hostedtoolcache/Python/3.10.18/x64/lib/python3.10/site-packages/larch/model/optimization.py:338: UserWarning: slsqp may not play nicely with unbounded parameters
if you get poor results, consider setting global bounds with model.set_cap()
  warnings.warn(  # infinite bounds #  )
resultj = mj.maximize_loglike(stderr=True)
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Larch Model Dashboard                                                                        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ optimization complete                                                                        │
│ Log Likelihood Current =     -25157.718750 Best =     -25157.718750                          │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┓ │
│ ┃ Parameter                       Estimate        Std. Error      t-Stat      Null Val ┃ │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━┩ │
│ │ EmpNonRetail_HighInc           │  1.3628019     │  0.25589651    │  5.326     │  0       │ │
│ │ EmpNonRetail_LowInc            │ -0.88164465    │  0.079109512   │ -11.14     │  0       │ │
│ │ EmpRetail_HighInc              │  0             │  locked        │  locked    │  0       │ │
│ │ EmpRetail_LowInc               │  0             │  locked        │  locked    │  0       │ │
│ │ Theta                          │  0.74935543    │  0.015231621   │ -16.46     │  1       │ │
│ │ distance                       │ -0.041801401   │  0.010723748   │ -3.898     │  0       │ │
│ │ logsum                         │  1.0206302     │  0.031736787   │  32.16     │  0       │ │
│ └────────────────────────────────┴────────────────┴────────────────┴────────────┴──────────┘ │
└──────────────────────────────────────────────────────────────────────────────────────────────┘

Model Visualization#

For destination choice and similar type models, it might be beneficial to review the observed and modeled choices, and the relative distribution of these choices across different factors. For example, we would probably want to see the distribution of travel distance. The Model object includes a built-in method to create this kind of visualization.

m.histogram_on_idca_variable("AUTO_DIST")
m.histogram_on_idca_variable("RETAIL_EMP")

The histogram_on_idca_variable has a variety of options, for example to control the number and range of the histogram bins:

m.histogram_on_idca_variable("AUTO_DIST", bins=40, span=(0, 10))

Subsets of the observations can be pulled out, to observe the distribution conditional on other idco factors, like income.

m.histogram_on_idca_variable(
    "AUTO_DIST",
    x_label="Distance (miles)",
    bins=26,
    span=(0, 13),
    filter_co="INCOME<10000",
)

Save and Report Model#

report = lx.Reporter(title=m.title)
report << "# Parameter Summary" << m.parameter_summary()

Parameter Summary

  Value Std Err t Stat Signif Null Value Constrained
Parameter            
EmpNonRetail_HighInc  1.37  0.257  5.32 *** 0.00
EmpNonRetail_LowInc -0.879  0.0792 -11.10 *** 0.00
EmpRetail_HighInc  0.00  0.00  NA 0.00 fixed value
EmpRetail_LowInc  0.00  0.00  NA 0.00 fixed value
Theta  0.749  0.0152 -16.47 *** 1.00
distance -0.0414  0.0107 -3.86 *** 0.00
logsum  1.02  0.0317  32.21 *** 0.00
report << "# Parameter Summary" << m.parameter_summary()

Parameter Summary

  Value Std Err t Stat Signif Null Value Constrained
Parameter            
EmpNonRetail_HighInc  1.37  0.257  5.32 *** 0.00
EmpNonRetail_LowInc -0.879  0.0792 -11.10 *** 0.00
EmpRetail_HighInc  0.00  0.00  NA 0.00 fixed value
EmpRetail_LowInc  0.00  0.00  NA 0.00 fixed value
Theta  0.749  0.0152 -16.47 *** 1.00
distance -0.0414  0.0107 -3.86 *** 0.00
logsum  1.02  0.0317  32.21 *** 0.00
report << "# Estimation Statistics" << m.estimation_statistics()

Estimation Statistics

StatisticAggregatePer Case
Number of Cases7564
Log Likelihood at Convergence-25157.72-3.33
Log Likelihood at Null Parameters-28238.34-3.73
Rho Squared w.r.t. Null Parameters0.109
report << "# Utility Functions" << m.utility_functions()

Utility Functions

+ P.logsum * X.logsum
+ P.distance * X.AUTO_DIST
+ P.Theta * log(
   + exp(P.EmpRetail_HighInc) * X('RETAIL_EMP * (INCOME>50000)')
   + exp(P.EmpNonRetail_HighInc) * X('NONRETAIL_EMP*(INCOME>50000)')
   + exp(P.EmpRetail_LowInc) * X('RETAIL_EMP*(INCOME<=50000)')
   + exp(P.EmpNonRetail_LowInc) * X('NONRETAIL_EMP*(INCOME<=50000)')
)

The figures shown above can also be inserted directly into reports.

figure = m.histogram_on_idca_variable(
    "AUTO_DIST",
    bins=30,
    span=(0, 15),
    x_label="Distance (miles)",
)
report << "# Visualization"
report << figure

Visualization

0246810121416Distance (miles)Relative FrequencyModeledObservedsource
report.save(
    "exampville_dest_choice.html",
    overwrite=True,
    metadata=m.dumps(),
)
'exampville_dest_choice.html'