17A: Market Segmentation#

Market segmentation can be used to determine whether the impact of other variables is different among population groups. The most common approach to market segmentation is for the analyst to consider sample segments which are mutually exclusive and collectively exhaustive (that is, each case is included in one and only one segment). Models are estimated for the sample associated with each segment and compared to the pooled model (all segments represented by a single model) to determine if there are statistically significant and important differences among the market segments.

Model 17A segments the market by automobile ownership for households that have one or fewer cars. It is appealing to include a distinct segment for households with no cars since the mode choice behavior of this segment is very different from the rest of the population due to their dependence on non-automobile modes. However, the size of this segment in the dataset is too small, so it is joined with the one car group. (pp. 129-133)5)

import larch

larch.__version__
'6.0.37.dev32+gdab7641'

This example is a mode choice model built using the MTC example dataset. First we create the DB and Model objects:

d = larch.examples.MTC(format="dataset")
d
<xarray.Dataset> Size: 2MB
Dimensions:    (caseid: 5029, altid: 6)
Coordinates:
  * caseid     (caseid) int64 40kB 1 2 3 4 5 6 ... 5024 5025 5026 5027 5028 5029
  * altid      (altid) int64 48B 1 2 3 4 5 6
    alt_names  (altid) <U7 168B 'DA' 'SR2' 'SR3+' 'Transit' 'Bike' 'Walk'
Data variables: (12/38)
    chose      (caseid, altid) float32 121kB 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0
    ivtt       (caseid, altid) float64 241kB 13.38 18.38 20.38 ... 1.59 6.55 0.0
    ovtt       (caseid, altid) float64 241kB 2.0 2.0 2.0 15.2 ... 16.0 4.5 0.0
    tottime    (caseid, altid) float64 241kB 15.38 20.38 22.38 ... 11.05 19.1
    totcost    (caseid, altid) float64 241kB 70.63 35.32 20.18 ... 75.0 0.0 0.0
    hhid       (caseid) int64 40kB 2 3 5 6 8 8 ... 9429 9430 9433 9434 9436 9438
    ...         ...
    corredis   (caseid) int64 40kB 0 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
    vehbywrk   (caseid) float64 40kB 4.0 1.0 0.33 1.0 0.0 ... 2.0 2.0 3.0 3.0
    vocc       (caseid) int64 40kB 1 0 1 0 2 0 1 1 1 1 0 ... 1 2 1 1 0 1 2 1 1 1
    wgt        (caseid) int64 40kB 1 1 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1 1 1
    _avail_    (caseid, altid) int8 30kB 1 1 1 1 1 0 1 1 1 ... 1 0 1 1 1 1 1 1 1
    avail      (caseid, altid) int8 30kB 1 1 1 1 1 0 1 1 1 ... 1 0 1 1 1 1 1 1 1
Attributes:
    _caseid_:  caseid
    _altid_:   altid

Include only the cases where number of vehicles is 1 or less

d = d.sel(caseid=d.numveh <= 1)
m = larch.Model(d, compute_engine="numba")

Then we can build up the utility function. We’ll use some :ref:idco data first, using the Model.utility.co attribute. This attribute is a dict-like object, to which we can assign :class:LinearFunction objects for each alternative code.

from larch import P, X

for a in [4, 5, 6]:
    m.utility_co[a] += X("hhinc") * P(f"hhinc#{a}")

Since the model we want to create groups together DA, SR2 and SR3+ jointly as reference alternatives with respect to income, we can simply omit all of these alternatives from the block that applies to hhinc.

For vehicles per worker, the preferred model include a joint parameter on SR2 and SR3+, but not including DA and not fixed at zero. Here we might use a shadow_parameter (also called an alias in some places), which allows us to specify one or more parameters that are simply a fixed proportion of another parameter. For example, we can say that vehbywrk_SR2 will be equal to vehbywrk_SR.

for i in d["alt_names"][1:3]:
    name = str(i.values)
    a = int(i.altid)
    m.utility_co[a] += (
        +X("vehbywrk") * P("vehbywrk_SR")
        + X("wkccbd+wknccbd") * P("wkcbd_" + name)
        + X("wkempden") * P("wkempden_" + name)
        + P("ASC_" + name)
    )

for i in d["alt_names"][3:]:
    name = str(i.values)
    a = int(i.altid)
    m.utility_co[a] += (
        +X("vehbywrk") * P("vehbywrk_" + name)
        + X("wkccbd+wknccbd") * P("wkcbd_" + name)
        + X("wkempden") * P("wkempden_" + name)
        + P("ASC_" + name)
    )

Next we’ll use some idca data, with the utility_ca attribute. This attribute is only a single :class:LinearFunction that is applied across all alternatives using :ref:idca data. Because the data is structured to vary across alternatives, the parameters (and thus the structure of the :class:LinearFunction) does not need to vary across alternatives.

m.utility_ca = (
    +X("totcost/hhinc") * P("costbyincome")
    + X("tottime * (altid <= 4)") * P("motorized_time")
    + X("tottime * (altid >= 5)") * P("nonmotorized_time")
    + X("ovtt/dist * (altid <= 4)") * P("motorized_ovtbydist")
)

The “totcost/hhinc” data is computed once as a new variable when loading the model data. The same applies for tottime filtered by motorized modes (we harness the convenient fact that all the motorized modes have identifying numbers 4 or less), and “ovtt/dist”.

Lastly, we need to identify idca Format data that gives the availability for each alternative, as well as the number of times each alternative is chosen. (In traditional discrete choice analysis, this is often 0 or 1, but it need not be binary, or even integral.)

m.availability_ca_var = "avail"
m.choice_ca_var = "chose"

And let’s give our model a descriptive title.

m.title = "MTC Example 17A, Segmented for 1 or fewer cars"

We can view a summary of the choices and alternative availabilities to make sure the model is set up correctly.

m.choice_avail_summary()
name chosen available
1 DA 615 998
2 SR2 152 1221
3 SR3+ 44 1221
4 Transit 277 1062
5 Bike 26 345
6 Walk 107 492
< Total All Alternatives > 1221 <NA>

We’ll set a parameter cap (bound) at +/- 25, which helps improve the numerical stability of the optimization algorithm used in estimation.

m.set_cap(25)

Having created this model, we can then estimate it:

assert m.compute_engine == "numba"
result = m.maximize_loglike(stderr=True, options={"maxiter": 1000, "ftol": 1e-10})
m.calculate_parameter_covariance()
m.loglike()

Iteration 100 [Optimization terminated successfully]

Best LL = -1049.2816699425903

value best initvalue minimum maximum nullvalue holdfast
param_name
ASC_Bike 0.974214 0.974208 0.0 -25.0 25.0 0.0 0
ASC_SR2 0.595367 0.595364 0.0 -25.0 25.0 0.0 0
ASC_SR3+ -0.782483 -0.782486 0.0 -25.0 25.0 0.0 0
ASC_Transit 2.258507 2.258517 0.0 -25.0 25.0 0.0 0
ASC_Walk 2.903959 2.903968 0.0 -25.0 25.0 0.0 0
costbyincome -0.022640 -0.022640 0.0 -25.0 25.0 0.0 0
hhinc#4 -0.006444 -0.006444 0.0 -25.0 25.0 0.0 0
hhinc#5 -0.011725 -0.011725 0.0 -25.0 25.0 0.0 0
hhinc#6 -0.011990 -0.011990 0.0 -25.0 25.0 0.0 0
motorized_ovtbydist -0.113089 -0.113089 0.0 -25.0 25.0 0.0 0
motorized_time -0.021068 -0.021068 0.0 -25.0 25.0 0.0 0
nonmotorized_time -0.043942 -0.043942 0.0 -25.0 25.0 0.0 0
vehbywrk_Bike -2.644434 -2.644434 0.0 -25.0 25.0 0.0 0
vehbywrk_SR -3.015709 -3.015702 0.0 -25.0 25.0 0.0 0
vehbywrk_Transit -3.963169 -3.963168 0.0 -25.0 25.0 0.0 0
vehbywrk_Walk -3.339830 -3.339824 0.0 -25.0 25.0 0.0 0
wkcbd_Bike 0.371887 0.371879 0.0 -25.0 25.0 0.0 0
wkcbd_SR2 0.370716 0.370714 0.0 -25.0 25.0 0.0 0
wkcbd_SR3+ 0.228933 0.228929 0.0 -25.0 25.0 0.0 0
wkcbd_Transit 1.105637 1.105626 0.0 -25.0 25.0 0.0 0
wkcbd_Walk 0.030613 0.030602 0.0 -25.0 25.0 0.0 0
wkempden_Bike 0.001543 0.001543 0.0 -25.0 25.0 0.0 0
wkempden_SR2 0.002044 0.002044 0.0 -25.0 25.0 0.0 0
wkempden_SR3+ 0.003530 0.003530 0.0 -25.0 25.0 0.0 0
wkempden_Transit 0.003160 0.003160 0.0 -25.0 25.0 0.0 0
wkempden_Walk 0.003786 0.003786 0.0 -25.0 25.0 0.0 0
np.float64(-1049.2816699456966)
m.parameter_summary()
  Value Std Err t Stat Signif Null Value
Parameter          
ASC_Bike  0.974  0.703  1.39 0.00
ASC_SR2  0.595  0.303  1.97 * 0.00
ASC_SR3+ -0.782  0.354 -2.21 * 0.00
ASC_Transit  2.26  0.444  5.09 *** 0.00
ASC_Walk  2.90  0.562  5.17 *** 0.00
costbyincome -0.0226  0.0138 -1.64 0.00
hhinc#4 -0.00644  0.00357 -1.81 0.00
hhinc#5 -0.0117  0.00949 -1.23 0.00
hhinc#6 -0.0120  0.00597 -2.01 * 0.00
motorized_ovtbydist -0.113  0.0259 -4.36 *** 0.00
motorized_time -0.0211  0.00604 -3.49 *** 0.00
nonmotorized_time -0.0439  0.00810 -5.42 *** 0.00
vehbywrk_Bike -2.64  0.662 -4.00 *** 0.00
vehbywrk_SR -3.02  0.349 -8.65 *** 0.00
vehbywrk_Transit -3.96  0.376 -10.55 *** 0.00
vehbywrk_Walk -3.34  0.444 -7.51 *** 0.00
wkcbd_Bike  0.372  0.538  0.69 0.00
wkcbd_SR2  0.371  0.241  1.54 0.00
wkcbd_SR3+  0.229  0.406  0.56 0.00
wkcbd_Transit  1.11  0.259  4.26 *** 0.00
wkcbd_Walk  0.0306  0.350  0.09 0.00
wkempden_Bike  0.00154  0.00188  0.82 0.00
wkempden_SR2  0.00204  0.000729  2.80 ** 0.00
wkempden_SR3+  0.00353  0.000908  3.89 *** 0.00
wkempden_Transit  0.00316  0.000668  4.73 *** 0.00
wkempden_Walk  0.00379  0.000968  3.91 *** 0.00

It is a little tough to read this report because the parameters show up in alphabetical order. We can use the reorder method to fix this and group them systematically:

m.ordering = (
    (
        "LOS",
        ".*cost.*",
        ".*time.*",
        ".*dist.*",
    ),
    (
        "Zonal",
        "wkcbd.*",
        "wkempden.*",
    ),
    (
        "Household",
        "hhinc.*",
        "vehbywrk.*",
    ),
    (
        "ASCs",
        "ASC.*",
    ),
)
m.parameter_summary()
    Value Std Err t Stat Signif Null Value
Category Parameter          
LOS costbyincome -0.0226  0.0138 -1.64 0.00
motorized_time -0.0211  0.00604 -3.49 *** 0.00
nonmotorized_time -0.0439  0.00810 -5.42 *** 0.00
motorized_ovtbydist -0.113  0.0259 -4.36 *** 0.00
Zonal wkcbd_Bike  0.372  0.538  0.69 0.00
wkcbd_SR2  0.371  0.241  1.54 0.00
wkcbd_SR3+  0.229  0.406  0.56 0.00
wkcbd_Transit  1.11  0.259  4.26 *** 0.00
wkcbd_Walk  0.0306  0.350  0.09 0.00
wkempden_Bike  0.00154  0.00188  0.82 0.00
wkempden_SR2  0.00204  0.000729  2.80 ** 0.00
wkempden_SR3+  0.00353  0.000908  3.89 *** 0.00
wkempden_Transit  0.00316  0.000668  4.73 *** 0.00
wkempden_Walk  0.00379  0.000968  3.91 *** 0.00
Household hhinc#4 -0.00644  0.00357 -1.81 0.00
hhinc#5 -0.0117  0.00949 -1.23 0.00
hhinc#6 -0.0120  0.00597 -2.01 * 0.00
vehbywrk_Bike -2.64  0.662 -4.00 *** 0.00
vehbywrk_SR -3.02  0.349 -8.65 *** 0.00
vehbywrk_Transit -3.96  0.376 -10.55 *** 0.00
vehbywrk_Walk -3.34  0.444 -7.51 *** 0.00
ASCs ASC_Bike  0.974  0.703  1.39 0.00
ASC_SR2  0.595  0.303  1.97 * 0.00
ASC_SR3+ -0.782  0.354 -2.21 * 0.00
ASC_Transit  2.26  0.444  5.09 *** 0.00
ASC_Walk  2.90  0.562  5.17 *** 0.00

Finally, let’s print model statistics. Note that if you want LL at constants you need to run a separate model.

m.estimation_statistics()
StatisticAggregatePer Case
Number of Cases1221
Log Likelihood at Convergence-1049.28-0.86
Log Likelihood at Null Parameters-1775.42-1.45
Rho Squared w.r.t. Null Parameters0.409