Vehicle Choice Mixed Logit#

import pandas as pd

import larch as lx
from larch import PX

In this example, we will demonstrate the estimatation of parameters for mixed logit models. The examples are based on a problem set originally published by Ken Train.

To ensure good preformance and ease of use, estimating parameters for a mixed logit model in Larch requires the jax library as the compute engine. Importing it here allows us to confirm that it is available.

import jax  # noqa: F401

The data represent consumers’ choices among vehicles in stated preference experiments. The data are from a study that he did for Toyota and GM to assist them in their analysis of the potential marketability of electric and hybrid vehicles, back before hybrids were introduced.

In each choice experiment, the respondent was presented with three vehicles, with the price and other attributes of each vehicle described. The respondent was asked to state which of the three vehicles he/she would buy if the these vehicles were the only ones available in the market. There are 100 respondents in our dataset (which, to reduce estimation time, is a subset of the full dataset which contains 500 respondents.) Each respondent was presented with 15 choice experiments, and most respondents answered all 15. The attributes of the vehicles were varied over experiments, both for a given respondent and over respondents. The attributes are: price, operating cost in dollars per month, engine type (gas, electric, or hybrid), range if electric (in hundreds of miles between recharging), and the performance level of the vehicle (high, medium, or low). The performance level was described in terms of top speed and acceleration, and these descriptions did not vary for each level; for example, “High” performance was described as having a top speed of 100 mpg and 12 seconds to reach 60 mpg, and this description was the same for all “high” performance vehicles.

raw_data = pd.read_csv(
    "data.txt",
    names=[
        "person_id",
        "case_id",
        "chosen",
        "price",  # dollars
        "opcost",  # dollars per month
        "max_range",  # hundreds of miles (0 if not electric)
        "ev",  # bool
        "gas",  # bool
        "hybrid",  # bool
        "hiperf",  # High performance (bool)
        "medhiperf",  # Medium or high performance (bool)
    ],
    sep=r"\s+",
)
raw_data

	person_id	case_id	chosen	price	opcost	max_range	ev	gas	hybrid	hiperf	medhiperf
0	1	1	0	46763	47.43	0.0	0	0	1	0	0
1	1	1	1	57209	27.43	1.3	1	0	0	1	1
2	1	1	0	87960	32.41	1.2	1	0	0	0	1
3	1	2	1	33768	4.89	1.3	1	0	0	1	1
4	1	2	0	90336	30.19	0.0	0	0	1	0	1
...	...	...	...	...	...	...	...	...	...	...	...
4447	100	1483	0	28036	14.45	1.6	1	0	0	0	0
4448	100	1483	0	19360	54.76	0.0	0	1	0	1	1
4449	100	1484	1	24054	50.57	0.0	0	1	0	0	0
4450	100	1484	0	52795	21.25	0.0	0	0	1	0	1
4451	100	1484	0	60705	25.41	1.4	1	0	0	0	0

4452 rows × 11 columns

Create alt_id values for each row within each group, and set the index on case and alternative id’s.

raw_data["alt_id"] = raw_data.groupby("case_id").cumcount() + 1
raw_data = raw_data.set_index(["case_id", "alt_id"])
raw_data["price_scaled"] = raw_data["price"] / 10000
raw_data["opcost_scaled"] = raw_data["opcost"] / 10
data = lx.Dataset.construct.from_idca(raw_data)

Simple MNL#

Start with a simple MNL model, using all the variable in the dataset without transformations.

simple = lx.Model(data)
simple.utility_ca = (
    PX("price_scaled")
    + PX("opcost_scaled")
    + PX("max_range")
    + PX("ev")
    + PX("hybrid")
    + PX("hiperf")
    + PX("medhiperf")
)
simple.choice_ca_var = "chosen"
simple.maximize_loglike(stderr=True, options={"ftol": 1e-9});

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Larch Model Dashboard                                                                        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ optimization complete                                                                        │
│ Log Likelihood Current =      -1399.193115 Best =      -1399.193115                          │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┓ │
│ ┃ Parameter                      ┃ Estimate       ┃ Std. Error     ┃ t-Stat     ┃ Null Val ┃ │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━┩ │
│ │ ev                             │ -1.3919506     │  0.27662924    │ -5.032     │  0       │ │
│ │ hiperf                         │  0.1098956     │  0.083822869   │  1.311     │  0       │ │
│ │ hybrid                         │  0.35540944    │  0.1218101     │  2.918     │  0       │ │
│ │ max_range                      │  0.47666917    │  0.17647879    │  2.701     │  0       │ │
│ │ medhiperf                      │  0.38405537    │  0.085450538   │  4.494     │  0       │ │
│ │ opcost_scaled                  │ -0.12880772    │  0.035308      │ -3.648     │  0       │ │
│ │ price_scaled                   │ -0.41662305    │  0.033157621   │ -12.56     │  0       │ │
│ └────────────────────────────────┴────────────────┴────────────────┴────────────┴──────────┘ │
└──────────────────────────────────────────────────────────────────────────────────────────────┘

Mixture Model#

To create a mixed logit model, we can start with a copy of the simple model.

mixed = simple.copy()

We assign mixtures to various parameters by providing a list of Mixture definitions.

mixed.mixtures = [
    lx.mixtures.Normal(k, f"s_{k}")
    for k in ["opcost_scaled", "max_range", "ev", "hybrid", "hiperf", "medhiperf"]
]

The estimation of the mixed logit parameters sometimes performs poorly (i.e. may prematurely call itself converged) if you start from the MNL solution, so it can be advantageous to start from the null parameters instead.

mixed.pvals = "null"

There are several extra settings that can be manipulated on the mixed logit model, including the number of draws and a random seed for generating draws.

mixed.n_draws = 200
mixed.seed = 42

mixed.maximize_loglike(stderr=True, options={"ftol": 1e-9});

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Larch Model Dashboard                                                                        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ optimization complete                                                                        │
│ Log Likelihood Current =      -1385.494385 Best =      -1385.494385                          │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┓ │
│ ┃ Parameter                      ┃ Estimate       ┃ Std. Error     ┃ t-Stat     ┃ Null Val ┃ │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━┩ │
│ │ ev                             │ -3.0322503     │  0.78012139    │ -3.887     │  0       │ │
│ │ hiperf                         │  0.17637114    │  0.12484043    │  1.413     │  0       │ │
│ │ hybrid                         │  0.54619946    │  0.19360504    │  2.821     │  0       │ │
│ │ max_range                      │  0.88870039    │  0.32432306    │  2.74      │  0       │ │
│ │ medhiperf                      │  0.73252345    │  0.20886612    │  3.507     │  0       │ │
│ │ opcost_scaled                  │ -0.21988611    │  0.071590014   │ -3.071     │  0       │ │
│ │ price_scaled                   │ -0.6478546     │  0.1020746     │ -6.347     │  0       │ │
│ │ s_ev                           │ -2.8045241     │  0.75729305    │ -3.703     │  0       │ │
│ │ s_hiperf                       │ -0.52182199    │  0.87339026    │ -0.5975    │  0       │ │
│ │ s_hybrid                       │  1.1432631     │  0.66008466    │  1.732     │  0       │ │
│ │ s_max_range                    │ -0.13063182    │  0.99431813    │ -0.1314    │  0       │ │
│ │ s_medhiperf                    │  1.5746653     │  0.61265969    │  2.57      │  0       │ │
│ │ s_opcost_scaled                │  0.52047953    │  0.1855457     │  2.805     │  0       │ │
│ └────────────────────────────────┴────────────────┴────────────────┴────────────┴──────────┘ │
└──────────────────────────────────────────────────────────────────────────────────────────────┘

Panel Data#

In the mixed logit model above, we have ignored the fact that we have “panel data”, where multiple observations are made by the same decision maker who (presumably) has stable preferences with respect to the choice attributes. We can tell Larch which data column represents the panel identifiers in the groupid attribute, (which we set to "person_id") and then use that information to build a significantly better model.

panel = mixed.copy()
panel.groupid = "person_id"
panel.maximize_loglike(stderr=True, options={"ftol": 1e-9});

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Larch Model Dashboard                                                                        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ optimization complete                                                                        │
│ Log Likelihood Current =      -1330.442871 Best =      -1330.442871                          │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┓ │
│ ┃ Parameter                      ┃ Estimate       ┃ Std. Error     ┃ t-Stat     ┃ Null Val ┃ │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━┩ │
│ │ ev                             │ -1.7029871     │  0.34308228    │ -4.964     │  0       │ │
│ │ hiperf                         │  0.092199017   │  0.10426249    │  0.8843    │  0       │ │
│ │ hybrid                         │  0.46851645    │  0.16408378    │  2.855     │  0       │ │
│ │ max_range                      │  0.46271096    │  0.22773667    │  2.032     │  0       │ │
│ │ medhiperf                      │  0.50906039    │  0.11354448    │  4.483     │  0       │ │
│ │ opcost_scaled                  │ -0.13885747    │  0.052872736   │ -2.626     │  0       │ │
│ │ price_scaled                   │ -0.50148287    │  0.039679293   │ -12.64     │  0       │ │
│ │ s_ev                           │ -0.92091268    │  0.22133178    │ -4.161     │  0       │ │
│ │ s_hiperf                       │ -0.41940384    │  0.14357798    │ -2.921     │  0       │ │
│ │ s_hybrid                       │  0.84454653    │  0.13320869    │  6.34      │  0       │ │
│ │ s_max_range                    │  0.71011009    │  0.18693064    │  3.799     │  0       │ │
│ │ s_medhiperf                    │  0.60289598    │  0.14173014    │  4.254     │  0       │ │
│ │ s_opcost_scaled                │ -0.3470382     │  0.049322512   │ -7.036     │  0       │ │
│ └────────────────────────────────┴────────────────┴────────────────┴────────────┴──────────┘ │
└──────────────────────────────────────────────────────────────────────────────────────────────┘

We can review summary statistics about the distibution of the mixed parameters:

panel.mixture_summary()

	mean	std	share +	share -	q25	median	q75
ev	-1.703002	0.920542	0.03225	0.96775	-2.323959	-1.703039	-1.081916
hiperf	0.092253	0.419601	0.58695	0.41305	-0.190676	0.092188	0.374949
hybrid	0.468636	0.845582	0.71035	0.28965	-0.101023	0.468462	1.037973
max_range	0.462799	0.710399	0.74255	0.25745	-0.016289	0.462716	0.941651
medhiperf	0.509098	0.603171	0.80070	0.19930	0.102518	0.509132	0.915687
opcost_scaled	-0.138829	0.346976	0.34445	0.65555	-0.372937	-0.138818	0.095209

The “share” columns show the share of random simulations with positive, negative, or zero values for each model parameter. This summary shows that about a third of people show a positive parameter on operating costs, i.e. they prefer higher costs. That’s probably not reasonable.

Log-Normal Parameter#

Let’s change the operating cost parameter to have a negative log-normal distribution.

panel2 = panel.copy()
panel2.mixtures[0] = lx.mixtures.NegLogNormal("opcost_scaled", "s_opcost_scaled")
panel2.pvals = "null"
panel2.maximize_loglike(stderr=True, options={"ftol": 1e-9});

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Larch Model Dashboard                                                                        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ optimization complete                                                                        │
│ Log Likelihood Current =      -1333.019165 Best =      -1333.019165                          │
│ ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━┓ │
│ ┃ Parameter                      ┃ Estimate       ┃ Std. Error     ┃ t-Stat     ┃ Null Val ┃ │
│ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━┩ │
│ │ ev                             │ -1.7584347     │  0.34469509    │ -5.101     │  0       │ │
│ │ hiperf                         │  0.075648619   │  0.10357389    │  0.7304    │  0       │ │
│ │ hybrid                         │  0.39528257    │  0.15409386    │  2.565     │  0       │ │
│ │ max_range                      │  0.51458756    │  0.21852678    │  2.355     │  0       │ │
│ │ medhiperf                      │  0.51126732    │  0.11600985    │  4.407     │  0       │ │
│ │ opcost_scaled                  │ -2.8271971     │  0.56015855    │ -5.047     │  0       │ │
│ │ price_scaled                   │ -0.48387014    │  0.038660176   │ -12.52     │  0       │ │
│ │ s_ev                           │ -1.0334787     │  0.25959203    │ -3.981     │  0       │ │
│ │ s_hiperf                       │ -0.39579697    │  0.13123819    │ -3.016     │  0       │ │
│ │ s_hybrid                       │  0.84501421    │  0.13239691    │  6.382     │  0       │ │
│ │ s_max_range                    │ -0.4796395     │  0.17025113    │ -2.817     │  0       │ │
│ │ s_medhiperf                    │ -0.65290779    │  0.14293973    │ -4.568     │  0       │ │
│ │ s_opcost_scaled                │  1.7193462     │  0.30749312    │  5.591     │  0       │ │
│ └────────────────────────────────┴────────────────┴────────────────┴────────────┴──────────┘ │
└──────────────────────────────────────────────────────────────────────────────────────────────┘

panel2.mixture_summary()

	mean	std	share +	share -	q25	median	q75
ev	-1.758452	1.033063	0.04455	0.95545	-2.455310	-1.758493	-1.061449
hiperf	0.075700	0.395983	0.57585	0.42415	-0.191304	0.075638	0.342483
hybrid	0.395402	0.846050	0.68010	0.31990	-0.174572	0.395228	0.965054
max_range	0.514528	0.479834	0.85815	0.14185	0.191090	0.514584	0.838125
medhiperf	0.511226	0.653205	0.78315	0.21685	0.070910	0.511189	0.951533
opcost_scaled	-0.258067	1.027769	0.00000	1.00000	-0.188721	-0.059167	-0.018558

panel2.mixture_density("opcost_scaled");

../../_images/7d676d20b3810a82e6efc091d93aa7bd9882c1c05764714622d983e4f95743cc.png

panel2.mixture_density("hybrid");

../../_images/b5e50ecef35e6938b57a316c3e62728cd300820d7813b5829bbcf770d146ae7f.png

Vehicle Choice Mixed Logit#

Simple MNL#

Mixture Model#

Panel Data#

Log-Normal Parameter#

This Page