107: Swissmetro Latent Class#

import numpy as np
import pandas as pd
import larch as lx
from larch import P, X

print(lx.versions())
{'larch': '6.0.37.dev32+gdab7641', 'sharrow': '2.13.1.dev4+gafde266', 'numpy': '2.2.5', 'pandas': '2.2.3', 'xarray': '2025.4.0', 'numba': '0.61.2', 'jax': '0.6.0'}

This example is a latent class mode choice model (also known as a discrete mixture model) built using the Swissmetro example dataset. We will follow the formulation of this model from the Biogeme documentation on the equivalent model.

The first step is to load the data and do some pre-processing, so that the format and scale of the data matches that from the Biogeme example.

raw = pd.read_csv(lx.example_file("swissmetro.csv.gz"))
raw["SM_COST"] = raw["SM_CO"] * (raw["GA"] == 0)
raw["TRAIN_COST"] = raw.eval("TRAIN_CO * (GA == 0)")
raw["TRAIN_COST_SCALED"] = raw["TRAIN_COST"] / 100
raw["TRAIN_TT_SCALED"] = raw["TRAIN_TT"] / 100
raw["SM_COST_SCALED"] = raw.eval("SM_COST / 100")
raw["SM_TT_SCALED"] = raw["SM_TT"] / 100
raw["CAR_CO_SCALED"] = raw["CAR_CO"] / 100
raw["CAR_TT_SCALED"] = raw["CAR_TT"] / 100
raw["CAR_AV_SP"] = raw.eval("CAR_AV * (SP!=0)")
raw["TRAIN_AV_SP"] = raw.eval("TRAIN_AV * (SP!=0)")

We now have a pandas DataFrame in “wide” or idco format, with one row per choice observation.

raw
GROUP SURVEY SP ID PURPOSE FIRST TICKET WHO LUGGAGE AGE MALE INCOME GA ORIGIN DEST TRAIN_AV CAR_AV SM_AV TRAIN_TT TRAIN_CO TRAIN_HE SM_TT SM_CO SM_HE SM_SEATS CAR_TT CAR_CO CHOICE SM_COST TRAIN_COST TRAIN_COST_SCALED TRAIN_TT_SCALED SM_COST_SCALED SM_TT_SCALED CAR_CO_SCALED CAR_TT_SCALED CAR_AV_SP TRAIN_AV_SP
0 2 0 1 1 1 0 1 1 0 3 0 2 0 2 1 1 1 1 112 48 120 63 52 20 0 117 65 2 52 48 0.48 1.12 0.52 0.63 0.65 1.17 1 1
1 2 0 1 1 1 0 1 1 0 3 0 2 0 2 1 1 1 1 103 48 30 60 49 10 0 117 84 2 49 48 0.48 1.03 0.49 0.60 0.84 1.17 1 1
2 2 0 1 1 1 0 1 1 0 3 0 2 0 2 1 1 1 1 130 48 60 67 58 30 0 117 52 2 58 48 0.48 1.30 0.58 0.67 0.52 1.17 1 1
3 2 0 1 1 1 0 1 1 0 3 0 2 0 2 1 1 1 1 103 40 30 63 52 20 0 72 52 2 52 40 0.40 1.03 0.52 0.63 0.52 0.72 1 1
4 2 0 1 1 1 0 1 1 0 3 0 2 0 2 1 1 1 1 130 36 60 63 42 20 0 90 84 2 42 36 0.36 1.30 0.42 0.63 0.84 0.90 1 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
10723 3 1 1 1192 4 1 7 1 0 5 1 3 0 2 20 1 1 1 148 13 30 93 17 30 0 156 56 2 17 13 0.13 1.48 0.17 0.93 0.56 1.56 1 1
10724 3 1 1 1192 4 1 7 1 0 5 1 3 0 2 20 1 1 1 148 12 30 96 16 10 0 96 70 3 16 12 0.12 1.48 0.16 0.96 0.70 0.96 1 1
10725 3 1 1 1192 4 1 7 1 0 5 1 3 0 2 20 1 1 1 148 16 60 93 16 20 0 96 56 3 16 16 0.16 1.48 0.16 0.93 0.56 0.96 1 1
10726 3 1 1 1192 4 1 7 1 0 5 1 3 0 2 20 1 1 1 178 16 30 96 17 30 0 96 91 2 17 16 0.16 1.78 0.17 0.96 0.91 0.96 1 1
10727 3 1 1 1192 4 1 7 1 0 5 1 3 0 2 20 1 1 1 148 13 60 96 21 30 0 120 70 3 21 13 0.13 1.48 0.21 0.96 0.70 1.20 1 1

10728 rows × 38 columns

We can convert this to a Larch Dataset, filtering to include only the cases of interest, following the same filter applied in the Biogeme example.

data = lx.Dataset.construct.from_idco(raw).dc.query_cases("PURPOSE in (1,3) and CHOICE != 0")

The result is a Dataset with 6,768 cases.

data
<xarray.Dataset> Size: 2MB
Dimensions:            (index: 6768)
Coordinates:
  * index              (index) int64 54kB 0 1 2 3 4 ... 8446 8447 8448 8449 8450
Data variables: (12/38)
    GROUP              (index) int64 54kB 2 2 2 2 2 2 2 2 2 ... 3 3 3 3 3 3 3 3
    SURVEY             (index) int64 54kB 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1
    SP                 (index) int64 54kB 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1
    ID                 (index) int64 54kB 1 1 1 1 1 1 ... 939 939 939 939 939
    PURPOSE            (index) int64 54kB 1 1 1 1 1 1 1 1 1 ... 3 3 3 3 3 3 3 3
    FIRST              (index) int64 54kB 0 0 0 0 0 0 0 0 0 ... 1 1 1 1 1 1 1 1
    ...                 ...
    SM_COST_SCALED     (index) float64 54kB 0.52 0.49 0.58 ... 0.16 0.17 0.21
    SM_TT_SCALED       (index) float64 54kB 0.63 0.6 0.67 0.63 ... 0.5 0.53 0.53
    CAR_CO_SCALED      (index) float64 54kB 0.65 0.84 0.52 ... 0.64 1.04 0.8
    CAR_TT_SCALED      (index) float64 54kB 1.17 1.17 1.17 0.72 ... 0.8 0.8 1.0
    CAR_AV_SP          (index) int64 54kB 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1
    TRAIN_AV_SP        (index) int64 54kB 1 1 1 1 1 1 1 1 1 ... 1 1 1 1 1 1 1 1
Attributes:
    _caseid_:  index

For convenience, we can collect the names of the variables that define alternative availability into a dictionary, keyed on the codes we will use to represent each alternative.

availability_co_vars = {
    1: "TRAIN_AV_SP",
    2: "SM_AV",
    3: "CAR_AV_SP",
}

Then we can contruct choice models for each of the classes. This is done in the usual manner for Larch choice models, assigning utility functions, alternative availability conditions, and choice markers in the usual manner.

In this example, Class 1 chooses based on cost, and the utility functions include a set of alternative specific constants.

m1 = lx.Model(title="Model1")
m1.utility_co[1] = P("ASC_TRAIN") + X("TRAIN_COST_SCALED") * P("B_COST")
m1.utility_co[2] = X("SM_COST_SCALED") * P("B_COST")
m1.utility_co[3] = P("ASC_CAR") + X("CAR_CO_SCALED") * P("B_COST")
m1.availability_co_vars = availability_co_vars
m1.choice_co_code = "CHOICE"

Class 2 is the nearly same model, defined completely seperately but in a similar manner. The only difference is in the utility functions, which adds travel time to the utility.

m2 = lx.Model(title="Model2")
m2.utility_co[1] = (
    P("ASC_TRAIN")
    + X("TRAIN_TT_SCALED") * P("B_TIME")
    + X("TRAIN_COST_SCALED") * P("B_COST")
)
m2.utility_co[2] = X("SM_TT_SCALED") * P("B_TIME") + X("SM_COST_SCALED") * P("B_COST")
m2.utility_co[3] = (
    P("ASC_CAR") + X("CAR_TT_SCALED") * P("B_TIME") + X("CAR_CO_SCALED") * P("B_COST")
)
m2.availability_co_vars = availability_co_vars
m2.choice_co_code = "CHOICE"

The class membership model is another Larch choice model. This model is quite simple, as it does not need alternative availability conditions nor choice indicators. The “alternatives” in this model are the class choice models we defined above, and we can assign them unique identifyong codes for clarity.

For this latent class model, the definition is quite simple, just constants for each class. As typical for logit models, one “alternative” (here, a class) is omitted to be the reference point. It is possible to use chooser attributes to inform the relative likelihood of class membership, but that is not what we will do here.

mk = lx.Model()
mk.utility_co[102] = P("W")

Finally, we assemble all the component models into one LatentClass structure.

b = lx.LatentClass(
    mk,
    {101: m1, 102: m2},
    datatree=data.dc.set_altids([1, 2, 3]),
)

Similar to other models, we can set a parameter cap to improve estimation stability.

b.set_cap(25)
b.d_loglike()
Array([  -99.00003, -1541.5    ,  -224.6052 ,  -923.5084 ,     0.     ],      dtype=float32)
result = b.maximize_loglike(method="slsqp")

Iteration 028 [Optimization terminated successfully]

Best LL = -5208.49853515625

value initvalue nullvalue minimum maximum holdfast best
param_name
ASC_CAR 0.126 0.0 0.0 -25.0 25.0 0 0.126
ASC_TRAIN -0.398 0.0 0.0 -25.0 25.0 0 -0.398
B_COST -1.266 0.0 0.0 -25.0 25.0 0 -1.266
B_TIME -2.802 0.0 0.0 -25.0 25.0 0 -2.802
W 1.092 0.0 0.0 -25.0 25.0 0 1.092
b.calculate_parameter_covariance();
b.parameter_summary()
  Value Std Err t Stat Signif Null Value
Parameter          
ASC_CAR  0.126  0.0505  2.49 * 0.00
ASC_TRAIN -0.398  0.0609 -6.53 *** 0.00
B_COST -1.27  0.0612 -20.67 *** 0.00
B_TIME -2.80  0.176 -15.95 *** 0.00
W  1.09  0.116  9.41 *** 0.00