# Data in Larch

A `larch.Model` works with data in two stages: the `datatree` and the `dataset`.
The higher level of data organization is the `datatree`. This represents the 
data in close to its "raw" form, as it is read in from the data file(s). This 
can be a simple table of data that is entirely in [idca](idca) (tall) or
[idco](idco) (wide) format, or it can be a collection of related tables and datasets (e.g. households,
persons, tours, trips, skims, etc.).  It is generally reasonable to assemble 
whatever data sources you have into a single `datatree` object, and then reference
against this object when building a discrete choice `Model` with larch.

The lower level of data organization is the `dataset`. This is a single 
:class:`xarray.Dataset` that is used in model estimation and application. The
`dataset` is built from the `datatree` by selecting the data that is needed for
a particular model, and then transforming it into the format that is required
for the estimation or application of that model.  All of this is generally done
automatically by larch, so most users will never need to see or interact with
the `dataset` object directly.  Moreover, the `dataset` is generally not saved
to disk, and will be recreated from the `datatree` whenever it is needed, or 
whenever the model structure is changed in any relevant way.

In [None]:
import larch as lx

In [None]:
import pandas as pd

import larch as lx

## Simple Datatrees

The [`datatree`](larch.Model.datatree) at its simplest is initialized from 
as simple pandas DataFrame, which can be either [idca](idca) (tall) or 
[idco](idco) (wide) format.  A simple [`datatree`](larch.Model.datatree) like 
this can be created with the appropriate constructor.

### Example [idco](idco) data

Here is a simple example of [idco](idco) data, which is a table of data with one row
per case, and one column per variable.  This data can easily be read from a CSV
and expressed as a pandas DataFrame with a simple one-level index.

In [None]:
data_co = pd.read_csv("example-data/tiny_idco.csv", index_col="caseid")
data_co

Converting this data to a Dataset that can be used as a [`datatree`](larch.Model.datatree) 
is as simple as calling the [`from_idco`](larch.Model.dc.from_idco) constructor on the DataFrame.

In [None]:
tree_co = lx.Dataset.dc.from_idco(data_co)
tree_co

### Example [idca](idca) data

Here is a simple example of [idca](idca) data, which is a table of data with one row
per alternative, and one column per variable.  This data can easily be read from a CSV
and expressed as a pandas DataFrame with a two-level MultiIndex, where the first level
contains the case id and the second level contains the alternative id.

In [None]:
data_ca = pd.read_csv("example-data/tiny_idca.csv", index_col=["caseid", "altid"])
data_ca

As long as the DataFrame has a MultiIndex with two levels as described, 
the [`from_idca`](larch.Model.dc.construct.from_idca) constructor can be used to convert it to a Dataset that 
can be used as a `datatree` for a :class:`larch.Model`.

In [None]:
tree_ca = lx.Dataset.dc.from_idca(data_ca)
tree_ca

You may have noticed in the result shown above that the [`from_idca`](larch.Model.dc.from_idca) constructor
does not simply transform the DataFrame into a Dataset.  In addition to the 
transformation, [`from_idca`](larch.Model.dc.from_idca) also analyzed the data and determined
that the "Income" variable has no variation across alternatives, and so it was
collapsed into a [idca](idca) variable.