Research Resources

This is a repository for research resources that might be of interest to academics and practitioners who use discrete choice methods. Below, you can find access to datasets that might be helpful for your analysis, code in various programming languages for the estimation of different discrete choice models, and working papers. The Institute is deeply committed to the open data and open source movement. The objective of this repository is to encourage opportunities for further analysis, replication, verification and refinement. 

Working Papers

The Unintended Impact of Helmet Use on Bicyclists' Risk-taking Behaviors

Akshay, Vij.

Abstract: Bicycle helmet compensation effects suggest that bicyclists offset perceived gains in safety from wearing a helmet by behaving more aggressively. A better understanding of these compensation effects can be useful in assessing mandatory legislated helmet use laws. Using a sample of 131 bicyclists, this research studies how bicyclists respond with respect to risk-taking behaviors under various urban-street conditions, as a function of helmet use. Study participants are each shown 12 videos, shot in Berkeley, California, from the perspective of a bicyclist riding behind another bicyclist. A fractional factorial experiment design is used to systematically vary contextual attributes, such as speed, bike lane facilities, on-street parking, passing vehicles, etc., across the videos. After each video, participants are asked to indicate if they would overtake the bicyclist in the video. With the help of data adaptive estimation techniques, targeted maximum likelihood estimation (TMLE) is applied to estimate the average risk difference between helmet users and non-users, controlling for self-selection effects. Individual-based nonparametric bootstrap is performed to assess the uncertainty associated with the estimator. Our findings suggest, on average, helmet users are 15.6% more likely to overtake, and the effect is statistically significant using the non-parametric bootstrap sampling evaluation. This study serves as a cautionary warning that road safety programs may need to consider strategies in which unintended impact of bicycle helmet use can be mitigated.

Download Here

Machine Learning Meets Microeconomics: The Case of Decision Trees and Discrete Choice

Akshay, Vij.

Abstract: In the 1960's, the logistic regression model from statistics and the binary probit model from psychology were linked with random utility theory, thereby connecting such methods with economic theory. Since then, the fi elds of statistics, computer science, and machine learning have created numerous methods for modeling discrete choices. However, these newer methods have not been derived from or linked with economic theories of human decision making. We believe this lack of economic interpretation is one reason discrete choice modelers have been slow to adopt these newer methods.

Our paper begins bridging this gap by providing a microeconomic framework for decision trees: a popular machine learning method. Speci cally, we show how decision trees represent a non-compensatory decision protocol known as disjunctions-of-conjunctions and how this protocol generalizes many of the non-compensatory rules used in the discrete choice literature so far. Additionally, we show how existing decision tree variants address many economic concerns that choice modelers might have. Beyond theoretical interpretations, we contribute to the existing literature of two-stage, semi-compensatory modeling and to the existing decision tree literature. In particular, we formulate the rst bayesian model tree, thereby allowing for uncertainty in the estimated non-compensatory rules as well as for context-dependent preference heterogeneity in one's second-stage choice model. Using an application of bicycle mode choice in the San Francisco Bay Area, we estimate our bayesian model tree, and we fi nd that it is over 1,000 times more likely to be closer to the true data-generating process than a multinomial logit model (MNL). Qualitatively, our bayesian model tree automatically fi nds the effect of bicycle infrastructure investment to be moderated by travel distance, socio-demographics and topography, and our model identi es diminishing returns from bicycle lane investments. These qualitative differences lead the bayesian model trees to produce forecasts that directly align with the observed bicycle mode shares in regions with abundant bicycle infrastructure such as Davis,CA and the Netherlands. In comparison, the forecasts of the MNL model are overly optimistic.

Download Here

Seasonality Effect on US Household Demand for Different Beef Cuts

Ardeshiri and Swait

Abstract: Australia is one the largest exporters of beef and beef products to the United States (Haley & Jones, 2017). A better understanding of the American demand for beef is important since Australia is facing strong competition from Canada and New Zealand in the beef market. We applied a discrete choice experiment to investigate 946 American consumer preferences and willingness-to-pay (WTP) for different beef products. Consumers were presented with a novel experiment in which they indicated “how many” they would purchase for ground, diced, roast, and six cuts of steaks (sirloin, tenderloin, flank, flap, New York and cowboy/rib-eye).

The results from a scaled adjusted ordered logit model showed that after price, cues related to safety option purchases such as certified logo, type of packaging, antibiotic free and organic products play a stronger influential role on American consumers’ decision making (especially in summer where the opportunities for foodborne bacteria to thrive in warm weather is higher) compared to other beef attributes.

Furthermore, on average US consumers purchase diced and roast products more often in winter “as a slow cooked season” than in summer whereas New York strip and flank steak are more popular in summer as “the grilling season” than in winter.

Finally, this study provides managerial and policy implication and recommendations to help Australian exporters to better understand US consumer preferences for beef through an understanding of seasonal effects on demand for this good.

Download Here

Flexible mixture - Amount Models for Business and Industry using Gaussian Processes

Ruseckaite, Fok and Goos

Abstract: Many products and services can be described as mixtures of ingredients whose proportions sum to one. Specialized models have been developed for linking the mixture proportions to outcome variables, such as preference, quality and liking. In many scenarios, only the mixture proportions matter for the outcome variable. In such cases, mixture models suffice. In other scenarios, the total amount of the mixture matters as well. In these cases, one needs mixture-amount models. As an example, consider advertisers who have to decide on the advertising media mix (e.g. 30% of the expenditures on TV advertising, 10% on radio and 60% on online advertising) as well as on the total budget of the entire campaign. To model mixture-amount data, the current strategy is to express the response in terms of the mixture proportions and specify mixture parameters as parametric functions of the amount. However, specifying the functional form for these parameters may not be straightforward, and using a flexible functional form usually comes at the cost of a large number of parameters. In this paper, we present a new modeling approach which is flexible but parsimonious in the number of parameters. The model is based on so-called Gaussian processes and avoids the necessity to a-priori specify the shape of the dependence of the mixture parameters on the amount. We show that our model encompasses two commonly used model specifications as extreme cases. Finally, we demonstrate the model’s added value when compared to standard models for mixture-amount data. We consider two applications. The first one deals with the reaction of mice to mixtures of hormones applied in different amounts. The second one concerns the recognition of advertising campaigns. The mixture here is the particular media mix (TV and magazine advertising) used for a campaign. As the total amount variable, we consider the total advertising campaign exposure.

Download Here

Modeling and Forecasting the Evolution of Preferences over Time: A Hidden Markov Model of Travel Behavior

El Zarwi, Vij and Walker

Abstract: Preferences, as denoted by taste parameters and consideration sets, may evolve over time in response to changes in demographic and situational variables, psychological, sociological and biological constructs,and available alternatives and their attributes. However, existing representations typically overlook the influence of past experiences on present preferences. This study develops a hidden Markov model with a discrete choice kernel for modeling and forecasting the evolution of individual preferences over time. The hidden states denote different latent preferences, and the evolutionary path is hypothesized to be a first order Markov process such that an individual’s preferences during a particular time period are dependent on their preferences during the previous time period. The framework is applied to study the evolution of modal preferences, or modality styles, over time, in response to a major change in the public transportation system. Empirical findings reveal two complementary narratives. At the population level, there are significant shifts in the distribution of individuals across modality styles before and after the change in the system, but the distribution is relatively stable in the periods after the change. At the individual level, greater instability in preferences is observed, much after the change, despite accounting for the inertial influence of past preferences. A comparison between the proposed dynamic frameworkand comparable static frameworks reveals corresponding differences in aggregate forecasts for different policy scenarios, demonstrating the value of the proposed framework for both individual and population level policy analysis.

Download Here


California Household Travel Survey 2012:
Tour mode choice data from the San Francisco Bay Area

This data was originally collected as part of the California Household Travel Survey (CHTS) in the year 2012. Individuals belonging to sampled households were asked to report their complete activity diary data over an observation period of one day, including which activities were conducted where, when, for how long, with whom and using what mode of travel. More information on the raw data can be found in NuStats, LLC (2013).

The data included here corresponds to individuals from the subset of households located in the nine-county San Francisco Bay Area. The raw trip data was processed into home-based tours that can be used for the purpose of tour-based travel mode choice analysis. The resulting dataset includes 27,054 tours made by 17,717 individuals from 8,228 households.

For each tour, six possible travel mode alternatives are defined: private vehicle, private transit, walk to public transit, drive to public transit, bike, and walk. Private vehicle refers to cases where the individual used a motorized vehicle owned by themselves (or someone they know) as a driver or a passenger. Private transit includes the use of travel modes such as taxis, Uber, carshare, rental cars and private shuttles. Walk to public transit captures all cases in which an individual only used non-motorized travel modes to access public transit, and drive to public transit captures all cases in which a motorized travel mode was used to access public transit.

The level-of-service attributes, namely travel times and costs, for each of the six travel modes for each tour are determined using network skims from the SF MTC for 2010, generated using version 3 of their travel demand model. We are unable to decompose travel time into its constituent elements, such as in-vehicle time and waiting time, as this information was unavailable at the time of processing. Travel costs are in 2000 US dollars.

The download link below contains five files: the processed data file, the Python script used to process the raw data, an iPython notebook included as an example on how to use the data file for analysis, the data dictionary for the raw data and a readme file.

A subset of this data was originally used by Vij et al. (2017) for understanding modal preference shifts in the San Francisco Bay Area over time. For more details, please refer to the original study. And if you have any questions, feel free to contact

Download Here


Nustats, LLC, 2013. 2010–2012 California Household Travel Survey Final Report.

Vij, A., Gorripaty, S., & Walker, J. L. (2017). From trend spotting to trend’splaining: Understanding modal preference shifts in the San Francisco Bay Area. Transportation Research Part A: Policy and Practice95, 238-258.

Estimation Code

Python estimation code for flexible Latent Class Choice Models (LCCMs)

LCCM is a Python package for estimating latent class choice models using the Expectation Maximization (EM) algorithm to maximize the likelihood function. The package was developed by Feras El Zarwi, a PhD candidate at the University of California, Berkeley, with assistance from Akshay Vij from the Institute for Choice. The package offers significant improvement over other estimation packages, some of which are listed below:

  • Supports datasets with multiple observations per decision-maker
  • Supports datasets where the choice set differs across observations
  • Supports model specifications where the coefficient for a given variable may be generic (same coefficient across all alternatives) or alternative specific (coefficients varying across all alternatives or subsets of alternatives) in each latent class
  • Accounts for sampling weights in case the data you are working with is choice-based i.e. Weighted Exogenous Sample Maximum Likelihood (WESML) from (Ben-Akiva and Lerman, 1983) to yield consistent estimates
  • Constrains the choice set across latent classes whereby each latent class can have its own subset of alternatives in the respective choice set
  • Constrains the availability of latent classes to all individuals in the sample whereby it might be the case that a certain latent class or set of latent classes are unavailable to certain decision-makers

For more information about the estimation code, see El Zarwi (2017). If the package is useful in your research or work, please cite the dissertation reference before and the package itself. For any questions, please contact Feras at


El Zarwi, Feras. "Modeling and Forecasting the Impact of Major Technological and Infrastructural Changes on Travel Demand", PhD Dissertation, 2017, University of California at Berkeley.

Areas of study and research

+ Click to minimise