Demographic Characters and Access to Public Transit in Greater Vancouver: Machine Learning Modeling

5 minute read

A shorter high-level summary of the project can be found here. For an interactive web application with simulation results of priority neighborhoods, see here.

This is part 3 of a series of in-depth posts about this project including

For the Jupyter Notebook with full analysIs, please see here. The GitHub repo of this analysis is located here.

Remove extreme observations

Now that we have tabular data of demographic characters, transit access and transit usage down to the DA level, I will pre-process the data so that it is ready to be put into machine learning algorithms. Firstly, features with all NA data are removed (in fact there are two such features). Then, I will remove rows that have irregular, or extreme values for important variables, as shown in the following table:

Removal Criteria	Number of DAs removed
DAs with zero population in 2016	8
DAs with zero land size in 2016	1
DAs with with land size 2 standard deviations above the mean	8
DAs with with population density 2 standard deviations above the mean	37
DAs with with population density 2 standard deviations below the mean	165
DAs without transit service or stops in the neighborhood area	49
DAs whose proportions of residents using transit are invalid (NAs)	2

The following map shows in brown DAs that have been removed from our next steps of analysis.

Map: DA removed for analyses

Feature engineering of census data

Most explanatory variables of this study come from the 2016 census, except for two, namely the number of transit services in the neighborhood area per person, and the number of transit stops in the neighborhood area per person, which are from the GTFS Data. I have also calculated some proportional relationships between census variables and added them to our analyses, and removed some confounding variables.

Create proportional variables

Proportional variables are calculated based on numeric features in the census data. There are two types of numeric columns:

(1) Proportions or averages. An example of proportion is “Population density, per square kilometer”, and an example of average is “Average Household Size.”
(2) Counts. Most numeric variables are in this category, for example (number of) “Private households”.

We will put type (1) variables directly into our models. For type (2) variables, I divide them into two sub-types:

(2.1) Counts of subjects belonging to the widest possible category in a DA, for example, total number of people who are immigrants.

We will put type (2.1) variables directly into our models.

(2.2) Counts of subjects belonging to a sub-category of a type (2.1) variable in a DA. There are further two sub-types:
(2.2.1) Counts of subjects belonging to an immediate sub-category of a type (2.1) variable, for example, total number of immigrants who are born in Asia.

We will use type (2.2.1) variables in two ways. Firstly, we will put them directly into our models. Secondly, we will calculate the proportion of each type (2.2.1) variable to the type (2.1) variable that it belongs to.

(2.2.2) Counts of subjects belonging to a further-off sub-category of a type (2.1) variable, for example, total number of immigrants who are born in India, Asia.

We will use type (2.2.2) variables in three ways. Firstly, we will put them directly into our models. Secondly, we will calculate the proportion of each type (2.2.2) variable to the type (2.1) variable that it ultimately (but not immediately, by definition) belongs to. Thirdly, we will calculate the proportion of each type (2.2.2) variable to the type (2.2.1) or type (2.2.2) variable that it immediately belongs to.

The following chart explains types of numeric variables.

Chart: types of numeric variables

Remove confounding variables

I have also removed variables that sit on the causal chain from access to transit to proportion of transit use. Such variables, theoretically, are impacted by the access to public transportation, and can in turn impact the use of public transportation. Confounding variables include:

Mode of Commuting
Duration of Commuting
Time of the day when leaving for work

In total, 572 new proportional variables have been created, and 76 confounding variables have been excluded.

Create train and test splits

There are 2544 observations in the train split and 636 observations in the test set.

Preliminary analyses

Before stepping into modeling, I did some quick sanity check on the correlation between access to, and public usage of, transit across DAs in GVA. The two plots below show that their relationships seem to be positive, which is expected.

Density plot: service per capita and public transit proportion

Density plot: stops per capita and public transit proportion

Modeling

Model selection and hyperparameter tuning

I selected three models, namely (1) dummy regression, (2) LASSO, and (3) Random Forest regression.

Dummy regression

The dummy regression model has both training and testing R-squared close to zero, which is expected.

LASSO regression

I tuned the alpha hyperparameter in the model using grid search. As shown in the table below, an alpha value of 0.0001 turns out to be the best.

Validation score rank	Mean validation R-squared	Alpha hyperparameter	Mean Fit Time
1	0.680	1.00E-04	13.815
2	0.651	1.00E-03	19.746
3	0.627	1.00E-05	14.588
4	0.518	1.00E-02	21.070
5	0.354	1.00E-01	14.889
6	0.255	1.00E+00	2.017
7	0.113	1.00E+01	1.835
8	-0.001	1.00E+02	1.322
9	-0.001	1.00E+03	1.279
9	-0.001	1.00E+04	1.002

Random Forest model

I tuned max_depth, max_features, min_samples_leaf, min_samples_split, and n_estimators hyperparameters using randomized search. The table below shows the best combination of hyperparameters.

Validation score rank	Mean validation R-squared	Max_depth hyperparameter	Max_features hyperparameter	Min_samples_leaf hyperparameter	Min_samples_split hyperparameter	N_estimators hyperparameter	Mean Fit Time
1	0.6760	100	auto	2	5	2000	1401.222
2	0.6759	100	auto	1	5	200	142.685
3	0.6755	40	auto	1	2	200	155.870
4	0.6751	40	auto	2	10	600	336.548
5	0.6745	100	auto	2	10	2000	1302.495
6	0.6720	40	auto	1	5	200	144.661
7	0.6657		sqrt	2	2	200	5.166
8	0.6642	100	sqrt	1	10	2000	45.181
9	0.6639	100	sqrt	4	2	2000	42.367
10	0.6630	10	sqrt	2	5	600	12.829

Model evaluation

After hyperparameter tuning, I compared the performance, as measured by root mean squared error (RMSE), among the three models on the whole train split. Random forest regression seems to perform the best among the three models.

Model	Train Split RMSE
Dummy Model	0.120
LASSO Regression	0.054
Random Forest Regression	0.027

I then fitted the best Random Forest model with the test split, and got a RMSE of about 0.070.

Share on

Twitter Facebook LinkedIn

Mark Wang

Demographic Characters and Access to Public Transit in Greater Vancouver: Machine Learning Modeling

Remove extreme observations

Feature engineering of census data

Create proportional variables

Remove confounding variables

Create train and test splits

Preliminary analyses

Modeling

Model selection and hyperparameter tuning

Model evaluation

Share on

You may also enjoy

It does not look like a Shiny App! Use GitHub Copilot to generate custom Shiny UI

GitHub Copilot seem hesitant? Could be AI safety and fairness measures!

Evaluation Metrics for Binary Classification in Machine Learning: When Accuracy Score Is Not Enough

Combating America’s Other Infodemic