This is part two, where we train our model and select an accuracy measurement.
In part one, we laid out the project, explained the features and target, and walked through the code to create the datasets.
CLICK HERE FOR PART 1
In the following three kernels, we:
*In the code, we create a list of 'classifiers'. This is incorrect; these are regression models. This has no effect on our results.
Some notes on our error metric:
What we have plotted here is not a ROC (Receiver-Operator Characteristic) curve.
What we needed was something to show the relationship between tolerance for error, and the model's ability to predict values within a given tolerance threshold. This is precisely the relationship the above graph illustrates.
Note that we took the mean absolute error of each test set, to calculate the area under our error-tolerances curve.
If we simply averaged the error together, much of the error would cancel out. Predictions that were over in one test set would cancel out predictions that were under in another. Taking the absolute error gives us the magnitude of the average error.
A note on real-world application.
Imagine the following: You are in change of the corporate strategy for a company that sells group tours to clients from all over the world. You operate in a small island country, with the majority of clients coming from three countries - Scienceville, Datatopia, and Regressionland. You are responsible for setting staffing levels - a number which changes every year - and setting contracts with local operators. These depend on the total quantity of tourists you expect, as well as the composition of those tourists (people from Datatopia have different tastes than people from Scienceville).
If you gain access to our model, you can make fairly informed decisions about how many tourists will come from each of these three countries. Even if the total number of tourists who come to your island remains fairly constant, the proportion of tourists by country may shift dramatically. This model allows you to predict these changes, and prepare accordingly. (Assuming that preferences of the citizens in each of these countries do not significantly change - i.e, your island remains a preferred destination for Datatopian tourists.)
Conclusion & Considerations:
Using nothing but five years of trailing macroeconomic indicators, and a baseline level of departures in a given year, we can make fairly accurate predictions about the number of tourists that will originate from a given country.
Area under Error-Tolerances Curve: .8462
Percent of Countries with less than 15% avg. absolute error: 79.41
This is inline with our initial hypothesis.
Note that this model is for illustration only. The purpose of the exercise is to test the predictive power of macro-economic indicators - not to get the most accurate predictions possible, using all data we could possibly source from. A production model would use all available and relevant information (including departures data over time, rather than using it only to establish a baseline in a single year).
Without using additional data, there is still plenty of room for improvement:
~ That's all for now ~