This is part 3 of an analysis of costs of living and local purchasing power, in five hundred major cities around the world.
For part one, click here: Primary Drivers of Costs of Living, Worldwide For part two, click here: Are People in More Expensive Countries Richer? The data was sourced from Numbeo.com, which hosts user-contributed data - current within the last 18 months. The IPython Notebook for this project is available on github.
Rich and poor, cheap and expensive - these are all relative terms. To an American tourist, for example, China might be cheap. But to a Cambodian, China is fairly expensive. What we are interested in determining is: how does the cost of living in various countries around the world look, depending on your country of origin.
To this end, we've put together two visualizations, using Tableau. (This was fairly easy to do, because we'd already crunched the numbers in the first two sections of this analysis of data from Numbeo.com.)
Note that for the second visualization, we define the baseline cost of living in each country as the median of the top five most expensive cities in each country - except Russia. For Russia, we used the mean average of costs in St. Petersburg and Moscow. The logic here is that the typical tourist from a given country tends to be richer than average. People with the resources to travel tend to live in bigger, more expensive cities. A disproportionate amount of tourists come from these regions, and the relative costs of living around the world should be in proportion to the costs these tourists actually experience back home.
This is part 2 of an analysis of costs of living and local purchasing power, in five hundred major cities around the world.
For part one, click here: Primary Drivers of Costs of Living, Worldwide For part three, click here: Are People in More Expensive Countries Richer? The data was sourced from Numbeo.com, which hosts user-contributed data - current within the last 18 months. The IPython Notebook for this project is available on github. Is there a relationship between cost of living and local purchasing power, in cities around the world?
Note that Local Purchasing Power is a measure that *already* takes into account the cost of living. It is not a measure of absolute wages; rather, it describes the amount of "spending power" a person has, given both:
If wages correlate perfectly to cost of living, then there should be *no* variance in local purchasing power. If there is no variance in local purchasing power, then it should hold no relationship to cost of living. (Wages should scale evenly with costs, thus disallowing cost of living from influencing local purchasing power). We do, in fact, see a low degree of correlation in most regions in the world:
Regions with no or very weak correlation between cost of living and local purchasing power:
And yet, for two regions in the world, there is a very high correlation:
These relationships become immediately clear when we plot local purchasing power vs cost of living:
The blue line in each chart is our regression line, and the light blue bands are our 95% confidence intervals.
We see from these plots that:
In Europe:
In Asia:
Why might this be so? North America, and Oceana contain only rich countries, which are economically similar. There is a relatively low degree of variance in both income, and cost of living between - or within - these countries. Africa and Latin America are comprised mostly of poorer countries with both low costs of living, and low incomes. While there is definite variance between the cost of living in, say, Chile, and Paraguay, there is still not a tremendous amount of variance between countries. Generally speaking - and this is only a hypothesis - it may be true that wages and cost of living tend to scale proportionally for cities within a single country, while they do not necessarily scale proportionally between countries. Europe and Asia - more so than any other regions - are both home to economies with a huge amount of variance between countries. Plotting the data in Tableau, this becomes quite clear:
In Europe, we see a very strong correlation between size and color.
Next up - Part Three: "The World Through Whose Eyes?" - Costs of living around the world, relative to your country of residence.
This project explores the costs of living and purchasing power characteristics of 500 major cities around the world.
The analysis in this post concerns itself with the following questions:
The data was sourced from Numbeo.com, which hosts user-contributed data - current within the last 18 months. The IPython Notebook for this project is available on github.
First, let's take a look at our key metrics - local purchasing power and total cost of living (including rent) - on a global scale:
Next, let's look at the distribution of purchasing power in each region of the world.
The first three graphs plot the proportion of cities in each region that enjoy varying levels of wealth, relative to the worldwide median. The fourth graph plots the same metric for the world at large, against the medians for each of our regions. (These are just two slightly different views into the same data).
Next, let's look at the distribution of rent costs - same regions, different metric.
How much does the average price of rent vary from city to city, for each region in the world?
What is the relationship between non-rent costs of living, and cost of rent, for the top 500 major cities, worldwide?
Which of these costs varies more?
The scatterplot left illustrates cost of rent vs non-rent costs of living*.
The kdeplot (essentially, a smoothed histogram) shows a much greater spread in the cost of rent, compared to non-rent costs*. What does this tell us?
On average, as costs of living increase, rent increases a full 2.13 times faster. *The above charts graph the delta in costs for each city, with relation to the worldwide median for each metric. The scatterplot, then, does not illustrate absolute costs, but rather the ratio by which costs are more (or less) expensive than average.
This raises an interesting question:
Given that rent and non-rent costs are strongly correlated, and given that rent rises faster (relative to the worldwide median) than non-rent costs of living, to what degree is rent (rather than non-rent) the major driver of variance in cost, for cities around the world? In simple terms, are "expensive" cities expensive because rent in those cities are expensive, or are they expensive because non-rent factors are driving up the cost of living? The following graph plots each of our major cities relative to the worldwide median cost of living
From this chart, it becomes immediately apparent that - in the vast majority of cities around the world - rent is the primary driver of cost.
The reason the typical expensive city is expensive is because rent in that city is expensive. Cheap cities, then, are cheap because rent is cheap. The more astute readers will notice that more than half of our cities in this visualization fall above the "median." This is because we used a calculated median to address sampling bias in the data. Significantly more than half of the cities sampled are from rich countries. This means that taking a simple median (the 250th of 500 data points) would result in a measure that was more expensive than the true worldwide median cost of living. To fix this we first calculated the median cost in each region, then took the median of all of our regional medians. A final point that needs to be addressed is how we calculated the cost ratio for each city, relative to our median: Numbeo creates their total cost of living index by attributing (essentially) equal weight to both rent, and non-rent costs of living. Thus, 50% of the cost for any city is derived from rent, and the other 50% from non-rent. Using these figures would have resulted in a boring and quite useless graph, with equal parts red and blue for every city. To get around this, we calculated two additional indexes for each city:
We then used these two columns to calculate one final metric: the proportion of the variance in total cost that is attributable to rent costs, specifically. This is the metric which determines our red/blue splits on the graph displayed above. Here is another graph, highlighting the top five most expensive, and top five least expensive cities in our dataset:
Interestingly, San Francisco (a city in which the author of the study has lived) is the second most expensive location in the world, and nearly all of the reason it is expensive is due to the costs of rent. This is of course no surprise, as San Francisco holds claim to the most expensive real estate on the entire continent.
Next, let's explore another of our hypotheses: Is there a relationship between cost of living and local purchasing power? This is part two, where we train our model and select an accuracy measurement. In part one, we laid out the project, explained the features and target, and walked through the code to create the datasets. CLICK HERE FOR PART 1 In the following three kernels, we:
*In the code, we create a list of 'classifiers'. This is incorrect; these are regression models. This has no effect on our results. Some notes on our error metric: What we have plotted here is not a ROC (Receiver-Operator Characteristic) curve.
What we needed was something to show the relationship between tolerance for error, and the model's ability to predict values within a given tolerance threshold. This is precisely the relationship the above graph illustrates. Note that we took the mean absolute error of each test set, to calculate the area under our error-tolerances curve. If we simply averaged the error together, much of the error would cancel out. Predictions that were over in one test set would cancel out predictions that were under in another. Taking the absolute error gives us the magnitude of the average error. A note on real-world application. Imagine the following: You are in change of the corporate strategy for a company that sells group tours to clients from all over the world. You operate in a small island country, with the majority of clients coming from three countries - Scienceville, Datatopia, and Regressionland. You are responsible for setting staffing levels - a number which changes every year - and setting contracts with local operators. These depend on the total quantity of tourists you expect, as well as the composition of those tourists (people from Datatopia have different tastes than people from Scienceville). If you gain access to our model, you can make fairly informed decisions about how many tourists will come from each of these three countries. Even if the total number of tourists who come to your island remains fairly constant, the proportion of tourists by country may shift dramatically. This model allows you to predict these changes, and prepare accordingly. (Assuming that preferences of the citizens in each of these countries do not significantly change - i.e, your island remains a preferred destination for Datatopian tourists.) Conclusion & Considerations: Using nothing but five years of trailing macroeconomic indicators, and a baseline level of departures in a given year, we can make fairly accurate predictions about the number of tourists that will originate from a given country. Area under Error-Tolerances Curve: .8462 Percent of Countries with less than 15% avg. absolute error: 79.41 This is inline with our initial hypothesis. Note that this model is for illustration only. The purpose of the exercise is to test the predictive power of macro-economic indicators - not to get the most accurate predictions possible, using all data we could possibly source from. A production model would use all available and relevant information (including departures data over time, rather than using it only to establish a baseline in a single year). Without using additional data, there is still plenty of room for improvement:
~ That's all for now ~ Introduction:Cluster analysis is a valuable machine learning tool, with applications across virtually every discipline. These include anything from analyzing satellite images of agricultural areas to identifying different types of crops, to finding themes across the billions of social media posts that are broadcast publicly every month. Clustering is often used together with natural language processing (NLP). NLP allows us to turn unstructured text data into structured numerical data. This is how we transform human language into the computer-friendly inputs required to apply machine learning. For a straightforward application of how these two techniques can be applied in sequence, we’ve chosen an easily accessible public data set: the 'Top 50 Songs of 2015' by Genius, the digital media company. In our example, we look at how songs can be broken into similar clusters by their lyrics, and then conduct further statistical analysis against the resulting clusters. Machine learning and other data science methodologies increasingly confer great powers upon those who practice it. It’s not, however, without its limitations. The misapplication or misinterpretation of these methodologies can lead not only to reduced ability to extract insights but also to categorically false or misleading conclusions. We examine potential mishaps in the transformation and analysis of the data at various stages in this study. This study explores the use of cluster analysis to answer the following questions:
Notes:
Methodology:We performed this analysis in Python, using Pandas, Sklearn, and NLTK to model the data. Graphs were projected with matplotlib and/or seaborn. Steps:
* We first fed the data into an affinity propagation model, but found the model unsuited to the data. Further Analysis: After clustering, we analyzed the relationships between word count, vocabulary, songs, and the clusters the songs were assigned to. Results are below. Project Results and Analysis: |
Top 5 Most Similar Songs:************************************************** #1: Father John Misty The Night Josh Tillman Came To Our Apt Vs. Kendrick Lamar The Blacker The Berry Dist. = 0.835462649851 ************************************************** #2: Vince Staples Jump Off The Roof Vs. The Weeknd Tell Your Friends Dist. = 0.873377864682 ************************************************** #3: The Weeknd Tell Your Friends Vs. Drake Back To Back Dist. = 0.908259234026 ************************************************** #4: Joey Bada Paper Trail Vs. Lupe Fiasco Prisoner 1 And 2 Dist. = 0.937557822226 ************************************************** #5: Dr Dre Darkside Gone Vs. Drake Back To Back Dist. = 0.939432020598 | Top 5 Most Dissimilar Songs:************************************************** #1: Sufjan Stevens Blue Bucket Of Gold Vs. Major Lazer Lean On Dist. = 3.0627237839 ************************************************** #2: Major Lazer Lean On Vs. Dangelo And The Vanguard Really Love Dist. = 3.01516407241 ************************************************** #3: Towkio Heaven Only Knows Vs. Major Lazer Lean On Dist. = 2.8076161803 ************************************************** #4: Post Malone White Iverson Vs. Major Lazer Lean On Dist. = 2.77959991825 ************************************************** #5: Major Lazer Lean On Vs. The Weeknd The Hills Dist. = 2.76888048057 |
Interestingly, Major Lazer's "Lean On" is so lyrically dissimilar from every other song, that it occupies the top five most dissimilar ranks, against five other songs.
Primary Component Analysis
Unfortunately, PCA Analysis here does not help as much as it may in other cases. We require the first 9 vectors in our new matrix of primary components, to account for only just over 60% of the variance.
Clustering the songs
The Reddit dataset this analysis was inspired by used an affinity propagation algorithm to perform it's clustering. Affinity propagation does not require you to input the initial number of clusters. This means that, unlike the K-Means, the number of clusters is NOT arbitrary. This is therefore often a great choice, when it works.
We, however, found K-Means to be a better choice for this dataset. The affinity propagation model kept outputting very uneven class distributions. We'd get roughly 8 to 12 clusters - 3 of which would contain about 90% of the songs, while the remaining 9 clusters had just 1 or 2 songs each.
This kind of clustering isn't necessarily bad. If many songs are similar to one another, and a few are very dissimilar to all the others, than the model is just doing it's job, and showing those relationships to us. We wanted to perform analysis of the summary statistics for each class, though, and the large number of single-song classes made that difficult.
After some initial tweaking, we fit a KMeans model to the first 9 dimensions of our PCA transform.
We found 8 clusters to be an effective tradeoff between the number of classes, and the distribution of songs within those classes.
We, however, found K-Means to be a better choice for this dataset. The affinity propagation model kept outputting very uneven class distributions. We'd get roughly 8 to 12 clusters - 3 of which would contain about 90% of the songs, while the remaining 9 clusters had just 1 or 2 songs each.
This kind of clustering isn't necessarily bad. If many songs are similar to one another, and a few are very dissimilar to all the others, than the model is just doing it's job, and showing those relationships to us. We wanted to perform analysis of the summary statistics for each class, though, and the large number of single-song classes made that difficult.
After some initial tweaking, we fit a KMeans model to the first 9 dimensions of our PCA transform.
We found 8 clusters to be an effective tradeoff between the number of classes, and the distribution of songs within those classes.
Of the 8 classes we get, five of them have six or more songs within them.
Now that we've clustered the Genius Top 50 Songs of 2015 into eight groups, how can we visually represent them?
If we plot the first 3 primary components along 3 axises, we can already start to see some clustering. This plot is actually more illustrative than might be expected: Even though these 3 dimensions only account for ~25% of the variance, we can see a reasonable degree of grouping.
If we plot the first 3 primary components along 3 axises, we can already start to see some clustering. This plot is actually more illustrative than might be expected: Even though these 3 dimensions only account for ~25% of the variance, we can see a reasonable degree of grouping.
Besides lyrics, what do classes share in common?
First, let's visualize some of our distributions:
1. On the left, we have a box plot showing the distribution of words per song.
- The mean is just over 400 words,
- The standard deviation is large relative to the mean, and word count is distributed fairly equally around it.
- The first and third quartiles range between ~300 and ~600 words
2. On the right, we have a histogram depicting the number of songs which fall into bins characterized by word quantity.
- The distribution is right-skewed, due to a single song with over 1400 words (3.5x the mean)
Next, let's see whether there is any correlation between classes, and the average number of unique words within each class.
There is no reason there *should* be, as songs were clustered by the words themselves, rather than the amount of words.
Still, there may be some correlation. Perhaps songs with similar lyrics share a similar breadth of vocabulary (amounts of unique words per song).
There is no correlation between the number of words or number of unique words, and the class a song was assigned to.
This graph is somewhat misleading, because it implies the possibility of some kind of linear relationship between classes and the number of words or unique words (or anything else at all). In reality, the class labels themselves are meaningless; they are randomly assigned by the KMeans algorithm. The information they contain relates to the ways in which songs with the same label relate to one another.
This graph itself, though, conveys valuable information.
If there were a correlation between classes and number of unique words, we would expect to see some kind of clustering along the horizontal axis, for each class along the vertical axis.
There is none of that whatsoever, indicating a complete lack of any correlation between classes and the number of words for songs within them. In fact, the R-Squared (not pictured here) between classes and unique words is ~0.01, or essentially zero.
This graph is somewhat misleading, because it implies the possibility of some kind of linear relationship between classes and the number of words or unique words (or anything else at all). In reality, the class labels themselves are meaningless; they are randomly assigned by the KMeans algorithm. The information they contain relates to the ways in which songs with the same label relate to one another.
This graph itself, though, conveys valuable information.
If there were a correlation between classes and number of unique words, we would expect to see some kind of clustering along the horizontal axis, for each class along the vertical axis.
There is none of that whatsoever, indicating a complete lack of any correlation between classes and the number of words for songs within them. In fact, the R-Squared (not pictured here) between classes and unique words is ~0.01, or essentially zero.
Word Count & Unique Words Vs Song Ranking
Is Song Rank a function of the number of unique words in a song?
Is Song Rank a function of a song's word count?
No. With R-Squared values of 0.0037, and 2.4e-6, there is absolutely no correlation between the number of unique words - or total words - in a song, and the song's ranking.
Is there a correlation between the AVERAGE rank per class, and the AVERAGE word count or unique word count per class?
Note, that we dropped 3 clusters here. These clusters had only 1, 1, and 3 songs respectively.
If we keep those clusters, single songs may exert far too much leverage on the rankings for each cluster.
The remaining 5 classes contain 45 of our 50 songs - or 90% of the data.
Each of these has between 6 and 12 songs in it.
If we keep those clusters, single songs may exert far too much leverage on the rankings for each cluster.
The remaining 5 classes contain 45 of our 50 songs - or 90% of the data.
Each of these has between 6 and 12 songs in it.
Yes. There does appear to be a correlation (13% and 34% of variance).
This finding seems to imply that, while a song's word count or vocabulary breadth holds no value as a predictor of it's ranking, the average word count or breadth of vocabulary for each kind* of song may actually hold some predictive value for that kind of song's average ranking.
* The easiest analogy to kinds of songs is genre. But these "genres" are not defined by popularly-agreed divisions or titles (eg. rap, country, rock). The "genres" here are determined** by the songs lyrical similarities - or dissimilarities - to one another.
** It would be interesting to compare the classifications made by analysis of the lyrics, to the genres the songs actually do fall under. The question here is, "Do songs that are clustered together tend to fall into the same genre? If so, how significant is the correlation?". This, however, is another job for another time.
This finding seems to imply that, while a song's word count or vocabulary breadth holds no value as a predictor of it's ranking, the average word count or breadth of vocabulary for each kind* of song may actually hold some predictive value for that kind of song's average ranking.
* The easiest analogy to kinds of songs is genre. But these "genres" are not defined by popularly-agreed divisions or titles (eg. rap, country, rock). The "genres" here are determined** by the songs lyrical similarities - or dissimilarities - to one another.
** It would be interesting to compare the classifications made by analysis of the lyrics, to the genres the songs actually do fall under. The question here is, "Do songs that are clustered together tend to fall into the same genre? If so, how significant is the correlation?". This, however, is another job for another time.
Conclusion
1. Songs can be effectively clustered together by analysis of their lyrics.
2. In the context of individual songs, there is no relationship between song ranking and number of words, or unique words, in a song.
3. After clustering songs into similar kinds of songs, relationships do emerge:
1. There is a weak inverse correlation between the average number of words in a cluster, and the average ranking
* This can account for ~13% of the variance.
2. There is a stronger - but still not huge - inverse correlation between the average number of unique words in a cluster, and the average ranking.
* This can account for ~34% of the variance
Considerations:
1. Small sample size: 50 songs, and resultant 8 classes (5 of which we kept - accounting for 90% of the songs)
* Statistical significance of regression (class rank vs word count) with only 5 data counts is *very* limited.
* It is neat, however, to see that there does at least *appear* to be an relationship between word count / vocabulary, and ranking
2. Subjective rankings.
* Genius describes their process as such: "Contributors voted on an initial poll, spent weeks discussing revisions and replacements, and elected to write about their favorite tracks."
* While it does seem that the Genius community at large was polled, and that poll determined the songs that were ultimately selected, the actual ranking was not necessarily reflective of the community at large. Rather, a select group individual contributors had the final say.
3. Relatively few words per song.
* The average song had between ~300 and ~600 words. This lead to a relatively *small* incidence of repeated words between songs. Of the total 3,126 unique words found across the entire dataset, only the top 36 appeared more than 100 times, cumulatively. This means that nearly 99% of the words appeared *less* than twice per song. This gives tremendous leverage to a small number of words.
* While the statistical significance of regression techniques would be improved by a larger dataset, that would still not likely change the reality that 1% of the words accounts for 100% of the leverage in our model. Indeed, the analysis of Reddit's top 50 subreddits that this project was inspired by was a clustering of entire *forums* - with an associated plethora of data to draw from.
2. In the context of individual songs, there is no relationship between song ranking and number of words, or unique words, in a song.
3. After clustering songs into similar kinds of songs, relationships do emerge:
1. There is a weak inverse correlation between the average number of words in a cluster, and the average ranking
* This can account for ~13% of the variance.
2. There is a stronger - but still not huge - inverse correlation between the average number of unique words in a cluster, and the average ranking.
* This can account for ~34% of the variance
Considerations:
1. Small sample size: 50 songs, and resultant 8 classes (5 of which we kept - accounting for 90% of the songs)
* Statistical significance of regression (class rank vs word count) with only 5 data counts is *very* limited.
* It is neat, however, to see that there does at least *appear* to be an relationship between word count / vocabulary, and ranking
2. Subjective rankings.
* Genius describes their process as such: "Contributors voted on an initial poll, spent weeks discussing revisions and replacements, and elected to write about their favorite tracks."
* While it does seem that the Genius community at large was polled, and that poll determined the songs that were ultimately selected, the actual ranking was not necessarily reflective of the community at large. Rather, a select group individual contributors had the final say.
3. Relatively few words per song.
* The average song had between ~300 and ~600 words. This lead to a relatively *small* incidence of repeated words between songs. Of the total 3,126 unique words found across the entire dataset, only the top 36 appeared more than 100 times, cumulatively. This means that nearly 99% of the words appeared *less* than twice per song. This gives tremendous leverage to a small number of words.
* While the statistical significance of regression techniques would be improved by a larger dataset, that would still not likely change the reality that 1% of the words accounts for 100% of the leverage in our model. Indeed, the analysis of Reddit's top 50 subreddits that this project was inspired by was a clustering of entire *forums* - with an associated plethora of data to draw from.
International Tourism & Neural Networks, Part 1: Creating the Dataset, & Cleaning the Data
3/22/2017
Introduction
Hypothesis: We can predict the number of tourists that will originate from a given country in a given year - if we understand just a few key macroeconomic indicators describing that country.
China's outbound tourists spent $258 billion abroad in 2017 - almost double the spend of US tourists, according to the United Nations World Tourism Organisation. Yet it has only been half a decade since China first overtook the US and Germany. Can we explain these trends?
The straightforward answer is that there there are four times as many Chinese as Americans (1.3bn vs 330m), and China is growing at a significantly faster pace. As its middle class grows the Chinese become wealthier - and they travel more. No mystery there, it seems.
But why now? China's GDP has been growing at around or above 10% for the best part of four decades. And after surging dramatically at the start of this century, it has dipped over the last decade to its current figure of 6-7%. At the same time, it is only in the past several years that tourism has surged. It’s then doubled in just a few years.
So we can assume that:
The data science techniques
This analysis explores the hypothesis above to walk the reader through the application of the key techniques a data scientist would use to design, test and refine a ML model. It shows us how to extract real world insights from public data sets. These techniques include:
[Note that the descriptions below are intended to provide a helpful overview, but are simplifications of very complex subjects.]
Data wrangling and cleaning
Over 90% of a typical data scientist's time is spent wrangling and cleaning. Data wrangling is the process of transforming raw data into a usable format for analysis. Data cleaning is the process of addressing errors and inconsistencies in the initial data sets. This is important because real world data is often messy and incomplete - a byproduct of real world interactions. Data scientists must transform the data they have into the data they need.
Exploratory data analysis
EDA is how a data scientist gets to grips with a new data set in order to make it useful. Data scientists must develop an intuitive sense for how to use statistical methodologies and data visualization techniques. This is important in order to explore a new data set and develop a strategy from how to extract insights from it.
Feature selection and engineering
Feature selection is how a data scientist chooses and refines which data they’re going to work with. In more technical terms, it is the process of applying rigorous scientific methodology to the curation, selection, and transformation of data which will be used as inputs in machine learning algorithms.
Model design and algorithm selection
Machine learning is a set of tools which allows computers to generate predictions about the world around us and find correlations and complex insights across data from nearly any domain.
A data scientist will often test multiple algorithms and choose the one best fitted to the data, depending on the requirements of the project.
Error metric selection
Data scientists much choose an appropriate measurement to evaluate the performance of each model. They have a wide range of statistical tests at their disposal, but their selection must be non-arbitrary, defensible, and made before they design the model.
Analysis of output
Fundamentally, the ‘science’ in data science comes from a rigorous application of the scientific method. The final step after every experiment is to analyse the results and decide on the next steps.
Hypothesis: We can predict the number of tourists that will originate from a given country in a given year - if we understand just a few key macroeconomic indicators describing that country.
China's outbound tourists spent $258 billion abroad in 2017 - almost double the spend of US tourists, according to the United Nations World Tourism Organisation. Yet it has only been half a decade since China first overtook the US and Germany. Can we explain these trends?
The straightforward answer is that there there are four times as many Chinese as Americans (1.3bn vs 330m), and China is growing at a significantly faster pace. As its middle class grows the Chinese become wealthier - and they travel more. No mystery there, it seems.
But why now? China's GDP has been growing at around or above 10% for the best part of four decades. And after surging dramatically at the start of this century, it has dipped over the last decade to its current figure of 6-7%. At the same time, it is only in the past several years that tourism has surged. It’s then doubled in just a few years.
So we can assume that:
- Economics do influence the amount of tourists that originate from each country, and
- The relationship is non-linear; and it may be defined by certain asymptotes or inflection points
The data science techniques
This analysis explores the hypothesis above to walk the reader through the application of the key techniques a data scientist would use to design, test and refine a ML model. It shows us how to extract real world insights from public data sets. These techniques include:
[Note that the descriptions below are intended to provide a helpful overview, but are simplifications of very complex subjects.]
Data wrangling and cleaning
Over 90% of a typical data scientist's time is spent wrangling and cleaning. Data wrangling is the process of transforming raw data into a usable format for analysis. Data cleaning is the process of addressing errors and inconsistencies in the initial data sets. This is important because real world data is often messy and incomplete - a byproduct of real world interactions. Data scientists must transform the data they have into the data they need.
Exploratory data analysis
EDA is how a data scientist gets to grips with a new data set in order to make it useful. Data scientists must develop an intuitive sense for how to use statistical methodologies and data visualization techniques. This is important in order to explore a new data set and develop a strategy from how to extract insights from it.
Feature selection and engineering
Feature selection is how a data scientist chooses and refines which data they’re going to work with. In more technical terms, it is the process of applying rigorous scientific methodology to the curation, selection, and transformation of data which will be used as inputs in machine learning algorithms.
Model design and algorithm selection
Machine learning is a set of tools which allows computers to generate predictions about the world around us and find correlations and complex insights across data from nearly any domain.
A data scientist will often test multiple algorithms and choose the one best fitted to the data, depending on the requirements of the project.
Error metric selection
Data scientists much choose an appropriate measurement to evaluate the performance of each model. They have a wide range of statistical tests at their disposal, but their selection must be non-arbitrary, defensible, and made before they design the model.
Analysis of output
Fundamentally, the ‘science’ in data science comes from a rigorous application of the scientific method. The final step after every experiment is to analyse the results and decide on the next steps.
The Project:
Objective: Use only macro economic data about given countries to predict the number of international tourists that will originate from each country, in a given year.
Our target metric - International Departures - is defined as: “The number of individuals which leave their home country for tourism purposes, at least one time in a calendar year.”
Key Challenges:
- Sparse data - for countries, or the years in which data on the countries was collected
- Multidimensional time series data
- We have multiple datasets, each of which plots data on 200+ countries, over 50 years
- To combine these, we have to create “wide” test/train datasets, with multiple values for multiple years in the same row (by country). This wouldn't be much of a challenge, except:
- We end up with a fairly small number of countries with sufficient data to train a model on, and need to cross validate using *all* of the countries in multiple test sets. We have to get creative.
- Selecting an appropriate error metric that *meaningfully* conveys the accuracy and implications of our model.
- Mean Squared Error may be the standard choice, but does it really tell us something useful about our predictions?
Datasets Collected:
- World Bank
- Other:
- Series for the US Dollar Index, over time
Plan of Action:
# Read each dataset into list of data frames
# What is our time range?
# What is our country set?
# Can we trust the data? Does the remaining data reflect an accurate account of reality?
# Fill Missing Values: How will we interpolate data?
# Normalize Local Currency Units (LCU) against Constant-Value US Dollar:
# Create Training Set
# Create Test Set
# Select A Model
# Tune the Best Model and Chose and Error Metric
# Read each dataset into list of data frames
# What is our time range?
- For which years do we have sufficient data to train and test our model?
# What is our country set?
- Drop all countries with more than 20% missing data over our selected years
- Determine which of the remaining countries in each dataset are found in *every* dataset
- Drop countries which are not found in every dataset
# Can we trust the data? Does the remaining data reflect an accurate account of reality?
- Examine the ratio of unique entries in each trimmed dataset, to determine where the World Bank interpolated or otherwise gave data that potentially diverges from the ground truth.
- For each dataset: If the entries for a country are not mostly unique, drop the country
# Fill Missing Values: How will we interpolate data?
- Clean and fill remaining (minimal) missing data
# Normalize Local Currency Units (LCU) against Constant-Value US Dollar:
- Divide each LCU rate by US Dollar Index for given year
# Create Training Set
- Select appropriate columns from each dataset
- Filter by the first five years of data, to train the model on
- Rename columns so we can match them in the training set
- Assign values for departures in year six to target year, to train on
# Create Test Set
- Cross Validation: We don’t have enough countries to cross validate on a stratified selection of rows; we have to get creative.
- Create (8) test sets - one for each year, 2006 - 2013
- Assign features based on selected data for the unique five-year time series preceding each target year
- Assign values for departures in each target year, to validate against
# Select A Model
- Determine which baseline regressor has the lowest mean squared error, averaged across our 8 test sets
# Tune the Best Model and Chose and Error Metric
- Determine a meaningful accuracy metric, and tune the hyper-parameters for best performance
A Justification of Feature Selection:
GDP Per Capita & GDP Per Capita Growth
GDP Per Capita is a measure of the total size of a country’s economy, divided by the size of that country’s population. It is a widely-used proxy for the wealth of individuals in a given country.
We used World Bank datasets for both the size and the growth of GDP Per Capita over time, even though these two datasets are simply functions of one another. There is useful information at the margins:
We included GDP Per Capita, even though it is collinear with GDP Per Capita growth, in order to establish a baseline value per country, in absolute terms. In other words, GDPPC Growth explains the delta between years, but we need the hard numbers from the first dataset to establish the absolute magnitude.
We could have included only the GDP Per Capita values, but the GDPPC Growth dataset provides insight that the former cannot. The growth in the last year in our range acts as a proxy for the size of the economy in the target year. It is the closest thing we have to an estimate of each country’s economic health in the target year, without leaking future data.
Inflation:
We included inflation, because it stands to reason that there is a negative cross-price elasticity between expenditures in a domestic market, and expenditures in a foreign market. In other words, the more expensive it becomes to travel (or simply live) in a domestic market, the more attractive foreign markets become.
Local Currency Exchange [GDP vs Purchasing Power Parity vs Inflation]
Purchasing Power Parity (PPP) is a measure of how far money, denoted in dollars, goes in each country. Economists weight the GDP for each country by the cost of a standardized basket of consumer goods, to determine a normalized value for the purchasing power that stems from a measure of raw GDP. It is usually a better indicator of a country’s economic environment than raw GDP, but we've excluded it here, because:
Inflation, paired with the exchange rate of a currency vs a basket of international currencies, are the primary factors by which a calculation of GDP PC can be turned into one for GDP PC PPP (Purchasing Power Parity). Thus, inflation is also an excellent (if incomplete) proxy for GDP PPC - a measure I chose to leave out because of it’s strong correlation with GDP.
We've also included the value of a country’s currency, pegged against an international benchmark, as a feature. In simple terms, the more foreign currency one can buy with a single unit of domestic currency, the more attractive travel abroad becomes. If the Chinese Yuan increases significantly against the Thai Baht, it is nearly guaranteed that, all else held constant, Thailand will see an influx of Chinese tourists.
Note that the inverse should also be true, as a predictor of inbound tourists to a country. If Thai Baht decreases against the Yuan, fewer Thais will visit China (all else being equal).
Dollar Index Series
The temptation is to peg each country’s currency against the international reserve currency - the US dollar. After all, this is exactly how economists measure economic activity in every country around the world. However, quick EDA of the currency dataset we are using tells us that the exchange rate for the “United States” against the dollar stays constant (1:1) across the entire time series. While you might *expect* this to be true, this is actually important information. This means that US Dollars have not been normalized.
They could have been normalized against an index year, for purchasing power parity (eg. "Year 2000 Dollars”). They could have been normalized against an international benchmark - such as the "US Dollar Index.”
Because US dollars have not been normalized, and all other currencies here are measured in US Dollars, we will have to normalize all these values against the US Dollar index.
But why must we do this? Assumptions that the “value” of the US dollar remains constant against other currencies simply do not hold. The dollar is not an immutable benchmark around which other currencies fluctuate. Its value is derived from indicators intrinsic to American economic activity, foreign policy, and the actions of the US Federal Reserve. In the simplest possible terms, If a country's currency in a given year goes down ten percent against the dollar, we cannot say *Which* currency moved. If the value of the Russian Ruble increases from 1:70 to 1:60, did the Ruble go up, or did the Dollar fall?
We have to find a way to normalize the data. One method would be to find the mean average value of a dollar against ALL currencies in the data set, across each year in our time series, then calculate the change in the average value of the Dollar, in each year. This would work, but it gives equal weight to all currencies and would be misleading. The performance of the Egyptian Pound against the Dollar should not carry as much sway as that of the Euro, Yen, or Yuan.
Fortunately, there IS a standardized index by which the Dollar is measured - the US Dollar Index. It maps the value of the US Dollar over time, against a weighted basket of foreign currencies.
Country Population
Quite obviously, the number of tourists that originate from a given country is correlated with that country’s population. The wealth of each citizen within a country is an important variable in determining that individual’s likelihood to travel, but the overall population of a country influences our equation in two ways:
- It sets a hard limit on the number of tourists that can come from a country
- When combined with a multiplier determined by the GDP Per Capita for each country, it should be an excellent proxy for the number of tourists that originate from that country.
Urban Population
This is a measure of the percent of each country’s population that lives in urban, versus rural, areas, as defined by the World Bank. Simply put, people who travel tend to have money - and people who have money tend to live in cities.
The urban population ratio is another proxy for individual wealth. Whereas GDP Per Capita is an effective measure, it is an imperfect one. GDP Per Capita tells us the *mean* wealth per person in a country, but says nothing about how that wealth is distributed. A country in which a small group of oligarchs hold all the wealth may have a relatively high GDP Per Capita, but a population of mostly poor people.
Including the ratio of urban population (and how that changes over time) should explain some additional variance.
Departures
This dataset contains the World Bank’s information on the number of outbound international tourists that originate from each country in a given year.
Because this is what we are trying to predict, we include data from this dataset in exactly two places, per train/test set:
- The number of departures in the year preceding the one we want to predict, to establish a baseline.
- The number of departures in the year we want to predict - to train the training set on, or to validate the test predictions against.
Note that we did NOT include any additional information or features from this dataset. To do so would have only helped us, but this project is an exercise in using *other data* to predict trends in the tourism industry.
The Code
Without any further ado, here's the project. We walk through the code in the comments.
~ Putting it all together ~
That's all for part one.
In part two, we iterate over our time series to create cross-validation test sets, select a model, and evaluate our predictions.
CLICK HERE FOR PART TWO.
In part two, we iterate over our time series to create cross-validation test sets, select a model, and evaluate our predictions.
CLICK HERE FOR PART TWO.