DATA EXPLORATIONS
  • Data Science Lab
  • Lucid Analytics Project

The World Through Whose Eyes?

4/3/2017

 
This is part 3 of an analysis of costs of living and local purchasing power, in five hundred major cities around the world.

For part one, click here: Primary Drivers of Costs of Living, Worldwide
For part two, click here:  Are People in More Expensive Countries Richer?

The data was sourced from Numbeo.com, which hosts user-contributed data - current within the last 18 months.
The IPython Notebook for this project is available on github.

Rich and poor, cheap and expensive - these are all relative terms. To an American tourist, for example, China might be cheap. But to a Cambodian, China is fairly expensive. What we are interested in determining is: how does the cost of living in various countries around the world look,  depending on your country of origin.

To this end, we've put together two visualizations, using Tableau. (This was fairly easy to do, because we'd already crunched the numbers in the first two sections of this analysis of data from Numbeo.com.)

  • For each region in the world, which are the relatively rich countries, and which are relatively poor?
    • Wealth here is expressed in terms of purchasing power parity, rather than absolute income
  • How do the costs of living relate to the costs of living "back home", for citizens of these four countries?
    • America
    • The UK
    • Russia
    • China

Note that for the second visualization, we define the baseline cost of living in each country as the median of the top five most expensive cities in each country - except Russia. For Russia, we used the mean average of costs in St. Petersburg and Moscow. The logic here is that the typical tourist from a given country tends to be richer than average. People with the resources to travel tend to live in bigger, more expensive cities. A disproportionate amount of tourists come from these regions, and the relative costs of living around the world should be in proportion to the costs these tourists actually experience back home.

Are People in More Expensive Countries Richer?

4/3/2017

 
Picture

This is part 2 of an analysis of costs of living and local purchasing power, in five hundred major cities around the world.

For part one, click here: Primary Drivers of Costs of Living, Worldwide
For part three, click here:  Are People in More Expensive Countries Richer?

The data was sourced from Numbeo.com, which hosts user-contributed data - current within the last 18 months.
The IPython Notebook for this project is available on github.

Is there a relationship between cost of living and local purchasing power, in cities around the world?

Note that Local Purchasing Power is a measure that *already* takes into account the cost of living. It is not a measure of absolute wages; rather, it describes the amount of "spending power" a person has, given both:
​
  • Their wage, and
  • The cost of goods and services in their area

If wages correlate perfectly to cost of living, then there should be *no* variance in local purchasing power.
If there is no variance in local purchasing power, then it should hold no relationship to cost of living.
(Wages should scale evenly with costs, thus disallowing cost of living from influencing local purchasing power).

We do, in fact, see a low degree of correlation in most regions in the world:
Regions with no or very weak correlation between cost of living and local purchasing power: 
  • NORTH AMERICA  * P-value: 0.925 
  • OCEANA * P-value: 0.812 
  • INDIA * P-value: 0.184 
  • MIDDLE EAST * P-value: 0.493 
  • AFRICA  * P-value: 0.778
  • LATIN AMERICA * P-value: 0.215​
​

And yet, for two regions in the world, there is a very high correlation:
  • Europe:
    • Correlation Strength: STRONG
    • P-Value: 0.0
    • R-Squared: 0.53
  • Asia:
    • Correlation Strength: MEDIUM
    • P-Value: 0.001 
    • R-Squared: 0.251

These relationships become immediately clear when we plot local purchasing power vs cost of living:
Picture
Picture
The blue line in each chart is our regression line, and the light blue bands are our 95% confidence intervals.

We see from these plots that:
  • There is a definite correlation between local purchasing power and cost of living
  • The more expensive a city is to live in, the *MORE* spending power the average person in that city enjoys.

In Europe:
  • For every 1% rise in the cost of living, purchasing power goes up by 1.15%.
  • A 1% rise increase in cost of living corresponds to a 2.17% rise in expected wages.
  • This relationship alone can explain over half of the variance (53%).

In Asia:
  • For every 1% rise in the cost of living, purchasing power goes up by 0.98%.
  • A 1% rise increase in cost of living corresponds to a 2.0% rise in expected wages.
  • This relationship alone can explain approximately one quarter of the variance (25%).

Why might this be so?

​North America, and Oceana contain only rich countries, which are economically similar. There is a relatively low degree of variance in both income, and cost of living between - or within - these countries. Africa and Latin America are comprised mostly of poorer countries with both low costs of living, and low incomes. While there is definite variance between the cost of living in, say, Chile, and Paraguay, there is still not a tremendous amount of variance between countries.

Generally speaking - and this is only a hypothesis - it may be true that wages and cost of living tend to scale proportionally for cities within a single country, while they do not necessarily scale proportionally between countries. Europe and Asia - more so than any other regions - are both home to economies with a huge amount of variance between countries.

Plotting the data in Tableau, this becomes quite clear:
​
  • Color indicates the total cost of living in a city (red is expensive, green is cheap).
  • Size indicates local purchasing power in a city (bigger is wealthier).

In Europe, we see a very strong correlation between size and color.
  • More expensive cities are wealthier, despite the higher costs of living.
In America, we see very little correlation between size and color.
  • Some cities are cheap, with wealthy people; some are expensive with poor people; some are rich and expensive, some are poor and cheap.


Next up - Part Three: "The World Through Whose Eyes?" - Costs of living around the world, relative to your country of residence.

Primary Drivers of Costs of Living in 500 Major Cities, Worldwide

3/31/2017

 
This project explores the costs of living and purchasing power characteristics of 500 major cities around the world.

 The analysis in this post concerns itself with the following questions:
  1. Is there a relationship between the cost of living and local purchasing power?
  2. Which is the primary driver of cost in "expensive" cities - rent, or non-rent costs of living?
  3. How can we visualize the data - and highlight the differences between regions?

The data was sourced from Numbeo.com, which hosts user-contributed data - current within the last 18 months.
The IPython Notebook for this project is available on github.

First, let's take a look at our key metrics - local purchasing power and total cost of living (including rent) - on a global scale:

Next, let's look at the distribution of purchasing power in each region of the world.

The first three graphs plot the proportion of cities in each region that enjoy varying levels of wealth, relative to the worldwide median. The fourth graph plots the same metric for the world at large, against the medians for each of our regions. (These are just two slightly different views into the same data).

Picture
Picture
Picture
Picture

Next, let's look at the distribution of rent costs - same regions, different metric.
​

How much does the average price of rent vary from city to city, for each region in the world?
Picture
Picture
Picture
Picture

What is the relationship between non-rent costs of living, and cost of rent, for the top 500 major cities, worldwide?
Which of these costs varies more?
Picture
Picture
The scatterplot left illustrates cost of rent vs non-rent costs of living*.
The kdeplot (essentially, a smoothed histogram) shows a much greater spread in the cost of rent, compared to non-rent costs*.

What does this tell us?
  • The cost of rent is indeed correlated with the non-rent cost of living in any given city
  • While it is unclear which, if either,  variable is causal, an r-squared of .63 indicates a very strong degree of correlation.
  • A rise in the non-rent cost of living is correlated with a much greater rise in the cost of rent in that city
  • (Or, conversely, a rise in the cost of rent correlates to a predictable, but smaller increase in non-rent costs of living)

On average, as costs of living increase, rent increases a full 2.13 times faster.

*The above charts graph the delta in costs for each city, with relation to the worldwide median for each metric.
The scatterplot, then, does not illustrate absolute costs, but rather the ratio by which costs are more (or less) expensive than average.


​This raises an interesting question:

Given that rent and non-rent costs are strongly correlated, and given that rent rises faster (relative to the worldwide median) than non-rent costs of living, to what degree is rent (rather than non-rent) the major driver of variance in cost, for cities around the world?

In simple terms, are "expensive" cities expensive because rent in those cities are expensive, or are they expensive because non-rent factors are driving up the cost of living?

The following graph plots each of our major cities relative to the worldwide median cost of living

  • Cities range from most expensive on the left, to least expensive on the right.
  • The absolute height of each bar indicates the multiple above or below the worldwide average.
  • Red indicates the portion of the variance due to costs of rent
  • Blue indicates the portion of the variance due to non-rent costs
Picture
From this chart, it becomes immediately apparent that - in the vast majority of cities around the world - rent is the primary driver of cost.

The reason the typical expensive city is expensive is because rent in that city is expensive. Cheap cities, then, are cheap because rent is cheap.

The more astute readers will notice that more than half of our cities in this visualization fall above the "median." This is because we used a calculated median to address sampling bias in the data. Significantly more than half of the cities sampled are from rich countries. This means that taking a simple median (the 250th of 500 data points) would result in a measure that was more expensive than the true worldwide median cost of living. To fix this we first calculated the median cost in each region, then took the median of all of our regional medians. 

A final point that needs to be addressed is how we calculated the cost ratio for each city, relative to our median:

Numbeo creates their total cost of living index by attributing (essentially) equal weight to both rent, and non-rent costs of living. Thus, 50% of the cost for any city is derived from rent, and the other 50% from non-rent. Using these figures would have resulted in a boring and quite useless graph, with equal parts red and blue for every city. To get around this, we calculated two additional indexes for each city:

  • Cost of living (non-rent), as multiple above/below worldwide median (again, median of regional medians)
  • Cost of rent, as multiple above/below worldwide median

We then used these two columns to calculate one final metric: the proportion of the variance in total cost that is attributable to rent costs, specifically. This is the metric which determines our red/blue splits on the graph displayed above.

Here is another graph, highlighting the top five most expensive, and top five least expensive cities in our dataset:
Picture
Interestingly, San Francisco (a city in which the author of the study has lived) is the second most expensive location in the world, and nearly all of the reason it is expensive is due to the costs of rent. This is of course no surprise, as San Francisco holds claim to the most expensive real estate on the entire continent.

​
​
Next, let's explore another of our hypotheses:  
​Is there a relationship between cost of living and local purchasing power?

International Tourism and Neural Networks - Part 2: Model Selection & Predictions

3/23/2017

 

​This is part two, where we train our model and select an accuracy measurement.
In part one, we laid out the project, explained the features and target, and walked through the code to create the datasets.
​

​CLICK HERE FOR PART 1

In the following three kernels, we:
  • Import our models 
  • Define a function to create test sets with our data, given target year (returns features data frame and target series)
  • Initialize several regression* models, and train the data on each of them
  • Validate on each of our eight test datasets (one for each year, 2006-2013)
  • Return the average mean squared error (MSE) for each classifier, over the eight datasets
We then select the best model (a neural network), tune it, and develop a meaningful error metric to evaluate it with.

*In the code, we create a list of 'classifiers'. This is incorrect; these are regression models. This has no effect on our results.
Picture
Picture
Picture
Picture
Picture
Picture
Picture
Picture
Picture
Picture
Some notes on our error metric:

What we have plotted here is not a ROC (Receiver-Operator Characteristic) curve.
  • ROC is a metric used to evaluate classifiers, while our model is a regression model.
This is also not a Precision-Recall curve.
  • As with ROC, Precision and Recall are metrics which evaluate the tradeoffs of probability thresholds for classification models.

What we needed was something to show the relationship between tolerance for error, and the model's ability to predict values within a given tolerance threshold. This is precisely the relationship the above graph illustrates.

Note that we took the mean absolute error of each test set, to calculate the area under our error-tolerances curve.
If we simply averaged the error together, much of the error would cancel out. Predictions that were over in one test set would cancel out predictions that were under in another. Taking the absolute error gives us the magnitude​ of the average error.
​


A note on real-world application.

Imagine the following: You are in change of the corporate strategy for a company that sells group tours to clients from all over the world. You operate in a small island country, with the majority of clients coming from three countries - Scienceville, Datatopia, and Regressionland. You are responsible for setting staffing levels - a number which changes every year - and setting contracts with local operators. These depend on the total quantity of tourists you expect, as well as the composition of those tourists (people from Datatopia have different tastes than people from Scienceville). 

If you gain access to our model, you can make fairly informed decisions about how many tourists will come from each of these three countries. Even if the total number of tourists who come to your island remains fairly constant, the proportion of tourists by country may shift dramatically. This model allows you to predict these changes, and prepare accordingly. (Assuming that preferences of the citizens in each of these countries do not significantly change - i.e, your island remains a preferred destination for Datatopian tourists.)

Conclusion & Considerations:

Using nothing but five years of trailing macroeconomic indicators, and a baseline level of departures in a given year, we can make fairly accurate predictions about the number of tourists that will originate from a given country.

Area under Error-Tolerances Curve: .8462
Percent of Countries with less than 15% avg. absolute error: 79.41

This is inline with our initial hypothesis.

Note that this model is for illustration only. The purpose of the exercise is to test the predictive power of macro-economic indicators - not to get the most accurate predictions possible, using all data we could possibly source from. A production model would use all available and relevant information (including departures data over time, rather than using it only to establish a baseline in a single year). 

Without using additional data, there is still plenty of room for improvement:

  • Create more robust models. 
    • Merge our predictions together into a single data frame, and fit an ensemble model to those predictions
    • Train and optimize one kind of model (neural network) on three different years, then average the predictions of all three models together.
  • Deal with outliers.
    • Our model is *fairly* robust. If seven test sets predicted a country's departures with under 5% error, and only 1 predicted with 100% error (e.g. predicting 2x the actual number of tourists), we would see a ~17% mean error (which happens to be the actual mean error we got).
    • Despite this, we do get some extreme outliers. Even if most of the countries fall within a safe prediction range, a few of them have wildly-inaccurate predictions. This would have serious implications in a real world model.
  • Create more advanced features. Using just the available data, we could create feature columns such as:
    • Average GDP Growth Rate past 5 years (CAGR)
    • Average inflation over past 5 years
    • Total % Change in local currency value over 5 years, denominated in normalized dollars

~ That's all for now ~

Exploring natural language processing and cluster analysis through song lyrics

3/22/2017

 
Picture

Introduction:

Cluster analysis is a valuable machine learning tool, with applications across virtually every discipline. These include anything from analyzing satellite images of agricultural areas to identifying different types of crops, to finding themes across the billions of social media posts that are broadcast publicly every month. 
​
Clustering is often used together with natural language processing (NLP). NLP allows us to turn unstructured text data into structured numerical data. This is how we transform human language into the computer-friendly inputs required to apply machine learning.

For a straightforward application of how these two techniques can be applied in sequence, we’ve chosen an easily accessible public data set: the 'Top 50 Songs of 2015' by Genius, the digital media company. In our example, we look at how songs can be broken into similar clusters by their lyrics, and then conduct further statistical analysis against the resulting clusters.
​

Machine learning and other data science methodologies increasingly confer great powers upon those who practice it. It’s not, however, without its limitations. The misapplication or misinterpretation of these methodologies can lead not only to reduced ability to extract insights but also to categorically false or misleading conclusions. We examine potential mishaps in the transformation and analysis of the data at various stages in this study.
This study explores the use of cluster analysis to answer the following questions:
  1. Can you use machine learning models to analyze the lyrics of a set of songs, and cluster them into similar kinds of songs?
  2. Do word count or breadth of vocabulary hold predictive value for a song's rank, within that set?
  3. Do the averages of word count and breadth of vocabulary for each cluster of song hold predictive value for the average rank for songs in that cluster?​
Notes:
  • This analysis was inspired by the thoughtful cluster analysis of the top 50 Reddit datasets, made available by Ari Morcos.
  • Data was sourced from Genius.com's user-curated list of top songs for 2015, available here.

Methodology:

We performed this analysis in Python, using Pandas, Sklearn, and NLTK to model the data. Graphs were projected with matplotlib and/or seaborn.
​

Steps:
  1. Clean and tokenize each song’s lyrics
  2. Create a dictionary containing every unique word appearing in the data, and its frequency
  3. Select the words which appear above an average of twice per song ​(only the top 36, out of 3,126 unique words)
  4. Assign the normalized frequency of each of these top 36 words to a vector, for each song
  5. Calculate the euclidean distance between each song, and project these values into a matrix
  6. Perform primary component analysis (PCA) on the matrix
  7. Feed the matrix with reduced dimensionality into a K-Means clustering algorithm*

​ * We first fed the data into an affinity propagation model, but found the model unsuited to the data.

Further Analysis:

After clustering, we analyzed the relationships between word count, vocabulary, songs, and the clusters the songs were assigned to. Results are below.

​Project Results and Analysis:
​

After preprocessing the data and determining the similarity coefficient for each song, we projected them into a matrix:

Cooler colors are more similar, while hotter colors are more dissimilar.
Picture

Top 5 Most Similar Songs:

**************************************************
#1: Father John Misty The Night Josh Tillman Came To Our Apt
​Vs. Kendrick Lamar The Blacker The Berry
Dist. = 0.835462649851
**************************************************
#2: Vince Staples Jump Off The Roof
Vs. The Weeknd Tell Your Friends
Dist. = 0.873377864682
**************************************************
#3: The Weeknd Tell Your Friends
Vs. Drake Back To Back
Dist. = 0.908259234026
**************************************************
#4: Joey Bada Paper Trail
Vs. Lupe Fiasco Prisoner 1 And 2
Dist. = 0.937557822226
**************************************************
#5: Dr Dre Darkside Gone
Vs. Drake Back To Back
​ Dist. = 0.939432020598

Top 5 Most Dissimilar Songs:

**************************************************
#1: Sufjan Stevens Blue Bucket Of Gold
Vs. Major Lazer Lean On
Dist. = 3.0627237839
**************************************************
#2: Major Lazer Lean On
Vs. Dangelo And The Vanguard Really Love
Dist. = 3.01516407241
**************************************************
#3: Towkio Heaven Only Knows
Vs. Major Lazer Lean On
Dist. = 2.8076161803
**************************************************
#4: Post Malone White Iverson
Vs. Major Lazer Lean On
Dist. = 2.77959991825
**************************************************
#5: Major Lazer Lean On
Vs. The Weeknd The Hills
​ Dist. = 2.76888048057

​Interestingly, Major Lazer's "Lean On" is so lyrically dissimilar from every other song, that it occupies the top five most dissimilar ranks, against five other songs.

Primary Component Analysis

Unfortunately, PCA Analysis here does not help as much as it may in other cases. We require the first 9 vectors in our new matrix of primary components, to account for only just over 60% of the variance.
Picture

Clustering the songs

The Reddit dataset this analysis was inspired by used an affinity propagation algorithm to perform it's clustering. Affinity propagation does not require you to input the initial number of clusters. This means that, unlike the K-Means, the number of clusters is NOT arbitrary. This is therefore often a great choice, when it works.

We, however, found K-Means to be a better choice for this dataset. The affinity propagation model kept outputting very uneven class distributions. We'd get roughly 8 to 12 clusters - 3 of which would contain about 90% of the songs, while the remaining 9 clusters had just 1 or 2 songs each.

This kind of clustering isn't necessarily bad. If many songs are similar to one another, and a few are very dissimilar to all the others, than the model is just doing it's job, and showing those relationships to us. We wanted to perform analysis of the summary statistics for each class, though, and the large number of single-song classes made that difficult.

After some initial tweaking, we fit a KMeans model to the first 9 dimensions of our PCA transform.
We found 8 clusters to be an effective tradeoff between the number of classes, and the distribution of songs within those classes.
Picture
Of the 8 classes we get, five of them have six or more songs within them.
Now that we've clustered the Genius Top 50 Songs of 2015 into eight groups, how can we visually represent them?

​If we plot the first 3 primary components along 3 axises, we can already start to see some clustering. This plot is actually more illustrative than might be expected: Even though these 3 dimensions only account for ~25% of the variance, we can see a reasonable degree of grouping.
Picture

Besides lyrics, what do classes share in common?


​First, let's visualize some of our distributions:

1. On the left, we have a box plot showing the distribution of words per song.
  • The mean is just over 400 words,
  • The standard deviation is large relative to the mean, and word count is distributed fairly equally around it.
    • The first and third quartiles range between ~300 and ~600 words

2. On the right, we have a histogram depicting the number of songs which fall into bins characterized by word quantity. 
  • The distribution is right-skewed, due to a single song with over 1400 words (3.5x the mean)
Picture
Picture


Next, let's see whether there is any correlation between classes, and the average number of unique words within each class.

​There is no reason there *should* be, as songs were clustered by the words themselves, rather than the amount of words.

Still, there may be some correlation. Perhaps songs with similar lyrics share a similar breadth of vocabulary (amounts of unique words per song).
Picture
There is no correlation between the number of words or number of unique words, and the class a song was assigned to.

This graph is somewhat misleading, because it implies the possibility of some kind of linear relationship between classes and the number of words or unique words (or anything else at all). In reality, the class labels themselves are meaningless; they are randomly assigned by the KMeans algorithm. The information they contain relates to the ways in which songs with the same label relate to one another.

This graph itself, though, conveys valuable information.

If there were a correlation between classes and number of unique words, we would expect to see some kind of clustering along the horizontal axis, for each class along the vertical axis.

There is none of that whatsoever, indicating a complete lack of any correlation between classes and the number of words for songs within them. In fact, the R-Squared (not pictured here) between classes and unique words is ~0.01, or essentially zero.
​

Word Count & Unique Words Vs Song Ranking​


​Is Song Rank a function of the number of unique words in a song?
Is Song Rank a function of a song's word count?
Picture
Picture
No. With R-Squared values of 0.0037, and 2.4e-6, there is absolutely no correlation between the number of unique words - or total words - in a song, and the song's ranking.
​

Is there a correlation between the AVERAGE rank per class, and the AVERAGE word count or unique word count per class?
Note, that we dropped 3 clusters here. These clusters had only 1, 1, and 3 songs respectively.
If we keep those clusters, single songs may exert far too much leverage on the rankings for each cluster.
The remaining 5 classes contain 45 of our 50 songs - or 90% of the data.
Each of these has between 6 and 12 songs in it.
Picture
Picture
Yes.  There does appear to be a correlation (13% and 34% of variance).

This finding seems to imply that, while a song's word count or vocabulary breadth holds no value as a predictor of it's ranking, the average word count or breadth of vocabulary for each kind* of song may actually hold some predictive value for that kind of song's average ranking.

* The easiest analogy to kinds of songs is genre. But these "genres" are not defined by popularly-agreed divisions or titles (eg. rap, country, rock). The "genres" here are determined** by the songs lyrical similarities - or dissimilarities - to one another.

** It would be interesting to compare the classifications made by analysis of the lyrics, to the genres the songs actually do fall under. The question here is, "Do songs that are clustered together tend to fall into the same genre? If so, how significant is the correlation?". This, however, is another job for another time.
​

Conclusion

1. Songs can be effectively clustered together by analysis of their lyrics.

2. In the context of individual songs, there is no relationship between song ranking and number of words, or unique words, in a song.  

3. After clustering songs into similar kinds of songs, relationships do emerge:

     1. There is a weak inverse correlation between the average number of words in a cluster, and the average ranking
    
        * This can account for ~13% of the variance.  
        
    2. There is a stronger - but still not huge - inverse correlation between the average number of unique words in a cluster, and the average ranking.
    
        * This can account for ~34% of the variance
        
Considerations:  

1. Small sample size: 50 songs, and resultant 8 classes (5 of which we kept - accounting for 90% of the songs)

    * Statistical significance of regression (class rank vs word count) with only 5 data counts is *very* limited. 

    * It is neat, however, to see that there does at least *appear* to be an relationship between word count / vocabulary, and ranking
    
2. Subjective rankings.

   * Genius describes their process as such: "Contributors voted on an initial poll, spent weeks discussing revisions and replacements, and elected to write about their favorite tracks."  
    
     * While it does seem that the Genius community at large was polled, and that poll determined the songs that were ultimately selected, the actual ranking was not necessarily reflective of the community at large. Rather, a select group individual contributors had the final say.
    
3. Relatively few words per song.

    * The average song had between ~300 and ~600 words. This lead to a relatively *small* incidence of repeated words between songs. Of the total 3,126 unique words found across the entire dataset, only the top 36 appeared more than 100 times, cumulatively. This means that nearly 99% of the words appeared *less* than twice per song. This gives tremendous leverage to a small number of words.
    
    * While the statistical significance of regression techniques would be improved by a larger dataset, that would still not likely change the reality that 1% of the words accounts for 100% of the leverage in our model. Indeed, the analysis of Reddit's top 50 subreddits that this project was inspired by was a clustering of entire *forums* - with an associated plethora of data to draw from. 

International Tourism & Neural Networks, Part 1: Creating the Dataset, & Cleaning the Data

3/22/2017

 
Introduction

​Hypothesis: We can predict the number of tourists that will originate from a given country in a given year - if we understand just a few key macroeconomic indicators describing that country.

​China's outbound tourists spent $258 billion abroad in 2017 - almost double the spend of US tourists, according to the United Nations World Tourism Organisation. Yet it has only been half a decade since China first overtook the US and Germany. Can we explain these trends?

The straightforward answer is that there there are four times as many Chinese as Americans (1.3bn vs 330m), and China is growing at a significantly faster pace. As its middle class grows the Chinese become wealthier - and they travel more. No mystery there, it seems.

But why now? China's GDP has been growing at around or above 10% for the best part of four decades. And after surging dramatically at the start of this century, it has dipped over the last decade to its current figure of 6-7%. At the same time, it is only in the past several years that tourism has surged. It’s then doubled in just a few years.

So we can assume that: 

  1. Economics do influence the amount of tourists that originate from each country, and
  2. The relationship is non-linear; and it may be defined by certain asymptotes or inflection points​

The data science techniques
This analysis explores the hypothesis above to walk the reader through the application of the key techniques a data scientist would use to design, test and refine a ML model. It shows us how to extract real world insights from public data sets. These techniques include: 

[Note that the descriptions below are intended to provide a helpful overview, but are simplifications of very complex subjects.]

Data wrangling and cleaning
Over 90% of a typical data scientist's time is spent wrangling and cleaning. Data wrangling is the process of transforming raw data into a usable format for analysis. Data cleaning is the process of addressing errors and inconsistencies in the initial data sets. This is important because real world data is often messy and incomplete - a byproduct of real world interactions. Data scientists must transform the data they have into the data they need.

Exploratory data analysis
EDA is how a data scientist gets to grips with a new data set in order to make it useful. Data scientists must develop an intuitive sense for how to use statistical methodologies and data visualization techniques. This is important in order to explore a new data set and develop a strategy from how to extract insights from it. 

Feature selection and engineering
Feature selection is how a data scientist chooses and refines which data they’re going to work with. In more technical terms, it is the process of applying rigorous scientific methodology to the curation, selection, and transformation of data which will be used as inputs in machine learning algorithms.

Model design and algorithm selection
Machine learning is a set of tools which allows computers to generate predictions about the world around us and find correlations and complex insights across data from nearly any domain.
A data scientist will often test multiple algorithms and choose the one best fitted to the data, depending on the requirements of the project.

Error metric selection
Data scientists much choose an appropriate measurement to evaluate the performance of each model. They have a wide range of statistical tests at their disposal, but their selection must be non-arbitrary, defensible, and made before they design the model. 

Analysis of output
Fundamentally, the ‘science’ in data science comes from a rigorous application of the scientific method. The final step after every experiment is to analyse the results and decide on the next steps. 
​


​The Project: 

Objective: Use only macro economic data about given countries to predict the number of international tourists that will originate from each country, in a given year.

Our target metric - International Departures -  is defined as: “The number of individuals which leave their home country for tourism purposes, at least one time in a calendar year.”

Key Challenges:

  1. Sparse data - for countries, or the years in which data on the countries was collected
  2. Multidimensional time series data
    1. We have multiple datasets, each of which plots data on 200+ countries, over 50 years
    2. To combine these, we have to create “wide” test/train datasets, with multiple values for multiple years in the same row (by country). This wouldn't be much of a challenge, except:
    3. We end up with a fairly small number of countries with sufficient data to train a model on, and need to cross validate using *all* of the countries in multiple test sets. We have to get creative.
  3.  Selecting an appropriate error metric that *meaningfully* conveys the accuracy and implications of our model.
    1. Mean Squared Error may be the standard choice, but does it really tell us something useful about our predictions?


Datasets Collected:

  • World Bank
    • ​GDP Per Capita
    • GDP Per Capita Growth
    • Inflation
    • Population
    • Urban population, percent of total
    • International tourism, number of departures
  • Other:
    •  Series for the US Dollar Index, over time

Plan of Action:


# Read each dataset into list of data frames 

# What is our time range?

  1. For which years do we have sufficient data to train and test our model?

# What is our country set?

  1. Drop all countries with more than 20% missing data over our selected years
  2. Determine which of the remaining countries in each dataset are found in *every* dataset
  3. Drop countries which are not found in every dataset

# Can we trust the data? Does the remaining data reflect an accurate account of reality? 
  1. Examine the ratio of unique entries in each trimmed dataset, to determine where the World Bank interpolated or otherwise gave data that potentially diverges from the ground truth.
  2. For each dataset: If the entries for a country are not mostly unique, drop the country

# Fill Missing Values: How will we interpolate data?
  1. Clean and fill remaining (minimal) missing data

# Normalize Local Currency Units (LCU) against Constant-Value US Dollar:
  1. Divide each LCU rate by US Dollar Index for given year

# Create Training Set
  1. Select appropriate columns from each dataset
  2. Filter by the first five years of data, to train the model on
  3. Rename columns so we can match them in the training set
  4. Assign values for departures in year six to target year, to train on


# Create Test Set
  1. Cross Validation: We don’t have enough countries to cross validate on a stratified selection of rows; we have to get creative.
  2. Create (8) test sets - one for each year, 2006 - 2013
  3. Assign features based on selected data for the unique five-year time series preceding each target year
  4. Assign values for departures in each target year, to validate against


# Select A Model
  1. Determine which baseline regressor has the lowest mean squared error, averaged across our 8 test sets

​# Tune the Best Model and Chose and Error Metric
  1. Determine a meaningful accuracy metric, and tune the hyper-parameters for best performance

A Justification of Feature Selection:

GDP Per Capita & GDP Per Capita Growth

GDP Per Capita is a measure of the total size of a country’s economy, divided by the size of that country’s population. It is a widely-used proxy for the wealth of individuals in a given country.

We used World Bank datasets for both the size and the growth of GDP Per Capita over time, even though these two datasets are simply functions of one another. There is useful information at the margins:

We included GDP Per Capita, even though it is collinear with GDP Per Capita growth, in order to establish a baseline value per country, in absolute terms. In other words, GDPPC Growth explains the delta between years, but we need the hard numbers from the first dataset to establish the absolute magnitude.

We could have included only the GDP Per Capita values, but the GDPPC Growth dataset provides insight that the former cannot. The growth in the last year in our range acts as a proxy for the size of the economy in the target year. It is the closest thing we have to an estimate of each country’s economic health in the target year, without leaking future data.


Inflation: 

We included inflation, because it stands to reason that there is a negative cross-price elasticity between expenditures in a domestic market, and expenditures in a foreign market. In other words, the more expensive it becomes to travel (or simply live) in a domestic market, the more attractive foreign markets become. 


Local Currency Exchange  [GDP vs Purchasing Power Parity vs Inflation]

Purchasing Power Parity (PPP) is a measure of how far money, denoted in dollars, goes in each country. Economists weight the GDP for each country by the cost of a standardized basket of consumer goods, to determine a normalized value for the purchasing power that stems from a measure of raw GDP. It is usually a better indicator of a country’s economic environment than raw GDP, but we've excluded it here, because:

Inflation, paired with the exchange rate of a currency vs a basket of international currencies, are the primary factors by which a calculation of GDP PC can be turned into one for GDP PC PPP (Purchasing Power Parity). Thus, inflation is also an excellent (if incomplete) proxy for GDP PPC - a measure I chose to leave out because of it’s strong correlation with GDP.

We've also included the value of a country’s currency, pegged against an international benchmark, as a feature. In simple terms, the more foreign currency one can buy with a single unit of domestic currency, the more attractive travel abroad becomes. If the Chinese Yuan increases significantly against the Thai Baht, it is nearly guaranteed that, all else held constant, Thailand will see an influx of Chinese tourists. 

Note that the inverse should also be true, as a predictor of inbound tourists to a country. If Thai Baht decreases against the Yuan, fewer Thais will visit China (all else being equal).


Dollar Index Series

The temptation is to peg each country’s currency against the international reserve currency - the US dollar. After all, this is exactly how economists measure economic activity in every country around the world. However, quick EDA of the currency dataset we are using tells us that the exchange rate for the “United States” against the dollar stays constant (1:1) across the entire time series. While you might *expect* this to be true, this is actually important information. This means that US Dollars have not been normalized.

They could have been normalized against an index year, for purchasing power parity (eg. "Year 2000 Dollars”). They could have been normalized against an international benchmark - such as the "US Dollar Index.”

Because US dollars have not been normalized, and all other currencies here are measured ​in US Dollars, we will have to normalize all these values against the US Dollar index.

But why must we do this? Assumptions that the “value” of the US dollar remains constant against other currencies simply do not hold. The dollar is not an immutable benchmark around which other currencies fluctuate. Its value is derived from indicators intrinsic to American economic activity, foreign policy, and the actions of the US Federal Reserve. In the simplest possible terms, If a country's currency in a given year goes down ten percent against the dollar, we cannot say *Which* currency moved. If the value of the Russian Ruble increases from 1:70 to 1:60, did the Ruble go up, or did the Dollar fall?

We have to find a way to normalize the data. One method would be to find the mean average value of a dollar against ALL currencies in the data set, across each year in our time series, then calculate the change in the average value of the Dollar, in each year. This would work, but it gives equal weight to all currencies and would be misleading. The performance of the Egyptian Pound against the Dollar should not carry as much sway as that of the Euro, Yen, or Yuan.

Fortunately, there IS a standardized index by which the Dollar is measured - the US Dollar Index. It maps the value of the US Dollar over time, against a weighted basket of foreign currencies.


Country Population

Quite obviously, the number of tourists that originate from a given country is correlated with that country’s population. The wealth of each citizen within a country is an important variable in determining that individual’s likelihood to travel, but the overall population of a country influences our equation in two ways:

  1.    It sets a hard limit on the number of tourists that can come from a country
  2.    When combined with a multiplier determined by the GDP Per Capita for each country, it should be an excellent proxy for the number of tourists that originate from that country.


Urban Population

This is a measure of the percent of each country’s population that lives in urban, versus rural, areas, as defined by the World Bank. Simply put, people who travel tend to have money - and people who have money tend to live in cities. 

The urban population ratio is another proxy for individual wealth. Whereas GDP Per Capita is an effective measure, it is an imperfect one. GDP Per Capita tells us the *mean* wealth per person in a country, but says nothing about how that wealth is distributed. A country in which a small group of oligarchs hold all the wealth may have a relatively high GDP Per Capita, but a population of mostly poor people. 

Including the ratio of urban population (and how that changes over time) should explain some additional variance.


Departures

This dataset contains the World Bank’s information on the number of outbound international tourists that originate from each country in a given year. 

Because this is what we are trying to predict, we include data from this dataset in exactly two places, per train/test set:

  1. The number of departures in the year preceding the one we want to predict, to establish a baseline.
  2. The number of departures in the year we want to predict - to train the training set on, or to validate the test predictions against.

Note that we did NOT include any additional information or features from this dataset. To do so would have only helped us, but this project is an exercise in using *other data* to predict trends in the tourism industry.

​The Code

Without any further ado, here's the project. We walk through the code in the comments.
Picture
Picture
Picture
Picture
Picture
Picture
Picture
Picture
Picture
Picture
Picture
Picture
Picture
Picture
Picture
Picture
Picture
Picture
Picture
Picture

~ Putting it all together ~


Picture
Picture
Picture

That's all for part one.
In part two, we iterate over our time series to create cross-validation test sets, select a model, and evaluate our predictions.
​
CLICK HERE FOR PART TWO.
Powered by Create your own unique website with customizable templates.
  • Data Science Lab
  • Lucid Analytics Project