DATA EXPLORATIONS
  • Data Science Lab
  • Lucid Analytics Project

International Tourism & Neural Networks, Part 1: Creating the Dataset, & Cleaning the Data

3/22/2017

 
Introduction

​Hypothesis: We can predict the number of tourists that will originate from a given country in a given year - if we understand just a few key macroeconomic indicators describing that country.

​China's outbound tourists spent $258 billion abroad in 2017 - almost double the spend of US tourists, according to the United Nations World Tourism Organisation. Yet it has only been half a decade since China first overtook the US and Germany. Can we explain these trends?

The straightforward answer is that there there are four times as many Chinese as Americans (1.3bn vs 330m), and China is growing at a significantly faster pace. As its middle class grows the Chinese become wealthier - and they travel more. No mystery there, it seems.

But why now? China's GDP has been growing at around or above 10% for the best part of four decades. And after surging dramatically at the start of this century, it has dipped over the last decade to its current figure of 6-7%. At the same time, it is only in the past several years that tourism has surged. It’s then doubled in just a few years.

So we can assume that: 

  1. Economics do influence the amount of tourists that originate from each country, and
  2. The relationship is non-linear; and it may be defined by certain asymptotes or inflection points​

The data science techniques
This analysis explores the hypothesis above to walk the reader through the application of the key techniques a data scientist would use to design, test and refine a ML model. It shows us how to extract real world insights from public data sets. These techniques include: 

[Note that the descriptions below are intended to provide a helpful overview, but are simplifications of very complex subjects.]

Data wrangling and cleaning
Over 90% of a typical data scientist's time is spent wrangling and cleaning. Data wrangling is the process of transforming raw data into a usable format for analysis. Data cleaning is the process of addressing errors and inconsistencies in the initial data sets. This is important because real world data is often messy and incomplete - a byproduct of real world interactions. Data scientists must transform the data they have into the data they need.

Exploratory data analysis
EDA is how a data scientist gets to grips with a new data set in order to make it useful. Data scientists must develop an intuitive sense for how to use statistical methodologies and data visualization techniques. This is important in order to explore a new data set and develop a strategy from how to extract insights from it. 

Feature selection and engineering
Feature selection is how a data scientist chooses and refines which data they’re going to work with. In more technical terms, it is the process of applying rigorous scientific methodology to the curation, selection, and transformation of data which will be used as inputs in machine learning algorithms.

Model design and algorithm selection
Machine learning is a set of tools which allows computers to generate predictions about the world around us and find correlations and complex insights across data from nearly any domain.
A data scientist will often test multiple algorithms and choose the one best fitted to the data, depending on the requirements of the project.

Error metric selection
Data scientists much choose an appropriate measurement to evaluate the performance of each model. They have a wide range of statistical tests at their disposal, but their selection must be non-arbitrary, defensible, and made before they design the model. 

Analysis of output
Fundamentally, the ‘science’ in data science comes from a rigorous application of the scientific method. The final step after every experiment is to analyse the results and decide on the next steps. 
​


​The Project: 

Objective: Use only macro economic data about given countries to predict the number of international tourists that will originate from each country, in a given year.

Our target metric - International Departures -  is defined as: “The number of individuals which leave their home country for tourism purposes, at least one time in a calendar year.”

Key Challenges:

  1. Sparse data - for countries, or the years in which data on the countries was collected
  2. Multidimensional time series data
    1. We have multiple datasets, each of which plots data on 200+ countries, over 50 years
    2. To combine these, we have to create “wide” test/train datasets, with multiple values for multiple years in the same row (by country). This wouldn't be much of a challenge, except:
    3. We end up with a fairly small number of countries with sufficient data to train a model on, and need to cross validate using *all* of the countries in multiple test sets. We have to get creative.
  3.  Selecting an appropriate error metric that *meaningfully* conveys the accuracy and implications of our model.
    1. Mean Squared Error may be the standard choice, but does it really tell us something useful about our predictions?


Datasets Collected:

  • World Bank
    • ​GDP Per Capita
    • GDP Per Capita Growth
    • Inflation
    • Population
    • Urban population, percent of total
    • International tourism, number of departures
  • Other:
    •  Series for the US Dollar Index, over time

Plan of Action:


# Read each dataset into list of data frames 

# What is our time range?

  1. For which years do we have sufficient data to train and test our model?

# What is our country set?

  1. Drop all countries with more than 20% missing data over our selected years
  2. Determine which of the remaining countries in each dataset are found in *every* dataset
  3. Drop countries which are not found in every dataset

# Can we trust the data? Does the remaining data reflect an accurate account of reality? 
  1. Examine the ratio of unique entries in each trimmed dataset, to determine where the World Bank interpolated or otherwise gave data that potentially diverges from the ground truth.
  2. For each dataset: If the entries for a country are not mostly unique, drop the country

# Fill Missing Values: How will we interpolate data?
  1. Clean and fill remaining (minimal) missing data

# Normalize Local Currency Units (LCU) against Constant-Value US Dollar:
  1. Divide each LCU rate by US Dollar Index for given year

# Create Training Set
  1. Select appropriate columns from each dataset
  2. Filter by the first five years of data, to train the model on
  3. Rename columns so we can match them in the training set
  4. Assign values for departures in year six to target year, to train on


# Create Test Set
  1. Cross Validation: We don’t have enough countries to cross validate on a stratified selection of rows; we have to get creative.
  2. Create (8) test sets - one for each year, 2006 - 2013
  3. Assign features based on selected data for the unique five-year time series preceding each target year
  4. Assign values for departures in each target year, to validate against


# Select A Model
  1. Determine which baseline regressor has the lowest mean squared error, averaged across our 8 test sets

​# Tune the Best Model and Chose and Error Metric
  1. Determine a meaningful accuracy metric, and tune the hyper-parameters for best performance

A Justification of Feature Selection:

GDP Per Capita & GDP Per Capita Growth

GDP Per Capita is a measure of the total size of a country’s economy, divided by the size of that country’s population. It is a widely-used proxy for the wealth of individuals in a given country.

We used World Bank datasets for both the size and the growth of GDP Per Capita over time, even though these two datasets are simply functions of one another. There is useful information at the margins:

We included GDP Per Capita, even though it is collinear with GDP Per Capita growth, in order to establish a baseline value per country, in absolute terms. In other words, GDPPC Growth explains the delta between years, but we need the hard numbers from the first dataset to establish the absolute magnitude.

We could have included only the GDP Per Capita values, but the GDPPC Growth dataset provides insight that the former cannot. The growth in the last year in our range acts as a proxy for the size of the economy in the target year. It is the closest thing we have to an estimate of each country’s economic health in the target year, without leaking future data.


Inflation: 

We included inflation, because it stands to reason that there is a negative cross-price elasticity between expenditures in a domestic market, and expenditures in a foreign market. In other words, the more expensive it becomes to travel (or simply live) in a domestic market, the more attractive foreign markets become. 


Local Currency Exchange  [GDP vs Purchasing Power Parity vs Inflation]

Purchasing Power Parity (PPP) is a measure of how far money, denoted in dollars, goes in each country. Economists weight the GDP for each country by the cost of a standardized basket of consumer goods, to determine a normalized value for the purchasing power that stems from a measure of raw GDP. It is usually a better indicator of a country’s economic environment than raw GDP, but we've excluded it here, because:

Inflation, paired with the exchange rate of a currency vs a basket of international currencies, are the primary factors by which a calculation of GDP PC can be turned into one for GDP PC PPP (Purchasing Power Parity). Thus, inflation is also an excellent (if incomplete) proxy for GDP PPC - a measure I chose to leave out because of it’s strong correlation with GDP.

We've also included the value of a country’s currency, pegged against an international benchmark, as a feature. In simple terms, the more foreign currency one can buy with a single unit of domestic currency, the more attractive travel abroad becomes. If the Chinese Yuan increases significantly against the Thai Baht, it is nearly guaranteed that, all else held constant, Thailand will see an influx of Chinese tourists. 

Note that the inverse should also be true, as a predictor of inbound tourists to a country. If Thai Baht decreases against the Yuan, fewer Thais will visit China (all else being equal).


Dollar Index Series

The temptation is to peg each country’s currency against the international reserve currency - the US dollar. After all, this is exactly how economists measure economic activity in every country around the world. However, quick EDA of the currency dataset we are using tells us that the exchange rate for the “United States” against the dollar stays constant (1:1) across the entire time series. While you might *expect* this to be true, this is actually important information. This means that US Dollars have not been normalized.

They could have been normalized against an index year, for purchasing power parity (eg. "Year 2000 Dollars”). They could have been normalized against an international benchmark - such as the "US Dollar Index.”

Because US dollars have not been normalized, and all other currencies here are measured ​in US Dollars, we will have to normalize all these values against the US Dollar index.

But why must we do this? Assumptions that the “value” of the US dollar remains constant against other currencies simply do not hold. The dollar is not an immutable benchmark around which other currencies fluctuate. Its value is derived from indicators intrinsic to American economic activity, foreign policy, and the actions of the US Federal Reserve. In the simplest possible terms, If a country's currency in a given year goes down ten percent against the dollar, we cannot say *Which* currency moved. If the value of the Russian Ruble increases from 1:70 to 1:60, did the Ruble go up, or did the Dollar fall?

We have to find a way to normalize the data. One method would be to find the mean average value of a dollar against ALL currencies in the data set, across each year in our time series, then calculate the change in the average value of the Dollar, in each year. This would work, but it gives equal weight to all currencies and would be misleading. The performance of the Egyptian Pound against the Dollar should not carry as much sway as that of the Euro, Yen, or Yuan.

Fortunately, there IS a standardized index by which the Dollar is measured - the US Dollar Index. It maps the value of the US Dollar over time, against a weighted basket of foreign currencies.


Country Population

Quite obviously, the number of tourists that originate from a given country is correlated with that country’s population. The wealth of each citizen within a country is an important variable in determining that individual’s likelihood to travel, but the overall population of a country influences our equation in two ways:

  1.    It sets a hard limit on the number of tourists that can come from a country
  2.    When combined with a multiplier determined by the GDP Per Capita for each country, it should be an excellent proxy for the number of tourists that originate from that country.


Urban Population

This is a measure of the percent of each country’s population that lives in urban, versus rural, areas, as defined by the World Bank. Simply put, people who travel tend to have money - and people who have money tend to live in cities. 

The urban population ratio is another proxy for individual wealth. Whereas GDP Per Capita is an effective measure, it is an imperfect one. GDP Per Capita tells us the *mean* wealth per person in a country, but says nothing about how that wealth is distributed. A country in which a small group of oligarchs hold all the wealth may have a relatively high GDP Per Capita, but a population of mostly poor people. 

Including the ratio of urban population (and how that changes over time) should explain some additional variance.


Departures

This dataset contains the World Bank’s information on the number of outbound international tourists that originate from each country in a given year. 

Because this is what we are trying to predict, we include data from this dataset in exactly two places, per train/test set:

  1. The number of departures in the year preceding the one we want to predict, to establish a baseline.
  2. The number of departures in the year we want to predict - to train the training set on, or to validate the test predictions against.

Note that we did NOT include any additional information or features from this dataset. To do so would have only helped us, but this project is an exercise in using *other data* to predict trends in the tourism industry.

​The Code

Without any further ado, here's the project. We walk through the code in the comments.
Picture
Picture
Picture
Picture
Picture
Picture
Picture
Picture
Picture
Picture
Picture
Picture
Picture
Picture
Picture
Picture
Picture
Picture
Picture
Picture

~ Putting it all together ~


Picture
Picture
Picture

That's all for part one.
In part two, we iterate over our time series to create cross-validation test sets, select a model, and evaluate our predictions.
​
CLICK HERE FOR PART TWO.

Comments are closed.
Powered by Create your own unique website with customizable templates.
  • Data Science Lab
  • Lucid Analytics Project