International Tourism & Neural Networks, Part 1: Creating the Dataset, & Cleaning the Data3/22/2017
Introduction Hypothesis: We can predict the number of tourists that will originate from a given country in a given year - if we understand just a few key macroeconomic indicators describing that country. China's outbound tourists spent $258 billion abroad in 2017 - almost double the spend of US tourists, according to the United Nations World Tourism Organisation. Yet it has only been half a decade since China first overtook the US and Germany. Can we explain these trends? The straightforward answer is that there there are four times as many Chinese as Americans (1.3bn vs 330m), and China is growing at a significantly faster pace. As its middle class grows the Chinese become wealthier - and they travel more. No mystery there, it seems. But why now? China's GDP has been growing at around or above 10% for the best part of four decades. And after surging dramatically at the start of this century, it has dipped over the last decade to its current figure of 6-7%. At the same time, it is only in the past several years that tourism has surged. It’s then doubled in just a few years. So we can assume that:
The data science techniques This analysis explores the hypothesis above to walk the reader through the application of the key techniques a data scientist would use to design, test and refine a ML model. It shows us how to extract real world insights from public data sets. These techniques include: [Note that the descriptions below are intended to provide a helpful overview, but are simplifications of very complex subjects.] Data wrangling and cleaning Over 90% of a typical data scientist's time is spent wrangling and cleaning. Data wrangling is the process of transforming raw data into a usable format for analysis. Data cleaning is the process of addressing errors and inconsistencies in the initial data sets. This is important because real world data is often messy and incomplete - a byproduct of real world interactions. Data scientists must transform the data they have into the data they need. Exploratory data analysis EDA is how a data scientist gets to grips with a new data set in order to make it useful. Data scientists must develop an intuitive sense for how to use statistical methodologies and data visualization techniques. This is important in order to explore a new data set and develop a strategy from how to extract insights from it. Feature selection and engineering Feature selection is how a data scientist chooses and refines which data they’re going to work with. In more technical terms, it is the process of applying rigorous scientific methodology to the curation, selection, and transformation of data which will be used as inputs in machine learning algorithms. Model design and algorithm selection Machine learning is a set of tools which allows computers to generate predictions about the world around us and find correlations and complex insights across data from nearly any domain. A data scientist will often test multiple algorithms and choose the one best fitted to the data, depending on the requirements of the project. Error metric selection Data scientists much choose an appropriate measurement to evaluate the performance of each model. They have a wide range of statistical tests at their disposal, but their selection must be non-arbitrary, defensible, and made before they design the model. Analysis of output Fundamentally, the ‘science’ in data science comes from a rigorous application of the scientific method. The final step after every experiment is to analyse the results and decide on the next steps. The Project: Objective: Use only macro economic data about given countries to predict the number of international tourists that will originate from each country, in a given year. Our target metric - International Departures - is defined as: “The number of individuals which leave their home country for tourism purposes, at least one time in a calendar year.” Key Challenges:
Datasets Collected:
Plan of Action: # Read each dataset into list of data frames # What is our time range?
# What is our country set?
# Can we trust the data? Does the remaining data reflect an accurate account of reality?
# Fill Missing Values: How will we interpolate data?
# Normalize Local Currency Units (LCU) against Constant-Value US Dollar:
# Create Training Set
# Create Test Set
# Select A Model
# Tune the Best Model and Chose and Error Metric
A Justification of Feature Selection: GDP Per Capita & GDP Per Capita Growth GDP Per Capita is a measure of the total size of a country’s economy, divided by the size of that country’s population. It is a widely-used proxy for the wealth of individuals in a given country. We used World Bank datasets for both the size and the growth of GDP Per Capita over time, even though these two datasets are simply functions of one another. There is useful information at the margins: We included GDP Per Capita, even though it is collinear with GDP Per Capita growth, in order to establish a baseline value per country, in absolute terms. In other words, GDPPC Growth explains the delta between years, but we need the hard numbers from the first dataset to establish the absolute magnitude. We could have included only the GDP Per Capita values, but the GDPPC Growth dataset provides insight that the former cannot. The growth in the last year in our range acts as a proxy for the size of the economy in the target year. It is the closest thing we have to an estimate of each country’s economic health in the target year, without leaking future data. Inflation: We included inflation, because it stands to reason that there is a negative cross-price elasticity between expenditures in a domestic market, and expenditures in a foreign market. In other words, the more expensive it becomes to travel (or simply live) in a domestic market, the more attractive foreign markets become. Local Currency Exchange [GDP vs Purchasing Power Parity vs Inflation] Purchasing Power Parity (PPP) is a measure of how far money, denoted in dollars, goes in each country. Economists weight the GDP for each country by the cost of a standardized basket of consumer goods, to determine a normalized value for the purchasing power that stems from a measure of raw GDP. It is usually a better indicator of a country’s economic environment than raw GDP, but we've excluded it here, because: Inflation, paired with the exchange rate of a currency vs a basket of international currencies, are the primary factors by which a calculation of GDP PC can be turned into one for GDP PC PPP (Purchasing Power Parity). Thus, inflation is also an excellent (if incomplete) proxy for GDP PPC - a measure I chose to leave out because of it’s strong correlation with GDP. We've also included the value of a country’s currency, pegged against an international benchmark, as a feature. In simple terms, the more foreign currency one can buy with a single unit of domestic currency, the more attractive travel abroad becomes. If the Chinese Yuan increases significantly against the Thai Baht, it is nearly guaranteed that, all else held constant, Thailand will see an influx of Chinese tourists. Note that the inverse should also be true, as a predictor of inbound tourists to a country. If Thai Baht decreases against the Yuan, fewer Thais will visit China (all else being equal). Dollar Index Series The temptation is to peg each country’s currency against the international reserve currency - the US dollar. After all, this is exactly how economists measure economic activity in every country around the world. However, quick EDA of the currency dataset we are using tells us that the exchange rate for the “United States” against the dollar stays constant (1:1) across the entire time series. While you might *expect* this to be true, this is actually important information. This means that US Dollars have not been normalized. They could have been normalized against an index year, for purchasing power parity (eg. "Year 2000 Dollars”). They could have been normalized against an international benchmark - such as the "US Dollar Index.” Because US dollars have not been normalized, and all other currencies here are measured in US Dollars, we will have to normalize all these values against the US Dollar index. But why must we do this? Assumptions that the “value” of the US dollar remains constant against other currencies simply do not hold. The dollar is not an immutable benchmark around which other currencies fluctuate. Its value is derived from indicators intrinsic to American economic activity, foreign policy, and the actions of the US Federal Reserve. In the simplest possible terms, If a country's currency in a given year goes down ten percent against the dollar, we cannot say *Which* currency moved. If the value of the Russian Ruble increases from 1:70 to 1:60, did the Ruble go up, or did the Dollar fall? We have to find a way to normalize the data. One method would be to find the mean average value of a dollar against ALL currencies in the data set, across each year in our time series, then calculate the change in the average value of the Dollar, in each year. This would work, but it gives equal weight to all currencies and would be misleading. The performance of the Egyptian Pound against the Dollar should not carry as much sway as that of the Euro, Yen, or Yuan. Fortunately, there IS a standardized index by which the Dollar is measured - the US Dollar Index. It maps the value of the US Dollar over time, against a weighted basket of foreign currencies. Country Population Quite obviously, the number of tourists that originate from a given country is correlated with that country’s population. The wealth of each citizen within a country is an important variable in determining that individual’s likelihood to travel, but the overall population of a country influences our equation in two ways:
Urban Population This is a measure of the percent of each country’s population that lives in urban, versus rural, areas, as defined by the World Bank. Simply put, people who travel tend to have money - and people who have money tend to live in cities. The urban population ratio is another proxy for individual wealth. Whereas GDP Per Capita is an effective measure, it is an imperfect one. GDP Per Capita tells us the *mean* wealth per person in a country, but says nothing about how that wealth is distributed. A country in which a small group of oligarchs hold all the wealth may have a relatively high GDP Per Capita, but a population of mostly poor people. Including the ratio of urban population (and how that changes over time) should explain some additional variance. Departures This dataset contains the World Bank’s information on the number of outbound international tourists that originate from each country in a given year. Because this is what we are trying to predict, we include data from this dataset in exactly two places, per train/test set:
Note that we did NOT include any additional information or features from this dataset. To do so would have only helped us, but this project is an exercise in using *other data* to predict trends in the tourism industry. The Code Without any further ado, here's the project. We walk through the code in the comments. ~ Putting it all together ~That's all for part one.
In part two, we iterate over our time series to create cross-validation test sets, select a model, and evaluate our predictions. CLICK HERE FOR PART TWO. Comments are closed.
|