Working with World Bank Data with Pandas

Jeremy M. Mikecz

2019

498 Home Syllabus Schedule Mikecz Home Tutorials

Please also see:

for Pandas:
- Dataquest Pandas cheat sheet
- Pandas cheat sheet
to browse World Bank data from the World Bank Data webpage:
- World Bank Open Data
for using Pandas with World Bank data:
- wbdata documentation: Typical User Session
- Getting Worldbank Data with Python. Easily

Tasks

Open your terminal.
Check to see if pandas, numpy, and wbdata are installed:
1. Open the terminal
2. Type to see if pandas is installed:
```
conda list
```
1. If not installed, install by typing (note install numpy too as it is often used in conjunction with pandas):
```
conda install pandas
conda install numpy
conda install wbdata
```
1. If the “conda install” command does not work for any of these packages try “pip” instead:
```
pip install wbdata
```
In Spyder, create and save a new program in the editor. In the program:
1. First, import pandas, numpy, and os
```
import pandas as pd
import numpy as np
import os
```
Run this code.

Set your working directory(The path to your folder of Python files):

os.chdir("C:/Users/mikecz/Documents/python") #replace with your folder path

choose which countries for which you wish to download data. Some examples include:

#wbdata.get_country(300) # to see list of countries
LAMcountries=['ARG','BLZ','BOL','BRA','CHL','COL','CRI','CUB','SLV','GTM','GUY','HTI','HND',
          'JAM','MEX','NIC','PAN','PRY','PER','PRI','SUR','URY','VEN']
NAMcountries=['USA','CAN']
#regions=['LCN','WLD','NAC'] #WORLD = WLD; LAT AM AND CARRIBEAN = LCN; NAC=North America
regions=['NAC','WLD','LCN']
allcountries=LAMcountries+NAMcountries
allregions=allcountries+regions

to browse available wbdata variables you have two options:

browse online

explore with Python, i.e.:

wbdata.get_topic()

1       Agriculture & Rural Development
2       Aid Effectiveness
3       Economy & Growth
4       Education
5       Energy & Mining
6       Environment
7       Financial Sector
8       Health
9       Infrastructure
10      Social Protection & Labor
11      Poverty
12      Private Sector
13      Public Sector
14      Science & Technology
15      Social Development
16      Urban Development
17      Gender
18      Millenium development goals
19      Climate Change
20      External Debt
21      Trade

To explore variables (“indicators”) available for one of these topics enter:

wbdata.get_indicator(topic=17)

WP15163_4.3             Mobile account, female (% age 15+) [w2]
WP15163_4.2             Mobile account, male (% age 15+) [w2]
WP_time_01.3            Account at a financial institution, female (% age 15+) [ts]
WP_time_01.2            Account at a financial institution, male (% age 15+) [ts]
SP.URB.TOTL.MA.ZS       Urban population, male (% of total)
SP.URB.TOTL.FE.ZS       Urban population, female (% of total)
....#much more returned....

You can also search for variables (“indicators”) by keywords:

wbdata.search_indicators("GDP per capita")

To choose which variables (“indicators”) you will import, create a dictionary linking the wbdata label for the variable with your own, more common-sense label, like this:

indicators={"EN.ATM.PM25.MC.ZS":"air_poll",
       "SP.URB.TOTL":"urbanpop",
       "EN.URB.LCTY.UR.ZS":"popLgstCity",
       "EN.POP.EL5M.UR.ZS":"urbBelow5m",
       "EN.POP.SLUM.UR.ZS":"popSlums",
       "SP.POP.TOTL":"totalpop",
       "EN.POP.DNST":"popDens",
       "1.1_ACCESS.ELECTRICITY.TOT":"electAccess",
       "1.3_ACCESS.ELECTRICITY.URBAN":"electAccessUrb"
    }

With your dictionary of indicators and list of countries, you can now import the date as a pandas dataframe using:
```
df=wbdata.get_dataframe(indicators,country=allregions)
```
Now that your World Bank data has been imported as a dataframe, you review summary information about it just as you would any other dataframe:
```
df.shape
df.columns.values
df.info()
df.describe()
df.head()
df.head(10)
df.tail()
df.tail(10)
```
Notice when you view your data using df.head() or df.tail() the country and date are listed as indices for each row rather than as separate columns. For our purposes, we will convert the indices (country and date) to columns.
```
df2=df.copy() #creates a copy of dataframe; if you instead  do df2=df all changes to df2 will impact df
df2.reset_index(inplace=True)
```
Also, for the purpose of labeling points on our (future) data visualizations, we should create a separate column combining the country and date into one variable.
```
df2['text']=df2['country']+df2['date'].map(str)
```
Now, we can create new columns. For example, we will want to convert absolute numbers into meaningful percentages (i.e. % of population living in urban areas or % of land that is arable)
```
df2['pcturb']=df2['urbanpop']/df2['totalpop']*100
df2['pcturb_inSlums']=df2['popSlums']/df2['urbanpop']*100
```
To calculate the mean (average) of these new columns, we can do either:
```
df2['pcturb'].mean()
df2['pcturb'].describe()
```
Subsetting Data. Often it is useful to create smaller subsets of your dataframe. For example, if you just want to see all countries with a current population over 1 million people. Our all instances (country / year data points) that have population data.
1. If we only wanted to experiment with a few rows of data, we can do something like this:
```
df_1st5rows=df2[0:5]
```
Notice how this works similar to subsetting the first ten items from a list. Subsetting the first ten columns, however, is a little different:
```
df_1st5cols=df2.iloc[:,0:10]
```
You can also create a subset for just a few countries:
```
df_selctrys=df2[df2['country'].isin(['ARG','BRA','COL','VEN','MEX'])]
```
or you can select for a chosen few years:
```
df_selyrs=df2[(df2['date']>=2014) & (df2['date']<=2016)] #produces an error
```
If this produces an era it is because the values in the ‘date’ column are recorded as strings not integers. To convert the entire column to integers try:
```
df2['date']=pd.to_numeric(df2['date'])
df_selyrs=df2[(df2['date']>=2014) & (df2['date']<=2016)]
```
1. We may also want to identify and then eliminate missing data for particular variables.
First, let us identify missing data. In your console, type “df2.info()”. This reveals the number of missing data for each variable. Some data is more complete than others. For example, for the variable “totalpop” notice we have 1566 out of 1593 possible data points. That means out of 1593 possible combinations of countries/regions and years (1960 - 2018) we are missing only 27, which likely means we are missing data for only one year (for all 27 countries). Let’s find out what year is missing.
```
~~~~
dfnas=df2[df2['totalpop'].isnull()] 
~~~~
```
You can then view the ‘dfnas’ dataframe to see what year was missing.

To drop all instances (rows) of missing data for a particular variable, try the following code:
```
df3=df2.dropna(subset=['totalpop'])
```
Now, compare the size of the two dataframes:
```
df2.shape
df3.shape
```
Subset and sort data:

Let us begin by retrieving the data for just one year: 2016, the most recent year with mostly complete data.
```
df2016=df3[df3['date']==2016]
```
Next, let us sort this data to see which country has the highest percent of people living in urban areas:
```
df2016.sort_values(by='pcturb')
```
If you want to sort if from most urban to less, you can modify it as:
```
df2016.sort_values(by='pcturb',ascending=False)
```
You can also subset your dataframe by a conditional. For example, if you only wanted those data points (countries and years) which had a life expectancy greater than 75, you would type:
```
longlifedf=df[df['life_exp']>75]
```
To see what countries and years remain in this dataset and, therefore, have / had a life expectancy greater than 75, see:
```
longlifedf['country'].value_counts()
longlifedf['date'].value_counts()
```