This project is a demonstration of how to use python for data visualization. Modules used for this project includes pandas, numpy, fuzzywuzzy for data cleaning; geopandas, matplotlib, seaborn, bokeh for plotting. I intend to examine what is the most commonly used contraceptive method worldwide, especially contraceptive use in the United States. When thinking of contraceptive use, most likely, pills and condoms may pop into people's minds, which is true for most developed countries. However, the US seems to be a unique case that worth a close examination.
This project is set out as follows: First, I create a dataset that contains the world and country specific contraceptive information. Then, I provide an overview of contraceptive use around the world. Next, I present the distribution of contraceptive prevalence of various methods for twelve countries and the country specific trends from 1980 to 2017. Lastly, a plot is created to illustrate the trend of six contraceptive methods used in the US from 1955 to 2017.
The data used in this project are publicly available on the UN website. Data can be found here:
Note: You may need to install additional dependencies for modules used in this project. Plots created from bokeh may not display automatically in jupyter notebook. I recommend You download this script and data, run locally on your laptop.
import geopandas
import pandas as pd
import seaborn as sns
import numpy as np
import pycountry # used to identify country code
from fuzzywuzzy import fuzz, process # used to fuzzy merge data
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import make_axes_locatable
from bokeh.palettes import Spectral6, Paired ## used to create interactive plot
from bokeh.plotting import figure, show, output_file
from bokeh.io import output_notebook
from bokeh.transform import factor_cmap ## use to apply color palette to categorical variables
from bokeh.models import ColumnDataSource, HoverTool,CrosshairTool, SaveTool, PanTool,Legend,Panel, Tabs
pd.set_option('display.max_rows', 300)
pd.set_option('display.max_columns', 500)
import warnings
warnings.filterwarnings('ignore')
First, let's import a world map from the geopandas module. This is a shapefile without basic information attached. We will later add contraceptive information from our data. A unique column is the "geometry". This variable contais the information we will use to plot the map. Let's first take a look of the data and the map.
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
world.head(5)
world.plot()
Next, we load the datasets that we will need for our illustrations, data are available under the same folder of my git repo. All three datasets are stored in one spreadsheet. The following code allows python to read them all at once. "country2019" contains contraceptive information for all countries and regions in 2019. "area2019" is a aggregated summary of contraceptive use by region and income levels in 2019. "country_tread" contains yearly country sepcific contraceptive prevalence for decades.
# loading contraceptive data
with pd.ExcelFile('Contraceptive_2019.xls') as xls:
country2019 = pd.read_excel(xls, 'Sheet1', na_values= ["."])
area2019 = pd.read_excel(xls, 'Sheet2',na_values= ["."])
country_trend= pd.read_excel(xls, "Sheet3",na_values= ["."])
For the country2019 dataset we also need to perform a management step, in particular,we need to merge the geographic information with the contraceptive information. However, the contraceptive data do not contain a column of unique identifier. A possible way to merge two data ("country2019" and "world") together is based on country names. Here I use a fuzzy method to match two dataset based on the name of each country. A special note: since the fuzzy merge is based on string variables (in our case, the country names), it highly possible to result in undesirable or mismatch. Do check the data after your merging. For our data, it works pretty well.
# The contraceptive data do not contain a col of unique identifier.
# Use fuzzy merge to join two datasets.
def fuzzy_merge(data1, data2, key1, key2, threshold=95, limit=1):
s = data2[key2].tolist()
m = data1[key1].apply(lambda x: process.extract(x, s, limit=limit))
data1['matches'] = m
m2 = data1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
data1['matches'] = m2
return data1
match=fuzzy_merge(world, country2019, 'name', 'area')
#match.loc[match['matches'] == ""] ## display countries that have no matches from contraceptive data
World_con = pd.merge(match, country2019,
left_on='matches',
right_on='area',
how = 'left'
)
# Clean the data
World_con=World_con[World_con.columns.drop('matches')];
World_con=World_con[World_con.columns.drop('area')]
World_con=World_con[World_con.continent!='Antarctica'] # we remove Antarctica from the data. Nobody lives there anyway.
Let's take a quick inspection of the data. Everything looks fine. We can go head and plotting.
round(World_con.describe(), 2)
## setup plot and legend
fig, ax = plt.subplots(1, figsize=(16, 17))
divider = make_axes_locatable(ax) # align the legend to the plot
cax = divider.append_axes("right", size="5%", pad=0.1)
World_con.plot(column='Any method', linewidth=0.3, ax=ax, edgecolor='0.7',alpha=0.95,\
cax=cax, cmap='GnBu',legend= True)
ax.axis('off')# remove the axis
# add plot title and annotation
ax.set_title('Figure 1:Estimated prevalence of contraceptive use among women of reproductive age (15-49 years), 2019(%)',
fontdict={'fontsize': '16', 'fontweight' : '5','horizontalalignment': 'center'})
ax.annotate('Source: United Nations, Population Division (2019).',xy=(0.1, 0.28),
xycoords='figure fraction', horizontalalignment='left', verticalalignment='top',
fontsize=10, color='#555555')
It is obvious that Northern and Southern America, European countries and China have high levels of contraceptive prevalence. Most African countries have lower levels of contraceptive use. This is consistent with our knowledge that contraceptive practice is common in developed countries whereas contraception is less popular in developing countries.
Next, from the "area2019" dataset, let's pull out the information to make some basic plots. Here, I first extract contraceptive prevalence of all method by geographic regions. Then use catplot from the seaborn module to make a barchart.
g_contra=area2019.iloc[1:9,:2]
g_contra=g_contra.sort_values("Any method", ascending=False) ## sort the values on a decending order
g=sns.catplot(y="area", x="Any method",
palette=(sns.cubehelix_palette(8, start=.5, rot=-.5)),
height=4,aspect=2, kind="bar", data=g_contra)
## remove y and x labs.
(g.set_ylabels("")
.set_xlabels(""))
## setup title
g.fig.subplots_adjust(top=0.9)
g.fig.suptitle('Figure 2: Contraceptive use prevalence in geographic regions: all method (%)',
fontsize=18)
Figure 2 presents the distribution of contraceptive use prevalence in different regions. Again, Eastern and South-Eastern Asia (especially China and India), Europe, America, Australia and New Zealand have a contraception prevalence around 60%. Africa and some Asia areas have lower prevalence of contraceptive use.
inc_contra=area2019.iloc[16:21,[0,1,2,3,4,7,8]] # select income and various contraceptive methods
inc_contra
## plot setup,we use PariGrid to setup the multiple plotting format.b
sns.set(style="whitegrid",rc={'font.size': 15, 'axes.labelsize': 15, 'legend.fontsize': 15,
'axes.titlesize': 15, 'xtick.labelsize': 13, 'ytick.labelsize': 15})
inc= sns.PairGrid(inc_contra,
x_vars=inc_contra.columns[1:7], y_vars=["area"],
height=5, aspect=.5)
inc.map(sns.stripplot, size=14, orient="h",
palette="ch:s=1,r=-.1,h=1_r", linewidth=1, edgecolor="w")
inc.set(ylabel="")
## change grid orientation
for ax in inc.axes.flat:
ax.xaxis.grid(False)
ax.yaxis.grid(True)
## setup title
inc.fig.subplots_adjust(top=0.9) ## title position
inc.fig.suptitle('Figure 3:Contraceptive use prevalence among women by income levels(%)', fontsize=18)
If we divide countries into income strata, high-income countries are more likely to adopt short acting contraceptive methods such as hormonal pills and male condoms. Middle income countries present a mixed pattern that both short- and long-acting methods are widely used. Low income countries are less likely to uptake any contraception and female sterilisation is more common than other contraceptive method in these developing countries.
Now, let's see how contraceptive use changed for the past few decades. I select 12 major countries to examine their trend. A trick I used here to select my target countries is to import the pycountry module. This module contains the information of all countries, including their country codes. I first extract the corresponding country ISO code and save them into a list, then select the counties with corresponding codes from "country_trend" dataset and save it as a new dataset "subcountry".
## Select 12 countries to examine the trends
countryname=[]
for i in ['USA', 'CAN', 'GBR','FRA','DEU','MEX' ,'JPN','KOR','CHN','IND','AUS','TUR']:
name=pycountry.countries.get(alpha_3=i).numeric
countryname.append(int(name))
subcountry=country_trend.loc[country_trend['ISO code'].isin(countryname),\
['ISO code','area', 'Survey\nend year','Any method','Female\nsterilization',\
'Pill', 'Male condom'] ]
## rename column names
subcountry = subcountry.rename(columns = {'ISO code':'iso','Survey\nend year':'year','Any method':'Any method' ,\
'Female\nsterilization':'Female sterilization', \
'Pill':'Pill', 'Male condom': 'Male condom'})
## use pivot table function from pandas to present the mean of contraceptive prevalence of each country.
country_mean=round(pd.pivot_table(subcountry, values=['Any method','Female sterilization','Pill','Male condom'],
index=['area'],aggfunc=np.mean),2)
country_mean
These figures presents the mean levels of each contraceptive methods use for 12 countries
## define a function to plot various methods for each country in a bar chart
source = ColumnDataSource(country_mean)
name=sorted(list(set(subcountry["area"])))
color=factor_cmap('area', palette=Paired[12], factors=name)
def plot(title, cate, right):
p1=figure(plot_height=400,y_range=name, min_border=20, toolbar_location=None, title=title)
p1t=p1.hbar(cate, right=right, left=0, height=0.9, source=source, legend="",
fill_color= color, line_color='black')
## hovertool is used to create interactive labels
p1.add_tools(HoverTool(renderers=[p1t], tooltips=[("Method","@{"+right+"}{0.0}")],mode='mouse'))
tab = Panel(child=p1, title=title)
output_notebook()
return tab
# Note: this plot only view locally. Can't be display via github
tab1=plot("Any contraceptive method", 'area',"Any method")
tab2=plot("Female sterilization", 'area',"Female sterilization")
tab3=plot("Male condom", 'area',"Male condom")
tab4=plot("Pill", 'area',"Pill")
tabs = Tabs(tabs=[ tab1, tab2, tab3, tab4 ])
show(tabs)
From the above figures, contraceptions are most commonly used in China and United Kingdom. Female sterilization is frequently used among developing countries such as India, China, Mexico and Korea. However, the US and Canada maintain a high level of female sterilization prevalence (16.3% and 19%, respectively) compared with other high income countries (for example, France is 5.5%). Women living in European countries such as Germany and France are more likely to choose pills. Male condoms are especially common in Japan as a major contraceptive method.
I use information after 1980. I first reshape this data from wide format into a long format for ploting. Then create 12 line plots to present contraceptive trend for various countries.
## Only examine the trend since 1980
subcountry = subcountry.query('year>1979')
long_subc = pd.melt(subcountry, id_vars=['area','iso', 'year'],\
value_vars=['Any method','Female sterilization','Pill','Male condom'])
#Parameters set up
sns.set(style="ticks",rc={"lines.linewidth": 2,'xtick.labelsize': 12, 'ytick.labelsize': 12,\
'font.size': 15, 'axes.labelsize': 12, 'legend.fontsize': 12})
g=sns.relplot(x="year", y="value",
hue="variable",
kind="line",col="area",style="variable", col_wrap=4, height=3,data=long_subc)
(g.set_ylabels("prevalence(%)")
.set_xlabels(""))
g._legend.texts[0].set_text("") ## remove legend title
#title for the plot
g.fig.subplots_adjust(top=0.9)
g.fig.suptitle('Figure 4: Contraceptive use prevalence trend of selected countries', fontsize=18)
# change column titles for each plot.
name=list(set(subcountry['area']))
titles=sorted(name)
for ax, title in zip(g.axes.flat, titles):
ax.set_title(title)
Figure 4 illustrates country-specific trend of contraceptive use. Overall, there is an upward trend of contraceptive use among different countries except for Japan and Australia. In most countries, the prevalence of female sterilization is declining. Male condoms are more and more popular in the recent years. The use of pills is stable in most countries. We observe an increase of pill use in Canana and the UK in more recent years. The use of contraceptive method has been stable in the US since 1980. Compared with European countries, the prevalence of female sterilization is much higher, whereas the short-acting methods remain low in the US.
## Note: this plot only view locally. Can't be display via github
## select usa data
CODE=int(pycountry.countries.get(alpha_3='USA').numeric)
usa=country_trend.loc[country_trend['ISO code']== CODE]
## prepare data for plotting
usa_trend = usa.iloc[:,[3,7,8,9,12,13,21]]
usa_trend = usa_trend.rename(columns = {'Survey\nend year':'year','Female\nsterilization':'Female sterilization',
'Male\nsterilization':'Male sterilization'})
usa_trend=usa_trend.interpolate(method ='linear') # fill NaN use interpolate method
## draw an interactive plot using bokeh
tools_to_show = 'box_zoom, save, reset, tap, wheel_zoom, pan' ## set up side bar tools
# set up the plotting canvas
p = figure(plot_width=850, plot_height=450, x_axis_label='Year', y_axis_label='Prevalence (%)',tools=tools_to_show)
p.title.text = 'Contraceptive prevalence in the United States: 1955-2017'
source = ColumnDataSource(usa_trend)
for method, color in zip(list(usa_trend.columns)[1:], Spectral6) :
p.line(usa_trend['year'], usa_trend[method], line_width=2, color=color, alpha=0.8,
muted_color=color, muted_alpha=0.1, legend= method)
## use source will lead to an error in legend but with source specified in the plot will be desirable to add hovertool.
## the trick is making the scatter plot connected with hovertool without showing legend
plt=p.circle('year', method, source=source, size=5, color=color, alpha=0.6)
## use +method+ to remove the heading and trending space
p.add_tools(HoverTool(renderers=[plt], tooltips=[("Year","@year"),
("Prevalence","@{"+method+"}{0.0}")],
mode='mouse'))
p.legend.location = "top_left"
p.legend.click_policy="mute"
p.legend.glyph_height = 12
p.legend.glyph_width = 8
p.legend.label_text_font_size ='8pt'
output_notebook()
#output_file("markers.html", title="markers.py example")
show(p)
If we take a close look at the contraceptive prevalence in the US since 1955, clearly, an increasing trend of female sterilization is prominent with a slight decline in more recent years. The use of pills was quite common in the 1970s and decreased since then. The use of long-acting reversible contraceptive methods such as IUD began to increase since 1995. Male condoms and male sterlization account for 20% among all contraceptive methods in the US. A small portion of people adopt withdrawal, which acounts for around 5% in all contraceptive methods.
As we have found, the contraceptive prevalence varies across countries and regions. Contraception is widely used in high income/developed countries whereas the prevalence of contraceptive use in less developed countries remains low. Women living in the US are more likely to rely on sterilization as an effective method to avoid unintended pregnancy, which is quite the opposite in countries such as France, Germany and Japan. Pills and male condoms have also been commonly used since 1990s in the US. The use of IUD has increased with the advent of new options. The observed pattern triggers more questions, for example, who are those women choosing female sterilization as a contraceptive method. I suspect there should be cohort differences when practice contraception. Older cohort may be more likely to adopt sterilization but younger cohort may rely on short acting methods. Race, education and marital status would also influence women's contraceptive preferences. The next step is to employ national data to further break down social categories and permit a close investigation of how contraceptive use is determined by demographic, social and behavioral factors.
As for data visualization, python with a variety of modelus, is a handy tool to present highly flexible and beatiful plots and figures. The thing is to master each module needs time and practices. Seaborn is a visually attractive module that is suitable for most basic plots. bokeh has nice functionality to create interactive plots. It is also capable to graph geograpic data. matplotlib is mostly used for academic plotting, I found the code is not so intuitive though.