Create a filled map (chloropleth) with Python

First, big thanks to Stephan Hügel for his excellent guide to mapping with Python. To create this guide, I heavily borrowed from Stephan's work, but tried to provide more contextualizing comment to ensure that a less advanced user can create their own filled map.

For this guide, I focus on creating a filled map of Sri Lanka. I use some basic demographic data plus a shapefile that includes the first level of administration (for example, in the United States this is a state) to demonstrate how to create and manipulate a filled map with Python.

The shapefile I obtained is from the Spatial Data Repository where you can find detailed shapefiles for most countries.

You can find the whole repository for this project on GitHub. Feel free to hit me up with questions on Twitter @brandonmrose.

Contents

Up front, you are going to need to install some packages you might not have used before. I had to install pysal, descartes, shapely, and Basemap. Install Basemap with:

pip install basemap --allow-external basemap --allow-unverified basemap
In [1]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from matplotlib.colors import Normalize
from matplotlib.collections import PatchCollection
from mpl_toolkits.basemap import Basemap
from shapely.geometry import Point, Polygon, MultiPoint, MultiPolygon
from shapely.prepared import prep
from pysal.esda.mapclassify import Natural_Breaks as nb
from descartes import PolygonPatch
import fiona
from itertools import chain
import pysal.esda.mapclassify as mapclassify

Reading in the dataset

I read in the dataset using Pandas. The primary dataset is just a .csv file where each row represents a district (state/province) along with the total population numbers per district plus population numbers for the three major ethnic groups. I've also already calculated the percent of the total population accounted for by each ethnic group. For example, in Colombo, 77% of the population is Sinhalese.

I read in the dataset using Pandas; it's automatically read in as a dataframe. If you're not familiar with Pandas you can learn more here. It is a go-to for data analysis with Python.

In [2]:
data = pd.read_csv('srl_pop.csv')
In [3]:
#print the first 5 rows of data to see what it looks like and make sure it read in properly
data.head()
Out[3]:
District_1 Total_pop Sinhalese Tamils Muslims Others Sinhalese_pct Tamils_pct Muslims_pct Others_pct
0 Colombo 2309809 1771319 258654 242728 37108 77 11 11 2
1 Gampaha 2294641 2079115 90950 95501 29075 91 4 4 1
2 Kalutara 1217260 1054991 47973 112276 2020 87 4 9 0
3 Kandy 1369899 1018323 154874 191159 5543 74 11 14 0
4 Matale 482229 389092 48156 44113 868 81 10 9 0
In [4]:
print 'The data contains ' + str(data.shape[0]) + ' rows and ' + str(data.shape[1]) + ' columns.' 
print 'Each of the 25 rows corresponds to one of the districts in Sri Lanka.'
The data contains 25 rows and 10 columns.
Each of the 25 rows corresponds to one of the districts in Sri Lanka.

Loading in the shapefile

A note on shapefiles

Note that when you download your shapefile you might get multiple shapefile sets. The Sri Lanka shapefile from the Spatial Data Repository that I downloaded contained 3 shapefiles: LKA_adm0.shp, LKA_adm1.shp, and LKA_adm2.shp. As you might imagine, LKA_adm0.shp is the shapefile for the whole country without any states. LKA_adm2.shp is a shapefile outlining the first-level administrative boundaries (states). LKA_adm3.shp goes a level deeper and has second-level administrative boundaries (think counties). For this guide I use the LKA_adm2.sp file and focus on state (or what I've called district level mapping).

So what's in a shapefile? A shapefile is really a set of files that needs to remain together (in the same folder). At minimum, the file should contain .shp, .shx, and .dbf files. The .shp and .shx files define the shapes themselves while the .dbf file will contain metadata on the shape polygons (e.g. state names). Sometimes (and in the case of the Sri Lanka shapefile) you will receive other files too like a .csv and a .prj. The .csv is basically the same data that is in the .dbf but is in a more ubiquitous format in case you want to inspect/analyze it. The .prj file defines the projection. You actually shouldn't mess with any of these files without using a GIS software like the very awesome, open source QGIS.

You absolutely will need to make sure that within the .dbf file the states or districts have the same spelling conventions as in your data set! If you don't have a .csv to inspect (which will mirror the .dbf) you can download Apache's Open Office for free. The Open Office spreadsheet program will let you open and inspect the .dbf file. Don't change it though; it's much easier to change your data file.

Now, back to loading in the shapefile

Import the shapefile using fiona. Also I set some variables which define the boundaries of the map area. One way to think about it is that the shapefile you pass to fiona has some bounds which are defined by 2 latitude, longitude pairs: the first defines the lower left (ll) corner of the map and the second defines the upper right (ur) corner of the map.

By subtracting the left edge longitude from the right edge longitude we can determine the width (w) for the map. We can do the same with the latitutde to get the height of the map.

In [5]:
#load the shape file as shp. Here I have saved my shapefile in the folder 'LKA_adm_2' within my working directory.
#your shapefile should end in .shp
shp = fiona.open('LKA_adm_2/LKA_adm1.shp')

#we can access the boundaries (the 2 lat,long pairs) using shp.bounds
bds = shp.bounds

#close the shp file
shp.close()

#define a variable called extra which we will use for padding the map when we display it (in this case I've selected a 10% pad)
extra = 0.1

#define the lower left hand boundary (longitude, latitude)
ll = (bds[0], bds[1])

#define the upper right hand boundary (longitude, latitude)
ur = (bds[2], bds[3])

#concatenate the lower left and upper right into a variable called coordinates
coords = list(chain(ll, ur))

#define variables for the width and the height of the map
w, h = coords[2] - coords[0], coords[3] - coords[1]

Create the basemap

Below I create the basemap, defined by the variable m. Note that you need to set the lat and long (lon_0, lat_0) so that it is roughly in the center of what your map. You could just use Google Maps to figure out where you want to center your map. Here, I took the average (the midpoint) between the lower and upper longitude and the lower and upper latitude to center the make squarely in the middle of the shapefile.

If you're interested in learning about the Basemap API check out the matplotlib Basemap documentation here. It runs through the various projections you can use.

In [6]:
m = Basemap(
    #set projection to 'tmerc' which is apparently less distorting when close-in
    projection='tmerc',

    #set longitude as average of lower, upper longitude bounds
    lon_0 = np.average([bds[0],bds[2]]),

    #set latitude as average of lower,upper latitude bounds
    lat_0 = np.average([bds[1],bds[3]]),

    #string describing ellipsoid (‘GRS80’ or ‘WGS84’, for example). Not sure what this does...
    ellps = 'WGS84',
    
    #set the map boundaries. Note that we use the extra variable to provide a 10% buffer around the map
    llcrnrlon=coords[0] - extra * w,
    llcrnrlat=coords[1] - extra + 0.01 * h,
    urcrnrlon=coords[2] + extra * w,
    urcrnrlat=coords[3] + extra + 0.01 * h,

    #provide latitude of 'true scale.' Not sure what this means, I would check the Basemap API if you are a GIS guru
    lat_ts=0,

    #resolution of boundary database to use. Can be c (crude), l (low), i (intermediate), h (high), f (full) or None.
    resolution='i',
    
    #don't show the axis ticks automatically
    suppress_ticks=True)

m.readshapefile(
    #provide the path to the shapefile, but leave off the .shp extension
    'LKA_adm_2/LKA_adm1',

    #name your map something useful (I named this 'srilanka')
    'srilanka',

    #set the default shape boundary coloring (default is black) and the zorder (layer order)
    color='none',
    zorder=2)
Out[6]:
(25,
 5,
 [79.52180480957031, 5.918471813201904, 0.0, 0.0],
 [81.87875366210955, 9.835970878601088, 0.0, 0.0],
 <matplotlib.collections.LineCollection at 0x10bc59bd0>)

You can access the m object by passing m.whatever_you_named_your_shape. Each item in m.srilanka is a matrixa 2 x n matrix defining the polygon shape. I believe they are the locations of each point, which is then interpolated to create the polygon outline (not 100% sure about this, but for our purposes it doesn't matter). If you want to take a look at the first item in m.srilanka execute

In: m.srilanka[0]
In [7]:
print 'm.srilanka is a ' + str(type(m.srilanka)) + ' object.'

print 'It contains ' + str(len(m.srilanka)) + ' items.'

print 'The first list item itself contains ' + str(len(m.srilanka[0])) + ' items.'
m.srilanka is a <type 'list'> object.
It contains 156 items.
The first list item itself contains 103 items.

You might be wondering why the list contains 156 items, rather than 25 (1 for each district). When you use the readshapefile class 'rings in individual Polygon shapes are split out into separate polygons, and additional keys ‘RINGNUM’ and ‘SHAPENUM’ are added to the shape attribute dictionary' (per the Basemap documentation). What this means is that if the district shape is actually comprised of multiple polygons, then the shape is split out into each of its associated polygon rings. This occurs frequently when the shape contains complex coastlines (e.g. islands) since the shape for that state or district is actually comprised of multiple fully formed polygons (an island requires its own polygon). Think of the U.S. State of Hawaii -- it is comprised of multiple island polygons.

You can also access the info feature of the polygon which has the associated metadata. Note that RINGNUM and SHAPENUM were added. I'll be using NAME_1 to merge this with my data as this contains the district name.

In [8]:
m.srilanka_info[0]
Out[8]:
{'ENGTYPE_1': 'District',
 'ID_0': 215,
 'ID_1': 1,
 'ISO': 'LKA',
 'NAME_0': 'Sri Lanka',
 'NAME_1': 'Ampara',
 'NL_NAME_1': '                                                  ',
 'RINGNUM': 1,
 'SHAPENUM': 1,
 'TYPE_1': 'Distrikkaya',
 'VARNAME_1': 'Amparai'}

Setting up the map dataframe

Next, we need to take the information from m.srilanka and get it into a nice Panda's dataframe. I use Shapely's Polygon class to convert the x,y information contained in the m.srilanka list to create a defined polygon. You can read more about Shapely here

Then, I grab NAME_1 and call it district and define areas for each polygon. Note that each Shapely polygon has an .area attribute which calculates the polygon area (in meters). I add a column for miles as well.

In [9]:
# set up a map dataframe
df_map = pd.DataFrame({

    #access the x,y coords and define a polygon for each item in m.srilanka
    'poly': [Polygon(xy) for xy in m.srilanka],
    #conver NAME_1 to a column called 'district'
    'district': [district['NAME_1'] for district in m.srilanka_info]})

#add the polygon area
df_map['area_m'] = df_map['poly'].map(lambda x: x.area/1000)

#convert meters to miles
df_map['area_miles'] = df_map['area_m'] * 0.000621371
In [10]:
df_map.head()
Out[10]:
district poly area_m area_miles
0 Ampara POLYGON ((276169.8002706398 175908.0897950437,... 519.996944 0.323111
1 Ampara POLYGON ((217641.0364441486 207208.6241718179,... 4484127.664631 2786.306891
2 Anuradhapura POLYGON ((166929.5980524687 338253.4440058147,... 7206341.191304 4477.811432
3 Badulla POLYGON ((188035.5295273047 193888.385865976, ... 2868839.391369 1782.613601
4 Batticaloa POLYGON ((279070.6564249436 180985.0646625504,... 650.514429 0.404211

Binding the data to the map

Again, note that we will be matching District_1 witin the data to district within the map dataframe.

In [11]:
data.head()
Out[11]:
District_1 Total_pop Sinhalese Tamils Muslims Others Sinhalese_pct Tamils_pct Muslims_pct Others_pct
0 Colombo 2309809 1771319 258654 242728 37108 77 11 11 2
1 Gampaha 2294641 2079115 90950 95501 29075 91 4 4 1
2 Kalutara 1217260 1054991 47973 112276 2020 87 4 9 0
3 Kandy 1369899 1018323 154874 191159 5543 74 11 14 0
4 Matale 482229 389092 48156 44113 868 81 10 9 0

First, let's rename District_1 to district within the data

In [12]:
data=data.rename(columns = {'District_1':'district'})

Did it work? Let's take a look at the row for Colombo.

In [13]:
data[data['district']=='Colombo']
Out[13]:
district Total_pop Sinhalese Tamils Muslims Others Sinhalese_pct Tamils_pct Muslims_pct Others_pct
0 Colombo 2309809 1771319 258654 242728 37108 77 11 11 2

Perfect! Now, let's merge our two data frames with df_map on the left and data on the right based on the key variable district. I redefine df_map as the merged dataframe.

In [14]:
df_map = pd.merge(df_map, data, on='district')

Let's make sure this looks right:

In [15]:
df_map.head()
Out[15]:
district poly area_m area_miles Total_pop Sinhalese Tamils Muslims Others Sinhalese_pct Tamils_pct Muslims_pct Others_pct
0 Ampara POLYGON ((276169.8002706398 175908.0897950437,... 519.996944 0.323111 648057 251018 112915 282484 1640 39 17 44 0
1 Ampara POLYGON ((217641.0364441486 207208.6241718179,... 4484127.664631 2786.306891 648057 251018 112915 282484 1640 39 17 44 0
2 Anuradhapura POLYGON ((166929.5980524687 338253.4440058147,... 7206341.191304 4477.811432 856232 778131 6022 70248 1831 91 1 8 0
3 Badulla POLYGON ((188035.5295273047 193888.385865976, ... 2868839.391369 1782.613601 811758 593120 169997 45886 2755 73 21 6 0
4 Batticaloa POLYGON ((279070.6564249436 180985.0646625504,... 650.514429 0.404211 525142 6127 382300 133844 2871 1 73 25 1

Now you can see that the map dataframe includes the demographic data iterated for each polygon.

Binning the data

Next, we need to decide how we want to split up the data. For a filled map, we need to decide how many bins we want (call this k) and where to split the bins. For example, if most of the data points we would like to plot are between 0 and 10, perhaps we would want to split it evenly in 4 bins ranging from 0-2.5, 2.5-5, 5-7.5 and 7.5-10. The problem with this approach is that your map might not tell you much if most of the data is in the first bin, 0-2.5, except a couple outliers. So, I'll present two options:
  • use Jenks natural breaks
  • define our own breaks

Jenks natural breaks defines optimal breakpoints for the data using a clustering algorithm. This is great if you are not sure where to break or bin your data.

Defining our own breaks (like the example I provided above) is useful for when you want to create a set of maps using the same basemap but plotting different variables. In this case, you generally want to have your data plotted on the same scale for ease of comparison.

Below, toggle the variable Jenks to True if you want to use Jenks bins and False if you want to define your own.

Set the bins
In [16]:
# change False to True to use Jenks binning
jenks = False

Next, set the variable you would like to plot. This should be the column name in df_map that you would like to analyze. I have set it to 'Tamils_pct' for the purpose of the walkthrough.

Set the variable to plot
In [17]:
# change 'Tamils_pct' to the column name of what you want to plot (e.g. 'Total_pop' for total population)
var_2_analyze = 'Tamils_pct'
Bin the data
In [18]:
if jenks == True:
    # Calculate Jenks natural breaks for each polygon
    breaks = nb(
        # set the data to use
        df_map[df_map[var_2_analyze].notnull()][var_2_analyze].values,

        # since this is an optimization function we need to give it a number of initial solutions to find. 
        # you can adjust this number if you are unsatisfied with the bin results
        initial=300,

        # k is the number of natural breaks you would like to apply. I've set it to 10, but you can change.
        k=10)

else:
    # Define my own breaks [even split each 20 percentage points] Note that the bins are the top range so >20, >40, etc
    # you can change the bins to whatever you like, though they should be based on the data you are analyzing
    # since I am going to plot data on a 0 to 100 scale, I chose these break points
    my_bins = [20,40,60,80,100]
    
    # Calculate the user defined breaks for our defined bins
    breaks = mapclassify.User_Defined(
               
            # set the data to use 
            df_map[df_map[var_2_analyze].notnull()][var_2_analyze].values, 
            
            #use my bins
            my_bins)

Note that the output of our breaks variable from pysal.esda.mapclassify has two attributes we can access: breaks.y and breaks.yb

  • breaks.y has the actual data that we broke on
  • breaks.yb has the bin that is output

You can take a look at this if you're interested by executing:

In: breaks.y
In: breaks.yb
Add breaks to the dataframe as 'bins'
In [19]:
# check if 'bins' already exists and drop it if it does so that we can recreate it using our new break information
if 'bins' in df_map.columns:
    df_map = df_map.drop('bins',1)
    print 'Bins column already existed, so we dropped the bins column'


# the notnull method lets us match indices when joining
# b is a dataframe of the bins with the var_2_analyze index
b = pd.DataFrame({'bins': breaks.yb}, index=df_map[df_map[var_2_analyze].notnull()].index)

# join b back to df_map
df_map = df_map.join(b)

# and handle our NA's if there are any
df_map.bins.fillna(-1, inplace=True)
Create labels for the legend
In [20]:
# check if this is a jenks or user-defined break
if jenks == True:
    
    # if jenks, use these labels
    bin_labels = ["<= %0.0f" % b for b in breaks.bins]
else: 
    
    # if user defined, use these ones
    bin_labels = ["< %0.0f" % b for b in breaks.bins]
    
print 'Here are the bin labels:'
for label in bin_labels:
    print label
Here are the bin labels:
< 20
< 40
< 60
< 80
< 100

Plotting the Map

Set up the colors

This is straight from Stephan Hügel's guide to mapping with Python. He does a very nice job defining reusable functions for mapping numbers to colors. I'm not going to get into what exactly these functions do in this walk-through, but keep in mind that we now have each polygon bucketed within a bin. We just want to ensure that we evenly split our color scale across the bins. These functions help us do that using some of matplotlib's built in functionality.

In [21]:
# Convenience functions for working with color ramps and bars
def colorbar_index(ncolors, cmap, labels=None, **kwargs):
    """
    This is a convenience function to stop you making off-by-one errors
    Takes a standard colour ramp, and discretizes it,
    then draws a colour bar with correctly aligned labels
    """
    cmap = cmap_discretize(cmap, ncolors)
    mappable = cm.ScalarMappable(cmap=cmap)
    mappable.set_array([])
    mappable.set_clim(-0.5, ncolors+0.5)
    colorbar = plt.colorbar(mappable, **kwargs)
    colorbar.set_ticks(np.linspace(0, ncolors, ncolors))
    colorbar.set_ticklabels(range(ncolors))
    if labels:
        colorbar.set_ticklabels(labels)
    return colorbar

def cmap_discretize(cmap, N):
    """
    Return a discrete colormap from the continuous colormap cmap.

        cmap: colormap instance, eg. cm.jet. 
        N: number of colors.

    Example
        x = resize(arange(100), (5,100))
        djet = cmap_discretize(cm.jet, 5)
        imshow(x, cmap=djet)

    """
    if type(cmap) == str:
        cmap = get_cmap(cmap)
    colors_i = np.concatenate((np.linspace(0, 1., N), (0., 0., 0., 0.)))
    colors_rgba = cmap(colors_i)
    indices = np.linspace(0, 1., N + 1)
    cdict = {}
    for ki, key in enumerate(('red', 'green', 'blue')):
        cdict[key] = [(indices[i], colors_rgba[i - 1, ki], colors_rgba[i, ki]) for i in xrange(N + 1)]
    return matplotlib.colors.LinearSegmentedColormap(cmap.name + "_%d" % N, cdict, 1024)

Onto plotting the map!

Let's first make sure that we plot our map inline (as opposed to a Python viewer)

In [22]:
%matplotlib inline

Next, we are going to actually plot the map using matplotlib. I've stepped through the process with in-line comments.

One thing worth mentioning is that you can customize the colors with any of the matplotlib color maps. You can take a look at all the options available in the reference document here. For creating a filled map you should stick to the sequential color maps.

In [23]:
# initialize the plot
plt.clf()

# define the figure and set the facecolor (e.g. background) to white
fig = plt.figure(facecolor='white')

# ad a subplot called 'ax'
ax = fig.add_subplot(111, axisbg='w', frame_on=False)

# use a blue colour ramp ('Blues') - we'll be converting it to a map using cmap()
# you could also use 'Oranges' or 'Greens' 
cmap = plt.get_cmap('Blues')


# draw district with grey outlines
df_map['patches'] = df_map['poly'].map(lambda x: PolygonPatch(x, ec='#555555', lw=.2, alpha=1., zorder=4))


# set the PatchCollection with our defined 'patches'
pc = PatchCollection(df_map['patches'], match_original=True)

# normalize our bins between the min and max values within the bins
norm = Normalize(vmin=df_map['bins'].min(), vmax=df_map['bins'].max())

# impose our color map onto the patch collection
pc.set_facecolor(cmap(norm(df_map['bins'].values)))
ax.add_collection(pc)

# Add a color bar which has our bin_labels applied
cb = colorbar_index(ncolors=len(bin_labels), cmap=cmap, shrink=0.5, labels=bin_labels)
# set the font size of the labels (set to size 10 here)
cb.ax.tick_params(labelsize=10)

# Create a bit of small print
smallprint = ax.text(
    # set the x,y location of the smallprint
    1, 1,
    # add whatever text you would like to appear
    'This is a map of Sri Lanka showing ' + var_2_analyze + ' per district.',
    # set the horizontal/vertical alignment
    ha='right', va='bottom',
    # set the size and the color
    size=10,
    color='#555555',
    transform=ax.transAxes)

# Draw a map scale
m.drawmapscale(
    #set the coordinates where the scale should appear
    coords[0] + 0.08, coords[1] + 0.215,
    coords[0], coords[1],
    # what is the max value of the scale (here it's set to 25 for 25 miles)
    25.,
    barstyle='fancy', labelstyle='simple',
    fillcolor1='w', fillcolor2='#555555',
    fontcolor='#555555',
    zorder=5,
    # what units would you like to use. Defaults to km
    units='mi')

# set the layout to maximally fit the bounding area
plt.tight_layout()

# define the size of the figure
fig.set_size_inches(5,6)

# save the figure. Increase the dpi to increase the quality of the output .png. For example, dpi=1000 is super high quality
# note that the figure will be saved as 'sri_lanka_' then the name of the variable under analysis
# you can change this to whatever you want
plt.savefig('sri_lanka_' + var_2_analyze + '.png', dpi=100, alpha=True)

# display our plot
plt.show()
<matplotlib.figure.Figure at 0x10970c390>