Exercise: Geospatial Data Structures in Python#

In discussions about Python you will often read statements like this:

gdal_vs_rasterio

Source: GitHub Issue - rasterio vs python gdal #11

In this notebook, you will …#

  1. write a class representing a Polygon in Python,

  2. profile the execution time of your class and compare it to shapely and ogr implementations,

  3. draw conclusions about the advantages and disadvantages of using ogr, shapely or your own implementation,

  4. and finally, explain what Pythonic means.

Comparing ogr, shapely and pure Python#

We will compare three different classes to represent a polygon in Python and how to calculate its bounding box.

  1. The Geometry class provided by ogr package.

  2. The Polygon class provided by the shapely package.

  3. A simplified Python implementation of a Polygon class written by yourself.

The three classes should be compared in regard to:#

  • Execution time

  • Programming time (how long did it take to write the code)

  • Readability

  • Flexibility (how easily can you adapt it to your needs)

import numpy as np
import shapely

test_coordinates = [(0,1.5),(0.5,1),(1,1.5),(0.5,2),(0,1.5)]

1. OGR package#

E 1.1: Use the ogr package to create a polygon geometry object using the same test_coordinates as above.

E 1.2: Calculate the envelope of the polygon.

2. Shapely package#

E 2.1: Use shapely package to create a polygon object using the same test_coordinates as above.

E 2.2: Calculate the bounding box of the shapely polygon.

E 2.3: Plot the two polygons using matplotlib.

3. Do-It-Yourself Polygon class#

Theory: Classes and objects in Python#

class Person:
    age = None
    name = None
    
    def is_adult(self):
        if self.age >= 18:
            return True
        else:
            return False
person_a = Person()
person_a.is_adult()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[3], line 2
      1 person_a = Person()
----> 2 person_a.is_adult()

Cell In[2], line 6, in Person.is_adult(self)
      5 def is_adult(self):
----> 6     if self.age >= 18:
      7         return True
      8     else:

TypeError: '>=' not supported between instances of 'NoneType' and 'int'

E 3.1: Write a class called MyPolygon. It should be created by passing a list of coordinates representing the nodes of the polygon, e.g. MyPolygon(coordinates=[(1,1),(1,2),(2,2),(2,1),(1,1))]). Calling polygon.coordinates should return the coordinates.

my_poly = MyPolygon(coordinates=test_coordinates)

Test whether the instance of class Polygon returns the right value by executing the cell below. If there is no error message, everything’s fine.

assert my_poly.coordinates == test_coordinates

E 3.2: Add a method called envelope to the class which calculates the bounding box of the polygon. The method should return the bounding box as a list containing the minimum and maximum coordinates of the bounding box, i.e.[minimum_x, minimum_y, maximum_x, maximum_y].

Note: A bounding box is often reffered to as the envelope of a geometry.

Check if your results are the same as above.

4. Comparison of execution times#

Compare the exectution times of the ogr, shapely and your own class using the magic command %%timeit. This function will execute the cell multiple times to get a good estimate of the execution time.

E 4.1: Compare the execution times of the calculation of the bounding box of all three classes.

my_poly = MyPolygon(test_coordinates)
%%timeit
my_poly.envelope()

Question: What do you observe when you compre the execution times of all three methods? Can you explain the difference in execution times?

Comparison bounding box creation and object creation#

When choosing the most efficient way to calculate something, we also need to consider the overhead of the calculation. The overhead contains all the processing steps that need to be taken as a preparation before the execution of the desired calculation. Depending on the implementation, this can change your decision.

E 4.2: Measure the execution times of all three implementations including the object creation.

%%timeit
my_poly = MyPolygon(test_coordinates)
my_poly.envelope()

Question: What do you observe when you compare the execution times and the object creation of all three polygon implementations? Can you explain the difference in execution times?

Your Answer:

5. Questions for Discussion#

Q1: Based on your results of the exercises, evaluate the three methods in the table below. (Double click the cell, to edit it)

Execution time
(fast - slow)

Programming time
(fast - slow)

Readability
(high - low)

Flexibility
(high - low)

DIY Python

OGR

Shapely

Q2: What are the advantages and disadvantages of using ogr, shapely or your own implementation for vector data processing?

Your answers:

OGR and GDAL Python bindings#

  • ogr and gdal are automatically generated Python bindings (using SWIG) to the C libraries GDAL and OGR

  • So when you use ogr and gdal classes in Python you are actually executing C code.

  • Still, the syntax or ogr and gdal doesn’t really feel like Python (e.g. it needs many lines of code)

  • shapely is a native Python package and is therefore closer to the way Python code is supposed to be written

But wait this is not the end.#

Why are we still mostly using shapely even though it is so slow?

E.: Create 1000 random polygons and calculate the bounding box of each polygon. Compare the execution times of all three classes. Do this separately using the ‘ogr’ implementation and the ‘shapely’ implementation.

n = 1000000

Shapely#

# Create 1000 random polygons using shapely 
import random   
import shapely.geometry as sg

random_shapely_polygons = [sg.Polygon([(random.uniform(0, 10), random.uniform(0, 10)) for i in range(5)]) for j in range(n)]  
# Loop over all polygons and calculate the area
# Shapely
areas_shapely = np.empty(n)
for i, poly in enumerate(random_shapely_polygons):
    areas_shapely[i] = shapely.area(poly)

OGR#

# Create 1000 random polygons using shapely 
import random   

random_ogr_polygons = []
for i in range(n):
    start_point = (random.uniform(0, 10), random.uniform(0, 10))
    coordinates = [start_point] + [(random.uniform(0, 10), random.uniform(0, 10)) for i in range(3)] + [start_point]
    new_polygon = create_polygon(coordinates)
    random_ogr_polygons.append(new_polygon)
# Loop over all polygons and calculate the area
areas = np.empty(n)
for i, poly in enumerate(random_ogr_polygons):
    areas[i] = poly.GetArea()

Geopandas#

# Create a geodataframe with n random polygons
import geopandas as gpd
random_polygons_df = gpd.GeoDataFrame({'geometry': random_shapely_polygons})
random_polygons_df.area
0          4.954251
1          8.478507
2         18.045323
3          6.314120
4          5.871313
            ...    
999995     6.119961
999996    16.395933
999997    12.813947
999998    12.657971
999999     5.360472
Length: 1000000, dtype: float64

What kind of geometry data type is GeoPandas using?

type(random_polygons_df.geometry[0])
shapely.geometry.polygon.Polygon

It uses shapely but somehow the calculation is now faster than OGR. It uses a vectorization. We will learn tomorrow what this is and how it works.

References#

If you would like to take a deeper look at object oriented programming in Python take a look at the following resources: