Exercise: Geospatial Data Structures in Python#
In discussions about Python you will often read statements like this:
Source: GitHub Issue - rasterio vs python gdal #11
In this notebook, you will …#
write a class representing a Polygon in Python,
profile the execution time of your class and compare it to shapely and ogr implementations,
draw conclusions about the advantages and disadvantages of using ogr, shapely or your own implementation,
and finally, explain what Pythonic means.
Comparing ogr, shapely and pure Python#
We will compare three different classes to represent a polygon in Python and how to calculate its bounding box.
The
Geometry
class provided byogr
package.The
Polygon
class provided by theshapely
package.A simplified Python implementation of a
Polygon
class written by yourself.
The three classes should be compared in regard to:#
Execution time
Programming time (how long did it take to write the code)
Readability
Flexibility (how easily can you adapt it to your needs)
import numpy as np
import shapely
test_coordinates = [(0,1.5),(0.5,1),(1,1.5),(0.5,2),(0,1.5)]
1. OGR package#
E 1.1: Use the ogr
package to create a polygon geometry object using the same test_coordinates
as above.
E 1.2: Calculate the envelope of the polygon.
2. Shapely package#
E 2.1: Use shapely
package to create a polygon object using the same test_coordinates
as above.
E 2.2: Calculate the bounding box of the shapely polygon.
E 2.3: Plot the two polygons using matplotlib.
3. Do-It-Yourself Polygon
class#
Theory: Classes and objects in Python#
class Person:
age = None
name = None
def is_adult(self):
if self.age >= 18:
return True
else:
return False
person_a = Person()
person_a.is_adult()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[3], line 2
1 person_a = Person()
----> 2 person_a.is_adult()
Cell In[2], line 6, in Person.is_adult(self)
5 def is_adult(self):
----> 6 if self.age >= 18:
7 return True
8 else:
TypeError: '>=' not supported between instances of 'NoneType' and 'int'
E 3.1: Write a class called MyPolygon
. It should be created by passing a list of coordinates representing the nodes of the polygon, e.g. MyPolygon(coordinates=[(1,1),(1,2),(2,2),(2,1),(1,1))])
. Calling polygon.coordinates
should return the coordinates.
my_poly = MyPolygon(coordinates=test_coordinates)
Test whether the instance of class Polygon
returns the right value by executing the cell below. If there is no error message, everything’s fine.
assert my_poly.coordinates == test_coordinates
E 3.2: Add a method called envelope
to the class which calculates the bounding box of the polygon. The method should return the bounding box as a list containing the minimum and maximum coordinates of the bounding box, i.e.[minimum_x, minimum_y, maximum_x, maximum_y]
.
Note: A bounding box is often reffered to as the envelope of a geometry.
Check if your results are the same as above.
4. Comparison of execution times#
Compare the exectution times of the ogr, shapely and your own class using the magic command %%timeit
. This function will execute the cell multiple times to get a good estimate of the execution time.
E 4.1: Compare the execution times of the calculation of the bounding box of all three classes.
my_poly = MyPolygon(test_coordinates)
%%timeit
my_poly.envelope()
Question: What do you observe when you compre the execution times of all three methods? Can you explain the difference in execution times?
Comparison bounding box creation and object creation#
When choosing the most efficient way to calculate something, we also need to consider the overhead of the calculation. The overhead contains all the processing steps that need to be taken as a preparation before the execution of the desired calculation. Depending on the implementation, this can change your decision.
E 4.2: Measure the execution times of all three implementations including the object creation.
%%timeit
my_poly = MyPolygon(test_coordinates)
my_poly.envelope()
Question: What do you observe when you compare the execution times and the object creation of all three polygon implementations? Can you explain the difference in execution times?
Your Answer:
5. Questions for Discussion#
Q1: Based on your results of the exercises, evaluate the three methods in the table below. (Double click the cell, to edit it)
Execution time |
Programming time |
Readability |
Flexibility |
|
---|---|---|---|---|
DIY Python |
||||
OGR |
||||
Shapely |
Q2: What are the advantages and disadvantages of using ogr, shapely or your own implementation for vector data processing?
Your answers:
OGR and GDAL Python bindings#
ogr
andgdal
are automatically generated Python bindings (using SWIG) to the C libraries GDAL and OGRSo when you use
ogr
andgdal
classes in Python you are actually executing C code.Still, the syntax or
ogr
andgdal
doesn’t really feel like Python (e.g. it needs many lines of code)shapely
is a native Python package and is therefore closer to the way Python code is supposed to be written
But wait this is not the end.#
Why are we still mostly using shapely even though it is so slow?
E.: Create 1000 random polygons and calculate the bounding box of each polygon. Compare the execution times of all three classes. Do this separately using the ‘ogr’ implementation and the ‘shapely’ implementation.
n = 1000000
Shapely#
# Create 1000 random polygons using shapely
import random
import shapely.geometry as sg
random_shapely_polygons = [sg.Polygon([(random.uniform(0, 10), random.uniform(0, 10)) for i in range(5)]) for j in range(n)]
# Loop over all polygons and calculate the area
# Shapely
areas_shapely = np.empty(n)
for i, poly in enumerate(random_shapely_polygons):
areas_shapely[i] = shapely.area(poly)
OGR#
# Create 1000 random polygons using shapely
import random
random_ogr_polygons = []
for i in range(n):
start_point = (random.uniform(0, 10), random.uniform(0, 10))
coordinates = [start_point] + [(random.uniform(0, 10), random.uniform(0, 10)) for i in range(3)] + [start_point]
new_polygon = create_polygon(coordinates)
random_ogr_polygons.append(new_polygon)
# Loop over all polygons and calculate the area
areas = np.empty(n)
for i, poly in enumerate(random_ogr_polygons):
areas[i] = poly.GetArea()
Geopandas#
# Create a geodataframe with n random polygons
import geopandas as gpd
random_polygons_df = gpd.GeoDataFrame({'geometry': random_shapely_polygons})
random_polygons_df.area
0 4.954251
1 8.478507
2 18.045323
3 6.314120
4 5.871313
...
999995 6.119961
999996 16.395933
999997 12.813947
999998 12.657971
999999 5.360472
Length: 1000000, dtype: float64
What kind of geometry data type is GeoPandas using?
type(random_polygons_df.geometry[0])
shapely.geometry.polygon.Polygon
It uses shapely but somehow the calculation is now faster than OGR. It uses a vectorization. We will learn tomorrow what this is and how it works.
References#
If you would like to take a deeper look at object oriented programming in Python take a look at the following resources: