Part 2: Pandas and GeoPandas#
Pandas is based on numpy, therefore it provides vectorized computation as well.
→ Pandas User Guide: Accelerated Operations
1. Brief Introduction to Pandas and GeoPandas#
import pandas as pd
file_path = "./data/new_york_hotels.csv"
hotels = pd.read_csv(file_path, encoding='cp1252')
hotels.head()
ean_hotel_id | name | address1 | city | state_province | postal_code | latitude | longitude | star_rating | high_rate | low_rate | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 269955 | Hilton Garden Inn Albany/SUNY Area | 1389 Washington Ave | Albany | NY | 12206 | 42.68751 | -73.81643 | 3.0 | 154.0272 | 124.0216 |
1 | 113431 | Courtyard by Marriott Albany Thruway | 1455 Washington Avenue | Albany | NY | 12206 | 42.68971 | -73.82021 | 3.0 | 179.0100 | 134.0000 |
2 | 108151 | Radisson Hotel Albany | 205 Wolf Rd | Albany | NY | 12205 | 42.72410 | -73.79822 | 3.0 | 134.1700 | 84.1600 |
3 | 254756 | Hilton Garden Inn Albany Medical Center | 62 New Scotland Ave | Albany | NY | 12208 | 42.65157 | -73.77638 | 3.0 | 308.2807 | 228.4597 |
4 | 198232 | CrestHill Suites SUNY University Albany | 1415 Washington Avenue | Albany | NY | 12206 | 42.68873 | -73.81854 | 3.0 | 169.3900 | 89.3900 |
E: Calculate the mean star_rating.
E: How many unique cities are in the city
column?
E: How many hotels are there in Newark?
E: How many hotels are there in each city?
E: How many hotels have the name ‘Hilton’ hotels are there?
E: Calculate the average star rating and its standard deviation per city.
2. How to use vectorized computation in Pandas and Geopandas#
To perform geospatial operations with the data, we need to create a geometry
column which contains Point geometries based on the latitude
and longitude
columns for each hotel. This can be done in different ways.
Exercise 2.1: Implement this in different ways and measure the execution time.
Using the iterrows()
and a for loop#
Using list comprehension#
Using the apply()
method#
Using geopandas built-in function#
Geospatial Operations in GeoPandas#
Yesterday we saw that OGR is faster than shapely. Still, geopandas uses shapely. Let’s see how they compare when using vectorization.
instead of only polygon, we will create a list of n polygons. For each one, we will generate random coordinates.
# Create 1000 random polygons using shapely
import random
import shapely.geometry as sg
n = 100000
random_coordinates = [[(random.uniform(0, 10), random.uniform(0, 10)) for i in range(5)] for j in range(n)]
OGR#
Create a list of OGR polygons and calculate the area of each polygon.
from osgeo import ogr
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[6], line 1
----> 1 from osgeo import ogr
ModuleNotFoundError: No module named 'osgeo'
def create_polygon(coords):
ring = ogr.Geometry(ogr.wkbLinearRing)
for coord in coords:
ring.AddPoint(coord[0], coord[1])
# Create polygon
poly = ogr.Geometry(ogr.wkbPolygon)
poly.AddGeometry(ring)
return poly
# Create 1000 random polygons using shapely
import random
random_ogr_polygons = []
for i in range(n):
start_point = (random.uniform(0, 10), random.uniform(0, 10))
coordinates = [start_point] + [(random.uniform(0, 10), random.uniform(0, 10)) for i in range(3)] + [start_point]
new_polygon = create_polygon(coordinates)
random_ogr_polygons.append(new_polygon)
import numpy as np
%%timeit
areas = np.empty(n)
for i, poly in enumerate(random_ogr_polygons):
areas[i] = poly.GetArea()
32.4 ms ± 832 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)
1. Implementation in pure shapely#
random_shapely_polygons = [sg.Polygon(j) for j in random_coordinates]
import numpy as np
import shapely
Calculate the area of each polygon
%%timeit
369 ms ± 947 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)
2. Implementation in GeoPandas#
Calculate the area of each polygon using GeoPandas.
import geopandas as gpd
%%timeit
1.49 ms ± 13.1 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
It uses shapely but somehow the calculation is now faster than OGR. What Geopandas does in the background is …
Create a vectorized representation of the geometries
Use a vectorized function to calculate the area
%%timeit
1.33 ms ± 24.2 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
This is a very new feature only available for shapely version > 2.0.0. Previously, geopandas used the pygeos package to implement vectorized geometric operations. In the latest shapely version 2.0.0, pygeos was integrated into shapely, so now shapely also supports vectorized computations. Be careful when using a versions < 2.0, they might be slower .
Resources#
More on Pandas and vectorization#
Watch this talk by Sofia Heisler’s repository PyCon 2017: Optimizing Pandas Code for Performance to get a more indepth look into vecotrized computation using pandas.
→ Watch her talk on YouTube I really recommend it (especially if you like panda GIFs)
→ Read her blog post