Part 2: Pandas and GeoPandas#

Pandas is based on numpy, therefore it provides vectorized computation as well.

Pandas User Guide: Accelerated Operations

1. Brief Introduction to Pandas and GeoPandas#

import pandas as pd
file_path = "./data/new_york_hotels.csv"
hotels = pd.read_csv(file_path, encoding='cp1252')
hotels.head()
ean_hotel_id name address1 city state_province postal_code latitude longitude star_rating high_rate low_rate
0 269955 Hilton Garden Inn Albany/SUNY Area 1389 Washington Ave Albany NY 12206 42.68751 -73.81643 3.0 154.0272 124.0216
1 113431 Courtyard by Marriott Albany Thruway 1455 Washington Avenue Albany NY 12206 42.68971 -73.82021 3.0 179.0100 134.0000
2 108151 Radisson Hotel Albany 205 Wolf Rd Albany NY 12205 42.72410 -73.79822 3.0 134.1700 84.1600
3 254756 Hilton Garden Inn Albany Medical Center 62 New Scotland Ave Albany NY 12208 42.65157 -73.77638 3.0 308.2807 228.4597
4 198232 CrestHill Suites SUNY University Albany 1415 Washington Avenue Albany NY 12206 42.68873 -73.81854 3.0 169.3900 89.3900

E: Calculate the mean star_rating.

E: How many unique cities are in the city column?

E: How many hotels are there in Newark?

E: How many hotels are there in each city?

E: How many hotels have the name ‘Hilton’ hotels are there?

E: Calculate the average star rating and its standard deviation per city.

2. How to use vectorized computation in Pandas and Geopandas#

To perform geospatial operations with the data, we need to create a geometry column which contains Point geometries based on the latitude and longitude columns for each hotel. This can be done in different ways.

Exercise 2.1: Implement this in different ways and measure the execution time.

Using the iterrows() and a for loop#

Using list comprehension#

Using the apply() method#

Using geopandas built-in function#

Geospatial Operations in GeoPandas#

Yesterday we saw that OGR is faster than shapely. Still, geopandas uses shapely. Let’s see how they compare when using vectorization.

instead of only polygon, we will create a list of n polygons. For each one, we will generate random coordinates.

# Create 1000 random polygons using shapely 
import random   
import shapely.geometry as sg

n = 100000
random_coordinates = [[(random.uniform(0, 10), random.uniform(0, 10)) for i in range(5)] for j in range(n)]

OGR#

Create a list of OGR polygons and calculate the area of each polygon.

from osgeo import ogr
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[6], line 1
----> 1 from osgeo import ogr

ModuleNotFoundError: No module named 'osgeo'
def create_polygon(coords):          
    ring = ogr.Geometry(ogr.wkbLinearRing)
    for coord in coords:
        ring.AddPoint(coord[0], coord[1])

    # Create polygon
    poly = ogr.Geometry(ogr.wkbPolygon)
    poly.AddGeometry(ring)
    return poly
# Create 1000 random polygons using shapely 
import random   

random_ogr_polygons = []
for i in range(n):
    start_point = (random.uniform(0, 10), random.uniform(0, 10))
    coordinates = [start_point] + [(random.uniform(0, 10), random.uniform(0, 10)) for i in range(3)] + [start_point]
    new_polygon = create_polygon(coordinates)
    random_ogr_polygons.append(new_polygon)
import numpy as np
%%timeit
areas = np.empty(n)
for i, poly in enumerate(random_ogr_polygons):
    areas[i] = poly.GetArea()
32.4 ms ± 832 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

1. Implementation in pure shapely#

random_shapely_polygons = [sg.Polygon(j) for j in random_coordinates]  
import numpy as np
import shapely

Calculate the area of each polygon

%%timeit
369 ms ± 947 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)

2. Implementation in GeoPandas#

Calculate the area of each polygon using GeoPandas.

import geopandas as gpd
%%timeit
1.49 ms ± 13.1 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

It uses shapely but somehow the calculation is now faster than OGR. What Geopandas does in the background is …

  1. Create a vectorized representation of the geometries

  1. Use a vectorized function to calculate the area

%%timeit
1.33 ms ± 24.2 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

This is a very new feature only available for shapely version > 2.0.0. Previously, geopandas used the pygeos package to implement vectorized geometric operations. In the latest shapely version 2.0.0, pygeos was integrated into shapely, so now shapely also supports vectorized computations. Be careful when using a versions < 2.0, they might be slower .

Resources#

More on Pandas and vectorization#

Watch this talk by Sofia Heisler’s repository PyCon 2017: Optimizing Pandas Code for Performance to get a more indepth look into vecotrized computation using pandas.

→ Watch her talk on YouTube I really recommend it (especially if you like panda GIFs)

→ Read her blog post

Introducing pygeos

PyGEOS Documentation

Cythonize Pandas

https://stackoverflow.com/questions/52673285/performance-of-pandas-apply-vs-np-vectorize-to-create-new-column-from-existing-c

https://www.google.com/url?q=http://homepages.math.uic.edu/~jan/mcs275/running_cython.pdf&sa=U&ved=2ahUKEwiq_M3-vfrqAhWF-KQKHXBXCfwQFjAAegQICRAB&usg=AOvVaw0jX9BZrTt2aPsxKo30zmDb