Part 2: Pandas and GeoPandas

Part 2: Pandas and GeoPandas#

Pandas is based on numpy, therefore it provides vectorized computation as well.

→ Pandas User Guide: Accelerated Operations

1. Brief Introduction to Pandas and GeoPandas#

import pandas as pd

file_path = "./data/new_york_hotels.csv"

hotels = pd.read_csv(file_path, encoding='cp1252')

hotels.head()

	ean_hotel_id	name	address1	city	state_province	postal_code	latitude	longitude	star_rating	high_rate	low_rate
0	269955	Hilton Garden Inn Albany/SUNY Area	1389 Washington Ave	Albany	NY	12206	42.68751	-73.81643	3.0	154.0272	124.0216
1	113431	Courtyard by Marriott Albany Thruway	1455 Washington Avenue	Albany	NY	12206	42.68971	-73.82021	3.0	179.0100	134.0000
2	108151	Radisson Hotel Albany	205 Wolf Rd	Albany	NY	12205	42.72410	-73.79822	3.0	134.1700	84.1600
3	254756	Hilton Garden Inn Albany Medical Center	62 New Scotland Ave	Albany	NY	12208	42.65157	-73.77638	3.0	308.2807	228.4597
4	198232	CrestHill Suites SUNY University Albany	1415 Washington Avenue	Albany	NY	12206	42.68873	-73.81854	3.0	169.3900	89.3900

E: Calculate the mean star_rating.

E: How many unique cities are in the city column?

E: How many hotels are there in Newark?

E: How many hotels are there in each city?

E: How many hotels have the name ‘Hilton’ hotels are there?

E: Calculate the average star rating and its standard deviation per city.

2. How to use vectorized computation in Pandas and Geopandas#

To perform geospatial operations with the data, we need to create a geometry column which contains Point geometries based on the latitude and longitude columns for each hotel. This can be done in different ways.

Exercise 2.1: Implement this in different ways and measure the execution time.

Using the `iterrows()` and a for loop#

Using list comprehension#

Using the `apply()` method#

Using geopandas built-in function#

Geospatial Operations in GeoPandas#

Yesterday we saw that OGR is faster than shapely. Still, geopandas uses shapely. Let’s see how they compare when using vectorization.

instead of only polygon, we will create a list of n polygons. For each one, we will generate random coordinates.

# Create 1000 random polygons using shapely 
import random   
import shapely.geometry as sg

n = 100000
random_coordinates = [[(random.uniform(0, 10), random.uniform(0, 10)) for i in range(5)] for j in range(n)]

OGR#

Create a list of OGR polygons and calculate the area of each polygon.

from osgeo import ogr

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[6], line 1
----> 1 from osgeo import ogr

ModuleNotFoundError: No module named 'osgeo'

def create_polygon(coords):          
    ring = ogr.Geometry(ogr.wkbLinearRing)
    for coord in coords:
        ring.AddPoint(coord[0], coord[1])

    # Create polygon
    poly = ogr.Geometry(ogr.wkbPolygon)
    poly.AddGeometry(ring)
    return poly

# Create 1000 random polygons using shapely 
import random   

random_ogr_polygons = []
for i in range(n):
    start_point = (random.uniform(0, 10), random.uniform(0, 10))
    coordinates = [start_point] + [(random.uniform(0, 10), random.uniform(0, 10)) for i in range(3)] + [start_point]
    new_polygon = create_polygon(coordinates)
    random_ogr_polygons.append(new_polygon)

import numpy as np

%%timeit
areas = np.empty(n)
for i, poly in enumerate(random_ogr_polygons):
    areas[i] = poly.GetArea()

32.4 ms ± 832 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

1. Implementation in pure shapely#

random_shapely_polygons = [sg.Polygon(j) for j in random_coordinates]  

import numpy as np
import shapely

Calculate the area of each polygon

%%timeit

369 ms ± 947 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)

2. Implementation in GeoPandas#

Calculate the area of each polygon using GeoPandas.

import geopandas as gpd

%%timeit

1.49 ms ± 13.1 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

It uses shapely but somehow the calculation is now faster than OGR. What Geopandas does in the background is …

Create a vectorized representation of the geometries

Use a vectorized function to calculate the area

%%timeit

1.33 ms ± 24.2 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

This is a very new feature only available for shapely version > 2.0.0. Previously, geopandas used the pygeos package to implement vectorized geometric operations. In the latest shapely version 2.0.0, pygeos was integrated into shapely, so now shapely also supports vectorized computations. Be careful when using a versions < 2.0, they might be slower .

Resources#

More on Pandas and vectorization#

Watch this talk by Sofia Heisler’s repository PyCon 2017: Optimizing Pandas Code for Performance to get a more indepth look into vecotrized computation using pandas.

→ Watch her talk on YouTube I really recommend it (especially if you like panda GIFs)

→ Read her blog post

Introducing pygeos

PyGEOS Documentation

Cythonize Pandas

https://stackoverflow.com/questions/52673285/performance-of-pandas-apply-vs-np-vectorize-to-create-new-column-from-existing-c

https://www.google.com/url?q=http://homepages.math.uic.edu/~jan/mcs275/running_cython.pdf&sa=U&ved=2ahUKEwiq_M3-vfrqAhWF-KQKHXBXCfwQFjAAegQICRAB&usg=AOvVaw0jX9BZrTt2aPsxKo30zmDb