{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": "# Part 2: Pandas and GeoPandas" }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pandas is based on numpy, therefore it provides vectorized computation as well. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "→ [Pandas User Guide: Accelerated Operations](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#accelerated-operations)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Brief Introduction to Pandas and GeoPandas" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "file_path = \"./data/new_york_hotels.csv\"" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "hotels = pd.read_csv(file_path, encoding='cp1252')" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ean_hotel_idnameaddress1citystate_provincepostal_codelatitudelongitudestar_ratinghigh_ratelow_rate
0269955Hilton Garden Inn Albany/SUNY Area1389 Washington AveAlbanyNY1220642.68751-73.816433.0154.0272124.0216
1113431Courtyard by Marriott Albany Thruway1455 Washington AvenueAlbanyNY1220642.68971-73.820213.0179.0100134.0000
2108151Radisson Hotel Albany205 Wolf RdAlbanyNY1220542.72410-73.798223.0134.170084.1600
3254756Hilton Garden Inn Albany Medical Center62 New Scotland AveAlbanyNY1220842.65157-73.776383.0308.2807228.4597
4198232CrestHill Suites SUNY University Albany1415 Washington AvenueAlbanyNY1220642.68873-73.818543.0169.390089.3900
\n", "
" ], "text/plain": [ " ean_hotel_id name \n", "0 269955 Hilton Garden Inn Albany/SUNY Area \\\n", "1 113431 Courtyard by Marriott Albany Thruway \n", "2 108151 Radisson Hotel Albany \n", "3 254756 Hilton Garden Inn Albany Medical Center \n", "4 198232 CrestHill Suites SUNY University Albany \n", "\n", " address1 city state_province postal_code latitude \n", "0 1389 Washington Ave Albany NY 12206 42.68751 \\\n", "1 1455 Washington Avenue Albany NY 12206 42.68971 \n", "2 205 Wolf Rd Albany NY 12205 42.72410 \n", "3 62 New Scotland Ave Albany NY 12208 42.65157 \n", "4 1415 Washington Avenue Albany NY 12206 42.68873 \n", "\n", " longitude star_rating high_rate low_rate \n", "0 -73.81643 3.0 154.0272 124.0216 \n", "1 -73.82021 3.0 179.0100 134.0000 \n", "2 -73.79822 3.0 134.1700 84.1600 \n", "3 -73.77638 3.0 308.2807 228.4597 \n", "4 -73.81854 3.0 169.3900 89.3900 " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hotels.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**E:** Calculate the mean star_rating. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**E:** How many unique cities are in the `city` column?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**E:** How many hotels are there in Newark?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**E:** How many hotels are there in each city?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**E:** How many hotels have the name 'Hilton' hotels are there?" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**E:** Calculate the average star rating and its standard deviation per city." ] }, { "cell_type": "markdown", "metadata": {}, "source": "## 2. How to use vectorized computation in Pandas and Geopandas" }, { "cell_type": "markdown", "metadata": {}, "source": [ "To perform geospatial operations with the data, we need to create a `geometry` column which contains Point geometries based on the `latitude` and `longitude` columns for each hotel. This can be done in different ways. \n", "\n", "**Exercise 2.1:** Implement this in different ways and measure the execution time. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Using the `iterrows()` and a for loop" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Using list comprehension" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Using the `apply()` method" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Using geopandas built-in function" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": "" }, { "metadata": {}, "cell_type": "markdown", "source": [ "### Geospatial Operations in GeoPandas \n", "\n", "Yesterday we saw that OGR is faster than shapely. Still, geopandas uses shapely. Let's see how they compare when using vectorization. " ] }, { "metadata": {}, "cell_type": "markdown", "source": "instead of only polygon, we will create a list of n polygons. For each one, we will generate random coordinates. " }, { "metadata": { "ExecuteTime": { "end_time": "2024-08-05T16:22:18.470137Z", "start_time": "2024-08-05T16:22:18.144754Z" } }, "cell_type": "code", "source": [ "# Create 1000 random polygons using shapely \n", "import random \n", "import shapely.geometry as sg\n", "\n", "n = 100000\n", "random_coordinates = [[(random.uniform(0, 10), random.uniform(0, 10)) for i in range(5)] for j in range(n)]\n" ], "outputs": [], "execution_count": 35 }, { "metadata": {}, "cell_type": "markdown", "source": [ "#### OGR \n", "\n", "Create a list of OGR polygons and calculate the area of each polygon." ] }, { "metadata": { "ExecuteTime": { "end_time": "2024-08-05T16:18:45.330576Z", "start_time": "2024-08-05T16:18:45.088003Z" } }, "cell_type": "code", "source": "from osgeo import ogr", "outputs": [], "execution_count": 21 }, { "metadata": { "ExecuteTime": { "end_time": "2024-08-05T16:18:45.334042Z", "start_time": "2024-08-05T16:18:45.331634Z" } }, "cell_type": "code", "source": [ "def create_polygon(coords): \n", " ring = ogr.Geometry(ogr.wkbLinearRing)\n", " for coord in coords:\n", " ring.AddPoint(coord[0], coord[1])\n", "\n", " # Create polygon\n", " poly = ogr.Geometry(ogr.wkbPolygon)\n", " poly.AddGeometry(ring)\n", " return poly" ], "outputs": [], "execution_count": 22 }, { "metadata": { "ExecuteTime": { "end_time": "2024-08-05T16:18:46.275731Z", "start_time": "2024-08-05T16:18:45.568715Z" } }, "cell_type": "code", "source": [ "# Create 1000 random polygons using shapely \n", "import random \n", "\n", "random_ogr_polygons = []\n", "for i in range(n):\n", " start_point = (random.uniform(0, 10), random.uniform(0, 10))\n", " coordinates = [start_point] + [(random.uniform(0, 10), random.uniform(0, 10)) for i in range(3)] + [start_point]\n", " new_polygon = create_polygon(coordinates)\n", " random_ogr_polygons.append(new_polygon)" ], "outputs": [], "execution_count": 23 }, { "metadata": { "ExecuteTime": { "end_time": "2024-08-05T16:20:06.880467Z", "start_time": "2024-08-05T16:20:06.878626Z" } }, "cell_type": "code", "source": "import numpy as np", "outputs": [], "execution_count": 24 }, { "metadata": { "ExecuteTime": { "end_time": "2024-08-05T16:20:30.446953Z", "start_time": "2024-08-05T16:20:27.819016Z" } }, "cell_type": "code", "source": [ "%%timeit\n", "areas = np.empty(n)\n", "for i, poly in enumerate(random_ogr_polygons):\n", " areas[i] = poly.GetArea()" ], "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "32.4 ms ± 832 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" ] } ], "execution_count": 27 }, { "metadata": {}, "cell_type": "markdown", "source": "#### 1. Implementation in pure shapely" }, { "metadata": { "ExecuteTime": { "end_time": "2024-08-05T16:20:40.746300Z", "start_time": "2024-08-05T16:20:39.495348Z" } }, "cell_type": "code", "source": "random_shapely_polygons = [sg.Polygon(j) for j in random_coordinates] ", "outputs": [], "execution_count": 28 }, { "metadata": {}, "cell_type": "code", "outputs": [], "execution_count": null, "source": [ "import numpy as np\n", "import shapely" ] }, { "metadata": {}, "cell_type": "markdown", "source": "Calculate the area of each polygon" }, { "metadata": { "ExecuteTime": { "end_time": "2024-08-05T16:20:56.559110Z", "start_time": "2024-08-05T16:20:53.596568Z" } }, "cell_type": "code", "source": "%%timeit\n", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "369 ms ± 947 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" ] } ], "execution_count": 29 }, { "metadata": {}, "cell_type": "markdown", "source": [ "### 2. Implementation in GeoPandas\n", "\n", "Calculate the area of each polygon using GeoPandas." ] }, { "metadata": { "ExecuteTime": { "end_time": "2024-08-05T16:21:04.833113Z", "start_time": "2024-08-05T16:21:04.707673Z" } }, "cell_type": "code", "source": "import geopandas as gpd", "outputs": [], "execution_count": 30 }, { "metadata": { "ExecuteTime": { "end_time": "2024-08-05T16:21:24.792247Z", "start_time": "2024-08-05T16:21:12.675362Z" } }, "cell_type": "code", "source": "%%timeit\n", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.49 ms ± 13.1 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)\n" ] } ], "execution_count": 31 }, { "metadata": {}, "cell_type": "markdown", "source": "It uses shapely but somehow the calculation is now faster than OGR. What Geopandas does in the background is ..." }, { "metadata": {}, "cell_type": "markdown", "source": "1. Create a vectorized representation of the geometries" }, { "metadata": { "ExecuteTime": { "end_time": "2024-08-05T16:21:53.761452Z", "start_time": "2024-08-05T16:21:53.580829Z" } }, "cell_type": "code", "source": "", "outputs": [], "execution_count": 32 }, { "metadata": {}, "cell_type": "markdown", "source": "2. Use a vectorized function to calculate the area" }, { "metadata": { "ExecuteTime": { "end_time": "2024-08-05T16:22:08.492098Z", "start_time": "2024-08-05T16:21:57.690075Z" } }, "cell_type": "code", "source": "%%timeit", "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "1.33 ms ± 24.2 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)\n" ] } ], "execution_count": 33 }, { "metadata": {}, "cell_type": "markdown", "source": "This is a very new feature only available for shapely version > 2.0.0. Previously, geopandas used the pygeos package to implement vectorized geometric operations. In the latest shapely version 2.0.0, pygeos was integrated into shapely, so now shapely also supports vectorized computations. Be careful when using a versions < 2.0, they might be slower . " }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Resources" ] }, { "metadata": {}, "cell_type": "markdown", "source": [ "#### More on Pandas and vectorization\n", "Watch this talk by __Sofia Heisler's repository [PyCon 2017: Optimizing Pandas Code for Performance](https://github.com/s-heisler/pycon2017-optimizing-pandas)__ to get a more indepth look into vecotrized computation using pandas. \n", "\n", "→ Watch her talk on [YouTube](https://www.youtube.com/watch?v=HN5d490_KKk) I really recommend it (especially if you like panda GIFs)\n", "\n", "→ Read her [blog post](https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "[Introducing pygeos](https://caspervdw.github.io/Introducing-Pygeos/)\n", "\n", "[PyGEOS Documentation](https://pygeos.readthedocs.io/en/latest/)\n", " \n", "\n", "[Cythonize Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html)\n", "\n", "https://stackoverflow.com/questions/52673285/performance-of-pandas-apply-vs-np-vectorize-to-create-new-column-from-existing-c\n", "\n", "https://www.google.com/url?q=http://homepages.math.uic.edu/~jan/mcs275/running_cython.pdf&sa=U&ved=2ahUKEwiq_M3-vfrqAhWF-KQKHXBXCfwQFjAAegQICRAB&usg=AOvVaw0jX9BZrTt2aPsxKo30zmDb\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 4 }