{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": "# Part 2: Pandas and GeoPandas"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Pandas is based on numpy, therefore it provides vectorized computation as well. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "&rarr; [Pandas User Guide: Accelerated Operations](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#accelerated-operations)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Brief Introduction to Pandas and GeoPandas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "file_path = \"./data/new_york_hotels.csv\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "hotels = pd.read_csv(file_path, encoding='cp1252')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>ean_hotel_id</th>\n",
       "      <th>name</th>\n",
       "      <th>address1</th>\n",
       "      <th>city</th>\n",
       "      <th>state_province</th>\n",
       "      <th>postal_code</th>\n",
       "      <th>latitude</th>\n",
       "      <th>longitude</th>\n",
       "      <th>star_rating</th>\n",
       "      <th>high_rate</th>\n",
       "      <th>low_rate</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>269955</td>\n",
       "      <td>Hilton Garden Inn Albany/SUNY Area</td>\n",
       "      <td>1389 Washington Ave</td>\n",
       "      <td>Albany</td>\n",
       "      <td>NY</td>\n",
       "      <td>12206</td>\n",
       "      <td>42.68751</td>\n",
       "      <td>-73.81643</td>\n",
       "      <td>3.0</td>\n",
       "      <td>154.0272</td>\n",
       "      <td>124.0216</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>113431</td>\n",
       "      <td>Courtyard by Marriott Albany Thruway</td>\n",
       "      <td>1455 Washington Avenue</td>\n",
       "      <td>Albany</td>\n",
       "      <td>NY</td>\n",
       "      <td>12206</td>\n",
       "      <td>42.68971</td>\n",
       "      <td>-73.82021</td>\n",
       "      <td>3.0</td>\n",
       "      <td>179.0100</td>\n",
       "      <td>134.0000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>108151</td>\n",
       "      <td>Radisson Hotel Albany</td>\n",
       "      <td>205 Wolf Rd</td>\n",
       "      <td>Albany</td>\n",
       "      <td>NY</td>\n",
       "      <td>12205</td>\n",
       "      <td>42.72410</td>\n",
       "      <td>-73.79822</td>\n",
       "      <td>3.0</td>\n",
       "      <td>134.1700</td>\n",
       "      <td>84.1600</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>254756</td>\n",
       "      <td>Hilton Garden Inn Albany Medical Center</td>\n",
       "      <td>62 New Scotland Ave</td>\n",
       "      <td>Albany</td>\n",
       "      <td>NY</td>\n",
       "      <td>12208</td>\n",
       "      <td>42.65157</td>\n",
       "      <td>-73.77638</td>\n",
       "      <td>3.0</td>\n",
       "      <td>308.2807</td>\n",
       "      <td>228.4597</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>198232</td>\n",
       "      <td>CrestHill Suites SUNY University Albany</td>\n",
       "      <td>1415 Washington Avenue</td>\n",
       "      <td>Albany</td>\n",
       "      <td>NY</td>\n",
       "      <td>12206</td>\n",
       "      <td>42.68873</td>\n",
       "      <td>-73.81854</td>\n",
       "      <td>3.0</td>\n",
       "      <td>169.3900</td>\n",
       "      <td>89.3900</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   ean_hotel_id                                     name   \n",
       "0        269955       Hilton Garden Inn Albany/SUNY Area  \\\n",
       "1        113431     Courtyard by Marriott Albany Thruway   \n",
       "2        108151                    Radisson Hotel Albany   \n",
       "3        254756  Hilton Garden Inn Albany Medical Center   \n",
       "4        198232  CrestHill Suites SUNY University Albany   \n",
       "\n",
       "                 address1    city state_province postal_code  latitude   \n",
       "0     1389 Washington Ave  Albany             NY       12206  42.68751  \\\n",
       "1  1455 Washington Avenue  Albany             NY       12206  42.68971   \n",
       "2             205 Wolf Rd  Albany             NY       12205  42.72410   \n",
       "3     62 New Scotland Ave  Albany             NY       12208  42.65157   \n",
       "4  1415 Washington Avenue  Albany             NY       12206  42.68873   \n",
       "\n",
       "   longitude  star_rating  high_rate  low_rate  \n",
       "0  -73.81643          3.0   154.0272  124.0216  \n",
       "1  -73.82021          3.0   179.0100  134.0000  \n",
       "2  -73.79822          3.0   134.1700   84.1600  \n",
       "3  -73.77638          3.0   308.2807  228.4597  \n",
       "4  -73.81854          3.0   169.3900   89.3900  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "hotels.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**E:** Calculate the mean star_rating. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**E:** How many unique cities are in the `city` column?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**E:** How many hotels are there in Newark?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**E:** How many hotels are there in each city?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**E:** How many hotels have the name 'Hilton' hotels are there?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**E:** Calculate the average star rating and its standard deviation per city."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": "## 2. How to use vectorized computation in Pandas and Geopandas"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To perform geospatial operations with the data, we need to create a `geometry` column which contains Point geometries based on the `latitude` and `longitude` columns for each hotel. This can be done in different ways. \n",
    "\n",
    "**Exercise 2.1:** Implement this in different ways and measure the execution time.  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Using the `iterrows()` and a for loop"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Using list comprehension"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Using the `apply()` method"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##### Using geopandas built-in function"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [],
   "source": ""
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "### Geospatial Operations in GeoPandas \n",
    "\n",
    "Yesterday we saw that OGR is faster than shapely. Still, geopandas uses shapely. Let's see how they compare when using vectorization. "
   ]
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "instead of only polygon, we will create a list of n polygons. For each one, we will generate random coordinates. "
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-08-05T16:22:18.470137Z",
     "start_time": "2024-08-05T16:22:18.144754Z"
    }
   },
   "cell_type": "code",
   "source": [
    "# Create 1000 random polygons using shapely \n",
    "import random   \n",
    "import shapely.geometry as sg\n",
    "\n",
    "n = 100000\n",
    "random_coordinates = [[(random.uniform(0, 10), random.uniform(0, 10)) for i in range(5)] for j in range(n)]\n"
   ],
   "outputs": [],
   "execution_count": 35
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "#### OGR \n",
    "\n",
    "Create a list of OGR polygons and calculate the area of each polygon."
   ]
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-08-05T16:18:45.330576Z",
     "start_time": "2024-08-05T16:18:45.088003Z"
    }
   },
   "cell_type": "code",
   "source": "from osgeo import ogr",
   "outputs": [],
   "execution_count": 21
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-08-05T16:18:45.334042Z",
     "start_time": "2024-08-05T16:18:45.331634Z"
    }
   },
   "cell_type": "code",
   "source": [
    "def create_polygon(coords):          \n",
    "    ring = ogr.Geometry(ogr.wkbLinearRing)\n",
    "    for coord in coords:\n",
    "        ring.AddPoint(coord[0], coord[1])\n",
    "\n",
    "    # Create polygon\n",
    "    poly = ogr.Geometry(ogr.wkbPolygon)\n",
    "    poly.AddGeometry(ring)\n",
    "    return poly"
   ],
   "outputs": [],
   "execution_count": 22
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-08-05T16:18:46.275731Z",
     "start_time": "2024-08-05T16:18:45.568715Z"
    }
   },
   "cell_type": "code",
   "source": [
    "# Create 1000 random polygons using shapely \n",
    "import random   \n",
    "\n",
    "random_ogr_polygons = []\n",
    "for i in range(n):\n",
    "    start_point = (random.uniform(0, 10), random.uniform(0, 10))\n",
    "    coordinates = [start_point] + [(random.uniform(0, 10), random.uniform(0, 10)) for i in range(3)] + [start_point]\n",
    "    new_polygon = create_polygon(coordinates)\n",
    "    random_ogr_polygons.append(new_polygon)"
   ],
   "outputs": [],
   "execution_count": 23
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-08-05T16:20:06.880467Z",
     "start_time": "2024-08-05T16:20:06.878626Z"
    }
   },
   "cell_type": "code",
   "source": "import numpy as np",
   "outputs": [],
   "execution_count": 24
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-08-05T16:20:30.446953Z",
     "start_time": "2024-08-05T16:20:27.819016Z"
    }
   },
   "cell_type": "code",
   "source": [
    "%%timeit\n",
    "areas = np.empty(n)\n",
    "for i, poly in enumerate(random_ogr_polygons):\n",
    "    areas[i] = poly.GetArea()"
   ],
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "32.4 ms ± 832 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
     ]
    }
   ],
   "execution_count": 27
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "#### 1. Implementation in pure shapely"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-08-05T16:20:40.746300Z",
     "start_time": "2024-08-05T16:20:39.495348Z"
    }
   },
   "cell_type": "code",
   "source": "random_shapely_polygons = [sg.Polygon(j) for j in random_coordinates]  ",
   "outputs": [],
   "execution_count": 28
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "import numpy as np\n",
    "import shapely"
   ]
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "Calculate the area of each polygon"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-08-05T16:20:56.559110Z",
     "start_time": "2024-08-05T16:20:53.596568Z"
    }
   },
   "cell_type": "code",
   "source": "%%timeit\n",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "369 ms ± 947 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
     ]
    }
   ],
   "execution_count": 29
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "### 2. Implementation in GeoPandas\n",
    "\n",
    "Calculate the area of each polygon using GeoPandas."
   ]
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-08-05T16:21:04.833113Z",
     "start_time": "2024-08-05T16:21:04.707673Z"
    }
   },
   "cell_type": "code",
   "source": "import geopandas as gpd",
   "outputs": [],
   "execution_count": 30
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-08-05T16:21:24.792247Z",
     "start_time": "2024-08-05T16:21:12.675362Z"
    }
   },
   "cell_type": "code",
   "source": "%%timeit\n",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.49 ms ± 13.1 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)\n"
     ]
    }
   ],
   "execution_count": 31
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "It uses shapely but somehow the calculation is now faster than OGR. What Geopandas does in the background is ..."
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "1. Create a vectorized representation of the geometries"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-08-05T16:21:53.761452Z",
     "start_time": "2024-08-05T16:21:53.580829Z"
    }
   },
   "cell_type": "code",
   "source": "",
   "outputs": [],
   "execution_count": 32
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "2. Use a vectorized function to calculate the area"
  },
  {
   "metadata": {
    "ExecuteTime": {
     "end_time": "2024-08-05T16:22:08.492098Z",
     "start_time": "2024-08-05T16:21:57.690075Z"
    }
   },
   "cell_type": "code",
   "source": "%%timeit",
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1.33 ms ± 24.2 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)\n"
     ]
    }
   ],
   "execution_count": 33
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": "This is a very new feature only available for shapely version > 2.0.0. Previously, geopandas used the pygeos package to implement vectorized geometric operations. In the latest shapely version 2.0.0, pygeos was integrated into shapely, so now shapely also supports vectorized computations. Be careful when using a versions < 2.0, they might be slower . "
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Resources"
   ]
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "#### More on Pandas and vectorization\n",
    "Watch this talk by __Sofia Heisler's repository [PyCon 2017: Optimizing Pandas Code for Performance](https://github.com/s-heisler/pycon2017-optimizing-pandas)__ to get a more indepth look into vecotrized computation using pandas. \n",
    "\n",
    "&rarr; Watch her talk on [YouTube](https://www.youtube.com/watch?v=HN5d490_KKk) I really recommend it (especially if you like panda GIFs)\n",
    "\n",
    "&rarr; Read her [blog post](https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Introducing pygeos](https://caspervdw.github.io/Introducing-Pygeos/)\n",
    "\n",
    "[PyGEOS Documentation](https://pygeos.readthedocs.io/en/latest/)\n",
    "   \n",
    "\n",
    "[Cythonize Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html)\n",
    "\n",
    "https://stackoverflow.com/questions/52673285/performance-of-pandas-apply-vs-np-vectorize-to-create-new-column-from-existing-c\n",
    "\n",
    "https://www.google.com/url?q=http://homepages.math.uic.edu/~jan/mcs275/running_cython.pdf&sa=U&ved=2ahUKEwiq_M3-vfrqAhWF-KQKHXBXCfwQFjAAegQICRAB&usg=AOvVaw0jX9BZrTt2aPsxKo30zmDb\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}