diff --git a/lessons/02_web_scraping.ipynb b/lessons/02_web_scraping.ipynb index 385806a..18a8069 100644 --- a/lessons/02_web_scraping.ipynb +++ b/lessons/02_web_scraping.ipynb @@ -45,20 +45,68 @@ "We will use two main packages: [Requests](http://docs.python-requests.org/en/latest/user/quickstart/) and [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/). Go ahead and install these packages, if you haven't already:" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "La siguiente sección corresponde a la instalación de los paquetes necesarios: requests, beautifulsoup4 y lxml.\n", + "## Funcion del request\n", + "\n", + "* Permite realizar peticiones HTTP de manera sencilla y manejar las respuestas devueltas por un servidor web.\n", + "* Facilita operaciones como enviar parámetros, encabezados y autenticación en las solicitudes.\n", + "\n", + "## Funcion del beautifulsoup4\n", + "\n", + "* Se utiliza para extraer y procesar información de páginas web obtenidas mediante una solicitud HTTP.\n", + "* Permite navegar y buscar de forma intuitiva entre etiquetas, atributos y textos dentro del código HTML.\n", + "\n", + "## Funcion del lxml\n", + "\n", + "* Es una librería especializada en el procesamiento y manipulación de documentos XML y HTML.\n", + "* Resulta muy eficiente en tareas que requieren analizar, transformar o extraer datos estructurados en forma de árbol.\n", + "\n", + "\n" + ] + }, { "cell_type": "code", - "execution_count": null, + "execution_count": 3, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Requirement already satisfied: requests in c:\\users\\user\\appdata\\local\\programs\\python\\python313\\lib\\site-packages (2.31.0)\n", + "Requirement already satisfied: charset-normalizer<4,>=2 in c:\\users\\user\\appdata\\local\\programs\\python\\python313\\lib\\site-packages (from requests) (3.4.3)\n", + "Requirement already satisfied: idna<4,>=2.5 in c:\\users\\user\\appdata\\local\\programs\\python\\python313\\lib\\site-packages (from requests) (3.10)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in c:\\users\\user\\appdata\\local\\programs\\python\\python313\\lib\\site-packages (from requests) (2.5.0)\n", + "Requirement already satisfied: certifi>=2017.4.17 in c:\\users\\user\\appdata\\local\\programs\\python\\python313\\lib\\site-packages (from requests) (2025.8.3)\n", + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], "source": [ "%pip install requests" ] }, { "cell_type": "code", - "execution_count": null, + "execution_count": 4, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Requirement already satisfied: beautifulsoup4 in c:\\users\\user\\appdata\\local\\programs\\python\\python313\\lib\\site-packages (4.13.4)\n", + "Requirement already satisfied: soupsieve>1.2 in c:\\users\\user\\appdata\\local\\programs\\python\\python313\\lib\\site-packages (from beautifulsoup4) (2.7)\n", + "Requirement already satisfied: typing-extensions>=4.0.0 in c:\\users\\user\\appdata\\local\\programs\\python\\python313\\lib\\site-packages (from beautifulsoup4) (4.14.1)\n", + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], "source": [ "%pip install beautifulsoup4" ] @@ -72,16 +120,40 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 5, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Requirement already satisfied: lxml in c:\\users\\user\\appdata\\local\\programs\\python\\python313\\lib\\site-packages (6.0.1)\n", + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], "source": [ "%pip install lxml" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Importación de librerías\n", + "\n", + "En esta sección se importan las librerías necesarias que serán utilizadas en el script para realizar web scraping, manipulación de fechas y control de tiempo en la ejecución.\n", + "\n", + "- **from bs4 import BeautifulSoup** → importa BeautifulSoup, que se usará para analizar y extraer información de documentos HTML. \n", + "- **from datetime import datetime** → permite trabajar con fechas y horas, como obtener la fecha actual o formatear timestamps. \n", + "- **import requests** → se utiliza para realizar peticiones HTTP y obtener el contenido de páginas web. \n", + "- **import time** → proporciona funciones para controlar pausas en la ejecución del script, por ejemplo usando `time.sleep()`.\n", + "\n" + ] + }, { "cell_type": "code", - "execution_count": null, + "execution_count": 1, "metadata": { "tags": [] }, @@ -124,12 +196,51 @@ ] }, { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", "metadata": { "tags": [] }, - "outputs": [], + "source": [ + "Este bloque realiza un request HTTP mediante el metodo GET a un sitio web y muestra parte del contenido recibido. \n", + "\n", + "- **req = requests.get(http://www.ilga.gov/senate/default.asp)** → Realiza una solicitud HTTP GET a la URL especificada\n", + "\n", + "- **src = req.text** → Obtiene el contenido de la respuesta del servidor en formato de texto (HTML) \n", + "\n", + "- **print(src[:1000])** → Muestra los primeros 1000 caracteres del contenido para ver una vista previa" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "\n", + "
\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " ` dentro del árbol HTML y muestra únicamente los primeros 10 resultados encontrados." + ] + }, { "cell_type": "code", - "execution_count": null, + "execution_count": 4, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "[\n", + " English\n", + " , \n", + " Afrikaans\n", + " , \n", + " Albanian\n", + " , \n", + " Arabic\n", + " , \n", + " Armenian\n", + " , \n", + " Azerbaijani\n", + " , \n", + " Basque\n", + " , \n", + " Bengali\n", + " , \n", + " Bosnian\n", + " , \n", + " Catalan\n", + " ]\n" + ] + } + ], "source": [ "# Find all elements with a certain tag\n", "a_tags = soup.find_all(\"a\")\n", @@ -208,13 +390,34 @@ "These two lines of code are equivalent:" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Explicación del bloque del código ##\n", + "Se buscan todos los elementos con la etiqueta `` usando dos formas equivalentes (`soup.find_all(\"a\")` y `soup(\"a\")`) y se imprime el primer elemento obtenido en cada caso. Por ultimo, imprime la cantidad de elementos del primer caso." + ] + }, { "cell_type": "code", - "execution_count": null, + "execution_count": 5, "metadata": { "tags": [] }, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + " English\n", + " \n", + "\n", + " English\n", + " \n" + ] + } + ], "source": [ "a_tags = soup.find_all(\"a\")\n", "a_tags_alt = soup(\"a\")\n", @@ -231,9 +434,17 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 6, "metadata": {}, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "270\n" + ] + } + ], "source": [ "print(len(a_tags))" ] @@ -249,16 +460,43 @@ "We can do this by adding an additional argument to the `find_all`. In the example below, we are finding all the `a` tags, and then filtering those with `class_=\"sidemenu\"`." ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Explicación del bloque del código ##\n", + "Se buscan específicamente las etiquetas `` que pertenecen a la clase `sidemenu` dentro del HTML. Primero se usa el codigo `soup(\"a\", class_=\"sidemenu\")` y luego la sintaxis de selectores CSS con `soup.select(\"a.sidemenu\")`. En ambos casos se mostraria solo los primeros 5 resultados encontrados, sin embargo, no encuentra ninguna clase sidemenu por lo que no muestra algun dato y la cantidad aparece en 0." + ] + }, { "cell_type": "code", - "execution_count": null, + "execution_count": 13, "metadata": { "tags": [] }, - "outputs": [], + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0\n" + ] + }, + { + "data": { + "text/plain": [ + "[]" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "# Get only the 'a' tags in 'sidemenu' class\n", "side_menus = soup(\"a\", class_=\"sidemenu\")\n", + "print(len(side_menus))\n", "side_menus[:5]" ] }, @@ -273,11 +511,22 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 14, "metadata": { "tags": [] }, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "[]" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ "# Get elements with \"a.sidemenu\" CSS Selector.\n", "selected = soup.select(\"a.sidemenu\")\n", @@ -293,13 +542,34 @@ "Use BeautifulSoup to find all the `a` elements with class `mainmenu`." ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Explicación del bloque del código ##\n", + "En este caso busca la etiqueta `` que pertenecen a la clase `mainmenu` dentro del HTML con la sintaxis de selectores y presentaría los ultimos 5 elementos. Sin embargo, como no encuentra una clase mainmenu, no presenta data." + ] + }, { "cell_type": "code", - "execution_count": null, + "execution_count": 7, "metadata": {}, - "outputs": [], + "outputs": [ + { + "data": { + "text/plain": [ + "[]" + ] + }, + "execution_count": 7, + "metadata": {}, + "output_type": "execute_result" + } + ], "source": [ - "# YOUR CODE HERE\n" + "# Get elements with \"a.mainmenu\" CSS Selector.\n", + "selected_main = soup.select(\"a.mainmenu\")\n", + "selected_main[:5]\n" ] }, { @@ -316,22 +586,44 @@ "Getting the text inside an element is easy. All we have to do is use the `text` member of a `tag` object:" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Explicación del bloque del código ##\n", + "Intenta obtener todos los enlaces `` con la clase `sidemenu` y examinar el primero de ellos. También verifica el tipo de dato de la variable que contiene el primer enlace. \n", + "Sin embargo, como la lista no obtiene datos con la clase sidemenu, al intentar acceder a `side_menu_links[0]` se genera un `IndexError`. \n", + "Luego, con los comando `first_link.text` y `first_link['href']` nos arroja un `NameError` porque la variable `first_link` no está definida, ya que la lista `side_menu_links` estaba vacía en el paso anterior." + ] + }, { "cell_type": "code", - "execution_count": null, + "execution_count": 9, "metadata": { "tags": [] }, - "outputs": [], - "source": [ - "# Get all sidemenu links as a list\n", - "side_menu_links = soup.select(\"a.sidemenu\")\n", - "\n", - "# Examine the first link\n", + "outputs": [ + { + "ename": "IndexError", + "evalue": "list index out of range", + "output_type": "error", + "traceback": [ + "\u001b[31m---------------------------------------------------------------------------\u001b[39m", + "\u001b[31mIndexError\u001b[39m Traceback (most recent call last)", + "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[9]\u001b[39m\u001b[32m, line 5\u001b[39m\n\u001b[32m 2\u001b[39m side_menu_links = soup.select(\u001b[33m\"\u001b[39m\u001b[33ma.sidemenu\u001b[39m\u001b[33m\"\u001b[39m) \n\u001b[32m 4\u001b[39m \u001b[38;5;66;03m# Examine the first link\u001b[39;00m\n\u001b[32m----> \u001b[39m\u001b[32m5\u001b[39m first_link = \u001b[43mside_menu_links\u001b[49m\u001b[43m[\u001b[49m\u001b[32;43m0\u001b[39;49m\u001b[43m]\u001b[49m\n\u001b[32m 6\u001b[39m \u001b[38;5;28mprint\u001b[39m(first_link)\n\u001b[32m 8\u001b[39m \u001b[38;5;66;03m# What class is this variable?\u001b[39;00m\n", + "\u001b[31mIndexError\u001b[39m: list index out of range" + ] + } + ], + "source": [ + " # Get all sidemenu links as a list\n", + "side_menu_links = soup.select(\"a.sidemenu\") \n", + "\n", + " # Examine the first link\n", "first_link = side_menu_links[0]\n", "print(first_link)\n", "\n", - "# What class is this variable?\n", + " # What class is this variable?\n", "print('Class: ', type(first_link))" ] }, @@ -344,11 +636,23 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 10, "metadata": { "tags": [] }, - "outputs": [], + "outputs": [ + { + "ename": "NameError", + "evalue": "name 'first_link' is not defined", + "output_type": "error", + "traceback": [ + "\u001b[31m---------------------------------------------------------------------------\u001b[39m", + "\u001b[31mNameError\u001b[39m Traceback (most recent call last)", + "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[10]\u001b[39m\u001b[32m, line 1\u001b[39m\n\u001b[32m----> \u001b[39m\u001b[32m1\u001b[39m \u001b[38;5;28mprint\u001b[39m(\u001b[43mfirst_link\u001b[49m.text)\n", + "\u001b[31mNameError\u001b[39m: name 'first_link' is not defined" + ] + } + ], "source": [ "print(first_link.text)" ] @@ -364,11 +668,23 @@ }, { "cell_type": "code", - "execution_count": null, + "execution_count": 18, "metadata": { "tags": [] }, - "outputs": [], + "outputs": [ + { + "ename": "NameError", + "evalue": "name 'first_link' is not defined", + "output_type": "error", + "traceback": [ + "\u001b[31m---------------------------------------------------------------------------\u001b[39m", + "\u001b[31mNameError\u001b[39m Traceback (most recent call last)", + "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[18]\u001b[39m\u001b[32m, line 1\u001b[39m\n\u001b[32m----> \u001b[39m\u001b[32m1\u001b[39m \u001b[38;5;28mprint\u001b[39m(\u001b[43mfirst_link\u001b[49m[\u001b[33m'\u001b[39m\u001b[33mhref\u001b[39m\u001b[33m'\u001b[39m])\n", + "\u001b[31mNameError\u001b[39m: name 'first_link' is not defined" + ] + } + ], "source": [ "print(first_link['href'])" ] @@ -382,13 +698,38 @@ "Extract all `href` attributes for each `mainmenu` URL." ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Explicación del bloque del código ##\n", + "Se buscan todos los enlaces `` que pertenecen a la clase `mainmenu` y se guardan en la lista `main_menu_links`. Luego se intenta acceder al primer elemento de esa lista y mostrar su atributo `href`. \n", + "Como resultado se genera un `IndexError` porque la lista `main_menu_links` está vacía, es decir, no se encontraron enlaces con la clase `mainmenu` en el HTML descargado." + ] + }, { "cell_type": "code", - "execution_count": null, + "execution_count": 15, "metadata": {}, - "outputs": [], + "outputs": [ + { + "ename": "IndexError", + "evalue": "list index out of range", + "output_type": "error", + "traceback": [ + "\u001b[31m---------------------------------------------------------------------------\u001b[39m", + "\u001b[31mIndexError\u001b[39m Traceback (most recent call last)", + "\u001b[36mCell\u001b[39m\u001b[36m \u001b[39m\u001b[32mIn[15]\u001b[39m\u001b[32m, line 3\u001b[39m\n\u001b[32m 1\u001b[39m \u001b[38;5;66;03m# Get all mainmenu links as a list\u001b[39;00m\n\u001b[32m 2\u001b[39m main_menu_links = soup.select(\u001b[33m\"\u001b[39m\u001b[33ma.mainmenu\u001b[39m\u001b[33m\"\u001b[39m)\n\u001b[32m----> \u001b[39m\u001b[32m3\u001b[39m first_link_menu = \u001b[43mmain_menu_links\u001b[49m\u001b[43m[\u001b[49m\u001b[32;43m0\u001b[39;49m\u001b[43m]\u001b[49m\n\u001b[32m 4\u001b[39m \u001b[38;5;28mprint\u001b[39m(first_link_menu[\u001b[33m'\u001b[39m\u001b[33mhref\u001b[39m\u001b[33m'\u001b[39m])\n", + "\u001b[31mIndexError\u001b[39m: list index out of range" + ] + } + ], "source": [ - "# YOUR CODE HERE\n" + "\n", + "# Get all mainmenu links as a list\n", + "main_menu_links = soup.select(\"a.mainmenu\")\n", + "first_link_menu = main_menu_links[0]\n", + "print(first_link_menu['href'])\n" ] }, { @@ -415,9 +756,17 @@ "Let's scrape and parse the webpage, using the tools we learned in the previous section." ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Explicación del bloque del código ##\n", + "Se realiza una solicitud HTTP GET a la página con el parámetro `GA=98`, se obtiene el contenido de la respuesta y se convierte en un árbol HTML con BeautifulSoup utilizando el parser `lxml`." + ] + }, { "cell_type": "code", - "execution_count": null, + "execution_count": 16, "metadata": { "tags": [] }, @@ -440,11 +789,32 @@ "Our goal is to obtain the elements in the table on the webpage. Remember: rows are identified by the `tr` tag. Let's use `find_all` to obtain these elements." ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "## Explicación del bloque del código ##\n", + "Se obtienen todas las filas de tabla (`