This package contains a Python script for both tasks:
- Read website page URLs from Excel
- Open each page
- Extract all
<a href="">links - Check which links are broken/inaccessible
- Export results to a new Excel file
- Read website page URLs from Excel
- Open each page
- Extract all
<img src="">image links - Download all images into one folder
Keep the first column header as:
URL
Example rows:
- https://lanoequip.com/
- https://lanoequip.com/new-equipment.html
- https://lanoequip.com/parts-toro.html
Open terminal / command prompt:
pip install requests beautifulsoup4 openpyxlpython scraper_task.py --task broken_links --input input_urls.xlsx --output broken_links_report.xlsxpython scraper_task.py --task download_images --input input_urls.xlsx --output downloaded_images- Some websites block scraping or block
HEADrequests. The script triesGETif needed. - Relative links like
/aboutare automatically converted to full URLs. mailto:,tel:,javascript:, and#anchorlinks are ignored.- If a page itself does not open, the script adds that page as an error in the report.
Excel columns:
pageURLBroken LinksStatus