Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 7 additions & 12 deletions docs/01_introduction/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -6,20 +6,15 @@ slug: /overview
description: 'The official library for creating Apify Actors in Python, providing tools for web scraping, automation, and data storage integration.'
---

import CodeBlock from '@theme/CodeBlock';

import IntroductionExample from '!!raw-loader!./code/01_introduction.py';

The Apify SDK for Python is the official library for creating [Apify Actors](https://docs.apify.com/platform/actors) in Python. It provides useful features like Actor lifecycle management, local storage emulation, and Actor event handling.

```python
from apify import Actor
from bs4 import BeautifulSoup
import requests

async def main():
async with Actor:
input = await Actor.get_input()
response = requests.get(input['url'])
soup = BeautifulSoup(response.content, 'html.parser')
await Actor.push_data({ 'url': input['url'], 'title': soup.title.string })
```
<CodeBlock className="language-python">
{IntroductionExample}
</CodeBlock>

## What are Actors

Expand Down
67 changes: 30 additions & 37 deletions docs/01_introduction/quick-start.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,9 @@ import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
import CodeBlock from '@theme/CodeBlock';

import MainExample from '!!raw-loader!./code/actor_structure/main.py';
import UnderscoreMainExample from '!!raw-loader!./code/actor_structure/__main__.py';

## Step 1: Create Actors

To create and run Actors in Apify Console, refer to the [Console documentation](/platform/actors/development/quick-start/web-ide).
Expand Down Expand Up @@ -61,33 +64,14 @@ The Actor's source code is in the `src` folder. This folder contains two importa

<Tabs>
<TabItem value="main.py" label="main.py" default>
<CodeBlock language="python">{
`from apify import Actor
${''}
async def main():
async with Actor:
Actor.log.info('Actor input:', await Actor.get_input())
await Actor.set_value('OUTPUT', 'Hello, world!')`
}</CodeBlock>
<CodeBlock className="language-python">
{MainExample}
</CodeBlock>
</TabItem>
<TabItem value="__main__.py" label="__main.py__">
<CodeBlock language="python">{
`import asyncio
import logging
${''}
from apify.log import ActorLogFormatter
${''}
from .main import main
${''}
handler = logging.StreamHandler()
handler.setFormatter(ActorLogFormatter())
${''}
apify_logger = logging.getLogger('apify')
apify_logger.setLevel(logging.DEBUG)
apify_logger.addHandler(handler)
${''}
asyncio.run(main())`
}</CodeBlock>
<CodeBlock className="language-python">
{UnderscoreMainExample}
</CodeBlock>
</TabItem>
</Tabs>

Expand All @@ -96,21 +80,30 @@ We recommend keeping the entrypoint for the Actor in the `src/__main__.py` file.

## Next steps

### Concepts

To learn more about the features of the Apify SDK and how to use them, check out the Concepts section in the sidebar:

- [Actor lifecycle](../concepts/actor-lifecycle)
- [Actor input](../concepts/actor-input)
- [Working with storages](../concepts/storages)
- [Actor events & state persistence](../concepts/actor-events)
- [Proxy management](../concepts/proxy-management)
- [Interacting with other Actors](../concepts/interacting-with-other-actors)
- [Creating webhooks](../concepts/webhooks)
- [Accessing Apify API](../concepts/access-apify-api)
- [Logging](../concepts/logging)
- [Actor configuration](../concepts/actor-configuration)
- [Pay-per-event monetization](../concepts/pay-per-event)

### Guides

To see how you can integrate the Apify SDK with some of the most popular web scraping libraries, check out our guides for working with:
To see how you can integrate the Apify SDK with popular web scraping libraries, check out our guides:

- [Requests or HTTPX](../guides/requests-and-httpx)
- [Beautiful Soup](../guides/beautiful-soup)
- [BeautifulSoup with HTTPX](../guides/beautifulsoup-httpx)
- [Parsel with Impit](../guides/parsel-impit)
- [Playwright](../guides/playwright)
- [Selenium](../guides/selenium)
- [Crawlee](../guides/crawlee)
- [Scrapy](../guides/scrapy)

### Usage concepts

To learn more about the features of the Apify SDK and how to use them, check out the Usage Concepts section in the sidebar, especially the guides for:

- [Actor lifecycle](../concepts/actor-lifecycle)
- [Working with storages](../concepts/storages)
- [Handling Actor events](../concepts/actor-events)
- [How to use proxies](../concepts/proxy-management)
- [Running webserver](../guides/running-webserver)
6 changes: 6 additions & 0 deletions docs/03_guides/01_beautifulsoup_httpx.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,9 @@ Below is a simple Actor that recursively scrapes titles from all linked websites
## Conclusion

In this guide, you learned how to use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) with the [HTTPX](https://www.python-httpx.org/) in your Apify Actors. By combining these libraries, you can efficiently extract data from HTML or XML files, making it easy to build web scraping tasks in Python. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!

## Additional resources

- [Apify templates: BeautifulSoup](https://apify.com/templates/python-beautifulsoup)
- [BeautifulSoup: Official documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [HTTPX: Official documentation](https://www.python-httpx.org/)
6 changes: 6 additions & 0 deletions docs/03_guides/02_parsel_impit.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,9 @@ The following example shows a simple Actor that recursively scrapes titles from
## Conclusion

In this guide, you learned how to use [Parsel](https://github.com/scrapy/parsel) with [Impit](https://github.com/apify/impit) in your Apify Actors. By combining these libraries, you get a powerful and efficient solution for web scraping: [Parsel](https://github.com/scrapy/parsel) provides excellent CSS selector and XPath support for data extraction, while [Impit](https://github.com/apify/impit) offers a fast and simple HTTP client built by Apify. This combination makes it easy to build scalable web scraping tasks in Python. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!

## Additional resources

- [Apify templates: Crawlee + Parsel](https://apify.com/templates/python-crawlee-parsel)
- [Parsel: GitHub repository](https://github.com/scrapy/parsel)
- [Impit: GitHub repository](https://github.com/apify/impit)
12 changes: 10 additions & 2 deletions docs/03_guides/03_playwright.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,10 @@ import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';

import PlaywrightExample from '!!raw-loader!roa-loader!./code/03_playwright.py';

In this guide, you'll learn how to use [Playwright](https://playwright.dev) for web scraping in your Apify Actors.

## Introduction

[Playwright](https://playwright.dev) is a tool for web automation and testing that can also be used for web scraping. It allows you to control a web browser programmatically and interact with web pages just as a human would.

Some of the key features of Playwright for web scraping include:
Expand All @@ -19,8 +23,6 @@ Some of the key features of Playwright for web scraping include:
- **Powerful selectors** - Playwright provides a variety of powerful selectors that allow you to target specific elements on a web page, including CSS selectors, XPath, and text matching.
- **Emulation of user interactions** - Playwright allows you to emulate user interactions like clicking, scrolling, filling out forms, and even typing in text, which can be useful for scraping websites that have dynamic content or require user input.

## Using Playwright in Actors

To create Actors which use Playwright, start from the [Playwright & Python](https://apify.com/templates/categories/python) Actor template.

On the Apify platform, the Actor will already have Playwright and the necessary browsers preinstalled in its Docker image, including the tools and setup necessary to run browsers in headful mode.
Expand Down Expand Up @@ -55,3 +57,9 @@ It uses Playwright to open the pages in an automated Chrome browser, and to extr
## Conclusion

In this guide you learned how to create Actors that use Playwright to scrape websites. Playwright is a powerful tool that can be used to manage browser instances and scrape websites that require JavaScript execution. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!

## Additional resources

- [Apify templates: Playwright + Chrome](https://apify.com/templates/python-playwright)
- [Apify templates: Crawlee + Playwright + Chrome](https://apify.com/templates/python-crawlee-playwright)
- [Playwright: Official documentation](https://playwright.dev/python/)
11 changes: 9 additions & 2 deletions docs/03_guides/04_selenium.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,10 @@ import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';

import SeleniumExample from '!!raw-loader!roa-loader!./code/04_selenium.py';

In this guide, you'll learn how to use [Selenium](https://www.selenium.dev/) for web scraping in your Apify Actors.

## Introduction

[Selenium](https://www.selenium.dev/) is a tool for web automation and testing that can also be used for web scraping. It allows you to control a web browser programmatically and interact with web pages just as a human would.

Some of the key features of Selenium for web scraping include:
Expand All @@ -21,8 +25,6 @@ including CSS selectors, XPath, and text matching.
- **Emulation of user interactions** - Selenium allows you to emulate user interactions like clicking, scrolling, filling out forms,
and even typing in text, which can be useful for scraping websites that have dynamic content or require user input.

## Using Selenium in Actors

To create Actors which use Selenium, start from the [Selenium & Python](https://apify.com/templates/categories/python) Actor template.

On the Apify platform, the Actor will already have Selenium and the necessary browsers preinstalled in its Docker image,
Expand All @@ -44,3 +46,8 @@ It uses Selenium ChromeDriver to open the pages in an automated Chrome browser,
## Conclusion

In this guide you learned how to use Selenium for web scraping in Apify Actors. You can now create your own Actors that use Selenium to scrape dynamic websites and interact with web pages just like a human would. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!

## Additional resources

- [Apify templates: Selenium + Chrome](https://apify.com/templates/python-selenium)
- [Selenium: Official documentation](https://www.selenium.dev/documentation/)
9 changes: 9 additions & 0 deletions docs/03_guides/05_crawlee.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -44,3 +44,12 @@ The [`PlaywrightCrawler`](https://crawlee.dev/python/api/class/PlaywrightCrawler
## Conclusion

In this guide, you learned how to use the [Crawlee](https://crawlee.dev/python) library in your Apify Actors. By using the [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler), [`ParselCrawler`](https://crawlee.dev/python/api/class/ParselCrawler), and [`PlaywrightCrawler`](https://crawlee.dev/python/api/class/PlaywrightCrawler) crawlers, you can efficiently scrape static or dynamic web pages, making it easy to build web scraping tasks in Python. See the [Actor templates](https://apify.com/templates/categories/python) to get started with your own scraping tasks. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU). Happy scraping!

## Additional resources

- [Apify templates: Crawlee + BeautifulSoup](https://apify.com/templates/python-crawlee-beautifulsoup)
- [Apify templates: Crawlee + Parsel](https://apify.com/templates/python-crawlee-parsel)
- [Apify templates: Crawlee + Playwright + Chrome](https://apify.com/templates/python-crawlee-playwright)
- [Crawlee: Official website](https://crawlee.dev/python)
- [Crawlee: Documentation](https://crawlee.dev/python/docs)
- [Crawlee: GitHub repository](https://github.com/apify/crawlee-python)
4 changes: 4 additions & 0 deletions docs/03_guides/06_scrapy.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,10 @@ import ItemsExample from '!!raw-loader!./code/scrapy_project/src/items.py';
import SpidersExample from '!!raw-loader!./code/scrapy_project/src/spiders/title.py';
import SettingsExample from '!!raw-loader!./code/scrapy_project/src/settings.py';

In this guide, you'll learn how to use the [Scrapy](https://scrapy.org/) framework in your Apify Actors.

## Introduction

[Scrapy](https://scrapy.org/) is an open-source web scraping framework for Python. It provides tools for defining scrapers, extracting data from web pages, following links, and handling pagination. With the Apify SDK, Scrapy projects can be converted into Apify [Actors](https://docs.apify.com/platform/actors), integrated with Apify [storages](https://docs.apify.com/platform/storage), and executed on the Apify [platform](https://docs.apify.com/platform).

## Integrating Scrapy with the Apify platform
Expand Down
18 changes: 15 additions & 3 deletions docs/03_guides/07_running_webserver.mdx
Original file line number Diff line number Diff line change
@@ -1,12 +1,16 @@
---
id: running-webserver
title: Running webserver in your Actor
title: Running webserver
---

import RunnableCodeBlock from '@site/src/components/RunnableCodeBlock';

import WebserverExample from '!!raw-loader!roa-loader!./code/07_webserver.py';

In this guide, you'll learn how to run a web server inside your Apify Actor. This is useful for monitoring Actor progress, creating custom APIs, or serving content during the Actor run.

## Introduction

Each Actor run on the Apify platform is assigned a unique hard-to-guess URL (for example `https://8segt5i81sokzm.runs.apify.net`), which enables HTTP access to an optional web server running inside the Actor run's container.

The URL is available in the following places:
Expand All @@ -17,10 +21,18 @@ The URL is available in the following places:

The web server running inside the container must listen at the port defined by the `Actor.configuration.container_port` property. When running Actors locally, the port defaults to `4321`, so the web server will be accessible at `http://localhost:4321`.

## Example
## Example Actor

The following example demonstrates how to start a simple web server in your Actor,which will respond to every GET request with the number of items that the Actor has processed so far:
The following example demonstrates how to start a simple web server in your Actor, which will respond to every GET request with the number of items that the Actor has processed so far:

<RunnableCodeBlock className="language-python" language="python">
{WebserverExample}
</RunnableCodeBlock>

## Conclusion

In this guide, you learned how to run a web server inside your Apify Actor. By leveraging the container URL and port provided by the platform, you can expose HTTP endpoints for monitoring, reporting, or serving content during Actor execution. If you have questions or need assistance, feel free to reach out on our [GitHub](https://github.com/apify/apify-sdk-python) or join our [Discord community](https://discord.com/invite/jyEM2PRvMU).

## Additional resources

- [Apify templates: Standby Python project](https://apify.com/templates/python-standby)
7 changes: 3 additions & 4 deletions docs/03_guides/code/07_webserver.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,9 @@
http_server = None


# Just a simple handler that will print the number of processed items so far
# on every GET request.
class RequestHandler(BaseHTTPRequestHandler):
"""A handler that prints the number of processed items on every GET request."""

def do_get(self) -> None:
self.log_request()
self.send_response(200)
Expand All @@ -18,8 +18,7 @@ def do_get(self) -> None:


def run_server() -> None:
# Start the HTTP server on the provided port,
# and save a reference to the server.
"""Start the HTTP server on the provided port, and save a reference to the server."""
global http_server
with ThreadingHTTPServer(
('', Actor.configuration.web_server_port), RequestHandler
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
import requests
from bs4 import BeautifulSoup

from apify import Actor


async def main() -> None:
async with Actor:
actor_input = await Actor.get_input()
response = requests.get(actor_input['url'])
soup = BeautifulSoup(response.content, 'html.parser')
await Actor.push_data({'url': actor_input['url'], 'title': soup.title.string})
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
import asyncio
import logging

from apify.log import ActorLogFormatter

from .main import main

handler = logging.StreamHandler()
handler.setFormatter(ActorLogFormatter())

apify_logger = logging.getLogger('apify')
apify_logger.setLevel(logging.DEBUG)
apify_logger.addHandler(handler)

asyncio.run(main())
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
from apify import Actor


async def main() -> None:
async with Actor:
Actor.log.info('Actor input:', await Actor.get_input())
await Actor.set_value('OUTPUT', 'Hello, world!')
19 changes: 7 additions & 12 deletions website/versioned_docs/version-1.7/01-introduction/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -6,20 +6,15 @@ slug: /overview
description: 'The official library for creating Apify Actors in Python, providing tools for web scraping, automation, and data storage integration.'
---

import CodeBlock from '@theme/CodeBlock';

import IntroductionExample from '!!raw-loader!./code/01_introduction.py';

The Apify SDK for Python is the official library for creating [Apify Actors](https://docs.apify.com/platform/actors) in Python.

```python
from apify import Actor
from bs4 import BeautifulSoup
import requests

async def main():
async with Actor:
actor_input = await Actor.get_input()
response = requests.get(actor_input['url'])
soup = BeautifulSoup(response.content, 'html.parser')
await Actor.push_data({ 'url': actor_input['url'], 'title': soup.title.string })
```
<CodeBlock className="language-python">
{IntroductionExample}
</CodeBlock>

## What are Actors?

Expand Down
Loading
Loading