Synapse is a highly efficient and pluggable, open-source crawling/scraping framework; for both local and distributed workloads.
There're two integration paths, based on level of control:
-
High-Level API: Built for standard crawling workloads. Extend with built-in plugins and get moving immediately without fiddling with the underlying mechanics. [TODO]
-
Low-Level API: For architecting custom scrapers/crawlers with (Sub)component-level control. Extend with your own implementations. [WIP]
The distributed architecture is [WIP]; essentially tinkering with distributed state-machine. Currently, in experimental phase. Expect breaking changes as the architecture evolves.
Efforts are currently prioritized toward solid core abstractions over polished public documentation. Implementation-specific details are available within each component's directory for developers diving into the internals.
Contributions are welcome!
-
Start by checking contribution guidelines.
-
Any questions, ask on discussions
In neurobiology, a synapse is the junction for signal transmission between neurons. This framework serves as the interface between the web and application-specific logic, decoupling data acquisition from downstream processing.
It's not intended for any malicious or unethical web scraping/crawling activities.
Please ensure you comply with the website's robots.txt directives
and terms of service (TOS) before crawling/scraping.