quick start

        ________      _________________
___  __ \___________  /_____  /____________
__  / / /  __ \  __  /_  __  /_  _ \_  ___/
_  /_/ // /_/ / /_/ / / /_/ / /  __/  /
/_____/ \____/\__,_/  \__,_/  \___//_/     A distributed DHT web crawler that supports cluster deployment.

Simplified version to run on local file-system and translated to English. At an high-level, this software has a service which scrapes the DHT and save unique info-hashes to Redis and stream them to Kafka. Another service then downloads the metadata, torrents get stored, etc.. The portion of interest for now is the DHT scraper "dodder-dht-server", which can run stand-alone by removing Redis and Kafka. This version just need to be executed and wait for info-hashes to be written out.

Original work of https://github.com/xwlcn/Dodder

quick start

environment dependent

Zookeeper-3.7.0 (http://zookeeper.apache.org/)
Kafka-2.13-2.8.0 (http://kafka.apache.org/)
Redis-2.6 (https://redis.io/)
MongoDB-4.4.5 (https://www.mongodb.com/)
Elasticsearch-7.12.0 (https://www.elastic.co/)
elasticsearch-analysis-ik-7.12.0 (https://github.com/medcl/elasticsearch-analysis-ik)

announce_peer messages:

Stand-alone operating environment:

CPU: Intel Xeon E3-1230 v3 - 3.3 GHz - 4 core(s)
RAM: 32GB - DDR3
Hard Drive(s): 2x 1TB (HDD SATA)
Bandwidth: Unmetered @ 1Gbps
2021-06-11
- Optimized search, related recommendation query speed
- Solve the memory leak problem of dodder-torrent-download-service ...

Overall structure

Note: dht-server, download-service, store-service in the project can be deployed in clusters. dht-server is responsible for crawling the info_hash in the DHT network, and then writes it to the Kafka message queue, download-service Responsible for reading the info_hash information to the specified ip to download the metadata of the torrent file (when deploying the cluster, pay attention to setting the number of partitions of the kafka topic, Number of partitions >= number of service deployments). The downloaded metadata parses the file information and encapsulates it into a Torrent object and writes it to Kafka torrentMessages topic, store-service is responsible for reading torrent storage into Elasticsearch.

Deduplication: The first time Redis deduplicates, MongDB and Elasticsearch use upsert to insert data to prevent repeated insertion.

deploy

After all the previous environments are set up, clone the entire project locally. If it is a cluster deployment, please modify some ip address parameters in each service module. I have a limited number of servers here, and I only use one server for stand-alone deployment. If there is a problem with the cluster deployment, please submit an issue.

Notice

dht-server needs public IP to crawl to info_hash

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
dodder-api		dodder-api
dodder-common		dodder-common
dodder-dht-server		dodder-dht-server
dodder-torrent-download-service		dodder-torrent-download-service
dodder-torrent-store-service		dodder-torrent-store-service
dodder-web		dodder-web
words		words
.gitignore		.gitignore
20190305.jpg		20190305.jpg
LICENSE		LICENSE
README.md		README.md
env.cmd		env.cmd
env15.cmd		env15.cmd
env18.cmd		env18.cmd
pom.xml		pom.xml
run-dht-server.ps1		run-dht-server.ps1
y_git.cmd		y_git.cmd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

quick start

environment dependent

Overall structure

deploy

Notice

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

quick start

environment dependent

Overall structure

deploy

Notice

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages