vgchartz-full-crawler.py is a python@3 crawler script based on BeautifulSoup. It creates a csv dataset with data from more than 57,000 games. based on data from VGChartz Site.
The dataset is saved in the file specified at cfg/resources.json, by default "dataset/vgsales.csv".
You will need to have some depencies compiled at requirements.txt.
It can be installed by pip.
# Install dependencies
$> pip install -r requirements.txt
# Run
$> python vgchartz-full-crawler.py
The dataset it's composed by this fields, and the data is collected with this methodology.
| Field | Description |
|---|---|
| Rank | Ranking of overall sales |
| Name | The games name |
| Genre | Genre of the game |
| Platform | Platform of the games release (i.e. PC,PS4, etc.) |
| Developer | Developer of the game |
| Publisher | Publisher of the game |
| Vgchartz_Score | Score at VGcharz site |
| Critic_Score | Score at Critic |
| User_Score | Score by VGcharts users' site |
| Total_Shipped | Total worldwide shipments (in millions) |
| Total_Sales | Total worldwide sales (in millions) |
| NA_Sales | Sales in North America (in millions) |
| EU_Sales | Sales in Europe (in millions) |
| JP_Sales | Sales in Japan (in millions) |
| Other_Sales | Sales in the rest of the world (in millions) |
| Release_Date | Year of the game's release |
| Last_Update | Last update of this register |
- Remap the columns according the selected values at resources.json
- Add some unit testing
- Dockerize (w/ alpine-python) to ease use and avoid intallations
- Publish at Docker hub
Thanks to Chris Albon