Skip to content

Commit 6ae297f

Browse files
2020 Update (#12)
* Use psycopg2-binary * Update lxml to 4.5.2 Allows to use wheel. * Avoid confusion between libarchive and libarchive-c * Install libarchive-c for downloader * Drop distribute This project is merged with setuptools. * Review README Document a quickstart setup first and then describe advanced usage for custom tables. * Change the example to use a different DB name. Also, removed mention of unnecessary dependency which was installed for Python 2 support. * Update requirements.txt Remove wsgiref which was required for Python 2 support. Co-authored-by: Utkarsh Upadhyay <502876+musically-ut@users.noreply.github.com>
1 parent 491e552 commit 6ae297f

File tree

2 files changed

+63
-47
lines changed

2 files changed

+63
-47
lines changed

README.md

Lines changed: 60 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -1,58 +1,74 @@
11
# StackOverflow data to postgres
22

3-
This is a quick script to move the Stackoverflow data from the [StackExchange data dump (Sept '14)](https://archive.org/details/stackexchange) to a Postgres SQL database.
4-
5-
Schema hints are taken from [a post on Meta.StackExchange](http://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede) and from [StackExchange Data Explorer](http://data.stackexchange.com).
6-
7-
## Dependencies
8-
9-
- [`lxml`](http://lxml.de/installation.html)
10-
- [`psycopg2`](http://initd.org/psycopg/docs/install.html)
11-
- [`libarchive-c`](https://pypi.org/project/libarchive-c/)
12-
13-
## Usage
14-
15-
- Create the database `stackoverflow` in your database: `CREATE DATABASE stackoverflow;`
16-
- You can use a custom database name as well. Make sure to explicitly give
17-
it while executing the script later.
18-
- Move the following files to the folder from where the program is executed:
19-
`Badges.xml`, `Votes.xml`, `Posts.xml`, `Users.xml`, `Tags.xml`.
20-
- In some old dumps, the cases in the filenames are different.
21-
- Execute in the current folder (in parallel, if desired):
22-
- `python load_into_pg.py -t Badges`
23-
- `python load_into_pg.py -t Posts`
24-
- `python load_into_pg.py -t Tags` (not present in earliest dumps)
25-
- `python load_into_pg.py -t Users`
26-
- `python load_into_pg.py -t Votes`
27-
- `python load_into_pg.py -t PostLinks`
28-
- `python load_into_pg.py -t PostHistory`
29-
- `python load_into_pg.py -t Comments`
30-
- Finally, after all the initial tables have been created:
31-
- `psql stackoverflow < ./sql/final_post.sql`
32-
- If you used a different database name, make sure to use that instead of
33-
`stackoverflow` while executing this step.
34-
- For some additional indexes and tables, you can also execute the the following;
35-
- `psql stackoverflow < ./sql/optional_post.sql`
36-
- Again, remember to user the correct database name here, if not `stackoverflow`.
37-
38-
## Loading a complete stackexchange project
39-
40-
You can use the script to download a given stackexchange compressed file from
3+
This is a quick script to move the Stackoverflow data from the [StackExchange
4+
data dump (Sept '14)](https://archive.org/details/stackexchange) to a Postgres
5+
SQL database.
6+
7+
Schema hints are taken from [a post on
8+
Meta.StackExchange](http://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede)
9+
and from [StackExchange Data Explorer](http://data.stackexchange.com).
10+
11+
## Quickstart
12+
13+
Install requirements, create a `stackoverflow` database, and use
14+
`load_into_pg.py` script:
15+
16+
``` console
17+
$ pip install -r requirements.txt
18+
...
19+
Successfully installed argparse-1.2.1 libarchive-c-2.9 lxml-4.5.2 psycopg2-binary-2.8.4 six-1.10.0
20+
$ createdb beerSO
21+
$ python load_into_pg.py -s beer -d beerSO
22+
```
23+
24+
This will download compressed files from
4125
[archive.org](https://ia800107.us.archive.org/27/items/stackexchange/) and load
42-
all the tables at once, using the `-s` switch.
26+
all the tables at once.
27+
28+
29+
## Advanced Usage
30+
31+
You can use a custom database name as well. Make sure to explicitly give it
32+
while executing the script later.
33+
34+
Each table data is archived in an XML file. Available tables varies accross
35+
history. `load_into_pg.py` knows how to handle the following tables:
4336

44-
You will need the `urllib` and `libarchive` modules.
37+
- `Badges`.
38+
- `Posts`.
39+
- `Tags` (not present in earliest dumps).
40+
- `Users`.
41+
- `Votes`.
42+
- `PostLinks`.
43+
- `PostHistory`.
44+
- `Comments`.
45+
46+
You can download manually the files to the folder from where the program is
47+
executed: `Badges.xml`, `Votes.xml`, `Posts.xml`, `Users.xml`, `Tags.xml`. In
48+
some old dumps, the cases in the filenames are different.
49+
50+
Then load each file with e.g. `python load_into_pg.py -t Badges`.
51+
52+
After all the initial tables have been created:
53+
54+
``` console
55+
$ psql beerSO < ./sql/final_post.sql
56+
```
57+
58+
For some additional indexes and tables, you can also execute the the following;
59+
60+
``` console
61+
$ psql beerSO < ./sql/optional_post.sql
62+
```
4563

4664
If you give a schema name using the `-n` switch, all the tables will be moved
4765
to the given schema. This schema will be created in the script.
4866

49-
To load the _dba.stackexchange.com_ project in the `dba` schema, you would execute:
50-
`./load_into_pg.py -s dba -n dba`
51-
5267
The paths are not changed in the final scripts `sql/final_post.sql` and
5368
`sql/optional_post.sql`. To run them, first set the _search_path_ to your
5469
schema name: `SET search_path TO <myschema>;`
5570

71+
5672
## Caveats and TODOs
5773

5874
- It prepares some indexes and views which may not be necessary for your analysis.
@@ -68,3 +84,4 @@ schema name: `SET search_path TO <myschema>;`
6884
## Acknowledgement
6985

7086
[@madtibo](https://github.com/madtibo) made significant contributions by adding `jsonb` and Foreign Key support.
87+
[@bersace](https://github.com/bersace) brought the dependencies and the `README.md` instructions into 2020.

requirements.txt

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,5 @@
11
argparse==1.2.1
2-
distribute==0.6.24
3-
lxml==3.4.1
4-
psycopg2==2.5.4
5-
wsgiref==0.1.2
2+
libarchive-c==2.9
3+
lxml==4.5.2
4+
psycopg2-binary==2.8.4
65
six==1.10.0

0 commit comments

Comments
 (0)