2020 Update (#12)

bersace · musically-ut · web-flow · commit 6ae297f048ac · 2020-08-30T20:04:13.000+02:00
* Use psycopg2-binary

* Update lxml to 4.5.2

Allows to use wheel.

* Avoid confusion between libarchive and libarchive-c

* Install libarchive-c for downloader

* Drop distribute

This project is merged with setuptools.

* Review README

Document a quickstart setup first and then describe advanced usage for
custom tables.

* Change the example to use a different DB name.

Also, removed mention of unnecessary dependency which was installed for Python 2 support.

* Update requirements.txt

Remove wsgiref which was required for Python 2 support.

Co-authored-by: Utkarsh Upadhyay &lt;502876+musically-ut@users.noreply.github.com&gt;
diff --git a/README.md b/README.md
@@ -1,58 +1,74 @@
 # StackOverflow data to postgres
 
-This is a quick script to move the Stackoverflow data from the [StackExchange data dump (Sept '14)](https://archive.org/details/stackexchange) to a Postgres SQL database.
-
-Schema hints are taken from [a post on Meta.StackExchange](http://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede) and from [StackExchange Data Explorer](http://data.stackexchange.com).
-
-## Dependencies
-
- - [`lxml`](http://lxml.de/installation.html)
- - [`psycopg2`](http://initd.org/psycopg/docs/install.html)
- - [`libarchive-c`](https://pypi.org/project/libarchive-c/)
-
-## Usage
-
- - Create the database `stackoverflow` in your database: `CREATE DATABASE stackoverflow;`
-   - You can use a custom database name as well. Make sure to explicitly give
-     it while executing the script later.
- - Move the following files to the folder from where the program is executed:
-   `Badges.xml`, `Votes.xml`, `Posts.xml`, `Users.xml`, `Tags.xml`.
-   - In some old dumps, the cases in the filenames are different.
- - Execute in the current folder (in parallel, if desired):
-   - `python load_into_pg.py -t Badges`
-   - `python load_into_pg.py -t Posts`
-   - `python load_into_pg.py -t Tags` (not present in earliest dumps)
-   - `python load_into_pg.py -t Users`
-   - `python load_into_pg.py -t Votes`
-   - `python load_into_pg.py -t PostLinks`
-   - `python load_into_pg.py -t PostHistory`
-   - `python load_into_pg.py -t Comments`
- - Finally, after all the initial tables have been created:
-   - `psql stackoverflow < ./sql/final_post.sql`
-   - If you used a different database name, make sure to use that instead of
-     `stackoverflow` while executing this step.
- - For some additional indexes and tables, you can also execute the the following;
-   - `psql stackoverflow < ./sql/optional_post.sql`
-   - Again, remember to user the correct database name here, if not `stackoverflow`.
-
-## Loading a complete stackexchange project
-
-You can use the script to download a given stackexchange compressed file from
+This is a quick script to move the Stackoverflow data from the [StackExchange
+data dump (Sept '14)](https://archive.org/details/stackexchange) to a Postgres
+SQL database.
+
+Schema hints are taken from [a post on
+Meta.StackExchange](http://meta.stackexchange.com/questions/2677/database-schema-documentation-for-the-public-data-dump-and-sede)
+and from [StackExchange Data Explorer](http://data.stackexchange.com).
+
+## Quickstart
+
+Install requirements, create a `stackoverflow` database, and use
+`load_into_pg.py` script:
+
+``` console
+$ pip install -r requirements.txt
+...
+Successfully installed argparse-1.2.1 libarchive-c-2.9 lxml-4.5.2 psycopg2-binary-2.8.4 six-1.10.0
+$ createdb beerSO
+$ python load_into_pg.py -s beer -d beerSO
+```
+
+This will download compressed files from
 [archive.org](https://ia800107.us.archive.org/27/items/stackexchange/) and load
-all the tables at once, using the `-s` switch.
+all the tables at once.
+
+
+## Advanced Usage
+
+You can use a custom database name as well. Make sure to explicitly give it
+while executing the script later.
+
+Each table data is archived in an XML file. Available tables varies accross
+history. `load_into_pg.py` knows how to handle the following tables:
 
-You will need the `urllib` and `libarchive` modules.
+- `Badges`.
+- `Posts`.
+- `Tags` (not present in earliest dumps).
+- `Users`.
+- `Votes`.
+- `PostLinks`.
+- `PostHistory`.
+- `Comments`.
+
+You can download manually the files to the folder from where the program is
+executed: `Badges.xml`, `Votes.xml`, `Posts.xml`, `Users.xml`, `Tags.xml`. In
+some old dumps, the cases in the filenames are different.
+
+Then load each file with e.g. `python load_into_pg.py -t Badges`.
+
+After all the initial tables have been created:
+
+``` console
+$ psql beerSO < ./sql/final_post.sql
+```
+
+For some additional indexes and tables, you can also execute the the following;
+
+``` console
+$ psql beerSO < ./sql/optional_post.sql
+```
 
 If you give a schema name using the `-n` switch, all the tables will be moved
 to the given schema. This schema will be created in the script.
 
-To load the _dba.stackexchange.com_ project in the `dba` schema, you would execute:
-`./load_into_pg.py -s dba -n dba`
-
 The paths are not changed in the final scripts `sql/final_post.sql` and
 `sql/optional_post.sql`. To run them, first set the _search_path_ to your
 schema name: `SET search_path TO <myschema>;`
 
+
 ## Caveats and TODOs
 
  - It prepares some indexes and views which may not be necessary for your analysis.
@@ -68,3 +84,4 @@ schema name: `SET search_path TO <myschema>;`
 ## Acknowledgement
 
 [@madtibo](https://github.com/madtibo) made significant contributions by adding `jsonb` and Foreign Key support.
+[@bersace](https://github.com/bersace) brought the dependencies and the `README.md` instructions into 2020.
diff --git a/requirements.txt b/requirements.txt
@@ -1,6 +1,5 @@
 argparse==1.2.1
-distribute==0.6.24
-lxml==3.4.1
-psycopg2==2.5.4
-wsgiref==0.1.2
+libarchive-c==2.9
+lxml==4.5.2
+psycopg2-binary==2.8.4
 six==1.10.0