Using some techniques learned in the Parallel Computing course at Universidad EAFIT, the Processing phase of a Text Analysis app for Information Retrieval and Clustering was implemented using the Gutenberg-Project dataset (eBooks in english and spanish).
- While processing every document of the Project Gutenberg dataset, each word is filtered to make it lowercase, check if it is not a spanish or english stopword (stopwords-json), remove accents and delete all special characters.
- After the filtering, a word is only selected to be part of the processing if it is one of the 100 main words.
- The
main_words.pyfile is used to find the 100 main words regarding the whole dataset. - The
norm_term_frequency.pyfile counts the number of ocurrences of every main word in every document and normalizes that number using its total number of terms. - The
inverted_document_frequency.pyfile calculates a number for each main word using the formula log(TOTAL_NUMB_OF_DOCUMENTS / NUMB_OF_DOCUMENTS_CONTAINING_THE_WORD). - The
documents_containing_word.pyfile builds, for every main word, a list of the documents where that word appears. The list elements are sorted in descendant order using the word ocurrences as criterion. - The
pig/calculate_tfidf.pigfile uses the outputs of thenorm_term_frequency.pyandinverted_document_frequency.pyfiles to join them by word and calculate the tfidf value for each document-word pair. - The
pig/calculate_magnitude.pigfile uses the output of thepig/calculate_tfidf.pigscript to calculate the magnitude of every document using the tfidf values. - The
zeppelin/calculate_similarity_btw_docs.scalafile uses the outputs of thepig/calculate_tfidf.pigandpig/calculate_magnitude.pigscripts to calculate the Cosine similarity between every pair of documents. That is necessary to determine which are the more related documents given a specific one. - The
idf_by_word.pyfile is used to transfer the output of theinverted_document_frequency.pyfile to a remote Redis database. - The
word_in_docs.pyfile is used to transfer the output of thedocuments_containing_word.pyfile to a remote Redis database. - The
tfidf_for_word_in_docs.pyfile is used to transfer the output of thepig/calculate_tfidf.pigscript to a remote Redis database. - The
magnitude_by_doc.pyfile is used to transfer the output of thepig/calculate_magnitude.pigscript to a remote Redis database. - The
similarity_doc.pyfile is used to transfer the output of thezeppelin/calculate_similarity_btw_docs.scalafile to a remote Redis database. - The remote Redis database is accessed on a web application (gutenberg-web) that was designed for the online part of this Information Retrieval and Clustering System.
The following section assumes that the whole dataset and the Python, Pig and Scala files reside on an environment with Hortonworks 2.5 installed. It makes possible to execute the implemented jobs in Hadoop.
The dataset must be in HDFS and the path user/asanch41/data_out/ must exist there.
The current directory on the terminal should be where the mentioned Python files are located.
-
Execute the
norm_term_frequency.pyfile by typing on the terminal:$ python norm_term_frequency.py hdfs://path/to/all/documents/*.txt -r hadoop --output-dir hdfs:///user/asanch41/data_out/norm_term_frequency -
Execute the
inverted_document_frequency.pyfile by typing on the terminal:$ python inverted_document_frequency.py hdfs://path/to/all/documents/*.txt -r hadoop --output-dir hdfs:///user/asanch41/data_out/inverted_document_frequency -
Execute the
documents_containing_word.pyfile by typing on the terminal:$ python documents_containing_word.py hdfs://path/to/all/documents/*.txt -r hadoop --output-dir hdfs:///user/asanch41/data_out/documents_containing_word -
Go to the Hive view on your Hortonworks environment, create a database with the name asanch41, select it and then execute the following two queries:
CREATE EXTERNAL TABLE hive_doc_name_word_tfidf (doc_name STRING, word STRING, tfidf FLOAT);CREATE EXTERNAL TABLE hive_doc_magnitude (doc_name STRING, magnitude DOUBLE);This step is important because the
pig/calculate_tfidf.pigandpig/calculate_magnitude.pigfiles will use the created tables to store their output, which will then be used by thezeppelin/calculate_similarity_btw_docs.scalafile during its execution. -
Go to the Pig view on your Hortonworks environment, create two scripts using the
pig/calculate_tfidf.pigandpig/calculate_magnitude.pigfile content and execute them from there in that order. -
Go to the Zeppelin Notebook on your Hortonworks environment, create a new note by pasting the content of the
zeppelin/calculate_similarity_btw_docs.scalafile and run it from there. -
When you complete all previous steps, execute the following commands to send the processed data to the Azure Redis DB that serves the Gutenberg Information Retrieval application:
$ python idf_by_word.py hdfs:///user/asanch41/data_out/inverted_document_frequency/part-00000 -r hadoop --output-dir hdfs:///user/asanch41/data_out/inverted_document_frequency_redis
$ python word_in_docs.py hdfs:///user/asanch41/data_out/documents_containing_word/part-00000 -r hadoop --output-dir hdfs:///user/asanch41/data_out/documents_containing_word_redis
$ python tfidf_from_word_in_docs.py hdfs:///user/asanch41/data_out/doc_name_word_tfidf/part-r-00000 -r hadoop --output-dir hdfs:///user/asanch41/data_out/doc_name_word_tfidf_redis
$ python magnitude_by_doc.py hdfs:///user/asanch41/data_out/doc_magnitude/part-r-00000 -r hadoop --output-dir hdfs:///user/asanch41/data_out/doc_magnitude_redis
$ python similarity_doc.py hdfs:///user/asanch41/data_out/similarity_btw_docs/part-00000 -r hadoop --output-dir hdfs:///user/asanch41/data_out/similarity_btw_docs_redis
-
Finally, open the Gutenberg Information Retrieval application on the web. Enter a query and look for documents and their communities.