The workflow follows the algorithm suggested in the paper and has following stages:
Get rid of cruft in the input data:
- empty text fields
- requires at least 20 characters of text
- remove unprintable unicode characters
- filter for english language using Googles
Generate aspects (
Extracts promising phrases (i.e., nouns described by adjectives) using
Aggregate aspects into topics (
Takes the output of the phrase extraction, maps them to
_disambiguate.py) and produces the list of clustered aspects
networkxfor the semantic tree
- pretrained word-vectors (via
vaderSentimentfor sentiment analysis
Analyze descriptors (
Cluster the associated adjectives using constant radius clustering.
Link information (
To make the output more useful, we want to link the topics back to the original texts and vice versa.
The whole code produces one csv file.