During the last 4 years I have been always into programing, data wrangling and algorithm development in different fields of application. Normally this type of projects, if successful results into scientific presentations and some times would open up some kind of ideological fights on the validity and usefulness of the proposed methods!
However, from the time I finished my PhD I decided to take a different position, which is more pragmatic. Therefore, I got extremely interested into web-application development. As my first ever application, during the last two and half months I completely dedicated my self to develop a web application for text news stream modeling, where the application is able to collect a lot of news (about an specific topic) and then it will categorize the news based on their semantic meaning. Of course, this is not new idea and currently there are several kinds of Social Listening (more) and Social Analytics (less) applications out there. However, in terms of methodology for text modeling I think our approach is new. Previously during my Phd studies I had thought about a text modeling framework, which is based on the application of Markov Chains (more conceptually) and Self Organizing Maps (more similar to ideas of WEBSOM algorithm). Further, I had found the Word2Vec algorithm extremely interesting. Nevertheless, I never had time to focus on this project.
Since I am comfortable with Python, I started learning Flask, which is an amazingly simple microframework for web applications and of course D3.js for web visualization. for the text processing part, thanks to beautiful gensim library as it has many!! functionalities for text processing and more importantly it has a super nice implementation of Wor2Vec algorithm in Python. Here, in this post I won’t go further into the methodology.
At the current version, the user has two options for collecting the news feeds. The user can either enter a twitter account name of a specific public ID .Then the system creates a list of all the public accounts that the user is following (The friend list). Next, the system will fetch the public twitts of each account in the friend list (From 200 to 3200 last twitts of each account in the friend list).
The second option, is to search for a specific keyword (e.g. oil price) in a specific time period. In this case, since we are using twitter search API (I tried Google CSE too, but a bit limited in comparison), but I think in future versions it would be better to use twitter streaming API, but this depends if there is a demand from some one.
After downloading the selected amount of twitts, the system will analyze and group the twitts based on their topics. If the full text option is requested, the system will identify the links (if any) in the twitts (using twitter-text-python library) and in another process it will collect the full texts from the provided links (using JusText library). In this case, the cleaned full texts will be the basis for grouping of the news, otherwise the tweet text itself is the basis.
Finally, after data cleaning the developed algorithm will categorize the news into several clusters and using word-cloud it highlights the main keywords within each cluster of news, followed by the original tweets.
OK, without further explanations, I would like to invite you to take a look at the current version of the web-app, which I deployed on redhat openshift servers. Since it is not an scalable application, I am not sure what happens if several people use it at the same time.
Your feedback is highly appreciated.