27/03/2021

[Python] Sentiment analysis with NLTK

NLTK is an interesting and powerful Python library to perform text analysis. A common use case is to perform sentiment analysis of some text.


While the configuration can go very deep, here is a sample application to analyse the sentiment of online posts written in English. The first time you execute your program, you will be prompted to download data for NLTK, choose the one that best fits your use case.

 import nltk  
 from nltk.sentiment import SentimentIntensityAnalyzer  
   
 #common words that have no impact on the overall analysis  
 stopwords = nltk.corpus.stopwords.words("english")  
 sia = SentimentIntensityAnalyzer()  
   
 #strip the STRING text only to relevant words  
 text = ""  
 for word in STRING.split():  
  #stopwords are all provided to us in lowercase  
  if word.lower() not in stopwords:  
   text += word + " "  
   
 #get overall sentiment in this post  
 sentiment = sia.polarity_scores(text)["compound"]  


An honorable mention goes to the pyenchant library, which uses Enchant spellchecker dictionaries and allows to check whether a string is a word in a given language. Additionally, it will match ignoring weird casing:

 import enchant  
   
 english_words = enchant.Dict("en_US")  
 if english_words.check(STRING):  
  #this is an english word  

[Python] Access Twitter tweets with Tweepy

To programmatically access Twitter tweets, you can use Python's Tweepy library.


You must login to Twitter, then apply for developer access. The process is a bit delayed since you will need to wait for Twitter to approve your request (maybe even reply to their emails with additional info), but in general for academic/testing purposes, the approval process takes about 1 week.


Once you're set it is simply a matter of:

 import tweepy  
   
 twitter_auth = tweepy.AppAuthHandler(TWITTER_KEY, TWITTER_SECRET)  
 #twitter limits frequency of polls, we need to slow down automatically. You have about 900 requests/15 minutes  
 twitter_api = tweepy.API(twitter_auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)  
   
 #default is 2 weeks of data, the optional input variable since_id is the ID of the last tweet since you want to start polling  
 #we are preparing the request here, but not yet executing it  
 cur = None  
 if since_id is not None:  
  cur = tweepy.Cursor(twitter_api.search, count=TWEET_POLL_LIMIT, include_entities=False, result_type="recent", q=YOUR_HASHTAGS)  
 else:  
  cur = tweepy.Cursor(twitter_api.search, count=TWEET_POLL_LIMIT, include_entities=False, result_type="recent", q=YOUR_HASHTAGS, since_id=since_id)  
   
 #we retrieve the tweets here, we can optionally limit the total we retrieve. If we do not limit ourselves and go over our quota, we are stalled until the next window is available to us  
 #the config we did on the twitter_api object will handle this automatically for us (wait_on_rate_limit) and notify us as well (wait_on_rate_limit_notify)  
 tweets = None  
 if TWITTER_POST_LIMIT == 0:  
  tweets = cur.items()  
 else:  
  tweets = cur.items(TWEET_LIMIT)  
    
 for tweet in tweets:  
  #do something  

 

Some things to note:

Your query rate will likely be limited, currently the free tier has a 900 requests/15 minutes quota. If you go over that quota, your app will stall and can only resume after the quota is reset.

 

The output of the poll is a set of tweets which are simply JSON, some notable fields:

  • id: unique ID of this tweet
  • created_at: a string representation of the datetime when this tweet was created
  • text: the body of the tweet

 

Strangely, the created_at value is NOT an epoch but a human readable string, to convert it to date you can:

 parsed = datetime.strftime(datetime.strptime(date,'%a %b %d %H:%M:%S +0000 %Y'), '%Y-%m-%d %H:%M:%S')  


And to convert that to epoch you can:

import ciso8601
return int(ciso8601.parse_datetime(parsed).timestamp()) 


To verify if a tweet is a retweet you can check if it starts with RT @:

tweet._json["text"].startswith("RT @")

[Python] Access Reddit posts and comments with PRAW

To programmatically access posts and comments from a Reddit subreddit, you can use Python's PRAW library.


You must login to Reddit, then register your application to get a client ID and secret.


Once you're set is it simply a matter of:


 import praw  
   
 reddit = praw.Reddit(  
   client_id="CLIENT_ID",  
   client_secret="CLIENT_SECRET",  
   user_agent="WHATEVER"  
 )  
   
 submissions = reddit.subreddit("SUBREDDIT_NAME").new(limit=POST_LIMIT)  
   
 for submission in submissions:  
  submission.comments.replace_more(limit=None)  
    
  for comment in submission.comments:  
   #do something  


Some things to note:

A post is represented by a submission object, some interesting attributes are:

  • created_utc: the UTC epoch when this post was created
  • author
  • score: all posts start with score 1 and it changes based on user votes
  • title
  • selftext: the body of the post

 

To verify if a post was deleted there is unfortunately not a simple way but in general you can consider this rule to be valid:

submission.author is None or not submission.is_robot_indexable or submission.selftext == "[deleted]" or submission.selftext == "[removed]"

Remember however that the [deleted] and [removed] strings are language dependent, therefore based on your settings you might not get them in English.


A comment is represented by a comment object, its body is in the body attribute.

To verify if a comment was posted by a moderator (or a moderator bot), you can check:

comment.distinguished == "moderator"