Scumm Bar: [Python] Twitter tweets analysis

I recently took a Big Data course on Coursera, where I had the chance to employ different analysis techniques to huge datasets. One of the assignments required us to perform some sort of tweet sentiment analysis using code written in Python (version 2.7.3).

To successfully run the examples, you must download this archive which contains:

README.html: instructions on how to register on Twitter and get your own token and consumer keys
AFINN-README.txt: a description of the AFINN-111.txt file which will be used to evaluate the various tweets
AFINN-111.txt: a list of english terms and their sentiment score, manually set by Finn Årup Nielsen
twitterstream.py: a Python script used to query Twitter and collect a stream of tweets. It runs indefinitely so you will have to manually stop it after some time

Firstly, you should run twitterstream.py as:

python twitterstream.py > output.txt

stopping it when you feel you've collected enough data. It will extract some tweets in JSON format and store them in a new file called output.txt which will be our test dataset. Note that all scripts were automatically graded so many fine tuning techniques had to be avoided to prevent it failing to correctly evaluate each solution.

Lastly, to understand the scripts, you should take a look at Twitter's official tweets JSON format documentation.

Now, the first task was to write a script which derives the sentiment of each tweet previously collected. The script will return a float value corresponding to the overall sentiment of each tweet; tweets tweeted in a language other than English and tweets without text will automatically be assigned a score of 0.0.

 import sys  
 import json  
   
 def main():  
      sent_file = open(sys.argv[1])  
      tweet_file = open(sys.argv[2])  
   
      scores = {} # initialize an empty dictionary  
      for line in sent_file:  
       term, score = line.split("\t") # The file is tab-delimited. "\t" means "tab character"  
       scores[term] = int(score) # Convert the score to an integer.  
   
      for line in tweet_file:  
           #convert the line from file into a json object  
           mystr = json.loads(line)  
           #check the language is english, if "lang" is among the keys  
           if 'lang' in mystr.keys() and mystr["lang"]=='en':  
                #if "text" is not among the keys, there's no tweet to read, skip it  
                if 'text' in mystr.keys():  
                     resscore=0  
                     #split the tweet into a list of words  
                     words = mystr["text"].split()  
                     for word in words:  
                          #if the current word exists in our dictionary, add its value to the total  
                          if word in scores:  
                               resscore+=scores[word]  
                     #convert to float  
                     resscore+=0.0  
                     print str(resscore)  
           else:  
                print 0.0  
   
 if __name__ == '__main__':  
      main()

To call the script simply use:

python tweet_sentiment.py AFINN-111.txt output.txt

The second task was about computing the sentiment score of those terms not listed in the AFINN-111 dataset, by analysing the tweet itself. The output would be the computed pair term score (string, float).

 import sys  
 import json  
   
 #remove duplicates from a list  
 def f7(seq):  
   seen = set()  
   seen_add = seen.add  
   return [ x for x in seq if x not in seen and not seen_add(x)]  
   
 def main():  
      sent_file = open(sys.argv[1])  
      tweet_file = open(sys.argv[2])  
   
      scores = {} # initialize an empty dictionary  
      for line in sent_file:  
       term, score = line.split("\t") # The file is tab-delimited. "\t" means "tab character"  
       scores[term] = int(score) # Convert the score to an integer.  
   
      #create result dictionary  
      resdict={}  
   
      for line in tweet_file:  
           #convert the line from file into a json object  
           mystr = json.loads(line)  
           # skip language check as grader gets buggy, check only for well formed responses  
           #check the language is english, if "lang" is among the keys  
           #if 'lang' in mystr.keys() and mystr["lang"]=='en':  
           if len(mystr.keys())>1:  
                #if "text" is not among the keys, there's no tweet to read, skip it  
                if 'text' in mystr.keys():  
                     resscore=0  
                     #split the tweet into a list of words  
                     words = mystr["text"].split()  
                     uniquewords = f7(words)  
                     for word in words:  
                          #if the current word exists in our dictionary, add its value to the total  
                          if word in scores:  
                               resscore+=scores[word]  
                     #convert to float  
                     resscore+=0.0  
                     #now that we know the sentiment of the tweet, let's try and assign a score to the other terms  
                     #term_score = sentiment/number of unique non-dictionary words in the tweet * occurrences of that word  
                     non_dict_words = len(uniquewords)  
                     #populate our dictionary  
                     for word in uniquewords:  
                          #count how many times that word occurs in the text  
                          occurrences=words.count(word)  
                          computed_sentiment = resscore/non_dict_words*occurrences  
                          #added modifier to raise gap between positive and negative terms  
                          if resscore < 0:  
                               computed_sentiment-=1  
                          else:  
                               computed_sentiment+=1  
                          resdict[word]=computed_sentiment  
                     #finally encode in UTF-8 and print every term computed value  
                     for word in words:  
                          unicode_str = word+' '+str(resdict[word])  
                          encoded_string = unicode_str.encode('utf-8')  
                          print encoded_string  
                       
           else:  
                print "foo "+str(0.0)  
   
 if __name__ == '__main__':  
      main()

Call it as:

python term_sentiment.py AFINN-111.txt output.txt

The third task required us to produce a script to compute the overall term frequency amongst all tweets, returning the computed pair term frequency (string, float) for each term.

 import sys  
 import json  
   
 def main():  
      tweet_file = open(sys.argv[1])  
   
      terms = {} #pair term, occurrences in all tweets  
      total_terms=0.0 #total terms in all tweets  
      #poplate our dict object  
      for line in tweet_file:  
           #convert the line from file into a json object  
           mystr = json.loads(line)  
           # skip language check as grader gets buggy, check only for well formed responses  
           if len(mystr.keys())>1:  
                #if "text" is not among the keys, there's no tweet to read, skip it  
                if 'text' in mystr.keys():  
                     #split the tweet into a list of words  
                     words = mystr["text"].split()  
                     for word in words:  
                          #if the current word is not in the dictionary, add it  
                          if word not in terms:  
                               terms[word]=1.0  
                          #else, increase its count  
                          else:  
                               terms[word]+=1  
                          #in an case, increase total terms counter  
                          total_terms+=1  
      #compute the frequency  
      for term in terms.keys():  
           frequency=terms[term]/total_terms  
           strprint=term+' '+str(frequency)  
           encoded_str=strprint.encode('utf-8')  
           print encoded_str       
   
 if __name__ == '__main__':  
      main()

Called as:

python frequency.py output.txt

Moving on to more interesting analysis, we were then tasked with the production of a script to compute which US state appears to be the happiest, basing our decision on the tweets sentiment for each state. The script would return the two letter description of the happiest state thus calculated; should your dataset include no tweets originating in the US, then the script would simply return 'US'.

 import sys  
 import json  
 import operator  
   
 def main():  
      sent_file = open(sys.argv[1])  
      tweet_file = open(sys.argv[2])  
   
      scores = {} # initialize an empty dictionary  
      for line in sent_file:  
       term, score = line.split("\t") # The file is tab-delimited. "\t" means "tab character"  
       scores[term] = int(score) # Convert the score to an integer.  
   
      happiness={} #pair state, sentiment  
   
      for line in tweet_file:  
           #convert the line from file into a json object  
           mystr = json.loads(line)  
           #check "place"->"country_code" key in response, it should be in US  
           if 'place' in mystr.keys() and mystr["place"] is not None and mystr["place"]["country_code"]=='US':  
                #get "user"->"location" parameter, it should be two characters representing the state  
                if 'user' in mystr.keys() and mystr["user"] is not None and mystr["user"]["location"] is not None:  
                     location=mystr["user"]["location"].split(' ')  
                     if len(location)==2 and len(location[1])==2:  
                          if 'text' in mystr.keys():  
                               resscore=0  
                               #split the tweet into a list of words  
                               words = mystr["text"].split()  
                               for word in words:  
                                    #if the current word exists in our dictionary, add its value to the total  
                                    if word in scores:  
                                         resscore+=scores[word]  
                               #convert to float  
                               resscore+=0.0  
                               #store states happiness  
                               if location[1] not in happiness:  
                                    happiness[location[1]]=resscore  
                               else:  
                                    happiness[location[1]]+=resscore  
      #get the happiest if dict is not null!  
      if len(happiness.keys())>0:  
           happiest=max(happiness.iteritems(), key=operator.itemgetter(1))[0]  
           print str(happiest)  
      else:  
           print 'US'  
   
 if __name__ == '__main__':  
      main()

Call it as:

python happiest_state.py AFINN-111.txt output.txt

Finally, the last task was to compute the 10 most used hashtags. The output would be the computed pair hashtag frequency (string, float) for each hashtag.

 import sys  
 import json  
 import operator  
   
 def main():  
      tweet_file = open(sys.argv[1])  
   
      tags = {} #pair tag, occurrences in all tweets  
      #poplate our dict object  
      for line in tweet_file:  
           #convert the line from file into a json object  
           mystr = json.loads(line)  
           #search for tags and count them  
           if 'entities' in mystr.keys() and mystr["entities"]["hashtags"] != []:  
                for hashtag in mystr["entities"]["hashtags"]:  
                     if hashtag["text"] not in tags.keys():  
                          tags[hashtag["text"]]=1  
                     else:  
                          tags[hashtag["text"]]+=1  
      #sort our dict object by descending values  
      sorted_dict = sorted(tags.iteritems(), key=operator.itemgetter(1), reverse=True)  
      #print top 10 hashtag count values  
      for i in range(10):  
           tag, count = sorted_dict[i]  
           count+=0.0  
           strprint = tag+' '+str(count)  
           encoded_str=strprint.encode('utf-8')  
           print encoded_str  
        
             
   
 if __name__ == '__main__':  
      main()

Called as:

python top_ten.py output.txt

Overall it was an interesting assignment, not overly complicated and a good introduction to Python programming.

8 comments:

Anonymous1 September 2014 at 10:10
Getting the error on this as 'list' object does not have an attribute called 'keys'.
Could you please help on this asap?
Thanks!
Anonymous2 September 2014 at 08:34
Thank you for your reply.
The 'k' in keys is in lower case.
The indentation is to be back spaced to one blank space if I copy this code.
That should not be an issue as per my understanding.
The only thing to verify is the null value. How can I determine that?
mark3 October 2015 at 23:24
i keep getting error: how can i solve this? thanks.

IOError: [Errno 2] No such file or directory.

With great power comes great responsibility

Pages

28/06/2013

[Python] Twitter tweets analysis

8 comments: