28/06/2013

[Python] Twitter tweets analysis

I recently took a Big Data course on Coursera, where I had the chance to employ different analysis techniques to huge datasets. One of the assignments required us to perform some sort of tweet sentiment analysis using code written in Python (version 2.7.3).

To successfully run the examples, you must download this archive which contains:
  • README.html: instructions on how to register on Twitter and get your own token and consumer keys
  • AFINN-README.txt: a description of the AFINN-111.txt file which will be used to evaluate the various tweets
  • AFINN-111.txt: a list of english terms and their sentiment score, manually set by Finn Årup Nielsen
  • twitterstream.py: a Python script used to query Twitter and collect a stream of tweets. It runs indefinitely so you will have to manually stop it after some time
Firstly, you should run twitterstream.py as:

python twitterstream.py > output.txt

stopping it when you feel you've collected enough data. It will extract some tweets in JSON format and store them in a new file called output.txt which will be our test dataset. Note that all scripts were automatically graded so many fine tuning techniques had to be avoided to prevent it failing to correctly evaluate each solution.

Lastly, to understand the scripts, you should take a look at Twitter's official tweets JSON format documentation.
Now, the first task was to write a script which derives the sentiment of each tweet previously collected. The script will return a float value corresponding to the overall sentiment of each tweet; tweets tweeted in a language other than English and tweets without text will automatically be assigned a score of 0.0.

 import sys  
 import json  
   
 def main():  
      sent_file = open(sys.argv[1])  
      tweet_file = open(sys.argv[2])  
   
      scores = {} # initialize an empty dictionary  
      for line in sent_file:  
       term, score = line.split("\t") # The file is tab-delimited. "\t" means "tab character"  
       scores[term] = int(score) # Convert the score to an integer.  
   
      for line in tweet_file:  
           #convert the line from file into a json object  
           mystr = json.loads(line)  
           #check the language is english, if "lang" is among the keys  
           if 'lang' in mystr.keys() and mystr["lang"]=='en':  
                #if "text" is not among the keys, there's no tweet to read, skip it  
                if 'text' in mystr.keys():  
                     resscore=0  
                     #split the tweet into a list of words  
                     words = mystr["text"].split()  
                     for word in words:  
                          #if the current word exists in our dictionary, add its value to the total  
                          if word in scores:  
                               resscore+=scores[word]  
                     #convert to float  
                     resscore+=0.0  
                     print str(resscore)  
           else:  
                print 0.0  
   
 if __name__ == '__main__':  
      main()  
   

To call the script simply use:

python tweet_sentiment.py AFINN-111.txt output.txt

The second task was about computing the sentiment score of those terms not listed in the AFINN-111 dataset, by analysing the tweet itself. The output would be the computed pair term score (string, float).

 import sys  
 import json  
   
 #remove duplicates from a list  
 def f7(seq):  
   seen = set()  
   seen_add = seen.add  
   return [ x for x in seq if x not in seen and not seen_add(x)]  
   
 def main():  
      sent_file = open(sys.argv[1])  
      tweet_file = open(sys.argv[2])  
   
      scores = {} # initialize an empty dictionary  
      for line in sent_file:  
       term, score = line.split("\t") # The file is tab-delimited. "\t" means "tab character"  
       scores[term] = int(score) # Convert the score to an integer.  
   
      #create result dictionary  
      resdict={}  
   
      for line in tweet_file:  
           #convert the line from file into a json object  
           mystr = json.loads(line)  
           # skip language check as grader gets buggy, check only for well formed responses  
           #check the language is english, if "lang" is among the keys  
           #if 'lang' in mystr.keys() and mystr["lang"]=='en':  
           if len(mystr.keys())>1:  
                #if "text" is not among the keys, there's no tweet to read, skip it  
                if 'text' in mystr.keys():  
                     resscore=0  
                     #split the tweet into a list of words  
                     words = mystr["text"].split()  
                     uniquewords = f7(words)  
                     for word in words:  
                          #if the current word exists in our dictionary, add its value to the total  
                          if word in scores:  
                               resscore+=scores[word]  
                     #convert to float  
                     resscore+=0.0  
                     #now that we know the sentiment of the tweet, let's try and assign a score to the other terms  
                     #term_score = sentiment/number of unique non-dictionary words in the tweet * occurrences of that word  
                     non_dict_words = len(uniquewords)  
                     #populate our dictionary  
                     for word in uniquewords:  
                          #count how many times that word occurs in the text  
                          occurrences=words.count(word)  
                          computed_sentiment = resscore/non_dict_words*occurrences  
                          #added modifier to raise gap between positive and negative terms  
                          if resscore < 0:  
                               computed_sentiment-=1  
                          else:  
                               computed_sentiment+=1  
                          resdict[word]=computed_sentiment  
                     #finally encode in UTF-8 and print every term computed value  
                     for word in words:  
                          unicode_str = word+' '+str(resdict[word])  
                          encoded_string = unicode_str.encode('utf-8')  
                          print encoded_string  
                       
           else:  
                print "foo "+str(0.0)  
   
 if __name__ == '__main__':  
      main()  
   

Call it as:

python term_sentiment.py AFINN-111.txt output.txt

The third task required us to produce a script to compute the overall term frequency amongst all tweets, returning the computed pair term frequency (string, float) for each term.

 import sys  
 import json  
   
 def main():  
      tweet_file = open(sys.argv[1])  
   
      terms = {} #pair term, occurrences in all tweets  
      total_terms=0.0 #total terms in all tweets  
      #poplate our dict object  
      for line in tweet_file:  
           #convert the line from file into a json object  
           mystr = json.loads(line)  
           # skip language check as grader gets buggy, check only for well formed responses  
           if len(mystr.keys())>1:  
                #if "text" is not among the keys, there's no tweet to read, skip it  
                if 'text' in mystr.keys():  
                     #split the tweet into a list of words  
                     words = mystr["text"].split()  
                     for word in words:  
                          #if the current word is not in the dictionary, add it  
                          if word not in terms:  
                               terms[word]=1.0  
                          #else, increase its count  
                          else:  
                               terms[word]+=1  
                          #in an case, increase total terms counter  
                          total_terms+=1  
      #compute the frequency  
      for term in terms.keys():  
           frequency=terms[term]/total_terms  
           strprint=term+' '+str(frequency)  
           encoded_str=strprint.encode('utf-8')  
           print encoded_str       
   
 if __name__ == '__main__':  
      main()  
   

Called as:

python frequency.py output.txt

Moving on to more interesting analysis, we were then tasked with the production of a script to compute which US state appears to be the happiest, basing our decision on the tweets sentiment for each state. The script would return the two letter description of the happiest state thus calculated; should your dataset include no tweets originating in the US, then the script would simply return 'US'.

 import sys  
 import json  
 import operator  
   
 def main():  
      sent_file = open(sys.argv[1])  
      tweet_file = open(sys.argv[2])  
   
      scores = {} # initialize an empty dictionary  
      for line in sent_file:  
       term, score = line.split("\t") # The file is tab-delimited. "\t" means "tab character"  
       scores[term] = int(score) # Convert the score to an integer.  
   
      happiness={} #pair state, sentiment  
   
      for line in tweet_file:  
           #convert the line from file into a json object  
           mystr = json.loads(line)  
           #check "place"->"country_code" key in response, it should be in US  
           if 'place' in mystr.keys() and mystr["place"] is not None and mystr["place"]["country_code"]=='US':  
                #get "user"->"location" parameter, it should be two characters representing the state  
                if 'user' in mystr.keys() and mystr["user"] is not None and mystr["user"]["location"] is not None:  
                     location=mystr["user"]["location"].split(' ')  
                     if len(location)==2 and len(location[1])==2:  
                          if 'text' in mystr.keys():  
                               resscore=0  
                               #split the tweet into a list of words  
                               words = mystr["text"].split()  
                               for word in words:  
                                    #if the current word exists in our dictionary, add its value to the total  
                                    if word in scores:  
                                         resscore+=scores[word]  
                               #convert to float  
                               resscore+=0.0  
                               #store states happiness  
                               if location[1] not in happiness:  
                                    happiness[location[1]]=resscore  
                               else:  
                                    happiness[location[1]]+=resscore  
      #get the happiest if dict is not null!  
      if len(happiness.keys())>0:  
           happiest=max(happiness.iteritems(), key=operator.itemgetter(1))[0]  
           print str(happiest)  
      else:  
           print 'US'  
   
 if __name__ == '__main__':  
      main()  
   

Call it as:

python happiest_state.py AFINN-111.txt output.txt

Finally, the last task was to compute the 10 most used hashtags. The output would be the computed pair hashtag frequency (string, float) for each hashtag.

 import sys  
 import json  
 import operator  
   
 def main():  
      tweet_file = open(sys.argv[1])  
   
      tags = {} #pair tag, occurrences in all tweets  
      #poplate our dict object  
      for line in tweet_file:  
           #convert the line from file into a json object  
           mystr = json.loads(line)  
           #search for tags and count them  
           if 'entities' in mystr.keys() and mystr["entities"]["hashtags"] != []:  
                for hashtag in mystr["entities"]["hashtags"]:  
                     if hashtag["text"] not in tags.keys():  
                          tags[hashtag["text"]]=1  
                     else:  
                          tags[hashtag["text"]]+=1  
      #sort our dict object by descending values  
      sorted_dict = sorted(tags.iteritems(), key=operator.itemgetter(1), reverse=True)  
      #print top 10 hashtag count values  
      for i in range(10):  
           tag, count = sorted_dict[i]  
           count+=0.0  
           strprint = tag+' '+str(count)  
           encoded_str=strprint.encode('utf-8')  
           print encoded_str  
        
             
   
 if __name__ == '__main__':  
      main()  
   

Called as:

python top_ten.py output.txt

Overall it was an interesting assignment, not overly complicated and a good introduction to Python programming.

8 comments:

  1. Getting the error on this as 'list' object does not have an attribute called 'keys'.
    Could you please help on this asap?
    Thanks!

    ReplyDelete
    Replies
    1. Hello,

      please check the code is list.keys() (lowercase k), that the list object is not null, and, REALLY important, that you did not change the indentation as it has an actual meaning in Python (https://docs.python.org/release/2.5.1/ref/indentation.html)

      Regards

      Delete
    2. Latest updated:

      I am getting an error on the below mentioned line as "list indices must be integers, not str".

      >>> words = mystr["text"].split()

      Delete
    3. Regarding the indentation, as long as it is correctly preserved, using a space rather than a tab it's not an issue. BUT you have to convert EACH tab to a space, meaning that 2 tabs become two spaces, etc.

      To check if the list object is null you can do something like:

      if not list:
      print "Empty"

      assuming the object is called list, otherwise use its name eg: mystr.

      However, mystr object is not a simple list/array, but the result of the json.loads() method so there will definitely be non-numerical keys, the json attributes names, this is not an error, see reference: https://docs.python.org/2/library/json.html section 'Decoding JSON'

      Please check what's contained in the mystr object before the line that raises the exception by printing its contents:

      print str(mystr)

      should be enough. Maybe you loaded a blank line or created the mystr object in a different way.

      Delete
    4. The editor cut out the tab before 'print "Empty" ' right blow 'if not list:'

      If you don't put it, it won't work correctly

      Delete
  2. Thank you for your reply.
    The 'k' in keys is in lower case.
    The indentation is to be back spaced to one blank space if I copy this code.
    That should not be an issue as per my understanding.
    The only thing to verify is the null value. How can I determine that?

    ReplyDelete
  3. i keep getting error: how can i solve this? thanks.

    IOError: [Errno 2] No such file or directory.

    ReplyDelete
    Replies
    1. Hello Mark, how are you invoking the scripts? You should pass a valid file path as argument, see sample calls for that; note that in my case I had the file in the same directory as the script, if that's not your case try invoking it as:

      python script_name.py /path/to/file_name.extension

      Have a nice day

      Delete

With great power comes great responsibility