I recently took a Big Data course on Coursera, where I had the chance to employ different analysis techniques to huge datasets. One of the assignments required us to perform some sort of tweet sentiment analysis using code written in Python (version 2.7.3).
To successfully run the examples, you must download this archive which contains:
- README.html: instructions on how to register on Twitter and get your own token and consumer keys
- AFINN-README.txt: a description of the AFINN-111.txt file which will be used to evaluate the various tweets
- AFINN-111.txt: a list of english terms and their sentiment score, manually set by Finn Årup Nielsen
- twitterstream.py: a Python script used to query Twitter and collect a stream of tweets. It runs indefinitely so you will have to manually stop it after some time
Firstly, you should run twitterstream.py as:
python twitterstream.py > output.txt
stopping it when you feel you've collected enough data. It will extract some tweets in JSON format and store them in a new file called output.txt which will be our test dataset. Note that all scripts were automatically graded so many fine tuning techniques had to be avoided to prevent it failing to correctly evaluate each solution.
Lastly, to understand the scripts, you should take a look at Twitter's official tweets JSON format documentation.
import sys
import json
def main():
sent_file = open(sys.argv[1])
tweet_file = open(sys.argv[2])
scores = {} # initialize an empty dictionary
for line in sent_file:
term, score = line.split("\t") # The file is tab-delimited. "\t" means "tab character"
scores[term] = int(score) # Convert the score to an integer.
for line in tweet_file:
#convert the line from file into a json object
mystr = json.loads(line)
#check the language is english, if "lang" is among the keys
if 'lang' in mystr.keys() and mystr["lang"]=='en':
#if "text" is not among the keys, there's no tweet to read, skip it
if 'text' in mystr.keys():
resscore=0
#split the tweet into a list of words
words = mystr["text"].split()
for word in words:
#if the current word exists in our dictionary, add its value to the total
if word in scores:
resscore+=scores[word]
#convert to float
resscore+=0.0
print str(resscore)
else:
print 0.0
if __name__ == '__main__':
main()
To call the script simply use:
python tweet_sentiment.py AFINN-111.txt output.txt
The second task was about computing the sentiment score of those terms not listed in the AFINN-111 dataset, by analysing the tweet itself. The output would be the computed pair term score (string, float).
import sys
import json
#remove duplicates from a list
def f7(seq):
seen = set()
seen_add = seen.add
return [ x for x in seq if x not in seen and not seen_add(x)]
def main():
sent_file = open(sys.argv[1])
tweet_file = open(sys.argv[2])
scores = {} # initialize an empty dictionary
for line in sent_file:
term, score = line.split("\t") # The file is tab-delimited. "\t" means "tab character"
scores[term] = int(score) # Convert the score to an integer.
#create result dictionary
resdict={}
for line in tweet_file:
#convert the line from file into a json object
mystr = json.loads(line)
# skip language check as grader gets buggy, check only for well formed responses
#check the language is english, if "lang" is among the keys
#if 'lang' in mystr.keys() and mystr["lang"]=='en':
if len(mystr.keys())>1:
#if "text" is not among the keys, there's no tweet to read, skip it
if 'text' in mystr.keys():
resscore=0
#split the tweet into a list of words
words = mystr["text"].split()
uniquewords = f7(words)
for word in words:
#if the current word exists in our dictionary, add its value to the total
if word in scores:
resscore+=scores[word]
#convert to float
resscore+=0.0
#now that we know the sentiment of the tweet, let's try and assign a score to the other terms
#term_score = sentiment/number of unique non-dictionary words in the tweet * occurrences of that word
non_dict_words = len(uniquewords)
#populate our dictionary
for word in uniquewords:
#count how many times that word occurs in the text
occurrences=words.count(word)
computed_sentiment = resscore/non_dict_words*occurrences
#added modifier to raise gap between positive and negative terms
if resscore < 0:
computed_sentiment-=1
else:
computed_sentiment+=1
resdict[word]=computed_sentiment
#finally encode in UTF-8 and print every term computed value
for word in words:
unicode_str = word+' '+str(resdict[word])
encoded_string = unicode_str.encode('utf-8')
print encoded_string
else:
print "foo "+str(0.0)
if __name__ == '__main__':
main()
Call it as:
python term_sentiment.py AFINN-111.txt output.txt
The third task required us to produce a script to compute the overall term frequency amongst all tweets, returning the computed pair term frequency (string, float) for each term.
import sys
import json
def main():
tweet_file = open(sys.argv[1])
terms = {} #pair term, occurrences in all tweets
total_terms=0.0 #total terms in all tweets
#poplate our dict object
for line in tweet_file:
#convert the line from file into a json object
mystr = json.loads(line)
# skip language check as grader gets buggy, check only for well formed responses
if len(mystr.keys())>1:
#if "text" is not among the keys, there's no tweet to read, skip it
if 'text' in mystr.keys():
#split the tweet into a list of words
words = mystr["text"].split()
for word in words:
#if the current word is not in the dictionary, add it
if word not in terms:
terms[word]=1.0
#else, increase its count
else:
terms[word]+=1
#in an case, increase total terms counter
total_terms+=1
#compute the frequency
for term in terms.keys():
frequency=terms[term]/total_terms
strprint=term+' '+str(frequency)
encoded_str=strprint.encode('utf-8')
print encoded_str
if __name__ == '__main__':
main()
Called as:
python frequency.py output.txt
Moving on to more interesting analysis, we were then tasked with the production of a script to compute which US state appears to be the happiest, basing our decision on the tweets sentiment for each state. The script would return the two letter description of the happiest state thus calculated; should your dataset include no tweets originating in the US, then the script would simply return 'US'.
import sys
import json
import operator
def main():
sent_file = open(sys.argv[1])
tweet_file = open(sys.argv[2])
scores = {} # initialize an empty dictionary
for line in sent_file:
term, score = line.split("\t") # The file is tab-delimited. "\t" means "tab character"
scores[term] = int(score) # Convert the score to an integer.
happiness={} #pair state, sentiment
for line in tweet_file:
#convert the line from file into a json object
mystr = json.loads(line)
#check "place"->"country_code" key in response, it should be in US
if 'place' in mystr.keys() and mystr["place"] is not None and mystr["place"]["country_code"]=='US':
#get "user"->"location" parameter, it should be two characters representing the state
if 'user' in mystr.keys() and mystr["user"] is not None and mystr["user"]["location"] is not None:
location=mystr["user"]["location"].split(' ')
if len(location)==2 and len(location[1])==2:
if 'text' in mystr.keys():
resscore=0
#split the tweet into a list of words
words = mystr["text"].split()
for word in words:
#if the current word exists in our dictionary, add its value to the total
if word in scores:
resscore+=scores[word]
#convert to float
resscore+=0.0
#store states happiness
if location[1] not in happiness:
happiness[location[1]]=resscore
else:
happiness[location[1]]+=resscore
#get the happiest if dict is not null!
if len(happiness.keys())>0:
happiest=max(happiness.iteritems(), key=operator.itemgetter(1))[0]
print str(happiest)
else:
print 'US'
if __name__ == '__main__':
main()
Call it as:
python happiest_state.py AFINN-111.txt output.txt
Finally, the last task was to compute the 10 most used hashtags. The output would be the computed pair hashtag frequency (string, float) for each hashtag.
import sys
import json
import operator
def main():
tweet_file = open(sys.argv[1])
tags = {} #pair tag, occurrences in all tweets
#poplate our dict object
for line in tweet_file:
#convert the line from file into a json object
mystr = json.loads(line)
#search for tags and count them
if 'entities' in mystr.keys() and mystr["entities"]["hashtags"] != []:
for hashtag in mystr["entities"]["hashtags"]:
if hashtag["text"] not in tags.keys():
tags[hashtag["text"]]=1
else:
tags[hashtag["text"]]+=1
#sort our dict object by descending values
sorted_dict = sorted(tags.iteritems(), key=operator.itemgetter(1), reverse=True)
#print top 10 hashtag count values
for i in range(10):
tag, count = sorted_dict[i]
count+=0.0
strprint = tag+' '+str(count)
encoded_str=strprint.encode('utf-8')
print encoded_str
if __name__ == '__main__':
main()
Called as:
python top_ten.py output.txt
Overall it was an interesting assignment, not overly complicated and a good introduction to Python programming.
Getting the error on this as 'list' object does not have an attribute called 'keys'.
ReplyDeleteCould you please help on this asap?
Thanks!
Hello,
Deleteplease check the code is list.keys() (lowercase k), that the list object is not null, and, REALLY important, that you did not change the indentation as it has an actual meaning in Python (https://docs.python.org/release/2.5.1/ref/indentation.html)
Regards
Latest updated:
DeleteI am getting an error on the below mentioned line as "list indices must be integers, not str".
>>> words = mystr["text"].split()
Regarding the indentation, as long as it is correctly preserved, using a space rather than a tab it's not an issue. BUT you have to convert EACH tab to a space, meaning that 2 tabs become two spaces, etc.
DeleteTo check if the list object is null you can do something like:
if not list:
print "Empty"
assuming the object is called list, otherwise use its name eg: mystr.
However, mystr object is not a simple list/array, but the result of the json.loads() method so there will definitely be non-numerical keys, the json attributes names, this is not an error, see reference: https://docs.python.org/2/library/json.html section 'Decoding JSON'
Please check what's contained in the mystr object before the line that raises the exception by printing its contents:
print str(mystr)
should be enough. Maybe you loaded a blank line or created the mystr object in a different way.
The editor cut out the tab before 'print "Empty" ' right blow 'if not list:'
DeleteIf you don't put it, it won't work correctly
Thank you for your reply.
ReplyDeleteThe 'k' in keys is in lower case.
The indentation is to be back spaced to one blank space if I copy this code.
That should not be an issue as per my understanding.
The only thing to verify is the null value. How can I determine that?
i keep getting error: how can i solve this? thanks.
ReplyDeleteIOError: [Errno 2] No such file or directory.
Hello Mark, how are you invoking the scripts? You should pass a valid file path as argument, see sample calls for that; note that in my case I had the file in the same directory as the script, if that's not your case try invoking it as:
Deletepython script_name.py /path/to/file_name.extension
Have a nice day