The second assignment in my Coursera Big Data course regarded data analysis in SQL using
SQLite3. We used a test dataset, which you can
download from here, composed of:
- matrix.db: a simple database representing two square sparse matrices stored as tables A and B each with row_num, col_num and value as columns
- reuters.db: a database containing a single table frequency(docid, term, count), where docid is an identifier corresponding to a particular file, term is an English word, and count is the number of the times the term appears within the document indicated by docid
I will not post all scripts produced here since many were just simple SELECTs so we'll just focus on the most interesting stuff:
matrix multiplication in SQL.
It sounds more complicated than it really is but, given the two sparse matrices, it is possible to compute their multiplication AxB by doing a simple JOIN between A columns and B rows, then GROUPing BY A rows and B columns and finally SELECTing, for each matrix cell, the SUM of the multiplication between the two cell values in both matrices:
SELECT a.row_num, b.col_num, SUM(a.value*b.value)
FROM A a JOIN B b ON a.col_num=b.row_num
GROUP BY a.row_num, b.col_num;
Which is exactly the implementation of the matrix multiplication formula.