06/07/2011

[PHP] PageRank implementation




There are two possible PageRank implementations: the power method and the iterative method. We will describe both.

To check the results obtained we provide a scri'pt to generate a .gexf file to be opened with Gephi.

Test datasets are available at the Toronto university website.


Technologies used:
  • PHP
  • Gephi
  • XML
1.Power method
 

This method takes longer than the other one, however the result is equally right. Using a sparse matrix instead of the full one may help speed up the process.

Method description:

  1. Pick the NxN input matrix and make it stochastic, call it A
  2. Create the P" matrix as alpha*A+(1-alpha)*M with alpha a factor which describes the probability of a random jump and M an NxN matrix filled with values 1/N
  3. Create starting vector v_0 with length N filled with values 1/N
  4. Compute v_m = v_m-1*P" where the first time v_m-1 is exactly v_0
  5. Compute the difference v_m - v_m-1 which should converge to 0
  6. When the difference doesn't vary over a certain threshold or i iterations have been made, stop. v_m should contain the PageRank for every page
To run the program you must have the nodes and adj_matrix files from the chosen dataset. The implementation source code is found here here.

NOTE: Due to Apache and PHP limitations, you may have to modify the php.ini file to grant more memory to the scripts (if you don't want to store the matrix on filesystem like we did) by setting the memory_limit parameter to at least 768MB and raising the maximum script execution time to 5 minutes(300 seconds) with max_execution_time.

2. Iterative method - found on phpir

To use this method you must have the
nodes and adj_list files from the chosen dataset.




Method description:


Each page is given a starting PR of 1/N where N is the number of nodes in the graph. Each page then gives to every page it links a PR of current_page_PR/number_of_outbound_links.
It's introduced a dampening factor alpha (0,15) which represents the probability of making a random jump while visiting the graph or when reaching a cul de sac.

PR_new = alpha/n + (1-alpha)*PR_old




Every PR is then normalized between 0 and 1.

The process keeps going until it has reached i iterations or the difference between old and new PR doesn't vary over a certain threshold.

Our implementation is available here.


3. Gephi parser:

To use the
parser you you must have the nodes and adj_list files from the chosen dataset. The parser outputs a .gexf file to be opened with Gephi. Here is a sample .gexf file obtained by parsing the first dataset available on the site (the one about abortion).

No comments:

Post a Comment

With great power comes great responsibility.

Da grandi poteri derivano grandi responsabilità.