Advertise here only $0.22 a day

Tuesday, December 16, 2008

Google PageRank (PR) vs. Alexa Traffic Rank Correlation (Regression) Analysis

Abstract: a statistical study (regression analysis) of a random sample of 102 websites has shown that a strong relationship (correlation) exists between Google PageRank and Alexa Traffic Rank.

Introduction

Google PageRank (GPR) and Alexa Traffic Rank (ATR) are two different measures of a website's success. As you know (shame on you if you don't), simply speaking, GPR measures the number of links to the site, while ATR measures the site's traffic. (Official detailed descriptions of these two indicators are available at Google Technology and Alexa Help pages.)

Correlation between ATR and GRP has become my concern after I visited two websites in a row, namely CSS Zen Garden and Mail.ru. The first one, a specialized CSS design project, has Google Page Rank of 8 and Alexa Traffic Rank of over 13 thousand. The second one, a huge Russian portal, had Page Rank of 6, yet ranked 23rd in Alexa! The question occured, «does traffic affect link popularity?» Interestingly, although Mail.ru is a much more popular portal, CSS Zen Garden obviously had much more quality links pointing to it. This phenomenon can be explained with a look at the nature of CSS Z.G.; the site is oriented at designers, who are likely to have a site and give direct links. Users of Mail.ru, on the other hand, are mortals that want free email, videos, chat, news, etc, and are less likely to put links to the site.

Here is a comparative table:

Site            |  GPR  |   ATR
--------------------------------
CSS Zen Garden | 8 | 13,138
Mail.ru | 6 | 23

This difference between the actual real popularity of a portal and the quality links pointing to it created this desire in me to test the statistical correlation between the inbound links and traffic, measured by Google PageRank and Alexa Traffic Rank respectively.

A copy of the original spreasheet is available, yet it does not contain graphs and charts.

The sample

The sample for this analysis consisted of 102 randomly picked websites. I tried to pick sites for this analysis as randomly as possible: I caught myself quering Google for such phrases that I would never ask for, such as "knitting", "nothing", "rotting" and other crazy queries. I tried to randomize the sample as much as I could. Yet I understand that there was a bias, because I was the only one who picked sites (except for the single site I got from my wife, which is likar.info).

Many websites from the sample I like and visit daily, yet others I don't even know. To get some of the lower quality sites, I went to a lousy web design studio, and simply randomly clicked sites from their portfolio (most clone-looking 0-2 PR sites are their masterpieces). Note that there was a chance of humar error; my Google Toolbar might have malfunctioned or I could have simply overlooked the value. Complete list of websites is also available.

T-distribution (or Student's T distribution) table I used for my analysis offers .... 70, 80, 100, 150 ... degrees of freedom, among others. The key fact is that it does not offer 98 degrees of freedom. This information is important, since the formula I used needs n-2 degrees of freedom, where n is the sample size. Thus, I used exactly 102 observations purposefully (last two added later) so that I could find the accurate tabular values in Student's T distribution table (so that I deduct 2 from my sample and arrive at 100).

Each sample value (site) had three parameters (dimentions), namely URL, GPR, and ATR. In the initial spreadsheet, each observation also has an ID number and the date of measurement. (Please note, that some of the sites I am sure have changed since the study! Few of the them I run/manage/own/develop. Note the date of measurement.)

Google PageRank Distribution

Normally distributed! From GPR point of view, the sample was almost perfectly distributed, representing the bell-shaped curve. As you can see from the diagram, there was a very little skewness to the right. Mean average was 5.05 and median 5, with dispertion of 7. The coefficient of skewness was as low as 0.06, which means that the sample was quite normally distributed.

For the histogram above, I have chosen each of the 11 PageRank values for each class (exactly 11, not 10, remember that zero is also a separate value).

Alexa Traffic Rank Distribution

For Alexa TR, the distribution was much less normal. The entire sample was significantly skewed right, with as much as 73% observations representing one eighth of the possible values. This vast majority encompassed sites within the first 1,500,000 positions in the rank.

With mean average of 1,245,677.471 and meadian of 109,922, the sample had huge dispersion of 5351667608993.28, range of 10,960,325, and skewness coefficient of 1.47. I have divided the entire sample into 8 classes, with 1,500,000 as a class step, altogether ranging from 0 to 12,000,000.

Obviously, vast majority of the websites from the sample belonged to a minority group, which is a limitation of the study. I should have either gathered sites that were all in the top 2M range, or gathered more lower quality sites.

Analysis and methods

The initial idea was to test whether the ATR actually correlates to GPR. Thus, the null hypothesis H0 was, «no relationship exists between traffic popularity measured by Alexa Traffic Rank and link popularity measured by Google Page Rank». The alternative hypethesis Hf was, «there is a correlation between Google PageRank and Alexa Traffic Rank». The purpose of the study was to reject the null hypothesis and to prove there truly is a correlation between the two site success indicators. (I must remind that the initial CSS vs. Mail encounter that pushed me toward this analysis showed that there was hardly any correlation between these indicators.)

Simple regression analysis and t-distribution significance test was used for the study.

Regression analysis

The two arrays of data (each of 102 observations) showed rather high negative correlation. The ultimate r (correlation coefficient) was equal to -0.5. The best fit line's equation was y = -439630,50x + 3469690,61. I had Alexa TR on the Y axis (and Google PR on X axis, respectively).

The data points are concentrated vertically at the 11 imaginary lines of PageRank values, because Google's rank only has 11 possible values. This phenomenon creates huge gaps in this discrete data array. Still, the tendency is obvious! There is a strong visible correlation between the two sets of data.

Hypothesis testing (significance test)

Regardless of the visual correlation, I had to test whether this was a chance occurrence, or a statistically significant phenomenon. As I mentioned earlier, I used t-distribution for significance test. The test statistic was r/√[(1-r2)/(n-2)], where r is the regression coefficient and n is the sample size. The number of degrees of freedom is n-2. I used standard significance level α=0.05. The table value of t0.05; 100df appeared 1.984. Thus, with a two-tailed test, if the absolute value of the calculated value of t is greater than the absolute value of tabular t, I can reject the null hypothesis (hypotheses are described above). The calculated value of t was -5.83125. |-5.83125| is greater than |1.984|, and therefore we reject the null hypothesis, and prove that there is statistical significance to claim that correlation between Google PageRank and Alexa Traffic Rank truly exists and is not a random chance phenomenon.

Outliers and interesting observations

Two potential outliers are at the top of the graph; one at point [6; ~11,000,000] and the other one (yet less likely to be considered an outlier) at [3; ~10,400,00]. These two, however, are potential graphical outliers, visible with the naked eye.

Three questionable points, which are not that visible, yet are very hard to believe in, are concentrated near the origin. Especially the one right next to the origin (the bottom point on the Y axis), which is a website with zero Page Rank, yet relatively high traffic. This is an interesting phenomenon, which shows a popular website, with nearly no inlinks. (The site is Red Bean, and at the date measurement on Nov 29th, 2007, it had 0 PR and 27,214 ATR. Yet, at the moment of writing this article, I see it has PR of 7.)

Conclusion

Regardless of the limitations of the test, the study showed very strong relationship between Google PR and Alexa Traffic Rank.

If you notice errors of typos, please leave a comment. The copy of the original spreadsheet is available at Google Spreadsheets.



see too about pagerank algorithm

No comments:

Post a Comment