Harvesting Wiki Consensus:
Using Wikipedia Entries as Vocabulary for Knowledge Management


On this page, we provide additional information and raw data related to our paper

IEEE Internet Computing
Martin Hepp, Katharina Siorpaes, Daniel Bachlechner:
Harvesting Wiki Consensus: Using Wikipedia Entries as Vocabulary  for Knowledge Management,
IEEE Internet Computing, Vol. 11, No. 5, pp. 54-65, Sept-Oct 2007.

News: Ivan Hermann commented on our work in his blog:
http://ivanherman.wordpress.com/2007/10/12/wikipedia-uri-s-as-reliable-identifiers-for-the-semantic-web/

If you have and questions or suggestions, please contact Martin Hepp at  mhepp --A-T-- computer --D-O-T-- org.

Updates and Additional Related Work

Sample Data

No URI Proton Top-level Significant change in meaning between first and current version? 
1 http://en.wikipedia.org/wiki/Prisman Agent none
2 http://en.wikipedia.org/wiki/Sullivan_%28town%29%2C_Wisconsin Location none
3 http://en.wikipedia.org/wiki/Alexander_Rybak Person none
4 http://en.wikipedia.org/wiki/Fulgence_Bienven%C3%Bce Person none
5 http://en.wikipedia.org/wiki/Nottawa Location gotBroader
6 http://en.wikipedia.org/wiki/Sensitivity_%28electronics%29 Abstract none
7 http://en.wikipedia.org/wiki/Tasman_Series Event none
8 http://en.wikipedia.org/wiki/Anti-Administration_Party_%28United_States%29 Group none
9 http://en.wikipedia.org/wiki/Geoffrey_de_Mandeville_%2811th_century%29 Person none
10 http://en.wikipedia.org/wiki/Herm%C3%A8s   Organization none
11 http://en.wikipedia.org/wiki/Ungku_Tun_Aminah_Mosque Location none
12 http://en.wikipedia.org/wiki/Associated_Students_of_the_University_of_Alaska_Fairbanks Organization none
13 http://en.wikipedia.org/wiki/Delaware_Valley_High_School Organization none
14 http://en.wikipedia.org/wiki/Residents%27_Association_of_London Organization none
15 http://en.wikipedia.org/wiki/List_of_non-fiction_writers Group none
16 http://en.wikipedia.org/wiki/Winning_Colors_%28horse%29 Agent none
17 http://en.wikipedia.org/wiki/John_F._MacArthur Person gotBroader
18 http://en.wikipedia.org/wiki/Kay_Eakin Person none
19 http://en.wikipedia.org/wiki/Hafez_al-Assad Person none
20 http://en.wikipedia.org/wiki/Wu_Yi_%28Shang_dynasty_ruler%29 Person none
21 http://en.wikipedia.org/wiki/Steven_L._Kent Person none
22 http://en.wikipedia.org/wiki/Joachim_Kunz Person none
23 http://en.wikipedia.org/wiki/Oakfield%2C_New_York Location gotBroader
24 http://en.wikipedia.org/wiki/Ugo-Iwaya_Station Location none
25 http://en.wikipedia.org/wiki/New_London_Fire  (deleted) ( deleted)
26 http://en.wikipedia.org/wiki/Lower_Saxon_Circle Location none
27 http://en.wikipedia.org/wiki/C.V.Padmarajan Person none
28 http://en.wikipedia.org/wiki/Prairie_Dunes_Country_Club Location none
29 http://en.wikipedia.org/wiki/Slater%2C_Colorado Location none
30 http://en.wikipedia.org/wiki/Audit_evidence InformationResource none
31 http://en.wikipedia.org/wiki/I%27ve_Got_a_Feeling Product none
32 http://en.wikipedia.org/wiki/Kitahiroshima%2C_Hokkaid%C5%8D Location none
33 http://en.wikipedia.org/wiki/Carpi GeneralTerm gotBroader
34 http://en.wikipedia.org/wiki/Yahud GeneralTerm gotBroader
35 http://en.wikipedia.org/wiki/Diu_Indo-Portuguese_language Language none
36 http://en.wikipedia.org/wiki/St_Albans_High_School_for_Girls Organization none
37 http://en.wikipedia.org/wiki/Bernhard_Russi Person none
38 http://en.wikipedia.org/wiki/Feedback_%28Jurassic_5_album%29 Product none
39 http://en.wikipedia.org/wiki/Sail_%28letter%29 GeneralTerm none
40 http://en.wikipedia.org/wiki/Metro_Observatorio Location none
41 http://en.wikipedia.org/wiki/W._W._Herenton Person none
42 http://en.wikipedia.org/wiki/Yamagata-Lobanov_Agreement Document none
43 http://en.wikipedia.org/wiki/Simple_Mail_Access_Protocol Document none
44 http://en.wikipedia.org/wiki/Leif_Johansson_%28tennis_player%29 Person none
45 http://en.wikipedia.org/wiki/Outside_the_Atmosphere Product none
46 http://en.wikipedia.org/wiki/Tiran%C3%AB_District Location none
47 http://en.wikipedia.org/wiki/Gaze_%28Death_Note_episode%29 Product none
48 http://en.wikipedia.org/wiki/The_Day_I_Met_God Product none
49 http://en.wikipedia.org/wiki/James_Packer Person none
50 http://en.wikipedia.org/wiki/FN_FNC Product none
51 http://en.wikipedia.org/wiki/Roman_Catholicism_in_Sudan Organization none
52 http://en.wikipedia.org/wiki/Criticism_of_debt Topic none
53 http://en.wikipedia.org/wiki/List_of_burials_in_the_Valley_of_the_Kings Event none
54 http://en.wikipedia.org/wiki/Prokop_%28Belgrade%29 Location none
55 http://en.wikipedia.org/wiki/Naam Event none
56 http://en.wikipedia.org/wiki/Air_Mauritius_destinations Location none
57 http://en.wikipedia.org/wiki/Wish GeneralTerm none
58 http://en.wikipedia.org/wiki/List_of_Chicago_Venture_Capital_Companies Organization none
59 http://en.wikipedia.org/wiki/Vauxhall_Prince_Henry Product none
60 http://en.wikipedia.org/wiki/Storv%C3%A4ttesh%C3%A5gna Location none
61 http://en.wikipedia.org/wiki/1922_Land_Code Document none
62 http://en.wikipedia.org/wiki/Maryetta%2C_Oklahoma Location none
63 http://en.wikipedia.org/wiki/Camp_Falcon Location none
64 http://en.wikipedia.org/wiki/Furneaux_Group Location none
65 http://en.wikipedia.org/wiki/National_Academy_of_Writing Organization none
66 http://en.wikipedia.org/wiki/Nasty_Boys_%28Cincinnati_Reds%29 Group none
67 http://en.wikipedia.org/wiki/Lupin_the_3rd:_Treasure_of_the_Sorcerer_King Product none
68 http://en.wikipedia.org/wiki/Ransdell Group gotBroader
69 http://en.wikipedia.org/wiki/Linda_P%C3%A9tursd%C3%B3ttir Person none
70 http://en.wikipedia.org/wiki/Park_Trammell Person none
71 http://en.wikipedia.org/wiki/1971_in_spaceflight Event none
72 http://en.wikipedia.org/wiki/Subject_to_Change_%28album%29 Product none
73 http://en.wikipedia.org/wiki/NCR_315-RMC Product none
74 http://en.wikipedia.org/wiki/July_7%2C_2003 Event none
75 http://en.wikipedia.org/wiki/Nordseter Location none
76 http://en.wikipedia.org/wiki/Robin_Dunbar Person none
77 http://en.wikipedia.org/wiki/Pittsboro Location none
78 http://en.wikipedia.org/wiki/Torremenga Location none
79 http://en.wikipedia.org/wiki/KBTC-TV Organization none
80 http://en.wikipedia.org/wiki/Diocese_of_Edinburgh Organization none
81 http://en.wikipedia.org/wiki/The_Dust_Collectors Group none
82 http://en.wikipedia.org/wiki/Escape_from_Jupiter Product none
83 http://en.wikipedia.org/wiki/Politics_of_Brazil Topic none
84 http://en.wikipedia.org/wiki/Lee_High_School Organization gotBroader
85 http://en.wikipedia.org/wiki/RTL_Z Organization none
86 http://en.wikipedia.org/wiki/Maurizio_De_Santis Person none
87 http://en.wikipedia.org/wiki/George_Fox_Evangelical_Seminary Organization none
88 http://en.wikipedia.org/wiki/Our_Hearts_Keep_Singing Product none
89 http://en.wikipedia.org/wiki/Bradley%2C_Maine Location none
90 http://en.wikipedia.org/wiki/Yue_Fei Person none
91 http://en.wikipedia.org/wiki/Imparfait GeneralTerm none
92 http://en.wikipedia.org/wiki/Joseph_Thomson_%28explorer%29 Person none
93 http://en.wikipedia.org/wiki/T._J._Bohn Person none
94 http://en.wikipedia.org/wiki/Girard%2C_Texas Location none
95 http://en.wikipedia.org/wiki/Barry_Bertram Person none
96 http://en.wikipedia.org/wiki/Clara_Nunes Person none
97 http://en.wikipedia.org/wiki/Don_Cupitt Person none
98 http://en.wikipedia.org/wiki/Thai_Yuan Group none
99 http://en.wikipedia.org/wiki/Bally Organization none
100 http://en.wikipedia.org/wiki/Japanese_Filipino Group none
101 http://en.wikipedia.org/wiki/Another_World_%28Brian_May_album%29 Product none
102 http://en.wikipedia.org/wiki/159P/LONEOS Object none
103 http://en.wikipedia.org/wiki/Blaze_of_Glory Product none
104 http://en.wikipedia.org/wiki/List_of_country_articles_containing_postal_sections InformationResource none
105 http://en.wikipedia.org/wiki/Mirai GeneralTerm gotBroader
106 http://en.wikipedia.org/wiki/Hollis%2C_Maine Location none
107 http://en.wikipedia.org/wiki/Area_code_641 Location none
108 http://en.wikipedia.org/wiki/MSN_Hotmail Service none
109 http://en.wikipedia.org/wiki/Richard_Seymour Person none
110 http://en.wikipedia.org/wiki/Joystick_%28disambiguation%29 GeneralTerm gotBroader
111 http://en.wikipedia.org/wiki/Off_%28album%29 Product none
112 http://en.wikipedia.org/wiki/Italia_1 Organization none
113 http://en.wikipedia.org/wiki/Zoruru Agent none
114 http://en.wikipedia.org/wiki/Craig_Anthony_Washington Person none
115 http://en.wikipedia.org/wiki/Atle_Antonsen Person none
116 http://en.wikipedia.org/wiki/Titania%27s_Palace Object none
117 http://en.wikipedia.org/wiki/Adelaide_Symphony_Orchestra Organization none
118 http://en.wikipedia.org/wiki/Santa_Isabel_do_Rio_Negro Location none
119 http://en.wikipedia.org/wiki/Florence_Freedom Organization none
120 http://en.wikipedia.org/wiki/Enfield_Lock_railway_station Location none
121 http://en.wikipedia.org/wiki/English_cricket_team_in_Australia_in_2002-03 Happening none
122 http://en.wikipedia.org/wiki/Barbania Location none
123 http://en.wikipedia.org/wiki/Number-one_dance_hits_of_2000_%28USA%29 Product none
124 http://en.wikipedia.org/wiki/Petrovsky_Stadium Location none
125 http://en.wikipedia.org/wiki/Brockville%2C_Ontario Location none
126 http://en.wikipedia.org/wiki/Welcome_Back_My_Friends_to_the_Show_That_Never_Ends Product none
127 http://en.wikipedia.org/wiki/Systematized_Nomenclature_of_Medicine InformationResource none
128 http://en.wikipedia.org/wiki/Greame_Carey Person none
129 http://en.wikipedia.org/wiki/Robert_Kelly_%28comics%29 Agent none
130 http://en.wikipedia.org/wiki/G%C3%B6teborgs_BK Organization none
131 http://en.wikipedia.org/wiki/Luke_Dingle Agent none
132 http://en.wikipedia.org/wiki/Eberhard_IV%2C_Count_of_W%C3%BCrttemberg Person none
133 http://en.wikipedia.org/wiki/Ali_of_Morocco Group none
134 http://en.wikipedia.org/wiki/Parksville%2C_Kentucky Location none
135 http://en.wikipedia.org/wiki/180th_%28Sportsmen%29_Battalion%2C_CEF Group none
136 http://en.wikipedia.org/wiki/Adam_Kasper Person none
137 http://en.wikipedia.org/wiki/Nicholas_of_Flue Person none
138 http://en.wikipedia.org/wiki/Cass_County%2C_Texas Location none
139 http://en.wikipedia.org/wiki/Custodian GeneralTerm gotBroader
140 http://en.wikipedia.org/wiki/Halina_Rozpondek Person none
141 http://en.wikipedia.org/wiki/Quintus_Servilius_Caepio_the_Younger Person none
142 http://en.wikipedia.org/wiki/Westville%2C_Illinois Location none
143 http://en.wikipedia.org/wiki/Humanity_%28The_Mad_Capsule_Markets_album%29 Product none
144 http://en.wikipedia.org/wiki/Rancho_Banquete%2C_Texas Location none
145 http://en.wikipedia.org/wiki/Jean_Van_Hamme Person none
146 http://en.wikipedia.org/wiki/Nekromanteia Product none
147 http://en.wikipedia.org/wiki/Gaijin_tarento GeneralTerm none
148 http://en.wikipedia.org/wiki/The_Everlasting_Blink Product none
149 http://en.wikipedia.org/wiki/Nana_Glen%2C_New_South_Wales Location none
150 http://en.wikipedia.org/wiki/Dierbergs Organization none


Quality of the Random Page functionality

For the "random page" and "random redirect" function, MediaWiki uses the Mersenne Twister pseudo-random number generator, which has been described in the following paper.

M. Matsumoto and T. Nishimura, "Mersenne Twister: A 623-dimensionally equidistributed uniform pseudorandom number generator", ACM Trans. on Modeling and Computer Simulation Vol. 8, No. 1, January pp.3-30 (1998)

We are confident that the distribution properties of this random number source is of sufficient quality for a statistical analysis of our kind, in particular, since the results are very clear.

Redirection Pages

The population for our orginal sample excludes redirection pages, i.e. URIs that automatically forward a user to another Wikipedia page. Note that a redirection page is always recognizable as such (the MediaWiki software shows “Redirected from xyz” on the top of the page). Users are thus able to spot that the the page retrieved is not the original content for that URI, but a related substitute. Most redirects just consolidate spellings variants or synonyms for the same page title. In the context of our analysis, a Wiki entry that turns into a redirect is a special type of change.

In order to analyse the impact of redirects in Wikipedia more closely, we took another representative sample of n=50 of the about 700,000 redirection pages using the "random redirect" page. This shows the following results:
•    78% of the redirection pages are obvious synonyms (in particular spelling variants or changes in word order of composite words),
•    12 % reflect pages for which the content was integrated into other pages,
•    for 10%, we could not quickly identify the semantic relationship (we also did not try very hard ;-)).

With regard to the impact on our analysis, we can observe the following: First, for the vast majority (78%) of all URI’s that represent redirects, there is no semantic difference, since they are synonyms. For 22% (10 + 12 %) of the redirects,  semantic differences between the original URI and the target of the redirect cannot be excluded. In 12 % of the cases, the redirect points to a page that incorporates the original content in a larger article.
 
Example: The page for “Ernie” may be redirected to the page for “Sesame Street”.

Even such a change that does not invalidate our thesis of stable meaning for URIs, since a redirection page is always recognizable as such. It is thus possible to avoid being mislead about the intended meaning of a URI, and http://en.wikipedia.org/Ernie will still identify Ernie and not the Sesame Street. Furthermore, such integration of small pages in larger ones with a redirect often use fragment identifiers in the target page (e.g. http://en.wikipedia.org/wiki/Sesame_Street#The_Muppets); that means, the redirect still points to a retrievable description of the original meaning.

Based on that analysis, we can safely suggest to represent Wikipedia redirects as rdfs:seeAlso relations. If precision is not an issue, one could even treat them as statements of equivalence and would be right in 80% of the cases.

Impact of Deleted Pages

The one deleted entry in our sample was deleted in the very short time-frame between the creation of the random sample list (afternoon of January 11, 2007) and the completion of the analysis of the pages in the sample (January 13). The random page function ignores deleted pages, and we intended to ignore deleted pages, too, because (1) Wikipedia deletes pages following a well-defined process justified either by obvious violation of rules for appropriate content or by social agreement that a page does not have sufficient encyclopedic relevance.

For our analysis, even deleted pages do not necessarily (and not even: frequently) imply that the meaning associated to the URI has been changed. It only means that there is no longer a retrievable resource for that URI. In fact, one can observe that sometimes people continously try to create a page for the same topic. For such cases, Wikipedia even maintains “blocking lists” that define about 200 [1] respectively 900 [2] URIs for which Wikipedia does not ever want pages to be created.

The one page in our sample that was deleted refers to the band “New London Fire”. A look into the deletion log of Wikipedia shows that there have been three attempts to create such a page (deleted on June 9, 2006; on October 6, 2006, and on January 12, 2007). We also assume that the three attempts referred to the same band. If that was the case, then even this deleted URI in a certain way continues to refer to the same conceptual entitiy.

While Wikipedia does not provide any data for the duration of existence of deleted pages, we assume that most pages are deleted very early in their life, i.e. exist only for a very short period of time.

[1] http://en.wikipedia.org/wiki/Category:Protected_deleted_pages
[2] http://en.wikipedia.org/wiki/Wikipedia:Protected_titles


This URI: http://www.heppnetz.de/harvesting-wikipedia/
Last update: March 15, 2008

Click here to return to the main page: http://www.heppnetz.de