Get Data From Wikipedia

Gems list :

.
require "rubygems"

require "rdf" : http://rubydoc.info/github/ruby-rdf/rdf/master/frames

require "linkeddata" : http://rdf.greggkellogg.net/yard/index.html

require "active_record"

require "uri"

require "json"

require "rest_client" : http://rubydoc.info/gems/rest-client/1.6.7/frames

require "escape_utils" : http://rubydoc.info/gems/escape_utils/0.2.4/frames

require "date"

This script is for getting data from wikipeida (in fact , from dbpedia) and update english part from live.dbpeida.org.

Make “http://dbpedia.org/page/Sergio_Agüero” as an example ,

BTW , this link will encode to “http://dbpedia.org/page/Sergio_Ag%C3%BCero” , and if we want its rdf format ,please check

“http://dbpedia.org/resource/Sergio_Ag%C3%BCero” or

“http://dbpedia.org/data/Sergio_Ag%C3%BCero.rdf” ,

and its live site link is “http://live.dbpedia.org/data/Sergio_Ag%C3%BCero.json”, for json.

How To :

Where to get data from wikipedia ?

Actually , we can use DBpedia , DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. Take a look at “http://en.wikipedia.org/wiki/Sergio_Ag%C3%BCero” , “http://live.dbpedia.org/page/Sergio_Ag%C3%BCero” , and “http://dbpedia.org/page/Sergio_Ag%C3%BCero” . the “live” one just for english .

So now , from dbpedia, we can get json and rdf format data from different URL , such as “http://dbpedia.org/data/Sergio_Agüero.rdf” and “http://dbpedia.org/data/Sergio_Agüero.json” , data or file should be shown to us .

How to get data from dbpedia ?

But how can we deal with the data ? try this :

First : use Gem “RestClient” and json link (“http://dbpedia.org/data/Sergio_Agüero.json”) to get json , json_hash = JSON.parse(RestClient.get(Json_link)) , if the link need to change encode, then create a loop and judgment to decode, like “URI.decode(json_link)”

Second : match and catch data.

for example :

redirects = json_hashsearch_url

abstract = json_hashsearch_url

match_info can be get from “rdf” format webpage.

Then we can get what we want . “live” site should be newer for enlish , replace it .

part one : logic

csvparser.rb script, it read a CSV file, which has identity and wiki_link, then update identity (update_identity(wiki_data_hash)) from ??API.

wikibot.rb script, fetch_wiki_urls creates wiki_identity_list (identity hash) , and identities have keys : “id”,”name”,”wiki_url” .

then parse_wiki_json_url(search_url) (let http://dbpedia.org/resource/Sergio_Ag%C3%BCero as search_url) , first change search_url into Json_url, and make it into json hash, match abstract(for “ru” and “zh” , “en” is from get_live_english_abstract()) and name .

self.find(search_url) first load rdf page , then use load_rdf_query_list to get other data , like “name”, “birthDate”, “deathDate”, “abstract”, “thumbnail”, “subject”, “type”, make data into two part , “data_query_list” and “data_name_list” .

run_category_create_update(category_list) , list all the category from webpage, and use Category.find(category_id) to find and check if those categories from website in the ??API, if not , create and save it . In fact , it means , category will update if there are new ones.

run_identity_create_update(identity,final_output,category_id_list_identity) , add all the data from above functions, then save identity to ??API.

part two : wikibot functions

Function 1 : update_log_file , for updating log and writing to log.txt .

Function 2 : get_live_english_abstract(name) , for getting english abstract from live site.

this function use RestClient.get(json_url) to get json data and return Json parser by using JSON.parse(json_data) , then match abstract .

Function 3 : self.find(search_url) , first use “RDF::Graph.load(search_url)” to load RDF page, the use Function 4 load_rdf_query_list help to query data , as output.

Function 4 : load_rdf_query_list , create data_query_list and data_name_query_list.

Function 5 : abstract_subject_live(search_url) , to get english text acstract content from live site.

Function 6 : run_category_create_update(category_list) , get category list from parameter ,check categories from ??API by using Category.find(category_id) , if not exist , create and update categories.

Function 7 : run_identity_create_update(identity,final_output,category_id_list_identity) , match data , and update . save “external_links” doesn’t work?

Function 8 : fetch_wiki_urls_dummy fetch wiki urls , here we can use a = idetity_id to test. It will find identity by find , then match and give us wiki_urls.

Function 9 : fetch_wiki_urls , it works is same to fetch_wiki_urls_dummy, but it get identities by wiki_urls.length to limit .

function 10 : then parse_wiki_json_url(search_url) , first change search_url into Json_url, and make it into json hash, match abstract(for “ru” and “zh” , “en” is from get_live_english_abstract()) and name .

blog comments powered by Disqus

Published

04 July 2012

Get Data From Wikipedia Supporting tagline