Short talk by Ian Ritchie who in a retrospective wouldn’t have turned TimBL down, i guess.
Instant classic due to his closing words: “But is he happy?”
TED: Hans Rosling – Religions and Babies
Heh, Hans Rosling did it again… fascinating stats and explaining world’s population growth (the big fill-up) with some leftover boxes.
http://www.ted.com/talks/hans_rosling_religions_and_babies.html
Setting up a local DBpedia 3.7 mirror with Virtuoso 6.1.5+
Newer version available: Setting up a Linked Data mirror from RDF dumps (DBpedia 2015-04, Freebase, Wikidata, LinkedGeoData, …) with Virtuso 7.2.1 and Docker (optional)
Nearly 1.5 years after i initially published a post about how to setup a local DBpedia mirror i recently revisited the problem myself to setup a local mirror of the DBpedia 3.7.
Unlike the previous updates so many things have changed that I decided to put them into a separate post instead of continuing to update the old one making it more and more complicated.
Two of the most severe changes are that Virtuoso 6.1.5+ includes a setting making the importer more robust so the repacking of the files isn’t needed anymore and the changes of DBpedia 3.7 to also provide internationalized versions causing a couple of problems / inconsistencies.
In this step by step guide I’ll tell you how to install a local mirror of the DBpedia 3.7 hosting a combination of the regular English and the i18n German datasets adding up to nearly half a billion triples!!!
Let’s jump in.
Versions
DBpedia 3.7 + 3.7-i18n dataset. Virtuoso 6.1.5+ (Actually it’s a 6.1.6-dev version with some bugfixes for the DBpedia VAD files, as detailed below). Ubuntu 12.04 LTS.
Prerequesits
A strong machine with root access and enough RAM: We used a VM with 4 Cores and 32 GBs of RAM. For installing i recommend more than 128 GB free HD space, especially for downloading and repacking the datasets, as well as the growing database file when importing (mine grew to 45 GBs).
Let’s go
Download and install virtuoso
Go and download virtuoso opensource:
Initially i started this guide with virtuoso-opensource-6.1.5 from http://sourceforge.net/projects/virtuoso/, but later on in the process i ran into problems with the DBpedia VAD file which is used for resolving and content negotiation of instances via http. If you only intend to use the sparql endpoint you can download that version, but if you want to be able to actually resolve the local versions of the http://dbpedia.org/resource/Kaiserslautern pages with content negotiation, you need a version with some bugfixes from github:
https://github.com/openlink/virtuoso-opensource/tree/674df8668d7dd3018b3a8a14c23702c583d64961. If at the time of you reading this 6.1.6 is officially released, I’d probably use the official release.
To get that version from github do the following:
cd ~
git clone git://github.com/openlink/virtuoso-opensource.git virtuoso-opensource
cd virtuoso-opensource
# the following command will actually set your working directory to the
# correct revision
git checkout 674df8668d7dd3018b3a8a14c23702c583d64961
./autogen.sh
(Skip the following if you got the version from github.)
If you downloaded one of the release .tar.gz files instead: Put the file in your home dir on the server, then do the following.
cd ~
tar -xvzf virtuoso-*
cd virtuoso-opensource-6.1.5 # or 6.1.6, depending what you got
Alright, now no matter how you got that virtuoso version do the following to install the prerequisites and then build virtuoso:
sudo aptitude install libxml2-dev libssl-dev autoconf libgraphviz-dev
libmagickcore-dev libmagickwand-dev dnsutils gawk bison flex gperf
export CFLAGS="-O2 -m64"
./configure --with-layout=debian
# NOTICE: this will _not_ install into /usr/local but into /usr
# (so might clash with packages by your distribution if you install
# "the" virtuoso package)
# You'll find the db in /var/lib/virtuoso/db !
# check output for errors and FIX THEM! (e.g., install missing packages)
make -j5
This will take about 1 hour. In parallel, you might want to start with downloading the DBpedia files (next section) and come back.
sudo make install
Now change the following values in /var/lib/virtuoso/db/virtuoso.ini
, the performance tuning stuff is according to http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtRDFPerformanceTuning:
# note: virtuoso ignores lines starting with whitespace
[Parameters]
# you need to include the directory where your datasets will be downloaded
# to, in our case /usr/local/data/datasets:
DirsAllowed = ., /usr/share/virtuoso/vad, /usr/local/data/datasets
# IMPORTANT: for performance also do this
[Parameters]
# the following two are as suggested by comments in the original .ini
# file in order to use the RAM on your server:
NumberOfBuffers = 2720000
MaxDirtyBuffers = 2000000
# each buffer caches a 8K page of data and occupies approx. 8700 bytes of
# memory. It's suggested to set this value to 65 % of ram for a db only server
# so if you have 32 GB of ram: 32*1000^3*0.65/8700 = 2390804
# default is 2000 which will use 16 MB ram ;)
# Make sure to remove whitespace if you uncomment existing lines!
[Database]
MaxCheckpointRemap = 625000
# set this to 1/4th of NumberOfBuffers
[SPARQL]
# I like to increase the ResultSetMaxrows, MaxQueryCostEstimationTime
# and MaxQueryExecutionTime drastically as it's a local store where we
# do quite complex queries... up to you (don't do this if a lot of people
# use it).
# In any case for the importer to be more robust add the following setting
# to this section:
ShortenLongURIs = 1
The next step installs an init-script (autostarts) and starts the virtuoso server. (If you’ve changed directories to edit /var/lib/virtuoso/db/virtuoso.ini
, go back to the virtuoso source dir!):
sudo cp debian/init.d /etc/init.d/virtuoso-opensource &&
sudo chmod a+x /etc/init.d/virtuoso-opensource &&
sudo bash debian/virtuoso-opensource.postinst.debhelper
Downloading the DBpedia dump files and a word about problems / inconsistencies in them
The DBpedia 3.7 is split into two separate datasets: one standard version and one i18n version. The standard version mints URIs by going through all English Wikipedia articles. For all of these the cross-language links are used to extract corresponding labels for the en URIs. This is problematic as for example articles which are only in the German Wikipedia won’t be extracted. To solve this problem the i18n version exists and creates IRIs in the form of de.dbpedia.org for every article in the German Wikipedia. There also are interlinking datasets providing owl:sameAs between the new URIs and the ones in corresponding other datasets. Note that the i18n IDs for concepts are IRIs, while the ones in the English Wikipedia are URIs. Also even though the i18n dataset includes all languages, only the Greek (el), German (de) and Russian (ru) Wikipedia have minted their own IRIs. The others are broken… they use URIs start with http://dbpedia.org but are linked to their corresponding language codes in the interlanguage links (e.g., the French interlanguage links falsely point to fr.dbpedia.org ). So it’s a mess! If you have a cleaned version of the datasets let us know or just wait for DBpedia 3.8 as we all do 😉
Besides that, the el, de and ru i18n files ending in .nt.gz
are actually not valid NT files, because the IRIs are UTF-8 encoded. After finding this out I simply renamed all the German files to .n3.gz
. and as n3 is a subset of turtle (TTL) and as virtuoso actually uses a TTL-parser (also for NT which is a subset of n3), I guess that renaming wasn’t all that important for Virtuoso. Still I had a bad feeling of having files with wrong endings flying around.
We have decided that we only needed the German and English files in (NT) format. If you need something different do so (and maybe report back if there were problems and how you solved them). If you decide to download the all-languages tar then make sure to exclude the NQ files from the later importing steps. One simple way to do this is to move everything you don’t want to import out of the directory. Also don’t forget to import the dbpedia_3.*.owl file (last step in the script below)!
Another hint: Virtuoso can only import plain (uncompressed) or gzipped files, the DBpedia dumps are bzipped, so you either repack them into gzip format or extract them. On our server the importing procedure was reasonably slower from extracted files than from gzipped ones (ignoring the vast amount of wasted disk space for the extracted files). File access becomes a bottleneck if you have 4 cores idling. This is why I decided on repacking all the files from bz2 to gz. As you can see I do the en and de repacking in parallel, if that’s not suitable for you, feel free to change it. You might also want to change this if you want to do it in parallel to downloading. The repackaging process below took about 1 hour but was worth it in the end. The more CPUs you have, the more you can parallelize this process.
sudo -i # get root
# see comment above, you could also get the all_language.tar or another DBpedia version...
mkdir -p /usr/local/data/datasets/dbpedia/3.7/3.7/en
cd /usr/local/data/datasets/dbpedia/3.7/3.7/en
wget -r -np -nd -nc -A'*.nt.bz2' http://downloads.dbpedia.org/3.7/en/
# if you want to save space do this:
for i in *.bz2 ; do bzcat $i | gzip > ${i%.bz2}.gz && rm $i ; done &
# else do:
#bunzip2 *&
cd ..
wget http://downloads.dbpedia.org/3.7/dbpedia_3.7.owl
mkdir ../3.7-i18n/de && cd ../3.7-i18n/de
wget -r -np -nd -nc -A'*.nt.bz2' http://downloads.dbpedia.org/3.7-i18n/de/
# if you want to save space do this:
for i in *.nt.bz2 ; do bzcat $i | gzip > ${i%.nt.bz2}.n3.gz && rm $i ; done &
# else do:
#bunzip2 *
# notice that the extraction (and repacking) of *.bz2 takes quite a while (about 1 hour)
# gzipped data is reasonably packed, but still very fast to access (in contrast to bz2), so maybe this is the best choice.
Data Cleaning and The bulk loader scripts
In contrast to the previous version of this article the virtuoso import will take care of shortening too long IRIs itself. Also it seems the bulk loader script is included in the more recent Virtuoso versions, so as a reference only: see the old version for the cleaning script and VirtBulkRDFLoaderExampleDbpedia and
http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtBulkRDFLoaderScript for info about the bulk loader scripts.
Importing DBpedia dumps into virtuoso
Now AFTER the re-/unpacking of the DBpedia dumps we will register all files in the dbpedia dir (recursively ld_dir_all
) to be added to the dbpedia graph. As mentioned above: If you use this method make sure that only files reside in the given subtree that you really want to import.
If you only want one directory’s files to be added (non recursive) use ld_dir
.
If you manually want to add some files, use ld_add
.
See the VirtBulkRDFLoaderScript file for args to pass.
Be warned that it might be a bad idea to import the normal and i18n dataset into one graph if you didn’t select specific languages, as it might introduce a lot of duplicates. In order to keep track what was selected and imported into which graph (see Note 2 below), we linked (ln -s
) the files from the English (orig) and German (i18n) into a directory structure beneath /usr/local/data/datasets/dbpedia/3.7/importedGraphs/
and imported from there instead. To make sure you think about this, I use that path below, so it won’t work if you didn’t pay attention. If you really want, just import /usr/local/data/datasets/dbpedia/3.7/
.
Note: in the following i will assume that your virtuoso isql command is called isql
. If you’re in lack of such a command it might be called isql-vt
.
Note2: in our case we actually decided not to import all the files into just one graph but instead used separated graphs for en and de as well as for the pagelinks, infoboxprops, extlinks and interlanguage_links dumps. Be warned though that only a certain amount of triples from the http://dbpedia.org graph will be shown in case you visit the local pages with your browser.
isql # enter virtuoso sql mode
-- we are in sql mode now
ld_dir_all('/usr/local/data/datasets/dbpedia/3.7/importedGraphs/dbpedia.org', '*.*', 'http://dbpedia.org');
-- do the following to see which files were registered to be added:
select * from DB.DBA.LOAD_LIST;
-- if unsatisfied use:
-- delete from DB.DBA.LOAD_LIST;
EXIT;
OK, now comes the fun (and long part: about 7 hours)… We registered the files to be added, now let’s finally start the process. Fire up screen
(see comment) if you didn’t already.
sudo aptitude install screen
screen isql
rdf_loader_run();
-- DO NOT USE THE DB BESIDES THE FOLLOWING COMMANDS:
-- (I had some warnings about a possibly corrupt db in the log,
-- when I visited the virtuoso conducter during the first run...)
-- you can watch the progress from another isql session with:
-- select * from DB.DBA.LOAD_LIST;
-- if you need to stop the loading for any reason: rdf_load_stop ();
-- if you want to force stopping: rdf_load_stop(1);
checkpoint;
commit work;
checkpoint;
EXIT;
After this:
Take a look into var/lib/virtuoso/db/virtuoso.log
file. Should you find any errors in there… FIX THEM! You might use the dump, but it’s incomplete then. Any error quits out of the loading of the corresponding file and continues with the next one, so you’re only using the part of that file up to the place where the error occurred. (Should you find errors you can’t fix in the way I did above, please leave a comment.)
Final polishing
You can & should now install the DBpedia and RDF Mappers packages from the Virtuoso Conductor.
http://your-server:8890
login: dba
pw: dba
Go to System Admin / Packages. Install the dbpedia and rdf_mappers packages (takes about 5 minutes).
Testing your local mirror
Go to the sparql-endpoint of your server http://your-server:8890/sparql
(or in isql prefix with: SPARQL)
SELECT count(*) WHERE { ?s ?p ?o } ;
This might take about 15 minutes and then returns 437,768,995. Subsequent queries are a lot faster (if you find another way (preferably automatic) to warm up the caches, please leave me a note).
I also like this query showing all the graphs and how many triples are in them:
SELECT ?g COUNT(*) { GRAPH ?g {?s ?p ?o.} } GROUP BY ?g ORDER BY DESC 2;
g callret-1
LONG VARCHAR LONG VARCHAR
___________________________________________________________
http://dbpedia.org 131477215
http://pagelinks.dbpedia.org 118039661
http://rawinfoboxproperties.dbpedia.org 83705116
http://pagelinks.de.dbpedia.org 41397135
http://extlinks.dbpedia.org 31354613
http://de.dbpedia.org 19748791
http://rawinfoboxproperties.de.dbpedia.org 11076144
http://interlanguagelinks.de.dbpedia.org 694064
http://www.openlinksw.com/schemas/RDF_Mapper_Ontology/1.0/ 256065
http://localhost:8890/DAV 4009
http://www.openlinksw.com/schemas/virtrdf# 2066
http://open.vocab.org/terms 1480
http://purl.org/ontology/bibo/ 1226
http://purl.org/goodrelations/v1 937
http://purl.org/dc/terms/ 857
http://www.openlinksw.com/schemas/opengraph 804
http://www.openlinksw.com/schemas/googleplus 696
http://www.openlinksw.com/schemas/google-base 691
http://www.openlinksw.com/schemas/cv 661
virtrdf-label 638
http://www.openlinksw.com/schemas/linkedin 613
http://xmlns.com/foaf/0.1/ 557
http://rdfs.org/sioc/ns# 553
http://www.openlinksw.com/schemas/evri 482
http://www.openlinksw.com/schemas/crunchbase 426
http://bblfish.net/work/atom-owl/2006-06-06/ 386
http://scot-project.org/scot/ns# 332
http://www.openlinksw.com/schemas/zillow 311
http://www.w3.org/2004/02/skos/core 252
http://www.openlinksw.com/schemas/cnet 225
http://www.openlinksw.com/schemas/tesco 183
http://www.openlinksw.com/schemas/bestbuy 172
http://www.w3.org/2002/07/owl# 167
http://www.w3.org/2002/07/owl 160
http://www.openlinksw.com/schemas/angel# 144
http://www.openlinksw.com/schemas/amazon 143
http://purl.org/dc/elements/1.1/ 139
http://www.w3.org/2007/05/powder-s# 117
http://www.openlinksw.com/schemas/twitter 103
http://www.openlinksw.com/schemas/stackoverflow# 102
http://www.openlinksw.com/schemas/klout 90
http://www.w3.org/2000/01/rdf-schema# 87
http://www.w3.org/1999/02/22-rdf-syntax-ns# 85
http://www.openlinksw.com/schemas/ebay 79
http://www.openlinksw.com/schema/attribution# 68
http://www.openlinksw.com/schemas/nyt 41
http://www.openlinksw.com/schemas/oplbase 26
http://www.openlinksw.com/schemas/cert# 23
http://www.openlinksw.com/schemas/dbpedia-spotlight# 21
http://www.openlinksw.com/schemas/money 21
http://dbpedia.org/schema/property_rules# 12
dbprdf-label 6
52 Rows. -- 1711753 msec.
Congratulations, you just imported nearly half a billion triples.
Backing up this initial state
Now is a good moment to backup the whole db (takes about half an hour):
sudo -i
cd /
/etc/init.d/virtuoso-opensource stop &&
tar -cvf - /var/lib/virtuoso | gzip --fast > virtuoso-6.1.6-dev-DBDUMP-dbpedia-3.7-en_de-$(date '+%F').tar.gz &&
/etc/init.d/virtuoso-opensource start
Yay, done 😉
As always, feel free to leave comments to tell us about your problems or how happy you are :D.
Thanks
Many thanks to the DBpedia team for their endless efforts of providing us all with a great dataset. Also many thanks to the Virtuoso crew for releasing an opensource version of their DB; especially to Hugh Williams and Patrick van Kleef for helping me out with a couple of problems in the newer version.
My name
Thanks to Paul for telling me about my name day: “Jörn (Der eberstarke, mutige, gute Freund)” (Jörn, the boar-strong, brave, good friend). Actually it’s fun to read the Plattdüütsch version which says as much as that it comes from Jürgen, and that one comes from Georg, but as people speaking that dialect were “mundfuul” (talk-lazy) they shortened it to Jörn 😀 (and i can really read it)
The English Wikipedia told me that I’m actually a village (yay, we all love RDF, don’t we? At least the de.dbpedia.org knows more.)
Oh and there’s a trainstation with my name on it.
Lovely.
Git ad-hoc sharing
I recently found quite a cool way for easy sharing sharing of git code between two machines in a LAN or WLAN (as easy as in mercurial). The following command creates a git alias called “serve” (you only need to run this once so you don’t have to manually call git daemon ...
each time):
git config --global alias.serve 'daemon --reuseaddr --base-path=. --export-all --verbose'
Get your IP with ifconfig
. After this you can just cd
into your code directory, (where the .git dir is) and then run:
git serve
This will host a small git daemon (server) and you can stop it any time with CTRL+C. While still running simply run this from the fetching computer (client):
git clone git://the_server_ip/
You can also run the server in a parent directory and actually serve multiple git repositories. If you do you need to include the relative path information to the .git-dir containing directory on the client side:
git clone git://the_server_ip/your/subdirs/here/
Subsequent calls like git fetch
should also work. If the IP changes just change the origin remote’s link.
For more you might want to have a look at these:
http://stackoverflow.com/questions/5817095/what-tools-exist-for-simple-one-off-sharing-of-a-git-repo
http://stackoverflow.com/questions/377213/git-serve-i-would-like-it-that-simple
Duolingo: Learn a language and translate the web
Another one of Luis von Ahn‘s ingenious projects: http://duolingo.com learn a language for free and translate the web in the background.
There is a pretty recent TED talk by him, and below you can find their introductory video on youtube:
Interesting talk about “Filter Bubbles”
A few days ago I stumbled over an interesting TED talk by Eli Pariser about the ever increasing personalization of the web, its search results, your facebook news feed, … Do you think that you still see the whole picture or are you already caught in your own filtered information bubble? (thx to Kingsley Idehen)
Mac OS X Harddisk high Load Cycle Counts
Short summary: Mac OS X’s default power management settings might wear your hard drive down unnecessarily. This post provides a lot of background information and how to change these settings. Continue reading
Live mapping of tweets, facebook msgs, emails, sms…
Reading the Wikimedia blog I stumbled over this interesting post. They mention a framework called Ushahidi (Swahili word for “testimony’) with its subproject SwitfRiver which can be used to track and verify the reliability of news concerning current trending topics, possibly helping editors of Wikipedia to enhance the quality.
Digging into I found out the framework is used for live mapping (collection, aggregation and visualization) of disaster and event related messages sent via all different kinds of transports (e.g., twitter, facebook, email, sms…). One example is the 2010 Haiti earthquake. Where it helped to coordinate all the s&r teams.
As I find it quite fascinating how much people who sit at home in their living rooms might be able to help others in a disaster region, I’d like to suggest this talk:
LaTeX Thesis Skeleton
As it might be useful for other students (especially for computer science students at the University of Kaiserslautern), I decided to invest some time and create a skeleton for a thesis.
The project can be found on github: http://github.com/joernhees/thesis-skeleton.
I’ll happily include / pull changes.
Quick instructions to get started with your thesis:
- Make sure you have git, otherwise install it (e.g., on ubuntu:
sudo aptitude install git-core
) - Run this:
git clone git://github.com/joernhees/thesis-skeleton.git myMasterThesis
It will create a directory called
myMasterThesis
in the current directory which actually is a git repository and includes a thesis directory. - Enter it and have a look at thesis.pdf
- Insert your name, title, supervisors, etc. in thesis.tex.
- Get familiar with git, this is a good start.
That’s it.