April | 2014 | Jörn's Blog

Newer version available: Setting up a Linked Data mirror from RDF dumps (DBpedia 2015-04, Freebase, Wikidata, LinkedGeoData, …) with Virtuso 7.2.1 and Docker (optional)

I just found this aged post in my drafts folder, maybe someone will still like it…

So you’re the guy who is allowed to setup a local DBpedia mirror or more generally a local Linked Data mirror for your work group? OK, today is your lucky day and you’re in the right place. I hope you’ll be able to benefit from my many hours of trials and errors. If anything goes wrong, feel free to leave me a comment below.

Versions of this guide

There are two older versions of this guide:

Oct. 2010: The first version focusing on DBpedia 3.5 – 3.6 and Virtuoso 6.1
May 2012: A bigger update to DBpedia 3.7 (new local language versions) and Virtuoso 6.1.5+ (with a lot of updates making pre-processing of the dumps easier)

With the recent release of Virtuoso 7 (way faster, thanks to Openlink!) and DBpedia 3.9 i again felt the urge to update this guide as a couple of things changed.

In this step by step guide I’ll tell you how to install a local Linked Data mirror of the DBpedia 3.9 hosting a combination of the regular English and (exemplary) the i18n German datasets adding up to nearly half a billion triples.

Let’s jump in.

Used Versions

DBpedia 3.9 + 3.9-i18n dataset
Virtuoso OpenSource 7.0.0
Ubuntu 12.04 LTS

Prerequesits

A strong machine with root access and enough RAM: We use a VM with 4 Cores and 32 GBs of RAM. For installing i recommend more than 128 GB free HD space, especially for downloading and repacking the datasets, as well as the growing database file when importing (mine grew to 41 GBs).

Let’s go

Download and install virtuoso

Go and download virtuoso opensource: either from http://sourceforge.net/projects/virtuoso/ (make sure you get v7.0.0 as in this guide or newer version).

Put the file in your home dir on the server, then extract it and switch to the directory:

cd ~
tar -xvzf virtuoso-7.0.0.tar.gz
cd virtuoso-opensource-7.0.0 # or newer, depending what you got

Now do the following to install the prerequisites and then build virtuoso:

sudo aptitude install libxml2-dev libssl-dev autoconf libgraphviz-dev 
     libmagickcore-dev libmagickwand-dev dnsutils gawk bison flex gperf 

# NOTICE: this will _not_ install into /usr/local but into /usr
# (so might clash with packages by your distribution if you install
# "the" virtuoso package)
# You'll find the db in /var/lib/virtuoso/db !
# check output for errors and FIX THEM! (e.g., install missing packages)
export CFLAGS="-O2 -m64"
./configure --with-layout=debian

# the following will build with 5 processes in parallel
# choose something like your server's #CPUs + 1
make -j5

This will take about 5 min

sudo make install

Now change the following values in /var/lib/virtuoso/db/virtuoso.ini, the performance tuning stuff is according to http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtRDFPerformanceTuning:

# note: virtuoso ignores lines starting with whitespace and stuff after a ;
[Parameters]
# you need to include the directory where your datasets will be downloaded
# to, in our case /usr/local/data/datasets:
DirsAllowed = ., /usr/share/virtuoso/vad, /usr/local/data/datasets
# IMPORTANT: for performance also do this
[Parameters]
# the following two are as suggested by comments in the original .ini
# file in order to use the RAM on your server:
NumberOfBuffers = 2720000
MaxDirtyBuffers = 2000000
# each buffer caches a 8K page of data and occupies approx. 8700 bytes of
# memory. It's suggested to set this value to 65 % of ram for a db only server
# so if you have 32 GB of ram: 32*1000^3*0.65/8700 = 2390804
# default is 2000 which will use 16 MB ram ;)
# Make sure to remove whitespace if you uncomment existing lines!
[Database]
MaxCheckpointRemap = 625000
# set this to 1/4th of NumberOfBuffers
[SPARQL]
# I like to increase the ResultSetMaxrows, MaxQueryCostEstimationTime
# and MaxQueryExecutionTime drastically as it's a local store where we
# do quite complex queries... up to you (don't do this if a lot of people
# use it).
# In any case for the importer to be more robust add the following setting
# to this section:
ShortenLongURIs = 1

The next step installs an init-script (autostart) and starts the virtuoso server. (If you’ve changed directories to edit /var/lib/virtuoso/db/virtuoso.ini, go back to the virtuoso source dir!):

sudo cp debian/init.d /etc/init.d/virtuoso-opensource &&
sudo chmod a+x /etc/init.d/virtuoso-opensource &&
sudo bash debian/virtuoso-opensource.postinst.debhelper

You should now have a running virtuoso server.

DBpedia URIs (en) vs. DBpedia IRIs (i18n)

The DBpedia 3.9 consists of several datasets: one “standard” English version and several localized versions for other languages (i18n). The standard version mints URIs by going through all English Wikipedia articles. For all of these the Wikipedia cross-language links are used to extract corresponding labels in other languages for the en URIs (e.g., de/labels_en_uris_de.nt.bz2). This is problematic as for example articles which are only in the German Wikipedia won’t be extracted. To solve this problem the i18n versions exists and create IRIs in the form of de.dbpedia.org for every article in the German Wikipedia (e.g., de/labels_de.nt.bz2).

This approach has several implications. For backwards compatibility reasons the standard DBpedia makes statements about URIs such as http://dbpedia.org/resource/Gerhard_Schr%C3%B6der while the local chapters, like the German one, make statements about IRIs such as http://de.dbpedia.org/resource/Gerhard_Schröder (note the ö). In other words and as written above: the standard DBpedia uses URIs to identify things, while the localized versions use IRIs. This also means that http://dbpedia.org/resource/Gerhard_Schröder shouldn’t work. That said, clicking the link will actually work as there is magic going on in your browser to give you what you probably meant. Using curl curl -i -L -H "Accept: application/rdf+xml" http://dbpedia.org/resource/Gerhard_Schröder or SPARQLing the endpoint will nevertheless not be so nice/sloppy and can cause quite some headache: select * where { dbpedia:Gerhard_Schröder ?p ?o. } vs. select * where { <http://dbpedia.org/resource/Gerhard_Schr%C3%B6der> ?p ?o. }. In order to mitigate this historic problem a bit DBpedia actually offers owl:sameAs links from IRIs to URIs: en/iri_same_as_uri_en which you should load, so you at least have a link to what you want if someone tries to get info about an IRI.

As if this isn’t confusing enough there is another trap: If you were to download the .ttl files then you suddenly have all statements associated with the IRI for the standard DBpedia (unlike the online endpoint). The only reason i can think of for this inconsistency is that at some point the actual inconsisty of URIs in EN vs IRIs in everything else will be resolved. For now these files are most certainly not what you want! So use the .nt files!

As the standard DBpedia provides labels, abstracts and a couple other things in several languages, there are two types of files in the localized DBpedia folders: There are triples directly associating the English URIs with for example the German labels (de/labels_en_uris_de) and there are the localized triple files which associate for example the DE IRIs with the German labels (de/labels_de).

Downloading the DBpedia dump files & Repacking

For our group we decided that we wanted a reasonably complete mirror of the standard DBpedia (EN) (have a look at datasets loaded into the public DBpedia SPARQL Endpoint), but also the i18n versions for the German and French DBpedia loaded in separate graphs, as well as each of their pagelink datasets in another separate graph. For this we download the corresponding files in (NT) format (also see previous section with remarks about the TTL files!). If you need something different do so (and maybe report back if there were problems and how you solved them).

Another hint: Virtuoso can only import plain (uncompressed) or gzipped files, the DBpedia dumps are bzipped, so you either repack them into gzip format or extract them. On our server the importing procedure was reasonably slower from extracted files than from gzipped ones (ignoring the vast amount of wasted disk space for the extracted files). File access becomes a bottleneck if you have 4 cores idling. This is why I decided on repacking all the files from bz2 to gz. As you can see I do the repacking per folder in parallel, if that’s not suitable for you, feel free to change it. You might also want to change this if you want to do it in parallel to downloading. The repackaging process below took about 1 hour but was worth it in the end. The more CPUs you have, the more you can parallelize this process.

sudo -i # get root
# see comment above, you could also get the all_language.tar or another DBpedia version...
mkdir -p /usr/local/data/datasets/dbpedia/3.9
cd /usr/local/data/datasets/dbpedia/3.9
wget -r -nc -nH --cut-dirs=1 -np -l1 -A '*.nt.bz2' -A '*.owl' -R '*unredirected*' http://downloads.dbpedia.org/3.9/{en/,de/,fr/,links/,wikidata/,dbpedia_3.9.owl}

# if you want to save space do this:
for d in */ ; do for i in "${d%/}"/*.bz2 ; do bzcat "$i" | gzip > "${i%.bz2}.gz" && rm "$i" ; done & done
# else do:
#bunzip2 */*.bz2 &

# notice that the extraction (and repacking) of *.bz2 takes quite a while (about 1 hour)
# gzipped data is reasonably packed, but still very fast to access (in contrast to bz2), so maybe this is the best choice.

Data Cleaning and The bulk loader scripts

In contrast to the previous versions of this article the virtuoso import will take care of shortening too long IRIs itself. Also it seems the bulk loader script is included in the more recent Virtuoso versions, so as a reference only: see the old version for the cleaning script and VirtBulkRDFLoaderExampleDbpedia and
http://virtuoso.openlinksw.com/dataspace/dav/wiki/Main/VirtBulkRDFLoaderScript for info about the bulk loader scripts.

Importing DBpedia dumps into virtuoso

Now AFTER the re-/unpacking of the DBpedia dumps we will register all files in the dbpedia dir (recursively ld_dir_all) to be added to the dbpedia graph. If you use this method make sure that only files reside in the given subtree that you really want to import.
Also don’t forget to import the dbpedia_3.9.owl file (last step in the script below)!
If you only want one directory’s files to be added (non recursive) use ld_dir.
If you manually want to add some files, use ld_add.
See the VirtBulkRDFLoaderScript file for args to pass.

Be warned that it might be a bad idea to import the normal and i18n dataset into one graph if you didn’t select specific languages, as it might introduce a lot of duplicates.

In order to keep track what was selected and imported into which graph, I actually link (ln -s) the repacked files into a directory structure beneath /usr/local/data/datasets/dbpedia/3.9/importedGraphs/ and import from there instead. To make sure you think about this, I use that path below, so it won’t work if you didn’t pay attention. If you really want to import all downloaded files, just import /usr/local/data/datasets/dbpedia/3.9/.

Also be aware of the fact that if you load certain parts of dumps in different graphs (such as I did with the pagelinks, as well as the i18n versions of the DE and FR datasets) that only triples from the http://dbpedia.org graph will be shown when you visit the local pages with your browser (SPARQL is unaffected by this)!

So if you want to load the same datasets as loaded on the official endpoint (but restricted to the EN,DE and FR ones ) the following should do the trick to link them up for the next steps:

cd /usr/local/data/datasets/dbpedia/3.9/
mkdir -p importedGraphs/dbpedia.org
cd importedGraphs/dbpedia.org
ln -s 
    ../../en/article_categories_en.nt.gz 
    ../../en/category_labels_en.nt.gz 
    ../../en/disambiguations_en.nt.gz 
    ../../en/external_links_en.nt.gz 
    ../../en/geo_coordinates_en.nt.gz 
    ../../en/homepages_en.nt.gz 
    ../../en/images_en.nt.gz 
    ../../en/instance_types_en.nt.gz 
    ../../en/instance_types_heuristic_en.nt.gz 
    ../../en/interlanguage_links_chapters_en.nt.gz 
    ../../en/iri_same_as_uri_en.nt.gz 
    ../../en/labels_en.nt.gz 
    ../../en/long_abstracts_en.nt.gz 
    ../../en/mappingbased_properties_cleaned_en.nt.gz 
    ../../en/page_ids_en.nt.gz 
    ../../en/persondata_en.nt.gz 
    ../../en/pnd_en.nt.gz 
    ../../en/raw_infobox_properties_en.nt.gz 
    ../../en/raw_infobox_property_definitions_en.nt.gz 
    ../../en/redirects_transitive_en.nt.gz 
    ../../en/revision_ids_en.nt.gz 
    ../../en/revision_uris_en.nt.gz 
    ../../en/short_abstracts_en.nt.gz 
    ../../en/skos_categories_en.nt.gz 
    ../../en/specific_mappingbased_properties_en.nt.gz 
    ../../en/wikipedia_links_en.nt.gz 
    ../../de/labels_en_uris_de.nt.gz 
    ../../de/long_abstracts_en_uris_de.nt.gz 
    ../../de/pnd_en_uris_de.nt.gz 
    ../../de/short_abstracts_en_uris_de.nt.gz 
    ../../fr/labels_en_uris_fr.nt.gz 
    ../../fr/long_abstracts_en_uris_fr.nt.gz 
    ../../fr/short_abstracts_en_uris_fr.nt.gz 
    ../../links/amsterdammuseum_links.nt.gz 
    ../../links/bbcwildlife_links.nt.gz 
    ../../links/bookmashup_links.nt.gz 
    ../../links/bricklink_links.nt.gz 
    ../../links/cordis_links.nt.gz 
    ../../links/dailymed_links.nt.gz 
    ../../links/dblp_links.nt.gz 
    ../../links/dbtune_links.nt.gz 
    ../../links/diseasome_links.nt.gz 
    ../../links/drugbank_links.nt.gz 
    ../../links/eunis_links.nt.gz 
    ../../links/eurostat_linkedstatistics_links.nt.gz 
    ../../links/eurostat_wbsg_links.nt.gz 
    ../../links/factbook_links.nt.gz 
    ../../links/flickrwrappr_links.nt.gz 
    ../../links/freebase_links.nt.gz 
    ../../links/gadm_links.nt.gz 
    ../../links/geonames_links.nt.gz 
    ../../links/geospecies_links.nt.gz 
    ../../links/gho_links.nt.gz 
    ../../links/gutenberg_links.nt.gz 
    ../../links/italian_public_schools_links.nt.gz 
    ../../links/linkedgeodata_links.nt.gz 
    ../../links/linkedmdb_links.nt.gz 
    ../../links/musicbrainz_links.nt.gz 
    ../../links/nytimes_links.nt.gz 
    ../../links/opencyc_links.nt.gz 
    ../../links/openei_links.nt.gz 
    ../../links/revyu_links.nt.gz 
    ../../links/sider_links.nt.gz 
    ../../links/tcm_links.nt.gz 
    ../../links/umbel_links.nt.gz 
    ../../links/uscensus_links.nt.gz 
    ../../links/wikicompany_links.nt.gz 
    ../../links/wordnet_links.nt.gz 
    ../../links/yago_links.nt.gz 
    ../../links/yago_taxonomy.nt.gz 
    ../../links/yago_type_links.nt.gz 
    ../../links/yago_types.nt.gz 
    ../../dbpedia_3.9.owl 
    ./

Note: in the following i will assume that your virtuoso isql command is called isql. If you’re in lack of such a command it might be called isql-vt, but this usually means you installed it using some other method than described in here

isql # enter virtuoso sql mode

-- we are in sql mode now
ld_dir_all('/usr/local/data/datasets/remote/dbpedia/3.9/importedGraphs/dbpedia.org', '*.*', 'http://dbpedia.org');
ld_dir_all('/usr/local/data/datasets/remote/dbpedia/3.9/importedGraphs/de.dbpedia.org', '*.*', 'http://de.dbpedia.org');
ld_dir_all('/usr/local/data/datasets/remote/dbpedia/3.9/importedGraphs/pagelinks.dbpedia.org', '*.*', 'http://pagelinks.dbpedia.org');
ld_dir_all('/usr/local/data/datasets/remote/dbpedia/3.9/importedGraphs/pagelinks.de.dbpedia.org', '*.*', 'http://pagelinks.de.dbpedia.org');
ld_dir_all('/usr/local/data/datasets/remote/dbpedia/3.9/importedGraphs/topicalconcepts.dbpedia.org', '*.*', 'http://topicalconcepts.dbpedia.org');

-- do the following to see which files were registered to be added:
select * from DB.DBA.LOAD_LIST;
-- if unsatisfied use:
-- delete from DB.DBA.LOAD_LIST;
EXIT;

OK, now comes the fun (and long part: about 1.5 hours (new virtuoso 7 is cool 😉 )… We registered the files to be added, now let’s finally start the process. Fire up screen if you didn’t already.

sudo aptitude install screen
screen isql

rdf_loader_run();
-- DO NOT USE THE DB BESIDES THE FOLLOWING COMMANDS:
-- (I had some warnings about a possibly corrupt db in the log,
-- when I visited the virtuoso conductor during the first run...)
-- you can watch the progress from another isql session with:
-- select * from DB.DBA.LOAD_LIST;
-- if you need to stop the loading for any reason: rdf_load_stop ();
-- if you want to force stopping: rdf_load_stop(1);
checkpoint;
commit work;
checkpoint;
EXIT;

After this:
Take a look into var/lib/virtuoso/db/virtuoso.log file. Should you find any errors in there… FIX THEM! You might use the dump, but it’s incomplete then. Any error quits out of the loading of the corresponding file and continues with the next one, so you’re only using the part of that file up to the place where the error occurred. (Should you find errors you can’t fix in the way I did above, please leave a comment.)

Final polishing

You can & should now install the DBpedia and RDF Mappers packages from the Virtuoso Conductor.
http://your-server:8890

login: dba
pw: dba

Go to System Admin / Packages. Install the dbpedia (v. 1.3.83) and rdf_mappers (v. 1.34.72) packages (takes about 5 minutes).

Testing your local mirror

Go to the sparql-endpoint of your server http://your-server:8890/sparql (or in isql prefix with: SPARQL)

sparql SELECT count(*) WHERE { ?s ?p ?o } ;

This shouldn’t take long in Virtuoso 7 anymore and for me now returns 567,173,934.
I also like this query showing all the graphs and how many triples are in them:

sparql SELECT ?g COUNT(*) { GRAPH ?g {?s ?p ?o.} } GROUP BY ?g ORDER BY DESC 2;
g                                                          callret-1
LONG VARCHAR                                               LONG VARCHAR
___________________________________________________________

http://dbpedia.org                                         312505120
http://pagelinks.dbpedia.org                               136591822
http://de.dbpedia.org                                      67997676
http://pagelinks.de.dbpedia.org                            49664737
http://www.openlinksw.com/schemas/RDF_Mapper_Ontology/1.0/ 256065
http://topicalconcepts.dbpedia.org                         136887
http://localhost:8890/DAV/                                 4709
http://www.openlinksw.com/schemas/virtrdf#                 2617
http://open.vocab.org/terms                                1480
http://purl.org/ontology/bibo/                             1226
http://purl.org/goodrelations/v1                           937
http://purl.org/dc/terms/                                  857
http://www.openlinksw.com/schemas/opengraph                804
http://www.openlinksw.com/schemas/linkedin                 741
http://www.openlinksw.com/schemas/googleplus               696
http://www.openlinksw.com/schemas/google-base              691
http://www.openlinksw.com/schemas/cv                       661
virtrdf-label                                              638
http://xmlns.com/foaf/0.1/                                 557
http://rdfs.org/sioc/ns#                                   553
http://www.openlinksw.com/schemas/evri                     482
http://www.openlinksw.com/schemas/crunchbase               444
http://bblfish.net/work/atom-owl/2006-06-06/               386
http://scot-project.org/scot/ns#                           332
http://www.openlinksw.com/schemas/zillow                   311
http://www.w3.org/2004/02/skos/core                        252
http://www.openlinksw.com/schemas/cnet                     225
http://www.openlinksw.com/schemas/tesco                    183
http://www.openlinksw.com/schemas/bestbuy                  172
http://www.w3.org/2002/07/owl#                             160
http://www.w3.org/2002/07/owl                              160
http://www.openlinksw.com/schemas/angel#                   144
http://www.openlinksw.com/schemas/amazon                   143
http://purl.org/dc/elements/1.1/                           139
http://www.w3.org/2007/05/powder-s#                        117
http://www.openlinksw.com/schemas/twitter                  103
http://www.openlinksw.com/schemas/stackoverflow#           102
http://www.openlinksw.com/schemas/klout                    90
http://www.w3.org/2000/01/rdf-schema#                      87
http://www.w3.org/1999/02/22-rdf-syntax-ns#                85
http://www.openlinksw.com/schemas/ebay                     79
http://www.openlinksw.com/schema/attribution#              68
http://www.openlinksw.com/schemas/nyt                      41
http://www.openlinksw.com/schemas/wolframalpha#            32
http://www.openlinksw.com/schemas/oplbase                  26
http://www.openlinksw.com/schemas/cert#                    23
http://www.openlinksw.com/schemas/money                    21
http://www.openlinksw.com/schemas/dbpedia-spotlight#       21
http://localhost:8890/sparql                               14
http://dbpedia.org/schema/property_rules#                  12
dbprdf-label                                               6

51 Rows. -- 4563 msec.

Congratulations, you just imported nearly half a billion triples.

Backing up this initial state

Now is a good moment to backup the whole db (takes about half an hour):

sudo -i
cd /
/etc/init.d/virtuoso-opensource stop &&
tar -cvjf virtuoso-7.0.0-DBDUMP-dbpedia-3.9-en_de-$(date '+%F').tar.bz2 /var/lib/virtuoso &&
/etc/init.d/virtuoso-opensource start

Yay, done 😉
As always, feel free to leave comments if i made a mistake or to tell us about your problems or how happy you are :D.

Thanks

Many thanks to the DBpedia team for their endless efforts of providing us all with a great dataset. Also many thanks to the Virtuoso crew for releasing an opensource version of their DB.

Jörn's Blog

Science, code and stuff…

Monthly Archives: April 2014

Setting up a local DBpedia 3.9 mirror with Virtuoso 7